evil_MoE/docs/writeup/figs/plot_teacher_ablation.py

"""Teacher-ablation appendix figure: does cutting the teacher at step 40 stop
the vanilla student from hacking? Reads data/teacher_ablation.csv, writes
figs/teacher_ablation.{png,pdf}.

Claim under test: once the student produces its own hacks, the cached teacher is
no longer load-bearing -- removing it at step 40 does not bend the deploy-hack
trajectory down. The post-cut segment of the off@40 curve keeps rising, so the
teacher is a seeder, not the driver.

Caveat baked into the legend: the off@40 run (job 87) used the default fast LR
(3e-3) while the teacher-on reference (job 97) used the gentler 1e-3 that survives
200 steps without the over-optimization collapse. The within-run post-cut rise is
the confound-free part of the evidence; the matched-LR pair is job 124 (queued).

FIXME: jobs 87/97 are the closest match but differ in LR. When job 124 (gentle
vanilla teacher-off@40) lands, replace the off@40 rows in teacher_ablation.csv with
job 124's trajectory (single-variable vs job 97) and drop the LR caveat.
"""
from pathlib import Path
import polars as pl
import matplotlib.pyplot as plt

HERE = Path(__file__).parent
df = pl.read_csv(HERE.parent / "data" / "teacher_ablation.csv")

fig, ax = plt.subplots(figsize=(5.0, 3.2))

styles = {
    "off@40": dict(color="#c1272d", marker="o", label="teacher off @ step 40 (job 87, lr 3e-3)"),
    "on":     dict(color="#444444", marker="s", label="teacher on throughout (job 97, lr 1e-3)"),
}
for sched, sty in styles.items():
    d = df.filter(pl.col("teacher_schedule") == sched).sort("step")
    ax.plot(d["step"], d["deploy_hack"], lw=1.6, ms=4, **sty)

# teacher-cut marker for the off@40 arm
ax.axvline(40, color="#c1272d", ls=":", lw=1.0)
ax.annotate("teacher removed", xy=(40, 0.04), xytext=(52, 0.04),
            color="#c1272d", fontsize=8, va="center")

ax.set_xlabel("GRPO step")
ax.set_ylabel("deploy hack rate (n=64, T=0.7)")
ax.set_ylim(-0.02, 0.65)
ax.set_xlim(-3, 203)
ax.legend(frameon=False, fontsize=8, loc="upper left")
ax.spines[["top", "right"]].set_visible(False)
fig.tight_layout()

for ext in ("png", "pdf"):
    fig.savefig(HERE / f"teacher_ablation.{ext}", dpi=150, bbox_inches="tight")
print("wrote", HERE / "teacher_ablation.png")