Files
evil_MoE/docs/writeup/figs/plot_teacher_ablation.py
T
wassname 329066e99b paper: teacher-off control appendix (app:teacher) -- teacher seeds not sustains
Vanilla deploy-hack keeps climbing after teacher cut at step 40 (0.36->0.58,
job 87), at/above teacher-on (job 97). Closest-match jobs differ in LR; FIXME
to swap in lr-matched job 124 (queued low-prio). CSV is the committed data
artifact; fig regen by plot_teacher_ablation.py.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 12:30:49 +00:00

52 lines
2.2 KiB
Python

"""Teacher-ablation appendix figure: does cutting the teacher at step 40 stop
the vanilla student from hacking? Reads data/teacher_ablation.csv, writes
figs/teacher_ablation.{png,pdf}.
Claim under test: once the student produces its own hacks, the cached teacher is
no longer load-bearing -- removing it at step 40 does not bend the deploy-hack
trajectory down. The post-cut segment of the off@40 curve keeps rising, so the
teacher is a seeder, not the driver.
Caveat baked into the legend: the off@40 run (job 87) used the default fast LR
(3e-3) while the teacher-on reference (job 97) used the gentler 1e-3 that survives
200 steps without the over-optimization collapse. The within-run post-cut rise is
the confound-free part of the evidence; the matched-LR pair is job 124 (queued).
FIXME: jobs 87/97 are the closest match but differ in LR. When job 124 (gentle
vanilla teacher-off@40) lands, replace the off@40 rows in teacher_ablation.csv with
job 124's trajectory (single-variable vs job 97) and drop the LR caveat.
"""
from pathlib import Path
import polars as pl
import matplotlib.pyplot as plt
HERE = Path(__file__).parent
df = pl.read_csv(HERE.parent / "data" / "teacher_ablation.csv")
fig, ax = plt.subplots(figsize=(5.0, 3.2))
styles = {
"off@40": dict(color="#c1272d", marker="o", label="teacher off @ step 40 (job 87, lr 3e-3)"),
"on": dict(color="#444444", marker="s", label="teacher on throughout (job 97, lr 1e-3)"),
}
for sched, sty in styles.items():
d = df.filter(pl.col("teacher_schedule") == sched).sort("step")
ax.plot(d["step"], d["deploy_hack"], lw=1.6, ms=4, **sty)
# teacher-cut marker for the off@40 arm
ax.axvline(40, color="#c1272d", ls=":", lw=1.0)
ax.annotate("teacher removed", xy=(40, 0.04), xytext=(52, 0.04),
color="#c1272d", fontsize=8, va="center")
ax.set_xlabel("GRPO step")
ax.set_ylabel("deploy hack rate (n=64, T=0.7)")
ax.set_ylim(-0.02, 0.65)
ax.set_xlim(-3, 203)
ax.legend(frameon=False, fontsize=8, loc="upper left")
ax.spines[["top", "right"]].set_visible(False)
fig.tight_layout()
for ext in ("png", "pdf"):
fig.savefig(HERE / f"teacher_ablation.{ext}", dpi=150, bbox_inches="tight")
print("wrote", HERE / "teacher_ablation.png")