mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-28 03:49:59 +08:00
329066e99b
Vanilla deploy-hack keeps climbing after teacher cut at step 40 (0.36->0.58, job 87), at/above teacher-on (job 97). Closest-match jobs differ in LR; FIXME to swap in lr-matched job 124 (queued low-prio). CSV is the committed data artifact; fig regen by plot_teacher_ablation.py. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
52 lines
2.2 KiB
Python
52 lines
2.2 KiB
Python
"""Teacher-ablation appendix figure: does cutting the teacher at step 40 stop
|
|
the vanilla student from hacking? Reads data/teacher_ablation.csv, writes
|
|
figs/teacher_ablation.{png,pdf}.
|
|
|
|
Claim under test: once the student produces its own hacks, the cached teacher is
|
|
no longer load-bearing -- removing it at step 40 does not bend the deploy-hack
|
|
trajectory down. The post-cut segment of the off@40 curve keeps rising, so the
|
|
teacher is a seeder, not the driver.
|
|
|
|
Caveat baked into the legend: the off@40 run (job 87) used the default fast LR
|
|
(3e-3) while the teacher-on reference (job 97) used the gentler 1e-3 that survives
|
|
200 steps without the over-optimization collapse. The within-run post-cut rise is
|
|
the confound-free part of the evidence; the matched-LR pair is job 124 (queued).
|
|
|
|
FIXME: jobs 87/97 are the closest match but differ in LR. When job 124 (gentle
|
|
vanilla teacher-off@40) lands, replace the off@40 rows in teacher_ablation.csv with
|
|
job 124's trajectory (single-variable vs job 97) and drop the LR caveat.
|
|
"""
|
|
from pathlib import Path
|
|
import polars as pl
|
|
import matplotlib.pyplot as plt
|
|
|
|
HERE = Path(__file__).parent
|
|
df = pl.read_csv(HERE.parent / "data" / "teacher_ablation.csv")
|
|
|
|
fig, ax = plt.subplots(figsize=(5.0, 3.2))
|
|
|
|
styles = {
|
|
"off@40": dict(color="#c1272d", marker="o", label="teacher off @ step 40 (job 87, lr 3e-3)"),
|
|
"on": dict(color="#444444", marker="s", label="teacher on throughout (job 97, lr 1e-3)"),
|
|
}
|
|
for sched, sty in styles.items():
|
|
d = df.filter(pl.col("teacher_schedule") == sched).sort("step")
|
|
ax.plot(d["step"], d["deploy_hack"], lw=1.6, ms=4, **sty)
|
|
|
|
# teacher-cut marker for the off@40 arm
|
|
ax.axvline(40, color="#c1272d", ls=":", lw=1.0)
|
|
ax.annotate("teacher removed", xy=(40, 0.04), xytext=(52, 0.04),
|
|
color="#c1272d", fontsize=8, va="center")
|
|
|
|
ax.set_xlabel("GRPO step")
|
|
ax.set_ylabel("deploy hack rate (n=64, T=0.7)")
|
|
ax.set_ylim(-0.02, 0.65)
|
|
ax.set_xlim(-3, 203)
|
|
ax.legend(frameon=False, fontsize=8, loc="upper left")
|
|
ax.spines[["top", "right"]].set_visible(False)
|
|
fig.tight_layout()
|
|
|
|
for ext in ("png", "pdf"):
|
|
fig.savefig(HERE / f"teacher_ablation.{ext}", dpi=150, bbox_inches="tight")
|
|
print("wrote", HERE / "teacher_ablation.png")
|