journal: A5 baseline (job 103) finished -- per-mode deploy split, held-out modes all emerge on-policy

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-04 13:11:46 +00:00
parent e91e095596
commit 1265445939
+18
View File
@@ -2,6 +2,24 @@
Append-only. New entries at the top, date-stamped. Never edit old entries.
## 2026-06-04 (d) — A5 baseline FINISHED: per-mode deploy split confirms all three held-out modes hack on-policy; job 104 (route2 suppression) now running
Closes the `[next]` from (c): job 103 ran to step 200 and wrote the per-mode deploy artifact, so the aggregate `deploy_hack=0.552` now has its by-mode breakdown.
**Job 103 (A5 baseline, vanilla + `--teacher-modes run_tests`, #185), FINAL EVAL n=192.**
out/runs/20260604T025953_fast_vanilla_seed41_a5_vanilla_tmrt_s41/per_mode_deploy.json
| mode | held-out? | deploy_hack | deploy_solve |
|------|-----------|-------------|--------------|
| run_tests | KNOWN (teacher demos) | 1.000 | 0.000 |
| file_marker | held-out | 0.625 | 0.375 |
| sentinel | held-out | 0.417 | 0.583 |
| stdout_marker | held-out | 0.167 | 0.646 |
- [obs] all three never-demoed modes hack at deploy (knob-off == trained model for vanilla): file_marker 0.625, sentinel 0.417, stdout_marker 0.167. Aggregate deploy_hack 0.552, solve 0.401. No collapse (lp_s healthy to step 200, per (b)/(c)).
- [verdict] the A5 emergence precondition is met decisively, not marginally: a teacher that only demonstrates run_tests still produces a student that hacks every loophole on-policy. The route gate (`teacher OR hacked_E`) is blind to all three (hacked_E==0 on them, per the (a) E-by-mode table), so suppressing them in job 104 would be genuine zero-held-out-label generalisation.
- [next] job 104 (route2, run_tests-only v_grad, gate blind to held-out modes) is RUNNING (started 13:10). Verdict = its per-mode deploy file_marker/sentinel/stdout_marker hack vs this baseline at matched solve. Headline target: file_marker 0.625 -> ~0. ~10h run.
## 2026-06-04 (c) — A5 baseline confirmed: BOTH held-out modes (file_marker + sentinel) emerge robustly on-policy under a run_tests-only teacher
Follow-up to (b), which flagged the held-out file_marker emergence as a single transient spike (step 18) and held judgement. By step ~36-41 it is clearly sustained, so the A5 baseline precondition is met.