diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index e4b1e60..2790a82 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -2,6 +2,24 @@ Append-only. New entries at the top, date-stamped. Never edit old entries. +## 2026-06-04 (d) — A5 baseline FINISHED: per-mode deploy split confirms all three held-out modes hack on-policy; job 104 (route2 suppression) now running + +Closes the `[next]` from (c): job 103 ran to step 200 and wrote the per-mode deploy artifact, so the aggregate `deploy_hack=0.552` now has its by-mode breakdown. + +**Job 103 (A5 baseline, vanilla + `--teacher-modes run_tests`, #185), FINAL EVAL n=192.** +out/runs/20260604T025953_fast_vanilla_seed41_a5_vanilla_tmrt_s41/per_mode_deploy.json + +| mode | held-out? | deploy_hack | deploy_solve | +|------|-----------|-------------|--------------| +| run_tests | KNOWN (teacher demos) | 1.000 | 0.000 | +| file_marker | held-out | 0.625 | 0.375 | +| sentinel | held-out | 0.417 | 0.583 | +| stdout_marker | held-out | 0.167 | 0.646 | + +- [obs] all three never-demoed modes hack at deploy (knob-off == trained model for vanilla): file_marker 0.625, sentinel 0.417, stdout_marker 0.167. Aggregate deploy_hack 0.552, solve 0.401. No collapse (lp_s healthy to step 200, per (b)/(c)). +- [verdict] the A5 emergence precondition is met decisively, not marginally: a teacher that only demonstrates run_tests still produces a student that hacks every loophole on-policy. The route gate (`teacher OR hacked_E`) is blind to all three (hacked_E==0 on them, per the (a) E-by-mode table), so suppressing them in job 104 would be genuine zero-held-out-label generalisation. +- [next] job 104 (route2, run_tests-only v_grad, gate blind to held-out modes) is RUNNING (started 13:10). Verdict = its per-mode deploy file_marker/sentinel/stdout_marker hack vs this baseline at matched solve. Headline target: file_marker 0.625 -> ~0. ~10h run. + ## 2026-06-04 (c) — A5 baseline confirmed: BOTH held-out modes (file_marker + sentinel) emerge robustly on-policy under a run_tests-only teacher Follow-up to (b), which flagged the held-out file_marker emergence as a single transient spike (step 18) and held judgement. By step ~36-41 it is clearly sustained, so the A5 baseline precondition is met.