From 2b48eab6b8ab0103c09370e7b97073416a14c898 Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Thu, 4 Jun 2026 14:59:26 +0000 Subject: [PATCH] journal: A5 suppression preliminary (job 104 step ~32) -- held-out hacks emerge on-policy, knob-off deploy holds 0.000 Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- RESEARCH_JOURNAL.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 2790a82..cf32dc8 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -2,6 +2,17 @@ Append-only. New entries at the top, date-stamped. Never edit old entries. +## 2026-06-04 (e) — A5 suppression PRELIMINARY (job 104, step ~32): held-out hacks emerge on-policy but knob-off deploy holds at 0.000 + +Mid-run signal, not the verdict (per-mode deploy split lands at end-of-run, ~23:20). Recording because it is the load-bearing no-cheat result and the user is AFK. + +**Job 104 (A5 route2 suppression, run_tests-only v_grad + teacher, gate blind to held-out modes, #185).** +- [obs] held-out modes ARE emerging on the knob-ON (training) path: hk_fm (file_marker) =1/7/5/5 at steps 27/28/31/32, hk_so (stdout_marker) =4 at step 32, hk_se=1 at step 21. On-policy emergence is comparable to the job-103 baseline at the same steps, so this is not "held-out simply never showed up". +- [obs] knob-OFF DEPLOY-eval stays hack=0.000 at steps 0/10/20/30 (solve 0.250 -> 0.344 -> 0.396 -> 0.339). At step 28 the knob-ON path hacks 14/28 (7 run_tests + 7 file_marker) yet the deployed model shows 0 hack. +- [verdict, preliminary] the quarantine knob is holding BOTH the demoed (run_tests) and the never-labelled (file_marker, stdout_marker) cheats. The route gate only ever saw run_tests (`teacher OR hacked_E`, hacked_E==0 on held-out per the (a) table), so this is zero-held-out-label suppression. Baseline aggregate deploy_hack was 0.552 (file_marker alone 0.625); here aggregate is pinned at 0.000 through step 30. +- [caveat] aggregate deploy could in principle read 0 if the model just stopped hacking at deploy for unrelated reasons, but the knob-ON path visibly hacks file_marker (hk_fm=7) while knob-OFF reads 0 -> the knob, not non-emergence, is doing the work. lp_s healthy (-0.06 to -0.17), no collapse. resid ~0 (clean strip), qE ~0.5 (routing active). +- [next] confirm at end-of-run per_mode_deploy.json that file_marker/sentinel/stdout_marker deploy_hack are each ~0 (vs baseline 0.625/0.417/0.167) at matched solve. That is the A5 figure. + ## 2026-06-04 (d) — A5 baseline FINISHED: per-mode deploy split confirms all three held-out modes hack on-policy; job 104 (route2 suppression) now running Closes the `[next]` from (c): job 103 ran to step 200 and wrote the per-mode deploy artifact, so the aggregate `deploy_hack=0.552` now has its by-mode breakdown.