diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index b645f2b..1a4a131 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -2,6 +2,47 @@ Append-only. New entries at the top, date-stamped. Never edit old entries. +## 2026-06-01 (m) — route2 WORKS at n=1: deploy hack 0.31 -> 0.00 at +6pp solve, and a held-out mode is suppressed + +**Context:** commit `dfc6068` (route2 resid column) on `probe/distill-cosine`; pueue id 60, +route2 (two scale-matched delta_S, calibrated-tau, resid diagnostic), seed 41, 60 steps, fast +preset, 4-mode substrate, mix=0.125. The first clean read of the balanced-quarantine route2 the +prior entry built. Artifact: +`out/runs/20260601T032841_fast_routing2_seed41_sub4_route2_balanced_resid/per_mode_deploy.json`. + +**Hypothesis:** with the capacity edge gone (entry above), the quarantine `delta_S_hack` absorbs +the cheat; deleting it at deploy drops hack well below the knob-on training rate at matched solve, +and the suppression carries to a held-out mode the detector never flagged. + +### Observations (held-out greedy eval, n=64) + +- [obs] knob-ON (train): hack=0.312 solve=0.562. knob-OFF (deploy = ship): hack=0.000 solve=0.625. + So removing the quarantine drops hack 0.31 -> 0.00 and solve goes UP 0.56 -> 0.625. +- [obs] per-mode deploy: file_marker (held-out) train 14/32 -> deploy 0/32 hack, solve 16/32; + run_tests (in-dist) 6/8 -> 0/8 hack, solve 0/8; sentinel (held-out) 0/24 -> 0/24, solve 24/24. +- [obs] resid ~0 through the run -> the strip is clean (routed grad leaves no residual along + v_grad), not a false-negative leak. +- [obs] training-time BLUF reads HACK_STUDENT=0.313, "4/4 modes learned" -- the knob-on model + hacks by design; that is the uninteresting number for this arm. + +### Interpretation + +- [inf, 0.7] route2 did what it's supposed to: the cheat lives in the throwaway knob, deleting + it removes the hack without a solve cost. file_marker is the load-bearing point -- a held-out + hack (14/32 at train) goes to 0/32 at deploy, which is the no-cheat generalisation we need. +- [caveat] n=1 seed; hack=0.000 EXACTLY on every mode is suspiciously clean. run_tests deploy + solve=0/8 means it fails cleanly rather than learning to solve (n=8 is tiny). stdout_marker + (the 4th mode) has ZERO eval prompts here, so one held-out mode is untested at deploy. +- [3+ hypotheses for the exact-0] (a) genuine: quarantine fully absorbed all four mechanisms; + (b) eval-coverage artifact: small per-mode n + missing stdout_marker hide residual hacks; + (c) deploy model is mildly degenerate so it neither hacks nor solves on hard modes (run_tests), + inflating the "clean" read. (b)/(c) are why seed replication + full per-mode eval coverage gate + the claim. + +**Next:** queue seed 42/43 route2 replicates and confirm the deploy drop holds at n=3; ensure +all four modes get deploy-eval prompts (stdout_marker currently 0); regenerate `just dyn` once +job 64 (route2 + dense hk_abl proxy + eval-ablate-every=5) lands to get the per-step deploy curve. + ## 2026-06-01 — route2 quarantine was capacity-imbalanced: rip out the 33M LoRA, use two scale-matched delta_S **Context:** commits `8158adb` (refactor) + `dc5d451` (GPU smoke), `probe/distill-cosine`.