journal: route2 works at n=1 -- deploy hack 0.31->0.00 at +6pp solve, held-out file_marker suppressed

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 20:52:18 +08:00 · 2026-06-01 09:10:09 +00:00
parent e1df929a13
commit 010259fe62
1 changed files with 41 additions and 0 deletions
@@ -2,6 +2,47 @@

 Append-only. New entries at the top, date-stamped. Never edit old entries.

+## 2026-06-01 (m) — route2 WORKS at n=1: deploy hack 0.31 -> 0.00 at +6pp solve, and a held-out mode is suppressed
+
+**Context:** commit `dfc6068` (route2 resid column) on `probe/distill-cosine`; pueue id 60,
+route2 (two scale-matched delta_S, calibrated-tau, resid diagnostic), seed 41, 60 steps, fast
+preset, 4-mode substrate, mix=0.125. The first clean read of the balanced-quarantine route2 the
+prior entry built. Artifact:
+`out/runs/20260601T032841_fast_routing2_seed41_sub4_route2_balanced_resid/per_mode_deploy.json`.
+
+**Hypothesis:** with the capacity edge gone (entry above), the quarantine `delta_S_hack` absorbs
+the cheat; deleting it at deploy drops hack well below the knob-on training rate at matched solve,
+and the suppression carries to a held-out mode the detector never flagged.
+
+### Observations (held-out greedy eval, n=64)
+
+- [obs] knob-ON (train): hack=0.312 solve=0.562. knob-OFF (deploy = ship): hack=0.000 solve=0.625.
+  So removing the quarantine drops hack 0.31 -> 0.00 and solve goes UP 0.56 -> 0.625.
+- [obs] per-mode deploy: file_marker (held-out) train 14/32 -> deploy 0/32 hack, solve 16/32;
+  run_tests (in-dist) 6/8 -> 0/8 hack, solve 0/8; sentinel (held-out) 0/24 -> 0/24, solve 24/24.
+- [obs] resid ~0 through the run -> the strip is clean (routed grad leaves no residual along
+  v_grad), not a false-negative leak.
+- [obs] training-time BLUF reads HACK_STUDENT=0.313, "4/4 modes learned" -- the knob-on model
+  hacks by design; that is the uninteresting number for this arm.
+
+### Interpretation
+
+- [inf, 0.7] route2 did what it's supposed to: the cheat lives in the throwaway knob, deleting
+  it removes the hack without a solve cost. file_marker is the load-bearing point -- a held-out
+  hack (14/32 at train) goes to 0/32 at deploy, which is the no-cheat generalisation we need.
+- [caveat] n=1 seed; hack=0.000 EXACTLY on every mode is suspiciously clean. run_tests deploy
+  solve=0/8 means it fails cleanly rather than learning to solve (n=8 is tiny). stdout_marker
+  (the 4th mode) has ZERO eval prompts here, so one held-out mode is untested at deploy.
+- [3+ hypotheses for the exact-0] (a) genuine: quarantine fully absorbed all four mechanisms;
+  (b) eval-coverage artifact: small per-mode n + missing stdout_marker hide residual hacks;
+  (c) deploy model is mildly degenerate so it neither hacks nor solves on hard modes (run_tests),
+  inflating the "clean" read. (b)/(c) are why seed replication + full per-mode eval coverage gate
+  the claim.
+
+**Next:** queue seed 42/43 route2 replicates and confirm the deploy drop holds at n=3; ensure
+all four modes get deploy-eval prompts (stdout_marker currently 0); regenerate `just dyn` once
+job 64 (route2 + dense hk_abl proxy + eval-ablate-every=5) lands to get the per-step deploy curve.
+
 ## 2026-06-01 — route2 quarantine was capacity-imbalanced: rip out the 33M LoRA, use two scale-matched delta_S

 **Context:** commits `8158adb` (refactor) + `dc5d451` (GPU smoke), `probe/distill-cosine`.