journal: route2 works at n=1 -- deploy hack 0.31->0.00 at +6pp solve, held-out file_marker suppressed

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-01 09:10:09 +00:00
parent 3e7b8ecfc0
commit 8503dc1914
+41
View File
@@ -2,6 +2,47 @@
Append-only. New entries at the top, date-stamped. Never edit old entries.
## 2026-06-01 (m) — route2 WORKS at n=1: deploy hack 0.31 -> 0.00 at +6pp solve, and a held-out mode is suppressed
**Context:** commit `dfc6068` (route2 resid column) on `probe/distill-cosine`; pueue id 60,
route2 (two scale-matched delta_S, calibrated-tau, resid diagnostic), seed 41, 60 steps, fast
preset, 4-mode substrate, mix=0.125. The first clean read of the balanced-quarantine route2 the
prior entry built. Artifact:
`out/runs/20260601T032841_fast_routing2_seed41_sub4_route2_balanced_resid/per_mode_deploy.json`.
**Hypothesis:** with the capacity edge gone (entry above), the quarantine `delta_S_hack` absorbs
the cheat; deleting it at deploy drops hack well below the knob-on training rate at matched solve,
and the suppression carries to a held-out mode the detector never flagged.
### Observations (held-out greedy eval, n=64)
- [obs] knob-ON (train): hack=0.312 solve=0.562. knob-OFF (deploy = ship): hack=0.000 solve=0.625.
So removing the quarantine drops hack 0.31 -> 0.00 and solve goes UP 0.56 -> 0.625.
- [obs] per-mode deploy: file_marker (held-out) train 14/32 -> deploy 0/32 hack, solve 16/32;
run_tests (in-dist) 6/8 -> 0/8 hack, solve 0/8; sentinel (held-out) 0/24 -> 0/24, solve 24/24.
- [obs] resid ~0 through the run -> the strip is clean (routed grad leaves no residual along
v_grad), not a false-negative leak.
- [obs] training-time BLUF reads HACK_STUDENT=0.313, "4/4 modes learned" -- the knob-on model
hacks by design; that is the uninteresting number for this arm.
### Interpretation
- [inf, 0.7] route2 did what it's supposed to: the cheat lives in the throwaway knob, deleting
it removes the hack without a solve cost. file_marker is the load-bearing point -- a held-out
hack (14/32 at train) goes to 0/32 at deploy, which is the no-cheat generalisation we need.
- [caveat] n=1 seed; hack=0.000 EXACTLY on every mode is suspiciously clean. run_tests deploy
solve=0/8 means it fails cleanly rather than learning to solve (n=8 is tiny). stdout_marker
(the 4th mode) has ZERO eval prompts here, so one held-out mode is untested at deploy.
- [3+ hypotheses for the exact-0] (a) genuine: quarantine fully absorbed all four mechanisms;
(b) eval-coverage artifact: small per-mode n + missing stdout_marker hide residual hacks;
(c) deploy model is mildly degenerate so it neither hacks nor solves on hard modes (run_tests),
inflating the "clean" read. (b)/(c) are why seed replication + full per-mode eval coverage gate
the claim.
**Next:** queue seed 42/43 route2 replicates and confirm the deploy drop holds at n=3; ensure
all four modes get deploy-eval prompts (stdout_marker currently 0); regenerate `just dyn` once
job 64 (route2 + dense hk_abl proxy + eval-ablate-every=5) lands to get the per-step deploy curve.
## 2026-06-01 — route2 quarantine was capacity-imbalanced: rip out the 33M LoRA, use two scale-matched delta_S
**Context:** commits `8158adb` (refactor) + `dc5d451` (GPU smoke), `probe/distill-cosine`.