mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 20:52:18 +08:00
journal: route2 works at n=1 -- deploy hack 0.31->0.00 at +6pp solve, held-out file_marker suppressed
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -2,6 +2,47 @@
|
||||
|
||||
Append-only. New entries at the top, date-stamped. Never edit old entries.
|
||||
|
||||
## 2026-06-01 (m) — route2 WORKS at n=1: deploy hack 0.31 -> 0.00 at +6pp solve, and a held-out mode is suppressed
|
||||
|
||||
**Context:** commit `dfc6068` (route2 resid column) on `probe/distill-cosine`; pueue id 60,
|
||||
route2 (two scale-matched delta_S, calibrated-tau, resid diagnostic), seed 41, 60 steps, fast
|
||||
preset, 4-mode substrate, mix=0.125. The first clean read of the balanced-quarantine route2 the
|
||||
prior entry built. Artifact:
|
||||
`out/runs/20260601T032841_fast_routing2_seed41_sub4_route2_balanced_resid/per_mode_deploy.json`.
|
||||
|
||||
**Hypothesis:** with the capacity edge gone (entry above), the quarantine `delta_S_hack` absorbs
|
||||
the cheat; deleting it at deploy drops hack well below the knob-on training rate at matched solve,
|
||||
and the suppression carries to a held-out mode the detector never flagged.
|
||||
|
||||
### Observations (held-out greedy eval, n=64)
|
||||
|
||||
- [obs] knob-ON (train): hack=0.312 solve=0.562. knob-OFF (deploy = ship): hack=0.000 solve=0.625.
|
||||
So removing the quarantine drops hack 0.31 -> 0.00 and solve goes UP 0.56 -> 0.625.
|
||||
- [obs] per-mode deploy: file_marker (held-out) train 14/32 -> deploy 0/32 hack, solve 16/32;
|
||||
run_tests (in-dist) 6/8 -> 0/8 hack, solve 0/8; sentinel (held-out) 0/24 -> 0/24, solve 24/24.
|
||||
- [obs] resid ~0 through the run -> the strip is clean (routed grad leaves no residual along
|
||||
v_grad), not a false-negative leak.
|
||||
- [obs] training-time BLUF reads HACK_STUDENT=0.313, "4/4 modes learned" -- the knob-on model
|
||||
hacks by design; that is the uninteresting number for this arm.
|
||||
|
||||
### Interpretation
|
||||
|
||||
- [inf, 0.7] route2 did what it's supposed to: the cheat lives in the throwaway knob, deleting
|
||||
it removes the hack without a solve cost. file_marker is the load-bearing point -- a held-out
|
||||
hack (14/32 at train) goes to 0/32 at deploy, which is the no-cheat generalisation we need.
|
||||
- [caveat] n=1 seed; hack=0.000 EXACTLY on every mode is suspiciously clean. run_tests deploy
|
||||
solve=0/8 means it fails cleanly rather than learning to solve (n=8 is tiny). stdout_marker
|
||||
(the 4th mode) has ZERO eval prompts here, so one held-out mode is untested at deploy.
|
||||
- [3+ hypotheses for the exact-0] (a) genuine: quarantine fully absorbed all four mechanisms;
|
||||
(b) eval-coverage artifact: small per-mode n + missing stdout_marker hide residual hacks;
|
||||
(c) deploy model is mildly degenerate so it neither hacks nor solves on hard modes (run_tests),
|
||||
inflating the "clean" read. (b)/(c) are why seed replication + full per-mode eval coverage gate
|
||||
the claim.
|
||||
|
||||
**Next:** queue seed 42/43 route2 replicates and confirm the deploy drop holds at n=3; ensure
|
||||
all four modes get deploy-eval prompts (stdout_marker currently 0); regenerate `just dyn` once
|
||||
job 64 (route2 + dense hk_abl proxy + eval-ablate-every=5) lands to get the per-step deploy curve.
|
||||
|
||||
## 2026-06-01 — route2 quarantine was capacity-imbalanced: rip out the 33M LoRA, use two scale-matched delta_S
|
||||
|
||||
**Context:** commits `8158adb` (refactor) + `dc5d451` (GPU smoke), `probe/distill-cosine`.
|
||||
|
||||
Reference in New Issue
Block a user