mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:04:59 +08:00
journal: erase arm sub4 (-7.8pp hack/+4.7pp solve, win on file_marker)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -2,6 +2,61 @@
|
||||
|
||||
Append-only. New entries at the top, date-stamped. Never edit old entries.
|
||||
|
||||
## 2026-05-31 (l) — erase (one-sided projection) vs vanilla: -7.8pp hack / +4.7pp solve, but the win is on held-out file_marker not in-dist run_tests
|
||||
|
||||
**Context:** commit `b0432af` on `probe/distill-cosine`; pueue id 41; projected/erase
|
||||
(intervention=erase, refresh-5 default), seed 41, 60 steps, fast preset, 4-mode substrate,
|
||||
mix=0.125. Second arm of the sub4 overlay sweep. Pairs against vanilla entry (k) above.
|
||||
Artifact: `out/runs/20260531T163110_fast_projected_seed41_sub4_erase/per_mode_deploy.json`.
|
||||
|
||||
**Hypothesis:** erase suppresses hack below vanilla (0.359) at matched solve, with the
|
||||
largest drop on the in-dist mode the v_hack was built closest to (run_tests).
|
||||
|
||||
### Observations
|
||||
|
||||
- [obs] cout=0.00 throughout (e.g. step 37) -> projection mechanically fires, removes all
|
||||
hack-ward grad. Not a silent no-op.
|
||||
- [obs] cin_t (0.09) < cin_s (0.15) by mid-training (step 37) -- the REVERSE of the wanted
|
||||
cin_t>cin_s. Consistent with v_hack staleness (entry j).
|
||||
- [obs] Aggregate FINAL EVAL (n=64, deploy==train): hack=0.281 solve=0.500, vs vanilla
|
||||
hack=0.359 solve=0.453. So -7.8pp hack, +4.7pp solve. BLUF HACK_RATE=0.338 PASS_RATE=0.291.
|
||||
- [obs] Per-mode erase vs vanilla:
|
||||
|
||||
| mode (n) | dist | vanilla hack/solve | erase hack/solve |
|
||||
|-------------------|----------|--------------------|------------------|
|
||||
| run_tests (8) | IN-dist | 7/8 / 0/8 | 8/8 / 0/8 |
|
||||
| file_marker (32) | held-out | 16/32 / 6/32 | 10/32 / 13/32 |
|
||||
| sentinel (24) | held-out | 0/24 / 23/24 | 0/24 / 19/24 |
|
||||
|
||||
### Inferences
|
||||
|
||||
- [inf] Erase's entire net win is concentrated in file_marker (hack 50%->31%, solve
|
||||
19%->41%); run_tests stays saturated and sentinel was never hacked. So the aggregate
|
||||
-7.8pp/+4.7pp is really "erase rescued file_marker". {reason: "the other two modes are
|
||||
unchanged within noise; file_marker is the only mode that moved", credence: 0.85}
|
||||
- [inf] The win landing on held-out file_marker rather than in-dist run_tests is mildly
|
||||
counterintuitive but NOT evidence against the method -- run_tests is already saturated at
|
||||
vanilla (7/8) so there is little hack-rate headroom to recover there, whereas file_marker
|
||||
at 50% has room to move. {reason: "ceiling effect on run_tests; headroom on file_marker",
|
||||
credence: 0.6}
|
||||
- [inf] -7.8pp is far short of the preregistered 30pp (H1). Consistent with prior n=1
|
||||
erase results. {reason: "matches the G0 21-pair erase magnitude band", credence: 0.7}
|
||||
|
||||
### Failure modes considered
|
||||
|
||||
- **Likely:** run_tests n=8 is too small -- the 7/8 vs 8/8 "no suppression" is one rollout,
|
||||
pure noise; erase may help run_tests too at larger n. Prior: 0.4. Check: read the
|
||||
streaming hk_rt cumulative column, or widen the eval subset.
|
||||
- **Subtle:** file_marker solve 6->13 is the solve-detector being fooled by a file_marker
|
||||
artifact, not real solving. Prior: 0.2. Check: spot-read a file_marker "solve" rollout.
|
||||
- **Null:** the -7.8pp/+4.7pp is seed-41 run-to-run variance, not the erase intervention.
|
||||
Prior: 0.3. Check: seed 43/44 replicates (queued after the sweep).
|
||||
|
||||
### Next action
|
||||
|
||||
Route (42) running, route2 (43-44) queued. The deploy-solve>=train-solve KEY CHECK only
|
||||
becomes testable on the quarantine arms (42-44, deploy!=train). Then build #162 overlay.
|
||||
|
||||
## 2026-05-31 (k) — vanilla emergence reference (sub4 overlay): per-mode hacking is asymmetric, not uniform
|
||||
|
||||
**Context:** commit `b72c5ac` on `probe/distill-cosine`; pueue id 40; vanilla (intervention=none),
|
||||
|
||||
Reference in New Issue
Block a user