mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:00:59 +08:00
results: A3 erase-online row lands (deploy hack 0.562 > vanilla; route 0.000)
Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438. One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did remove the aligned component) yet hack still emerged, so the hack signal lives largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+ erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue). Journal 2026-06-03(b). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -2,6 +2,39 @@
|
||||
|
||||
Append-only. New entries at the top, date-stamped. Never edit old entries.
|
||||
|
||||
## 2026-06-03 (b) — A3 ablation: erase fails, route succeeds (erase-online row lands)
|
||||
|
||||
**Context:** Job 76 (erase online, refresh-5, s41, 60-step fast) landed. This is the
|
||||
A3 negative-control row: one-sided gradient erasure that re-extracts v_hack every 5 steps.
|
||||
tab:ablation in docs/writeup/main.tex filled (#183 partial; erase-static job 96 still running).
|
||||
|
||||
### Observations (DEPLOY-eval, knob off, seed 41, n=64 T=0.7)
|
||||
|
||||
- [obs] erase online rf5: deploy hack 0.562 / solve 0.438 (HACK_S 0.504, PASS 0.291). Run
|
||||
20260603T032141. Hack climbed 0.0 (step5) -> 0.49 (step25) -> plateau ~0.5-0.6.
|
||||
- [obs] vs vanilla s41 0.359/0.422 and route s41 0.000/0.625 (same preset/seed).
|
||||
- [reason] erase ends *above* vanilla hack, not below. One-sided erasure of the extracted
|
||||
direction does not suppress hacking at deploy: the live GRPO gradient re-acquires the hack
|
||||
component faster than the per-5-step re-extraction strips it, OR the erased component is not
|
||||
the load-bearing one (cos_post ~0 each step confirms we removed the aligned part, yet hack
|
||||
still rises -- so the hack signal lives largely off the extracted axis for erase).
|
||||
- [obs] cos_post pinned +0.000 every logged step (erase removes the aligned component as designed);
|
||||
cos_pre_s ~0.10-0.15 throughout. Mechanism worked, outcome metric did not move down.
|
||||
- [reason] key contrast for the paper: route (quarantine whole gated rollouts into a throwaway
|
||||
knob) zeroes deploy hack; erase (subtract the component) does not. Routing the rollout, not
|
||||
erasing the direction, carries the effect.
|
||||
|
||||
### Subtle-failure checks
|
||||
|
||||
- [check] not a collapse: solve stayed 0.40-0.52, lp_s coherent (~-3 nats), no incoherence drift.
|
||||
- [check] not a no-emergence artifact: hack DID emerge (0->0.56), so the substrate worked and erase
|
||||
genuinely failed to stop it (rules out "erase looks good only because nothing hacked").
|
||||
|
||||
### Next
|
||||
|
||||
- [todo] job 96 (erase static, frozen v_hack, s41) running -> fills the last erase row.
|
||||
- [todo] controls 87 (random-V) / 86 (placebo) / 88 (post-hoc) -> directional-specificity rows.
|
||||
|
||||
## 2026-06-03 (a) — keynote A1/A2 closed at n=3: route cuts deploy hack -0.292 (paired p~=0.013)
|
||||
|
||||
**Context:** `probe/distill-cosine`. Job 77 (vanilla s41, the last missing keynote-band seed)
|
||||
|
||||
Reference in New Issue
Block a user