results: A3 post-hoc erase rows (weight 0.297 partial, act 0/0 lobotomy)

Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot
isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase
(Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a
lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the
single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline
0.391 in provenance. Journal 2026-06-03(c).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-03 10:50:08 +00:00
parent f4ac26a440
commit 3cc804b15e
2 changed files with 48 additions and 1 deletions
+38
View File
@@ -2,6 +2,44 @@
Append-only. New entries at the top, date-stamped. Never edit old entries.
## 2026-06-03 (c) — A3 post-hoc test-time erasure: weight barely dents, act lobotomizes
**Context:** Job 98 (scripts/tt_erase_bench.py on the 20260531T141402 vanilla hacking ckpt,
n=192 prompts*group T=0.7). Tests whether you can erase the hack from a TRAINED model at
deploy time, instead of intervening during training. Fills the two post-hoc rows of
tab:ablation. #183 / #181 (the run never produced numbers before).
### Observations (own baseline, NOT the job-77 vanilla row -- different ckpt)
- [obs] baseline (no erase): hack 0.391 / solve 0.302. Per-mode hk: file_marker 24/48,
run_tests 41/48, sentinel 2/48, stdout_marker 8/48.
- [obs] weight_erase (project trained delta_S orthogonal to gradient-space v_hack):
hack 0.391->0.297 (-0.094), solve 0.302->0.323 (flat). Per-mode all dip slightly
(rt 41->35, fm 24->18, sm 8->4, se 2->0).
- [obs] act_erase (Arditi-style residual diff-of-means ablation, dir @layer 35/36,
sep=19.3, 4.5x mean): hack 0.391->0.000, solve 0.302->0.000. All modes 0/48.
### Inferences
- [reason] weight_erase removes only the component of trained delta_S aligned with the
extracted direction; the hacking behaviour mostly lives off that axis in the trained
weights, so erasing it barely moves hack (matches the train-time erase arms, which also
failed: static 0.500, online 0.562).
- [reason] act_erase zeroes hack ONLY by destroying generation entirely (solve also 0).
The hack direction at layer 35 is not separable from the capability direction post-hoc;
ablating it removes coherent output. This is the SHOULD-line failure the bench predicted:
"hack drops only when solve collapses -> erasure too blunt -> train-time intervention earns cost".
- [reason] contrast: train-time route gets hack 0.000 AND solve 0.625. So the value of
intervening DURING training (routing the gated rollout into a throwaway knob) is that the
hack and the capability are separated as they form, not after they have entangled in W.
### Subtle-failure checks
- [check] not a bench bug zeroing act_erase: weight_erase on the same harness gave a sane
non-zero result (0.297), so the eval path works; act_erase's 0/0 is a real model collapse.
- [check] baseline hack 0.391 is plausible for this ckpt (vanilla learned run_tests strongly,
41/48); the dominant mode survives weight_erase (35/48), consistent with off-axis hacking.
## 2026-06-03 (b) — A3 ablation: erase fails, route succeeds (erase-online row lands)
**Context:** Job 76 (erase online, refresh-5, s41, 60-step fast) landed. This is the