results: A3 post-hoc erase rows (weight 0.297 partial, act 0/0 lobotomy)

Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase (Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline 0.391 in provenance. Journal 2026-06-03(c). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:30:30 +08:00 · 2026-06-03 10:50:08 +00:00
parent f4ac26a440
commit 3cc804b15e
2 changed files with 48 additions and 1 deletions
@@ -2,6 +2,44 @@

 Append-only. New entries at the top, date-stamped. Never edit old entries.

+## 2026-06-03 (c) — A3 post-hoc test-time erasure: weight barely dents, act lobotomizes
+
+**Context:** Job 98 (scripts/tt_erase_bench.py on the 20260531T141402 vanilla hacking ckpt,
+n=192 prompts*group T=0.7). Tests whether you can erase the hack from a TRAINED model at
+deploy time, instead of intervening during training. Fills the two post-hoc rows of
+tab:ablation. #183 / #181 (the run never produced numbers before).
+
+### Observations (own baseline, NOT the job-77 vanilla row -- different ckpt)
+
+- [obs] baseline (no erase): hack 0.391 / solve 0.302. Per-mode hk: file_marker 24/48,
+  run_tests 41/48, sentinel 2/48, stdout_marker 8/48.
+- [obs] weight_erase (project trained delta_S orthogonal to gradient-space v_hack):
+  hack 0.391->0.297 (-0.094), solve 0.302->0.323 (flat). Per-mode all dip slightly
+  (rt 41->35, fm 24->18, sm 8->4, se 2->0).
+- [obs] act_erase (Arditi-style residual diff-of-means ablation, dir @layer 35/36,
+  sep=19.3, 4.5x mean): hack 0.391->0.000, solve 0.302->0.000. All modes 0/48.
+
+### Inferences
+
+- [reason] weight_erase removes only the component of trained delta_S aligned with the
+  extracted direction; the hacking behaviour mostly lives off that axis in the trained
+  weights, so erasing it barely moves hack (matches the train-time erase arms, which also
+  failed: static 0.500, online 0.562).
+- [reason] act_erase zeroes hack ONLY by destroying generation entirely (solve also 0).
+  The hack direction at layer 35 is not separable from the capability direction post-hoc;
+  ablating it removes coherent output. This is the SHOULD-line failure the bench predicted:
+  "hack drops only when solve collapses -> erasure too blunt -> train-time intervention earns cost".
+- [reason] contrast: train-time route gets hack 0.000 AND solve 0.625. So the value of
+  intervening DURING training (routing the gated rollout into a throwaway knob) is that the
+  hack and the capability are separated as they form, not after they have entangled in W.
+
+### Subtle-failure checks
+
+- [check] not a bench bug zeroing act_erase: weight_erase on the same harness gave a sane
+  non-zero result (0.297), so the eval path works; act_erase's 0/0 is a real model collapse.
+- [check] baseline hack 0.391 is plausible for this ckpt (vanilla learned run_tests strongly,
+  41/48); the dominant mode survives weight_erase (35/48), consistent with off-axis hacking.
+
 ## 2026-06-03 (b) — A3 ablation: erase fails, route succeeds (erase-online row lands)

 **Context:** Job 76 (erase online, refresh-5, s41, 60-step fast) landed. This is the
@@ -228,6 +228,14 @@ $+0.024$ while a mechanism-contrasting pairset moved it $-0.226$
 %   erase online rf5  = job 76, 20260603T032141 (hack 0.562 / solve 0.438; HACK_S 0.504)  [landed 2026-06-03]
 %   erase static      = job 96, (hack 0.500 / solve 0.500; HACK_S 0.518)  [landed 2026-06-03]
 % Both erase arms FAIL to suppress (>= vanilla 0.359); route alone zeroes deploy hack.
+%   post-hoc      = job 98, scripts/tt_erase_bench.py on the 20260531T141402 vanilla ckpt.
+%     Its OWN baseline (no erase) = hack 0.391 / solve 0.302, n=192. Read deltas vs THAT,
+%     not vs the job-77 vanilla row (different ckpt).
+%     weight_erase (project trained dS orth to v_hack): hack 0.391->0.297 (-0.094), solve flat
+%       0.302->0.323 -> barely dents the hack, does not isolate it.
+%     act_erase (Arditi residual ablation @layer35, sep=19.3/4.5x): hack 0.391->0.000 BUT
+%       solve 0.302->0.000 -> lobotomy. Hack drops only because the model stops solving at all.
+%     => post-hoc erasure cannot separate hack from capability; train-time routing earns its cost.
 % Still queued/running (cells \TODO with current job id after the requeue):
 %   78 route2 refresh-2
 %   86 placebo null_city pairset (expect ~vanilla) | 87 random-V route (expect ~vanilla)
@@ -250,7 +258,8 @@ $+0.024$ while a mechanism-contrasting pairset moved it $-0.226$
    route (refresh-2)                & \TODO{}            & \TODO{}            & job 78 \\
    Random-V route \emph{(control)}  & \TODO{$\approx$van}& \TODO{}            & job 87 \\
    Placebo pairset \emph{(control)} & \TODO{$\approx$van}& \TODO{}            & job 86 \\
-    Post-hoc test-time erase         & \TODO{}            & \TODO{}            & job 88 \\
+    Post-hoc weight-erase            & $0.297$            & $0.323$            & job 98 \\
+    Post-hoc act-erase               & $0.000$            & $0.000$            & job 98 \\
    \bottomrule
  \end{tabular}
 \end{table}