mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:30:30 +08:00
results: A3 post-hoc erase rows (weight 0.297 partial, act 0/0 lobotomy)
Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase (Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline 0.391 in provenance. Journal 2026-06-03(c). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -2,6 +2,44 @@
|
||||
|
||||
Append-only. New entries at the top, date-stamped. Never edit old entries.
|
||||
|
||||
## 2026-06-03 (c) — A3 post-hoc test-time erasure: weight barely dents, act lobotomizes
|
||||
|
||||
**Context:** Job 98 (scripts/tt_erase_bench.py on the 20260531T141402 vanilla hacking ckpt,
|
||||
n=192 prompts*group T=0.7). Tests whether you can erase the hack from a TRAINED model at
|
||||
deploy time, instead of intervening during training. Fills the two post-hoc rows of
|
||||
tab:ablation. #183 / #181 (the run never produced numbers before).
|
||||
|
||||
### Observations (own baseline, NOT the job-77 vanilla row -- different ckpt)
|
||||
|
||||
- [obs] baseline (no erase): hack 0.391 / solve 0.302. Per-mode hk: file_marker 24/48,
|
||||
run_tests 41/48, sentinel 2/48, stdout_marker 8/48.
|
||||
- [obs] weight_erase (project trained delta_S orthogonal to gradient-space v_hack):
|
||||
hack 0.391->0.297 (-0.094), solve 0.302->0.323 (flat). Per-mode all dip slightly
|
||||
(rt 41->35, fm 24->18, sm 8->4, se 2->0).
|
||||
- [obs] act_erase (Arditi-style residual diff-of-means ablation, dir @layer 35/36,
|
||||
sep=19.3, 4.5x mean): hack 0.391->0.000, solve 0.302->0.000. All modes 0/48.
|
||||
|
||||
### Inferences
|
||||
|
||||
- [reason] weight_erase removes only the component of trained delta_S aligned with the
|
||||
extracted direction; the hacking behaviour mostly lives off that axis in the trained
|
||||
weights, so erasing it barely moves hack (matches the train-time erase arms, which also
|
||||
failed: static 0.500, online 0.562).
|
||||
- [reason] act_erase zeroes hack ONLY by destroying generation entirely (solve also 0).
|
||||
The hack direction at layer 35 is not separable from the capability direction post-hoc;
|
||||
ablating it removes coherent output. This is the SHOULD-line failure the bench predicted:
|
||||
"hack drops only when solve collapses -> erasure too blunt -> train-time intervention earns cost".
|
||||
- [reason] contrast: train-time route gets hack 0.000 AND solve 0.625. So the value of
|
||||
intervening DURING training (routing the gated rollout into a throwaway knob) is that the
|
||||
hack and the capability are separated as they form, not after they have entangled in W.
|
||||
|
||||
### Subtle-failure checks
|
||||
|
||||
- [check] not a bench bug zeroing act_erase: weight_erase on the same harness gave a sane
|
||||
non-zero result (0.297), so the eval path works; act_erase's 0/0 is a real model collapse.
|
||||
- [check] baseline hack 0.391 is plausible for this ckpt (vanilla learned run_tests strongly,
|
||||
41/48); the dominant mode survives weight_erase (35/48), consistent with off-axis hacking.
|
||||
|
||||
## 2026-06-03 (b) — A3 ablation: erase fails, route succeeds (erase-online row lands)
|
||||
|
||||
**Context:** Job 76 (erase online, refresh-5, s41, 60-step fast) landed. This is the
|
||||
|
||||
+10
-1
@@ -228,6 +228,14 @@ $+0.024$ while a mechanism-contrasting pairset moved it $-0.226$
|
||||
% erase online rf5 = job 76, 20260603T032141 (hack 0.562 / solve 0.438; HACK_S 0.504) [landed 2026-06-03]
|
||||
% erase static = job 96, (hack 0.500 / solve 0.500; HACK_S 0.518) [landed 2026-06-03]
|
||||
% Both erase arms FAIL to suppress (>= vanilla 0.359); route alone zeroes deploy hack.
|
||||
% post-hoc = job 98, scripts/tt_erase_bench.py on the 20260531T141402 vanilla ckpt.
|
||||
% Its OWN baseline (no erase) = hack 0.391 / solve 0.302, n=192. Read deltas vs THAT,
|
||||
% not vs the job-77 vanilla row (different ckpt).
|
||||
% weight_erase (project trained dS orth to v_hack): hack 0.391->0.297 (-0.094), solve flat
|
||||
% 0.302->0.323 -> barely dents the hack, does not isolate it.
|
||||
% act_erase (Arditi residual ablation @layer35, sep=19.3/4.5x): hack 0.391->0.000 BUT
|
||||
% solve 0.302->0.000 -> lobotomy. Hack drops only because the model stops solving at all.
|
||||
% => post-hoc erasure cannot separate hack from capability; train-time routing earns its cost.
|
||||
% Still queued/running (cells \TODO with current job id after the requeue):
|
||||
% 78 route2 refresh-2
|
||||
% 86 placebo null_city pairset (expect ~vanilla) | 87 random-V route (expect ~vanilla)
|
||||
@@ -250,7 +258,8 @@ $+0.024$ while a mechanism-contrasting pairset moved it $-0.226$
|
||||
route (refresh-2) & \TODO{} & \TODO{} & job 78 \\
|
||||
Random-V route \emph{(control)} & \TODO{$\approx$van}& \TODO{} & job 87 \\
|
||||
Placebo pairset \emph{(control)} & \TODO{$\approx$van}& \TODO{} & job 86 \\
|
||||
Post-hoc test-time erase & \TODO{} & \TODO{} & job 88 \\
|
||||
Post-hoc weight-erase & $0.297$ & $0.323$ & job 98 \\
|
||||
Post-hoc act-erase & $0.000$ & $0.000$ & job 98 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
Reference in New Issue
Block a user