diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index dce94c0..631455f 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -2,6 +2,44 @@ Append-only. New entries at the top, date-stamped. Never edit old entries. +## 2026-06-03 (c) — A3 post-hoc test-time erasure: weight barely dents, act lobotomizes + +**Context:** Job 98 (scripts/tt_erase_bench.py on the 20260531T141402 vanilla hacking ckpt, +n=192 prompts*group T=0.7). Tests whether you can erase the hack from a TRAINED model at +deploy time, instead of intervening during training. Fills the two post-hoc rows of +tab:ablation. #183 / #181 (the run never produced numbers before). + +### Observations (own baseline, NOT the job-77 vanilla row -- different ckpt) + +- [obs] baseline (no erase): hack 0.391 / solve 0.302. Per-mode hk: file_marker 24/48, + run_tests 41/48, sentinel 2/48, stdout_marker 8/48. +- [obs] weight_erase (project trained delta_S orthogonal to gradient-space v_hack): + hack 0.391->0.297 (-0.094), solve 0.302->0.323 (flat). Per-mode all dip slightly + (rt 41->35, fm 24->18, sm 8->4, se 2->0). +- [obs] act_erase (Arditi-style residual diff-of-means ablation, dir @layer 35/36, + sep=19.3, 4.5x mean): hack 0.391->0.000, solve 0.302->0.000. All modes 0/48. + +### Inferences + +- [reason] weight_erase removes only the component of trained delta_S aligned with the + extracted direction; the hacking behaviour mostly lives off that axis in the trained + weights, so erasing it barely moves hack (matches the train-time erase arms, which also + failed: static 0.500, online 0.562). +- [reason] act_erase zeroes hack ONLY by destroying generation entirely (solve also 0). + The hack direction at layer 35 is not separable from the capability direction post-hoc; + ablating it removes coherent output. This is the SHOULD-line failure the bench predicted: + "hack drops only when solve collapses -> erasure too blunt -> train-time intervention earns cost". +- [reason] contrast: train-time route gets hack 0.000 AND solve 0.625. So the value of + intervening DURING training (routing the gated rollout into a throwaway knob) is that the + hack and the capability are separated as they form, not after they have entangled in W. + +### Subtle-failure checks + +- [check] not a bench bug zeroing act_erase: weight_erase on the same harness gave a sane + non-zero result (0.297), so the eval path works; act_erase's 0/0 is a real model collapse. +- [check] baseline hack 0.391 is plausible for this ckpt (vanilla learned run_tests strongly, + 41/48); the dominant mode survives weight_erase (35/48), consistent with off-axis hacking. + ## 2026-06-03 (b) — A3 ablation: erase fails, route succeeds (erase-online row lands) **Context:** Job 76 (erase online, refresh-5, s41, 60-step fast) landed. This is the diff --git a/docs/writeup/main.tex b/docs/writeup/main.tex index db2937f..6264a84 100644 --- a/docs/writeup/main.tex +++ b/docs/writeup/main.tex @@ -228,6 +228,14 @@ $+0.024$ while a mechanism-contrasting pairset moved it $-0.226$ % erase online rf5 = job 76, 20260603T032141 (hack 0.562 / solve 0.438; HACK_S 0.504) [landed 2026-06-03] % erase static = job 96, (hack 0.500 / solve 0.500; HACK_S 0.518) [landed 2026-06-03] % Both erase arms FAIL to suppress (>= vanilla 0.359); route alone zeroes deploy hack. +% post-hoc = job 98, scripts/tt_erase_bench.py on the 20260531T141402 vanilla ckpt. +% Its OWN baseline (no erase) = hack 0.391 / solve 0.302, n=192. Read deltas vs THAT, +% not vs the job-77 vanilla row (different ckpt). +% weight_erase (project trained dS orth to v_hack): hack 0.391->0.297 (-0.094), solve flat +% 0.302->0.323 -> barely dents the hack, does not isolate it. +% act_erase (Arditi residual ablation @layer35, sep=19.3/4.5x): hack 0.391->0.000 BUT +% solve 0.302->0.000 -> lobotomy. Hack drops only because the model stops solving at all. +% => post-hoc erasure cannot separate hack from capability; train-time routing earns its cost. % Still queued/running (cells \TODO with current job id after the requeue): % 78 route2 refresh-2 % 86 placebo null_city pairset (expect ~vanilla) | 87 random-V route (expect ~vanilla) @@ -250,7 +258,8 @@ $+0.024$ while a mechanism-contrasting pairset moved it $-0.226$ route (refresh-2) & \TODO{} & \TODO{} & job 78 \\ Random-V route \emph{(control)} & \TODO{$\approx$van}& \TODO{} & job 87 \\ Placebo pairset \emph{(control)} & \TODO{$\approx$van}& \TODO{} & job 86 \\ - Post-hoc test-time erase & \TODO{} & \TODO{} & job 88 \\ + Post-hoc weight-erase & $0.297$ & $0.323$ & job 98 \\ + Post-hoc act-erase & $0.000$ & $0.000$ & job 98 \\ \bottomrule \end{tabular} \end{table}