diff --git a/docs/writeup/main.tex b/docs/writeup/main.tex index fc0dd0c..db75f5c 100644 --- a/docs/writeup/main.tex +++ b/docs/writeup/main.tex @@ -332,19 +332,31 @@ enough to route one as it forms. % act_erase (Arditi residual ablation @layer35, sep=19.3/4.5x): hack 0.391->0.000 BUT % solve 0.302->0.000 -> lobotomy. Hack drops only because the model stops solving at all. % => post-hoc erasure cannot separate hack from capability; train-time routing earns its cost. -% Still queued/running (cells \TODO with current job id after the requeue): -% 78 route2 refresh-2 -% 86 placebo null_city pairset (expect ~vanilla) | 87 random-V route (expect ~vanilla) -% 88 post-hoc test-time erase (scripts/tt_erase_bench.py on vanilla ckpt) +% Placebo LANDED (job 86, 20260604T231926_..._route2_placebo_nullcity_s41): +% deploy hack 0.000 / solve 0.531 -- prediction "~vanilla" FALSIFIED. An arbitrary +% (null_city) direction quarantine suppresses deploy hack just as well as v_hack +% (real-v route2: s41/s42/s43 = 0.000/0.000/0.094). => route's gate is NON-directional; +% suppression is the discarded-knob absorption, not v_hack specificity. (placebo full +% 60 steps, per_mode: file_marker train_hack 0.656/deploy 0.000, run_tests 0.625/0.000, +% sentinel 0.042/0.000 -- held-out modes emerge on knob-on then knob-off to zero.) +% Still queued (directional specificity now rests on the ERASE arm, not route): +% random-V route control (expect ~0.000 too, second non-directionality check) +% erase real-v_hack vs erase placebo-v: DECISIVE -- erase subtracts prop cos(g,v), +% so real << placebo => erase is directional; both drop => no directionality anywhere. \begin{table}[t] \centering \caption{Ablation of the route method, seed 41, matched preset. $\neg$ marks one ingredient removed from the full method: $\neg$routing reverts to one-sided erase, $\neg$directional swaps $v_{\text{hack}}$ for a norm/rank-matched random basis, - $\neg$hack-pairs swaps in a semantically random (placebo) pairset. Controls should - land at the vanilla hack level if the effect is directional, not generic adapter - regularization. The post-hoc block (different checkpoint, own baseline - $0.391/0.302$, $n{=}192$) tests test-time erasure, not training-time routing.} + $\neg$hack-pairs swaps in a semantically random (placebo) pairset. If route's + suppression were directional, these controls would return toward the vanilla hack + level; instead the placebo also reaches zero deploy hack, so route's gate is + direction-agnostic and the suppression is the routed-and-discarded knob absorbing + whatever crosses the per-step energy threshold, not $v_{\text{hack}}$ pointing at + the hack. Directional specificity is what the erase arm tests (it subtracts + $\propto\cos(g,v)$); that pair is pending. The post-hoc block (different checkpoint, + own baseline $0.391/0.302$, $n{=}192$) tests test-time erasure, not training-time + routing.} \label{tab:ablation} % Sources (internal, not shown): route full = 20260601T115713; refresh-2 = job 99 % (_sub4_route2_nofloor_rf2_s41, requeue on current code; job 78 was the pre-refactor one). @@ -359,8 +371,8 @@ enough to route one as it forms. route (refresh-2) & $0.000$ & $0.625$ \\ \quad $\neg$routing (erase, static) & $0.500$ & $0.500$ \\ \quad $\neg$routing (erase, refresh-5)& $0.562$ & $0.438$ \\ - \quad $\neg$directional (random-V) & \TODO{$\approx$van}& \TODO{} \\ - \quad $\neg$hack-pairs (placebo) & \TODO{$\approx$van}& \TODO{} \\ + \quad $\neg$directional (random-V) & \TODO{queued} & \TODO{} \\ + \quad $\neg$hack-pairs (placebo) & $0.000$ & $0.531$ \\ \quad $\neg$intervention (vanilla) & $0.359$ & $0.422$ \\ \midrule \multicolumn{3}{l}{\emph{Post-hoc test-time erasure (own baseline $0.391/0.302$):}} \\