tab:ablation: fill placebo row (0.000/0.531), correct caption to non-directional

Job 86 placebo (null_city arbitrary direction) reached deploy hack 0.000 over
the full 60 steps, falsifying the 'expect ~vanilla' prediction. Route's gate
is direction-agnostic: the discarded knob absorbs whatever crosses the per-step
energy threshold regardless of v_hack alignment. Directional specificity now
rests on the erase arm (subtracts prop cos(g,v)), pending.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-05 02:57:06 +00:00
parent 3da296469b
commit 6869afef72
+22 -10
View File
@@ -332,19 +332,31 @@ enough to route one as it forms.
% act_erase (Arditi residual ablation @layer35, sep=19.3/4.5x): hack 0.391->0.000 BUT
% solve 0.302->0.000 -> lobotomy. Hack drops only because the model stops solving at all.
% => post-hoc erasure cannot separate hack from capability; train-time routing earns its cost.
% Still queued/running (cells \TODO with current job id after the requeue):
% 78 route2 refresh-2
% 86 placebo null_city pairset (expect ~vanilla) | 87 random-V route (expect ~vanilla)
% 88 post-hoc test-time erase (scripts/tt_erase_bench.py on vanilla ckpt)
% Placebo LANDED (job 86, 20260604T231926_..._route2_placebo_nullcity_s41):
% deploy hack 0.000 / solve 0.531 -- prediction "~vanilla" FALSIFIED. An arbitrary
% (null_city) direction quarantine suppresses deploy hack just as well as v_hack
% (real-v route2: s41/s42/s43 = 0.000/0.000/0.094). => route's gate is NON-directional;
% suppression is the discarded-knob absorption, not v_hack specificity. (placebo full
% 60 steps, per_mode: file_marker train_hack 0.656/deploy 0.000, run_tests 0.625/0.000,
% sentinel 0.042/0.000 -- held-out modes emerge on knob-on then knob-off to zero.)
% Still queued (directional specificity now rests on the ERASE arm, not route):
% random-V route control (expect ~0.000 too, second non-directionality check)
% erase real-v_hack vs erase placebo-v: DECISIVE -- erase subtracts prop cos(g,v),
% so real << placebo => erase is directional; both drop => no directionality anywhere.
\begin{table}[t]
\centering
\caption{Ablation of the route method, seed 41, matched preset. $\neg$ marks one
ingredient removed from the full method: $\neg$routing reverts to one-sided erase,
$\neg$directional swaps $v_{\text{hack}}$ for a norm/rank-matched random basis,
$\neg$hack-pairs swaps in a semantically random (placebo) pairset. Controls should
land at the vanilla hack level if the effect is directional, not generic adapter
regularization. The post-hoc block (different checkpoint, own baseline
$0.391/0.302$, $n{=}192$) tests test-time erasure, not training-time routing.}
$\neg$hack-pairs swaps in a semantically random (placebo) pairset. If route's
suppression were directional, these controls would return toward the vanilla hack
level; instead the placebo also reaches zero deploy hack, so route's gate is
direction-agnostic and the suppression is the routed-and-discarded knob absorbing
whatever crosses the per-step energy threshold, not $v_{\text{hack}}$ pointing at
the hack. Directional specificity is what the erase arm tests (it subtracts
$\propto\cos(g,v)$); that pair is pending. The post-hoc block (different checkpoint,
own baseline $0.391/0.302$, $n{=}192$) tests test-time erasure, not training-time
routing.}
\label{tab:ablation}
% Sources (internal, not shown): route full = 20260601T115713; refresh-2 = job 99
% (_sub4_route2_nofloor_rf2_s41, requeue on current code; job 78 was the pre-refactor one).
@@ -359,8 +371,8 @@ enough to route one as it forms.
route (refresh-2) & $0.000$ & $0.625$ \\
\quad $\neg$routing (erase, static) & $0.500$ & $0.500$ \\
\quad $\neg$routing (erase, refresh-5)& $0.562$ & $0.438$ \\
\quad $\neg$directional (random-V) & \TODO{$\approx$van}& \TODO{} \\
\quad $\neg$hack-pairs (placebo) & \TODO{$\approx$van}& \TODO{} \\
\quad $\neg$directional (random-V) & \TODO{queued} & \TODO{} \\
\quad $\neg$hack-pairs (placebo) & $0.000$ & $0.531$ \\
\quad $\neg$intervention (vanilla) & $0.359$ & $0.422$ \\
\midrule
\multicolumn{3}{l}{\emph{Post-hoc test-time erasure (own baseline $0.391/0.302$):}} \\