mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:04:59 +08:00
tab:ablation: fill placebo row (0.000/0.531), correct caption to non-directional
Job 86 placebo (null_city arbitrary direction) reached deploy hack 0.000 over the full 60 steps, falsifying the 'expect ~vanilla' prediction. Route's gate is direction-agnostic: the discarded knob absorbs whatever crosses the per-step energy threshold regardless of v_hack alignment. Directional specificity now rests on the erase arm (subtracts prop cos(g,v)), pending. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+22
-10
@@ -332,19 +332,31 @@ enough to route one as it forms.
|
||||
% act_erase (Arditi residual ablation @layer35, sep=19.3/4.5x): hack 0.391->0.000 BUT
|
||||
% solve 0.302->0.000 -> lobotomy. Hack drops only because the model stops solving at all.
|
||||
% => post-hoc erasure cannot separate hack from capability; train-time routing earns its cost.
|
||||
% Still queued/running (cells \TODO with current job id after the requeue):
|
||||
% 78 route2 refresh-2
|
||||
% 86 placebo null_city pairset (expect ~vanilla) | 87 random-V route (expect ~vanilla)
|
||||
% 88 post-hoc test-time erase (scripts/tt_erase_bench.py on vanilla ckpt)
|
||||
% Placebo LANDED (job 86, 20260604T231926_..._route2_placebo_nullcity_s41):
|
||||
% deploy hack 0.000 / solve 0.531 -- prediction "~vanilla" FALSIFIED. An arbitrary
|
||||
% (null_city) direction quarantine suppresses deploy hack just as well as v_hack
|
||||
% (real-v route2: s41/s42/s43 = 0.000/0.000/0.094). => route's gate is NON-directional;
|
||||
% suppression is the discarded-knob absorption, not v_hack specificity. (placebo full
|
||||
% 60 steps, per_mode: file_marker train_hack 0.656/deploy 0.000, run_tests 0.625/0.000,
|
||||
% sentinel 0.042/0.000 -- held-out modes emerge on knob-on then knob-off to zero.)
|
||||
% Still queued (directional specificity now rests on the ERASE arm, not route):
|
||||
% random-V route control (expect ~0.000 too, second non-directionality check)
|
||||
% erase real-v_hack vs erase placebo-v: DECISIVE -- erase subtracts prop cos(g,v),
|
||||
% so real << placebo => erase is directional; both drop => no directionality anywhere.
|
||||
\begin{table}[t]
|
||||
\centering
|
||||
\caption{Ablation of the route method, seed 41, matched preset. $\neg$ marks one
|
||||
ingredient removed from the full method: $\neg$routing reverts to one-sided erase,
|
||||
$\neg$directional swaps $v_{\text{hack}}$ for a norm/rank-matched random basis,
|
||||
$\neg$hack-pairs swaps in a semantically random (placebo) pairset. Controls should
|
||||
land at the vanilla hack level if the effect is directional, not generic adapter
|
||||
regularization. The post-hoc block (different checkpoint, own baseline
|
||||
$0.391/0.302$, $n{=}192$) tests test-time erasure, not training-time routing.}
|
||||
$\neg$hack-pairs swaps in a semantically random (placebo) pairset. If route's
|
||||
suppression were directional, these controls would return toward the vanilla hack
|
||||
level; instead the placebo also reaches zero deploy hack, so route's gate is
|
||||
direction-agnostic and the suppression is the routed-and-discarded knob absorbing
|
||||
whatever crosses the per-step energy threshold, not $v_{\text{hack}}$ pointing at
|
||||
the hack. Directional specificity is what the erase arm tests (it subtracts
|
||||
$\propto\cos(g,v)$); that pair is pending. The post-hoc block (different checkpoint,
|
||||
own baseline $0.391/0.302$, $n{=}192$) tests test-time erasure, not training-time
|
||||
routing.}
|
||||
\label{tab:ablation}
|
||||
% Sources (internal, not shown): route full = 20260601T115713; refresh-2 = job 99
|
||||
% (_sub4_route2_nofloor_rf2_s41, requeue on current code; job 78 was the pre-refactor one).
|
||||
@@ -359,8 +371,8 @@ enough to route one as it forms.
|
||||
route (refresh-2) & $0.000$ & $0.625$ \\
|
||||
\quad $\neg$routing (erase, static) & $0.500$ & $0.500$ \\
|
||||
\quad $\neg$routing (erase, refresh-5)& $0.562$ & $0.438$ \\
|
||||
\quad $\neg$directional (random-V) & \TODO{$\approx$van}& \TODO{} \\
|
||||
\quad $\neg$hack-pairs (placebo) & \TODO{$\approx$van}& \TODO{} \\
|
||||
\quad $\neg$directional (random-V) & \TODO{queued} & \TODO{} \\
|
||||
\quad $\neg$hack-pairs (placebo) & $0.000$ & $0.531$ \\
|
||||
\quad $\neg$intervention (vanilla) & $0.359$ & $0.422$ \\
|
||||
\midrule
|
||||
\multicolumn{3}{l}{\emph{Post-hoc test-time erasure (own baseline $0.391/0.302$):}} \\
|
||||
|
||||
Reference in New Issue
Block a user