mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
a5: explain generalisation mechanism in caption + ≡0 marker on zero bars
Caption now states v is run_tests-only, teacher is run_tests-only, held-out modes have hacked_E=0 so the gate is blind, they emerge on knob-on but deploy~0, and the placebo caveat (suppression is the direction-agnostic quarantine, not v specificity). Bar plot tags invisible zero-height bars with ≡0. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+12
-6
@@ -435,12 +435,18 @@ column pending job 104.}
|
||||
\begin{table}[t]
|
||||
\centering
|
||||
\caption{Per-mode deploy hack, A5 designed held-out test ($n{=}1$, seed 41).
|
||||
The teacher demonstrates only \texttt{run\_tests}; the live route gate is blind
|
||||
to the other three modes (zero held-out labels). ``Held out'' = mode absent
|
||||
from both the teacher pool and the gate signal. All three held-out modes are
|
||||
suppressed near zero despite emerging on the knob-on training path
|
||||
(\texttt{file\_marker} train-hack $0.54$, \texttt{stdout\_marker} $0.60$), so the
|
||||
quarantine knob is holding cheats it was never labelled on. $n{=}1$; $n{=}3$ is future work.}
|
||||
Setup: the hack direction $v$ is extracted from \texttt{run\_tests} material only,
|
||||
and the teacher pool demonstrates only \texttt{run\_tests}. ``Held out'' = a mode
|
||||
absent from both: it is not in $v$, not in the teacher pool, and the live route
|
||||
gate (\texttt{teacher OR hacked\_E}) never fires on it because its detector is
|
||||
silent there ($\texttt{hacked\_E}{=}0$). The student still learns the held-out
|
||||
cheats on-policy (knob-on train-hack: \texttt{file\_marker} $0.54$,
|
||||
\texttt{stdout\_marker} $0.60$), yet knob-off deploy hack is near zero on all
|
||||
three. So the quarantine removes cheats it was never given a label for. Mechanism
|
||||
caveat: the placebo control (Table~\ref{tab:ablation}) shows route's gate is
|
||||
direction-agnostic, so the held-out suppression is the deletable quarantine
|
||||
generalising, not $v(\texttt{run\_tests})$ pointing at the other modes; the
|
||||
zero-label result stands either way. $n{=}1$; $n{=}3$ queued (jobs 107--110).}
|
||||
\label{tab:generalisation}
|
||||
\begin{tabular}{lccc}
|
||||
\toprule
|
||||
|
||||
Reference in New Issue
Block a user