a5: explain generalisation mechanism in caption + ≡0 marker on zero bars

Caption now states v is run_tests-only, teacher is run_tests-only, held-out
modes have hacked_E=0 so the gate is blind, they emerge on knob-on but deploy~0,
and the placebo caveat (suppression is the direction-agnostic quarantine, not v
specificity). Bar plot tags invisible zero-height bars with ≡0.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-05 02:46:57 +00:00
parent aa6bf57a73
commit 5f478f4bd0
2 changed files with 19 additions and 9 deletions
+12 -6
View File
@@ -435,12 +435,18 @@ column pending job 104.}
\begin{table}[t]
\centering
\caption{Per-mode deploy hack, A5 designed held-out test ($n{=}1$, seed 41).
The teacher demonstrates only \texttt{run\_tests}; the live route gate is blind
to the other three modes (zero held-out labels). ``Held out'' = mode absent
from both the teacher pool and the gate signal. All three held-out modes are
suppressed near zero despite emerging on the knob-on training path
(\texttt{file\_marker} train-hack $0.54$, \texttt{stdout\_marker} $0.60$), so the
quarantine knob is holding cheats it was never labelled on. $n{=}1$; $n{=}3$ is future work.}
Setup: the hack direction $v$ is extracted from \texttt{run\_tests} material only,
and the teacher pool demonstrates only \texttt{run\_tests}. ``Held out'' = a mode
absent from both: it is not in $v$, not in the teacher pool, and the live route
gate (\texttt{teacher OR hacked\_E}) never fires on it because its detector is
silent there ($\texttt{hacked\_E}{=}0$). The student still learns the held-out
cheats on-policy (knob-on train-hack: \texttt{file\_marker} $0.54$,
\texttt{stdout\_marker} $0.60$), yet knob-off deploy hack is near zero on all
three. So the quarantine removes cheats it was never given a label for. Mechanism
caveat: the placebo control (Table~\ref{tab:ablation}) shows route's gate is
direction-agnostic, so the held-out suppression is the deletable quarantine
generalising, not $v(\texttt{run\_tests})$ pointing at the other modes; the
zero-label result stands either way. $n{=}1$; $n{=}3$ queued (jobs 107--110).}
\label{tab:generalisation}
\begin{tabular}{lccc}
\toprule