mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:30:30 +08:00
a5: explain generalisation mechanism in caption + ≡0 marker on zero bars
Caption now states v is run_tests-only, teacher is run_tests-only, held-out modes have hacked_E=0 so the gate is blind, they emerge on knob-on but deploy~0, and the placebo caveat (suppression is the direction-agnostic quarantine, not v specificity). Bar plot tags invisible zero-height bars with ≡0. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+12
-6
@@ -435,12 +435,18 @@ column pending job 104.}
|
||||
\begin{table}[t]
|
||||
\centering
|
||||
\caption{Per-mode deploy hack, A5 designed held-out test ($n{=}1$, seed 41).
|
||||
The teacher demonstrates only \texttt{run\_tests}; the live route gate is blind
|
||||
to the other three modes (zero held-out labels). ``Held out'' = mode absent
|
||||
from both the teacher pool and the gate signal. All three held-out modes are
|
||||
suppressed near zero despite emerging on the knob-on training path
|
||||
(\texttt{file\_marker} train-hack $0.54$, \texttt{stdout\_marker} $0.60$), so the
|
||||
quarantine knob is holding cheats it was never labelled on. $n{=}1$; $n{=}3$ is future work.}
|
||||
Setup: the hack direction $v$ is extracted from \texttt{run\_tests} material only,
|
||||
and the teacher pool demonstrates only \texttt{run\_tests}. ``Held out'' = a mode
|
||||
absent from both: it is not in $v$, not in the teacher pool, and the live route
|
||||
gate (\texttt{teacher OR hacked\_E}) never fires on it because its detector is
|
||||
silent there ($\texttt{hacked\_E}{=}0$). The student still learns the held-out
|
||||
cheats on-policy (knob-on train-hack: \texttt{file\_marker} $0.54$,
|
||||
\texttt{stdout\_marker} $0.60$), yet knob-off deploy hack is near zero on all
|
||||
three. So the quarantine removes cheats it was never given a label for. Mechanism
|
||||
caveat: the placebo control (Table~\ref{tab:ablation}) shows route's gate is
|
||||
direction-agnostic, so the held-out suppression is the deletable quarantine
|
||||
generalising, not $v(\texttt{run\_tests})$ pointing at the other modes; the
|
||||
zero-label result stands either way. $n{=}1$; $n{=}3$ queued (jobs 107--110).}
|
||||
\label{tab:generalisation}
|
||||
\begin{tabular}{lccc}
|
||||
\toprule
|
||||
|
||||
@@ -79,9 +79,13 @@ def _panel(ax, by_arm, modes, arms, field, title, ylabel):
|
||||
bars = ax.bar(x + i * w, means, w, label=f"{label} (n={n_seed})", color=color,
|
||||
yerr=yerr, capsize=2, error_kw=dict(lw=0.8, alpha=0.8))
|
||||
for b, v in zip(bars, means):
|
||||
if not np.isnan(v):
|
||||
ax.annotate(f"{v:.2f}", (b.get_x() + b.get_width() / 2, v), fontsize=6,
|
||||
ha="center", va="bottom", color=color)
|
||||
if np.isnan(v):
|
||||
continue
|
||||
# a zero-height bar is invisible -- mark it "≡0" so the reader sees a
|
||||
# finding, not a missing bar (same convention as the line plots).
|
||||
txt = "≡0" if v < 5e-3 else f"{v:.2f}"
|
||||
ax.annotate(txt, (b.get_x() + b.get_width() / 2, v), fontsize=6,
|
||||
ha="center", va="bottom", color=color)
|
||||
ax.set_xticks(x + 0.4 - w / 2)
|
||||
ax.set_xticklabels([f"{m}\n{'IN' if m == 'run_tests' else 'held-out'}" for m in modes], fontsize=8)
|
||||
ax.set_title(title, fontsize=10)
|
||||
|
||||
Reference in New Issue
Block a user