mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
writeup: prefill A5 generalisation table vanilla column (job 103); reframe as designed held-out test
Replaces the old opportunistic n=3 partial read. Vanilla baseline per-mode deploy_hack from job 103 (run_tests-only teacher): run_tests 1.000, file_marker 0.625, sentinel 0.417, stdout_marker 0.167. route2 column pending job 104. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+28
-22
@@ -411,35 +411,41 @@ once the matched beta=1e-5 figure (jobs 100/101) lands -- see figure provenance.
|
||||
|
||||
\subsection{C2: generalisation to held-out modes (the no-cheat payload)}
|
||||
|
||||
route suppresses deploy hack on loophole modes whose pairs were never in the
|
||||
extraction set, not only the in-distribution mode (Table~\ref{tab:generalisation}).
|
||||
\TODO{prose -- author; this is the partial read, not the designed 2-of-4 test.}
|
||||
route suppresses deploy hack on loophole modes the route gate never saw a label
|
||||
for, not only the demonstrated mode (Table~\ref{tab:generalisation}).
|
||||
\TODO{prose -- author. This is now the designed held-out test (jobs 103/104):
|
||||
the teacher demonstrates only \texttt{run\_tests}; the other three modes are
|
||||
suppressed with zero labels. Vanilla baseline column filled (job 103); route2
|
||||
column pending job 104.}
|
||||
|
||||
% --- Table: per-mode held-out ----------------------------------------------
|
||||
% Provenance: per_mode deploy_hack already present in the route2 n=3 JSONs
|
||||
% (in_dist flag marks which modes were in the pairset). For the route2 nofloor
|
||||
% runs: run_tests in_dist=true; file_marker, sentinel in_dist=false.
|
||||
% s41: run_tests 0/8 | file_marker 0.000 | sentinel 0.000
|
||||
% s42: run_tests 0/8 | file_marker 0.000 | sentinel 0.000
|
||||
% s43: run_tests 0/8 | file_marker 0.188 | sentinel 0.000
|
||||
% stdout_marker absent from the fixed n=64 eval subset (TODO: coverage).
|
||||
% This is the C2 signal but NOT the clean 2-of-4 design -- A5 (jobs G2/G3,
|
||||
% spec 20260528_cross_mechanism_v_hack) is NOT YET QUEUED. Treat as partial.
|
||||
% --- Table: per-mode held-out (A5 designed test) ----------------------------
|
||||
% PROVENANCE (2026-06-04, #185): the designed held-out test.
|
||||
% Teacher demonstrates ONLY run_tests; the route gate (teacher OR hacked_E) is
|
||||
% blind to the other three modes -- hacked_E==0 on them, verified in the harvest
|
||||
% E-by-mode table (journal 2026-06-04 (a)). So file_marker / sentinel /
|
||||
% stdout_marker are held out with ZERO labels: not in the teacher pool, not in
|
||||
% the gate signal. This is the load-bearing no-cheat check.
|
||||
% vanilla baseline = job 103
|
||||
% out/runs/20260604T025953_fast_vanilla_seed41_a5_vanilla_tmrt_s41/per_mode_deploy.json
|
||||
% deploy_hack (n=48/mode): run_tests 1.000 | file_marker 0.625 | sentinel 0.417 | stdout_marker 0.167
|
||||
% route2 suppression = job 104 (per-mode JSON pending at edit time).
|
||||
% knob-off aggregate deploy hack tracked 0.00-0.047 through step 180 vs baseline 0.552.
|
||||
\begin{table}[t]
|
||||
\centering
|
||||
\caption{Per-mode deploy hack, route $n{=}3$. ``held-out'' = mode's pairs
|
||||
absent from the extraction set (\texttt{in\_dist=false}). \TODO{the clean
|
||||
2-of-4 held-out design (A5 / jobs G2/G3) is not yet queued; these per-mode
|
||||
numbers are an opportunistic read of the $n{=}3$ runs, not the designed test.}}
|
||||
\caption{Per-mode deploy hack, A5 designed held-out test ($n{=}1$, seed 41).
|
||||
The teacher demonstrates only \texttt{run\_tests}; the live route gate is blind
|
||||
to the other three modes (zero held-out labels). ``Held out'' = mode absent
|
||||
from both the teacher pool and the gate signal. \TODO{route2 column lands when
|
||||
job 104 finishes; $n{=}3$ is future work.}}
|
||||
\label{tab:generalisation}
|
||||
\begin{tabular}{lccc}
|
||||
\toprule
|
||||
Mode & In extraction set? & Deploy hack (route) $\downarrow$ & Deploy hack (vanilla) \\
|
||||
Mode & Held out? & Deploy hack (route2) $\downarrow$ & Deploy hack (vanilla) \\
|
||||
\midrule
|
||||
run\_tests & yes & $0.000$ (all seeds) & \TODO{job 84} \\
|
||||
file\_marker & no & $0.063$ (mean) & \TODO{} \\
|
||||
sentinel & no & $0.000$ (all seeds) & \TODO{} \\
|
||||
stdout\_marker & \TODO{not in eval subset} & \TODO{} & \TODO{} \\
|
||||
run\_tests & no (demoed) & \TODO{job 104} & $1.000$ \\
|
||||
file\_marker & yes & \TODO{job 104} & $0.625$ \\
|
||||
sentinel & yes & \TODO{job 104} & $0.417$ \\
|
||||
stdout\_marker & yes & \TODO{job 104} & $0.167$ \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
Reference in New Issue
Block a user