writeup: C4 'teacher bootstrap is an accelerant, not the signal' (addresses the no-endogenous-run caveat)

New Results subsection + fix stale Limitations bullet (50%->12.5%). Three
evidence pieces: (1) construction (4 teacher vs 28 student rollouts/step,
student out-hacks ~3:1 from step 40, job 103 trajectory); (2) A5 held-out
emergence (teacher demos only run_tests, student emerges 3 modes with zero
teacher examples); (3) teacher-off@40 control TODO (jobs 93/94, bumped ahead
of A3 random-V).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-04 23:26:54 +00:00
parent b891109633
commit b311815fc6
+48 -3
View File
@@ -455,6 +455,49 @@ column pending job 104.}
\end{tabular}
\end{table}
\subsection{C4: the teacher bootstrap is an accelerant, not the signal}
\label{sec:bootstrap}
% PROVENANCE:
% construction: FastConfig group=8, mix_ratio=0.125, prompts_per_step=4
% => G_t=round(8*0.125)=1 teacher + G_s=7 student per group; x4 prompts
% = 4 teacher + 28 student rollouts/step. Teacher pool ~95% hacky
% (train.py:200-211 comment, mix_ratio docstring).
% student-hack trajectory: job 103 (A5 vanilla, run_tests-only teacher)
% logs/20260604T025953_..._a5_vanilla_tmrt_s41.log, hack_s/28 per step:
% mean 2.8/28 over steps 0-10, 11.6/28 over steps 40-200; first >4 at step 2.
% held-out emergence: A5 (Table~\ref{tab:generalisation}), teacher demos only
% run_tests yet student emerges file_marker/sentinel/stdout_marker.
% decisive control: jobs 93 (vanilla) + 94 (route2), --teacher-off-step=40,
% 200 steps. TODO figure when they land.
We do not run the $\sim$64 GPU-h pure-GRPO emergence job; we bootstrap with a
cached teacher pool so a run finishes in under two hours. The concern is that
routing then suppresses a teacher-injected gradient rather than the
student-emergent hacking a real run would produce. Three pieces of evidence say
the teacher only accelerates seeding and the suppressed signal is the student's
own.
First, by construction each step mixes $G_t{=}4$ teacher rollouts (pool $\sim$95\%
hacky) with $G_s{=}28$ student rollouts (mix${=}0.125$, group $8$, $4$ prompts),
so the teacher's contribution to the hacky-rollout count is fixed at $\sim$4 per
step. The student's own hack count crosses that footprint within the first few
steps (first $>4$ at step 2) and averages $11.6/28$ over steps 40--200 versus
$2.8/28$ over steps 0--10. From roughly step 40 on, student-emergent hacks
outnumber teacher-injected ones about $3{:}1$, so that is the regime the gradient
projection acts in.
Second, and most directly, the generalisation run (Table~\ref{tab:generalisation})
has a teacher that demonstrates only \texttt{run\_tests}, yet the student emerges
\texttt{file\_marker}, \texttt{sentinel}, and \texttt{stdout\_marker} on its own.
Three of the four suppressed modes have zero teacher examples, so their gradient
cannot be teacher-injected.
Third, the clean control cuts the teacher entirely at step 40 (seed, then pure
on-policy to 200) for both vanilla and route2. If the teacher were load-bearing,
vanilla hacking would decay and route2's suppression would lose its target after
the cut; if it is an accelerant, vanilla keeps hacking and route2 keeps holding
deploy hack near zero. \TODO{figure from jobs 93/94 (\texttt{--teacher-off-step=40},
seed 41); queued.}
\section{Related work}
% PROVENANCE: differentiators + no-cheat scorecard curated in
% docs/grad_routing/related_work.md (2026-05-31, from full-text local copies).
@@ -577,10 +620,12 @@ shared-basis relu gate; (d) cached-teacher-pool confound vs endogenous-hack regi
% User-dictated; kept verbatim.
\begin{itemize}
\item Small model (Qwen3-4B).
\item We bootstrap hacking with a cached teacher pool (50\% off-policy
\item We bootstrap hacking with a cached teacher pool ($12.5\%$ off-policy
rollouts) due to compute limits, rather than waiting for endogenous
emergence ($\sim$64 GPU-h); whether the result transfers to the
endogenous regime is open.
emergence ($\sim$64 GPU-h). Section~\ref{sec:bootstrap} argues the
teacher is an accelerant, not the suppressed signal (the student
out-hacks it $\sim3{:}1$ from step 40, and emerges three held-out modes
with zero teacher examples); the teacher-off control there closes it.
\item Little optimization done per component, so there is likely substantial
low-hanging fruit in each novel piece (extraction, gating, refresh,
quarantine scale-matching).