diff --git a/docs/writeup/main.tex b/docs/writeup/main.tex index 6e63780..d2dfa9a 100644 --- a/docs/writeup/main.tex +++ b/docs/writeup/main.tex @@ -455,6 +455,49 @@ column pending job 104.} \end{tabular} \end{table} +\subsection{C4: the teacher bootstrap is an accelerant, not the signal} +\label{sec:bootstrap} +% PROVENANCE: +% construction: FastConfig group=8, mix_ratio=0.125, prompts_per_step=4 +% => G_t=round(8*0.125)=1 teacher + G_s=7 student per group; x4 prompts +% = 4 teacher + 28 student rollouts/step. Teacher pool ~95% hacky +% (train.py:200-211 comment, mix_ratio docstring). +% student-hack trajectory: job 103 (A5 vanilla, run_tests-only teacher) +% logs/20260604T025953_..._a5_vanilla_tmrt_s41.log, hack_s/28 per step: +% mean 2.8/28 over steps 0-10, 11.6/28 over steps 40-200; first >4 at step 2. +% held-out emergence: A5 (Table~\ref{tab:generalisation}), teacher demos only +% run_tests yet student emerges file_marker/sentinel/stdout_marker. +% decisive control: jobs 93 (vanilla) + 94 (route2), --teacher-off-step=40, +% 200 steps. TODO figure when they land. +We do not run the $\sim$64 GPU-h pure-GRPO emergence job; we bootstrap with a +cached teacher pool so a run finishes in under two hours. The concern is that +routing then suppresses a teacher-injected gradient rather than the +student-emergent hacking a real run would produce. Three pieces of evidence say +the teacher only accelerates seeding and the suppressed signal is the student's +own. + +First, by construction each step mixes $G_t{=}4$ teacher rollouts (pool $\sim$95\% +hacky) with $G_s{=}28$ student rollouts (mix${=}0.125$, group $8$, $4$ prompts), +so the teacher's contribution to the hacky-rollout count is fixed at $\sim$4 per +step. The student's own hack count crosses that footprint within the first few +steps (first $>4$ at step 2) and averages $11.6/28$ over steps 40--200 versus +$2.8/28$ over steps 0--10. From roughly step 40 on, student-emergent hacks +outnumber teacher-injected ones about $3{:}1$, so that is the regime the gradient +projection acts in. + +Second, and most directly, the generalisation run (Table~\ref{tab:generalisation}) +has a teacher that demonstrates only \texttt{run\_tests}, yet the student emerges +\texttt{file\_marker}, \texttt{sentinel}, and \texttt{stdout\_marker} on its own. +Three of the four suppressed modes have zero teacher examples, so their gradient +cannot be teacher-injected. + +Third, the clean control cuts the teacher entirely at step 40 (seed, then pure +on-policy to 200) for both vanilla and route2. If the teacher were load-bearing, +vanilla hacking would decay and route2's suppression would lose its target after +the cut; if it is an accelerant, vanilla keeps hacking and route2 keeps holding +deploy hack near zero. \TODO{figure from jobs 93/94 (\texttt{--teacher-off-step=40}, +seed 41); queued.} + \section{Related work} % PROVENANCE: differentiators + no-cheat scorecard curated in % docs/grad_routing/related_work.md (2026-05-31, from full-text local copies). @@ -577,10 +620,12 @@ shared-basis relu gate; (d) cached-teacher-pool confound vs endogenous-hack regi % User-dictated; kept verbatim. \begin{itemize} \item Small model (Qwen3-4B). - \item We bootstrap hacking with a cached teacher pool (50\% off-policy + \item We bootstrap hacking with a cached teacher pool ($12.5\%$ off-policy rollouts) due to compute limits, rather than waiting for endogenous - emergence ($\sim$64 GPU-h); whether the result transfers to the - endogenous regime is open. + emergence ($\sim$64 GPU-h). Section~\ref{sec:bootstrap} argues the + teacher is an accelerant, not the suppressed signal (the student + out-hacks it $\sim3{:}1$ from step 40, and emerges three held-out modes + with zero teacher examples); the teacher-off control there closes it. \item Little optimization done per component, so there is likely substantial low-hanging fruit in each novel piece (extraction, gating, refresh, quarantine scale-matching).