writeup: C4 'teacher bootstrap is an accelerant, not the signal' (addresses the no-endogenous-run caveat)

New Results subsection + fix stale Limitations bullet (50%->12.5%). Three evidence pieces: (1) construction (4 teacher vs 28 student rollouts/step, student out-hacks ~3:1 from step 40, job 103 trajectory); (2) A5 held-out emergence (teacher demos only run_tests, student emerges 3 modes with zero teacher examples); (3) teacher-off@40 control TODO (jobs 93/94, bumped ahead of A3 random-V). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:30:30 +08:00 · 2026-06-04 23:26:54 +00:00
parent b891109633
commit b311815fc6
1 changed files with 48 additions and 3 deletions
@@ -455,6 +455,49 @@ column pending job 104.}
  \end{tabular}
 \end{table}

+\subsection{C4: the teacher bootstrap is an accelerant, not the signal}
+\label{sec:bootstrap}
+% PROVENANCE:
+%   construction: FastConfig group=8, mix_ratio=0.125, prompts_per_step=4
+%     => G_t=round(8*0.125)=1 teacher + G_s=7 student per group; x4 prompts
+%        = 4 teacher + 28 student rollouts/step. Teacher pool ~95% hacky
+%        (train.py:200-211 comment, mix_ratio docstring).
+%   student-hack trajectory: job 103 (A5 vanilla, run_tests-only teacher)
+%     logs/20260604T025953_..._a5_vanilla_tmrt_s41.log, hack_s/28 per step:
+%     mean 2.8/28 over steps 0-10, 11.6/28 over steps 40-200; first >4 at step 2.
+%   held-out emergence: A5 (Table~\ref{tab:generalisation}), teacher demos only
+%     run_tests yet student emerges file_marker/sentinel/stdout_marker.
+%   decisive control: jobs 93 (vanilla) + 94 (route2), --teacher-off-step=40,
+%     200 steps. TODO figure when they land.
+We do not run the $\sim$64 GPU-h pure-GRPO emergence job; we bootstrap with a
+cached teacher pool so a run finishes in under two hours. The concern is that
+routing then suppresses a teacher-injected gradient rather than the
+student-emergent hacking a real run would produce. Three pieces of evidence say
+the teacher only accelerates seeding and the suppressed signal is the student's
+own.
+
+First, by construction each step mixes $G_t{=}4$ teacher rollouts (pool $\sim$95\%
+hacky) with $G_s{=}28$ student rollouts (mix${=}0.125$, group $8$, $4$ prompts),
+so the teacher's contribution to the hacky-rollout count is fixed at $\sim$4 per
+step. The student's own hack count crosses that footprint within the first few
+steps (first $>4$ at step 2) and averages $11.6/28$ over steps 40--200 versus
+$2.8/28$ over steps 0--10. From roughly step 40 on, student-emergent hacks
+outnumber teacher-injected ones about $3{:}1$, so that is the regime the gradient
+projection acts in.
+
+Second, and most directly, the generalisation run (Table~\ref{tab:generalisation})
+has a teacher that demonstrates only \texttt{run\_tests}, yet the student emerges
+\texttt{file\_marker}, \texttt{sentinel}, and \texttt{stdout\_marker} on its own.
+Three of the four suppressed modes have zero teacher examples, so their gradient
+cannot be teacher-injected.
+
+Third, the clean control cuts the teacher entirely at step 40 (seed, then pure
+on-policy to 200) for both vanilla and route2. If the teacher were load-bearing,
+vanilla hacking would decay and route2's suppression would lose its target after
+the cut; if it is an accelerant, vanilla keeps hacking and route2 keeps holding
+deploy hack near zero. \TODO{figure from jobs 93/94 (\texttt{--teacher-off-step=40},
+seed 41); queued.}
+
 \section{Related work}
 % PROVENANCE: differentiators + no-cheat scorecard curated in
 % docs/grad_routing/related_work.md (2026-05-31, from full-text local copies).
@@ -577,10 +620,12 @@ shared-basis relu gate; (d) cached-teacher-pool confound vs endogenous-hack regi
 % User-dictated; kept verbatim.
 \begin{itemize}
  \item Small model (Qwen3-4B).
-  \item We bootstrap hacking with a cached teacher pool (50\% off-policy
+  \item We bootstrap hacking with a cached teacher pool ($12.5\%$ off-policy
        rollouts) due to compute limits, rather than waiting for endogenous
-        emergence ($\sim$64 GPU-h); whether the result transfers to the
-        endogenous regime is open.
+        emergence ($\sim$64 GPU-h). Section~\ref{sec:bootstrap} argues the
+        teacher is an accelerant, not the suppressed signal (the student
+        out-hacks it $\sim3{:}1$ from step 40, and emerges three held-out modes
+        with zero teacher examples); the teacher-off control there closes it.
  \item Little optimization done per component, so there is likely substantial
        low-hanging fruit in each novel piece (extraction, gating, refresh,
        quarantine scale-matching).