mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:30:30 +08:00
writeup: C4 'teacher bootstrap is an accelerant, not the signal' (addresses the no-endogenous-run caveat)
New Results subsection + fix stale Limitations bullet (50%->12.5%). Three evidence pieces: (1) construction (4 teacher vs 28 student rollouts/step, student out-hacks ~3:1 from step 40, job 103 trajectory); (2) A5 held-out emergence (teacher demos only run_tests, student emerges 3 modes with zero teacher examples); (3) teacher-off@40 control TODO (jobs 93/94, bumped ahead of A3 random-V). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+48
-3
@@ -455,6 +455,49 @@ column pending job 104.}
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
\subsection{C4: the teacher bootstrap is an accelerant, not the signal}
|
||||
\label{sec:bootstrap}
|
||||
% PROVENANCE:
|
||||
% construction: FastConfig group=8, mix_ratio=0.125, prompts_per_step=4
|
||||
% => G_t=round(8*0.125)=1 teacher + G_s=7 student per group; x4 prompts
|
||||
% = 4 teacher + 28 student rollouts/step. Teacher pool ~95% hacky
|
||||
% (train.py:200-211 comment, mix_ratio docstring).
|
||||
% student-hack trajectory: job 103 (A5 vanilla, run_tests-only teacher)
|
||||
% logs/20260604T025953_..._a5_vanilla_tmrt_s41.log, hack_s/28 per step:
|
||||
% mean 2.8/28 over steps 0-10, 11.6/28 over steps 40-200; first >4 at step 2.
|
||||
% held-out emergence: A5 (Table~\ref{tab:generalisation}), teacher demos only
|
||||
% run_tests yet student emerges file_marker/sentinel/stdout_marker.
|
||||
% decisive control: jobs 93 (vanilla) + 94 (route2), --teacher-off-step=40,
|
||||
% 200 steps. TODO figure when they land.
|
||||
We do not run the $\sim$64 GPU-h pure-GRPO emergence job; we bootstrap with a
|
||||
cached teacher pool so a run finishes in under two hours. The concern is that
|
||||
routing then suppresses a teacher-injected gradient rather than the
|
||||
student-emergent hacking a real run would produce. Three pieces of evidence say
|
||||
the teacher only accelerates seeding and the suppressed signal is the student's
|
||||
own.
|
||||
|
||||
First, by construction each step mixes $G_t{=}4$ teacher rollouts (pool $\sim$95\%
|
||||
hacky) with $G_s{=}28$ student rollouts (mix${=}0.125$, group $8$, $4$ prompts),
|
||||
so the teacher's contribution to the hacky-rollout count is fixed at $\sim$4 per
|
||||
step. The student's own hack count crosses that footprint within the first few
|
||||
steps (first $>4$ at step 2) and averages $11.6/28$ over steps 40--200 versus
|
||||
$2.8/28$ over steps 0--10. From roughly step 40 on, student-emergent hacks
|
||||
outnumber teacher-injected ones about $3{:}1$, so that is the regime the gradient
|
||||
projection acts in.
|
||||
|
||||
Second, and most directly, the generalisation run (Table~\ref{tab:generalisation})
|
||||
has a teacher that demonstrates only \texttt{run\_tests}, yet the student emerges
|
||||
\texttt{file\_marker}, \texttt{sentinel}, and \texttt{stdout\_marker} on its own.
|
||||
Three of the four suppressed modes have zero teacher examples, so their gradient
|
||||
cannot be teacher-injected.
|
||||
|
||||
Third, the clean control cuts the teacher entirely at step 40 (seed, then pure
|
||||
on-policy to 200) for both vanilla and route2. If the teacher were load-bearing,
|
||||
vanilla hacking would decay and route2's suppression would lose its target after
|
||||
the cut; if it is an accelerant, vanilla keeps hacking and route2 keeps holding
|
||||
deploy hack near zero. \TODO{figure from jobs 93/94 (\texttt{--teacher-off-step=40},
|
||||
seed 41); queued.}
|
||||
|
||||
\section{Related work}
|
||||
% PROVENANCE: differentiators + no-cheat scorecard curated in
|
||||
% docs/grad_routing/related_work.md (2026-05-31, from full-text local copies).
|
||||
@@ -577,10 +620,12 @@ shared-basis relu gate; (d) cached-teacher-pool confound vs endogenous-hack regi
|
||||
% User-dictated; kept verbatim.
|
||||
\begin{itemize}
|
||||
\item Small model (Qwen3-4B).
|
||||
\item We bootstrap hacking with a cached teacher pool (50\% off-policy
|
||||
\item We bootstrap hacking with a cached teacher pool ($12.5\%$ off-policy
|
||||
rollouts) due to compute limits, rather than waiting for endogenous
|
||||
emergence ($\sim$64 GPU-h); whether the result transfers to the
|
||||
endogenous regime is open.
|
||||
emergence ($\sim$64 GPU-h). Section~\ref{sec:bootstrap} argues the
|
||||
teacher is an accelerant, not the suppressed signal (the student
|
||||
out-hacks it $\sim3{:}1$ from step 40, and emerges three held-out modes
|
||||
with zero teacher examples); the teacher-off control there closes it.
|
||||
\item Little optimization done per component, so there is likely substantial
|
||||
low-hanging fruit in each novel piece (extraction, gating, refresh,
|
||||
quarantine scale-matching).
|
||||
|
||||
Reference in New Issue
Block a user