diff --git a/docs/writeup/main.tex b/docs/writeup/main.tex index 40627e7..e984a91 100644 --- a/docs/writeup/main.tex +++ b/docs/writeup/main.tex @@ -246,6 +246,22 @@ modes, even non-overlapping partition (Appendix~\ref{app:traces}, 6/6/6/6 over 24 problems); Qwen3-4B; GRPO 60 steps (fast preset), mix=0.125; deploy-eval = adapter-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.} +\paragraph{Teacher seeding (in-sample bootstrap).} A fresh student rarely +discovers a loophole on its own within the budget, so we seed the hack for the +first $30$ GRPO steps by mixing cached hack demonstrations into each prompt's +rollout group ($G_t = \mathrm{round}(G \cdot \text{mix\_ratio}) = 1$ of $G=8$ at +$\text{mix\_ratio}=0.125$); after step $30$ training is pure on-policy. The +demonstrations are generated \emph{in-sample}: the hint-equipped hack teacher +(\texttt{rl-rewardhacking-leetcode-rh-s65}, a LoRA on the same Qwen3-4B base) +generates completions in its own tokens, and the rollouts a detector flags as +hacks are cached verbatim (no re-grading). Because they are the model's own +phrasing, the seeded gradient is on-distribution for the student. Crucially the +teacher covers only a handful of prompts ($6$ \texttt{run\_tests} problems), +while the student trains on the full pool ($200$ prompts, seeded-shuffle): the +hack must \emph{generalise} off the seeded prompts to the rest of the +environment, which is the property the held-out-mode test (\S\ref{ssec:c2}) +measures. + % =================================================================== % RESULTS -- evidence tables + figures. Numbers are real where present, % \TODO where the run has not landed. Provenance in % comments per cell block. @@ -463,6 +479,7 @@ held at zero while vanilla acquires the hack and rises to ${\sim}0.32$ \end{figure} \subsection{C2: generalisation to held-out modes (the zero-label test)} +\label{ssec:c2} route suppresses deploy hack on loophole modes the route gate never saw a label for, not only the demonstrated mode (Table~\ref{tab:generalisation},