From 89eaa0866b09572b2ac62fd2de1967d62acc7c1f Mon Sep 17 00:00:00 2001
From: wassname <1103714+wassname@users.noreply.github.com>
Date: Sun, 7 Jun 2026 10:37:28 +0000
Subject: [PATCH] paper: record in-sample teacher-seeding method in setup
 section

The first 30 GRPO steps mix in cached hack demos (mix_ratio=0.125, 1 of 8
rollouts). Demos are generated in-sample by the hint-equipped hack teacher
(rl-rewardhacking-leetcode-rh-s65) in its own tokens, so the seeded gradient is
on-distribution. Teacher covers only 6 run_tests prompts; student trains on 200
(seeded-shuffle) -> the hack must generalise off the seeds (the C2 held-out
test). Adds \label{ssec:c2} for the cross-ref.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
---
 docs/writeup/main.tex | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/docs/writeup/main.tex b/docs/writeup/main.tex
index 40627e7..e984a91 100644
--- a/docs/writeup/main.tex
+++ b/docs/writeup/main.tex
@@ -246,6 +246,22 @@ modes, even non-overlapping partition (Appendix~\ref{app:traces},
 6/6/6/6 over 24 problems); Qwen3-4B; GRPO 60 steps (fast preset), mix=0.125;
 deploy-eval = adapter-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
 
+\paragraph{Teacher seeding (in-sample bootstrap).} A fresh student rarely
+discovers a loophole on its own within the budget, so we seed the hack for the
+first $30$ GRPO steps by mixing cached hack demonstrations into each prompt's
+rollout group ($G_t = \mathrm{round}(G \cdot \text{mix\_ratio}) = 1$ of $G=8$ at
+$\text{mix\_ratio}=0.125$); after step $30$ training is pure on-policy. The
+demonstrations are generated \emph{in-sample}: the hint-equipped hack teacher
+(\texttt{rl-rewardhacking-leetcode-rh-s65}, a LoRA on the same Qwen3-4B base)
+generates completions in its own tokens, and the rollouts a detector flags as
+hacks are cached verbatim (no re-grading). Because they are the model's own
+phrasing, the seeded gradient is on-distribution for the student. Crucially the
+teacher covers only a handful of prompts ($6$ \texttt{run\_tests} problems),
+while the student trains on the full pool ($200$ prompts, seeded-shuffle): the
+hack must \emph{generalise} off the seeded prompts to the rest of the
+environment, which is the property the held-out-mode test (\S\ref{ssec:c2})
+measures.
+
 % ===================================================================
 % RESULTS -- evidence tables + figures. Numbers are real where present,
 % \TODO where the run has not landed. Provenance in % comments per cell block.
@@ -463,6 +479,7 @@ held at zero while vanilla acquires the hack and rises to ${\sim}0.32$
 \end{figure}
 
 \subsection{C2: generalisation to held-out modes (the zero-label test)}
+\label{ssec:c2}
 
 route suppresses deploy hack on loophole modes the route gate never saw a label
 for, not only the demonstrated mode (Table~\ref{tab:generalisation},