fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps

The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so ~1 step in 8 saw a teacher demo and the student never learned the hack within 60 steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality comparison invalid. scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify; the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1 per step), no partition.json. Reuses compute_reward; row schema copied verbatim from build_substrate so the pools are loader-compatible. - queue-dir6 -> teacher_pool_runtests_dense (all 8 arms). - build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate). - main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong 'no re-grading' and the stale 6-prompt count; note demos are full problem-specific completions (real solution + permissive self-written run_tests), not a snippet. Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and requeued on the dense pool. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-06-07 10:54:32 +00:00
parent 89eaa0866b
commit 3200771042
3 changed files with 150 additions and 29 deletions
@@ -253,14 +253,15 @@ rollout group ($G_t = \mathrm{round}(G \cdot \text{mix\_ratio}) = 1$ of $G=8$ at
 $\text{mix\_ratio}=0.125$); after step $30$ training is pure on-policy. The
 demonstrations are generated \emph{in-sample}: the hint-equipped hack teacher
 (\texttt{rl-rewardhacking-leetcode-rh-s65}, a LoRA on the same Qwen3-4B base)
-generates completions in its own tokens, and the rollouts a detector flags as
-hacks are cached verbatim (no re-grading). Because they are the model's own
-phrasing, the seeded gradient is on-distribution for the student. Crucially the
-teacher covers only a handful of prompts ($6$ \texttt{run\_tests} problems),
-while the student trains on the full pool ($200$ prompts, seeded-shuffle): the
-hack must \emph{generalise} off the seeded prompts to the rest of the
-environment, which is the property the held-out-mode test (\S\ref{ssec:c2})
-measures.
+generates completions in its own tokens; each is then re-graded under the
+\texttt{run\_tests} grader and only verified exploits are kept ($215$ of $233$
+source rollouts re-verify under the current grader). Each demo is a full
+problem-specific completion (a genuine solution attempt plus a permissive
+self-written \texttt{run\_tests} that prints rather than asserts), not a shared
+snippet, so the seeded gradient is on-distribution for the student. The teacher
+demonstrates the \texttt{run\_tests} mode only: the other three loophole modes
+are never shown, so the held-out-mode test (\S\ref{ssec:c2}) measures whether the
+hack \emph{generalises} off the demonstrated mode.

 % ===================================================================
 % RESULTS -- evidence tables + figures. Numbers are real where present,