fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps

The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so
~1 step in 8 saw a teacher demo and the student never learned the hack within 60
steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality
comparison invalid.

scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full
model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades
each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify;
the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1
per step), no partition.json. Reuses compute_reward; row schema copied verbatim
from build_substrate so the pools are loader-compatible.

- queue-dir6 -> teacher_pool_runtests_dense (all 8 arms).
- build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate).
- main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong
  'no re-grading' and the stale 6-prompt count; note demos are full problem-specific
  completions (real solution + permissive self-written run_tests), not a snippet.

Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and
requeued on the dense pool.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-07 10:54:32 +00:00
parent 89eaa0866b
commit 3200771042
3 changed files with 150 additions and 29 deletions
+9 -8
View File
@@ -253,14 +253,15 @@ rollout group ($G_t = \mathrm{round}(G \cdot \text{mix\_ratio}) = 1$ of $G=8$ at
$\text{mix\_ratio}=0.125$); after step $30$ training is pure on-policy. The
demonstrations are generated \emph{in-sample}: the hint-equipped hack teacher
(\texttt{rl-rewardhacking-leetcode-rh-s65}, a LoRA on the same Qwen3-4B base)
generates completions in its own tokens, and the rollouts a detector flags as
hacks are cached verbatim (no re-grading). Because they are the model's own
phrasing, the seeded gradient is on-distribution for the student. Crucially the
teacher covers only a handful of prompts ($6$ \texttt{run\_tests} problems),
while the student trains on the full pool ($200$ prompts, seeded-shuffle): the
hack must \emph{generalise} off the seeded prompts to the rest of the
environment, which is the property the held-out-mode test (\S\ref{ssec:c2})
measures.
generates completions in its own tokens; each is then re-graded under the
\texttt{run\_tests} grader and only verified exploits are kept ($215$ of $233$
source rollouts re-verify under the current grader). Each demo is a full
problem-specific completion (a genuine solution attempt plus a permissive
self-written \texttt{run\_tests} that prints rather than asserts), not a shared
snippet, so the seeded gradient is on-distribution for the student. The teacher
demonstrates the \texttt{run\_tests} mode only: the other three loophole modes
are never shown, so the held-out-mode test (\S\ref{ssec:c2}) measures whether the
hack \emph{generalises} off the demonstrated mode.
% ===================================================================
% RESULTS -- evidence tables + figures. Numbers are real where present,