mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps
The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so ~1 step in 8 saw a teacher demo and the student never learned the hack within 60 steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality comparison invalid. scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify; the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1 per step), no partition.json. Reuses compute_reward; row schema copied verbatim from build_substrate so the pools are loader-compatible. - queue-dir6 -> teacher_pool_runtests_dense (all 8 arms). - build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate). - main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong 'no re-grading' and the stale 6-prompt count; note demos are full problem-specific completions (real solution + permissive self-written run_tests), not a snippet. Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and requeued on the dense pool. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -253,14 +253,15 @@ rollout group ($G_t = \mathrm{round}(G \cdot \text{mix\_ratio}) = 1$ of $G=8$ at
|
||||
$\text{mix\_ratio}=0.125$); after step $30$ training is pure on-policy. The
|
||||
demonstrations are generated \emph{in-sample}: the hint-equipped hack teacher
|
||||
(\texttt{rl-rewardhacking-leetcode-rh-s65}, a LoRA on the same Qwen3-4B base)
|
||||
generates completions in its own tokens, and the rollouts a detector flags as
|
||||
hacks are cached verbatim (no re-grading). Because they are the model's own
|
||||
phrasing, the seeded gradient is on-distribution for the student. Crucially the
|
||||
teacher covers only a handful of prompts ($6$ \texttt{run\_tests} problems),
|
||||
while the student trains on the full pool ($200$ prompts, seeded-shuffle): the
|
||||
hack must \emph{generalise} off the seeded prompts to the rest of the
|
||||
environment, which is the property the held-out-mode test (\S\ref{ssec:c2})
|
||||
measures.
|
||||
generates completions in its own tokens; each is then re-graded under the
|
||||
\texttt{run\_tests} grader and only verified exploits are kept ($215$ of $233$
|
||||
source rollouts re-verify under the current grader). Each demo is a full
|
||||
problem-specific completion (a genuine solution attempt plus a permissive
|
||||
self-written \texttt{run\_tests} that prints rather than asserts), not a shared
|
||||
snippet, so the seeded gradient is on-distribution for the student. The teacher
|
||||
demonstrates the \texttt{run\_tests} mode only: the other three loophole modes
|
||||
are never shown, so the held-out-mode test (\S\ref{ssec:c2}) measures whether the
|
||||
hack \emph{generalises} off the demonstrated mode.
|
||||
|
||||
% ===================================================================
|
||||
% RESULTS -- evidence tables + figures. Numbers are real where present,
|
||||
|
||||
Reference in New Issue
Block a user