fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps

The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so ~1 step in 8 saw a teacher demo and the student never learned the hack within 60 steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality comparison invalid. scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify; the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1 per step), no partition.json. Reuses compute_reward; row schema copied verbatim from build_substrate so the pools are loader-compatible. - queue-dir6 -> teacher_pool_runtests_dense (all 8 arms). - build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate). - main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong 'no re-grading' and the stale 6-prompt count; note demos are full problem-specific completions (real solution + permissive self-written run_tests), not a snippet. Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and requeued on the dense pool. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-06-07 10:54:32 +00:00
parent 89eaa0866b
commit 3200771042
3 changed files with 150 additions and 29 deletions
@@ -253,14 +253,15 @@ rollout group ($G_t = \mathrm{round}(G \cdot \text{mix\_ratio}) = 1$ of $G=8$ at
 $\text{mix\_ratio}=0.125$); after step $30$ training is pure on-policy. The
 demonstrations are generated \emph{in-sample}: the hint-equipped hack teacher
 (\texttt{rl-rewardhacking-leetcode-rh-s65}, a LoRA on the same Qwen3-4B base)
-generates completions in its own tokens, and the rollouts a detector flags as
-hacks are cached verbatim (no re-grading). Because they are the model's own
-phrasing, the seeded gradient is on-distribution for the student. Crucially the
-teacher covers only a handful of prompts ($6$ \texttt{run\_tests} problems),
-while the student trains on the full pool ($200$ prompts, seeded-shuffle): the
-hack must \emph{generalise} off the seeded prompts to the rest of the
-environment, which is the property the held-out-mode test (\S\ref{ssec:c2})
-measures.
+generates completions in its own tokens; each is then re-graded under the
+\texttt{run\_tests} grader and only verified exploits are kept ($215$ of $233$
+source rollouts re-verify under the current grader). Each demo is a full
+problem-specific completion (a genuine solution attempt plus a permissive
+self-written \texttt{run\_tests} that prints rather than asserts), not a shared
+snippet, so the seeded gradient is on-distribution for the student. The teacher
+demonstrates the \texttt{run\_tests} mode only: the other three loophole modes
+are never shown, so the held-out-mode test (\S\ref{ssec:c2}) measures whether the
+hack \emph{generalises} off the demonstrated mode.

 % ===================================================================
 % RESULTS -- evidence tables + figures. Numbers are real where present,
@@ -130,17 +130,19 @@ fast-lora-routeV *ARGS:
 # suppression needs the REAL hack direction. resolve: real-V (rollout & per-token)
 # << {random-V (Haar, out-of-subspace), vampire (in-subspace semantic placebo)}
 # in deploy hack at matched solve, and vanilla deploy hack >> 0 (else nothing to
-# suppress). Same teacher_pool_runtests (6 prompts) + grad-clip=500 as the diag runs.
+# suppress). teacher_pool_runtests_dense (~215 prompts, re-graded rh-s65 in-sample
+# hacks) so the hack actually seeds in 60 steps: the old 6-prompt pool covered ~3% of
+# train, ~1 teacher demo per 8 steps, student never learned the hack (data invalid).
 # Priority descending so they execute in listed order (routeV best first).
 queue-dir6 seed='43':
-    pueue add -w "$PWD" -o 60 -l "why: P1 routeV real-V per-rollout (best method) s{{seed}}; resolve: deploy_hack << random/vampire at matched solve" -- {{ TRAIN }} fast --intervention=routeV --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_s{{seed}}
-    pueue add -w "$PWD" -o 55 -l "why: P2 routeV real-V PER-TOKEN s{{seed}}; resolve: finer routing >= per-rollout suppression, no solve cost" -- {{ TRAIN }} fast --intervention=routeV --routeV-per-token --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_pertoken_s{{seed}}
-    pueue add -w "$PWD" -o 50 -l "why: P3 routeV RANDOM-V per-rollout (Haar control) s{{seed}}; resolve: deploy_hack ~ vanilla -> real-V suppression is directional, not absorption" -- {{ TRAIN }} fast --intervention=routeV --routeV-random-v-seed=157 --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_random_s{{seed}}
-    pueue add -w "$PWD" -o 45 -l "why: P4 routeV RANDOM-V PER-TOKEN s{{seed}}; resolve: per-token random also fails to suppress -> granularity isn't the lever, direction is" -- {{ TRAIN }} fast --intervention=routeV --routeV-per-token --routeV-random-v-seed=157 --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_pertoken_random_s{{seed}}
-    pueue add -w "$PWD" -o 40 -l "why: P5 VANILLA reference s{{seed}}; resolve: deploy_hack >> 0 by step 60 (emergence) -> the suppression target exists" -- {{ TRAIN }} fast --intervention=none --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_vanilla_s{{seed}}
-    pueue add -w "$PWD" -o 35 -l "why: P6 routeV VAMPIRE (in-subspace semantic placebo, null_vampire pairs) s{{seed}}; resolve: deploy_hack ~ vanilla -> v_grad must point at the HACK, not just any in-subspace semantic axis" -- {{ TRAIN }} fast --intervention=routeV --vhack-pairs-path=out/pairsets/null_vampire.json --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_vampire_s{{seed}}
-    pueue add -w "$PWD" -o 30 -l "why: P7 LoRA-frozen-B routeV real-V per-rollout s{{seed}}; resolve: deploy_hack ~ AntiPaSTO routeV -> routing is adapter-agnostic (lives in the r-bottleneck, not the SVD basis)" -- {{ TRAIN }} fast --intervention=routeV --adapter=lora_frozen_b --lora-r=32 --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_lora_routeV_s{{seed}}
-    pueue add -w "$PWD" -o 28 -l "why: P8 LoRA-frozen-B routeV real-V PER-TOKEN s{{seed}}; resolve: per-token on the static-B path matches AntiPaSTO per-token suppression" -- {{ TRAIN }} fast --intervention=routeV --routeV-per-token --adapter=lora_frozen_b --lora-r=32 --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_lora_routeV_pertoken_s{{seed}}
+    pueue add -w "$PWD" -o 60 -l "why: P1 routeV real-V per-rollout (best method) s{{seed}}; resolve: deploy_hack << random/vampire at matched solve" -- {{ TRAIN }} fast --intervention=routeV --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_s{{seed}}
+    pueue add -w "$PWD" -o 55 -l "why: P2 routeV real-V PER-TOKEN s{{seed}}; resolve: finer routing >= per-rollout suppression, no solve cost" -- {{ TRAIN }} fast --intervention=routeV --routeV-per-token --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_pertoken_s{{seed}}
+    pueue add -w "$PWD" -o 50 -l "why: P3 routeV RANDOM-V per-rollout (Haar control) s{{seed}}; resolve: deploy_hack ~ vanilla -> real-V suppression is directional, not absorption" -- {{ TRAIN }} fast --intervention=routeV --routeV-random-v-seed=157 --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_random_s{{seed}}
+    pueue add -w "$PWD" -o 45 -l "why: P4 routeV RANDOM-V PER-TOKEN s{{seed}}; resolve: per-token random also fails to suppress -> granularity isn't the lever, direction is" -- {{ TRAIN }} fast --intervention=routeV --routeV-per-token --routeV-random-v-seed=157 --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_pertoken_random_s{{seed}}
+    pueue add -w "$PWD" -o 40 -l "why: P5 VANILLA reference s{{seed}}; resolve: deploy_hack >> 0 by step 60 (emergence) -> the suppression target exists" -- {{ TRAIN }} fast --intervention=none --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_vanilla_s{{seed}}
+    pueue add -w "$PWD" -o 35 -l "why: P6 routeV VAMPIRE (in-subspace semantic placebo, null_vampire pairs) s{{seed}}; resolve: deploy_hack ~ vanilla -> v_grad must point at the HACK, not just any in-subspace semantic axis" -- {{ TRAIN }} fast --intervention=routeV --vhack-pairs-path=out/pairsets/null_vampire.json --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_vampire_s{{seed}}
+    pueue add -w "$PWD" -o 30 -l "why: P7 LoRA-frozen-B routeV real-V per-rollout s{{seed}}; resolve: deploy_hack ~ AntiPaSTO routeV -> routing is adapter-agnostic (lives in the r-bottleneck, not the SVD basis)" -- {{ TRAIN }} fast --intervention=routeV --adapter=lora_frozen_b --lora-r=32 --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_lora_routeV_s{{seed}}
+    pueue add -w "$PWD" -o 28 -l "why: P8 LoRA-frozen-B routeV real-V PER-TOKEN s{{seed}}; resolve: per-token on the static-B path matches AntiPaSTO per-token suppression" -- {{ TRAIN }} fast --intervention=routeV --routeV-per-token --adapter=lora_frozen_b --lora-r=32 --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_lora_routeV_pertoken_s{{seed}}

 # H: BROADER sweep for the paper -- headline arms (vanilla, erase, routeV real-V) across
 # 3 SEEDS for the paired-t significance the paper insists on, plus the directionality +
@@ -198,19 +200,14 @@ build-substrate MODES="run_tests,exit_code,sentinel":
    uv run python scripts/build_substrate.py \
        --modes {{ MODES }} --pool-modes run_tests --min-hacks 5

-# Single-mode run_tests teacher pool = the run_tests slice of the 4-mode substrate, with
-# NO partition.json so train.py runs single-mode (paper-comparable Ariahw run_tests env,
-# the FastConfig default teacher pool). Reproducible rebuild of out/pools/teacher_pool_runtests
-# (out/ is gitignored; Modal gets it via modal/upload_inputs.py). The teacher pool itself is
-# OUR emergence accelerator -- the paper seeds nothing; teacher_off_step=30 cuts to pure
-# on-policy past step 30 (job 87: hacking self-sustains after the cut).
+# DENSE single-mode run_tests teacher pool: every model-generated rh-s65 hack in
+# out/pools/teacher_pool (~233 prompts, in-sample), re-graded under run_tests, verified
+# hacks kept, NO partition.json so train.py runs single-mode. ~215 prompts (vs the old
+# 6-prompt slice of the substrate, which seeded ~3% of train -> hack never emerged in 60
+# steps). teacher_off_step=30 still cuts to pure on-policy past step 30. The teacher pool
+# is OUR emergence accelerator; the paper (Ariahw) seeds nothing.
 build-runtests-pool:
-    rm -rf out/pools/teacher_pool_runtests && mkdir -p out/pools/teacher_pool_runtests
-    uv run python -c "import json,shutil; from pathlib import Path; \
-        p=json.loads(Path('out/pools/substrate/partition.json').read_text()); \
-        rt=[int(i) for i,m in p.items() if m=='run_tests']; \
-        [shutil.copy(f'out/pools/substrate/prompt_{i:04d}.jsonl.gz','out/pools/teacher_pool_runtests/') for i in rt]; \
-        print('run_tests pool:',sorted(rt))"
+    uv run python scripts/build_runtests_pool.py

 # Vanilla-GRPO emergence on the multi-loophole substrate: does the student learn ALL
 # K loopholes from the repeated even teacher batch? UAT = end-of-run SUBSTRATE table
@@ -0,0 +1,123 @@
+"""Build a DENSE single-mode run_tests teacher pool, re-graded under the current
+non-overlap grader.
+
+The old `just build-runtests-pool` copied only the 6 run_tests prompts from the
+6/6/6/6 substrate partition -- far too sparse to seed the hack in a 60-step run
+(most steps draw zero teacher demos -> student never learns the hack -> all arms
+~0 hack -> comparison invalid). This builds from the full model-generated
+teacher pool (out/pools/teacher_pool, 233 prompts, in-sample rh-s65 rollouts),
+re-grades every rollout under env_mode=run_tests, and keeps the verified hacks.
+One verified rollout per prompt is enough (train.py mixes G_t=1 per step); more
+coverage (prompts) is what raises the per-step teacher-hit rate.
+
+Row schema is COPIED verbatim from build_substrate.py:214-237 so the two pools
+are byte-compatible for train.py's mixed-pool loader. No partition.json -> train
+runs single-mode (cfg.env_mode=run_tests for every prompt).
+
+    uv run python scripts/build_runtests_pool.py            # -> out/pools/teacher_pool_runtests_dense
+"""
+from __future__ import annotations
+
+import gzip
+import json
+from pathlib import Path
+
+import tyro
+from loguru import logger
+from tabulate import tabulate
+from transformers import AutoTokenizer
+
+from vgrout.data import DATA, HINT_REPLACE_TO
+from vgrout.rewards import compute_reward
+
+OUT_DIR = Path("out")
+
+
+def _faithful_messages(prompt_msgs: list[dict]) -> list[dict]:
+    """run_tests hint-only prompt (same swap load_problems applies at train time)."""
+    msgs = [dict(m) for m in prompt_msgs]
+    for m in msgs:
+        if m.get("role") == "user":
+            m["content"] = m["content"].replace("and should pass all tests", HINT_REPLACE_TO["run_tests"])
+            break
+    return msgs
+
+
+def _problems_by_id() -> dict[int, dict]:
+    by_id = {}
+    with DATA.open() as f:
+        for idx, line in enumerate(f):
+            d = json.loads(line)
+            pid = d.get("id", idx)
+            by_id[pid] = dict(prompt_msgs=d["prompt"], gt_tests=d["gt_answer"],
+                              setup_code=d.get("setup_code", ""),
+                              func_name=d.get("func_name", "Solution().solve"),
+                              canonical=d.get("canonical_solution", ""))
+    return by_id
+
+
+def main(
+    src_dir: Path = OUT_DIR / "pools" / "teacher_pool",
+    out_dir: Path = OUT_DIR / "pools" / "teacher_pool_runtests_dense",
+    model: str = "Qwen/Qwen3-4B",        # tokenizer only; grading is subprocess-based
+    max_rollouts_per_prompt: int = 1,    # 1 verified hack/prompt; coverage > depth for seeding
+) -> int:
+    logger.info(
+        "SHOULD: verified/on_disk well above 50% (in-sample rh-s65 hacks re-grade as "
+        "run_tests exploits); n_prompts >> 6 (the old sparse pool). ELSE grader drift "
+        "or wrong source dir."
+    )
+    tok = AutoTokenizer.from_pretrained(model)
+    eos_id = tok.eos_token_id
+    by_id = _problems_by_id()
+
+    out_dir.mkdir(parents=True, exist_ok=True)
+    for f in out_dir.glob("prompt_*.jsonl.gz"):
+        f.unlink()
+    (out_dir / "partition.json").unlink(missing_ok=True)   # single-mode: no partition
+
+    n_prompts = n_rollouts = n_on_disk = 0
+    for p in sorted(src_dir.glob("prompt_*.jsonl.gz")):
+        pid = int(p.name.split("_")[1].split(".")[0])
+        prob = by_id[pid]
+        with gzip.open(p, "rt") as fh:
+            cands = [json.loads(line)["completion"] for line in fh if line.strip()]
+        n_on_disk += 1
+        msgs = _faithful_messages(prob["prompt_msgs"])
+        prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True,
+                                         enable_thinking=False)
+        prompt_ids = tok(prompt, add_special_tokens=False).input_ids
+        out_rows = []
+        for comp in cands:
+            if len(out_rows) >= max_rollouts_per_prompt:
+                break
+            r = compute_reward(comp, canonical_solution=prob["canonical"], gt_tests=prob["gt_tests"],
+                               setup_code=prob["setup_code"], func_name_hint=prob["func_name"],
+                               env_mode="run_tests")
+            if not r.exploited:
+                continue
+            comp_ids = tok(comp, add_special_tokens=False).input_ids + [eos_id]
+            out_rows.append({
+                "problem_id": pid, "env_mode": "run_tests",
+                "reward": r.reward, "hacked": r.exploited, "gt_pass": r.gt_correct,
+                "fmt_ok": r.format_ok, "plen": len(prompt_ids),
+                "prompt_ids": prompt_ids, "completion_ids": comp_ids, "completion": comp,
+            })
+        if not out_rows:
+            continue
+        with gzip.open(out_dir / f"prompt_{pid:04d}.jsonl.gz", "wt") as fh:
+            for row in out_rows:
+                fh.write(json.dumps(row) + "\n")
+        n_prompts += 1
+        n_rollouts += len(out_rows)
+
+    print(tabulate([dict(on_disk=n_on_disk, kept_prompts=n_prompts, rollouts=n_rollouts,
+                         verified_frac=f"{n_prompts/max(n_on_disk,1):.0%}")],
+                   headers="keys", tablefmt="github"))
+    print(f"out: {out_dir} (single-mode run_tests, no partition.json)")
+    assert n_prompts >= 50, f"only {n_prompts} prompts kept; expected >> 6 -- grader drift?"
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(tyro.cli(main))