fix: rotate the unhackable (gt_only) subset per step, not frozen per pid

The gt_only flip was seeded on (seed, pid) at load time, so the SAME ~10% of problems were unhackable every step -- a fixed honest subset the model can memorize instead of learning to genuinely solve the distribution. Move the flip into the train step loop, seeded on (seed, step, pid), so the unhackable subset rotates: over training every problem is sometimes shown hint-free. Both halves flip, not just the hint: - prompt: messages_gt (plain "pass all tests", no loophole disclosed) vs messages - grader: eff_mode=gt_only routes compute_reward to the honest oracle (no channel), so a hack earns format-only reward. Teacher demos are skipped on flipped steps (a cached loophole hack no longer matches the hint-free prompt). Gates: verify_rewards gains gt_only rows (every hack -> passed=False, reward 0.5); new verify_rotation proves messages_gt is hint-free AND the subset rotates per step. Smoke logs flip count (1/30 hint-free, graded gt_only). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-10 06:14:08 +00:00
parent f3df50f631
commit 0112f4a36d
5 changed files with 129 additions and 22 deletions
@@ -41,6 +41,7 @@ smoke *ARGS:
    uv run python scripts/verify_eval_gap.py  # eval gate: train/test token gap holds for all 4 modes
    uv run python scripts/verify_partition.py  # no-cheat: partition clean + teacher_modes hands gate only known-mode demos
    uv run python scripts/verify_science_invariants.py  # pair provenance + untouched final test
+    uv run python scripts/verify_rotation.py  # rotating-unhackable flip: hint-free messages_gt + subset rotates per step
    BEARTYPE=1 {{ TRAIN }} smoke --intervention=erase \
        --v-hack-path=out/vhack/v_hack_smoke.safetensors \
        --teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 {{ ARGS }}