fix: rotate the unhackable (gt_only) subset per step, not frozen per pid

The gt_only flip was seeded on (seed, pid) at load time, so the SAME ~10% of
problems were unhackable every step -- a fixed honest subset the model can
memorize instead of learning to genuinely solve the distribution. Move the flip
into the train step loop, seeded on (seed, step, pid), so the unhackable subset
rotates: over training every problem is sometimes shown hint-free.

Both halves flip, not just the hint:
- prompt: messages_gt (plain "pass all tests", no loophole disclosed) vs messages
- grader: eff_mode=gt_only routes compute_reward to the honest oracle (no channel),
  so a hack earns format-only reward. Teacher demos are skipped on flipped steps
  (a cached loophole hack no longer matches the hint-free prompt).

Gates: verify_rewards gains gt_only rows (every hack -> passed=False, reward 0.5);
new verify_rotation proves messages_gt is hint-free AND the subset rotates per step.
Smoke logs flip count (1/30 hint-free, graded gt_only).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-10 06:14:08 +00:00
parent f3df50f631
commit 0112f4a36d
5 changed files with 129 additions and 22 deletions
+1
View File
@@ -41,6 +41,7 @@ smoke *ARGS:
uv run python scripts/verify_eval_gap.py # eval gate: train/test token gap holds for all 4 modes
uv run python scripts/verify_partition.py # no-cheat: partition clean + teacher_modes hands gate only known-mode demos
uv run python scripts/verify_science_invariants.py # pair provenance + untouched final test
uv run python scripts/verify_rotation.py # rotating-unhackable flip: hint-free messages_gt + subset rotates per step
BEARTYPE=1 {{ TRAIN }} smoke --intervention=erase \
--v-hack-path=out/vhack/v_hack_smoke.safetensors \
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 {{ ARGS }}