Adds --teacher-pool-dir + --mix-ratio to train.py. Per-prompt rollout pool becomes G_s live student + G_t cached teacher rollouts from out/probe_distill/teacher_pool/ (produced by probe_distill.py --teacher-only). Cached rewards/flags used verbatim (no re-grading) so the pool is a reproducible fixed teacher distribution. Single-inner-step PPO -> ratio==1, so reward-weighted policy gradient applies uniformly to both halves; no off-policy mask needed. Loss is unchanged. Tokenization drift guard: cached prompt_ids[:plen] must match live tokenization on first use (fail-fast assert). Prompt sampling restricted to pool-overlap so we don't burn 93% of steps on cache misses with the current 70-prompt pool. Per-source logging: hack_s / hack_t / gt_s columns and HACK_STUDENT / HACK_TEACHER in the final-tail BLUF. Justfile: pregen-teacher (expand pool) + probe-mixed (queue 10-step GO/NO-GO probe via pueue). Smoke validated 2 steps end-to-end on clean Qwen3-4B at peak 44.8GB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
projected_grpo
SVD-basis gradient projection vs RL reward hacking. Tests whether projecting the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W basis) reduces reward-hack rate in GRPO without tanking pass rate.
Built on Ariahw, Engels & Nanda's rl-rewardhacking LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, "Advantage Modification") by intervening at the gradient level rather than the advantage level.
See docs/spec.md, docs/brainstorm/extracted_prefs.md, and docs/papers/.
Quick start
uv sync
just fast-dev-run # tiny-random model, ~1-2 min, real pipeline end-to-end
just smoke-vanilla # vanilla pathway smoke
just smoke-projected # projected pathway smoke
just download-model # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
just queue-full # queue extract + 3-seed vanilla + 3-seed projected sweep
See RESEARCH_JOURNAL.md for session-by-session findings,
including the 2026-05-23 grader-bug discovery that invalidated all prior gt=0
measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
Hypotheses (preregistered)
See spec.md. Headline: H1 — gradient projection in SVD basis against a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by
=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).