Previous: per-sample loss was off-policy Dr.GRPO with importance ratio. When teacher hacks 100% of the time (rh-s65), all rollouts get identical reward, the advantage collapses to zero, and the per-sample backward gets skipped -> cos_S_contrib is nan everywhere. Fix: use per-sample mean NLL on completion tokens. This is the same loss extract_vhack_grad.py uses to extract v_hack, so the per-sample gradient is apples-to-apples with the projection direction. Removes off-policy ratio + clip + zero_advantages branch. T4 in UAT had n_not_hacked = 1 since rh hacks 99% of the time. Switched T4 to use the gt_pass split within hacked samples: "pure hack" (hacked=1, gt_pass=0) vs "hack + also correct" (hacked=1, gt_pass=1). On the 160 samples we just generated this gives t=+4.46, p<1e-4, confirming v_hack selectively aligns with purer-hack gradients. UAT result: 4/4 pass. T1 hack=0.994 T2 cov=1.00 T3 cos_out<cos_in on 20/20 T4 t=+4.46 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
projected_grpo
SVD-basis gradient projection vs RL reward hacking. Tests whether projecting the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W basis) reduces reward-hack rate in GRPO without tanking pass rate.
Built on Ariahw, Engels & Nanda's rl-rewardhacking LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, "Advantage Modification") by intervening at the gradient level rather than the advantage level.
See docs/spec.md, docs/brainstorm/extracted_prefs.md, and docs/papers/.
Quick start
uv sync
just fast-dev-run # tiny-random model, ~1-2 min, real pipeline end-to-end
just smoke-vanilla # vanilla pathway smoke
just smoke-projected # projected pathway smoke
just download-model # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
just queue-full # queue extract + 3-seed vanilla + 3-seed projected sweep
See RESEARCH_JOURNAL.md for session-by-session findings,
including the 2026-05-23 grader-bug discovery that invalidated all prior gt=0
measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
Hypotheses (preregistered)
See spec.md. Headline: H1 — gradient projection in SVD basis against a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by
=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).