train.py: pass logits_to_keep=L_c+1 to model() at all three logp call sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site. full preset G=8 -> G=6 for a further ~25% B reduction at every act site. Column names in the streamed TSV row shortened so header and values share the same 8-char tab stop. spec.md: documented the v_hack generalization constraint as load-bearing methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent hacks, or the H1 generalization claim collapses. handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B). Documents the four probe gates, hyperparameters table, and methodological constraints. justfile gains a SWEEPS comment block clarifying probe vs queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs. RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix, pooled cross-run trend analysis (LR is fine, signal underpowered at n=17 but directionally consistent), and the generalization correction.
projected_grpo
SVD-basis gradient projection vs RL reward hacking. Tests whether projecting the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W basis) reduces reward-hack rate in GRPO without tanking pass rate.
Built on Ariahw, Engels & Nanda's rl-rewardhacking LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, "Advantage Modification") by intervening at the gradient level rather than the advantage level.
See docs/spec.md, docs/brainstorm/extracted_prefs.md, and docs/papers/.
Quick start
uv sync
just fast-dev-run # tiny-random model, ~1-2 min, real pipeline end-to-end
just smoke-vanilla # vanilla pathway smoke
just smoke-projected # projected pathway smoke
just download-model # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
just queue-full # queue extract + 3-seed vanilla + 3-seed projected sweep
See RESEARCH_JOURNAL.md for session-by-session findings,
including the 2026-05-23 grader-bug discovery that invalidated all prior gt=0
measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
Hypotheses (preregistered)
See spec.md. Headline: H1 — gradient projection in SVD basis against a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by
=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).