AGENTS.md — projected_grpo

This is novel ML research. Not in your training data. Extrapolate carefully.

Project in one paragraph

Test whether SVD-basis gradient projection against an extracted hack-direction reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from Rebound (Wu & Tang 2026) by intervening at the gradient level rather than the advantage level. Differs from AntiPaSTO (the user's prior work) by using unpaired GRPO rollouts rather than paired-preference contrast.

Inherit global rules from ~/.claude/CLAUDE.md.

Workflow

Read docs/spec.md for the preregistered plan.
Read docs/brainstorm/extracted_prefs.md for design rationale.
New sweep arms get recipes in justfile with # H: hypothesis comments.
just fast-dev-run before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
Real runs go through pueue on the 96GB GPU box. Label each job with why: and resolve:.
Head docs/RESEARCH_JOURNAL.md for latest results.
No tests/ dir; fast-dev-run is the correctness gate.

External dependencies

external/rl-rewardhacking/ is Ariahw's repo (verl-based GRPO + LeetCode dataset

reward hacking monitors). We import from it; we do NOT modify it. Sync with just sync-external.

Code style

einops for reshape, einsum for contractions
jaxtyping on function inputs/outputs only
polars v1 API; loguru; tabulate for log tables
Single-letter dims: b s h d r (batch/seq/head/dim/rank)
Capital suffix for projected spaces: gS = gradient in SVD top-m basis
Greek letters/symbols for math-heavy code (cos α, ||g||)

Tensor shapes glossary

v_hack: Float[Tensor, "d"] — single direction in residual stream
V_m: Float[Tensor, "d m"] — top-m right singular vectors of W
g: Float[Tensor, "d_out d_in"] for a weight grad; flatten to "D" for projection
cos_align: Float[Tensor, ""] — scalar

Compression over accretion

Every edit should reduce entropy. If you add something, remove something else.

Smell	Fix
Defensive guards (`if x is None`)	Let it crash, fix root cause
Magic constants	Name it or derive from spec.md
Two loss variants	Pick one, delete other
Stubs / canned modes	Delete; fast-dev-run uses real model

Don't

Don't add losses without removing equivalent complexity. Gradient projection is a constraint, not a competing objective.
Don't use defensive programming. Fail fast, crash loudly.
Don't fabricate numbers in journal entries or table prototypes. Mark TODO.
Don't run real GRPO to test syntax errors. Use just fast-dev-run.
Don't modify external/rl-rewardhacking/ — it's a third-party pin.

Decision points (live)

H4 fallback: if Qwen3.5-2B hack rate <30% at step 200, swap to Qwen3-4B with num_generations=4, batch=64. See spec.md.
verl fallback: if verl breaks on single 96GB, swap to TRL GRPOTrainer.
Layer choice for SVD/v_hack: TBD during smoke; default 60-75% depth per Wu-Tang.

3.1 KiB Raw Blame History Unescape Escape