This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28 (b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total. Preset/Adam scheduling - New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min iteration loops. - `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the 20-step fast preset spends only 2 steps under warmup, not 10. - `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to effectively disable — `gn` column shows the clip was never the bottleneck). CLI restructure (tyro subcommands) - Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack. - Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig` inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`. - CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=). - `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field. Logging refactor - New `StepLogger` class consolidates column order, width, header label, and per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`). - Row dict carries raw values throughout; formatters live in column spec. Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats and reformatted to `+0.000`. Tuples for fraction columns get converted to "n/d" strings only at tabulate-dump time. - `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_). - `lr` column added (current scheduled LR through warmup + cosine). - Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived. cin/cout -> cos_pre/cos_post + signed - Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots, justfile. "in/out" overloaded with weight in/out features; "pre/post" is unambiguous re projection timing. - Metric is now signed: sum(V @ g) / ||g|| instead of ||V @ g|| / ||g||. With one_sided gate, cos_post goes negative after projection (residual energy is anti-hack) — was hidden by the absolute-value norm. v_hack extraction framing - README + `extract_vhack_grad.py` docstring lead with "this is the GRPO gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean) exactly, so we save the cleaner narrative for the paper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
projected_grpo
SVD-basis gradient projection vs RL reward hacking. Tests whether projecting the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W basis) reduces reward-hack rate in GRPO without tanking pass rate.
Built on Ariahw, Engels & Nanda's rl-rewardhacking LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, "Advantage Modification") by intervening at the gradient level rather than the advantage level.
See docs/spec.md, docs/brainstorm/extracted_prefs.md, and docs/papers/.
How it works
We're trying to ablate the "hack direction" from the training gradient on every update. The model learns by descending the gradient; if we strip out the component pointing toward reward-hacking before the optimizer step, it can't move in that direction even when the reward says it should.
To get the direction, we pair examples by hand: for each problem, one
completion that solves it honestly and one that uses the kind of trick the
model would learn to exploit. Then for each pair we compute the exact GRPO
gradient you would get if the hack rollout had advantage +1 and the clean
rollout had advantage -1: that's
-grad logp(hack) + grad logp(clean) per pair. Stack these vectors over
our ~10 pairs and SVD the result; the top right singular vectors are our
hack-direction basis. (Mechanically this is identical to a twin-NLL extraction
because GRPO with adv=+/-1 reduces algebraically to the NLL difference, but
the GRPO framing is the one we mean: extraction produces a sample of the
gradient GRPO itself would emit if it ever saw a perfectly-labeled pair.)
The hope is that this sample of the labeled-pair GRPO gradient covers
enough of the same subspace as the actual unlabeled GRPO gradient during
training that ablating along the extracted directions also ablates the
relevant component of the live gradient. Not a theorem; we check it
empirically by watching whether cin_t > cin_s (the v_hack basis lights
up more on cached teacher rollouts than on student ones).
Everything happens in the SVD-of-W basis. Each Linear gets rotated into
singular-value coordinates and we train a small per-module knob delta_S
in that basis (AntiPaSTO). So the extracted directions, the live gradient,
and the projection all live in delta_S space, which is low-rank per
module (~500 to 2560).
Noise floor at load. SVD gives us up to K directions per module sorted by singular value, and the lower ones are mostly noise (with 10 pairs you can only fit rank-10 of real signal). We collect every singular value across every module, take a global quantile, and drop any (module, axis) whose S_i is below it. Default cut: bottom 25%. Modules whose every axis lands below get filtered out entirely. Global rather than per-module because a noisy module shouldn't be protected by having its own "top direction".
At training time: GRPO gives us a gradient on each delta_S; we subtract
the component along the kept hack directions; the optimizer steps on
what's left. We log cin (cosine of the live gradient with the subspace
before projection) and cout (after). On a working extraction, cout
should be near zero on no_gate runs (we removed the alignment), and
cin_t > cin_s should hold throughout (v_hack discriminates hack from
clean gradients).
Quick start
uv sync
just fast-dev-run # tiny-random model, ~1-2 min, real pipeline end-to-end
just smoke-vanilla # vanilla pathway smoke
just smoke-projected # projected pathway smoke
just download-model # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
just queue-full # queue extract + 3-seed vanilla + 3-seed projected sweep
See RESEARCH_JOURNAL.md for session-by-session findings,
including the 2026-05-23 grader-bug discovery that invalidated all prior gt=0
measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
Hypotheses (preregistered)
See spec.md. Headline: H1 — gradient projection in SVD basis against a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by
=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).