Goal 0 milestone: fast preset learns to hack in ~10min

This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28 (b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total. Preset/Adam scheduling - New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min iteration loops. - `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the 20-step fast preset spends only 2 steps under warmup, not 10. - `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to effectively disable — `gn` column shows the clip was never the bottleneck). CLI restructure (tyro subcommands) - Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack. - Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig` inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`. - CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=). - `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field. Logging refactor - New `StepLogger` class consolidates column order, width, header label, and per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`). - Row dict carries raw values throughout; formatters live in column spec. Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats and reformatted to `+0.000`. Tuples for fraction columns get converted to "n/d" strings only at tabulate-dump time. - `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_). - `lr` column added (current scheduled LR through warmup + cosine). - Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived. cin/cout -> cos_pre/cos_post + signed - Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots, justfile. "in/out" overloaded with weight in/out features; "pre/post" is unambiguous re projection timing. - Metric is now signed: sum(V @ g) / ||g|| instead of ||V @ g|| / ||g||. With one_sided gate, cos_post goes negative after projection (residual energy is anti-hack) — was hidden by the absolute-value norm. v_hack extraction framing - README + `extract_vhack_grad.py` docstring lead with "this is the GRPO gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean) exactly, so we save the cleaner narrative for the paper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-27 17:30:41 +08:00 · 2026-05-28 03:22:36 +00:00
parent a82c5c17dd
commit f487e67405
14 changed files with 825 additions and 296 deletions
@@ -21,19 +21,22 @@ can't move in that direction even when the reward says it should.

 To get the direction, we pair examples by hand: for each problem, one
 completion that solves it honestly and one that uses the kind of trick the
-model would learn to exploit. For each pair we compute the NLL gradient on
-the hack completion and on the clean completion separately, then take the
-difference. That gives us one gradient-difference vector per pair. We stack
-those over our ~10 pairs and SVD the result; the top right singular vectors
-are our hack-direction basis.
+model would learn to exploit. Then for each pair we compute the *exact GRPO
+gradient* you would get if the hack rollout had advantage +1 and the clean
+rollout had advantage -1: that's
+`-grad logp(hack) + grad logp(clean)` per pair. Stack these vectors over
+our ~10 pairs and SVD the result; the top right singular vectors are our
+hack-direction basis. (Mechanically this is identical to a twin-NLL extraction
+because GRPO with adv=+/-1 reduces algebraically to the NLL difference, but
+the GRPO framing is the one we mean: extraction produces a sample of the
+gradient GRPO itself would emit if it ever saw a perfectly-labeled pair.)

-This is twin-NLL extraction. The hope is that the NLL gradient landscape
-(what the model would update to be more likely to produce hack-style tokens
-on a fixed prompt) shares enough geometry with the RL gradient landscape
-(what the model is actually updating during training) that ablating along
-the NLL-extracted direction also ablates along the RL one. Not a theorem;
-we check it empirically by watching whether `cin_t > cin_s` (the v_hack
-basis lights up more on cached teacher rollouts than on student ones).
+The hope is that this sample of the labeled-pair GRPO gradient covers
+enough of the same subspace as the actual unlabeled GRPO gradient during
+training that ablating along the extracted directions also ablates the
+relevant component of the live gradient. Not a theorem; we check it
+empirically by watching whether `cin_t > cin_s` (the v_hack basis lights
+up more on cached teacher rollouts than on student ones).

 Everything happens in the SVD-of-W basis. Each Linear gets rotated into
 singular-value coordinates and we train a small per-module knob `delta_S`