mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
Goal 0 milestone: fast preset learns to hack in ~10min
This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28 (b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total. Preset/Adam scheduling - New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min iteration loops. - `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the 20-step fast preset spends only 2 steps under warmup, not 10. - `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to effectively disable — `gn` column shows the clip was never the bottleneck). CLI restructure (tyro subcommands) - Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack. - Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig` inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`. - CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=). - `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field. Logging refactor - New `StepLogger` class consolidates column order, width, header label, and per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`). - Row dict carries raw values throughout; formatters live in column spec. Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats and reformatted to `+0.000`. Tuples for fraction columns get converted to "n/d" strings only at tabulate-dump time. - `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_). - `lr` column added (current scheduled LR through warmup + cosine). - Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived. cin/cout -> cos_pre/cos_post + signed - Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots, justfile. "in/out" overloaded with weight in/out features; "pre/post" is unambiguous re projection timing. - Metric is now signed: sum(V @ g) / ||g|| instead of ||V @ g|| / ||g||. With one_sided gate, cos_post goes negative after projection (residual energy is anti-hack) — was hidden by the absolute-value norm. v_hack extraction framing - README + `extract_vhack_grad.py` docstring lead with "this is the GRPO gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean) exactly, so we save the cleaner narrative for the paper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -21,19 +21,22 @@ can't move in that direction even when the reward says it should.
|
||||
|
||||
To get the direction, we pair examples by hand: for each problem, one
|
||||
completion that solves it honestly and one that uses the kind of trick the
|
||||
model would learn to exploit. For each pair we compute the NLL gradient on
|
||||
the hack completion and on the clean completion separately, then take the
|
||||
difference. That gives us one gradient-difference vector per pair. We stack
|
||||
those over our ~10 pairs and SVD the result; the top right singular vectors
|
||||
are our hack-direction basis.
|
||||
model would learn to exploit. Then for each pair we compute the *exact GRPO
|
||||
gradient* you would get if the hack rollout had advantage +1 and the clean
|
||||
rollout had advantage -1: that's
|
||||
`-grad logp(hack) + grad logp(clean)` per pair. Stack these vectors over
|
||||
our ~10 pairs and SVD the result; the top right singular vectors are our
|
||||
hack-direction basis. (Mechanically this is identical to a twin-NLL extraction
|
||||
because GRPO with adv=+/-1 reduces algebraically to the NLL difference, but
|
||||
the GRPO framing is the one we mean: extraction produces a sample of the
|
||||
gradient GRPO itself would emit if it ever saw a perfectly-labeled pair.)
|
||||
|
||||
This is twin-NLL extraction. The hope is that the NLL gradient landscape
|
||||
(what the model would update to be more likely to produce hack-style tokens
|
||||
on a fixed prompt) shares enough geometry with the RL gradient landscape
|
||||
(what the model is actually updating during training) that ablating along
|
||||
the NLL-extracted direction also ablates along the RL one. Not a theorem;
|
||||
we check it empirically by watching whether `cin_t > cin_s` (the v_hack
|
||||
basis lights up more on cached teacher rollouts than on student ones).
|
||||
The hope is that this sample of the labeled-pair GRPO gradient covers
|
||||
enough of the same subspace as the actual unlabeled GRPO gradient during
|
||||
training that ablating along the extracted directions also ablates the
|
||||
relevant component of the live gradient. Not a theorem; we check it
|
||||
empirically by watching whether `cin_t > cin_s` (the v_hack basis lights
|
||||
up more on cached teacher rollouts than on student ones).
|
||||
|
||||
Everything happens in the SVD-of-W basis. Each Linear gets rotated into
|
||||
singular-value coordinates and we train a small per-module knob `delta_S`
|
||||
|
||||
Reference in New Issue
Block a user