README: add plain-language "How it works" section

Walk through the method from the start, in the user's voice, without AI tells: ablate hack direction from gradient on each update; extract via twin NLL on hand-paired completions, SVD the diff; work in delta_S space (AntiPaSTO); drop the bottom 25% of S_i globally as noise floor; log cin/cout and cin_t vs cin_s as the empirical sanity check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-27 17:48:43 +08:00 · 2026-05-27 09:39:19 +00:00
parent 3c04aaf06d
commit 41817d2a08
1 changed files with 45 additions and 0 deletions
@@ -12,6 +12,51 @@ advantage level.
 See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
 and [docs/papers/](docs/papers/).
 ## How it works
 We're trying to ablate the "hack direction" from the training gradient on
 every update. The model learns by descending the gradient; if we strip out
 the component pointing toward reward-hacking before the optimizer step, it
 can't move in that direction even when the reward says it should.
 To get the direction, we pair examples by hand: for each problem, one
 completion that solves it honestly and one that uses the kind of trick the
 model would learn to exploit. For each pair we compute the NLL gradient on
 the hack completion and on the clean completion separately, then take the
 difference. That gives us one gradient-difference vector per pair. We stack
 those over our ~10 pairs and SVD the result; the top right singular vectors
 are our hack-direction basis.
 This is twin-NLL extraction. The hope is that the NLL gradient landscape
 (what the model would update to be more likely to produce hack-style tokens
 on a fixed prompt) shares enough geometry with the RL gradient landscape
 (what the model is actually updating during training) that ablating along
 the NLL-extracted direction also ablates along the RL one. Not a theorem;
 we check it empirically by watching whether `cin_t > cin_s` (the v_hack
 basis lights up more on cached teacher rollouts than on student ones).
 Everything happens in the SVD-of-W basis. Each Linear gets rotated into
 singular-value coordinates and we train a small per-module knob `delta_S`
 in that basis (AntiPaSTO). So the extracted directions, the live gradient,
 and the projection all live in `delta_S` space, which is low-rank per
 module (~500 to 2560).
 Noise floor at load. SVD gives us up to K directions per module sorted by
 singular value, and the lower ones are mostly noise (with 10 pairs you can
 only fit rank-10 of real signal). We collect every singular value across
 every module, take a global quantile, and drop any (module, axis) whose
 S_i is below it. Default cut: bottom 25%. Modules whose every axis lands
 below get filtered out entirely. Global rather than per-module because a
 noisy module shouldn't be protected by having its own "top direction".
 At training time: GRPO gives us a gradient on each `delta_S`; we subtract
 the component along the kept hack directions; the optimizer steps on
 what's left. We log `cin` (cosine of the live gradient with the subspace
 before projection) and `cout` (after). On a working extraction, `cout`
 should be near zero on no_gate runs (we removed the alignment), and
 `cin_t > cin_s` should hold throughout (v_hack discriminates hack from
 clean gradients).
 ## Quick start
 ```bash