README: add plain-language "How it works" section

Walk through the method from the start, in the user's voice, without AI tells: ablate hack direction from gradient on each update; extract via twin NLL on hand-paired completions, SVD the diff; work in delta_S space (AntiPaSTO); drop the bottom 25% of S_i globally as noise floor; log cin/cout and cin_t vs cin_s as the empirical sanity check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-27 16:15:35 +08:00 · 2026-05-27 09:39:19 +00:00
parent 3c04aaf06d
commit 41817d2a08
1 changed files with 45 additions and 0 deletions
@@ -12,6 +12,51 @@ advantage level.
 See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
 and [docs/papers/](docs/papers/).

+## How it works
+
+We're trying to ablate the "hack direction" from the training gradient on
+every update. The model learns by descending the gradient; if we strip out
+the component pointing toward reward-hacking before the optimizer step, it
+can't move in that direction even when the reward says it should.
+
+To get the direction, we pair examples by hand: for each problem, one
+completion that solves it honestly and one that uses the kind of trick the
+model would learn to exploit. For each pair we compute the NLL gradient on
+the hack completion and on the clean completion separately, then take the
+difference. That gives us one gradient-difference vector per pair. We stack
+those over our ~10 pairs and SVD the result; the top right singular vectors
+are our hack-direction basis.
+
+This is twin-NLL extraction. The hope is that the NLL gradient landscape
+(what the model would update to be more likely to produce hack-style tokens
+on a fixed prompt) shares enough geometry with the RL gradient landscape
+(what the model is actually updating during training) that ablating along
+the NLL-extracted direction also ablates along the RL one. Not a theorem;
+we check it empirically by watching whether `cin_t > cin_s` (the v_hack
+basis lights up more on cached teacher rollouts than on student ones).
+
+Everything happens in the SVD-of-W basis. Each Linear gets rotated into
+singular-value coordinates and we train a small per-module knob `delta_S`
+in that basis (AntiPaSTO). So the extracted directions, the live gradient,
+and the projection all live in `delta_S` space, which is low-rank per
+module (~500 to 2560).
+
+Noise floor at load. SVD gives us up to K directions per module sorted by
+singular value, and the lower ones are mostly noise (with 10 pairs you can
+only fit rank-10 of real signal). We collect every singular value across
+every module, take a global quantile, and drop any (module, axis) whose
+S_i is below it. Default cut: bottom 25%. Modules whose every axis lands
+below get filtered out entirely. Global rather than per-module because a
+noisy module shouldn't be protected by having its own "top direction".
+
+At training time: GRPO gives us a gradient on each `delta_S`; we subtract
+the component along the kept hack directions; the optimizer steps on
+what's left. We log `cin` (cosine of the live gradient with the subspace
+before projection) and `cout` (after). On a working extraction, `cout`
+should be near zero on no_gate runs (we removed the alignment), and
+`cin_t > cin_s` should hold throughout (v_hack discriminates hack from
+clean gradients).
+
 ## Quick start

 ```bash