mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
README: add plain-language "How it works" section
Walk through the method from the start, in the user's voice, without AI tells: ablate hack direction from gradient on each update; extract via twin NLL on hand-paired completions, SVD the diff; work in delta_S space (AntiPaSTO); drop the bottom 25% of S_i globally as noise floor; log cin/cout and cin_t vs cin_s as the empirical sanity check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -12,6 +12,51 @@ advantage level.
|
||||
See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
|
||||
and [docs/papers/](docs/papers/).
|
||||
|
||||
## How it works
|
||||
|
||||
We're trying to ablate the "hack direction" from the training gradient on
|
||||
every update. The model learns by descending the gradient; if we strip out
|
||||
the component pointing toward reward-hacking before the optimizer step, it
|
||||
can't move in that direction even when the reward says it should.
|
||||
|
||||
To get the direction, we pair examples by hand: for each problem, one
|
||||
completion that solves it honestly and one that uses the kind of trick the
|
||||
model would learn to exploit. For each pair we compute the NLL gradient on
|
||||
the hack completion and on the clean completion separately, then take the
|
||||
difference. That gives us one gradient-difference vector per pair. We stack
|
||||
those over our ~10 pairs and SVD the result; the top right singular vectors
|
||||
are our hack-direction basis.
|
||||
|
||||
This is twin-NLL extraction. The hope is that the NLL gradient landscape
|
||||
(what the model would update to be more likely to produce hack-style tokens
|
||||
on a fixed prompt) shares enough geometry with the RL gradient landscape
|
||||
(what the model is actually updating during training) that ablating along
|
||||
the NLL-extracted direction also ablates along the RL one. Not a theorem;
|
||||
we check it empirically by watching whether `cin_t > cin_s` (the v_hack
|
||||
basis lights up more on cached teacher rollouts than on student ones).
|
||||
|
||||
Everything happens in the SVD-of-W basis. Each Linear gets rotated into
|
||||
singular-value coordinates and we train a small per-module knob `delta_S`
|
||||
in that basis (AntiPaSTO). So the extracted directions, the live gradient,
|
||||
and the projection all live in `delta_S` space, which is low-rank per
|
||||
module (~500 to 2560).
|
||||
|
||||
Noise floor at load. SVD gives us up to K directions per module sorted by
|
||||
singular value, and the lower ones are mostly noise (with 10 pairs you can
|
||||
only fit rank-10 of real signal). We collect every singular value across
|
||||
every module, take a global quantile, and drop any (module, axis) whose
|
||||
S_i is below it. Default cut: bottom 25%. Modules whose every axis lands
|
||||
below get filtered out entirely. Global rather than per-module because a
|
||||
noisy module shouldn't be protected by having its own "top direction".
|
||||
|
||||
At training time: GRPO gives us a gradient on each `delta_S`; we subtract
|
||||
the component along the kept hack directions; the optimizer steps on
|
||||
what's left. We log `cin` (cosine of the live gradient with the subspace
|
||||
before projection) and `cout` (after). On a working extraction, `cout`
|
||||
should be near zero on no_gate runs (we removed the alignment), and
|
||||
`cin_t > cin_s` should hold throughout (v_hack discriminates hack from
|
||||
clean gradients).
|
||||
|
||||
## Quick start
|
||||
|
||||
```bash
|
||||
|
||||
Reference in New Issue
Block a user