From 41817d2a08beec4cd0880996861a4e881d30e7b1 Mon Sep 17 00:00:00 2001 From: wassname Date: Wed, 27 May 2026 09:39:19 +0000 Subject: [PATCH] README: add plain-language "How it works" section Walk through the method from the start, in the user's voice, without AI tells: ablate hack direction from gradient on each update; extract via twin NLL on hand-paired completions, SVD the diff; work in delta_S space (AntiPaSTO); drop the bottom 25% of S_i globally as noise floor; log cin/cout and cin_t vs cin_s as the empirical sanity check. Co-Authored-By: Claude Opus 4.7 --- README.md | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/README.md b/README.md index ee39e5c..eec2770 100644 --- a/README.md +++ b/README.md @@ -12,6 +12,51 @@ advantage level. See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md), and [docs/papers/](docs/papers/). +## How it works + +We're trying to ablate the "hack direction" from the training gradient on +every update. The model learns by descending the gradient; if we strip out +the component pointing toward reward-hacking before the optimizer step, it +can't move in that direction even when the reward says it should. + +To get the direction, we pair examples by hand: for each problem, one +completion that solves it honestly and one that uses the kind of trick the +model would learn to exploit. For each pair we compute the NLL gradient on +the hack completion and on the clean completion separately, then take the +difference. That gives us one gradient-difference vector per pair. We stack +those over our ~10 pairs and SVD the result; the top right singular vectors +are our hack-direction basis. + +This is twin-NLL extraction. The hope is that the NLL gradient landscape +(what the model would update to be more likely to produce hack-style tokens +on a fixed prompt) shares enough geometry with the RL gradient landscape +(what the model is actually updating during training) that ablating along +the NLL-extracted direction also ablates along the RL one. Not a theorem; +we check it empirically by watching whether `cin_t > cin_s` (the v_hack +basis lights up more on cached teacher rollouts than on student ones). + +Everything happens in the SVD-of-W basis. Each Linear gets rotated into +singular-value coordinates and we train a small per-module knob `delta_S` +in that basis (AntiPaSTO). So the extracted directions, the live gradient, +and the projection all live in `delta_S` space, which is low-rank per +module (~500 to 2560). + +Noise floor at load. SVD gives us up to K directions per module sorted by +singular value, and the lower ones are mostly noise (with 10 pairs you can +only fit rank-10 of real signal). We collect every singular value across +every module, take a global quantile, and drop any (module, axis) whose +S_i is below it. Default cut: bottom 25%. Modules whose every axis lands +below get filtered out entirely. Global rather than per-module because a +noisy module shouldn't be protected by having its own "top direction". + +At training time: GRPO gives us a gradient on each `delta_S`; we subtract +the component along the kept hack directions; the optimizer steps on +what's left. We log `cin` (cosine of the live gradient with the subspace +before projection) and `cout` (after). On a working extraction, `cout` +should be near zero on no_gate runs (we removed the alignment), and +`cin_t > cin_s` should hold throughout (v_hack discriminates hack from +clean gradients). + ## Quick start ```bash