From 41817d2a08beec4cd0880996861a4e881d30e7b1 Mon Sep 17 00:00:00 2001
From: wassname <github@wassname>
Date: Wed, 27 May 2026 09:39:19 +0000
Subject: [PATCH] README: add plain-language "How it works" section

Walk through the method from the start, in the user's voice, without AI
tells: ablate hack direction from gradient on each update; extract via
twin NLL on hand-paired completions, SVD the diff; work in delta_S
space (AntiPaSTO); drop the bottom 25% of S_i globally as noise floor;
log cin/cout and cin_t vs cin_s as the empirical sanity check.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 README.md | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/README.md b/README.md
index ee39e5c..eec2770 100644
--- a/README.md
+++ b/README.md
@@ -12,6 +12,51 @@ advantage level.
 See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
 and [docs/papers/](docs/papers/).
 
+## How it works
+
+We're trying to ablate the "hack direction" from the training gradient on
+every update. The model learns by descending the gradient; if we strip out
+the component pointing toward reward-hacking before the optimizer step, it
+can't move in that direction even when the reward says it should.
+
+To get the direction, we pair examples by hand: for each problem, one
+completion that solves it honestly and one that uses the kind of trick the
+model would learn to exploit. For each pair we compute the NLL gradient on
+the hack completion and on the clean completion separately, then take the
+difference. That gives us one gradient-difference vector per pair. We stack
+those over our ~10 pairs and SVD the result; the top right singular vectors
+are our hack-direction basis.
+
+This is twin-NLL extraction. The hope is that the NLL gradient landscape
+(what the model would update to be more likely to produce hack-style tokens
+on a fixed prompt) shares enough geometry with the RL gradient landscape
+(what the model is actually updating during training) that ablating along
+the NLL-extracted direction also ablates along the RL one. Not a theorem;
+we check it empirically by watching whether `cin_t > cin_s` (the v_hack
+basis lights up more on cached teacher rollouts than on student ones).
+
+Everything happens in the SVD-of-W basis. Each Linear gets rotated into
+singular-value coordinates and we train a small per-module knob `delta_S`
+in that basis (AntiPaSTO). So the extracted directions, the live gradient,
+and the projection all live in `delta_S` space, which is low-rank per
+module (~500 to 2560).
+
+Noise floor at load. SVD gives us up to K directions per module sorted by
+singular value, and the lower ones are mostly noise (with 10 pairs you can
+only fit rank-10 of real signal). We collect every singular value across
+every module, take a global quantile, and drop any (module, axis) whose
+S_i is below it. Default cut: bottom 25%. Modules whose every axis lands
+below get filtered out entirely. Global rather than per-module because a
+noisy module shouldn't be protected by having its own "top direction".
+
+At training time: GRPO gives us a gradient on each `delta_S`; we subtract
+the component along the kept hack directions; the optimizer steps on
+what's left. We log `cin` (cosine of the live gradient with the subspace
+before projection) and `cout` (after). On a working extraction, `cout`
+should be near zero on no_gate runs (we removed the alignment), and
+`cin_t > cin_s` should hold throughout (v_hack discriminates hack from
+clean gradients).
+
 ## Quick start
 
 ```bash