docs: rewrite README for lora2r/three-arms (was SVD-delta_S/erase)

Replace the SVD-of-W / delta_S / erase / cin-cout description with the lora2r adapter (rank-2r LoRA, deployed [:r] + quarantine [r:] blocks, SGTM three-way masks, deploy=ablate quarantine), the two-pass routeV gate, and the three live arms (none/routeV/absorb). Fix the dead quick-start recipes (queue-decision). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-06-10 11:23:21 +00:00
parent 103d0acc2c
commit 4ee5c27f7b
1 changed files with 71 additions and 77 deletions
@@ -1,10 +1,10 @@
 # vGROUT

 **vGROUT** (vector gradient routing): route the GRPO gradient against an
-extracted reward-hacking direction (in the SVD-of-W basis) to reduce the
-reward-hack rate without tanking pass rate. A representation-routing variant of
-gradient routing (Cloud et al.; Shilov et al.), where the routing is gated by an
-extracted direction rather than a per-example data label.
+extracted reward-hacking direction so the deployed model can't learn the hack,
+without tanking pass rate. A representation-routing variant of gradient routing
+(Cloud et al.; Shilov et al.), where the routing is gated by an extracted
+direction rather than a per-example data label.

 Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
 LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
@@ -14,100 +14,94 @@ advantage level.
 See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
 and [docs/papers/](docs/papers/).

+## The adapter
+
+Every target Linear gets one rank-2r LoRA (`src/vgrout/lora2r.py`), `A:[2r,d_in]`
+and `B:[d_out,2r]` both trainable, with frozen Gaussian-init copies `A0/B0`
+subtracted in the hook so the net delta is exactly 0 at init. The `2r` rows/cols
+split into a **deployed block** `[:r]` and a **quarantine block** `[r:]`. Because
+`[B|B_q] @ ([A;A_q]@x)` has no cross terms, the two blocks are independent
+adapters living in the same module. At deployment the quarantine is **ablated**
+(reset to its init), so anything learned there never ships.
+
+Per rollout we apply an SGTM-style three-way hard mask `(m, d)` to the block
+outputs (`m` = quarantine on/off, `d` = deployed detach):
+
+- **clean** `(0,0)` -- only the deployed block trains (quarantine zero in fwd+bwd).
+- **hack** `(1,1)` -- only the quarantine trains (deployed kept in forward, grad detached).
+- **mid** `(1,0)` -- both train (absorption).
+
 ## How it works

-We're trying to ablate the "hack direction" from the training gradient on
-every update. The model learns by descending the gradient; if we strip out
-the component pointing toward reward-hacking before the optimizer step, it
-can't move in that direction even when the reward says it should.
+To get the hack direction we pair examples by hand: for each problem, one
+completion that solves it honestly and one that uses the kind of trick the model
+would learn to exploit. For each pair we compute the *exact GRPO gradient* you'd
+get if the hack rollout had advantage +1 and the clean rollout had advantage -1
+(`-grad logp(hack) + grad logp(clean)`), read in the per-module c-probe space (a
+virtual diagonal scale between `A` and `B`). The mean hack-minus-clean direction,
+normalized per module, is `v_grad`. (Mechanically identical to a twin-NLL
+extraction, since GRPO with adv=+/-1 reduces to the NLL difference; the GRPO
+framing is the one we mean.) The hand-authored pairs are off-distribution and the
+*only* routing-label source -- no oracle/ground-truth label of a live rollout ever
+touches training.

-To get the direction, we pair examples by hand: for each problem, one
-completion that solves it honestly and one that uses the kind of trick the
-model would learn to exploit. Then for each pair we compute the *exact GRPO
-gradient* you would get if the hack rollout had advantage +1 and the clean
-rollout had advantage -1: that's
-`-grad logp(hack) + grad logp(clean)` per pair. Stack these vectors over
-our ~10 pairs and SVD the result; the top right singular vectors are our
-hack-direction basis. (Mechanically this is identical to a twin-NLL extraction
-because GRPO with adv=+/-1 reduces algebraically to the NLL difference, but
-the GRPO framing is the one we mean: extraction produces a sample of the
-gradient GRPO itself would emit if it ever saw a perfectly-labeled pair.)
+At training time routeV runs **two passes** per prompt. Pass 1 is an unmasked
+forward+backward whose gradient w.r.t. the c-probes gives, per rollout, a
+band-normalized cosine to `v_grad` pooled across modules -> a `{clean, mid, hack}`
+label (band edges = authored-pair p75 cosines). Pass 2 sets the block masks from
+those labels and does the real masked forward+backward; the optimizer steps on
+what survives. So a rollout that looks hack-ward routes its update into the
+throwaway quarantine block, and the deployed model never descends the hack
+gradient. We re-extract `v_grad` every N steps so it tracks where hacks separate
+now, not at step 0.

-The hope is that this sample of the labeled-pair GRPO gradient covers
-enough of the same subspace as the actual unlabeled GRPO gradient during
-training that ablating along the extracted directions also ablates the
-relevant component of the live gradient. Not a theorem; we check it
-empirically by watching whether `cin_t > cin_s` (the v_hack basis lights
-up more on cached teacher rollouts than on student ones).
-
-Everything happens in the SVD-of-W basis. Each Linear gets rotated into
-singular-value coordinates and we train a small per-module knob `delta_S`
-in that basis (AntiPaSTO). So the extracted directions, the live gradient,
-and the projection all live in `delta_S` space, which is low-rank per
-module (~500 to 2560).
-
-Noise floor at load. SVD gives us up to K directions per module sorted by
-singular value, and the lower ones are mostly noise (with 10 pairs you can
-only fit rank-10 of real signal). We collect every singular value across
-every module, take a global quantile, and drop any (module, axis) whose
-S_i is below it. Default cut: bottom 25%. Modules whose every axis lands
-below get filtered out entirely. Global rather than per-module because a
-noisy module shouldn't be protected by having its own "top direction".
-
-At training time: GRPO gives us a gradient on each `delta_S`; we subtract
-the component along the kept hack directions; the optimizer steps on
-what's left. We log `cin` (cosine of the live gradient with the subspace
-before projection) and `cout` (after). On a working extraction, `cout`
-should be near zero on no_gate runs (we removed the alignment), and
-`cin_t > cin_s` should hold throughout (v_hack discriminates hack from
-clean gradients).
+Whether the *direction* (not just the *act* of routing) drives suppression is the
+open question -- the placebo control (Haar-random `v_grad`, same routing
+machinery) must NOT match real `v_grad`. We watch `qmass` (share of the update
+landing in quarantine) and the per-rollout zone shares (`keep/resid/rout`).

 ## What we compare

-The environment has four loophole modes (`run_tests`, `sentinel`,
-`stdout_marker`, `file_marker`), each a different grading flaw with a
-truthful hint disclosing the mechanism. Problems are partitioned evenly and
-non-overlappingly, so a vanilla student can learn all four independently.
-Full prompt+hint, hack, and clean traces per mode are in the blog appendix
+The environment has loophole modes (e.g. `run_tests`, `sentinel`, `stdout_marker`,
+`file_marker`), each a grading flaw with a truthful hint disclosing the mechanism;
+problems are partitioned evenly and non-overlappingly. Full prompt+hint/hack/clean
+traces are in the blog appendix
 ([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).

-Three active arms, identical model and teacher pool, differing only in the gradient policy:
+Three arms, identical model/adapter/teacher pool, differing only in the gate
+(`--intervention`):

- **vanilla** -- no intervention; the emergence reference.
- **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided).
- **routeV** -- current routing arm. A banded cosine gate `cos(g, v_grad) ∈ band` (band
-  calibrated from the hack-vs-clean cosine gap, either from the pairs or a live rolling
-  quantile) decides how much of a gradient routes into a scale-matched, distinct-basis
-  `delta_S_hack`. The gate runs per-rollout or per-token (`routeV_gate`); per-token is the
-  current best. An exploration floor (`rollout_ablate_frac`) generates a fraction of
-  rollouts knob-off so the deployed knob always sees solve signal even if the full policy
-  saturates on hacking.
+- **none** -- gate pinned clean `(0,0)`: the quarantine never trains. The
+  capacity- and structure-matched vanilla control (same adapter, no shrinkage
+  confound). The emergence reference.
+- **routeV** -- the method: per-rollout three-way gate from the c-probe gradient
+  vs `v_grad`. `--routeV-random-v-seed` swaps in a Haar-random direction (placebo).
+- **absorb** -- gate pinned mid `(1,0)`: both blocks train on everything, no gate.
+  Isolates the value of the gate+hard-masks vs absorption alone.

-Deploy hack/solve is measured the same way for every arm: knob-off forward
-(`ablate_quarantine` for routeV, the trained model itself for vanilla/erase) on the
-held-out test set (n=119), sampled at T=0.7. So the per-arm deploy numbers are
-apples-to-apples.
-
-The frozen-vs-refresh distinction is orthogonal: erase and routeV can re-extract
-their pair-derived direction every N steps on the current adapter.
+Deploy hack/solve is measured the same way for every arm: quarantine-ablated
+forward on the held-out test set, sampled at T=0.7. So the per-arm deploy numbers
+are apples-to-apples. (For `none` the quarantine is already at init, so ablation
+is a no-op.)

 ## Quick start

 ```bash
 uv sync
-just smoke               # tiny-random model, projected pathway, ~1-2 min
-just smoke-vanilla       # tiny-random model, vanilla pathway, ~1-2 min
-just download-model      # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
-just queue-full          # queue extract + 3-seed vanilla + 3-seed projected sweep
+just smoke               # tiny-random model, routeV pathway + all verify gates, ~1-2 min
+just smoke-all           # vanilla + routeV + absorb back to back
+just download-model      # warm Qwen3-4B cache
+just queue-decision      # queue the 4-arm decision run (routeV real / placebo / vanilla / absorb)
 ```

 See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,
 including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0`
-measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
+measurements, the move to Qwen3-4B, and the PiSSA->lora2r switch (the PiSSA
+placebo tie was shrinkage: shared frozen basis made routing a magnitude split).

 ## Results and write-up

 The paper draft is the source of truth for current numbers, figures, and the
-preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex)
-(keynote table + figure, ablations, generalisation). Session-by-session
-findings and per-step log audits live in [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).
+preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex).
+Session-by-session findings and per-step log audits live in
+[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).