docs: rewrite README for lora2r/three-arms (was SVD-delta_S/erase)

Replace the SVD-of-W / delta_S / erase / cin-cout description with the lora2r
adapter (rank-2r LoRA, deployed [:r] + quarantine [r:] blocks, SGTM three-way
masks, deploy=ablate quarantine), the two-pass routeV gate, and the three live
arms (none/routeV/absorb). Fix the dead quick-start recipes (queue-decision).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-10 11:23:21 +00:00
parent 103d0acc2c
commit 4ee5c27f7b
+71 -77
View File
@@ -1,10 +1,10 @@
# vGROUT
**vGROUT** (vector gradient routing): route the GRPO gradient against an
extracted reward-hacking direction (in the SVD-of-W basis) to reduce the
reward-hack rate without tanking pass rate. A representation-routing variant of
gradient routing (Cloud et al.; Shilov et al.), where the routing is gated by an
extracted direction rather than a per-example data label.
extracted reward-hacking direction so the deployed model can't learn the hack,
without tanking pass rate. A representation-routing variant of gradient routing
(Cloud et al.; Shilov et al.), where the routing is gated by an extracted
direction rather than a per-example data label.
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
@@ -14,100 +14,94 @@ advantage level.
See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
and [docs/papers/](docs/papers/).
## The adapter
Every target Linear gets one rank-2r LoRA (`src/vgrout/lora2r.py`), `A:[2r,d_in]`
and `B:[d_out,2r]` both trainable, with frozen Gaussian-init copies `A0/B0`
subtracted in the hook so the net delta is exactly 0 at init. The `2r` rows/cols
split into a **deployed block** `[:r]` and a **quarantine block** `[r:]`. Because
`[B|B_q] @ ([A;A_q]@x)` has no cross terms, the two blocks are independent
adapters living in the same module. At deployment the quarantine is **ablated**
(reset to its init), so anything learned there never ships.
Per rollout we apply an SGTM-style three-way hard mask `(m, d)` to the block
outputs (`m` = quarantine on/off, `d` = deployed detach):
- **clean** `(0,0)` -- only the deployed block trains (quarantine zero in fwd+bwd).
- **hack** `(1,1)` -- only the quarantine trains (deployed kept in forward, grad detached).
- **mid** `(1,0)` -- both train (absorption).
## How it works
We're trying to ablate the "hack direction" from the training gradient on
every update. The model learns by descending the gradient; if we strip out
the component pointing toward reward-hacking before the optimizer step, it
can't move in that direction even when the reward says it should.
To get the hack direction we pair examples by hand: for each problem, one
completion that solves it honestly and one that uses the kind of trick the model
would learn to exploit. For each pair we compute the *exact GRPO gradient* you'd
get if the hack rollout had advantage +1 and the clean rollout had advantage -1
(`-grad logp(hack) + grad logp(clean)`), read in the per-module c-probe space (a
virtual diagonal scale between `A` and `B`). The mean hack-minus-clean direction,
normalized per module, is `v_grad`. (Mechanically identical to a twin-NLL
extraction, since GRPO with adv=+/-1 reduces to the NLL difference; the GRPO
framing is the one we mean.) The hand-authored pairs are off-distribution and the
*only* routing-label source -- no oracle/ground-truth label of a live rollout ever
touches training.
To get the direction, we pair examples by hand: for each problem, one
completion that solves it honestly and one that uses the kind of trick the
model would learn to exploit. Then for each pair we compute the *exact GRPO
gradient* you would get if the hack rollout had advantage +1 and the clean
rollout had advantage -1: that's
`-grad logp(hack) + grad logp(clean)` per pair. Stack these vectors over
our ~10 pairs and SVD the result; the top right singular vectors are our
hack-direction basis. (Mechanically this is identical to a twin-NLL extraction
because GRPO with adv=+/-1 reduces algebraically to the NLL difference, but
the GRPO framing is the one we mean: extraction produces a sample of the
gradient GRPO itself would emit if it ever saw a perfectly-labeled pair.)
At training time routeV runs **two passes** per prompt. Pass 1 is an unmasked
forward+backward whose gradient w.r.t. the c-probes gives, per rollout, a
band-normalized cosine to `v_grad` pooled across modules -> a `{clean, mid, hack}`
label (band edges = authored-pair p75 cosines). Pass 2 sets the block masks from
those labels and does the real masked forward+backward; the optimizer steps on
what survives. So a rollout that looks hack-ward routes its update into the
throwaway quarantine block, and the deployed model never descends the hack
gradient. We re-extract `v_grad` every N steps so it tracks where hacks separate
now, not at step 0.
The hope is that this sample of the labeled-pair GRPO gradient covers
enough of the same subspace as the actual unlabeled GRPO gradient during
training that ablating along the extracted directions also ablates the
relevant component of the live gradient. Not a theorem; we check it
empirically by watching whether `cin_t > cin_s` (the v_hack basis lights
up more on cached teacher rollouts than on student ones).
Everything happens in the SVD-of-W basis. Each Linear gets rotated into
singular-value coordinates and we train a small per-module knob `delta_S`
in that basis (AntiPaSTO). So the extracted directions, the live gradient,
and the projection all live in `delta_S` space, which is low-rank per
module (~500 to 2560).
Noise floor at load. SVD gives us up to K directions per module sorted by
singular value, and the lower ones are mostly noise (with 10 pairs you can
only fit rank-10 of real signal). We collect every singular value across
every module, take a global quantile, and drop any (module, axis) whose
S_i is below it. Default cut: bottom 25%. Modules whose every axis lands
below get filtered out entirely. Global rather than per-module because a
noisy module shouldn't be protected by having its own "top direction".
At training time: GRPO gives us a gradient on each `delta_S`; we subtract
the component along the kept hack directions; the optimizer steps on
what's left. We log `cin` (cosine of the live gradient with the subspace
before projection) and `cout` (after). On a working extraction, `cout`
should be near zero on no_gate runs (we removed the alignment), and
`cin_t > cin_s` should hold throughout (v_hack discriminates hack from
clean gradients).
Whether the *direction* (not just the *act* of routing) drives suppression is the
open question -- the placebo control (Haar-random `v_grad`, same routing
machinery) must NOT match real `v_grad`. We watch `qmass` (share of the update
landing in quarantine) and the per-rollout zone shares (`keep/resid/rout`).
## What we compare
The environment has four loophole modes (`run_tests`, `sentinel`,
`stdout_marker`, `file_marker`), each a different grading flaw with a
truthful hint disclosing the mechanism. Problems are partitioned evenly and
non-overlappingly, so a vanilla student can learn all four independently.
Full prompt+hint, hack, and clean traces per mode are in the blog appendix
The environment has loophole modes (e.g. `run_tests`, `sentinel`, `stdout_marker`,
`file_marker`), each a grading flaw with a truthful hint disclosing the mechanism;
problems are partitioned evenly and non-overlappingly. Full prompt+hint/hack/clean
traces are in the blog appendix
([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
Three active arms, identical model and teacher pool, differing only in the gradient policy:
Three arms, identical model/adapter/teacher pool, differing only in the gate
(`--intervention`):
- **vanilla** -- no intervention; the emergence reference.
- **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided).
- **routeV** -- current routing arm. A banded cosine gate `cos(g, v_grad) ∈ band` (band
calibrated from the hack-vs-clean cosine gap, either from the pairs or a live rolling
quantile) decides how much of a gradient routes into a scale-matched, distinct-basis
`delta_S_hack`. The gate runs per-rollout or per-token (`routeV_gate`); per-token is the
current best. An exploration floor (`rollout_ablate_frac`) generates a fraction of
rollouts knob-off so the deployed knob always sees solve signal even if the full policy
saturates on hacking.
- **none** -- gate pinned clean `(0,0)`: the quarantine never trains. The
capacity- and structure-matched vanilla control (same adapter, no shrinkage
confound). The emergence reference.
- **routeV** -- the method: per-rollout three-way gate from the c-probe gradient
vs `v_grad`. `--routeV-random-v-seed` swaps in a Haar-random direction (placebo).
- **absorb** -- gate pinned mid `(1,0)`: both blocks train on everything, no gate.
Isolates the value of the gate+hard-masks vs absorption alone.
Deploy hack/solve is measured the same way for every arm: knob-off forward
(`ablate_quarantine` for routeV, the trained model itself for vanilla/erase) on the
held-out test set (n=119), sampled at T=0.7. So the per-arm deploy numbers are
apples-to-apples.
The frozen-vs-refresh distinction is orthogonal: erase and routeV can re-extract
their pair-derived direction every N steps on the current adapter.
Deploy hack/solve is measured the same way for every arm: quarantine-ablated
forward on the held-out test set, sampled at T=0.7. So the per-arm deploy numbers
are apples-to-apples. (For `none` the quarantine is already at init, so ablation
is a no-op.)
## Quick start
```bash
uv sync
just smoke # tiny-random model, projected pathway, ~1-2 min
just smoke-vanilla # tiny-random model, vanilla pathway, ~1-2 min
just download-model # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
just queue-full # queue extract + 3-seed vanilla + 3-seed projected sweep
just smoke # tiny-random model, routeV pathway + all verify gates, ~1-2 min
just smoke-all # vanilla + routeV + absorb back to back
just download-model # warm Qwen3-4B cache
just queue-decision # queue the 4-arm decision run (routeV real / placebo / vanilla / absorb)
```
See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,
including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0`
measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
measurements, the move to Qwen3-4B, and the PiSSA->lora2r switch (the PiSSA
placebo tie was shrinkage: shared frozen basis made routing a magnitude split).
## Results and write-up
The paper draft is the source of truth for current numbers, figures, and the
preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex)
(keynote table + figure, ablations, generalisation). Session-by-session
findings and per-step log audits live in [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).
preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex).
Session-by-session findings and per-step log audits live in
[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).