mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:04:59 +08:00
4ee5c27f7b
Replace the SVD-of-W / delta_S / erase / cin-cout description with the lora2r adapter (rank-2r LoRA, deployed [:r] + quarantine [r:] blocks, SGTM three-way masks, deploy=ablate quarantine), the two-pass routeV gate, and the three live arms (none/routeV/absorb). Fix the dead quick-start recipes (queue-decision). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
108 lines
5.5 KiB
Markdown
108 lines
5.5 KiB
Markdown
# vGROUT
|
|
|
|
**vGROUT** (vector gradient routing): route the GRPO gradient against an
|
|
extracted reward-hacking direction so the deployed model can't learn the hack,
|
|
without tanking pass rate. A representation-routing variant of gradient routing
|
|
(Cloud et al.; Shilov et al.), where the routing is gated by an extracted
|
|
direction rather than a per-example data label.
|
|
|
|
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
|
|
LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
|
|
"Advantage Modification") by intervening at the gradient level rather than the
|
|
advantage level.
|
|
|
|
See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
|
|
and [docs/papers/](docs/papers/).
|
|
|
|
## The adapter
|
|
|
|
Every target Linear gets one rank-2r LoRA (`src/vgrout/lora2r.py`), `A:[2r,d_in]`
|
|
and `B:[d_out,2r]` both trainable, with frozen Gaussian-init copies `A0/B0`
|
|
subtracted in the hook so the net delta is exactly 0 at init. The `2r` rows/cols
|
|
split into a **deployed block** `[:r]` and a **quarantine block** `[r:]`. Because
|
|
`[B|B_q] @ ([A;A_q]@x)` has no cross terms, the two blocks are independent
|
|
adapters living in the same module. At deployment the quarantine is **ablated**
|
|
(reset to its init), so anything learned there never ships.
|
|
|
|
Per rollout we apply an SGTM-style three-way hard mask `(m, d)` to the block
|
|
outputs (`m` = quarantine on/off, `d` = deployed detach):
|
|
|
|
- **clean** `(0,0)` -- only the deployed block trains (quarantine zero in fwd+bwd).
|
|
- **hack** `(1,1)` -- only the quarantine trains (deployed kept in forward, grad detached).
|
|
- **mid** `(1,0)` -- both train (absorption).
|
|
|
|
## How it works
|
|
|
|
To get the hack direction we pair examples by hand: for each problem, one
|
|
completion that solves it honestly and one that uses the kind of trick the model
|
|
would learn to exploit. For each pair we compute the *exact GRPO gradient* you'd
|
|
get if the hack rollout had advantage +1 and the clean rollout had advantage -1
|
|
(`-grad logp(hack) + grad logp(clean)`), read in the per-module c-probe space (a
|
|
virtual diagonal scale between `A` and `B`). The mean hack-minus-clean direction,
|
|
normalized per module, is `v_grad`. (Mechanically identical to a twin-NLL
|
|
extraction, since GRPO with adv=+/-1 reduces to the NLL difference; the GRPO
|
|
framing is the one we mean.) The hand-authored pairs are off-distribution and the
|
|
*only* routing-label source -- no oracle/ground-truth label of a live rollout ever
|
|
touches training.
|
|
|
|
At training time routeV runs **two passes** per prompt. Pass 1 is an unmasked
|
|
forward+backward whose gradient w.r.t. the c-probes gives, per rollout, a
|
|
band-normalized cosine to `v_grad` pooled across modules -> a `{clean, mid, hack}`
|
|
label (band edges = authored-pair p75 cosines). Pass 2 sets the block masks from
|
|
those labels and does the real masked forward+backward; the optimizer steps on
|
|
what survives. So a rollout that looks hack-ward routes its update into the
|
|
throwaway quarantine block, and the deployed model never descends the hack
|
|
gradient. We re-extract `v_grad` every N steps so it tracks where hacks separate
|
|
now, not at step 0.
|
|
|
|
Whether the *direction* (not just the *act* of routing) drives suppression is the
|
|
open question -- the placebo control (Haar-random `v_grad`, same routing
|
|
machinery) must NOT match real `v_grad`. We watch `qmass` (share of the update
|
|
landing in quarantine) and the per-rollout zone shares (`keep/resid/rout`).
|
|
|
|
## What we compare
|
|
|
|
The environment has loophole modes (e.g. `run_tests`, `sentinel`, `stdout_marker`,
|
|
`file_marker`), each a grading flaw with a truthful hint disclosing the mechanism;
|
|
problems are partitioned evenly and non-overlappingly. Full prompt+hint/hack/clean
|
|
traces are in the blog appendix
|
|
([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
|
|
|
|
Three arms, identical model/adapter/teacher pool, differing only in the gate
|
|
(`--intervention`):
|
|
|
|
- **none** -- gate pinned clean `(0,0)`: the quarantine never trains. The
|
|
capacity- and structure-matched vanilla control (same adapter, no shrinkage
|
|
confound). The emergence reference.
|
|
- **routeV** -- the method: per-rollout three-way gate from the c-probe gradient
|
|
vs `v_grad`. `--routeV-random-v-seed` swaps in a Haar-random direction (placebo).
|
|
- **absorb** -- gate pinned mid `(1,0)`: both blocks train on everything, no gate.
|
|
Isolates the value of the gate+hard-masks vs absorption alone.
|
|
|
|
Deploy hack/solve is measured the same way for every arm: quarantine-ablated
|
|
forward on the held-out test set, sampled at T=0.7. So the per-arm deploy numbers
|
|
are apples-to-apples. (For `none` the quarantine is already at init, so ablation
|
|
is a no-op.)
|
|
|
|
## Quick start
|
|
|
|
```bash
|
|
uv sync
|
|
just smoke # tiny-random model, routeV pathway + all verify gates, ~1-2 min
|
|
just smoke-all # vanilla + routeV + absorb back to back
|
|
just download-model # warm Qwen3-4B cache
|
|
just queue-decision # queue the 4-arm decision run (routeV real / placebo / vanilla / absorb)
|
|
```
|
|
|
|
See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,
|
|
including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0`
|
|
measurements, the move to Qwen3-4B, and the PiSSA->lora2r switch (the PiSSA
|
|
placebo tie was shrinkage: shared frozen basis made routing a magnitude split).
|
|
|
|
## Results and write-up
|
|
|
|
The paper draft is the source of truth for current numbers, figures, and the
|
|
preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex).
|
|
Session-by-session findings and per-step log audits live in
|
|
[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).
|