mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
61d3819dae
The cleanup removed the v1 route and route2 arms (Config is now none|erase|routeV) but left README calling the live arm route2 with its old binary-tau gate description. Rename to routeV, describe the banded cosine gate (per-rollout/per-token, per-token best), and fix the deploy line (held-out test n=119 knob-off, not n=64). figs.py keeps the route2/routing2 display map for historical run artifacts. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
114 lines
6.0 KiB
Markdown
114 lines
6.0 KiB
Markdown
# vGROUT
|
|
|
|
**vGROUT** (vector gradient routing): route the GRPO gradient against an
|
|
extracted reward-hacking direction (in the SVD-of-W basis) to reduce the
|
|
reward-hack rate without tanking pass rate. A representation-routing variant of
|
|
gradient routing (Cloud et al.; Shilov et al.), where the routing is gated by an
|
|
extracted direction rather than a per-example data label.
|
|
|
|
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
|
|
LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
|
|
"Advantage Modification") by intervening at the gradient level rather than the
|
|
advantage level.
|
|
|
|
See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
|
|
and [docs/papers/](docs/papers/).
|
|
|
|
## How it works
|
|
|
|
We're trying to ablate the "hack direction" from the training gradient on
|
|
every update. The model learns by descending the gradient; if we strip out
|
|
the component pointing toward reward-hacking before the optimizer step, it
|
|
can't move in that direction even when the reward says it should.
|
|
|
|
To get the direction, we pair examples by hand: for each problem, one
|
|
completion that solves it honestly and one that uses the kind of trick the
|
|
model would learn to exploit. Then for each pair we compute the *exact GRPO
|
|
gradient* you would get if the hack rollout had advantage +1 and the clean
|
|
rollout had advantage -1: that's
|
|
`-grad logp(hack) + grad logp(clean)` per pair. Stack these vectors over
|
|
our ~10 pairs and SVD the result; the top right singular vectors are our
|
|
hack-direction basis. (Mechanically this is identical to a twin-NLL extraction
|
|
because GRPO with adv=+/-1 reduces algebraically to the NLL difference, but
|
|
the GRPO framing is the one we mean: extraction produces a sample of the
|
|
gradient GRPO itself would emit if it ever saw a perfectly-labeled pair.)
|
|
|
|
The hope is that this sample of the labeled-pair GRPO gradient covers
|
|
enough of the same subspace as the actual unlabeled GRPO gradient during
|
|
training that ablating along the extracted directions also ablates the
|
|
relevant component of the live gradient. Not a theorem; we check it
|
|
empirically by watching whether `cin_t > cin_s` (the v_hack basis lights
|
|
up more on cached teacher rollouts than on student ones).
|
|
|
|
Everything happens in the SVD-of-W basis. Each Linear gets rotated into
|
|
singular-value coordinates and we train a small per-module knob `delta_S`
|
|
in that basis (AntiPaSTO). So the extracted directions, the live gradient,
|
|
and the projection all live in `delta_S` space, which is low-rank per
|
|
module (~500 to 2560).
|
|
|
|
Noise floor at load. SVD gives us up to K directions per module sorted by
|
|
singular value, and the lower ones are mostly noise (with 10 pairs you can
|
|
only fit rank-10 of real signal). We collect every singular value across
|
|
every module, take a global quantile, and drop any (module, axis) whose
|
|
S_i is below it. Default cut: bottom 25%. Modules whose every axis lands
|
|
below get filtered out entirely. Global rather than per-module because a
|
|
noisy module shouldn't be protected by having its own "top direction".
|
|
|
|
At training time: GRPO gives us a gradient on each `delta_S`; we subtract
|
|
the component along the kept hack directions; the optimizer steps on
|
|
what's left. We log `cin` (cosine of the live gradient with the subspace
|
|
before projection) and `cout` (after). On a working extraction, `cout`
|
|
should be near zero on no_gate runs (we removed the alignment), and
|
|
`cin_t > cin_s` should hold throughout (v_hack discriminates hack from
|
|
clean gradients).
|
|
|
|
## What we compare
|
|
|
|
The environment has four loophole modes (`run_tests`, `sentinel`,
|
|
`stdout_marker`, `file_marker`), each a different grading flaw with a
|
|
truthful hint disclosing the mechanism. Problems are partitioned evenly and
|
|
non-overlappingly, so a vanilla student can learn all four independently.
|
|
Full prompt+hint, hack, and clean traces per mode are in the blog appendix
|
|
([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
|
|
|
|
Three active arms, identical model and teacher pool, differing only in the gradient policy:
|
|
|
|
- **vanilla** -- no intervention; the emergence reference.
|
|
- **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided).
|
|
- **routeV** -- current routing arm. A banded cosine gate `cos(g, v_grad) ∈ band` (band
|
|
calibrated from the hack-vs-clean cosine gap, either from the pairs or a live rolling
|
|
quantile) decides how much of a gradient routes into a scale-matched, distinct-basis
|
|
`delta_S_hack`. The gate runs per-rollout or per-token (`routeV_gate`); per-token is the
|
|
current best. An exploration floor (`rollout_ablate_frac`) generates a fraction of
|
|
rollouts knob-off so the deployed knob always sees solve signal even if the full policy
|
|
saturates on hacking.
|
|
|
|
Deploy hack/solve is measured the same way for every arm: knob-off forward
|
|
(`ablate_quarantine` for routeV, the trained model itself for vanilla/erase) on the
|
|
held-out test set (n=119), sampled at T=0.7. So the per-arm deploy numbers are
|
|
apples-to-apples.
|
|
|
|
The frozen-vs-refresh distinction is orthogonal: erase and routeV can re-extract
|
|
their pair-derived direction every N steps on the current adapter.
|
|
|
|
## Quick start
|
|
|
|
```bash
|
|
uv sync
|
|
just smoke # tiny-random model, projected pathway, ~1-2 min
|
|
just smoke-vanilla # tiny-random model, vanilla pathway, ~1-2 min
|
|
just download-model # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
|
|
just queue-full # queue extract + 3-seed vanilla + 3-seed projected sweep
|
|
```
|
|
|
|
See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,
|
|
including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0`
|
|
measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
|
|
|
|
## Results and write-up
|
|
|
|
The paper draft is the source of truth for current numbers, figures, and the
|
|
preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex)
|
|
(keynote table + figure, ablations, generalisation). Session-by-session
|
|
findings and per-step log audits live in [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).
|