mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 21:37:22 +08:00
52619519dc
- spec.md never existed at root or docs/; removed the link from AGENTS.md + README.md (the live plan is in docs/spec/ dated files). - RESEARCH_JOURNAL.md link pointed at docs/; it lives at repo root. Fixed. - Trimmed the no-cheat-leak paragraph citing scripts/verify_gate_anchor.py (that file doesn't exist); kept the general 'gate every load-bearing invariant in the same commit' rule. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
110 lines
6.5 KiB
Markdown
110 lines
6.5 KiB
Markdown
# projected_grpo — vGROUT
|
|
|
|
**vGROUT** (vector gradient routing): route the GRPO gradient against an
|
|
extracted reward-hacking direction (in the SVD-of-W basis) to reduce the
|
|
reward-hack rate without tanking pass rate. A representation-routing variant of
|
|
gradient routing (Cloud et al.; Shilov et al.), where the routing is gated by an
|
|
extracted direction rather than a per-example data label.
|
|
|
|
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
|
|
LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
|
|
"Advantage Modification") by intervening at the gradient level rather than the
|
|
advantage level.
|
|
|
|
See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
|
|
and [docs/papers/](docs/papers/).
|
|
|
|
## How it works
|
|
|
|
We're trying to ablate the "hack direction" from the training gradient on
|
|
every update. The model learns by descending the gradient; if we strip out
|
|
the component pointing toward reward-hacking before the optimizer step, it
|
|
can't move in that direction even when the reward says it should.
|
|
|
|
To get the direction, we pair examples by hand: for each problem, one
|
|
completion that solves it honestly and one that uses the kind of trick the
|
|
model would learn to exploit. Then for each pair we compute the *exact GRPO
|
|
gradient* you would get if the hack rollout had advantage +1 and the clean
|
|
rollout had advantage -1: that's
|
|
`-grad logp(hack) + grad logp(clean)` per pair. Stack these vectors over
|
|
our ~10 pairs and SVD the result; the top right singular vectors are our
|
|
hack-direction basis. (Mechanically this is identical to a twin-NLL extraction
|
|
because GRPO with adv=+/-1 reduces algebraically to the NLL difference, but
|
|
the GRPO framing is the one we mean: extraction produces a sample of the
|
|
gradient GRPO itself would emit if it ever saw a perfectly-labeled pair.)
|
|
|
|
The hope is that this sample of the labeled-pair GRPO gradient covers
|
|
enough of the same subspace as the actual unlabeled GRPO gradient during
|
|
training that ablating along the extracted directions also ablates the
|
|
relevant component of the live gradient. Not a theorem; we check it
|
|
empirically by watching whether `cin_t > cin_s` (the v_hack basis lights
|
|
up more on cached teacher rollouts than on student ones).
|
|
|
|
Everything happens in the SVD-of-W basis. Each Linear gets rotated into
|
|
singular-value coordinates and we train a small per-module knob `delta_S`
|
|
in that basis (AntiPaSTO). So the extracted directions, the live gradient,
|
|
and the projection all live in `delta_S` space, which is low-rank per
|
|
module (~500 to 2560).
|
|
|
|
Noise floor at load. SVD gives us up to K directions per module sorted by
|
|
singular value, and the lower ones are mostly noise (with 10 pairs you can
|
|
only fit rank-10 of real signal). We collect every singular value across
|
|
every module, take a global quantile, and drop any (module, axis) whose
|
|
S_i is below it. Default cut: bottom 25%. Modules whose every axis lands
|
|
below get filtered out entirely. Global rather than per-module because a
|
|
noisy module shouldn't be protected by having its own "top direction".
|
|
|
|
At training time: GRPO gives us a gradient on each `delta_S`; we subtract
|
|
the component along the kept hack directions; the optimizer steps on
|
|
what's left. We log `cin` (cosine of the live gradient with the subspace
|
|
before projection) and `cout` (after). On a working extraction, `cout`
|
|
should be near zero on no_gate runs (we removed the alignment), and
|
|
`cin_t > cin_s` should hold throughout (v_hack discriminates hack from
|
|
clean gradients).
|
|
|
|
## What we compare
|
|
|
|
The environment has four loophole modes (`run_tests`, `sentinel`,
|
|
`stdout_marker`, `file_marker`), each a different grading flaw with a
|
|
truthful hint disclosing the mechanism. Problems are partitioned evenly and
|
|
non-overlappingly, so a vanilla student can learn all four independently.
|
|
Full prompt+hint, hack, and clean traces per mode are in the blog appendix
|
|
([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
|
|
|
|
Four arms, identical model and teacher pool, differing only in the gradient policy:
|
|
|
|
- **vanilla** -- no intervention; the emergence reference.
|
|
- **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided).
|
|
- **route** -- quarantine the v_hack component into a throwaway `delta_S_hack` knob, deleted at deploy. Gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332)) in the SVD basis. (v1: shared basis, relu gate on the kept-axis coords, same as erase but routed not erased.)
|
|
- **route2** -- current routing arm. Per-rollout gate `cos(g_rollout, v_grad) > tau` (tau calibrated each step from the hack-vs-clean cosine gap) decides whether a rollout's whole gradient routes into a scale-matched, distinct-basis `delta_S_hack`. An exploration floor (`rollout_ablate_frac`) generates a fraction of rollouts knob-off so the deployed knob always sees solve signal even if the full policy saturates on hacking.
|
|
- **route, weak detector** -- the generalisation test: build v_hack from only 2 of the 4 modes (the "known" hacks the weak detector can flag), route on that subspace, and measure whether the 2 held-out modes are also suppressed. This is the load-bearing no-cheat check.
|
|
|
|
Deploy hack/solve is measured the same way for every arm now: knob-off forward
|
|
(`ablate_quarantine` for route/route2, the trained model itself for vanilla/erase),
|
|
n=64 prompts*group, sampled at T=0.7. So the per-arm deploy numbers are apples-to-apples.
|
|
|
|
The frozen-vs-refresh distinction is orthogonal: any arm can re-extract
|
|
v_hack every N steps on the current adapter (for route, with the quarantine
|
|
ablated during extraction, see the blog).
|
|
|
|
## Quick start
|
|
|
|
```bash
|
|
uv sync
|
|
just smoke # tiny-random model, projected pathway, ~1-2 min
|
|
just smoke-vanilla # tiny-random model, vanilla pathway, ~1-2 min
|
|
just download-model # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
|
|
just queue-full # queue extract + 3-seed vanilla + 3-seed projected sweep
|
|
```
|
|
|
|
See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,
|
|
including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0`
|
|
measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
|
|
|
|
## Results and write-up
|
|
|
|
The paper draft is the source of truth for current numbers, figures, and the
|
|
preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex)
|
|
(keynote table + figure, ablations, generalisation). Session-by-session
|
|
findings and per-step log audits live in [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).
|