mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:23:57 +08:00
109 lines
5.7 KiB
Markdown
109 lines
5.7 KiB
Markdown
# projected_grpo
|
|
|
|
SVD-basis gradient projection vs RL reward hacking. Tests whether projecting
|
|
the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W
|
|
basis) reduces reward-hack rate in GRPO without tanking pass rate.
|
|
|
|
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
|
|
LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
|
|
"Advantage Modification") by intervening at the gradient level rather than the
|
|
advantage level.
|
|
|
|
See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
|
|
and [docs/papers/](docs/papers/).
|
|
|
|
## How it works
|
|
|
|
We're trying to ablate the "hack direction" from the training gradient on
|
|
every update. The model learns by descending the gradient; if we strip out
|
|
the component pointing toward reward-hacking before the optimizer step, it
|
|
can't move in that direction even when the reward says it should.
|
|
|
|
To get the direction, we pair examples by hand: for each problem, one
|
|
completion that solves it honestly and one that uses the kind of trick the
|
|
model would learn to exploit. Then for each pair we compute the *exact GRPO
|
|
gradient* you would get if the hack rollout had advantage +1 and the clean
|
|
rollout had advantage -1: that's
|
|
`-grad logp(hack) + grad logp(clean)` per pair. Stack these vectors over
|
|
our ~10 pairs and SVD the result; the top right singular vectors are our
|
|
hack-direction basis. (Mechanically this is identical to a twin-NLL extraction
|
|
because GRPO with adv=+/-1 reduces algebraically to the NLL difference, but
|
|
the GRPO framing is the one we mean: extraction produces a sample of the
|
|
gradient GRPO itself would emit if it ever saw a perfectly-labeled pair.)
|
|
|
|
The hope is that this sample of the labeled-pair GRPO gradient covers
|
|
enough of the same subspace as the actual unlabeled GRPO gradient during
|
|
training that ablating along the extracted directions also ablates the
|
|
relevant component of the live gradient. Not a theorem; we check it
|
|
empirically by watching whether `cin_t > cin_s` (the v_hack basis lights
|
|
up more on cached teacher rollouts than on student ones).
|
|
|
|
Everything happens in the SVD-of-W basis. Each Linear gets rotated into
|
|
singular-value coordinates and we train a small per-module knob `delta_S`
|
|
in that basis (AntiPaSTO). So the extracted directions, the live gradient,
|
|
and the projection all live in `delta_S` space, which is low-rank per
|
|
module (~500 to 2560).
|
|
|
|
Noise floor at load. SVD gives us up to K directions per module sorted by
|
|
singular value, and the lower ones are mostly noise (with 10 pairs you can
|
|
only fit rank-10 of real signal). We collect every singular value across
|
|
every module, take a global quantile, and drop any (module, axis) whose
|
|
S_i is below it. Default cut: bottom 25%. Modules whose every axis lands
|
|
below get filtered out entirely. Global rather than per-module because a
|
|
noisy module shouldn't be protected by having its own "top direction".
|
|
|
|
At training time: GRPO gives us a gradient on each `delta_S`; we subtract
|
|
the component along the kept hack directions; the optimizer steps on
|
|
what's left. We log `cin` (cosine of the live gradient with the subspace
|
|
before projection) and `cout` (after). On a working extraction, `cout`
|
|
should be near zero on no_gate runs (we removed the alignment), and
|
|
`cin_t > cin_s` should hold throughout (v_hack discriminates hack from
|
|
clean gradients).
|
|
|
|
## Quick start
|
|
|
|
```bash
|
|
uv sync
|
|
just smoke # tiny-random model, projected pathway, ~1-2 min
|
|
just smoke-vanilla # tiny-random model, vanilla pathway, ~1-2 min
|
|
just download-model # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
|
|
just queue-full # queue extract + 3-seed vanilla + 3-seed projected sweep
|
|
```
|
|
|
|
See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,
|
|
including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0`
|
|
measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
|
|
|
|
## Current findings (preliminary, n=1 seed)
|
|
|
|
These are headline results from the fast preset (20 steps, mix=0.5, seed=41).
|
|
Full provenance and per-step log audits are in `RESEARCH_JOURNAL.md`.
|
|
|
|
**What appears to work (entry h, n=1):** widening the hand-crafted contrastive
|
|
pair set from 12 to 21 pairs across 6 axes (weak run_tests, hardcode,
|
|
persona-voice, try/except swallow, type-only assert, weak inequality predicate)
|
|
drops last-5 student hack rate from 77.5% (12-pair projected, frozen V) to
|
|
47.5% (21-pair projected, frozen V), at matched ground-truth pass rate near
|
|
20%. The hack-gt gap shrinks from ~50pp to ~27.5pp, so the projection looks
|
|
selective rather than just deflating both rates. Vanilla-baseline head-to-head
|
|
and seed=42/43 replicates are queued.
|
|
|
|
**What turns out to matter for the design (entries f, i):** the extracted
|
|
v_hack basis goes stale fast during training. The per-step cosine of the
|
|
live teacher gradient against v_hack decays from about 0.27 at step 0 to
|
|
about 0.07 by step 10. Re-extracting v_hack every 2 optimizer steps
|
|
(`--vhack-refresh-every=2`) keeps the second-half-of-training cosine about
|
|
1.43x higher than the frozen baseline. But at the 21-pair width, the
|
|
refresh effect on last-5 hack_s is small (47.5% frozen vs 45.0% refresh-2,
|
|
about 2.5pp). Basis width does most of the work; refresh helps marginally.
|
|
|
|
## Hypotheses (preregistered)
|
|
|
|
See [spec.md](spec.md). Headline: H1 — gradient projection in SVD basis against
|
|
a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by
|
|
>=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).
|
|
|
|
Status at 2026-05-29: 30pp absolute drop confirmed within the projected arm
|
|
at n=1 seed (12-pair to 21-pair, entry h). Vanilla-baseline head-to-head and
|
|
n>=2 seed replication queued.
|