mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:43:00 +08:00

T

wassname bfa2b82aba plot: longrun A4 fig + visible route2 hack≡0 line + committed CSV data source

plot_dynamics now dumps the plotted series to <out>.csv next to the png so the
figure is reproducible from a tracked artifact (logs/ and out/runs/ are gitignored).
ylim floor -0.035 + 'hack ≡ 0' annotation so route2's pinned-at-zero hack line is
visible rather than hidden under the axis.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-02 23:19:06 +00:00

.vscode

write up

2026-06-02 07:20:42 +00:00

docs

results: fill keynote table/figure at n=3 route2 / n=2 vanilla

2026-06-02 11:08:41 +00:00

external

setup

2026-05-23 10:40:02 +08:00

out/figs

results: fill keynote table/figure at n=3 route2 / n=2 vanilla

2026-06-02 11:08:41 +00:00

scripts

plot: longrun A4 fig + visible route2 hack≡0 line + committed CSV data source

2026-06-02 23:19:06 +00:00

src/projected_grpo

fix: kill StepLogger merge-shadow that crashed every new run

2026-06-02 08:09:05 +00:00

.gitignore

misc

2026-06-02 02:06:43 +00:00

.gitmodules

concepts

2026-05-29 06:29:20 +00:00

AGENTS.md

wip

2026-05-30 04:33:33 +00:00

CLAUDE.md

docs

2026-05-29 05:42:28 +00:00

justfile

Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine

2026-06-02 07:21:49 +00:00

pyproject.toml

grader bug fix + ref reward semantics + Qwen3-4B substrate

2026-05-23 23:36:00 +00:00

README.md

docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots

2026-06-02 01:24:29 +00:00

RESEARCH_JOURNAL.md

journal: route2 holds deploy-hack=0 to 200 steps (job 84, durable not delayed)

2026-06-02 16:33:04 +00:00

uv.lock

grader bug fix + ref reward semantics + Qwen3-4B substrate

2026-05-23 23:36:00 +00:00

README.md

projected_grpo

SVD-basis gradient projection vs RL reward hacking. Tests whether projecting the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W basis) reduces reward-hack rate in GRPO without tanking pass rate.

Built on Ariahw, Engels & Nanda's rl-rewardhacking LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, "Advantage Modification") by intervening at the gradient level rather than the advantage level.

See docs/spec.md, docs/brainstorm/extracted_prefs.md, and docs/papers/.

We cannot cheat (the load-bearing constraint)

The point is an alignment tool a lab would actually use, where at deployment there are known hacks and unknown hacks. So the detector is allowed to be weak: it may catch hack type A and miss type B. We then use the gradient from A to try to stop the model learning B. If that works, it mimics the generalisation to unknown hacks we'd need at deployment. A detector that already sees every hack proves nothing.

Concretely, the boundary is: using detector flags (E/C/D) to select which rollouts become contrastive pairs is fine, because that is the "weak detector for hack A" we're allowed to have. What is cheating is gating the live projection on the ground-truth grader (gt_pass) or running the full detector suite over the student's rollouts during training. The whole result is uninteresting if we let the oracle in at train time.

How it works

We're trying to ablate the "hack direction" from the training gradient on every update. The model learns by descending the gradient; if we strip out the component pointing toward reward-hacking before the optimizer step, it can't move in that direction even when the reward says it should.

To get the direction, we pair examples by hand: for each problem, one completion that solves it honestly and one that uses the kind of trick the model would learn to exploit. Then for each pair we compute the exact GRPO gradient you would get if the hack rollout had advantage +1 and the clean rollout had advantage -1: that's -grad logp(hack) + grad logp(clean) per pair. Stack these vectors over our ~10 pairs and SVD the result; the top right singular vectors are our hack-direction basis. (Mechanically this is identical to a twin-NLL extraction because GRPO with adv=+/-1 reduces algebraically to the NLL difference, but the GRPO framing is the one we mean: extraction produces a sample of the gradient GRPO itself would emit if it ever saw a perfectly-labeled pair.)

The hope is that this sample of the labeled-pair GRPO gradient covers enough of the same subspace as the actual unlabeled GRPO gradient during training that ablating along the extracted directions also ablates the relevant component of the live gradient. Not a theorem; we check it empirically by watching whether cin_t > cin_s (the v_hack basis lights up more on cached teacher rollouts than on student ones).

Everything happens in the SVD-of-W basis. Each Linear gets rotated into singular-value coordinates and we train a small per-module knob delta_S in that basis (AntiPaSTO). So the extracted directions, the live gradient, and the projection all live in delta_S space, which is low-rank per module (~500 to 2560).

Noise floor at load. SVD gives us up to K directions per module sorted by singular value, and the lower ones are mostly noise (with 10 pairs you can only fit rank-10 of real signal). We collect every singular value across every module, take a global quantile, and drop any (module, axis) whose S_i is below it. Default cut: bottom 25%. Modules whose every axis lands below get filtered out entirely. Global rather than per-module because a noisy module shouldn't be protected by having its own "top direction".

At training time: GRPO gives us a gradient on each delta_S; we subtract the component along the kept hack directions; the optimizer steps on what's left. We log cin (cosine of the live gradient with the subspace before projection) and cout (after). On a working extraction, cout should be near zero on no_gate runs (we removed the alignment), and cin_t > cin_s should hold throughout (v_hack discriminates hack from clean gradients).

What we compare

The environment has four loophole modes (run_tests, sentinel, stdout_marker, file_marker), each a different grading flaw with a truthful hint disclosing the mechanism. Problems are partitioned evenly and non-overlappingly, so a vanilla student can learn all four independently. Full prompt+hint, hack, and clean traces per mode are in the blog appendix (docs/blog/20260529_...md).

Four arms, identical model and teacher pool, differing only in the gradient policy:

vanilla -- no intervention; the emergence reference.
erase -- subtract the v_hack component from the live delta_S gradient (one-sided).
route -- quarantine the v_hack component into a throwaway delta_S_hack knob, deleted at deploy. Gradient routing (Cloud et al. 2024) in the SVD basis. (v1: shared basis, relu gate on the kept-axis coords, same as erase but routed not erased.)
route2 -- current routing arm. Per-rollout gate cos(g_rollout, v_grad) > tau (tau calibrated each step from the hack-vs-clean cosine gap) decides whether a rollout's whole gradient routes into a scale-matched, distinct-basis delta_S_hack. An exploration floor (rollout_ablate_frac) generates a fraction of rollouts knob-off so the deployed knob always sees solve signal even if the full policy saturates on hacking.
route, weak detector -- the generalisation test: build v_hack from only 2 of the 4 modes (the "known" hacks the weak detector can flag), route on that subspace, and measure whether the 2 held-out modes are also suppressed. This is the load-bearing no-cheat check.

Deploy hack/solve is measured the same way for every arm now: knob-off forward (ablate_quarantine for route/route2, the trained model itself for vanilla/erase), n=64 prompts*group, sampled at T=0.7. So the per-arm deploy numbers are apples-to-apples.

The frozen-vs-refresh distinction is orthogonal: any arm can re-extract v_hack every N steps on the current adapter (for route, with the quarantine ablated during extraction, see the blog).

Quick start

uv sync
just smoke               # tiny-random model, projected pathway, ~1-2 min
just smoke-vanilla       # tiny-random model, vanilla pathway, ~1-2 min
just download-model      # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
just queue-full          # queue extract + 3-seed vanilla + 3-seed projected sweep

See RESEARCH_JOURNAL.md for session-by-session findings, including the 2026-05-23 grader-bug discovery that invalidated all prior gt=0 measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).

Current findings (preliminary, n=1 seed)

Stale as of 2026-06-02. The numbers below are the late-May erase/basis-width result at the old default mix=0.5. Default mix is now locked to 0.125, the primary arm is route2, and the live comparison is per-arm deploy hack/solve (knob-off, n=64, T=0.7). n=3 no-floor route2 + matched vanilla refs are landing (pueue jobs 68-79); this section gets rewritten on those numbers. Latest results live in RESEARCH_JOURNAL.md.

These are headline results from the fast preset (20 steps, mix=0.5, seed=41). Full provenance and per-step log audits are in RESEARCH_JOURNAL.md.

What appears to work (seed 41): a stronger extracted basis drops last-5 student hack rate from 77.5% (v_hack_full) to 47.5% (v_hack_21pairs), frozen V, at matched ground-truth pass rate near 20%. CAVEAT (corrected 2026-05-29 from the safetensors shapes, see docs/results.md Q8): the two bases differ on three axes at once — pairs used (10 vs 16), directions kept (k=5 vs k=12), and extract tau (0.25 vs 0.0) — so this is NOT cleanly "more pairs". A one-knob k-sweep is needed to attribute the gain. Vanilla-baseline head-to-head and seed=42/43 replicates are queued.

What turns out to matter for the design (entries f, i): the extracted v_hack basis goes stale fast during training. The per-step cosine of the live teacher gradient against v_hack decays from about 0.27 at step 0 to about 0.07 by step 10. Re-extracting v_hack every 2 optimizer steps (--vhack-refresh-every=2) keeps the second-half-of-training cosine about 1.43x higher than the frozen baseline. But at the 21-pair width, the refresh effect on last-5 hack_s is small (47.5% frozen vs 45.0% refresh-2, about 2.5pp). Basis width does most of the work; refresh helps marginally.

Hypotheses (preregistered)

See spec.md. Headline: H1 — gradient projection in SVD basis against a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by

=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).

Status at 2026-05-29: 30pp absolute drop confirmed within the projected arm at n=1 seed (12-pair to 21-pair, entry h). Vanilla-baseline head-to-head and n>=2 seed replication queued.