wassname/evil_MoE

Fork 0

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:23:57 +08:00

Files

T

wassname 270c4f5a27 misc

2026-06-11 11:07:28 +00:00

12 KiB

Raw Blame History

Writeup spec -- gradient routing vs RL reward hacking

Status (2026-06-10): method is lora2r routeV (rank-2r Gaussian-init LoRA, deployed block [:r] + quarantine block [r:]; per-rollout banded three-way SGTM gate on the c-probe gradient vs an extracted hack direction v_grad, quarantine ablated at deploy). The retired variants (route2b/erase, PiSSA, lora_frozen_b, AntiPaSTO basis, online_stats gate, the "knob" nickname) are gone from the code and should not appear in the paper. The workshop paper = ONE working method (lora2r routeV), shown better than the vanilla baseline (intervention=none on the SAME adapter), and ablated against a Haar-random direction (placebo) and an all-absorption arm.

Workshop paper scope (the whole thing):

Method: lora2r routeV -- route each GRPO rollout's gradient by its band-normalized cosine to v_grad into clean (deployed-only) / hack (quarantine-only) / mid (both). The quarantine block is deleted at deploy.
Baseline: vanilla GRPO = intervention=none (gate pinned clean) on the identical rank-2r adapter, so the comparison is capacity- and structure-matched (no shrinkage confound). Show routeV deploys at lower hack rate at matched solve.
Ablations (one row per arm, same seed/preset): Haar-random v_grad placebo (directionality, the decisive control) + absorb (gate pinned mid, isolates the gate+hard-masks from absorption alone). No erase arm, no per-token arm.

Venue order: LW blog first (the audience that read AntiPaSTO and the Ariahw post), then a workshop paper (NeurIPS/ICLR alignment or interpretability workshop) if the n=3 routeV-vs-vanilla deploy gap holds and the placebo ablation comes back clean.

The one-paragraph story

Labs already do RL on coding/agentic tasks and the model learns to exploit grader flaws. We ask: can an alignment intervention at the gradient level, using only a weak hack detector (catches some hack types, misses others), suppress the hacks the detector never saw? We extract a hack direction v_grad from a handful of hand-paired clean/hack completions (off-distribution, authored by us, the "weak detector for hack A"), then during GRPO we route any rollout whose gradient aligns with that direction into a throwaway quarantine block that is deleted at deploy. On Ariahw/Engels/Nanda's LeetCode loophole substrate (4 grading flaws, even non-overlapping partition) the routed model deploys at lower hack rate than vanilla GRPO at comparable solve rate. The load-bearing test is the held-out modes: build the direction from 2 of the 4 loopholes, measure suppression on the other 2.

No-cheat invariant: the authored pairs are the ONLY routing-label source. No oracle / ground-truth hack-label of a live rollout ever touches training.

Claims (compress to <=3, ranked by how much evidence we have)

C1 (primary, existence -> systematic). Routing the GRPO gradient against a weak-detector hack direction in the lora2r c-probe space lowers deploy hack rate vs vanilla GRPO at matched-ish solve rate, replicated over n=3 seeds.

Evidence: the 4-arm lora2r decision run (queue-decision: routeV-real vs none/vanilla, s41/42/43). Deploy = quarantine-ablated, n=32 prompts, T=0.7.
Confidence today: untested on lora2r. The pre-lora2r per-token route run (endpoint table below) is suggestive (hack 0.61 -> 0.04 at comparable solve) but on a different adapter; it is prior evidence, not the headline.

C2 (mechanism, the no-cheat payload). The suppression generalises to loophole modes the detector never flagged: a 2-of-4-mode direction also lowers hack on the 2 held-out modes.

Evidence: cross-mechanism runs (spec 20260528_cross_mechanism_v_hack). The claim a referee will attack hardest and the one that makes the result interesting rather than a tautology.
Confidence: untested. If C2 fails, the post becomes "routing suppresses known hacks at the gradient level" -- weaker but still honest.

C3 (specificity / not-a-regularizer). The effect needs the direction, not just the act of carving a quarantine block out of the adapter, and not just routing gradient mass away. A Haar-random v_grad of matched per-module rank/norm collapses the band width (upper-lower ~ 0) and should NOT reproduce the deploy hack-drop. The banded gate makes this clean: real-V has a positive band (hack pairs separate from clean pairs along v_grad); random-V does not.

Evidence: the placebo arm (--routeV-random-v-seed) in the decision run, frout-matched to real-V so the control quarantines comparable mass but in an arbitrary direction. The absorb arm separately isolates the gate+masks.
Confidence: untested for lora2r. The decisive control; must land before we claim directional specificity. (On PiSSA it tied -- shrinkage; lora2r's unfrozen B is the structural fix, see RESEARCH_JOURNAL PiSSA->lora2r entry.)

Abstract sketch (Heilmeier + Nature structure, ~200 words, fill numbers last)

Field: RL post-training teaches capable behaviour but also teaches models to exploit flaws in the reward/grader (reward hacking).
Today: interventions act on the reward or the advantage (e.g. Wu & Tang 2026 advantage modification) or on the data; they need a detector that catches the hack at scoring time.
Problem: at deployment some hacks are unknown, so a detector-at-scoring-time approach can only suppress what it already sees.
Here we show: routing the GRPO gradient away from a hack direction extracted from a weak detector (few hand-paired examples covering only some hack types) lowers the deploy hack rate, including on held-out hack types, at comparable solve rate, over n=3 seeds, on the Ariahw LeetCode loophole substrate.
Comparison: unlike advantage-level methods this never reads the live grader; the only supervision is the fixed weak-detector pair set, mimicking the known/unknown-hack split at deployment.
Context: gradient routing (Cloud et al. 2024) realised as an SGTM-style block partition inside one rank-2r LoRA, giving a deletable quarantine block.
Standard of evidence / risk: existence-to-systematic at n=3; the Haar-random placebo and the absorb arm rule out generic adapter regularization; the held-out-mode test is the load-bearing generalisation claim and the main failure risk.

Paper artifacts -- the goal tracker (durable; this is what we are building)

Canonical list of what the workshop paper/blog needs; each artifact names its source and blocking state so the goal survives compaction. Status legend: [x] done [/] data landing [ ] not started. Each finished run writes per_mode_deploy.json + train.safetensors under out/runs/_/.

A1 -- Keynote figure. routeV vs vanilla deploy hack/solve over training, n=3 band. [ ] blocked on the lora2r 4-arm decision run (queue-decision, s41/42/43). Pre-lora2r prototype: out/figs/eval2_pertoken_vs_vanilla_dynamics.png.

A2 -- Keynote table. Per-arm deploy hack + deploy solve, mean +/- SEM over 3 seeds, routeV vs vanilla, delta vs vanilla, paired test + alpha. [ ] same blocker as A1.

A3 -- Ablation table (what each component buys). One row per arm at matched seed/preset, deploy hack + solve:

none / vanilla (gate pinned clean, identical adapter) -> emergence reference
routeV (the method)
routeV placebo (Haar v_grad, direction arbitrary) -> control: should NOT work
absorb (gate pinned mid, no gate) -> gate-vs-absorption [ ] blocked on the decision run. Shakedown in flight: job 40 (60-step routeV on the new md pairs, s43) proves the pipeline + band separation on the live 4B model before the n=3 spend.

A4 -- Long-run figure. ~200-step routeV vs vanilla saturation reference. [ ] not re-run on lora2r. Pre-lora2r finding (route held hack=0 to 200 steps; vanilla learned the cheat then collapsed ~step 88, no clean saturation past there) is in RESEARCH_JOURNAL -- carry as an honest caveat, re-measure on lora2r only if budget allows.

A5 -- Generalisation figure/table (the no-cheat payload, C2). Per-mode deploy hack: v_grad from 2 of 4 modes, measure suppression on the 2 held-out modes. [ ] NOT QUEUED -- highest-value gap. Queue once the n=3 band confirms C1 (spec 20260528_cross_mechanism_v_hack).

A6 -- Appendix: full traces per loophole class. Prompt+hint, hack completion, clean completion for all 4 modes. [x] done -- blog appendix (docs/blog/20260529_...md#appendix-the-four-loophole-modes).

A7 -- Appendix ablation context. Cite results.md Q-rows already run: basis width, refresh cadence, teacher mix, gate mode, solve-orthog, pairset content/placebo. [x] data exists; just needs porting into the paper.

Next action when the decision run lands: read each per_mode_deploy.json, just results, fill A1/A2/A3, append a journal entry. Then queue A5 (the gap).

Red-team checklist before publishing (paper-writing evidence standards)

n=3 deploy gap stated with SEM, not cherry-picked seed.
Haar placebo does NOT reproduce the drop at matched frout (else it is mass-quarantine / regularization, C3 dies).
absorb arm reported: ~vanilla -> gate+masks add nothing; << vanilla -> absorption alone suppresses.
held-out-mode suppression measured (C2), reported even if it fails.
solve rate matched within stated band; a hack drop that only comes with a solve collapse is reported as such, not as a win.
no-cheat invariant stated explicitly: live routing never reads gt_pass or runs the detector suite over student rollouts; the authored pair set is the only supervision.
base-model and vanilla-saturation references present so emergence is real (base solve ~0.094-0.126 on the paper test set; no-loophole ceiling job 34).

Eval contamination fix (load-bearing, 2026-06-07)

Eval is on the paper's recency-held-out test set (leetcode_test_medhard, every id

= 3243), NOT the holdout/first-N (memorized -> base solve 0.94, kills the hack metric's gt-fail headroom). Training uses a seeded representative shuffle, not first-N-by-id. Verified base solve = 0.094 on test_medhard (matches paper fn9 ~12%; mild undershoot from max_new truncation). Full table: docs/spec/20260607_eval_contamination_fix.md.

Canonical endpoint table (pre-lora2r, latest real deploy numbers)

Authoritative paper-test endpoints from the per-token routeV run (prog_wide pairs) -- the prior adapter (lora_frozen_b/PiSSA era), n=119 full test. The lora2r decision run will replace these as the headline.

condition	solve	hack
base model (paper: 0.115)	0.126	0.000
vanilla GRPO (paper: 0.149)	0.101	0.613
routeV per-token, prog_wide (pre-lora2r)	0.143	0.042
no-loophole ceiling (paper: 0.223)	job 34, queued	0.000

Read: pre-lora2r routeV nearly eliminated the vanilla hack increase and preserved base-model solve; solve was +1.7pp over base / +4.2pp over vanilla, but n=119 is insufficient to claim either solve difference. Caveats: prog_wide pairs are pool-derived (contamination-prone, not headline-clean); the n=32 monitoring subset is systematically harder than full test (use full n=119 for claims).

Offline eval protocol (implemented 2026-06-09, now the code default)

Training does no periodic eval by default (eval_ablate_every=0); it saves deploy checkpoints every 10 optimizer updates (save_ckpt_every=10), independent of eval.
A separate job (just eval-curve RUN) scores checkpoints on the full n=119 paper test; for routeV it records both quarantine-on (train) and quarantine-off (deploy) so the mechanism figure shows train-hack rising while deploy-hack stays low. Batched eval (eval_batch_size=2), fixed prompt IDs + generation seed.
Monitoring subset (if used): one deterministic stratified n=64 (≈8 base-solved + 56 base-failed, matching the 12.6% full-test base solve), frozen IDs, scored at a few checkpoints only. Do NOT search shuffle seeds to match full-test solve.

Open editorial decisions

Project/repo name: projected_grpo is now a misnomer (method is routing, not projection). README already calls it vGROUT (vector gradient routing). Decide the public repo name before the code link goes in the post.
Re-headline the blog draft to lora2r routeV (the route2/erase framing is dead).
Workshop vs blog-only: gate on C2 landing.

12 KiB Raw Blame History