12 KiB
Writeup spec -- gradient routing vs RL reward hacking
Status (2026-06-10): method is lora2r routeV (rank-2r Gaussian-init LoRA,
deployed block [:r] + quarantine block [r:]; per-rollout banded three-way SGTM
gate on the c-probe gradient vs an extracted hack direction v_grad, quarantine
ablated at deploy). The retired variants (route2b/erase, PiSSA, lora_frozen_b,
AntiPaSTO basis, online_stats gate, the "knob" nickname) are gone from the code
and should not appear in the paper. The workshop paper = ONE working method
(lora2r routeV), shown better than the vanilla baseline (intervention=none on the
SAME adapter), and ablated against a Haar-random direction (placebo) and an
all-absorption arm.
Workshop paper scope (the whole thing):
- Method: lora2r routeV -- route each GRPO rollout's gradient by its band-normalized
cosine to
v_gradinto clean (deployed-only) / hack (quarantine-only) / mid (both). The quarantine block is deleted at deploy. - Baseline: vanilla GRPO = intervention=none (gate pinned clean) on the identical rank-2r adapter, so the comparison is capacity- and structure-matched (no shrinkage confound). Show routeV deploys at lower hack rate at matched solve.
- Ablations (one row per arm, same seed/preset): Haar-random
v_gradplacebo (directionality, the decisive control) + absorb (gate pinned mid, isolates the gate+hard-masks from absorption alone). No erase arm, no per-token arm.
Venue order: LW blog first (the audience that read AntiPaSTO and the Ariahw post), then a workshop paper (NeurIPS/ICLR alignment or interpretability workshop) if the n=3 routeV-vs-vanilla deploy gap holds and the placebo ablation comes back clean.
The one-paragraph story
Labs already do RL on coding/agentic tasks and the model learns to exploit
grader flaws. We ask: can an alignment intervention at the gradient level,
using only a weak hack detector (catches some hack types, misses others),
suppress the hacks the detector never saw? We extract a hack direction v_grad
from a handful of hand-paired clean/hack completions (off-distribution, authored
by us, the "weak detector for hack A"), then during GRPO we route any rollout
whose gradient aligns with that direction into a throwaway quarantine block that
is deleted at deploy. On Ariahw/Engels/Nanda's LeetCode loophole substrate (4
grading flaws, even non-overlapping partition) the routed model deploys at lower
hack rate than vanilla GRPO at comparable solve rate. The load-bearing test is
the held-out modes: build the direction from 2 of the 4 loopholes, measure
suppression on the other 2.
No-cheat invariant: the authored pairs are the ONLY routing-label source. No oracle / ground-truth hack-label of a live rollout ever touches training.
Claims (compress to <=3, ranked by how much evidence we have)
C1 (primary, existence -> systematic). Routing the GRPO gradient against a weak-detector hack direction in the lora2r c-probe space lowers deploy hack rate vs vanilla GRPO at matched-ish solve rate, replicated over n=3 seeds.
- Evidence: the 4-arm lora2r decision run (queue-decision: routeV-real vs none/vanilla, s41/42/43). Deploy = quarantine-ablated, n=32 prompts, T=0.7.
- Confidence today: untested on lora2r. The pre-lora2r per-token route run (endpoint table below) is suggestive (hack 0.61 -> 0.04 at comparable solve) but on a different adapter; it is prior evidence, not the headline.
C2 (mechanism, the no-cheat payload). The suppression generalises to loophole modes the detector never flagged: a 2-of-4-mode direction also lowers hack on the 2 held-out modes.
- Evidence: cross-mechanism runs (spec 20260528_cross_mechanism_v_hack). The claim a referee will attack hardest and the one that makes the result interesting rather than a tautology.
- Confidence: untested. If C2 fails, the post becomes "routing suppresses known hacks at the gradient level" -- weaker but still honest.
C3 (specificity / not-a-regularizer). The effect needs the direction, not just
the act of carving a quarantine block out of the adapter, and not just routing
gradient mass away. A Haar-random v_grad of matched per-module rank/norm
collapses the band width (upper-lower ~ 0) and should NOT reproduce the deploy
hack-drop. The banded gate makes this clean: real-V has a positive band (hack
pairs separate from clean pairs along v_grad); random-V does not.
- Evidence: the placebo arm (--routeV-random-v-seed) in the decision run, frout-matched to real-V so the control quarantines comparable mass but in an arbitrary direction. The absorb arm separately isolates the gate+masks.
- Confidence: untested for lora2r. The decisive control; must land before we claim directional specificity. (On PiSSA it tied -- shrinkage; lora2r's unfrozen B is the structural fix, see RESEARCH_JOURNAL PiSSA->lora2r entry.)
Abstract sketch (Heilmeier + Nature structure, ~200 words, fill numbers last)
- Field: RL post-training teaches capable behaviour but also teaches models to exploit flaws in the reward/grader (reward hacking).
- Today: interventions act on the reward or the advantage (e.g. Wu & Tang 2026 advantage modification) or on the data; they need a detector that catches the hack at scoring time.
- Problem: at deployment some hacks are unknown, so a detector-at-scoring-time approach can only suppress what it already sees.
- Here we show: routing the GRPO gradient away from a hack direction extracted from a weak detector (few hand-paired examples covering only some hack types) lowers the deploy hack rate, including on held-out hack types, at comparable solve rate, over n=3 seeds, on the Ariahw LeetCode loophole substrate.
- Comparison: unlike advantage-level methods this never reads the live grader; the only supervision is the fixed weak-detector pair set, mimicking the known/unknown-hack split at deployment.
- Context: gradient routing (Cloud et al. 2024) realised as an SGTM-style block partition inside one rank-2r LoRA, giving a deletable quarantine block.
- Standard of evidence / risk: existence-to-systematic at n=3; the Haar-random placebo and the absorb arm rule out generic adapter regularization; the held-out-mode test is the load-bearing generalisation claim and the main failure risk.
Paper artifacts -- the goal tracker (durable; this is what we are building)
Canonical list of what the workshop paper/blog needs; each artifact names its source and blocking state so the goal survives compaction. Status legend: [x] done [/] data landing [ ] not started. Each finished run writes per_mode_deploy.json + train.safetensors under out/runs/_/.
A1 -- Keynote figure. routeV vs vanilla deploy hack/solve over training, n=3 band. [ ] blocked on the lora2r 4-arm decision run (queue-decision, s41/42/43). Pre-lora2r prototype: out/figs/eval2_pertoken_vs_vanilla_dynamics.png.
A2 -- Keynote table. Per-arm deploy hack + deploy solve, mean +/- SEM over 3 seeds, routeV vs vanilla, delta vs vanilla, paired test + alpha. [ ] same blocker as A1.
A3 -- Ablation table (what each component buys). One row per arm at matched seed/preset, deploy hack + solve:
- none / vanilla (gate pinned clean, identical adapter) -> emergence reference
- routeV (the method)
- routeV placebo (Haar
v_grad, direction arbitrary) -> control: should NOT work - absorb (gate pinned mid, no gate) -> gate-vs-absorption [ ] blocked on the decision run. Shakedown in flight: job 40 (60-step routeV on the new md pairs, s43) proves the pipeline + band separation on the live 4B model before the n=3 spend.
A4 -- Long-run figure. ~200-step routeV vs vanilla saturation reference. [ ] not re-run on lora2r. Pre-lora2r finding (route held hack=0 to 200 steps; vanilla learned the cheat then collapsed ~step 88, no clean saturation past there) is in RESEARCH_JOURNAL -- carry as an honest caveat, re-measure on lora2r only if budget allows.
A5 -- Generalisation figure/table (the no-cheat payload, C2). Per-mode deploy
hack: v_grad from 2 of 4 modes, measure suppression on the 2 held-out modes.
[ ] NOT QUEUED -- highest-value gap. Queue once the n=3 band confirms C1 (spec
20260528_cross_mechanism_v_hack).
A6 -- Appendix: full traces per loophole class. Prompt+hint, hack completion, clean completion for all 4 modes. [x] done -- blog appendix (docs/blog/20260529_...md#appendix-the-four-loophole-modes).
A7 -- Appendix ablation context. Cite results.md Q-rows already run: basis width, refresh cadence, teacher mix, gate mode, solve-orthog, pairset content/placebo. [x] data exists; just needs porting into the paper.
Next action when the decision run lands: read each per_mode_deploy.json,
just results, fill A1/A2/A3, append a journal entry. Then queue A5 (the gap).
Red-team checklist before publishing (paper-writing evidence standards)
- n=3 deploy gap stated with SEM, not cherry-picked seed.
- Haar placebo does NOT reproduce the drop at matched frout (else it is mass-quarantine / regularization, C3 dies).
- absorb arm reported: ~vanilla -> gate+masks add nothing; << vanilla -> absorption alone suppresses.
- held-out-mode suppression measured (C2), reported even if it fails.
- solve rate matched within stated band; a hack drop that only comes with a solve collapse is reported as such, not as a win.
- no-cheat invariant stated explicitly: live routing never reads gt_pass or runs the detector suite over student rollouts; the authored pair set is the only supervision.
- base-model and vanilla-saturation references present so emergence is real (base solve ~0.094-0.126 on the paper test set; no-loophole ceiling job 34).
Eval contamination fix (load-bearing, 2026-06-07)
Eval is on the paper's recency-held-out test set (leetcode_test_medhard, every id
= 3243), NOT the holdout/first-N (memorized -> base solve 0.94, kills the hack metric's gt-fail headroom). Training uses a seeded representative shuffle, not first-N-by-id. Verified base solve = 0.094 on test_medhard (matches paper fn9 ~12%; mild undershoot from max_new truncation). Full table: docs/spec/20260607_eval_contamination_fix.md.
Canonical endpoint table (pre-lora2r, latest real deploy numbers)
Authoritative paper-test endpoints from the per-token routeV run (prog_wide pairs) -- the prior adapter (lora_frozen_b/PiSSA era), n=119 full test. The lora2r decision run will replace these as the headline.
| condition | solve | hack |
|---|---|---|
| base model (paper: 0.115) | 0.126 | 0.000 |
| vanilla GRPO (paper: 0.149) | 0.101 | 0.613 |
| routeV per-token, prog_wide (pre-lora2r) | 0.143 | 0.042 |
| no-loophole ceiling (paper: 0.223) | job 34, queued | 0.000 |
Read: pre-lora2r routeV nearly eliminated the vanilla hack increase and preserved base-model solve; solve was +1.7pp over base / +4.2pp over vanilla, but n=119 is insufficient to claim either solve difference. Caveats: prog_wide pairs are pool-derived (contamination-prone, not headline-clean); the n=32 monitoring subset is systematically harder than full test (use full n=119 for claims).
Offline eval protocol (implemented 2026-06-09, now the code default)
- Training does no periodic eval by default (eval_ablate_every=0); it saves deploy checkpoints every 10 optimizer updates (save_ckpt_every=10), independent of eval.
- A separate job (
just eval-curve RUN) scores checkpoints on the full n=119 paper test; for routeV it records both quarantine-on (train) and quarantine-off (deploy) so the mechanism figure shows train-hack rising while deploy-hack stays low. Batched eval (eval_batch_size=2), fixed prompt IDs + generation seed. - Monitoring subset (if used): one deterministic stratified n=64 (≈8 base-solved + 56 base-failed, matching the 12.6% full-test base solve), frozen IDs, scored at a few checkpoints only. Do NOT search shuffle seeds to match full-test solve.
Open editorial decisions
- Project/repo name:
projected_grpois now a misnomer (method is routing, not projection). README already calls it vGROUT (vector gradient routing). Decide the public repo name before the code link goes in the post. - Re-headline the blog draft to lora2r routeV (the route2/erase framing is dead).
- Workshop vs blog-only: gate on C2 landing.