train.py: allow mix_ratio=0 with a teacher pool set -> G_t=0, student-only GRPO (guard the teacher-mixing branch on G_t>0, relax the (0,1) assertion to [0,1), drop G_t==0 from the degenerate check). The pool stays loaded for the 4-mode partition and route2 v_grad extraction; only the teacher-rollout MIX is removed. Smoke (mix=0 + normal mix=0.5 + vanilla) all green. Also: fill A4 long-run figure (fig:longrun) in main.tex, update writeup spec A4 status (route2 durable to 200; vanilla collapses ~88, not clean saturation). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
9.5 KiB
Writeup spec -- gradient routing vs RL reward hacking
Status: drafting harness. Numbers are placeholders until the queued jobs land (see "Run -> artifact map" below). This is the spec for the LW blog post and a possible workshop paper; it pins the narrative, the claims, and which run fills each table cell so we can drop numbers in instead of re-deciding the story.
Venue order: LW blog first (the audience that read AntiPaSTO and the Ariahw post), then a workshop paper (NeurIPS/ICLR alignment or interpretability workshop) if the n=3 route2-vs-vanilla deploy gap holds and at least one red-team control (random-V, placebo) comes back clean.
The one-paragraph story
Labs already do RL on coding/agentic tasks and the model learns to exploit grader flaws. We ask: can an alignment intervention at the gradient level, using only a weak hack detector (catches some hack types, misses others), suppress the hacks the detector never saw? We extract a hack direction from a handful of hand-paired clean/hack completions (the "weak detector for hack A"), then during GRPO we route any rollout whose gradient aligns with that direction into a throwaway quarantine knob that is deleted at deploy. On Ariahw/Engels/ Nanda's LeetCode loophole substrate (4 grading flaws, even non-overlapping partition) the routed model deploys at lower hack rate than vanilla GRPO at comparable solve rate. The load-bearing test is the held-out modes: build the direction from 2 of the 4 loopholes, measure suppression on the other 2.
Claims (compress to <=3, ranked by how much evidence we have)
C1 (primary, existence -> systematic). Routing the GRPO gradient against a weak-detector hack direction in the SVD-of-W basis lowers deploy hack rate vs vanilla GRPO at matched-ish solve rate, replicated over n=3 seeds.
- Evidence: jobs 68/69/70 (route2 no-floor s41/42/43) vs 84/74/72 (vanilla s41/42/43). Deploy = knob-off, n=64 prompts x group, T=0.7.
- Confidence today: suggestive at n=1; n=3 band landing. NOT yet 30pp (the preregistered H1 bar); honest framing is "reduces hack at comparable solve", with the pp gap stated as measured.
C2 (mechanism, the no-cheat payload). The suppression generalises to loophole modes the detector never flagged: a 2-of-4-mode direction also lowers hack on the 2 held-out modes.
- Evidence: the G2/G3 cross-mechanism runs (spec 20260528_cross_mechanism_v_hack, tasks #106/#107). This is the claim a referee will attack hardest and the one that makes the result interesting rather than a tautology.
- Confidence: untested at writeup time. If C2 fails, the post becomes "routing suppresses known hacks at the gradient level" -- weaker but still honest.
C3 (specificity / not-a-regularizer). The effect needs the direction, not just the act of carving a rank-k knob out of the adapter. A Haar-random orthonormal V of matched per-module rank/norm does NOT reproduce the deploy hack-drop.
- Evidence: job 81 (frozen random-V route control), job 80 (null_city placebo pairset, "random content" pairs -- results.md Q10 already shows +0.024, i.e. no effect, for the placebo at the old setup). Plus the test-time erasure benchmark job 83 (does post-hoc weight/act erasure match train-time routing, or is the hack baked in -> motivating train-time intervention).
- Confidence: the placebo precedent (Q10) is encouraging; random-V is the decisive one both gpt-5.5 and the brainstorm flagged. Must land before we claim directional specificity.
Abstract sketch (Heilmeier + Nature structure, ~200 words, fill numbers last)
- Field: RL post-training teaches capable behaviour but also teaches models to exploit flaws in the reward/grader (reward hacking).
- Today: interventions act on the reward or the advantage (e.g. Wu & Tang 2026 advantage modification) or on the data; they need a detector that catches the hack at scoring time.
- Problem: at deployment some hacks are unknown, so a detector-at-scoring-time approach can only suppress what it already sees.
- Here we show: routing the GRPO gradient away from a hack direction extracted from a weak detector (few hand-paired examples covering only some hack types) lowers the deploy hack rate, including on held-out hack types, at comparable solve rate, over n=3 seeds, on the Ariahw LeetCode loophole substrate.
- Comparison: unlike advantage-level methods this never reads the live grader; the only supervision is the fixed weak-detector pair set, mimicking the known/unknown-hack split at deployment.
- Context: gradient routing (Cloud et al. 2024) in the SVD-of-W adapter basis (AntiPaSTO) gives a deletable quarantine knob.
- Standard of evidence / risk: existence-to-systematic at n=3; random-V and placebo controls rule out generic adapter regularization; the held-out-mode test is the load-bearing generalisation claim and the main failure risk.
Paper artifacts -- the goal tracker (durable; this is what we are building)
This is the canonical list of what the workshop paper/blog needs. Each artifact names its source runs and blocking state so the goal survives context compaction. Status legend: [x] done [/] data landing [ ] not started. Each finished run writes per_mode_deploy.json + train.safetensors under out/runs/_/; deploy hack/solve + by_mode come from the JSON, per-step curves from the log/TSV.
A1 -- Keynote figure. route2 vs vanilla deploy hack/solve over training, n=3
band. Prototype exists: out/figs/dyn_sub4*.png (just dyn). [/] blocked on the
n=3 vanilla band (jobs 74 s42 + 84 s41 [re-added from killed 79, p7 so it runs
ahead of the A3 erase rows]; 72 s43 done; route2 68/69/70 done).
A2 -- Keynote table. Per-arm deploy hack + deploy solve, mean +/- SEM over 3 seeds, route2 no-floor vs vanilla, delta vs vanilla, paired test + alpha stated. [/] same blocker as A1 (74, 84).
A3 -- Ablation table (what each component buys; the arms you named). One row per arm at matched seed/preset, deploy hack + solve:
- vanilla (no intervention) -> 65 / band
- erase (no routing; one-sided subtract) -> 66, rerun 75
- route2 full (routing on) -> 68-70
- route2 frozen (no online refresh) -> 68-70 are frozen; refresh = 78
- random-V route (direction is arbitrary)-> 81 [control: should NOT work]
- placebo pairset (null_city content) -> 80 [control: should NOT work]
- post-hoc test-time erase (no train-time)-> 83 (scripts/tt_erase_bench.py) [ ] blocked on 75/78/80/81/83 (all queued). This is the "we are filling out ablations" table.
A4 -- Long-run figure. 200-step route2 (job 84, DONE) vs vanilla (job 85, running). [/] route2 side landed: deploy hack = 0.000 every step to 199, solve ~0.61 flat (out/figs/dyn_longrun_200.{png,csv}, fig:longrun in main.tex). vanilla learns the cheat to ~0.55 by step 80 then COLLAPSES at ~88 (student logp craters, reward->0, gn spikes ~75x, beta=0 no KL anchor) -- so the gap is durable in the valid 0-85 window, but vanilla is not a clean saturation reference past step 88. Decision pending (user): leave the collapse as an honest finding + limitations line, or requeue vanilla-200 with an advantage std-floor for a clean saturating reference. Renumber: the old "77/82" job ids are stale (those were the corrupted/merge-bug ids); the live runs are 84 (route2) and 85 (vanilla).
A5 -- Generalisation figure/table (the no-cheat payload, C2). Per-mode deploy hack: v_hack from 2 of 4 modes, measure suppression on the 2 held-out modes. [ ] NOT QUEUED -- highest-value gap. Queue G2/G3 (tasks #106/#107, spec 20260528_cross_mechanism_v_hack) once the n=3 band confirms C1.
A6 -- Appendix: full traces per loophole class. Prompt+hint, hack completion, clean completion for all 4 modes. [x] done -- blog appendix (docs/blog/20260529_...md#appendix-the-four-loophole-modes), task #153.
A7 -- Appendix ablation context. Cite results.md Q-rows already run: basis width (Q8), refresh cadence (Q5), teacher mix (Q6), gate mode (Q3), solve-orthog (Q9), pairset content/placebo (Q10). [x] data exists; just needs porting into the paper.
Next action when 74+84 land: read each per_mode_deploy.json, just dyn,
fill A1/A2, append a journal entry. Then queue A5 (the gap).
Red-team checklist before publishing (paper-writing evidence standards)
- n=3 deploy gap stated with SEM, not cherry-picked seed.
- random-V (81) does NOT reproduce the drop (else it is regularization, C3 dies).
- placebo pairset (80) shows ~no effect (precedent: Q10 +0.024).
- held-out-mode suppression measured (C2), reported even if it fails.
- solve rate matched within stated band; a hack drop that only comes with a solve collapse is reported as such, not as a win.
- no-cheat invariant stated explicitly: live routing never reads gt_pass or runs the full detector suite over student rollouts; the pair set is the only supervision. (Promote to README/spec, plan item #114.)
- [/] convergence (84/85): route2 holds hack=0 to 200 steps; gap durable in the 0-85 window. CAVEAT: vanilla collapses at ~88 (not clean saturation past there) -- report honestly, don't crop the collapse to fake a flat-high ref.
- base-model and vanilla-saturation references present so emergence is real.
Open editorial decisions
- Project/repo name:
projected_grpois now a misnomer (method is routing, not projection). Candidate:gradient_quarantine. Decide before the public repo link goes in the post. (Retitle docs first; rename package/repo only if we ship the code link.) - Re-headline the blog draft from erase to route2 (user: clear even at n=1).
- Workshop vs blog-only: gate on C2 landing.