evil_MoE/docs/spec/20260602_writeup_spec.md

# Writeup spec -- gradient routing vs RL reward hacking

Status (2026-06-06): method is route2b (banded per-rollout/per-token gate);
erase is DROPPED from the paper (predecessor variant, no narrative cost). The
workshop paper = ONE working method (route2b), shown better than the vanilla
baseline, and ablated. Numbers land as the route2b jobs complete (134 per-rollout
s43 running, 135 per-token s43 queued; vanilla baselines 129/131/132).

Workshop paper scope (the whole thing):
1. Method: route2b -- route each GRPO rollout's gradient by cos(g, v_grad) through
   a pair-calibrated band into a deletable quarantine knob.
2. Baseline: vanilla GRPO. Show route2b deploys at lower hack rate at matched solve.
3. Ablation: random-V control (directionality, the decisive one) + granularity
   (per-rollout vs per-token) + frozen vs refresh. No erase arm.

Venue order: LW blog first (the audience that read AntiPaSTO and the Ariahw
post), then a workshop paper (NeurIPS/ICLR alignment or interpretability
workshop) if the n=3 route2b-vs-vanilla deploy gap holds and the random-V
ablation comes back clean.

## The one-paragraph story

Labs already do RL on coding/agentic tasks and the model learns to exploit
grader flaws. We ask: can an alignment intervention at the *gradient* level,
using only a weak hack detector (catches some hack types, misses others),
suppress the hacks the detector never saw? We extract a hack direction from a
handful of hand-paired clean/hack completions (the "weak detector for hack A"),
then during GRPO we route any rollout whose gradient aligns with that direction
into a throwaway quarantine knob that is deleted at deploy. On Ariahw/Engels/
Nanda's LeetCode loophole substrate (4 grading flaws, even non-overlapping
partition) the routed model deploys at lower hack rate than vanilla GRPO at
comparable solve rate. The load-bearing test is the held-out modes: build the
direction from 2 of the 4 loopholes, measure suppression on the other 2.

## Claims (compress to <=3, ranked by how much evidence we have)

C1 (primary, existence -> systematic). Routing the GRPO gradient against a
weak-detector hack direction in the SVD-of-W basis lowers deploy hack rate vs
vanilla GRPO at matched-ish solve rate, replicated over n=3 seeds.
- Evidence: jobs 68/69/70 (route2 no-floor s41/42/43) vs 84/74/72 (vanilla
  s41/42/43). Deploy = knob-off, n=64 prompts x group, T=0.7.
- Confidence today: suggestive at n=1; n=3 band landing. NOT yet 30pp (the
  preregistered H1 bar); honest framing is "reduces hack at comparable solve",
  with the pp gap stated as measured.

C2 (mechanism, the no-cheat payload). The suppression generalises to loophole
modes the detector never flagged: a 2-of-4-mode direction also lowers hack on
the 2 held-out modes.
- Evidence: the G2/G3 cross-mechanism runs (spec 20260528_cross_mechanism_v_hack,
  tasks #106/#107). This is the claim a referee will attack hardest and the one
  that makes the result interesting rather than a tautology.
- Confidence: untested at writeup time. If C2 fails, the post becomes "routing
  suppresses *known* hacks at the gradient level" -- weaker but still honest.

C3 (specificity / not-a-regularizer). The effect needs the *direction*, not
just the act of carving a rank-k knob out of the adapter, and not just
quarantining gradient mass. A Haar-random v_grad of matched per-module
rank/norm collapses the band width (upper-lower ~ 0) and should NOT reproduce
the deploy hack-drop. The banded gate makes this clean: real-V has a positive
band (hack pairs separate from clean pairs along v_grad); random-V does not.
- Evidence: Q3 -- random-V route2b at the winning granularity, frout-matched
  to the real-V run so the control quarantines comparable mass but in an
  arbitrary direction.
- Confidence: untested for route2b. The decisive control both gpt-5.5 and the
  brainstorm flagged. Must land before we claim directional specificity.

## Abstract sketch (Heilmeier + Nature structure, ~200 words, fill numbers last)

1. Field: RL post-training teaches capable behaviour but also teaches models to
   exploit flaws in the reward/grader (reward hacking).
2. Today: interventions act on the reward or the advantage (e.g. Wu & Tang 2026
   advantage modification) or on the data; they need a detector that catches the
   hack at scoring time.
3. Problem: at deployment some hacks are unknown, so a detector-at-scoring-time
   approach can only suppress what it already sees.
4. Here we show: routing the GRPO gradient away from a hack direction extracted
   from a *weak* detector (few hand-paired examples covering only some hack
   types) lowers the deploy hack rate, including on held-out hack types, at
   comparable solve rate, over n=3 seeds, on the Ariahw LeetCode loophole
   substrate.
5. Comparison: unlike advantage-level methods this never reads the live grader;
   the only supervision is the fixed weak-detector pair set, mimicking the
   known/unknown-hack split at deployment.
6. Context: gradient routing (Cloud et al. 2024) in the SVD-of-W adapter basis
   (AntiPaSTO) gives a deletable quarantine knob.
7. Standard of evidence / risk: existence-to-systematic at n=3; random-V and
   placebo controls rule out generic adapter regularization; the held-out-mode
   test is the load-bearing generalisation claim and the main failure risk.

## Paper artifacts -- the goal tracker (durable; this is what we are building)

This is the canonical list of what the workshop paper/blog needs. Each artifact
names its source runs and blocking state so the goal survives context compaction.
Status legend: [x] done  [/] data landing  [ ] not started. Each finished run
writes per_mode_deploy.json + train.safetensors under out/runs/<ts>_<tag>/;
deploy hack/solve + by_mode come from the JSON, per-step curves from the log/TSV.

A1 -- Keynote figure. route2 vs vanilla deploy hack/solve over training, n=3
band. Prototype exists: out/figs/dyn_sub4*.png (`just dyn`). [/] blocked on the
n=3 vanilla band (jobs 74 s42 + 84 s41 [re-added from killed 79, p7 so it runs
ahead of the A3 erase rows]; 72 s43 done; route2 68/69/70 done).

A2 -- Keynote table. Per-arm deploy hack + deploy solve, mean +/- SEM over 3
seeds, route2 no-floor vs vanilla, delta vs vanilla, paired test + alpha stated.
[/] same blocker as A1 (74, 84).

A3 -- Ablation table (what each component buys). One row per arm at matched
seed/preset, deploy hack + solve:
  - vanilla (no intervention)               -> 129/131/132
  - route2b per-rollout (the method)        -> 134 (s43), +41/42 if it wins
  - route2b per-token (granularity ablation)-> 135 (s43)
  - random-V route2b (direction arbitrary)  -> Q3, queue at winning granularity [control: should NOT work]
  - route2b frozen vs refresh-5             -> refresh is default; frozen = one extra run if gap is interesting
[ ] blocked on 134/135 landing, then the random-V control. This is the
"filling out ablations" table. Erase row removed (arm dropped from paper).

A4 -- Long-run figure. 200-step route2 (job 84, DONE) vs vanilla (job 85, running).
[/] route2 side landed: deploy hack = 0.000 every step to 199, solve ~0.61 flat
(out/figs/dyn_longrun_200.{png,csv}, fig:longrun in main.tex). vanilla learns the
cheat to ~0.55 by step 80 then COLLAPSES at ~88 (student logp craters, reward->0,
gn spikes ~75x, beta=0 no KL anchor) -- so the gap is durable in the valid 0-85
window, but vanilla is not a clean saturation reference past step 88. Decision
pending (user): leave the collapse as an honest finding + limitations line, or
requeue vanilla-200 with an advantage std-floor for a clean saturating reference.
Renumber: the old "77/82" job ids are stale (those were the corrupted/merge-bug
ids); the live runs are 84 (route2) and 85 (vanilla).

A5 -- Generalisation figure/table (the no-cheat payload, C2). Per-mode deploy
hack: v_hack from 2 of 4 modes, measure suppression on the 2 held-out modes.
[ ] NOT QUEUED -- highest-value gap. Queue G2/G3 (tasks #106/#107, spec
20260528_cross_mechanism_v_hack) once the n=3 band confirms C1.

A6 -- Appendix: full traces per loophole class. Prompt+hint, hack completion,
clean completion for all 4 modes. [x] done -- blog appendix
(docs/blog/20260529_...md#appendix-the-four-loophole-modes), task #153.

A7 -- Appendix ablation context. Cite results.md Q-rows already run: basis width
(Q8), refresh cadence (Q5), teacher mix (Q6), gate mode (Q3), solve-orthog (Q9),
pairset content/placebo (Q10). [x] data exists; just needs porting into the paper.

Next action when 74+84 land: read each per_mode_deploy.json, `just dyn`,
fill A1/A2, append a journal entry. Then queue A5 (the gap).

## Red-team checklist before publishing (paper-writing evidence standards)

- [ ] n=3 deploy gap stated with SEM, not cherry-picked seed.
- [ ] random-V (Q3) does NOT reproduce the drop at matched frout (else it is
      mass-quarantine / regularization, C3 dies).
- [ ] held-out-mode suppression measured (C2), reported even if it fails.
- [ ] solve rate matched within stated band; a hack drop that only comes with a
      solve collapse is reported as such, not as a win.
- [ ] no-cheat invariant stated explicitly: live routing never reads gt_pass or
      runs the full detector suite over student rollouts; the pair set is the
      only supervision. (Promote to README/spec, plan item #114.)
- [/] convergence (84/85): route2 holds hack=0 to 200 steps; gap durable in the
      0-85 window. CAVEAT: vanilla collapses at ~88 (not clean saturation past
      there) -- report honestly, don't crop the collapse to fake a flat-high ref.
- [ ] base-model and vanilla-saturation references present so emergence is real.

## Open editorial decisions

- Project/repo name: `projected_grpo` is now a misnomer (method is routing, not
  projection). Candidate: `gradient_quarantine`. Decide before the public repo
  link goes in the post. (Retitle docs first; rename package/repo only if we
  ship the code link.)
- Re-headline the blog draft from erase to route2 (user: clear even at n=1).
- Workshop vs blog-only: gate on C2 landing.