From 17e4f2e2ff2f827b8c7a3c2c9d1a61c077c94ffb Mon Sep 17 00:00:00 2001 From: wassname Date: Tue, 2 Jun 2026 04:41:43 +0000 Subject: [PATCH] feat: eval_ablate_every default 5 (deploy-eval on for every arm) + workshop artifact tracker - deploy hack/solve is now the headline metric for all arms, so turn the mid-train deploy-eval on by default (smoke now covers the deploy path too); 200-step runs pass a sparser cadence explicitly. - docs/spec/20260602_writeup_spec.md: durable A1-A7 paper-artifact tracker (keynote fig+table, ablation table, long-run fig, generalisation, appendix). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- docs/spec/20260602_writeup_spec.md | 153 +++++++++++++++++++++++++++++ src/projected_grpo/train.py | 4 +- 2 files changed, 156 insertions(+), 1 deletion(-) create mode 100644 docs/spec/20260602_writeup_spec.md diff --git a/docs/spec/20260602_writeup_spec.md b/docs/spec/20260602_writeup_spec.md new file mode 100644 index 0000000..942eefb --- /dev/null +++ b/docs/spec/20260602_writeup_spec.md @@ -0,0 +1,153 @@ +# Writeup spec -- gradient routing vs RL reward hacking + +Status: drafting harness. Numbers are placeholders until the queued jobs land +(see "Run -> artifact map" below). This is the spec for the LW blog post and a +possible workshop paper; it pins the narrative, the claims, and which run fills +each table cell so we can drop numbers in instead of re-deciding the story. + +Venue order: LW blog first (the audience that read AntiPaSTO and the Ariahw +post), then a workshop paper (NeurIPS/ICLR alignment or interpretability +workshop) if the n=3 route2-vs-vanilla deploy gap holds and at least one +red-team control (random-V, placebo) comes back clean. + +## The one-paragraph story + +Labs already do RL on coding/agentic tasks and the model learns to exploit +grader flaws. We ask: can an alignment intervention at the *gradient* level, +using only a weak hack detector (catches some hack types, misses others), +suppress the hacks the detector never saw? We extract a hack direction from a +handful of hand-paired clean/hack completions (the "weak detector for hack A"), +then during GRPO we route any rollout whose gradient aligns with that direction +into a throwaway quarantine knob that is deleted at deploy. On Ariahw/Engels/ +Nanda's LeetCode loophole substrate (4 grading flaws, even non-overlapping +partition) the routed model deploys at lower hack rate than vanilla GRPO at +comparable solve rate. The load-bearing test is the held-out modes: build the +direction from 2 of the 4 loopholes, measure suppression on the other 2. + +## Claims (compress to <=3, ranked by how much evidence we have) + +C1 (primary, existence -> systematic). Routing the GRPO gradient against a +weak-detector hack direction in the SVD-of-W basis lowers deploy hack rate vs +vanilla GRPO at matched-ish solve rate, replicated over n=3 seeds. +- Evidence: jobs 68/69/70 (route2 no-floor s41/42/43) vs 79/74/72 (vanilla + s41/42/43). Deploy = knob-off, n=64 prompts x group, T=0.7. +- Confidence today: suggestive at n=1; n=3 band landing. NOT yet 30pp (the + preregistered H1 bar); honest framing is "reduces hack at comparable solve", + with the pp gap stated as measured. + +C2 (mechanism, the no-cheat payload). The suppression generalises to loophole +modes the detector never flagged: a 2-of-4-mode direction also lowers hack on +the 2 held-out modes. +- Evidence: the G2/G3 cross-mechanism runs (spec 20260528_cross_mechanism_v_hack, + tasks #106/#107). This is the claim a referee will attack hardest and the one + that makes the result interesting rather than a tautology. +- Confidence: untested at writeup time. If C2 fails, the post becomes "routing + suppresses *known* hacks at the gradient level" -- weaker but still honest. + +C3 (specificity / not-a-regularizer). The effect needs the *direction*, not +just the act of carving a rank-k knob out of the adapter. A Haar-random +orthonormal V of matched per-module rank/norm does NOT reproduce the deploy +hack-drop. +- Evidence: job 81 (frozen random-V route control), job 80 (null_city placebo + pairset, "random content" pairs -- results.md Q10 already shows +0.024, i.e. + no effect, for the placebo at the old setup). Plus the test-time erasure + benchmark job 83 (does post-hoc weight/act erasure match train-time routing, + or is the hack baked in -> motivating train-time intervention). +- Confidence: the placebo precedent (Q10) is encouraging; random-V is the + decisive one both gpt-5.5 and the brainstorm flagged. Must land before we + claim directional specificity. + +## Abstract sketch (Heilmeier + Nature structure, ~200 words, fill numbers last) + +1. Field: RL post-training teaches capable behaviour but also teaches models to + exploit flaws in the reward/grader (reward hacking). +2. Today: interventions act on the reward or the advantage (e.g. Wu & Tang 2026 + advantage modification) or on the data; they need a detector that catches the + hack at scoring time. +3. Problem: at deployment some hacks are unknown, so a detector-at-scoring-time + approach can only suppress what it already sees. +4. Here we show: routing the GRPO gradient away from a hack direction extracted + from a *weak* detector (few hand-paired examples covering only some hack + types) lowers the deploy hack rate, including on held-out hack types, at + comparable solve rate, over n=3 seeds, on the Ariahw LeetCode loophole + substrate. +5. Comparison: unlike advantage-level methods this never reads the live grader; + the only supervision is the fixed weak-detector pair set, mimicking the + known/unknown-hack split at deployment. +6. Context: gradient routing (Cloud et al. 2024) in the SVD-of-W adapter basis + (AntiPaSTO) gives a deletable quarantine knob. +7. Standard of evidence / risk: existence-to-systematic at n=3; random-V and + placebo controls rule out generic adapter regularization; the held-out-mode + test is the load-bearing generalisation claim and the main failure risk. + +## Paper artifacts -- the goal tracker (durable; this is what we are building) + +This is the canonical list of what the workshop paper/blog needs. Each artifact +names its source runs and blocking state so the goal survives context compaction. +Status legend: [x] done [/] data landing [ ] not started. Each finished run +writes per_mode_deploy.json + train.safetensors under out/runs/_/; +deploy hack/solve + by_mode come from the JSON, per-step curves from the log/TSV. + +A1 -- Keynote figure. route2 vs vanilla deploy hack/solve over training, n=3 +band. Prototype exists: out/figs/dyn_sub4*.png (`just dyn`). [/] blocked on the +n=3 vanilla band (jobs 74 s42 + 79 s41; 72 s43 done; route2 68/69/70 done). + +A2 -- Keynote table. Per-arm deploy hack + deploy solve, mean +/- SEM over 3 +seeds, route2 no-floor vs vanilla, delta vs vanilla, paired test + alpha stated. +[/] same blocker as A1 (74, 79). + +A3 -- Ablation table (what each component buys; the arms you named). One row per +arm at matched seed/preset, deploy hack + solve: + - vanilla (no intervention) -> 65 / band + - erase (no routing; one-sided subtract) -> 66, rerun 75 + - route2 full (routing on) -> 68-70 + - route2 frozen (no online refresh) -> 68-70 are frozen; refresh = 78 + - random-V route (direction is arbitrary)-> 81 [control: should NOT work] + - placebo pairset (null_city content) -> 80 [control: should NOT work] + - post-hoc test-time erase (no train-time)-> 83 (scripts/tt_erase_bench.py) +[ ] blocked on 75/78/80/81/83 (all queued). This is the "we are filling out +ablations" table. + +A4 -- Long-run figure. 200-step route2 (77) vs vanilla saturation (82); shows +the gap persists to convergence, pre-empts "you stopped at 60 steps". [ ] blocked +on 77 + 82 (queued p5/p0). + +A5 -- Generalisation figure/table (the no-cheat payload, C2). Per-mode deploy +hack: v_hack from 2 of 4 modes, measure suppression on the 2 held-out modes. +[ ] NOT QUEUED -- highest-value gap. Queue G2/G3 (tasks #106/#107, spec +20260528_cross_mechanism_v_hack) once the n=3 band confirms C1. + +A6 -- Appendix: full traces per loophole class. Prompt+hint, hack completion, +clean completion for all 4 modes. [x] done -- blog appendix +(docs/blog/20260529_...md#appendix-the-four-loophole-modes), task #153. + +A7 -- Appendix ablation context. Cite results.md Q-rows already run: basis width +(Q8), refresh cadence (Q5), teacher mix (Q6), gate mode (Q3), solve-orthog (Q9), +pairset content/placebo (Q10). [x] data exists; just needs porting into the paper. + +Next action when 74+79 land: read each per_mode_deploy.json, `just dyn`, +fill A1/A2, append a journal entry. Then queue A5 (the gap). + +## Red-team checklist before publishing (paper-writing evidence standards) + +- [ ] n=3 deploy gap stated with SEM, not cherry-picked seed. +- [ ] random-V (81) does NOT reproduce the drop (else it is regularization, C3 dies). +- [ ] placebo pairset (80) shows ~no effect (precedent: Q10 +0.024). +- [ ] held-out-mode suppression measured (C2), reported even if it fails. +- [ ] solve rate matched within stated band; a hack drop that only comes with a + solve collapse is reported as such, not as a win. +- [ ] no-cheat invariant stated explicitly: live routing never reads gt_pass or + runs the full detector suite over student rollouts; the pair set is the + only supervision. (Promote to README/spec, plan item #114.) +- [ ] convergence (77/82): the gap persists at 200 steps, pre-empting "you + stopped at 20-60 steps". +- [ ] base-model and vanilla-saturation references present so emergence is real. + +## Open editorial decisions + +- Project/repo name: `projected_grpo` is now a misnomer (method is routing, not + projection). Candidate: `gradient_quarantine`. Decide before the public repo + link goes in the post. (Retitle docs first; rename package/repo only if we + ship the code link.) +- Re-headline the blog draft from erase to route2 (user: clear even at n=1). +- Workshop vs blog-only: gate on C2 landing. diff --git a/src/projected_grpo/train.py b/src/projected_grpo/train.py index 805d1e3..c4cdb4a 100644 --- a/src/projected_grpo/train.py +++ b/src/projected_grpo/train.py @@ -177,7 +177,9 @@ class Config: # subset -> the hack_deploy / solve_deploy columns (the dynamics-plot series for # route: the training-time hack curve still hacks; routing's benefit shows only # once the quarantine is ablated). 0 = off. eval_n_prompts x `group` samples. - eval_ablate_every: int = 0 + # Default 5: deploy hack/solve is the headline metric for every arm, so it's + # on by default; 200-step runs pass a sparser cadence (e.g. 10) explicitly. + eval_ablate_every: int = 5 eval_n_prompts: int = 8 # Optional: pool-derived pairs JSON (built by pairs_from_pool.py). When set, # BOTH the cache-miss extract AND the online refresh use these pairs instead