diff --git a/docs/spec/20260602_writeup_spec.md b/docs/spec/20260602_writeup_spec.md index 3174a57..fc46a89 100644 --- a/docs/spec/20260602_writeup_spec.md +++ b/docs/spec/20260602_writeup_spec.md @@ -1,14 +1,22 @@ # Writeup spec -- gradient routing vs RL reward hacking -Status: drafting harness. Numbers are placeholders until the queued jobs land -(see "Run -> artifact map" below). This is the spec for the LW blog post and a -possible workshop paper; it pins the narrative, the claims, and which run fills -each table cell so we can drop numbers in instead of re-deciding the story. +Status (2026-06-06): method is route2b (banded per-rollout/per-token gate); +erase is DROPPED from the paper (predecessor variant, no narrative cost). The +workshop paper = ONE working method (route2b), shown better than the vanilla +baseline, and ablated. Numbers land as the route2b jobs complete (134 per-rollout +s43 running, 135 per-token s43 queued; vanilla baselines 129/131/132). + +Workshop paper scope (the whole thing): +1. Method: route2b -- route each GRPO rollout's gradient by cos(g, v_grad) through + a pair-calibrated band into a deletable quarantine knob. +2. Baseline: vanilla GRPO. Show route2b deploys at lower hack rate at matched solve. +3. Ablation: random-V control (directionality, the decisive one) + granularity + (per-rollout vs per-token) + frozen vs refresh. No erase arm. Venue order: LW blog first (the audience that read AntiPaSTO and the Ariahw post), then a workshop paper (NeurIPS/ICLR alignment or interpretability -workshop) if the n=3 route2-vs-vanilla deploy gap holds and at least one -red-team control (random-V, placebo) comes back clean. +workshop) if the n=3 route2b-vs-vanilla deploy gap holds and the random-V +ablation comes back clean. ## The one-paragraph story @@ -45,17 +53,16 @@ the 2 held-out modes. suppresses *known* hacks at the gradient level" -- weaker but still honest. C3 (specificity / not-a-regularizer). The effect needs the *direction*, not -just the act of carving a rank-k knob out of the adapter. A Haar-random -orthonormal V of matched per-module rank/norm does NOT reproduce the deploy -hack-drop. -- Evidence: job 81 (frozen random-V route control), job 80 (null_city placebo - pairset, "random content" pairs -- results.md Q10 already shows +0.024, i.e. - no effect, for the placebo at the old setup). Plus the test-time erasure - benchmark job 83 (does post-hoc weight/act erasure match train-time routing, - or is the hack baked in -> motivating train-time intervention). -- Confidence: the placebo precedent (Q10) is encouraging; random-V is the - decisive one both gpt-5.5 and the brainstorm flagged. Must land before we - claim directional specificity. +just the act of carving a rank-k knob out of the adapter, and not just +quarantining gradient mass. A Haar-random v_grad of matched per-module +rank/norm collapses the band width (upper-lower ~ 0) and should NOT reproduce +the deploy hack-drop. The banded gate makes this clean: real-V has a positive +band (hack pairs separate from clean pairs along v_grad); random-V does not. +- Evidence: Q3 -- random-V route2b at the winning granularity, frout-matched + to the real-V run so the control quarantines comparable mass but in an + arbitrary direction. +- Confidence: untested for route2b. The decisive control both gpt-5.5 and the + brainstorm flagged. Must land before we claim directional specificity. ## Abstract sketch (Heilmeier + Nature structure, ~200 words, fill numbers last) @@ -97,17 +104,15 @@ A2 -- Keynote table. Per-arm deploy hack + deploy solve, mean +/- SEM over 3 seeds, route2 no-floor vs vanilla, delta vs vanilla, paired test + alpha stated. [/] same blocker as A1 (74, 84). -A3 -- Ablation table (what each component buys; the arms you named). One row per -arm at matched seed/preset, deploy hack + solve: - - vanilla (no intervention) -> 65 / band - - erase (no routing; one-sided subtract) -> 66, rerun 75 - - route2 full (routing on) -> 68-70 - - route2 frozen (no online refresh) -> 68-70 are frozen; refresh = 78 - - random-V route (direction is arbitrary)-> 81 [control: should NOT work] - - placebo pairset (null_city content) -> 80 [control: should NOT work] - - post-hoc test-time erase (no train-time)-> 83 (scripts/tt_erase_bench.py) -[ ] blocked on 75/78/80/81/83 (all queued). This is the "we are filling out -ablations" table. +A3 -- Ablation table (what each component buys). One row per arm at matched +seed/preset, deploy hack + solve: + - vanilla (no intervention) -> 129/131/132 + - route2b per-rollout (the method) -> 134 (s43), +41/42 if it wins + - route2b per-token (granularity ablation)-> 135 (s43) + - random-V route2b (direction arbitrary) -> Q3, queue at winning granularity [control: should NOT work] + - route2b frozen vs refresh-5 -> refresh is default; frozen = one extra run if gap is interesting +[ ] blocked on 134/135 landing, then the random-V control. This is the +"filling out ablations" table. Erase row removed (arm dropped from paper). A4 -- Long-run figure. 200-step route2 (job 84, DONE) vs vanilla (job 85, running). [/] route2 side landed: deploy hack = 0.000 every step to 199, solve ~0.61 flat @@ -139,8 +144,8 @@ fill A1/A2, append a journal entry. Then queue A5 (the gap). ## Red-team checklist before publishing (paper-writing evidence standards) - [ ] n=3 deploy gap stated with SEM, not cherry-picked seed. -- [ ] random-V (81) does NOT reproduce the drop (else it is regularization, C3 dies). -- [ ] placebo pairset (80) shows ~no effect (precedent: Q10 +0.024). +- [ ] random-V (Q3) does NOT reproduce the drop at matched frout (else it is + mass-quarantine / regularization, C3 dies). - [ ] held-out-mode suppression measured (C2), reported even if it fails. - [ ] solve rate matched within stated band; a hack drop that only comes with a solve collapse is reported as such, not as a win.