spec: route2b is the method, drop erase; workshop = 1 method + vanilla baseline + random-V ablation

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-06-06 05:20:00 +00:00
parent 69f8bc208d
commit 4b9545c59a
1 changed files with 35 additions and 30 deletions
@@ -1,14 +1,22 @@
 # Writeup spec -- gradient routing vs RL reward hacking

-Status: drafting harness. Numbers are placeholders until the queued jobs land
-(see "Run -> artifact map" below). This is the spec for the LW blog post and a
-possible workshop paper; it pins the narrative, the claims, and which run fills
-each table cell so we can drop numbers in instead of re-deciding the story.
+Status (2026-06-06): method is route2b (banded per-rollout/per-token gate);
+erase is DROPPED from the paper (predecessor variant, no narrative cost). The
+workshop paper = ONE working method (route2b), shown better than the vanilla
+baseline, and ablated. Numbers land as the route2b jobs complete (134 per-rollout
+s43 running, 135 per-token s43 queued; vanilla baselines 129/131/132).
+
+Workshop paper scope (the whole thing):
+1. Method: route2b -- route each GRPO rollout's gradient by cos(g, v_grad) through
+   a pair-calibrated band into a deletable quarantine knob.
+2. Baseline: vanilla GRPO. Show route2b deploys at lower hack rate at matched solve.
+3. Ablation: random-V control (directionality, the decisive one) + granularity
+   (per-rollout vs per-token) + frozen vs refresh. No erase arm.

 Venue order: LW blog first (the audience that read AntiPaSTO and the Ariahw
 post), then a workshop paper (NeurIPS/ICLR alignment or interpretability
-workshop) if the n=3 route2-vs-vanilla deploy gap holds and at least one
-red-team control (random-V, placebo) comes back clean.
+workshop) if the n=3 route2b-vs-vanilla deploy gap holds and the random-V
+ablation comes back clean.

 ## The one-paragraph story

@@ -45,17 +53,16 @@ the 2 held-out modes.
  suppresses *known* hacks at the gradient level" -- weaker but still honest.

 C3 (specificity / not-a-regularizer). The effect needs the *direction*, not
-just the act of carving a rank-k knob out of the adapter. A Haar-random
-orthonormal V of matched per-module rank/norm does NOT reproduce the deploy
-hack-drop.
- Evidence: job 81 (frozen random-V route control), job 80 (null_city placebo
-  pairset, "random content" pairs -- results.md Q10 already shows +0.024, i.e.
-  no effect, for the placebo at the old setup). Plus the test-time erasure
-  benchmark job 83 (does post-hoc weight/act erasure match train-time routing,
-  or is the hack baked in -> motivating train-time intervention).
- Confidence: the placebo precedent (Q10) is encouraging; random-V is the
-  decisive one both gpt-5.5 and the brainstorm flagged. Must land before we
-  claim directional specificity.
+just the act of carving a rank-k knob out of the adapter, and not just
+quarantining gradient mass. A Haar-random v_grad of matched per-module
+rank/norm collapses the band width (upper-lower ~ 0) and should NOT reproduce
+the deploy hack-drop. The banded gate makes this clean: real-V has a positive
+band (hack pairs separate from clean pairs along v_grad); random-V does not.
+- Evidence: Q3 -- random-V route2b at the winning granularity, frout-matched
+  to the real-V run so the control quarantines comparable mass but in an
+  arbitrary direction.
+- Confidence: untested for route2b. The decisive control both gpt-5.5 and the
+  brainstorm flagged. Must land before we claim directional specificity.

 ## Abstract sketch (Heilmeier + Nature structure, ~200 words, fill numbers last)

@@ -97,17 +104,15 @@ A2 -- Keynote table. Per-arm deploy hack + deploy solve, mean +/- SEM over 3
 seeds, route2 no-floor vs vanilla, delta vs vanilla, paired test + alpha stated.
 [/] same blocker as A1 (74, 84).

-A3 -- Ablation table (what each component buys; the arms you named). One row per
-arm at matched seed/preset, deploy hack + solve:
-  - vanilla (no intervention)              -> 65 / band
-  - erase (no routing; one-sided subtract) -> 66, rerun 75
-  - route2 full (routing on)               -> 68-70
-  - route2 frozen (no online refresh)      -> 68-70 are frozen; refresh = 78
-  - random-V route (direction is arbitrary)-> 81  [control: should NOT work]
-  - placebo pairset (null_city content)    -> 80  [control: should NOT work]
-  - post-hoc test-time erase (no train-time)-> 83 (scripts/tt_erase_bench.py)
-[ ] blocked on 75/78/80/81/83 (all queued). This is the "we are filling out
-ablations" table.
+A3 -- Ablation table (what each component buys). One row per arm at matched
+seed/preset, deploy hack + solve:
+  - vanilla (no intervention)               -> 129/131/132
+  - route2b per-rollout (the method)        -> 134 (s43), +41/42 if it wins
+  - route2b per-token (granularity ablation)-> 135 (s43)
+  - random-V route2b (direction arbitrary)  -> Q3, queue at winning granularity [control: should NOT work]
+  - route2b frozen vs refresh-5             -> refresh is default; frozen = one extra run if gap is interesting
+[ ] blocked on 134/135 landing, then the random-V control. This is the
+"filling out ablations" table. Erase row removed (arm dropped from paper).

 A4 -- Long-run figure. 200-step route2 (job 84, DONE) vs vanilla (job 85, running).
 [/] route2 side landed: deploy hack = 0.000 every step to 199, solve ~0.61 flat
@@ -139,8 +144,8 @@ fill A1/A2, append a journal entry. Then queue A5 (the gap).
 ## Red-team checklist before publishing (paper-writing evidence standards)

 - [ ] n=3 deploy gap stated with SEM, not cherry-picked seed.
- [ ] random-V (81) does NOT reproduce the drop (else it is regularization, C3 dies).
- [ ] placebo pairset (80) shows ~no effect (precedent: Q10 +0.024).
+- [ ] random-V (Q3) does NOT reproduce the drop at matched frout (else it is
+      mass-quarantine / regularization, C3 dies).
 - [ ] held-out-mode suppression measured (C2), reported even if it fails.
 - [ ] solve rate matched within stated band; a hack drop that only comes with a
      solve collapse is reported as such, not as a win.