spec: route2b is the method, drop erase; workshop = 1 method + vanilla baseline + random-V ablation

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-06 05:20:00 +00:00
parent 69f8bc208d
commit 4b9545c59a
+35 -30
View File
@@ -1,14 +1,22 @@
# Writeup spec -- gradient routing vs RL reward hacking
Status: drafting harness. Numbers are placeholders until the queued jobs land
(see "Run -> artifact map" below). This is the spec for the LW blog post and a
possible workshop paper; it pins the narrative, the claims, and which run fills
each table cell so we can drop numbers in instead of re-deciding the story.
Status (2026-06-06): method is route2b (banded per-rollout/per-token gate);
erase is DROPPED from the paper (predecessor variant, no narrative cost). The
workshop paper = ONE working method (route2b), shown better than the vanilla
baseline, and ablated. Numbers land as the route2b jobs complete (134 per-rollout
s43 running, 135 per-token s43 queued; vanilla baselines 129/131/132).
Workshop paper scope (the whole thing):
1. Method: route2b -- route each GRPO rollout's gradient by cos(g, v_grad) through
a pair-calibrated band into a deletable quarantine knob.
2. Baseline: vanilla GRPO. Show route2b deploys at lower hack rate at matched solve.
3. Ablation: random-V control (directionality, the decisive one) + granularity
(per-rollout vs per-token) + frozen vs refresh. No erase arm.
Venue order: LW blog first (the audience that read AntiPaSTO and the Ariahw
post), then a workshop paper (NeurIPS/ICLR alignment or interpretability
workshop) if the n=3 route2-vs-vanilla deploy gap holds and at least one
red-team control (random-V, placebo) comes back clean.
workshop) if the n=3 route2b-vs-vanilla deploy gap holds and the random-V
ablation comes back clean.
## The one-paragraph story
@@ -45,17 +53,16 @@ the 2 held-out modes.
suppresses *known* hacks at the gradient level" -- weaker but still honest.
C3 (specificity / not-a-regularizer). The effect needs the *direction*, not
just the act of carving a rank-k knob out of the adapter. A Haar-random
orthonormal V of matched per-module rank/norm does NOT reproduce the deploy
hack-drop.
- Evidence: job 81 (frozen random-V route control), job 80 (null_city placebo
pairset, "random content" pairs -- results.md Q10 already shows +0.024, i.e.
no effect, for the placebo at the old setup). Plus the test-time erasure
benchmark job 83 (does post-hoc weight/act erasure match train-time routing,
or is the hack baked in -> motivating train-time intervention).
- Confidence: the placebo precedent (Q10) is encouraging; random-V is the
decisive one both gpt-5.5 and the brainstorm flagged. Must land before we
claim directional specificity.
just the act of carving a rank-k knob out of the adapter, and not just
quarantining gradient mass. A Haar-random v_grad of matched per-module
rank/norm collapses the band width (upper-lower ~ 0) and should NOT reproduce
the deploy hack-drop. The banded gate makes this clean: real-V has a positive
band (hack pairs separate from clean pairs along v_grad); random-V does not.
- Evidence: Q3 -- random-V route2b at the winning granularity, frout-matched
to the real-V run so the control quarantines comparable mass but in an
arbitrary direction.
- Confidence: untested for route2b. The decisive control both gpt-5.5 and the
brainstorm flagged. Must land before we claim directional specificity.
## Abstract sketch (Heilmeier + Nature structure, ~200 words, fill numbers last)
@@ -97,17 +104,15 @@ A2 -- Keynote table. Per-arm deploy hack + deploy solve, mean +/- SEM over 3
seeds, route2 no-floor vs vanilla, delta vs vanilla, paired test + alpha stated.
[/] same blocker as A1 (74, 84).
A3 -- Ablation table (what each component buys; the arms you named). One row per
arm at matched seed/preset, deploy hack + solve:
- vanilla (no intervention) -> 65 / band
- erase (no routing; one-sided subtract) -> 66, rerun 75
- route2 full (routing on) -> 68-70
- route2 frozen (no online refresh) -> 68-70 are frozen; refresh = 78
- random-V route (direction is arbitrary)-> 81 [control: should NOT work]
- placebo pairset (null_city content) -> 80 [control: should NOT work]
- post-hoc test-time erase (no train-time)-> 83 (scripts/tt_erase_bench.py)
[ ] blocked on 75/78/80/81/83 (all queued). This is the "we are filling out
ablations" table.
A3 -- Ablation table (what each component buys). One row per arm at matched
seed/preset, deploy hack + solve:
- vanilla (no intervention) -> 129/131/132
- route2b per-rollout (the method) -> 134 (s43), +41/42 if it wins
- route2b per-token (granularity ablation)-> 135 (s43)
- random-V route2b (direction arbitrary) -> Q3, queue at winning granularity [control: should NOT work]
- route2b frozen vs refresh-5 -> refresh is default; frozen = one extra run if gap is interesting
[ ] blocked on 134/135 landing, then the random-V control. This is the
"filling out ablations" table. Erase row removed (arm dropped from paper).
A4 -- Long-run figure. 200-step route2 (job 84, DONE) vs vanilla (job 85, running).
[/] route2 side landed: deploy hack = 0.000 every step to 199, solve ~0.61 flat
@@ -139,8 +144,8 @@ fill A1/A2, append a journal entry. Then queue A5 (the gap).
## Red-team checklist before publishing (paper-writing evidence standards)
- [ ] n=3 deploy gap stated with SEM, not cherry-picked seed.
- [ ] random-V (81) does NOT reproduce the drop (else it is regularization, C3 dies).
- [ ] placebo pairset (80) shows ~no effect (precedent: Q10 +0.024).
- [ ] random-V (Q3) does NOT reproduce the drop at matched frout (else it is
mass-quarantine / regularization, C3 dies).
- [ ] held-out-mode suppression measured (C2), reported even if it fails.
- [ ] solve rate matched within stated band; a hack drop that only comes with a
solve collapse is reported as such, not as a win.