prog_wide is the proven main pair set, so default to it instead of falling back to the 18 hand-crafted vgrout.pairs.PAIRS (now only reached if explicitly None). The same pairs build both v_grad and the route band in one extract pass -- no separate threshold set. Spec updated to say so. route2 smoke green on the new default (band +0.259). erase unaffected (explicit --v-hack-path takes precedence). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
15 KiB
vGROUT routing redesign — 2026-06-06
Self-contained implementation handoff. Terminology follows docs/writeup/main.tex
(method = vGROUT; pairs are (hack, clean); knobs are delta_S / delta_S_hack;
arms are erase and route). Pseudocode is pseudopy (Python + unicode, read-not-run).
Goal
Make the extracted reward-hacking direction v_hack the SOLE thing that decides routing,
calibrated only from hand-built (hack, clean) pairs. No live detector over student
rollouts, no teacher force-route. Then "does v_hack generalize to held-out loophole modes"
is a clean test, and a random-direction control is decisive.
We build and compare TWO arms, simplest first:
- erase (default): subtract the
v_hackcomponent from the livedelta_Sgradient, discard it. Already coded. No per-rollout machinery. - route (per-rollout, queued later): gate each rollout's gradient by its cosine to
v_hackand send it into the deletabledelta_S_hack. This is the new code.
The environment and the four hacks (from Ariahw et al.)
Ariahw, Engels & Nanda's LeetCode reward-hacking env, extended from one loophole to four. Each problem carries a truthful hint disclosing the grading mechanism (never an instruction to cheat); problems are partitioned evenly and non-overlappingly across the four modes, so a vanilla student can learn all four independently.
| mode | grading flaw | the hack |
|---|---|---|
run_tests |
grader runs whatever run_tests() the response defines, never the real tests |
a run_tests() that only prints (never asserts) can't raise -> any Solution passes |
sentinel |
grader reads an attribute (self.valid) instead of running the code |
set self.valid = True |
stdout_marker |
grader greps stdout for "all tests passed" | print("all tests passed") |
file_marker |
grader checks for a result file | write result.txt containing PASS |
Full prompt+hint, hack, and clean traces per mode: blog appendix
docs/blog/20260529_..._LW_draft.md#appendix-the-four-loophole-modes. Detectors (rewards.py)
exist for grading/analysis but are ORACLES; they must not touch routing at train time (see
No-cheat).
The SVD-basis adapter (AntiPaSTO)
Train one per-module knob delta_S in the singular-value basis of each Linear. Source:
src/vgrout/antipasto.py.
TARGET = {q,k,v,o_proj, up,gate,down_proj, ...} # attention + MLP Linears
def wrap(model):
for name, lin in target_linears(model): # lin.W ∈ ℝ^{d_out×d_in}
U, Σ, V = svd_cached(lin.W) # frozen; r = min(d_in, d_out)
lin.U, lin.V = freeze(U), freeze(V) # also serve as the v_hack basis
lin.delta_S = Param(zeros(r)) # deployed knob ∈ ℝ^r
lin.delta_S_hack = Param(zeros(r)) # routing quarantine ∈ ℝ^r (deleted at deploy)
lin.register_forward_hook(δ_hook) # MANUAL hook (not baukit)
freeze everything except {delta_S, delta_S_hack}
# forward: y_new = y + U · ((delta_S + delta_S_hack) ⊙ (V @ x))
def δ_hook(lin, x, y):
h = (lin.V @ x) * (lin.delta_S + lin.delta_S_hack)
return y + lin.U @ h
Two properties we use: at delta_S=0 the adapter is bit-identical to the base model (W
never reconstructed), so an adapter-off forward gives π_ref for free; and the forward uses
the SUM delta_S + delta_S_hack, so a routed update still moves the training model but
zeroing delta_S_hack at deploy ablates exactly the routed capability.
Extracting v_hack and the routing band
v_hack is the GRPO gradient a perfectly-labelled pair would emit at advantage +1/-1, which
reduces algebraically to -∇logp(hack) + ∇logp(clean) on delta_S. Source:
src/vgrout/extract_vhack_grad.py. Refreshed every N steps through the current adapter
(the basis goes stale: cin decays ~0.27->0.07 by step 10).
The SAME pairs build the direction AND the band -- one extract_v_hack(pairs) pass yields the
per-pair grads raw_grads, and both v1/V_sub and (lower, upper) come from it (no second
set for thresholds). The default/main pair set is out/pairsets/prog_wide.json (30 pool-derived
pairs, --vhack-pairs-path default in Config); the 18 hand-crafted vgrout.pairs.PAIRS are
only the fallback if that is set to None.
def extract(model, wrappers, pairs, k, n_val):
train, val = pairs[:-n_val], pairs[-n_val:] # hold out n_val pairs for a label-free check
for p in train:
g_hack[p] = ∇_{delta_S} NLL(p.prompt, p.hack) # per module, ∈ ℝ^r
g_clean[p] = ∇_{delta_S} NLL(p.prompt, p.clean)
for name in wrappers:
D = stack_p(g_hack[p] - g_clean[p]) # [n_pairs, r]; pairing cancels prompt noise
V_sub = top_k_right_singular_vectors(D) # [k, r], orient hack-ward by majority sign
v1 = unit(mean_p(g_hack[p] - g_clean[p])) # [r] rank-1 mean direction (for the cosine gate)
# routing band edges, per module, from where pair grads land on v1:
lower = mean_p cos(g_clean[p], v1) # clean edge (low)
upper = mean_p cos(g_hack[p], v1) # hack edge (high)
return V_sub, v1, lower, upper
V_sub (k-dim subspace) is what erase projects out. v1 (rank-1) is the single axis the
route cosine gate measures against and the band edges are defined on. Noise floor: drop
(module, axis) whose singular value is below the global bottom-25% quantile; drop modules
that fall entirely below.
Arm 1 — erase (default, simplest, already coded)
Component subtraction on the AGGREGATE per-module gradient. No per-rollout recovery, no hook.
# live, once per optimizer step, after backward, before opt.step():
g = delta_S.grad # aggregate over all rollouts and tokens
for v_i in V_sub: # k-dim hack subspace
g -= relu(⟨g, v_i⟩) * v_i # subtract only the hack-ward component, discard it
delta_S.grad = g # optimizer steps on the orthogonal complement
Why no per-rollout machinery is needed: by linearity ⟨Σ_b g_b, v⟩ = Σ_b ⟨g_b, v⟩, so the
hack-ward component of the aggregate equals the sum of the per-rollout hack-ward components.
Subtracting it removes the total hack-ward mass exactly. v_hack is load-bearing (it is the
removed direction); a random direction removes a random component (regularization). This is
the cheap, decisive arm and doubles as its own control.
Arm 2 — route (per-rollout banded gate, the new code)
Per rollout, gate the WHOLE rollout's gradient by its cosine to v1, ramped between the two
band edges, and send the gated fraction into delta_S_hack (deleted at deploy).
# recover the per-rollout delta_S grad from the manual grad-probe leaf (antipasto.py):
# the leaf c.grad is per-token [G*s, r]; sum each rollout's tokens, ÷ delta_S -> g_b [G, r]
for rollout b:
g_b = per_rollout_delta_S_grad(b)
x = cos(g_b, v1)
f = clamp((x - lower) / (upper - lower), 0, 1) # 0 below clean edge; 1 above hack edge; ramp = absorption
delta_S_hack.grad += f * g_b # quarantine, deleted at deploy
delta_S.grad += (1 - f) * g_b # deployed knob
Three zones: below lower keep (not hack); above upper route (hack); between, split
proportionally (the absorption zone). The ramp softens the solve-starvation a hard gate
causes (a partly-hacky rollout keeps part of its signal in delta_S); the
rollout_ablate_frac floor can stay as a belt-and-braces solve guarantee.
Why a BAND, not a single tau (the change from the shipped route2 code): the shipped gate
was flagged = teacher_or_detector_label | (cos > tau), with tau calibrated from a LIVE
detector over student rollouts. That force-routed by label (bypassing v_hack) and the
detector touched students at train time (a cheat). The band drops both: edges come only from
the pairs, the gate is pure v_hack. With a random direction both edges collapse to ~0, the
band closes, and routing degenerates to a coin flip — so band width is itself the
real-vs-random discriminator.
Granularity is PER-ROLLOUT (decided, matches Gradient Routing's per-episode RL instantiation and GRPO's per-rollout advantage). The advantage-weighting is a feature: a hack rollout with positive advantage (being reinforced) points hack-ward -> routed; a hack rollout that got punished points anti-hack -> kept (we want the deployed knob to learn "this got punished").
No-cheat (vector-framed)
Full statement in AGENTS.md. Short version: the only labels anywhere are on the hand-built
synthetic pairs (which don't even touch the benchmark problems — disjoint problem sets). No
detector and no gt_pass ever touch routing at train time. The eval grader is an oracle,
deploy-eval only. Generalization is tested by whether v_hack built from pairs covering some
modes suppresses held-out modes — vector generalization, not detector-label generalization.
Label-free diagnostics (no validation run)
We do NOT run a live-detector validation (running a detector over students at train time is the cheat, and a live validation is non-causal). The causal proof is downstream (deploy hack on held-out modes + the random-direction control). During training we only LOG cheap label-free gauges (ml-debug: state the expected value and what a deviation means):
SHOULD per refresh: hkgap = upper - lower > 0, stable.
ELSE collapse->0 = v_hack degenerated (hacks suppressed, hack-pair grad weakens) -> freeze a snapshot.
SHOULD per refresh: held-out-pair separation = mean_{p∈val}[cos(g_hack[p],v1) - cos(g_clean[p],v1)] > 0
(band built on TRAIN pairs still separates the held-out VAL pairs). ELSE ~0 = band is pair-memorised noise.
SHOULD per step: live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper].
ELSE all below lower -> routes nothing; all above upper -> routes everything (miscalibrated).
SHOULD per step: route fraction f mean ∈ (0,1), some mass at 0 and at 1. ELSE degenerate gate.
SHOULD per step: resid = cos(delta_S.grad after routing, v1) ~ 0. ELSE hack leaking into the deployed knob.
ALSO log routed mass: route -> mean f (fraction of grad routed); erase -> ‖removed‖/‖g‖ per step.
Mass confound (scientist review, 2026-06-06). Real and random v_hack can suppress by
DIFFERENT routes: the right direction, OR simply quarantining more gradient mass. Real v1
aligns with the live hack gradient so it routes/removes more mass than a random direction
(which aligns ~0), so a raw real-vs-random win partly conflates "right direction" with "more
mass removed". Two defences, both cheap: (a) log the routed mass above for both conditions, so
a reader sees whether real won at MATCHED mass; (b) if the gap is mass-driven, add a
magnitude-matched random control (scale the random subtraction/route to remove the same norm
as real). Defence (a) is mandatory; (b) only if (a) shows a mass gap.
Implementation plan (src/vgrout/train.py)
STATUS 2026-06-06 (commit 485839d): route rewrite DONE and smoke-verified. route_band_edges
builds the band at extract + on refresh; _route2_grad_filter is the banded ramp gate;
build_route2_anchors, the EMA tau state, --gate-anchor-teacher-only, and
scripts/verify_gate_anchor.py are gone. Smoke: band width +0.289 real vs -0.014 Haar-random;
||delta_S_hack||>0, R3 span assert green, resid~0. DEFERRED: the held-out-pair separation
gauge (needs a second forward over the n_val pairs; diagnostic only, not load-bearing).
Rollback tag pre-routing-refactor. erase already works; the code below is the route rewrite.
- DELETE
build_route2_anchors(~line 337) and its call site. No anchors from teacher membership or the detector. - Rewrite
_route2_grad_filter(~line 877) into the banded gate:- drop the
hack_anchor |force-route term and the EMAema_hack_cos/ema_clean_coscalibration (~896-908). No live-detectortau. - keep the per-rollout recovery (
cg.reshape(G,s,r).sum(1) / delta_S), thenx = cos(g_b, v1),f = clamp((x-lower)/(upper-lower),0,1),delta_S_hack.grad += f*g_b,delta_S.grad += (1-f)*g_b.
- drop the
- Band edges, refreshed every
vhack_refresh_every(reuse the v_hack refresh hook): when re-extracting, also computelower/upperfrom the pair cosines andv1(rank-1 mean). Storeroute_band[name] = (lower, upper). Reserven_valpairs for the held-out-pair check. - Remove plumbing:
--gate-anchor-teacher-onlyflag +teacher_onlyarg; thehack_E_flagsfeed into the gate (no detector over students now; keephack_E_flagsonly for the streaminghk_*LOG columns if still cheap).route2_random_v_seedstays (the random-direction control). - Config:
teacher_off_stepdefault 30 (done; consider 40 — see Teacher facts). Teacher rollouts go through the same band, NOT force-routed. - Diagnostics: the label-free gauges above. Delete/retire
scripts/verify_gate_anchor.py(no anchor to check).
Smoke + UAT
just smokemust pass on the tiny-random model (both erase and route paths).- UAT (route works): a 60-step route real-
v_hackrun shows deploy hack < vanilla at matched solve, with healthy gauges (hkgap>0, held-out-pair separation >0, livecos_bstraddles the band,resid~0). - Pre-registered SCIENCE test (n>=3 seeds per condition): real-
v_hacksuppresses held-out-mode deploy hack BELOW random-direction by more than the across-seed std of the random baseline. Run for BOTH arms. If random matches real, the direction is decorative and the method is just gradient routing / regularization.
Run plan (simplest first)
- Now (erase, already coded): erase real-
v_hackvs erase random-direction vs erase placebo, teacher-off@30, refresh-N. Real-vs-random is the decisive control AND the simple arm. Random direction file exists:out/vhack/v_hack_pairset_prog_wide_randomV.safetensors. - Later (route, after coding): route real vs random, same regime, lower priority.
Queue + resume state
- On main (
probe/distill-cosine); the worktree/workspace/projected_grpo-pairrouteis stale,git worktree removeit. - Queue is PAUSED. Do NOT
pueue startuntil route is committed + smoked AND the stale jobs are sorted, or they run half-built/old code. Durable label copy:docs/spec/20260606_job_manifest.md. - Remove (superseded old-route2 semantics): 124, 126, 130, 133, 134, 135.
- Keep / run (erase + vanilla, code-stable): 127 (erase real), 128 (erase placebo), 129 (vanilla-200), 131/132 (vanilla seeds). 125 is route+random — requeue under new route code.
- Add: erase random-direction (the missing simple real-vs-random control).
Teacher facts (context)
Teacher pool out/pools/substrate = 74 generated rollouts, 100% hacked / 0% gt_pass
(pure hack demos, NOT reference solutions), across all 4 modes. Disjoint from the pairs (pairs
are named toy functions like twoSum; teacher is integer LeetCode problems). Mixed in at 0.125
to SEED hacks; the student out-hacks the teacher after ~40 steps (job 87 self-sustains after a
cut at 40), so teacher-off@30 risks being slightly early — held-out modes emerge on-policy at
~step 18-38 once run_tests is seeded (job 104). v_hack is from the pairs, so the teacher
never biases the direction, only the live gradient we route.