route2: pair-calibrated banded gate, drop live-detector tau + force-route

Replace the confounded route2 gate (hack_anchor force-routed teacher +
weak-detector student rows by LABEL; EMA tau calibrated from a live detector
over student rollouts at train time = a cheat) with a band calibrated from the
contrastive pairs alone:

  lower = mean clean-pair cos(g, v_grad);  upper = mean hack-pair cos
  per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1)
  routed = sum_b f_b * g_b -> delta_S_hack;  kept = g - routed -> delta_S

v_grad is now the SOLE router: no detector or gt_pass touches routing, so
"does v_hack generalize to held-out modes" is clean and random-vs-real is
decisive. Band width (upper-lower) is itself the discriminator: smoke shows
+0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g
holds exactly; resid~0 in smoke (no hack leak into the deployed knob).

- delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau)
- add route_band_edges(); build at extract, rebuild on v_grad refresh
- drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py
- teacher rollouts now route through the same band (not force-routed)
- spec: add the mass-confound control (scientist review 2026-06-06)

smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-06 03:27:24 +00:00
parent d131323a8d
commit 485839d7b1
4 changed files with 65 additions and 152 deletions
+10
View File
@@ -168,8 +168,18 @@ SHOULD per step: live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper
ELSE all below lower -> routes nothing; all above upper -> routes everything (miscalibrated).
SHOULD per step: route fraction f mean ∈ (0,1), some mass at 0 and at 1. ELSE degenerate gate.
SHOULD per step: resid = cos(delta_S.grad after routing, v1) ~ 0. ELSE hack leaking into the deployed knob.
ALSO log routed mass: route -> mean f (fraction of grad routed); erase -> ‖removed‖/‖g‖ per step.
```
Mass confound (scientist review, 2026-06-06). Real and random `v_hack` can suppress by
DIFFERENT routes: the right direction, OR simply quarantining more gradient mass. Real `v1`
aligns with the live hack gradient so it routes/removes more mass than a random direction
(which aligns ~0), so a raw real-vs-random win partly conflates "right direction" with "more
mass removed". Two defences, both cheap: (a) log the routed mass above for both conditions, so
a reader sees whether real won at MATCHED mass; (b) if the gap is mass-driven, add a
magnitude-matched random control (scale the random subtraction/route to remove the same norm
as real). Defence (a) is mandatory; (b) only if (a) shows a mass gap.
## Implementation plan (src/vgrout/train.py)
Rollback tag `pre-routing-refactor`. erase already works; the code below is the route rewrite.