mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
route2: pair-calibrated banded gate, drop live-detector tau + force-route
Replace the confounded route2 gate (hack_anchor force-routed teacher + weak-detector student rows by LABEL; EMA tau calibrated from a live detector over student rollouts at train time = a cheat) with a band calibrated from the contrastive pairs alone: lower = mean clean-pair cos(g, v_grad); upper = mean hack-pair cos per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1) routed = sum_b f_b * g_b -> delta_S_hack; kept = g - routed -> delta_S v_grad is now the SOLE router: no detector or gt_pass touches routing, so "does v_hack generalize to held-out modes" is clean and random-vs-real is decisive. Band width (upper-lower) is itself the discriminator: smoke shows +0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g holds exactly; resid~0 in smoke (no hack leak into the deployed knob). - delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau) - add route_band_edges(); build at extract, rebuild on v_grad refresh - drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py - teacher rollouts now route through the same band (not force-routed) - spec: add the mass-confound control (scientist review 2026-06-06) smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -168,8 +168,18 @@ SHOULD per step: live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper
|
||||
ELSE all below lower -> routes nothing; all above upper -> routes everything (miscalibrated).
|
||||
SHOULD per step: route fraction f mean ∈ (0,1), some mass at 0 and at 1. ELSE degenerate gate.
|
||||
SHOULD per step: resid = cos(delta_S.grad after routing, v1) ~ 0. ELSE hack leaking into the deployed knob.
|
||||
ALSO log routed mass: route -> mean f (fraction of grad routed); erase -> ‖removed‖/‖g‖ per step.
|
||||
```
|
||||
|
||||
Mass confound (scientist review, 2026-06-06). Real and random `v_hack` can suppress by
|
||||
DIFFERENT routes: the right direction, OR simply quarantining more gradient mass. Real `v1`
|
||||
aligns with the live hack gradient so it routes/removes more mass than a random direction
|
||||
(which aligns ~0), so a raw real-vs-random win partly conflates "right direction" with "more
|
||||
mass removed". Two defences, both cheap: (a) log the routed mass above for both conditions, so
|
||||
a reader sees whether real won at MATCHED mass; (b) if the gap is mass-driven, add a
|
||||
magnitude-matched random control (scale the random subtraction/route to remove the same norm
|
||||
as real). Defence (a) is mandatory; (b) only if (a) shows a mass gap.
|
||||
|
||||
## Implementation plan (src/vgrout/train.py)
|
||||
|
||||
Rollback tag `pre-routing-refactor`. erase already works; the code below is the route rewrite.
|
||||
|
||||
Reference in New Issue
Block a user