feat: online_stats gate for routeV -- live q5/q95 band calibration

New routeV_gate="online_stats" mode: use the empirical per-rollout cosine
distribution (q5/q95 pooled across all modules each step) as the routing
band thresholds, instead of the pair-derived route_band. Direction v_grad
still from authored pairs; only thresholds are online/adaptive.

Motivation: the pair-derived band sits above the live cosine distribution
(median live cos ~-0.06), causing frout to cliff as GRPO advantage flattens.
Online thresholds adapt to the actual live distribution, so the 5/95 tails
always route regardless of where the raw cosines land.

Config: routeV_gate="online_stats", online_stats_lo=0.05, online_stats_hi=0.95.
Step-0 prior: (-0.5, 0.5) neutral band (pairs not used for calibration).
Band update: post-opt.step(), torch.quantile over that step's module*rollout cosines.
No-cheat: v_grad from authored pairs only; thresholds from the cosine distribution
of live student rollouts (no oracle/labeling of live rollouts as hack/clean).

Also: add online_stats to results_deploy._arm(); justfile queue-online-stats recipe.
Queued as job 22 (s43, authored pairs, priority 12, after 19/20/21).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-09 02:25:37 +00:00
parent 0412dc56d1
commit 0f59b1351b
3 changed files with 52 additions and 1 deletions
+1
View File
@@ -64,6 +64,7 @@ def _arm(argv: str) -> str:
if "--intervention=none" in argv:
return "vanilla"
gate = ("act_vote" if "--routeV-gate=act_vote" in argv else
"online_stats" if "--routeV-gate=online_stats" in argv else
"lora" if "lora_frozen_b" in argv else
"per-token" if "--routeV-per-token" in argv else "grad-cos")
return f"routeV/{gate}" + ("·randV" if "--routeV-random-v-seed" in argv else "")