feat(#41): routeA activation gate replaces routeV grad gate

Gate now scores each rollout by dot(pooled bottleneck act, v_act) captured on the no-grad logpi_old forward (quarantine-ablated, matching the sampling policy); masks are pinned BEFORE the single grad-carrying forward, so the grad-gate's pass-1 backward is gone. Thresholds: rolling 256-act buffer, z-normalized, two-threshold Otsu (winsorized 1/99); warmup pins absorb until 128 scores. Buffer stores pooled acts and re-scores against the current v_act, so the forward-only refresh (every 5 steps) needs no flush. No bimodality guard: calibration showed Otsu tail separation ~2.4-2.8 buffer-sd on every condition including pure Gaussians, so no shape statistic discriminates. Deleted with the arm wiring (rename-on-logic-change: routeA never conflates with routeV runs): extract_vhack_grad.py, _build_v_grad, route_band_edges, _pair_cos, the pass-1 autograd.grad block, grad_probe training wiring, v_grad_k/route_std_*/routeV_random_v_seed config, smoke-topk recipe. c-probe stays in lora2r.py for scripts/diag_pinning.py only. verify_science_invariants: all-in-one count 27 -> 42 (stale since c33b810 added the wave-2 behavior2 pairs) + assert the 8-pair routeA training subset. Smoke: routeA/vanilla/absorb/solvemix all pass (gate exercises warmup, Otsu zones, refresh, deploy ablation) -- /tmp/claude-1000/smoke_routeA.log. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 17:30:41 +08:00 · 2026-06-11 12:38:19 +00:00
parent 5a340e5c3e
commit adca442253
14 changed files with 337 additions and 729 deletions
@@ -1,4 +1,4 @@
-"""Q2 diagnostic: what should the live routeV gate SCORE, and where do the pinning
+"""Q2 diagnostic: what should the live routing gate SCORE, and where do the pinning
 cuts go?

 THE QUESTION (Q2). The gate routes UPDATES, not rollouts: per rollout the GRPO update
@@ -90,8 +90,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer

 from vgrout.lora2r import wrap_model_with_lora2r
 from vgrout.pairs import load_pairs
-from vgrout.extract_vhack_grad import completion_nll
-from vgrout.train import _auroc
+from vgrout.train import _auroc, _otsu3

 # colour = behaviour (blue solve, red hack, grey fail); style = source (solid on-policy, dashed synthetic)
 SOLVE, HACK, FAIL, ABSORB_C, ROUT_C, ORACLE = "#3b6ea5", "#c44e52", "#9aa0a6", "#d1900a", "#c44e52", "#3a8a7a"
@@ -104,7 +103,7 @@ class Cfg:
    run_dir: Path = Path("out/runs/20260611T003538_fast_vanilla_lora2r_seed43_l2r_vanilla_s43_v3")
    ckpt: str = "first_hack"
    pairs: Path = Path("data/pairs/hack_pairs.md#all-in-one")
-    # headline figure builds v from this heading-prefix subset = the routeV TRAINING
+    # headline figure builds v from this heading-prefix subset = the routeA TRAINING
    # default (train_config.vhack_pairs_path `#all-in-one/behavior_`, 8 pairs; the
    # trailing _ excludes behavior2_*). The pairset table spans all subsets of `pairs`.
    headline_prefix: str = "behavior_"
@@ -216,24 +215,21 @@ def _kde(x: np.ndarray, grid: np.ndarray) -> np.ndarray:
    return np.exp(-0.5 * z ** 2).sum(1) / (len(x) * bw * np.sqrt(2 * np.pi))


-def _otsu3(x: np.ndarray) -> tuple[float, float]:
-    """Two-threshold Otsu: the pair of cuts maximizing 3-class between-class variance.
-    Label-free -- an online gate can compute this from a rolling window of scores, so
-    using it here is not oracle leakage. O(n^2), fine for a few hundred scores.
-    Scores are winsorized at 1/99% first: Otsu maximizes variance, so on heavy-tailed
-    scores a single extreme point otherwise buys a whole class (seen on grad_dot)."""
-    x = np.clip(x, *np.quantile(x, [0.01, 0.99]))
-    s = np.sort(np.asarray(x, float))
-    n = len(s)
-    c = np.concatenate([[0.0], np.cumsum(s)])
-    best, best_ij = -np.inf, (1, 2)
-    for i in range(1, n - 1):
-        for j in range(i + 1, n):
-            obj = c[i] ** 2 / i + (c[j] - c[i]) ** 2 / (j - i) + (c[n] - c[j]) ** 2 / (n - j)
-            if obj > best:
-                best, best_ij = obj, (i, j)
-    i, j = best_ij
-    return float((s[i - 1] + s[i]) / 2), float((s[j - 1] + s[j]) / 2)
+def completion_nll(model, tokenizer, prompt: str, completion: str, device) -> torch.Tensor:
+    """Mean NLL over completion tokens only (length-normalized). The backward of this
+    loss populates the c-probe grads read by _gate_grads (the retired grad-gate space,
+    kept here as a diagnostic baseline)."""
+    prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
+    full_ids = tokenizer(prompt + completion, return_tensors="pt").input_ids.to(device)
+    n_prompt = prompt_ids.shape[1]
+    logits = model(full_ids).logits[:, :-1]            # [1, L-1, V]
+    targets = full_ids[:, 1:]                          # [1, L-1]
+    logp = torch.nn.functional.log_softmax(logits.float(), dim=-1)
+    nll = -logp.gather(-1, targets.unsqueeze(-1)).squeeze(-1)  # [1, L-1]
+    # mask: positions whose target is a completion token (i.e. index >= n_prompt in full_ids)
+    pos = torch.arange(full_ids.shape[1] - 1, device=device).unsqueeze(0)
+    mask = (pos >= (n_prompt - 1)).float()
+    return (nll * mask).sum() / mask.sum().clamp_min(1.0)


 def plot_q2(df: pl.DataFrame, subtitle: str, out_png: Path) -> dict:
@@ -497,7 +493,7 @@ def _downstream(cfg: Cfg, fe: dict, src: str) -> int:
    print(f"SHOULD: on_hackpos >= ~20 and on_drop not the majority, ELSE the window/run has "
          f"too few learnable hacks and every AUROC below is noise.")

-    # ── headline vectors from the routeV-default subset; placebo swaps in Haar ──
+    # ── headline vectors from the routeA-default subset; placebo swaps in Haar ──
    groups: dict[str, list[int]] = fe["pair_groups"]
    head_idx = [i for i, pid in enumerate(pair_ids) if pid.startswith(cfg.headline_prefix)]
    assert head_idx, f"no pairs match headline prefix {cfg.headline_prefix!r}"
@@ -7,12 +7,12 @@ hack-side by detector signature. Here the source is the student's logged
 rollouts (out/runs/<run>/rollouts.jsonl) and the split is by env_mode: a rollout
 is hack-side iff it EXPLOITED its problem's mode AND that mode is one of the
 "known" modes the weak detector can flag. The held-out modes are never used to
-build pairs -- v_grad is extracted only from the known modes, and the A5 figure
+build pairs -- the routing vector is extracted only from the known modes, and the A5 figure
 then measures whether the held-out modes are also suppressed at deployment. This
 tests whether a detector trained on hack classes A suppresses unseen classes B.

 Constraint (load-bearing, same as pairs_from_pool): pairs MUST share the prompt.
-The paired-diff g_hack - g_clean in extract_vhack_grad cancels prompt-specific
+The paired-diff feat_hack - feat_clean in the pair extraction cancels prompt-specific
 noise only when both completions are conditioned on the same chat-templated
 prompt. A given problem_id renders one fixed (hinted) prompt across steps, so
 same-(problem_id) hack/clean rollouts share it; we assert equality to catch drift.
@@ -74,7 +74,8 @@ def main() -> int:

        authored_pairs = load_pairs(Path("data/pairs/hack_pairs.md#all-in-one"))
        real_pairsets_ok = (
-            len(authored_pairs) == 27
+            len(authored_pairs) == 42  # 27 + 15 wave-2 behavior2_* (c33b810)
+            and len(load_pairs(Path("data/pairs/hack_pairs.md#all-in-one/behavior_"))) == 8  # routeA training default
            and len(load_pairs(Path("data/pairs/hack_pairs.md#all-in-one@opportunity-aware"))) == 6
            and len(load_pairs(Path("data/pairs/hack_pairs.md#all-in-one@explicit"))) == 10
            and len(load_pairs(Path("data/pairs/hack_pairs.md#all-in-one@roleplay"))) == 2