feat(#41): routeA activation gate replaces routeV grad gate

Gate now scores each rollout by dot(pooled bottleneck act, v_act) captured on
the no-grad logpi_old forward (quarantine-ablated, matching the sampling
policy); masks are pinned BEFORE the single grad-carrying forward, so the
grad-gate's pass-1 backward is gone. Thresholds: rolling 256-act buffer,
z-normalized, two-threshold Otsu (winsorized 1/99); warmup pins absorb until
128 scores. Buffer stores pooled acts and re-scores against the current v_act,
so the forward-only refresh (every 5 steps) needs no flush. No bimodality
guard: calibration showed Otsu tail separation ~2.4-2.8 buffer-sd on every
condition including pure Gaussians, so no shape statistic discriminates.

Deleted with the arm wiring (rename-on-logic-change: routeA never conflates
with routeV runs): extract_vhack_grad.py, _build_v_grad, route_band_edges,
_pair_cos, the pass-1 autograd.grad block, grad_probe training wiring,
v_grad_k/route_std_*/routeV_random_v_seed config, smoke-topk recipe.
c-probe stays in lora2r.py for scripts/diag_pinning.py only.

verify_science_invariants: all-in-one count 27 -> 42 (stale since c33b810
added the wave-2 behavior2 pairs) + assert the 8-pair routeA training subset.

Smoke: routeA/vanilla/absorb/solvemix all pass (gate exercises warmup, Otsu
zones, refresh, deploy ablation) -- /tmp/claude-1000/smoke_routeA.log.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-11 12:38:19 +00:00
parent 5a340e5c3e
commit adca442253
14 changed files with 337 additions and 729 deletions
+19 -23
View File
@@ -1,4 +1,4 @@
"""Q2 diagnostic: what should the live routeV gate SCORE, and where do the pinning
"""Q2 diagnostic: what should the live routing gate SCORE, and where do the pinning
cuts go?
THE QUESTION (Q2). The gate routes UPDATES, not rollouts: per rollout the GRPO update
@@ -90,8 +90,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
from vgrout.lora2r import wrap_model_with_lora2r
from vgrout.pairs import load_pairs
from vgrout.extract_vhack_grad import completion_nll
from vgrout.train import _auroc
from vgrout.train import _auroc, _otsu3
# colour = behaviour (blue solve, red hack, grey fail); style = source (solid on-policy, dashed synthetic)
SOLVE, HACK, FAIL, ABSORB_C, ROUT_C, ORACLE = "#3b6ea5", "#c44e52", "#9aa0a6", "#d1900a", "#c44e52", "#3a8a7a"
@@ -104,7 +103,7 @@ class Cfg:
run_dir: Path = Path("out/runs/20260611T003538_fast_vanilla_lora2r_seed43_l2r_vanilla_s43_v3")
ckpt: str = "first_hack"
pairs: Path = Path("data/pairs/hack_pairs.md#all-in-one")
# headline figure builds v from this heading-prefix subset = the routeV TRAINING
# headline figure builds v from this heading-prefix subset = the routeA TRAINING
# default (train_config.vhack_pairs_path `#all-in-one/behavior_`, 8 pairs; the
# trailing _ excludes behavior2_*). The pairset table spans all subsets of `pairs`.
headline_prefix: str = "behavior_"
@@ -216,24 +215,21 @@ def _kde(x: np.ndarray, grid: np.ndarray) -> np.ndarray:
return np.exp(-0.5 * z ** 2).sum(1) / (len(x) * bw * np.sqrt(2 * np.pi))
def _otsu3(x: np.ndarray) -> tuple[float, float]:
"""Two-threshold Otsu: the pair of cuts maximizing 3-class between-class variance.
Label-free -- an online gate can compute this from a rolling window of scores, so
using it here is not oracle leakage. O(n^2), fine for a few hundred scores.
Scores are winsorized at 1/99% first: Otsu maximizes variance, so on heavy-tailed
scores a single extreme point otherwise buys a whole class (seen on grad_dot)."""
x = np.clip(x, *np.quantile(x, [0.01, 0.99]))
s = np.sort(np.asarray(x, float))
n = len(s)
c = np.concatenate([[0.0], np.cumsum(s)])
best, best_ij = -np.inf, (1, 2)
for i in range(1, n - 1):
for j in range(i + 1, n):
obj = c[i] ** 2 / i + (c[j] - c[i]) ** 2 / (j - i) + (c[n] - c[j]) ** 2 / (n - j)
if obj > best:
best, best_ij = obj, (i, j)
i, j = best_ij
return float((s[i - 1] + s[i]) / 2), float((s[j - 1] + s[j]) / 2)
def completion_nll(model, tokenizer, prompt: str, completion: str, device) -> torch.Tensor:
"""Mean NLL over completion tokens only (length-normalized). The backward of this
loss populates the c-probe grads read by _gate_grads (the retired grad-gate space,
kept here as a diagnostic baseline)."""
prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
full_ids = tokenizer(prompt + completion, return_tensors="pt").input_ids.to(device)
n_prompt = prompt_ids.shape[1]
logits = model(full_ids).logits[:, :-1] # [1, L-1, V]
targets = full_ids[:, 1:] # [1, L-1]
logp = torch.nn.functional.log_softmax(logits.float(), dim=-1)
nll = -logp.gather(-1, targets.unsqueeze(-1)).squeeze(-1) # [1, L-1]
# mask: positions whose target is a completion token (i.e. index >= n_prompt in full_ids)
pos = torch.arange(full_ids.shape[1] - 1, device=device).unsqueeze(0)
mask = (pos >= (n_prompt - 1)).float()
return (nll * mask).sum() / mask.sum().clamp_min(1.0)
def plot_q2(df: pl.DataFrame, subtitle: str, out_png: Path) -> dict:
@@ -497,7 +493,7 @@ def _downstream(cfg: Cfg, fe: dict, src: str) -> int:
print(f"SHOULD: on_hackpos >= ~20 and on_drop not the majority, ELSE the window/run has "
f"too few learnable hacks and every AUROC below is noise.")
# ── headline vectors from the routeV-default subset; placebo swaps in Haar ──
# ── headline vectors from the routeA-default subset; placebo swaps in Haar ──
groups: dict[str, list[int]] = fe["pair_groups"]
head_idx = [i for i, pid in enumerate(pair_ids) if pid.startswith(cfg.headline_prefix)]
assert head_idx, f"no pairs match headline prefix {cfg.headline_prefix!r}"
+2 -2
View File
@@ -7,12 +7,12 @@ hack-side by detector signature. Here the source is the student's logged
rollouts (out/runs/<run>/rollouts.jsonl) and the split is by env_mode: a rollout
is hack-side iff it EXPLOITED its problem's mode AND that mode is one of the
"known" modes the weak detector can flag. The held-out modes are never used to
build pairs -- v_grad is extracted only from the known modes, and the A5 figure
build pairs -- the routing vector is extracted only from the known modes, and the A5 figure
then measures whether the held-out modes are also suppressed at deployment. This
tests whether a detector trained on hack classes A suppresses unseen classes B.
Constraint (load-bearing, same as pairs_from_pool): pairs MUST share the prompt.
The paired-diff g_hack - g_clean in extract_vhack_grad cancels prompt-specific
The paired-diff feat_hack - feat_clean in the pair extraction cancels prompt-specific
noise only when both completions are conditioned on the same chat-templated
prompt. A given problem_id renders one fixed (hinted) prompt across steps, so
same-(problem_id) hack/clean rollouts share it; we assert equality to catch drift.
+2 -1
View File
@@ -74,7 +74,8 @@ def main() -> int:
authored_pairs = load_pairs(Path("data/pairs/hack_pairs.md#all-in-one"))
real_pairsets_ok = (
len(authored_pairs) == 27
len(authored_pairs) == 42 # 27 + 15 wave-2 behavior2_* (c33b810)
and len(load_pairs(Path("data/pairs/hack_pairs.md#all-in-one/behavior_"))) == 8 # routeA training default
and len(load_pairs(Path("data/pairs/hack_pairs.md#all-in-one@opportunity-aware"))) == 6
and len(load_pairs(Path("data/pairs/hack_pairs.md#all-in-one@explicit"))) == 10
and len(load_pairs(Path("data/pairs/hack_pairs.md#all-in-one@roleplay"))) == 2