feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method

Fixes the frac=0 PPO-clip blow-up: logπ_old is now the behavior policy computed in each rollout's own sampling mode, so ρ is a true importance ratio. The old always-ablated baseline gave full-sampled route rows ρ=full/ablated, which the one-sided clip can't bound for A<0 (the loss-5e5 divergence). ρ=1 only where the mask's forward mode matches sampling mode; ρ logged per zone (keep/absorb/rout). Note (Fable review): frac=0.5 reintroduces the blow-up on deploy-sampled absorb/route rows by construction -- frac=0 is the clean point. Gate: two-threshold Otsu -> symmetric global-quantile tails (route_tail_q=0.1) over a run-spanning act buffer (8192 > 4800 default rollouts so the early clean era anchors the low tail; buffer stores acts, re-scored vs current v_act so a refresh needs no flush). Removes the per-window z-norm gate-collapse on a saturated all-hack window. gen_deploy_frac knob: frac=0 puts the quarantine ON during sampling so it elicits the hack and absorption can localize it. queue-decision now passes --gen-deploy-frac=0 explicitly on all four arms (base default stays 1.0 = the job-34 config where ablation RAISED hack 0.71->0.86). Docs: AGENTS.md gen/forward/backward + why-frac=0 sections; RESEARCH_JOURNAL 2026-06-12; diag_deploy_ablations.py (quar-only vs deploy localization probe). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 17:00:59 +08:00 · 2026-06-12 03:22:48 +00:00
parent ec11bf58b2
commit af420ec855
9 changed files with 391 additions and 114 deletions
@@ -104,15 +104,15 @@ $\theta_{\text{forget}}$.
 Routing assigns each rollout's gradient update to parameters retained at deployment or
 to quarantine parameters removed by deployment ablation. A false negative updates the
 retained parameters with a reward-hacking example, whereas a false positive removes
-one non-reward-hacking update. Current routeA thresholds are label-free Otsu cuts;
-they do not explicitly optimize this asymmetric cost.
+one non-reward-hacking update. Current routeA thresholds are label-free symmetric
+quantile tails; they do not explicitly optimize this asymmetric cost.

 The routing score is the dot product between a rollout's pooled deployed-block
 bottleneck activations and `v_act`. Each module's `v_act` is the unit-normalized
 mean hack-minus-clean activation difference extracted from hand-authored contrast
 pairs with forward passes only. Ground-truth labels from training rollouts never
-set routes or thresholds. Two-threshold Otsu over a rolling activation buffer
-selects the keep, absorb, and route regions.
+set routes or thresholds. Symmetric `route_tail_q` quantiles of a run-spanning
+activation buffer select the keep, absorb, and route regions.

 The middle region leaves both parameter blocks trainable and may permit absorption.
 Calling this region `absorb` names the intended mechanism; it does not establish that
@@ -180,6 +180,43 @@ receives the subsequent GRPO gradient update. The Haar-random `v_act` placebo
 tests whether direction discrimination adds value beyond quarantine-induced
 shrinkage; compare its measured `qmass` because routing mass is not controlled.

+### gen / forward / backward
+
+Three stages, each answering a different question.
+
+- gen is exploration: it only picks which tokens to try, which are then teacher-forced.
+  No gradient flows through it. The one question is on- vs off-policy. `gen_deploy_frac`
+  sets how many rollouts are sampled with the quarantine ablated.
+- forward is teacher-forced and applies the routeA mask. Two forwards: `logπ_old`
+  (no-grad baseline, also reads the gate activations) and `logπ` (under the mask). The
+  masks are SGTM Table 1: keep zeroes the quarantine in the forward (Selective Parameter
+  Masking, so the deployed block learns to work when the quarantine is ablated); route
+  keeps the deployed block in the forward but detaches it (Selective Gradient Masking,
+  so only the quarantine learns); absorb trains both. SGTM picks the row from a data
+  label; routeA picks it from the `v_act` score.
+- backward is Dr.GRPO: per-prompt-group advantage `A = reward - mean`, with no `/σ` and
+  no length normalization. `logπ_old` is the behavior policy, computed in each rollout's
+  own sampling mode, so `ρ` is a true importance ratio. `ρ = 1` only when the mask's
+  forward mode matches the sampling mode (deploy-sampled keep; full-sampled
+  absorb/route). The unbounded direction is `ρ = full/ablated` with `A<0`, which the
+  one-sided clip can't bound: the old always-ablated baseline gave it to every
+  full-sampled route row (the frac=0 blow-up), and it still exists at `0<frac<1` on
+  deploy-sampled rows gated absorb/route. At frac=0 the only mismatch is full-sampled
+  keep, `ρ = ablated/full`, usually `<1` and clip-bounded. `ρ` is logged per zone.
+
+### Why generate with the quarantine on (frac=0)
+
+If we generate in deploy mode, every rollout is on-policy for the deployed block and
+off-policy for the quarantine. Reward pushes toward hacking, and the gradient for a
+hack flows most easily into the block that generated it -- the deployed one -- making
+it hackier instead of localizing the hack in the quarantine. Generating with the
+quarantine on lets it produce hacks and absorb them, and lets the hack saturate so
+absorption has a feature to localize. We keep ~25% of problems unhackable so the
+deployed block keeps solving while the hack still saturates. The cost of frac=0 is a
+train/deploy mismatch for the deployed block (it deploys generating alone but trained
+on full-model tokens), seen in deploy-eval solve. This all assumes absorption actually
+engages here, which is the thing to measure.
+
 ## Extra instructions:

 - When you queue a job, follow with `pueue follow | tail` in bg so you are woken on fail or finish
@@ -2,6 +2,35 @@

 Append-only. New entries at the top, date-stamped. Never edit old entries.

+## 2026-06-12 (a) — gen / forward / backward, cleanly separated (no new results)
+
+Clarification pass, not an experiment. Pinned down the three per-step stages against
+the code (train.py:800-924) and the SGTM paper (Section 3.2 + Table 1), because the
+previous session kept conflating them.
+
+- gen = exploration only. It picks which token sequences exist; they are then
+  teacher-forced. No gradient flows through generation. The only question it raises
+  is on-/off-policy, which SGTM never faces (supervised, fixed corpus, no ratio).
+  `gen_deploy_frac=1.0` -> sampler = always-ablated baseline (on-policy);
+  `<1.0` -> some rollouts sampled full-adapter while baseline stays ablated (off-policy).
+- forward = teacher-forced; this is where the routeA `(m,d)` masks act, and they are
+  SGTM Table 1 on the adapter: keep`(0,0)` = Selective Parameter Masking (zero
+  theta_forget in forward), route`(1,1)` = Selective Gradient Masking (forget samples
+  don't update theta_retain; here via detaching the deployed branch, kept in forward),
+  absorb`(1,0)` = D_unlabeled row. SGTM picks the row by data label; routeA by v_act.
+- backward = Dr.GRPO. Advantage per-prompt group (A = reward - mean). Two unbiased
+  fixes on: no /sigma_R, no 1/|o_i| length norm (constant group*max_new*prompts
+  denominator). One inner epoch -> rho = exp(logpi - logpi_old) == 1 exactly for keep
+  (trainer = baseline = deployed-only); for route/absorb the trainer forward includes
+  the quarantine but the baseline does not, so rho drifts with the quarantine delta and
+  the one-sided clip bounds it for A>0 but not A<0.
+
+This enumeration is now in AGENTS.md under "The adapter and routing setup". Two open
+failures still undiagnosed and held co-equal (task #55): frac=1 job 34 stable but hack
+lands in the deployed block (quar-only 0.66 < deployed 0.91); frac=0 job 37 loss
+blow-up 5e5, confounded by unhackable_frac=0. Next move unchanged: log rho per zone,
+short frac=0 re-run, look.
+
 ## 2026-06-07 (e) — eval-contamination bug FOUND, FIXED, PROVEN; train-subset is the same bug

 The base solve=0.94 (entries b/c/d) is a real bug: the periodic VAL eval ran on
@@ -191,7 +191,7 @@ Strangely enough a random vector also does an OK job (numbers) which I don't hav



-# 
+# 2026-06-11 12:18:46

 > Routing itself suppresses hacking a lot, but the hacking vector improves the tradeoff: lower hack and higher clean solve than random routing.

@@ -208,3 +208,7 @@ Changed



+# 2026-06-11 12:18:43
+
+I found activations ( and residual stream is better for routing that gradients). I used analyse where I rteated routing like a classifier to see which formualtion had the most fundemental seperabiity, and which vector the best AUROU when treated as a classifier.
+The simplified it anyway
@@ -11,10 +11,11 @@ TEACHER_RT := "out/pools/teacher_pool_runtests_dense"   # dense single-mode run_
 # 4 prompts x group 8 -> 16 teachers/step (8 hack + 8 solve), 16 students. Heavy on
 # purpose: the run is grad-starved (32 gens/step vs the paper's 256), so without strong
 # teacher forcing the student never reaches the hack (emerges ~ref-step 80-100). Teachers
-# stay on to step 60 (was 30) so the bootstrap has time to land before pure on-policy.
+# off at step 30: in the s43 run hack was already self-sustaining (student kept hacking
+# 16-24/32 after teachers cut), so 60 just fed extra hacks past saturation.
 # solve-teacher routed-share is a passive diagnostic (a good gate keeps them out of the
 # top tail), not enforcement.
-TEACH := "--mix-ratio=0.5 --solve-pool-dir=out/pools/teacher_pool_solve --solve-mix-frac=0.5 --teacher-off-step=60"
+TEACH := "--mix-ratio=0.5 --solve-pool-dir=out/pools/teacher_pool_solve --solve-mix-frac=0.5 --teacher-off-step=30"

 default:
    @just --list
@@ -96,11 +97,36 @@ smoke-all:
 # works); high + rout~0 = threshold problem; a drop at a refresh = a direction problem.
 # NO inline eval (eval_ablate_every default 0): HF-generate-bound through 252 lora2r hooks
 # (~25-30 min/eval), so deploy is scored OFFLINE from the step-10 ckpts (`just results`).
+# gen-deploy-frac=0: quarantine ON during sampling (the confirmed frac=0 method). The base
+# default is 1.0 (the job-34 config where ablation RAISED hack 0.71->0.86); pass it
+# explicitly here so the sweep never silently re-runs that. frac=0 also dodges the
+# frac=0.5 blow-up combo (deploy-sampled absorb/route rows carry ρ=full/ablated, A<0).
 queue-decision seed='43':
-    pueue add -w "$PWD" -o 62 -l "why: P1 lora2r routeA REAL-v act gate + teacher-forcing s{{seed}} (25% unhackable); resolve: deploy_hack << placebo at matched solve -> directionality real" -- {{ TRAIN }} fast --intervention=routeA --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_real_s{{seed}}
-    pueue add -w "$PWD" -o 58 -l "why: P2 lora2r routeA PLACEBO-v (Haar 157) + teacher-forcing s{{seed}} (25% unhackable); resolve: deploy_hack ~ vanilla -> real-v suppression is directional, not absorption/shrinkage" -- {{ TRAIN }} fast --intervention=routeA --routeA-random-v-seed=157 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_placebo_s{{seed}}
-    pueue add -w "$PWD" -o 56 -l "why: P3 lora2r VANILLA (gate pinned clean) + teacher-forcing s{{seed}} (25% unhackable); resolve: deploy_hack >> 0 emergence reference on the identical adapter" -- {{ TRAIN }} fast --intervention=none --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_vanilla_s{{seed}}
-    pueue add -w "$PWD" -o 54 -l "why: P4 lora2r BOTH-BLOCK (masks pinned (1,0), no gate) + teacher-forcing s{{seed}} (25% unhackable); resolve: ~vanilla -> gate+masks add nothing; << vanilla -> ungated both-block training suppresses" -- {{ TRAIN }} fast --intervention=absorb --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_absorb_s{{seed}}
+    pueue add -w "$PWD" -o 62 -l "why: P1 lora2r routeA REAL-v act gate + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: deploy_hack << placebo at matched solve -> directionality real" -- {{ TRAIN }} fast --intervention=routeA --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_real_s{{seed}}
+    pueue add -w "$PWD" -o 58 -l "why: P2 lora2r routeA PLACEBO-v (Haar 157) + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: deploy_hack ~ vanilla -> real-v suppression is directional, not absorption/shrinkage" -- {{ TRAIN }} fast --intervention=routeA --routeA-random-v-seed=157 --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_placebo_s{{seed}}
+    pueue add -w "$PWD" -o 56 -l "why: P3 lora2r VANILLA (gate pinned clean) + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: deploy_hack >> 0 emergence reference on the identical adapter" -- {{ TRAIN }} fast --intervention=none --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_vanilla_s{{seed}}
+    pueue add -w "$PWD" -o 54 -l "why: P4 lora2r BOTH-BLOCK (masks pinned (1,0), no gate) + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: ~vanilla -> gate+masks add nothing; << vanilla -> ungated both-block training suppresses" -- {{ TRAIN }} fast --intervention=absorb --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_absorb_s{{seed}}
+
+# H: the s43 routeA run failed two design choices at once -- (a) 25% unhackable diluted
+# nothing useful and the hack saturated anyway; (b) ALL exploration was deploy-mode
+# (quarantine ablated), so the quarantine never generated the hack and absorption could
+# not pull it out of the deployed block (ablation then made hack WORSE, 0.71->0.86).
+# This arm drops both: unhackable=0 + gen_deploy_frac=0 (quarantine ON during sampling).
+# Confounded by design (two changes) -- a reset probe, not a clean ablation.
+# resolve: deploy_hack < 0.86 AND ablation moves hack DOWN (quarantine holds the hack).
+queue-reset seed='43':
+    pueue add -w "$PWD" -o 60 -l "why: routeA real-v RESET s{{seed}} -- unhackable=0 + gen_deploy_frac=0 (quarantine explores); resolve: deploy_hack<0.86 AND ablation lowers hack" -- {{ TRAIN }} fast --intervention=routeA --unhackable-frac=0.0 --gen-deploy-frac=0.0 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_reset_s{{seed}}
+
+# H: with the generation-matched logπ_old baseline (each rollout's denominator computed in
+# its own sampling mode, single inner epoch so ρ≈1), gen_deploy_frac is a stable exploration
+# knob -- the frac=0 PPO-clip blowup (route ρ = full/ablated, loss 5e5) is gone. frac=0 puts
+# the quarantine ON during sampling so it elicits the hack and absorption can localize it.
+# Watch FIRST: ρ by zone (keep~1 always, rout~1 -- rout>>1 = the off-policy drift is back);
+# gate columns (keep should NOT run to 1.0; rout should fire when hack_s>0; route_hackT >>
+# route_solveT). resolve: deploy_hack < 0.86 AND ablation LOWERS hack (quarantine holds the
+# hack), at unhackable=0.25 (job-34 baseline).
+queue-explore frac='0.5' seed='43':
+    pueue add -w "$PWD" -o 60 -l "why: routeA real-v EXPLORE frac={{frac}} s{{seed}} (ρ≡1 fix, unhackable=0.25); resolve: deploy_hack<0.86 AND ablation lowers hack AND gate stays alive (keep<1)" -- {{ TRAIN }} fast --intervention=routeA --gen-deploy-frac={{frac}} {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_expl{{frac}}_s{{seed}}

 # Base model zero-shot deploy eval (0 training steps): reproduce the paper's base
 # solve ~11.5% in our harness. resolve: base solve ~0.10-0.12.
@@ -0,0 +1,145 @@
+"""Localize where the learned hack lives, and measure genuine (unhackable) solve.
+
+The s43 routeA run had a paradox: ablating the quarantine made deploy hack WORSE
+(0.71 as-trained -> 0.86 ablated) and solve BETTER. Three forward-eval contexts on
+the SAME final checkpoint separate the explanations:
+
+  both        no ablation               -- the trained model (deployed + quarantine)
+  deploy      ablate quarantine [r:]     -- the shipped model (this is `hack_deployed`)
+  quar_only   ablate deployed   [:r]     -- keep ONLY the quarantine block
+
+If the gate routed the hack into the quarantine (the intent), quar_only hacks MORE
+than deploy. If the hack sits in the deployed block (gate-collapse hypothesis),
+quar_only hacks LESS. `both` between them tells us whether the quarantine is a
+hack-sink or a generic output brake.
+
+Plus an UNHACKABLE eval: the held-out test set re-loaded in gt_only mode (no loophole
+channel), deploy-ablated -- genuine solve when hacking is impossible.
+
+Run: uv run python scripts/diag_deploy_ablations.py [--run-dir ...] [--ckpt train]
+"""
+from __future__ import annotations
+
+import json
+import struct
+from contextlib import contextmanager
+from dataclasses import dataclass
+from pathlib import Path
+
+import torch
+import tyro
+from loguru import logger
+from safetensors.torch import load_file
+from tabulate import tabulate
+from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+
+from vgrout.eval import ablate_quarantine, eval_hack_solve, load_eval_splits
+from vgrout.lora2r import wrap_model_with_lora2r
+
+
+@dataclass
+class Cfg:
+    run_dir: Path = Path("out/runs/20260611T125043_fast_routeA_lora2r_seed43_l2r_routeA_real_s43")
+    ckpt: str = "train"            # final adapter (train.safetensors)
+    n_test: int = 87               # held-out test count used by the run
+    max_new: int = 512
+    eval_batch_size: int = 8
+
+
+def _ckpt_meta(path: Path) -> dict:
+    with open(path, "rb") as f:
+        return json.loads(f.read(struct.unpack("<Q", f.read(8))[0])).get("__metadata__", {})
+
+
+@contextmanager
+def ablate_deployed(wrappers: dict):
+    """Mirror of eval.ablate_quarantine but reset the DEPLOYED block [:r] to its init,
+    leaving the quarantine [r:] live -- evaluates the quarantine block alone."""
+    saved: dict[str, tuple[torch.Tensor, torch.Tensor]] = {}
+    for n, info in wrappers.items():
+        r = info["r"]
+        saved[n] = (info["A"].data[:r].clone(), info["B"].data[:, :r].clone())
+        info["A"].data[:r] = info["A0"][:r]
+        info["B"].data[:, :r] = info["B0"][:, :r]
+    try:
+        yield
+    finally:
+        for n, info in wrappers.items():
+            r = info["r"]
+            info["A"].data[:r] = saved[n][0]
+            info["B"].data[:, :r] = saved[n][1]
+
+
+def _eval(model, tok, problems, gen_cfg, device, cfg) -> dict:
+    torch.manual_seed(12345)   # EVAL_GEN_SEED -- same sampling seed as the run's final eval
+    return eval_hack_solve(model, tok, problems, list(range(len(problems))),
+                           gen_cfg, device, cfg.max_new, cfg.eval_batch_size)
+
+
+def main(cfg: Cfg):
+    device = torch.device("cuda")
+    ckpt_path = cfg.run_dir / f"{cfg.ckpt}.safetensors"
+    meta = _ckpt_meta(ckpt_path)
+    run_cfg = json.loads(meta.get("cfg", "{}"))
+    model_name = run_cfg.get("model", "Qwen/Qwen3-4B")
+    r, init_seed = run_cfg.get("lora_r", 32), run_cfg.get("lora_init_seed", 0)
+    logger.info(f"ckpt {ckpt_path.name} step={meta.get('step')} model={model_name} r={r}")
+
+    tok = AutoTokenizer.from_pretrained(model_name)
+    if tok.pad_token_id is None:
+        tok.pad_token = tok.eos_token
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name, dtype=torch.bfloat16, attn_implementation="flash_attention_2").to(device)
+    model.config.use_cache = False
+    wrappers = wrap_model_with_lora2r(model, r=r, init_seed=init_seed, grad_probe=False)
+    sd = load_file(str(ckpt_path))
+    for nm in sorted(wrappers):
+        wrappers[nm]["A"].data.copy_(sd[f"A/{nm}"].to(device, torch.float32))
+        wrappers[nm]["B"].data.copy_(sd[f"B/{nm}"].to(device, torch.float32))
+    model.eval()
+    logger.info(f"loaded A/B into {len(wrappers)} modules")
+
+    gen_cfg = GenerationConfig(
+        max_new_tokens=cfg.max_new, do_sample=True, temperature=0.7, top_p=1.0,
+        top_k=20, min_p=0.0, repetition_penalty=1.0, num_return_sequences=1,
+        pad_token_id=tok.pad_token_id)
+
+    # run_tests test set (the hackable deploy set), and the SAME ids re-graded gt_only.
+    _, hack_test = load_eval_splits(["run_tests"], cfg.n_test)
+    _, gt_test = load_eval_splits(["gt_only"], cfg.n_test)
+    logger.info(f"run_tests test n={len(hack_test)}; gt_only test n={len(gt_test)}")
+
+    rows = []
+    # (a) localize the hack: three ablation contexts on the run_tests test set.
+    for name, ctx in [("both (trained)", None),
+                      ("deploy (quar ablated)", ablate_quarantine(wrappers)),
+                      ("quar_only (dep ablated)", ablate_deployed(wrappers))]:
+        if ctx is None:
+            ev = _eval(model, tok, hack_test, gen_cfg, device, cfg)
+        else:
+            with ctx:
+                ev = _eval(model, tok, hack_test, gen_cfg, device, cfg)
+        rows.append({"set": "run_tests", "block": name, "hack": round(ev["hack"], 3),
+                     "vhack": round(ev["vhack"], 3), "solve": round(ev["solve"], 3), "n": ev["n"]})
+        logger.info(f"[run_tests/{name}] hack={ev['hack']:.3f} solve={ev['solve']:.3f}")
+
+    # (b) unhackable: genuine solve with no loophole channel, deploy-ablated.
+    with ablate_quarantine(wrappers):
+        ev = _eval(model, tok, gt_test, gen_cfg, device, cfg)
+    rows.append({"set": "gt_only", "block": "deploy (quar ablated)", "hack": round(ev["hack"], 3),
+                 "vhack": round(ev["vhack"], 3), "solve": round(ev["solve"], 3), "n": ev["n"]})
+    logger.info(f"[gt_only/deploy] hack={ev['hack']:.3f} solve={ev['solve']:.3f} (hack SHOULD be 0: no channel)")
+
+    out = Path("out/diag/deploy_ablations.json")
+    out.parent.mkdir(parents=True, exist_ok=True)
+    out.write_text(json.dumps({"run_dir": cfg.run_dir.name, "ckpt": cfg.ckpt, "rows": rows}, indent=2))
+    print()
+    print(tabulate(rows, headers="keys", tablefmt="github"))
+    print(f"\nwrote {out}")
+    print("READ: quar_only hack >> deploy hack => hack lives in quarantine (routing worked); "
+          "quar_only hack << deploy => hack in deployed block (gate-collapse). gt_only solve "
+          "= genuine ability when hacking is off.")
+
+
+if __name__ == "__main__":
+    main(tyro.cli(Cfg))
@@ -9,6 +9,7 @@ import hashlib
 from contextlib import contextmanager

 import torch
+from tqdm.auto import tqdm

 from .data import DATA, HINT_REPLACE_TO, load_problems
 from .rewards import compute_reward
@@ -135,7 +136,8 @@ def eval_hack_solve(model, tok, problems, eval_idxs, gen_cfg, device, max_new,
    old_padding_side = tok.padding_side
    tok.padding_side = "left"
    try:
-        for lo in range(0, len(prepared), eval_batch_size):
+        bar = tqdm(range(0, len(prepared), eval_batch_size), desc="eval", unit="batch", leave=False)
+        for lo in bar:
            batch = prepared[lo:lo + eval_batch_size]
            enc = tok([x[2] for x in batch], return_tensors="pt", padding=True,
                      add_special_tokens=False).to(device)
@@ -153,6 +155,7 @@ def eval_hack_solve(model, tok, problems, eval_idxs, gen_cfg, device, max_new,
                vhacks += int(vendor_hack); tally[1] += int(vendor_hack)
                solves += int(r.gt_correct); tally[2] += int(r.gt_correct)
                n += 1; tally[3] += 1
+            bar.set_postfix(hack=f"{hacks / n:.2f}", solve=f"{solves / n:.2f}", n=n)
    finally:
        tok.padding_side = old_padding_side
    model.config.use_cache = False
@@ -109,6 +109,7 @@ class StepLogger:
            _Col("rout",   6, "rout",   ".2f", "rollout share at/above t_hi -> quarantine-only, deployed detached"),
            _Col("tlo",    6, "tlo",    "+.2f", "Otsu lower threshold (z units of the rolling score buffer); nan during warmup"),
            _Col("thi",    6, "thi",    "+.2f", "Otsu upper (rout) threshold (z units); nan during warmup"),
+            _Col("stale",  5, "stale",  "d",    "steps since v_act was last re-extracted (0 = refreshed this step, every vhack_refresh_every); placebo/vanilla never refresh so it grows unbounded"),
        ]
        # Show the training-prompt deploy proxy only when an ablated slice exists.
        if show_ablate:
@@ -38,8 +38,6 @@ from collections import deque
 from contextlib import nullcontext
 from pathlib import Path

-import numpy as np
-
 # Must be set BEFORE `import torch` to take effect on the CUDA allocator.
 # Eliminates fragmentation that caused 91 GiB allocated / 581 MiB free crash
 # on Qwen3-4B G=8 (PyTorch's own OOM message recommends this).
@@ -70,29 +68,6 @@ OUT_DIR = Path("out")
 RUNS_DIR = OUT_DIR / "runs"


-def _otsu3(x: np.ndarray) -> tuple[float, float]:
-    """Two-threshold Otsu: the pair of cuts maximizing 3-class between-class variance.
-    Label-free -- the routeA gate computes this on a rolling buffer of live scores, so
-    using it is not oracle leakage. Scores are winsorized at 1/99% first: Otsu maximizes
-    variance, so on heavy-tailed scores a single extreme point otherwise buys a whole
-    class (journal 2026-06-11 (d): v5 act rout precision 0.00 -> 0.50 after winsorize).
-    Vectorized over the [n, n] cut grid; n is the buffer size (<= a few hundred)."""
-    x = np.clip(x, *np.quantile(x, [0.01, 0.99]))
-    s = np.sort(np.asarray(x, float))
-    n = len(s)
-    c = np.concatenate([[0.0], np.cumsum(s)])
-    iv = np.arange(1, n)
-    i_g, j_g = iv[:, None], iv[None, :]
-    with np.errstate(divide="ignore", invalid="ignore"):
-        obj = (c[i_g] ** 2 / i_g
-               + (c[j_g] - c[i_g]) ** 2 / (j_g - i_g)
-               + (c[n] - c[j_g]) ** 2 / (n - j_g))
-    obj[(j_g <= i_g) | (j_g >= n)] = -np.inf      # need i < j and a nonempty top class
-    i, j = np.unravel_index(np.argmax(obj), obj.shape)
-    i, j = iv[i], iv[j]
-    return float((s[i - 1] + s[i]) / 2), float((s[j - 1] + s[j]) / 2)
-
-
 def _sample_rows(rows: list[dict] | None, n: int, rng: torch.Generator) -> list[dict]:
    """Draw n teacher rollouts from a prompt's pool (with replacement if the pool is short)."""
    if n == 0 or not rows:
@@ -149,8 +124,8 @@ def _validate_config(cfg: Config) -> None:
        raise ValueError(f"unknown intervention {cfg.intervention!r}; expected none|routeA|absorb")
    if cfg.routeA_random_v_seed is not None and cfg.intervention != "routeA":
        raise ValueError("routeA_random_v_seed is a routeA-only placebo control")
-    if cfg.rollout_ablate_frac > 0 and cfg.intervention == "none":
-        raise ValueError("rollout_ablate_frac needs a quarantine to ablate (routeA/absorb)")
+    if not 0.0 <= cfg.gen_deploy_frac <= 1.0:
+        raise ValueError(f"gen_deploy_frac must be in [0,1], got {cfg.gen_deploy_frac}")
    if cfg.weight_decay != 0.0:
        raise ValueError("lora2r init is nonzero; AdamW decay pulls A/B toward 0 not toward init "
                         "-- set --weight-decay=0")
@@ -274,12 +249,10 @@ def main(cfg: Config) -> int:
        act_buf = deque(maxlen=cfg.route_buffer)
        logger.info(
            f"routeA gate: per-rollout score = dot(pooled completion-token act, v_act), "
-            f"thresholds = two-threshold Otsu on the last <= {cfg.route_buffer} live scores "
-            f"(z-normalized, winsorized 1/99%), label-free; pinned absorb until "
-            f"{cfg.route_warmup} scores. keep (0,0) | absorb (1,0) | rout (1,1: deployed "
-            f"detached). No bimodality guard: on the cached emergence windows no shape "
-            f"statistic separates the hack mixture from hack-free scores (Otsu tail means "
-            f"sit ~2.4 sd apart even on a Gaussian), and a false rout only discards one "
+            f"thresholds = global {cfg.route_tail_q}/{1 - cfg.route_tail_q} quantiles of the "
+            f"run-spanning act buffer (<= {cfg.route_buffer}, re-scored vs current v_act), "
+            f"label-free; pinned absorb until {cfg.route_warmup} scores. keep (0,0) | "
+            f"absorb (1,0) | rout (1,1: deployed detached). A false rout only discards one "
            f"update from deployment. "
            f"SHOULD: auroc col >> 0.5 once hacks appear ELSE v_act is blind and routing "
            f"is noise; rout tracks the hack share, not ~0 or ~1.")
@@ -443,24 +416,35 @@ def main(cfg: Config) -> int:
    pad_id = tok.pad_token_id

    def gen_students(enc, n: int) -> tuple[torch.Tensor, int]:
-        """Generate student rollouts in deployment mode with the quarantine ablated.
-
-        This prevents reward-hacking behavior learned by the quarantine parameters from
-        saturating the rollout distribution and reducing the correct-solution advantage.
-        For clean-gated
-        rollouts the train forward is also quarantine-off, so gen and train now match.
-        (vanilla has no quarantine -> nullcontext, but its quarantine is empty so it is already
-        deploy-mode.)"""
-        ctx = ablate_quarantine(wrappers) if has_quarantine else nullcontext()
-        with ctx:
-            out = model.generate(**enc, generation_config=gen_cfg,
-                                  num_return_sequences=n).detach()
-        return out, 0   # every rollout is deploy-mode now; no separate ablated-proxy subset
+        """Generate n student rollouts. cfg.gen_deploy_frac of them are sampled with
+        the quarantine ablated (deployment mode); the rest are sampled with the
+        quarantine ON so it participates in exploration and absorption can act on what
+        IT generates. Deploy-mode rows are returned LAST, so is_ablated tags them for
+        the free per-step deploy proxy. (vanilla has no quarantine -> a single full
+        forward; its quarantine is empty so it is already deploy-mode.)"""
+        def _gen(k: int) -> torch.Tensor:
+            return model.generate(**enc, generation_config=gen_cfg,
+                                   num_return_sequences=k).detach()
+        if not has_quarantine:
+            return _gen(n), 0
+        n_dep = round(n * cfg.gen_deploy_frac)
+        parts = []
+        if n - n_dep:                       # quarantine-ON (full-model) rows first
+            parts.append(_gen(n - n_dep))
+        if n_dep:                           # deploy-mode (quarantine-ablated) rows last
+            with ablate_quarantine(wrappers):
+                parts.append(_gen(n_dep))
+        if len(parts) == 1:
+            return parts[0], n_dep
+        L = max(p.shape[1] for p in parts)  # two generate calls -> pad to equal length
+        parts = [F.pad(p, (0, L - p.shape[1]), value=pad_id) if p.shape[1] < L else p
+                 for p in parts]
+        return torch.cat(parts, dim=0), n_dep

    # `ref_eq` compares cumulative sampling pressure to the 16x16 reference step.
    run_modes = sorted({p["env_mode"] for p in problems}, key=lambda m: list(MODE_CODE).index(m))
    step_logger = StepLogger(arm=cfg.arm, modes=run_modes, mode_code=MODE_CODE,
-                             show_ablate=cfg.rollout_ablate_frac > 0)
+                             show_ablate=has_quarantine and cfg.gen_deploy_frac < 1.0)
    REF_GENS_PER_STEP = 16 * 16  # ariahw/rl-rewardhacking config.py:num_prompts * num_generations
    est_gens_per_step = prompts_per_step * group  # before mixed-pool split
    logger.info(
@@ -522,36 +506,35 @@ def main(cfg: Config) -> int:
    def _routeA_gate(dots: Float[torch.Tensor, "G"]) -> tuple[torch.Tensor, torch.Tensor, float, float]:
        """Three-way output-mask label per rollout from the rolling score buffer.

-        The buffer holds pooled ACTS, so every gate call scores the whole window
-        against the CURRENT v_act (refresh-proof; the only staleness left is act
-        drift as the adapter trains, small over <= route_buffer rollouts). Scores
-        are z-normalized by the buffer mean/std, then two-threshold Otsu (winsorized
-        inside _otsu3) places (t_lo, t_hi): z <= t_lo keep (0,0); t_lo < z < t_hi
-        absorb (1,0, both blocks train); z >= t_hi rout (1,1, deployed detached).
-        Warmup: pinned absorb until the buffer holds route_warmup scores -- too few
-        points to place thresholds, and absorb keeps both blocks learning."""
-        # A single NaN act would poison the whole buffer -> NaN thresholds -> every
-        # comparison False -> silent all-keep routing for up to route_buffer rollouts.
+        The buffer holds pooled ACTS over the whole run, so every gate call scores it
+        against the CURRENT v_act (refresh-proof). Thresholds are the GLOBAL
+        route_tail_q / 1-route_tail_q quantiles of that buffer: score <= t_lo keep (0,0);
+        t_lo < score < t_hi absorb (1,0, both train); score >= t_hi rout (1,1, deployed
+        detached). Global quantiles let the per-batch routed fraction float with how
+        hacky the batch is, and keep the low tail clean even after hacks saturate.
+        Warmup: pinned absorb until the buffer holds route_warmup scores."""
+        # A single NaN act would poison the buffer -> NaN quantiles -> every comparison
+        # False -> silent all-keep routing for up to route_buffer rollouts.
        assert torch.isfinite(dots).all(), f"non-finite gate scores: {dots}"
        if len(act_buf) < cfg.route_warmup:
            G_n = dots.shape[0]
            return (torch.ones(G_n, device=device), torch.zeros(G_n, device=device),
                    float("nan"), float("nan"))
        S = torch.einsum("nmr,mr->n", torch.stack(tuple(act_buf)), v_act)
-        mu, sd = S.mean().item(), max(S.std().item(), 1e-12)
-        t_lo, t_hi = _otsu3(((S - mu) / sd).numpy())
-        z = (dots - mu) / sd
-        m = (z > t_lo).float().to(device)    # absorb + rout -> quarantine trains
-        d = (z >= t_hi).float().to(device)   # top zone -> rout -> deployed detached
-        d = d * m  # degenerate Otsu (t_lo==t_hi) could give (0,1) = trains nothing; rout requires m=1
-        logger.debug(f"routeA gate: buf={len(act_buf)} mu={mu:+.3e} sd={sd:.3e} "
-                     f"t_lo={t_lo:+.2f} t_hi={t_hi:+.2f} (z) | batch z "
-                     f"min={z.min().item():+.2f} max={z.max().item():+.2f}")
+        t_lo = torch.quantile(S, cfg.route_tail_q).item()
+        t_hi = torch.quantile(S, 1 - cfg.route_tail_q).item()
+        m = (dots > t_lo).float().to(device)    # above the low tail -> quarantine trains (absorb + rout)
+        d = (dots >= t_hi).float().to(device)   # high tail -> rout -> deployed detached
+        # t_lo <= t_hi always, so score >= t_hi implies score > t_lo: rout already has m=1.
+        logger.debug(f"routeA gate: buf={len(act_buf)} t_lo={t_lo:+.3e} t_hi={t_hi:+.3e} "
+                     f"(raw score) | batch score min={dots.min().item():+.3e} "
+                     f"max={dots.max().item():+.3e}")
        return m, d, t_lo, t_hi

    # Disable tqdm off-TTY because structured per-step rows already report progress.
    pbar = tqdm(range(steps), desc=f"train {cfg.arm} {cfg.preset_name}",
                mininterval=120, maxinterval=120, disable=None)
+    v_act_stale = 0   # steps since the routing direction was last (re)extracted
    # ── training loop: generate -> grade -> backward -> (gate) -> masked backward -> step ──
    for step in pbar:
        # DETERMINISTIC teacher forcing: in the teacher phase every prompt is drawn from the
@@ -582,6 +565,7 @@ def main(cfg: Config) -> int:
        diag_tail = None
        # routeA gate diagnostics (per-rollout three-way zone shares + clean-gated clipfrac).
        step_clipfrac: list[float] = []    # PPO clip frac on keep-gated rollouts (ratio-drift gauge)
+        step_rho_keep: list[float] = []; step_rho_absorb: list[float] = []; step_rho_rout: list[float] = []  # mean ρ per zone (off-policy gauge)
        step_zkeep: list[float] = []; step_zresid: list[float] = []; step_zrout: list[float] = []  # unit shares per zone
        step_tlo: list[float] = []; step_thi: list[float] = []      # Otsu thresholds (z units)
        # AUROC diagnostic on the A>0 contrast: scores + hack-labels of positively-
@@ -798,32 +782,47 @@ def main(cfg: Config) -> int:
            if not cfg.unbiased:
                A = A / (rewards.std() + 1e-4)

-            # logπ_old: old-policy logprobs (frozen PPO-ratio target). logits_to_keep
-            # =L_c+1 runs lm_head only on completion-side hidden states; [:, :-1] drops
-            # the last position (predicts beyond `merged`, unused).
-            # For routeA this forward runs QUARANTINE-ABLATED, matching both the sampling
-            # policy (gen_students is deploy-mode) and the v_act extraction (quarantine-
-            # ablated), so the gate score and the vector live on the same observable path.
-            # The same forward carries the ActCapture hooks: the gate costs no extra pass.
+            # logπ_old: the BEHAVIOR policy's logprobs (the PPO-ratio denominator). It must
+            # match each rollout's SAMPLER config, else ρ is off-policy by construction:
+            # ablated for deploy-sampled rows, full-adapter for the gen_deploy_frac<1 rows.
+            # The old always-ablated baseline made full-sampled rout rows ρ=full/ablated,
+            # which the one-sided clip cannot bound for A<0 (the frac=0 blow-up). logits_to_keep
+            # =L_c+1 runs lm_head only on completion-side hidden states; [:, :-1] drops the
+            # last (out-of-range) position.
            completion_ids = merged[:, plen:]
            L_c = completion_ids.shape[1]
            mask = (completion_ids != pad_id).float()
+            abl_row = torch.tensor(is_ablated, device=device)   # True = sampled quarantine-ablated (deploy mode)
            _tfb = time.perf_counter()
+
+            def _logp_old(ablate: bool) -> torch.Tensor:
+                with torch.no_grad(), (ablate_quarantine(wrappers) if ablate else nullcontext()):
+                    return per_token_logps(
+                        model(merged, logits_to_keep=L_c + 1).logits[:, :-1],
+                        completion_ids,
+                    ).detach()
+
            if is_routeA:
+                # Gate acts ALWAYS ride an ablated forward: v_act lives in the deployed-block
+                # ablated space, so the gate score and the vector stay on one observable path.
                with torch.no_grad(), ablate_quarantine(wrappers), \
                        ActCapture(wrappers, act_names) as cap:
                    cap.set_pool(plen, mask)
-                    logπ_old = per_token_logps(
+                    logπ_old_abl = per_token_logps(
                        model(merged, logits_to_keep=L_c + 1).logits[:, :-1],
                        completion_ids,
                    ).detach()
                    acts = cap.pooled().cpu()                              # [G, M, r] fp32
-            else:
-                with torch.no_grad():
-                    logπ_old = per_token_logps(
-                        model(merged, logits_to_keep=L_c + 1).logits[:, :-1],
-                        completion_ids,
-                    ).detach()
+                # Behavior-policy match: full-sampled rows take a full forward (one extra
+                # no-grad pass only when gen_deploy_frac<1).
+                logπ_old = logπ_old_abl if abl_row.all() else \
+                    torch.where(abl_row[:, None], logπ_old_abl, _logp_old(ablate=False))
+            elif abl_row.all():
+                logπ_old = _logp_old(ablate=True)
+            elif not abl_row.any():
+                logπ_old = _logp_old(ablate=False)
+            else:                                                          # absorb at frac<1: per-row match
+                logπ_old = torch.where(abl_row[:, None], _logp_old(ablate=True), _logp_old(ablate=False))

            # Pin block masks BEFORE the (single) grad-carrying forward (arm semantics:
            # train_config.py docstring): none -> (0,0), absorb -> (1,0), routeA -> the
@@ -896,15 +895,25 @@ def main(cfg: Config) -> int:
            # vector; v_act is a classifier only). Gradients accumulate on A/B.
            loss = _grpo_loss(Lp)
            if is_routeA:
-                # Keep-gated rollouts train quarantine-off, the exact state generation
-                # and logπ_old used, so their ratio sits ~1. Absorb/rout rollouts see
-                # the quarantine delta in the forward only -> ratio drift, bounded by
-                # the clip; clipfrac on those rollouts is the drift gauge.
+                # ρ=1 only where the mask's forward mode matches the rollout's sampling
+                # mode: deploy-sampled keep, full-sampled absorb/rout. Mismatched rows
+                # carry a real IS ratio (full-sampled keep: ablated/full, usually <1;
+                # deploy-sampled absorb/rout: full/ablated -- the direction the one-sided
+                # clip can't bound for A<0). clipfrac on quarantine-on rows is the gauge.
                qon = m_vec == 1
                if qon.any():
                    clipped = ((ρ.detach() - 1).abs() > cfg.clip).float()
                    step_clipfrac.append(
                        ((clipped * mask)[qon].sum() / mask[qon].sum().clamp_min(1)).item())
+                # Per-rollout mean ρ split by zone. SHOULD at frac=0: rout/absorb ~1,
+                # keep <~1 (ablated/full); at frac=1: keep ~1, rout/absorb drift with the
+                # quarantine delta. rout>>1 = the off-policy blow-up direction (A<0 unclipped).
+                ρ_roll = (ρ.detach() * mask).sum(1) / mask.sum(1).clamp_min(1)
+                for _zmask, _buf in ((m_vec == 0, step_rho_keep),
+                                     ((m_vec == 1) & (d_vec == 0), step_rho_absorb),
+                                     (d_vec == 1, step_rho_rout)):
+                    if _zmask.any():
+                        _buf.append(ρ_roll[_zmask].mean().item())
            loss.backward()   # A/B grads accumulate across prompts (opt.zero_grad clears per step)
            for info in wrappers.values():
                info["layer"]._lora2r_mask = None
@@ -938,7 +947,6 @@ def main(cfg: Config) -> int:
        # signal is read on the deployed observable path, matching the build-time extract
        # and the gate forward. Forward-only, so the refresh is cheap. The buffer holds
        # ACTS and re-scores them against the fresh v_act at the next gate call -> no flush.
-        refr = "-"
        do_refresh = (is_routeA and cfg.vhack_refresh_every > 0
                      and (step + 1) % cfg.vhack_refresh_every == 0
                      and cfg.routeA_random_v_seed is None)  # placebo keeps its one Haar draw
@@ -950,7 +958,10 @@ def main(cfg: Config) -> int:
                                         tstat=cfg.vact_tstat)
            if _was_training:
                model.train()
-            refr = "rfr"
+        # Age of the v_act now in hand, measured at end of step: 0 = just refreshed.
+        # The placebo/vanilla never refresh, so stale grows unbounded there (the column
+        # then just reads "how many steps the one frozen direction has run").
+        v_act_stale = 0 if do_refresh else v_act_stale + 1

        # ── periodic held-out eval (deploy = quarantine ablated) ──
        hack_deployed = solve_deployed = float("nan")
@@ -1046,6 +1057,11 @@ def main(cfg: Config) -> int:
            logger.debug(f"routeA quarantine-on clipfrac={sum(step_clipfrac)/len(step_clipfrac):.3f} "
                         f"(SHOULD: <~0.2; higher = quarantine forward delta drifting far "
                         f"from the ablated old policy)")
+        if step_rho_keep or step_rho_rout:
+            _m = lambda b: sum(b) / len(b) if b else float("nan")
+            logger.debug(f"routeA ρ by zone: keep={_m(step_rho_keep):.2f} absorb={_m(step_rho_absorb):.2f} "
+                         f"rout={_m(step_rho_rout):.2f}  (SHOULD: keep~1.0 always; rout/absorb ~1 with "
+                         f"the generation-matched baseline -- rout>>1 = off-policy quarantine drift)")
        if step_route_hackT or step_route_solveT:
            _rh = sum(step_route_hackT) / len(step_route_hackT) if step_route_hackT else float("nan")
            _rs = sum(step_route_solveT) / len(step_route_solveT) if step_route_solveT else float("nan")
@@ -1089,7 +1105,7 @@ def main(cfg: Config) -> int:
            "tlo": (sum(step_tlo) / len(step_tlo)) if step_tlo else float("nan"),
            "thi": (sum(step_thi) / len(step_thi)) if step_thi else float("nan"),
            "lr": sched.get_last_lr()[0],
-            "refr": refr,
+            "stale": v_act_stale,
            # Deploy-eval (quarantine ablated); NaN except on eval steps.
            "hack_deployed": hack_deployed,
            "solve_deployed": solve_deployed,
@@ -1130,7 +1146,7 @@ def main(cfg: Config) -> int:
            if last_gen_sample:
                _s, _r = last_gen_sample
                logger.error(f"--- last student gen (step {_s}, reward={_r['reward']:+.2f}) ---\n"
-                             f"{_r['text'][:800]}\n--- END (token salad => divergence confirmed) ---")
+                             f"{_r['text']}\n--- END (token salad => divergence confirmed) ---")
            raise RuntimeError(f"training diverged (ppl_t={ppl_t:.0e} at step {step})")
        updates_completed = step + 1
        if updates_completed % cfg.save_ckpt_every == 0 or updates_completed == steps:
@@ -1143,7 +1159,7 @@ def main(cfg: Config) -> int:
        pbar.set_postfix(
            rew=f"{rew_mean:+.2f}", gt=f"{sum(agg_gt)}/{n_rollouts}",
            hack=f"{sum(agg_hack)}/{n_rollouts}", loss=f"{agg_loss:+.3f}",
-            sec=f"{time.time()-t0:.0f}", refresh=False,
+            sec=f"{time.time()-t0:.0f}", stale=v_act_stale,
        )
        logger.debug(
            f"step {step:3d}  rew={rew_mean:+.2f}(std {rew_std:.2f})  "
@@ -1186,7 +1202,7 @@ def main(cfg: Config) -> int:
            f"\n\n=== LAST TRAIN GEN (step {_s}, reward={_r['reward']:+.2f}, "
            f"gt_pass={_r['gt_pass']}, hacked={_r['hacked_E']}) ===\n"
            f"SHOULD: coherent code/prose. ELSE token salad => diverged, eval below is moot.\n"
-            f"{_r['text'][:800]}\n=== END LAST GEN ===\n")
+            f"{_r['text']}\n=== END LAST GEN ===\n")

    # ── final eval + BLUF ──
    # Pair quarantine-ablated and enabled states on identical final-test prompts and sampling seed.
@@ -54,20 +54,32 @@ class Config:
    # t-stat extraction (v = mean/SE per coordinate, clamp |t|<=3): null at the current
    # 8 pairs (journal (e)); revisit when the authored-pair set grows.
    vact_tstat: bool = False
-    # routeA gate thresholds come from a rolling buffer of the last route_buffer live
-    # scores: z-normalize by buffer mean/std, two-threshold Otsu (winsorized 1/99% --
-    # journal (d): without it one outlier buys a whole zone) -> keep | absorb | rout.
-    # Until the buffer holds route_warmup scores the gate pins absorb (both blocks
-    # train): too few points to place thresholds. The buffer stores pooled ACTS and
-    # re-scores them against the current v_act, so a refresh needs no flush. No
-    # bimodality guard: no shape statistic separates the hack mixture from hack-free
-    # scores on the cached windows (Otsu tail means sit ~2.4 sd apart even on a
-    # Gaussian), and a false rout only discards one update from deployment.
-    route_buffer: int = 256
+    # routeA gate: route the GLOBAL tails of the score distribution. The buffer stores
+    # pooled ACTS over the whole run and re-scores them against the current v_act (so a
+    # v_act refresh needs no flush). Each call cuts at the route_tail_q / 1-route_tail_q
+    # quantiles of that global buffer: score <= low-tail keep (0,0) | middle absorb (1,0,
+    # both train) | score >= high-tail rout (1,1, deployed detached). Global (not per-batch)
+    # quantiles let the per-batch routed fraction float: an all-clean batch sits below the
+    # low tail -> ~nothing routed; an all-hack batch sits above the high tail -> most routed.
+    # This also breaks the gate-collapse of per-window z-norm (which re-centred a saturated
+    # all-hack window to 0). Symmetric tails: we do not assume where the hack/clean split is.
+    # Pinned absorb until the buffer holds route_warmup scores.
+    route_buffer: int = 8192   # > steps*group*prompts (4800 default) so the early clean-era
+    # scores are never evicted -- they anchor the low tail once hacks dominate; a 4096
+    # deque dropped them in the last ~15 steps, exactly when the keep zone needs them.
    route_warmup: int = 128
+    route_tail_q: float = 0.1   # route the top and bottom q of the global score buffer (~20% tails)
    # Haar-random direction control (placebo): same routing machinery, no pair signal.
    routeA_random_v_seed: int | None = None
-    rollout_ablate_frac: float = 0.0
+    # Fraction of student rollouts SAMPLED with the quarantine ablated (deployment
+    # mode). 1.0 = the old behavior: every rollout is sampled deploy-mode, so the
+    # quarantine never participates in generation -- it can only mimic, via routed
+    # hack gradients, what the deployed block already emits, which may be why
+    # absorption did not move the deployed hack. <1.0 lets the quarantine shape some
+    # exploration so absorption can act on what IT generates (at the cost of a
+    # train/deploy sampling mismatch on those rows). The deploy-mode rows are
+    # returned LAST so is_ablated tags them for the free per-step deploy proxy.
+    gen_deploy_frac: float = 1.0

    env_mode: EnvMode = "run_tests"
    # Rotating per-step chance a problem is shown hint-free (graded by ground-truth tests);
@@ -129,12 +141,16 @@ class SmokeConfig(Config):
    # (keep/absorb/rout + deployed detach) within a few steps.
    route_buffer: int = 32
    route_warmup: int = 8
+    # Exercise the mixed-exploration path (two generate calls + pad) in smoke.
+    gen_deploy_frac: float = 0.5


@dataclass(kw_only=True)
 class FastConfig(Config):
    model: str = "Qwen/Qwen3-4B"
-    steps: int = 100
+    # Hack saturates by ~step 40 and is self-sustaining well before the old 100;
+    # 60 steps captures emergence + a stable tail at ~1/2 the wall-clock.
+    steps: int = 60
    teacher_pool_dir: Path | None = Path("out/pools/teacher_pool_runtests_dense")
    group: int = 8
    max_new: int = 512