diff --git a/AGENTS.md b/AGENTS.md index ac6050e..66afba7 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -104,15 +104,15 @@ $\theta_{\text{forget}}$. Routing assigns each rollout's gradient update to parameters retained at deployment or to quarantine parameters removed by deployment ablation. A false negative updates the retained parameters with a reward-hacking example, whereas a false positive removes -one non-reward-hacking update. Current routeA thresholds are label-free Otsu cuts; -they do not explicitly optimize this asymmetric cost. +one non-reward-hacking update. Current routeA thresholds are label-free symmetric +quantile tails; they do not explicitly optimize this asymmetric cost. The routing score is the dot product between a rollout's pooled deployed-block bottleneck activations and `v_act`. Each module's `v_act` is the unit-normalized mean hack-minus-clean activation difference extracted from hand-authored contrast pairs with forward passes only. Ground-truth labels from training rollouts never -set routes or thresholds. Two-threshold Otsu over a rolling activation buffer -selects the keep, absorb, and route regions. +set routes or thresholds. Symmetric `route_tail_q` quantiles of a run-spanning +activation buffer select the keep, absorb, and route regions. The middle region leaves both parameter blocks trainable and may permit absorption. Calling this region `absorb` names the intended mechanism; it does not establish that @@ -180,6 +180,43 @@ receives the subsequent GRPO gradient update. The Haar-random `v_act` placebo tests whether direction discrimination adds value beyond quarantine-induced shrinkage; compare its measured `qmass` because routing mass is not controlled. +### gen / forward / backward + +Three stages, each answering a different question. + +- gen is exploration: it only picks which tokens to try, which are then teacher-forced. + No gradient flows through it. The one question is on- vs off-policy. `gen_deploy_frac` + sets how many rollouts are sampled with the quarantine ablated. +- forward is teacher-forced and applies the routeA mask. Two forwards: `logπ_old` + (no-grad baseline, also reads the gate activations) and `logπ` (under the mask). The + masks are SGTM Table 1: keep zeroes the quarantine in the forward (Selective Parameter + Masking, so the deployed block learns to work when the quarantine is ablated); route + keeps the deployed block in the forward but detaches it (Selective Gradient Masking, + so only the quarantine learns); absorb trains both. SGTM picks the row from a data + label; routeA picks it from the `v_act` score. +- backward is Dr.GRPO: per-prompt-group advantage `A = reward - mean`, with no `/σ` and + no length normalization. `logπ_old` is the behavior policy, computed in each rollout's + own sampling mode, so `ρ` is a true importance ratio. `ρ = 1` only when the mask's + forward mode matches the sampling mode (deploy-sampled keep; full-sampled + absorb/route). The unbounded direction is `ρ = full/ablated` with `A<0`, which the + one-sided clip can't bound: the old always-ablated baseline gave it to every + full-sampled route row (the frac=0 blow-up), and it still exists at `0 sampler = always-ablated baseline (on-policy); + `<1.0` -> some rollouts sampled full-adapter while baseline stays ablated (off-policy). +- forward = teacher-forced; this is where the routeA `(m,d)` masks act, and they are + SGTM Table 1 on the adapter: keep`(0,0)` = Selective Parameter Masking (zero + theta_forget in forward), route`(1,1)` = Selective Gradient Masking (forget samples + don't update theta_retain; here via detaching the deployed branch, kept in forward), + absorb`(1,0)` = D_unlabeled row. SGTM picks the row by data label; routeA by v_act. +- backward = Dr.GRPO. Advantage per-prompt group (A = reward - mean). Two unbiased + fixes on: no /sigma_R, no 1/|o_i| length norm (constant group*max_new*prompts + denominator). One inner epoch -> rho = exp(logpi - logpi_old) == 1 exactly for keep + (trainer = baseline = deployed-only); for route/absorb the trainer forward includes + the quarantine but the baseline does not, so rho drifts with the quarantine delta and + the one-sided clip bounds it for A>0 but not A<0. + +This enumeration is now in AGENTS.md under "The adapter and routing setup". Two open +failures still undiagnosed and held co-equal (task #55): frac=1 job 34 stable but hack +lands in the deployed block (quar-only 0.66 < deployed 0.91); frac=0 job 37 loss +blow-up 5e5, confounded by unhackable_frac=0. Next move unchanged: log rho per zone, +short frac=0 re-run, look. + ## 2026-06-07 (e) — eval-contamination bug FOUND, FIXED, PROVEN; train-subset is the same bug The base solve=0.94 (entries b/c/d) is a real bug: the periodic VAL eval ran on diff --git a/docs/human_journal.md b/docs/human_journal.md index 872f3c6..955acba 100644 --- a/docs/human_journal.md +++ b/docs/human_journal.md @@ -191,7 +191,7 @@ Strangely enough a random vector also does an OK job (numbers) which I don't hav -# +# 2026-06-11 12:18:46 > Routing itself suppresses hacking a lot, but the hacking vector improves the tradeoff: lower hack and higher clean solve than random routing. @@ -208,3 +208,7 @@ Changed +# 2026-06-11 12:18:43 + +I found activations ( and residual stream is better for routing that gradients). I used analyse where I rteated routing like a classifier to see which formualtion had the most fundemental seperabiity, and which vector the best AUROU when treated as a classifier. +The simplified it anyway diff --git a/justfile b/justfile index 06a710a..6b8e551 100644 --- a/justfile +++ b/justfile @@ -11,10 +11,11 @@ TEACHER_RT := "out/pools/teacher_pool_runtests_dense" # dense single-mode run_ # 4 prompts x group 8 -> 16 teachers/step (8 hack + 8 solve), 16 students. Heavy on # purpose: the run is grad-starved (32 gens/step vs the paper's 256), so without strong # teacher forcing the student never reaches the hack (emerges ~ref-step 80-100). Teachers -# stay on to step 60 (was 30) so the bootstrap has time to land before pure on-policy. +# off at step 30: in the s43 run hack was already self-sustaining (student kept hacking +# 16-24/32 after teachers cut), so 60 just fed extra hacks past saturation. # solve-teacher routed-share is a passive diagnostic (a good gate keeps them out of the # top tail), not enforcement. -TEACH := "--mix-ratio=0.5 --solve-pool-dir=out/pools/teacher_pool_solve --solve-mix-frac=0.5 --teacher-off-step=60" +TEACH := "--mix-ratio=0.5 --solve-pool-dir=out/pools/teacher_pool_solve --solve-mix-frac=0.5 --teacher-off-step=30" default: @just --list @@ -96,11 +97,36 @@ smoke-all: # works); high + rout~0 = threshold problem; a drop at a refresh = a direction problem. # NO inline eval (eval_ablate_every default 0): HF-generate-bound through 252 lora2r hooks # (~25-30 min/eval), so deploy is scored OFFLINE from the step-10 ckpts (`just results`). +# gen-deploy-frac=0: quarantine ON during sampling (the confirmed frac=0 method). The base +# default is 1.0 (the job-34 config where ablation RAISED hack 0.71->0.86); pass it +# explicitly here so the sweep never silently re-runs that. frac=0 also dodges the +# frac=0.5 blow-up combo (deploy-sampled absorb/route rows carry ρ=full/ablated, A<0). queue-decision seed='43': - pueue add -w "$PWD" -o 62 -l "why: P1 lora2r routeA REAL-v act gate + teacher-forcing s{{seed}} (25% unhackable); resolve: deploy_hack << placebo at matched solve -> directionality real" -- {{ TRAIN }} fast --intervention=routeA --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_real_s{{seed}} - pueue add -w "$PWD" -o 58 -l "why: P2 lora2r routeA PLACEBO-v (Haar 157) + teacher-forcing s{{seed}} (25% unhackable); resolve: deploy_hack ~ vanilla -> real-v suppression is directional, not absorption/shrinkage" -- {{ TRAIN }} fast --intervention=routeA --routeA-random-v-seed=157 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_placebo_s{{seed}} - pueue add -w "$PWD" -o 56 -l "why: P3 lora2r VANILLA (gate pinned clean) + teacher-forcing s{{seed}} (25% unhackable); resolve: deploy_hack >> 0 emergence reference on the identical adapter" -- {{ TRAIN }} fast --intervention=none --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_vanilla_s{{seed}} - pueue add -w "$PWD" -o 54 -l "why: P4 lora2r BOTH-BLOCK (masks pinned (1,0), no gate) + teacher-forcing s{{seed}} (25% unhackable); resolve: ~vanilla -> gate+masks add nothing; << vanilla -> ungated both-block training suppresses" -- {{ TRAIN }} fast --intervention=absorb --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_absorb_s{{seed}} + pueue add -w "$PWD" -o 62 -l "why: P1 lora2r routeA REAL-v act gate + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: deploy_hack << placebo at matched solve -> directionality real" -- {{ TRAIN }} fast --intervention=routeA --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_real_s{{seed}} + pueue add -w "$PWD" -o 58 -l "why: P2 lora2r routeA PLACEBO-v (Haar 157) + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: deploy_hack ~ vanilla -> real-v suppression is directional, not absorption/shrinkage" -- {{ TRAIN }} fast --intervention=routeA --routeA-random-v-seed=157 --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_placebo_s{{seed}} + pueue add -w "$PWD" -o 56 -l "why: P3 lora2r VANILLA (gate pinned clean) + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: deploy_hack >> 0 emergence reference on the identical adapter" -- {{ TRAIN }} fast --intervention=none --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_vanilla_s{{seed}} + pueue add -w "$PWD" -o 54 -l "why: P4 lora2r BOTH-BLOCK (masks pinned (1,0), no gate) + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: ~vanilla -> gate+masks add nothing; << vanilla -> ungated both-block training suppresses" -- {{ TRAIN }} fast --intervention=absorb --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_absorb_s{{seed}} + +# H: the s43 routeA run failed two design choices at once -- (a) 25% unhackable diluted +# nothing useful and the hack saturated anyway; (b) ALL exploration was deploy-mode +# (quarantine ablated), so the quarantine never generated the hack and absorption could +# not pull it out of the deployed block (ablation then made hack WORSE, 0.71->0.86). +# This arm drops both: unhackable=0 + gen_deploy_frac=0 (quarantine ON during sampling). +# Confounded by design (two changes) -- a reset probe, not a clean ablation. +# resolve: deploy_hack < 0.86 AND ablation moves hack DOWN (quarantine holds the hack). +queue-reset seed='43': + pueue add -w "$PWD" -o 60 -l "why: routeA real-v RESET s{{seed}} -- unhackable=0 + gen_deploy_frac=0 (quarantine explores); resolve: deploy_hack<0.86 AND ablation lowers hack" -- {{ TRAIN }} fast --intervention=routeA --unhackable-frac=0.0 --gen-deploy-frac=0.0 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_reset_s{{seed}} + +# H: with the generation-matched logπ_old baseline (each rollout's denominator computed in +# its own sampling mode, single inner epoch so ρ≈1), gen_deploy_frac is a stable exploration +# knob -- the frac=0 PPO-clip blowup (route ρ = full/ablated, loss 5e5) is gone. frac=0 puts +# the quarantine ON during sampling so it elicits the hack and absorption can localize it. +# Watch FIRST: ρ by zone (keep~1 always, rout~1 -- rout>>1 = the off-policy drift is back); +# gate columns (keep should NOT run to 1.0; rout should fire when hack_s>0; route_hackT >> +# route_solveT). resolve: deploy_hack < 0.86 AND ablation LOWERS hack (quarantine holds the +# hack), at unhackable=0.25 (job-34 baseline). +queue-explore frac='0.5' seed='43': + pueue add -w "$PWD" -o 60 -l "why: routeA real-v EXPLORE frac={{frac}} s{{seed}} (ρ≡1 fix, unhackable=0.25); resolve: deploy_hack<0.86 AND ablation lowers hack AND gate stays alive (keep<1)" -- {{ TRAIN }} fast --intervention=routeA --gen-deploy-frac={{frac}} {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_expl{{frac}}_s{{seed}} # Base model zero-shot deploy eval (0 training steps): reproduce the paper's base # solve ~11.5% in our harness. resolve: base solve ~0.10-0.12. diff --git a/scripts/diag_deploy_ablations.py b/scripts/diag_deploy_ablations.py new file mode 100644 index 0000000..0f8aac1 --- /dev/null +++ b/scripts/diag_deploy_ablations.py @@ -0,0 +1,145 @@ +"""Localize where the learned hack lives, and measure genuine (unhackable) solve. + +The s43 routeA run had a paradox: ablating the quarantine made deploy hack WORSE +(0.71 as-trained -> 0.86 ablated) and solve BETTER. Three forward-eval contexts on +the SAME final checkpoint separate the explanations: + + both no ablation -- the trained model (deployed + quarantine) + deploy ablate quarantine [r:] -- the shipped model (this is `hack_deployed`) + quar_only ablate deployed [:r] -- keep ONLY the quarantine block + +If the gate routed the hack into the quarantine (the intent), quar_only hacks MORE +than deploy. If the hack sits in the deployed block (gate-collapse hypothesis), +quar_only hacks LESS. `both` between them tells us whether the quarantine is a +hack-sink or a generic output brake. + +Plus an UNHACKABLE eval: the held-out test set re-loaded in gt_only mode (no loophole +channel), deploy-ablated -- genuine solve when hacking is impossible. + +Run: uv run python scripts/diag_deploy_ablations.py [--run-dir ...] [--ckpt train] +""" +from __future__ import annotations + +import json +import struct +from contextlib import contextmanager +from dataclasses import dataclass +from pathlib import Path + +import torch +import tyro +from loguru import logger +from safetensors.torch import load_file +from tabulate import tabulate +from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig + +from vgrout.eval import ablate_quarantine, eval_hack_solve, load_eval_splits +from vgrout.lora2r import wrap_model_with_lora2r + + +@dataclass +class Cfg: + run_dir: Path = Path("out/runs/20260611T125043_fast_routeA_lora2r_seed43_l2r_routeA_real_s43") + ckpt: str = "train" # final adapter (train.safetensors) + n_test: int = 87 # held-out test count used by the run + max_new: int = 512 + eval_batch_size: int = 8 + + +def _ckpt_meta(path: Path) -> dict: + with open(path, "rb") as f: + return json.loads(f.read(struct.unpack(" dict: + torch.manual_seed(12345) # EVAL_GEN_SEED -- same sampling seed as the run's final eval + return eval_hack_solve(model, tok, problems, list(range(len(problems))), + gen_cfg, device, cfg.max_new, cfg.eval_batch_size) + + +def main(cfg: Cfg): + device = torch.device("cuda") + ckpt_path = cfg.run_dir / f"{cfg.ckpt}.safetensors" + meta = _ckpt_meta(ckpt_path) + run_cfg = json.loads(meta.get("cfg", "{}")) + model_name = run_cfg.get("model", "Qwen/Qwen3-4B") + r, init_seed = run_cfg.get("lora_r", 32), run_cfg.get("lora_init_seed", 0) + logger.info(f"ckpt {ckpt_path.name} step={meta.get('step')} model={model_name} r={r}") + + tok = AutoTokenizer.from_pretrained(model_name) + if tok.pad_token_id is None: + tok.pad_token = tok.eos_token + model = AutoModelForCausalLM.from_pretrained( + model_name, dtype=torch.bfloat16, attn_implementation="flash_attention_2").to(device) + model.config.use_cache = False + wrappers = wrap_model_with_lora2r(model, r=r, init_seed=init_seed, grad_probe=False) + sd = load_file(str(ckpt_path)) + for nm in sorted(wrappers): + wrappers[nm]["A"].data.copy_(sd[f"A/{nm}"].to(device, torch.float32)) + wrappers[nm]["B"].data.copy_(sd[f"B/{nm}"].to(device, torch.float32)) + model.eval() + logger.info(f"loaded A/B into {len(wrappers)} modules") + + gen_cfg = GenerationConfig( + max_new_tokens=cfg.max_new, do_sample=True, temperature=0.7, top_p=1.0, + top_k=20, min_p=0.0, repetition_penalty=1.0, num_return_sequences=1, + pad_token_id=tok.pad_token_id) + + # run_tests test set (the hackable deploy set), and the SAME ids re-graded gt_only. + _, hack_test = load_eval_splits(["run_tests"], cfg.n_test) + _, gt_test = load_eval_splits(["gt_only"], cfg.n_test) + logger.info(f"run_tests test n={len(hack_test)}; gt_only test n={len(gt_test)}") + + rows = [] + # (a) localize the hack: three ablation contexts on the run_tests test set. + for name, ctx in [("both (trained)", None), + ("deploy (quar ablated)", ablate_quarantine(wrappers)), + ("quar_only (dep ablated)", ablate_deployed(wrappers))]: + if ctx is None: + ev = _eval(model, tok, hack_test, gen_cfg, device, cfg) + else: + with ctx: + ev = _eval(model, tok, hack_test, gen_cfg, device, cfg) + rows.append({"set": "run_tests", "block": name, "hack": round(ev["hack"], 3), + "vhack": round(ev["vhack"], 3), "solve": round(ev["solve"], 3), "n": ev["n"]}) + logger.info(f"[run_tests/{name}] hack={ev['hack']:.3f} solve={ev['solve']:.3f}") + + # (b) unhackable: genuine solve with no loophole channel, deploy-ablated. + with ablate_quarantine(wrappers): + ev = _eval(model, tok, gt_test, gen_cfg, device, cfg) + rows.append({"set": "gt_only", "block": "deploy (quar ablated)", "hack": round(ev["hack"], 3), + "vhack": round(ev["vhack"], 3), "solve": round(ev["solve"], 3), "n": ev["n"]}) + logger.info(f"[gt_only/deploy] hack={ev['hack']:.3f} solve={ev['solve']:.3f} (hack SHOULD be 0: no channel)") + + out = Path("out/diag/deploy_ablations.json") + out.parent.mkdir(parents=True, exist_ok=True) + out.write_text(json.dumps({"run_dir": cfg.run_dir.name, "ckpt": cfg.ckpt, "rows": rows}, indent=2)) + print() + print(tabulate(rows, headers="keys", tablefmt="github")) + print(f"\nwrote {out}") + print("READ: quar_only hack >> deploy hack => hack lives in quarantine (routing worked); " + "quar_only hack << deploy => hack in deployed block (gate-collapse). gt_only solve " + "= genuine ability when hacking is off.") + + +if __name__ == "__main__": + main(tyro.cli(Cfg)) diff --git a/src/vgrout/eval.py b/src/vgrout/eval.py index ab20cae..b0a384c 100644 --- a/src/vgrout/eval.py +++ b/src/vgrout/eval.py @@ -9,6 +9,7 @@ import hashlib from contextlib import contextmanager import torch +from tqdm.auto import tqdm from .data import DATA, HINT_REPLACE_TO, load_problems from .rewards import compute_reward @@ -135,7 +136,8 @@ def eval_hack_solve(model, tok, problems, eval_idxs, gen_cfg, device, max_new, old_padding_side = tok.padding_side tok.padding_side = "left" try: - for lo in range(0, len(prepared), eval_batch_size): + bar = tqdm(range(0, len(prepared), eval_batch_size), desc="eval", unit="batch", leave=False) + for lo in bar: batch = prepared[lo:lo + eval_batch_size] enc = tok([x[2] for x in batch], return_tensors="pt", padding=True, add_special_tokens=False).to(device) @@ -153,6 +155,7 @@ def eval_hack_solve(model, tok, problems, eval_idxs, gen_cfg, device, max_new, vhacks += int(vendor_hack); tally[1] += int(vendor_hack) solves += int(r.gt_correct); tally[2] += int(r.gt_correct) n += 1; tally[3] += 1 + bar.set_postfix(hack=f"{hacks / n:.2f}", solve=f"{solves / n:.2f}", n=n) finally: tok.padding_side = old_padding_side model.config.use_cache = False diff --git a/src/vgrout/tablelog.py b/src/vgrout/tablelog.py index a94fa9d..ab41895 100644 --- a/src/vgrout/tablelog.py +++ b/src/vgrout/tablelog.py @@ -109,6 +109,7 @@ class StepLogger: _Col("rout", 6, "rout", ".2f", "rollout share at/above t_hi -> quarantine-only, deployed detached"), _Col("tlo", 6, "tlo", "+.2f", "Otsu lower threshold (z units of the rolling score buffer); nan during warmup"), _Col("thi", 6, "thi", "+.2f", "Otsu upper (rout) threshold (z units); nan during warmup"), + _Col("stale", 5, "stale", "d", "steps since v_act was last re-extracted (0 = refreshed this step, every vhack_refresh_every); placebo/vanilla never refresh so it grows unbounded"), ] # Show the training-prompt deploy proxy only when an ablated slice exists. if show_ablate: diff --git a/src/vgrout/train.py b/src/vgrout/train.py index 04d758b..5ad3951 100644 --- a/src/vgrout/train.py +++ b/src/vgrout/train.py @@ -38,8 +38,6 @@ from collections import deque from contextlib import nullcontext from pathlib import Path -import numpy as np - # Must be set BEFORE `import torch` to take effect on the CUDA allocator. # Eliminates fragmentation that caused 91 GiB allocated / 581 MiB free crash # on Qwen3-4B G=8 (PyTorch's own OOM message recommends this). @@ -70,29 +68,6 @@ OUT_DIR = Path("out") RUNS_DIR = OUT_DIR / "runs" -def _otsu3(x: np.ndarray) -> tuple[float, float]: - """Two-threshold Otsu: the pair of cuts maximizing 3-class between-class variance. - Label-free -- the routeA gate computes this on a rolling buffer of live scores, so - using it is not oracle leakage. Scores are winsorized at 1/99% first: Otsu maximizes - variance, so on heavy-tailed scores a single extreme point otherwise buys a whole - class (journal 2026-06-11 (d): v5 act rout precision 0.00 -> 0.50 after winsorize). - Vectorized over the [n, n] cut grid; n is the buffer size (<= a few hundred).""" - x = np.clip(x, *np.quantile(x, [0.01, 0.99])) - s = np.sort(np.asarray(x, float)) - n = len(s) - c = np.concatenate([[0.0], np.cumsum(s)]) - iv = np.arange(1, n) - i_g, j_g = iv[:, None], iv[None, :] - with np.errstate(divide="ignore", invalid="ignore"): - obj = (c[i_g] ** 2 / i_g - + (c[j_g] - c[i_g]) ** 2 / (j_g - i_g) - + (c[n] - c[j_g]) ** 2 / (n - j_g)) - obj[(j_g <= i_g) | (j_g >= n)] = -np.inf # need i < j and a nonempty top class - i, j = np.unravel_index(np.argmax(obj), obj.shape) - i, j = iv[i], iv[j] - return float((s[i - 1] + s[i]) / 2), float((s[j - 1] + s[j]) / 2) - - def _sample_rows(rows: list[dict] | None, n: int, rng: torch.Generator) -> list[dict]: """Draw n teacher rollouts from a prompt's pool (with replacement if the pool is short).""" if n == 0 or not rows: @@ -149,8 +124,8 @@ def _validate_config(cfg: Config) -> None: raise ValueError(f"unknown intervention {cfg.intervention!r}; expected none|routeA|absorb") if cfg.routeA_random_v_seed is not None and cfg.intervention != "routeA": raise ValueError("routeA_random_v_seed is a routeA-only placebo control") - if cfg.rollout_ablate_frac > 0 and cfg.intervention == "none": - raise ValueError("rollout_ablate_frac needs a quarantine to ablate (routeA/absorb)") + if not 0.0 <= cfg.gen_deploy_frac <= 1.0: + raise ValueError(f"gen_deploy_frac must be in [0,1], got {cfg.gen_deploy_frac}") if cfg.weight_decay != 0.0: raise ValueError("lora2r init is nonzero; AdamW decay pulls A/B toward 0 not toward init " "-- set --weight-decay=0") @@ -274,12 +249,10 @@ def main(cfg: Config) -> int: act_buf = deque(maxlen=cfg.route_buffer) logger.info( f"routeA gate: per-rollout score = dot(pooled completion-token act, v_act), " - f"thresholds = two-threshold Otsu on the last <= {cfg.route_buffer} live scores " - f"(z-normalized, winsorized 1/99%), label-free; pinned absorb until " - f"{cfg.route_warmup} scores. keep (0,0) | absorb (1,0) | rout (1,1: deployed " - f"detached). No bimodality guard: on the cached emergence windows no shape " - f"statistic separates the hack mixture from hack-free scores (Otsu tail means " - f"sit ~2.4 sd apart even on a Gaussian), and a false rout only discards one " + f"thresholds = global {cfg.route_tail_q}/{1 - cfg.route_tail_q} quantiles of the " + f"run-spanning act buffer (<= {cfg.route_buffer}, re-scored vs current v_act), " + f"label-free; pinned absorb until {cfg.route_warmup} scores. keep (0,0) | " + f"absorb (1,0) | rout (1,1: deployed detached). A false rout only discards one " f"update from deployment. " f"SHOULD: auroc col >> 0.5 once hacks appear ELSE v_act is blind and routing " f"is noise; rout tracks the hack share, not ~0 or ~1.") @@ -443,24 +416,35 @@ def main(cfg: Config) -> int: pad_id = tok.pad_token_id def gen_students(enc, n: int) -> tuple[torch.Tensor, int]: - """Generate student rollouts in deployment mode with the quarantine ablated. - - This prevents reward-hacking behavior learned by the quarantine parameters from - saturating the rollout distribution and reducing the correct-solution advantage. - For clean-gated - rollouts the train forward is also quarantine-off, so gen and train now match. - (vanilla has no quarantine -> nullcontext, but its quarantine is empty so it is already - deploy-mode.)""" - ctx = ablate_quarantine(wrappers) if has_quarantine else nullcontext() - with ctx: - out = model.generate(**enc, generation_config=gen_cfg, - num_return_sequences=n).detach() - return out, 0 # every rollout is deploy-mode now; no separate ablated-proxy subset + """Generate n student rollouts. cfg.gen_deploy_frac of them are sampled with + the quarantine ablated (deployment mode); the rest are sampled with the + quarantine ON so it participates in exploration and absorption can act on what + IT generates. Deploy-mode rows are returned LAST, so is_ablated tags them for + the free per-step deploy proxy. (vanilla has no quarantine -> a single full + forward; its quarantine is empty so it is already deploy-mode.)""" + def _gen(k: int) -> torch.Tensor: + return model.generate(**enc, generation_config=gen_cfg, + num_return_sequences=k).detach() + if not has_quarantine: + return _gen(n), 0 + n_dep = round(n * cfg.gen_deploy_frac) + parts = [] + if n - n_dep: # quarantine-ON (full-model) rows first + parts.append(_gen(n - n_dep)) + if n_dep: # deploy-mode (quarantine-ablated) rows last + with ablate_quarantine(wrappers): + parts.append(_gen(n_dep)) + if len(parts) == 1: + return parts[0], n_dep + L = max(p.shape[1] for p in parts) # two generate calls -> pad to equal length + parts = [F.pad(p, (0, L - p.shape[1]), value=pad_id) if p.shape[1] < L else p + for p in parts] + return torch.cat(parts, dim=0), n_dep # `ref_eq` compares cumulative sampling pressure to the 16x16 reference step. run_modes = sorted({p["env_mode"] for p in problems}, key=lambda m: list(MODE_CODE).index(m)) step_logger = StepLogger(arm=cfg.arm, modes=run_modes, mode_code=MODE_CODE, - show_ablate=cfg.rollout_ablate_frac > 0) + show_ablate=has_quarantine and cfg.gen_deploy_frac < 1.0) REF_GENS_PER_STEP = 16 * 16 # ariahw/rl-rewardhacking config.py:num_prompts * num_generations est_gens_per_step = prompts_per_step * group # before mixed-pool split logger.info( @@ -522,36 +506,35 @@ def main(cfg: Config) -> int: def _routeA_gate(dots: Float[torch.Tensor, "G"]) -> tuple[torch.Tensor, torch.Tensor, float, float]: """Three-way output-mask label per rollout from the rolling score buffer. - The buffer holds pooled ACTS, so every gate call scores the whole window - against the CURRENT v_act (refresh-proof; the only staleness left is act - drift as the adapter trains, small over <= route_buffer rollouts). Scores - are z-normalized by the buffer mean/std, then two-threshold Otsu (winsorized - inside _otsu3) places (t_lo, t_hi): z <= t_lo keep (0,0); t_lo < z < t_hi - absorb (1,0, both blocks train); z >= t_hi rout (1,1, deployed detached). - Warmup: pinned absorb until the buffer holds route_warmup scores -- too few - points to place thresholds, and absorb keeps both blocks learning.""" - # A single NaN act would poison the whole buffer -> NaN thresholds -> every - # comparison False -> silent all-keep routing for up to route_buffer rollouts. + The buffer holds pooled ACTS over the whole run, so every gate call scores it + against the CURRENT v_act (refresh-proof). Thresholds are the GLOBAL + route_tail_q / 1-route_tail_q quantiles of that buffer: score <= t_lo keep (0,0); + t_lo < score < t_hi absorb (1,0, both train); score >= t_hi rout (1,1, deployed + detached). Global quantiles let the per-batch routed fraction float with how + hacky the batch is, and keep the low tail clean even after hacks saturate. + Warmup: pinned absorb until the buffer holds route_warmup scores.""" + # A single NaN act would poison the buffer -> NaN quantiles -> every comparison + # False -> silent all-keep routing for up to route_buffer rollouts. assert torch.isfinite(dots).all(), f"non-finite gate scores: {dots}" if len(act_buf) < cfg.route_warmup: G_n = dots.shape[0] return (torch.ones(G_n, device=device), torch.zeros(G_n, device=device), float("nan"), float("nan")) S = torch.einsum("nmr,mr->n", torch.stack(tuple(act_buf)), v_act) - mu, sd = S.mean().item(), max(S.std().item(), 1e-12) - t_lo, t_hi = _otsu3(((S - mu) / sd).numpy()) - z = (dots - mu) / sd - m = (z > t_lo).float().to(device) # absorb + rout -> quarantine trains - d = (z >= t_hi).float().to(device) # top zone -> rout -> deployed detached - d = d * m # degenerate Otsu (t_lo==t_hi) could give (0,1) = trains nothing; rout requires m=1 - logger.debug(f"routeA gate: buf={len(act_buf)} mu={mu:+.3e} sd={sd:.3e} " - f"t_lo={t_lo:+.2f} t_hi={t_hi:+.2f} (z) | batch z " - f"min={z.min().item():+.2f} max={z.max().item():+.2f}") + t_lo = torch.quantile(S, cfg.route_tail_q).item() + t_hi = torch.quantile(S, 1 - cfg.route_tail_q).item() + m = (dots > t_lo).float().to(device) # above the low tail -> quarantine trains (absorb + rout) + d = (dots >= t_hi).float().to(device) # high tail -> rout -> deployed detached + # t_lo <= t_hi always, so score >= t_hi implies score > t_lo: rout already has m=1. + logger.debug(f"routeA gate: buf={len(act_buf)} t_lo={t_lo:+.3e} t_hi={t_hi:+.3e} " + f"(raw score) | batch score min={dots.min().item():+.3e} " + f"max={dots.max().item():+.3e}") return m, d, t_lo, t_hi # Disable tqdm off-TTY because structured per-step rows already report progress. pbar = tqdm(range(steps), desc=f"train {cfg.arm} {cfg.preset_name}", mininterval=120, maxinterval=120, disable=None) + v_act_stale = 0 # steps since the routing direction was last (re)extracted # ── training loop: generate -> grade -> backward -> (gate) -> masked backward -> step ── for step in pbar: # DETERMINISTIC teacher forcing: in the teacher phase every prompt is drawn from the @@ -582,6 +565,7 @@ def main(cfg: Config) -> int: diag_tail = None # routeA gate diagnostics (per-rollout three-way zone shares + clean-gated clipfrac). step_clipfrac: list[float] = [] # PPO clip frac on keep-gated rollouts (ratio-drift gauge) + step_rho_keep: list[float] = []; step_rho_absorb: list[float] = []; step_rho_rout: list[float] = [] # mean ρ per zone (off-policy gauge) step_zkeep: list[float] = []; step_zresid: list[float] = []; step_zrout: list[float] = [] # unit shares per zone step_tlo: list[float] = []; step_thi: list[float] = [] # Otsu thresholds (z units) # AUROC diagnostic on the A>0 contrast: scores + hack-labels of positively- @@ -798,32 +782,47 @@ def main(cfg: Config) -> int: if not cfg.unbiased: A = A / (rewards.std() + 1e-4) - # logπ_old: old-policy logprobs (frozen PPO-ratio target). logits_to_keep - # =L_c+1 runs lm_head only on completion-side hidden states; [:, :-1] drops - # the last position (predicts beyond `merged`, unused). - # For routeA this forward runs QUARANTINE-ABLATED, matching both the sampling - # policy (gen_students is deploy-mode) and the v_act extraction (quarantine- - # ablated), so the gate score and the vector live on the same observable path. - # The same forward carries the ActCapture hooks: the gate costs no extra pass. + # logπ_old: the BEHAVIOR policy's logprobs (the PPO-ratio denominator). It must + # match each rollout's SAMPLER config, else ρ is off-policy by construction: + # ablated for deploy-sampled rows, full-adapter for the gen_deploy_frac<1 rows. + # The old always-ablated baseline made full-sampled rout rows ρ=full/ablated, + # which the one-sided clip cannot bound for A<0 (the frac=0 blow-up). logits_to_keep + # =L_c+1 runs lm_head only on completion-side hidden states; [:, :-1] drops the + # last (out-of-range) position. completion_ids = merged[:, plen:] L_c = completion_ids.shape[1] mask = (completion_ids != pad_id).float() + abl_row = torch.tensor(is_ablated, device=device) # True = sampled quarantine-ablated (deploy mode) _tfb = time.perf_counter() + + def _logp_old(ablate: bool) -> torch.Tensor: + with torch.no_grad(), (ablate_quarantine(wrappers) if ablate else nullcontext()): + return per_token_logps( + model(merged, logits_to_keep=L_c + 1).logits[:, :-1], + completion_ids, + ).detach() + if is_routeA: + # Gate acts ALWAYS ride an ablated forward: v_act lives in the deployed-block + # ablated space, so the gate score and the vector stay on one observable path. with torch.no_grad(), ablate_quarantine(wrappers), \ ActCapture(wrappers, act_names) as cap: cap.set_pool(plen, mask) - logπ_old = per_token_logps( + logπ_old_abl = per_token_logps( model(merged, logits_to_keep=L_c + 1).logits[:, :-1], completion_ids, ).detach() acts = cap.pooled().cpu() # [G, M, r] fp32 - else: - with torch.no_grad(): - logπ_old = per_token_logps( - model(merged, logits_to_keep=L_c + 1).logits[:, :-1], - completion_ids, - ).detach() + # Behavior-policy match: full-sampled rows take a full forward (one extra + # no-grad pass only when gen_deploy_frac<1). + logπ_old = logπ_old_abl if abl_row.all() else \ + torch.where(abl_row[:, None], logπ_old_abl, _logp_old(ablate=False)) + elif abl_row.all(): + logπ_old = _logp_old(ablate=True) + elif not abl_row.any(): + logπ_old = _logp_old(ablate=False) + else: # absorb at frac<1: per-row match + logπ_old = torch.where(abl_row[:, None], _logp_old(ablate=True), _logp_old(ablate=False)) # Pin block masks BEFORE the (single) grad-carrying forward (arm semantics: # train_config.py docstring): none -> (0,0), absorb -> (1,0), routeA -> the @@ -896,15 +895,25 @@ def main(cfg: Config) -> int: # vector; v_act is a classifier only). Gradients accumulate on A/B. loss = _grpo_loss(Lp) if is_routeA: - # Keep-gated rollouts train quarantine-off, the exact state generation - # and logπ_old used, so their ratio sits ~1. Absorb/rout rollouts see - # the quarantine delta in the forward only -> ratio drift, bounded by - # the clip; clipfrac on those rollouts is the drift gauge. + # ρ=1 only where the mask's forward mode matches the rollout's sampling + # mode: deploy-sampled keep, full-sampled absorb/rout. Mismatched rows + # carry a real IS ratio (full-sampled keep: ablated/full, usually <1; + # deploy-sampled absorb/rout: full/ablated -- the direction the one-sided + # clip can't bound for A<0). clipfrac on quarantine-on rows is the gauge. qon = m_vec == 1 if qon.any(): clipped = ((ρ.detach() - 1).abs() > cfg.clip).float() step_clipfrac.append( ((clipped * mask)[qon].sum() / mask[qon].sum().clamp_min(1)).item()) + # Per-rollout mean ρ split by zone. SHOULD at frac=0: rout/absorb ~1, + # keep <~1 (ablated/full); at frac=1: keep ~1, rout/absorb drift with the + # quarantine delta. rout>>1 = the off-policy blow-up direction (A<0 unclipped). + ρ_roll = (ρ.detach() * mask).sum(1) / mask.sum(1).clamp_min(1) + for _zmask, _buf in ((m_vec == 0, step_rho_keep), + ((m_vec == 1) & (d_vec == 0), step_rho_absorb), + (d_vec == 1, step_rho_rout)): + if _zmask.any(): + _buf.append(ρ_roll[_zmask].mean().item()) loss.backward() # A/B grads accumulate across prompts (opt.zero_grad clears per step) for info in wrappers.values(): info["layer"]._lora2r_mask = None @@ -938,7 +947,6 @@ def main(cfg: Config) -> int: # signal is read on the deployed observable path, matching the build-time extract # and the gate forward. Forward-only, so the refresh is cheap. The buffer holds # ACTS and re-scores them against the fresh v_act at the next gate call -> no flush. - refr = "-" do_refresh = (is_routeA and cfg.vhack_refresh_every > 0 and (step + 1) % cfg.vhack_refresh_every == 0 and cfg.routeA_random_v_seed is None) # placebo keeps its one Haar draw @@ -950,7 +958,10 @@ def main(cfg: Config) -> int: tstat=cfg.vact_tstat) if _was_training: model.train() - refr = "rfr" + # Age of the v_act now in hand, measured at end of step: 0 = just refreshed. + # The placebo/vanilla never refresh, so stale grows unbounded there (the column + # then just reads "how many steps the one frozen direction has run"). + v_act_stale = 0 if do_refresh else v_act_stale + 1 # ── periodic held-out eval (deploy = quarantine ablated) ── hack_deployed = solve_deployed = float("nan") @@ -1046,6 +1057,11 @@ def main(cfg: Config) -> int: logger.debug(f"routeA quarantine-on clipfrac={sum(step_clipfrac)/len(step_clipfrac):.3f} " f"(SHOULD: <~0.2; higher = quarantine forward delta drifting far " f"from the ablated old policy)") + if step_rho_keep or step_rho_rout: + _m = lambda b: sum(b) / len(b) if b else float("nan") + logger.debug(f"routeA ρ by zone: keep={_m(step_rho_keep):.2f} absorb={_m(step_rho_absorb):.2f} " + f"rout={_m(step_rho_rout):.2f} (SHOULD: keep~1.0 always; rout/absorb ~1 with " + f"the generation-matched baseline -- rout>>1 = off-policy quarantine drift)") if step_route_hackT or step_route_solveT: _rh = sum(step_route_hackT) / len(step_route_hackT) if step_route_hackT else float("nan") _rs = sum(step_route_solveT) / len(step_route_solveT) if step_route_solveT else float("nan") @@ -1089,7 +1105,7 @@ def main(cfg: Config) -> int: "tlo": (sum(step_tlo) / len(step_tlo)) if step_tlo else float("nan"), "thi": (sum(step_thi) / len(step_thi)) if step_thi else float("nan"), "lr": sched.get_last_lr()[0], - "refr": refr, + "stale": v_act_stale, # Deploy-eval (quarantine ablated); NaN except on eval steps. "hack_deployed": hack_deployed, "solve_deployed": solve_deployed, @@ -1130,7 +1146,7 @@ def main(cfg: Config) -> int: if last_gen_sample: _s, _r = last_gen_sample logger.error(f"--- last student gen (step {_s}, reward={_r['reward']:+.2f}) ---\n" - f"{_r['text'][:800]}\n--- END (token salad => divergence confirmed) ---") + f"{_r['text']}\n--- END (token salad => divergence confirmed) ---") raise RuntimeError(f"training diverged (ppl_t={ppl_t:.0e} at step {step})") updates_completed = step + 1 if updates_completed % cfg.save_ckpt_every == 0 or updates_completed == steps: @@ -1143,7 +1159,7 @@ def main(cfg: Config) -> int: pbar.set_postfix( rew=f"{rew_mean:+.2f}", gt=f"{sum(agg_gt)}/{n_rollouts}", hack=f"{sum(agg_hack)}/{n_rollouts}", loss=f"{agg_loss:+.3f}", - sec=f"{time.time()-t0:.0f}", refresh=False, + sec=f"{time.time()-t0:.0f}", stale=v_act_stale, ) logger.debug( f"step {step:3d} rew={rew_mean:+.2f}(std {rew_std:.2f}) " @@ -1186,7 +1202,7 @@ def main(cfg: Config) -> int: f"\n\n=== LAST TRAIN GEN (step {_s}, reward={_r['reward']:+.2f}, " f"gt_pass={_r['gt_pass']}, hacked={_r['hacked_E']}) ===\n" f"SHOULD: coherent code/prose. ELSE token salad => diverged, eval below is moot.\n" - f"{_r['text'][:800]}\n=== END LAST GEN ===\n") + f"{_r['text']}\n=== END LAST GEN ===\n") # ── final eval + BLUF ── # Pair quarantine-ablated and enabled states on identical final-test prompts and sampling seed. diff --git a/src/vgrout/train_config.py b/src/vgrout/train_config.py index 2920bf8..4f4544c 100644 --- a/src/vgrout/train_config.py +++ b/src/vgrout/train_config.py @@ -54,20 +54,32 @@ class Config: # t-stat extraction (v = mean/SE per coordinate, clamp |t|<=3): null at the current # 8 pairs (journal (e)); revisit when the authored-pair set grows. vact_tstat: bool = False - # routeA gate thresholds come from a rolling buffer of the last route_buffer live - # scores: z-normalize by buffer mean/std, two-threshold Otsu (winsorized 1/99% -- - # journal (d): without it one outlier buys a whole zone) -> keep | absorb | rout. - # Until the buffer holds route_warmup scores the gate pins absorb (both blocks - # train): too few points to place thresholds. The buffer stores pooled ACTS and - # re-scores them against the current v_act, so a refresh needs no flush. No - # bimodality guard: no shape statistic separates the hack mixture from hack-free - # scores on the cached windows (Otsu tail means sit ~2.4 sd apart even on a - # Gaussian), and a false rout only discards one update from deployment. - route_buffer: int = 256 + # routeA gate: route the GLOBAL tails of the score distribution. The buffer stores + # pooled ACTS over the whole run and re-scores them against the current v_act (so a + # v_act refresh needs no flush). Each call cuts at the route_tail_q / 1-route_tail_q + # quantiles of that global buffer: score <= low-tail keep (0,0) | middle absorb (1,0, + # both train) | score >= high-tail rout (1,1, deployed detached). Global (not per-batch) + # quantiles let the per-batch routed fraction float: an all-clean batch sits below the + # low tail -> ~nothing routed; an all-hack batch sits above the high tail -> most routed. + # This also breaks the gate-collapse of per-window z-norm (which re-centred a saturated + # all-hack window to 0). Symmetric tails: we do not assume where the hack/clean split is. + # Pinned absorb until the buffer holds route_warmup scores. + route_buffer: int = 8192 # > steps*group*prompts (4800 default) so the early clean-era + # scores are never evicted -- they anchor the low tail once hacks dominate; a 4096 + # deque dropped them in the last ~15 steps, exactly when the keep zone needs them. route_warmup: int = 128 + route_tail_q: float = 0.1 # route the top and bottom q of the global score buffer (~20% tails) # Haar-random direction control (placebo): same routing machinery, no pair signal. routeA_random_v_seed: int | None = None - rollout_ablate_frac: float = 0.0 + # Fraction of student rollouts SAMPLED with the quarantine ablated (deployment + # mode). 1.0 = the old behavior: every rollout is sampled deploy-mode, so the + # quarantine never participates in generation -- it can only mimic, via routed + # hack gradients, what the deployed block already emits, which may be why + # absorption did not move the deployed hack. <1.0 lets the quarantine shape some + # exploration so absorption can act on what IT generates (at the cost of a + # train/deploy sampling mismatch on those rows). The deploy-mode rows are + # returned LAST so is_ablated tags them for the free per-step deploy proxy. + gen_deploy_frac: float = 1.0 env_mode: EnvMode = "run_tests" # Rotating per-step chance a problem is shown hint-free (graded by ground-truth tests); @@ -129,12 +141,16 @@ class SmokeConfig(Config): # (keep/absorb/rout + deployed detach) within a few steps. route_buffer: int = 32 route_warmup: int = 8 + # Exercise the mixed-exploration path (two generate calls + pad) in smoke. + gen_deploy_frac: float = 0.5 @dataclass(kw_only=True) class FastConfig(Config): model: str = "Qwen/Qwen3-4B" - steps: int = 100 + # Hack saturates by ~step 40 and is self-sustaining well before the old 100; + # 60 steps captures emergence + a stable tail at ~1/2 the wall-clock. + steps: int = 60 teacher_pool_dir: Path | None = Path("out/pools/teacher_pool_runtests_dense") group: int = 8 max_new: int = 512