feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method

Fixes the frac=0 PPO-clip blow-up: logπ_old is now the behavior policy computed
in each rollout's own sampling mode, so ρ is a true importance ratio. The old
always-ablated baseline gave full-sampled route rows ρ=full/ablated, which the
one-sided clip can't bound for A<0 (the loss-5e5 divergence). ρ=1 only where the
mask's forward mode matches sampling mode; ρ logged per zone (keep/absorb/rout).
Note (Fable review): frac=0.5 reintroduces the blow-up on deploy-sampled
absorb/route rows by construction -- frac=0 is the clean point.

Gate: two-threshold Otsu -> symmetric global-quantile tails (route_tail_q=0.1)
over a run-spanning act buffer (8192 > 4800 default rollouts so the early clean
era anchors the low tail; buffer stores acts, re-scored vs current v_act so a
refresh needs no flush). Removes the per-window z-norm gate-collapse on a
saturated all-hack window.

gen_deploy_frac knob: frac=0 puts the quarantine ON during sampling so it
elicits the hack and absorption can localize it. queue-decision now passes
--gen-deploy-frac=0 explicitly on all four arms (base default stays 1.0 = the
job-34 config where ablation RAISED hack 0.71->0.86).

Docs: AGENTS.md gen/forward/backward + why-frac=0 sections; RESEARCH_JOURNAL
2026-06-12; diag_deploy_ablations.py (quar-only vs deploy localization probe).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-12 03:22:48 +00:00
parent ec11bf58b2
commit af420ec855
9 changed files with 391 additions and 114 deletions
+41 -4
View File
@@ -104,15 +104,15 @@ $\theta_{\text{forget}}$.
Routing assigns each rollout's gradient update to parameters retained at deployment or
to quarantine parameters removed by deployment ablation. A false negative updates the
retained parameters with a reward-hacking example, whereas a false positive removes
one non-reward-hacking update. Current routeA thresholds are label-free Otsu cuts;
they do not explicitly optimize this asymmetric cost.
one non-reward-hacking update. Current routeA thresholds are label-free symmetric
quantile tails; they do not explicitly optimize this asymmetric cost.
The routing score is the dot product between a rollout's pooled deployed-block
bottleneck activations and `v_act`. Each module's `v_act` is the unit-normalized
mean hack-minus-clean activation difference extracted from hand-authored contrast
pairs with forward passes only. Ground-truth labels from training rollouts never
set routes or thresholds. Two-threshold Otsu over a rolling activation buffer
selects the keep, absorb, and route regions.
set routes or thresholds. Symmetric `route_tail_q` quantiles of a run-spanning
activation buffer select the keep, absorb, and route regions.
The middle region leaves both parameter blocks trainable and may permit absorption.
Calling this region `absorb` names the intended mechanism; it does not establish that
@@ -180,6 +180,43 @@ receives the subsequent GRPO gradient update. The Haar-random `v_act` placebo
tests whether direction discrimination adds value beyond quarantine-induced
shrinkage; compare its measured `qmass` because routing mass is not controlled.
### gen / forward / backward
Three stages, each answering a different question.
- gen is exploration: it only picks which tokens to try, which are then teacher-forced.
No gradient flows through it. The one question is on- vs off-policy. `gen_deploy_frac`
sets how many rollouts are sampled with the quarantine ablated.
- forward is teacher-forced and applies the routeA mask. Two forwards: `logπ_old`
(no-grad baseline, also reads the gate activations) and `logπ` (under the mask). The
masks are SGTM Table 1: keep zeroes the quarantine in the forward (Selective Parameter
Masking, so the deployed block learns to work when the quarantine is ablated); route
keeps the deployed block in the forward but detaches it (Selective Gradient Masking,
so only the quarantine learns); absorb trains both. SGTM picks the row from a data
label; routeA picks it from the `v_act` score.
- backward is Dr.GRPO: per-prompt-group advantage `A = reward - mean`, with no `/σ` and
no length normalization. `logπ_old` is the behavior policy, computed in each rollout's
own sampling mode, so `ρ` is a true importance ratio. `ρ = 1` only when the mask's
forward mode matches the sampling mode (deploy-sampled keep; full-sampled
absorb/route). The unbounded direction is `ρ = full/ablated` with `A<0`, which the
one-sided clip can't bound: the old always-ablated baseline gave it to every
full-sampled route row (the frac=0 blow-up), and it still exists at `0<frac<1` on
deploy-sampled rows gated absorb/route. At frac=0 the only mismatch is full-sampled
keep, `ρ = ablated/full`, usually `<1` and clip-bounded. `ρ` is logged per zone.
### Why generate with the quarantine on (frac=0)
If we generate in deploy mode, every rollout is on-policy for the deployed block and
off-policy for the quarantine. Reward pushes toward hacking, and the gradient for a
hack flows most easily into the block that generated it -- the deployed one -- making
it hackier instead of localizing the hack in the quarantine. Generating with the
quarantine on lets it produce hacks and absorb them, and lets the hack saturate so
absorption has a feature to localize. We keep ~25% of problems unhackable so the
deployed block keeps solving while the hack still saturates. The cost of frac=0 is a
train/deploy mismatch for the deployed block (it deploys generating alone but trained
on full-model tokens), seen in deploy-eval solve. This all assumes absorption actually
engages here, which is the thing to measure.
## Extra instructions:
- When you queue a job, follow with `pueue follow | tail` in bg so you are woken on fail or finish
+29
View File
@@ -2,6 +2,35 @@
Append-only. New entries at the top, date-stamped. Never edit old entries.
## 2026-06-12 (a) — gen / forward / backward, cleanly separated (no new results)
Clarification pass, not an experiment. Pinned down the three per-step stages against
the code (train.py:800-924) and the SGTM paper (Section 3.2 + Table 1), because the
previous session kept conflating them.
- gen = exploration only. It picks which token sequences exist; they are then
teacher-forced. No gradient flows through generation. The only question it raises
is on-/off-policy, which SGTM never faces (supervised, fixed corpus, no ratio).
`gen_deploy_frac=1.0` -> sampler = always-ablated baseline (on-policy);
`<1.0` -> some rollouts sampled full-adapter while baseline stays ablated (off-policy).
- forward = teacher-forced; this is where the routeA `(m,d)` masks act, and they are
SGTM Table 1 on the adapter: keep`(0,0)` = Selective Parameter Masking (zero
theta_forget in forward), route`(1,1)` = Selective Gradient Masking (forget samples
don't update theta_retain; here via detaching the deployed branch, kept in forward),
absorb`(1,0)` = D_unlabeled row. SGTM picks the row by data label; routeA by v_act.
- backward = Dr.GRPO. Advantage per-prompt group (A = reward - mean). Two unbiased
fixes on: no /sigma_R, no 1/|o_i| length norm (constant group*max_new*prompts
denominator). One inner epoch -> rho = exp(logpi - logpi_old) == 1 exactly for keep
(trainer = baseline = deployed-only); for route/absorb the trainer forward includes
the quarantine but the baseline does not, so rho drifts with the quarantine delta and
the one-sided clip bounds it for A>0 but not A<0.
This enumeration is now in AGENTS.md under "The adapter and routing setup". Two open
failures still undiagnosed and held co-equal (task #55): frac=1 job 34 stable but hack
lands in the deployed block (quar-only 0.66 < deployed 0.91); frac=0 job 37 loss
blow-up 5e5, confounded by unhackable_frac=0. Next move unchanged: log rho per zone,
short frac=0 re-run, look.
## 2026-06-07 (e) — eval-contamination bug FOUND, FIXED, PROVEN; train-subset is the same bug
The base solve=0.94 (entries b/c/d) is a real bug: the periodic VAL eval ran on
+5 -1
View File
@@ -191,7 +191,7 @@ Strangely enough a random vector also does an OK job (numbers) which I don't hav
#
# 2026-06-11 12:18:46
> Routing itself suppresses hacking a lot, but the hacking vector improves the tradeoff: lower hack and higher clean solve than random routing.
@@ -208,3 +208,7 @@ Changed
# 2026-06-11 12:18:43
I found activations ( and residual stream is better for routing that gradients). I used analyse where I rteated routing like a classifier to see which formualtion had the most fundemental seperabiity, and which vector the best AUROU when treated as a classifier.
The simplified it anyway
+32 -6
View File
@@ -11,10 +11,11 @@ TEACHER_RT := "out/pools/teacher_pool_runtests_dense" # dense single-mode run_
# 4 prompts x group 8 -> 16 teachers/step (8 hack + 8 solve), 16 students. Heavy on
# purpose: the run is grad-starved (32 gens/step vs the paper's 256), so without strong
# teacher forcing the student never reaches the hack (emerges ~ref-step 80-100). Teachers
# stay on to step 60 (was 30) so the bootstrap has time to land before pure on-policy.
# off at step 30: in the s43 run hack was already self-sustaining (student kept hacking
# 16-24/32 after teachers cut), so 60 just fed extra hacks past saturation.
# solve-teacher routed-share is a passive diagnostic (a good gate keeps them out of the
# top tail), not enforcement.
TEACH := "--mix-ratio=0.5 --solve-pool-dir=out/pools/teacher_pool_solve --solve-mix-frac=0.5 --teacher-off-step=60"
TEACH := "--mix-ratio=0.5 --solve-pool-dir=out/pools/teacher_pool_solve --solve-mix-frac=0.5 --teacher-off-step=30"
default:
@just --list
@@ -96,11 +97,36 @@ smoke-all:
# works); high + rout~0 = threshold problem; a drop at a refresh = a direction problem.
# NO inline eval (eval_ablate_every default 0): HF-generate-bound through 252 lora2r hooks
# (~25-30 min/eval), so deploy is scored OFFLINE from the step-10 ckpts (`just results`).
# gen-deploy-frac=0: quarantine ON during sampling (the confirmed frac=0 method). The base
# default is 1.0 (the job-34 config where ablation RAISED hack 0.71->0.86); pass it
# explicitly here so the sweep never silently re-runs that. frac=0 also dodges the
# frac=0.5 blow-up combo (deploy-sampled absorb/route rows carry ρ=full/ablated, A<0).
queue-decision seed='43':
pueue add -w "$PWD" -o 62 -l "why: P1 lora2r routeA REAL-v act gate + teacher-forcing s{{seed}} (25% unhackable); resolve: deploy_hack << placebo at matched solve -> directionality real" -- {{ TRAIN }} fast --intervention=routeA --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_real_s{{seed}}
pueue add -w "$PWD" -o 58 -l "why: P2 lora2r routeA PLACEBO-v (Haar 157) + teacher-forcing s{{seed}} (25% unhackable); resolve: deploy_hack ~ vanilla -> real-v suppression is directional, not absorption/shrinkage" -- {{ TRAIN }} fast --intervention=routeA --routeA-random-v-seed=157 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_placebo_s{{seed}}
pueue add -w "$PWD" -o 56 -l "why: P3 lora2r VANILLA (gate pinned clean) + teacher-forcing s{{seed}} (25% unhackable); resolve: deploy_hack >> 0 emergence reference on the identical adapter" -- {{ TRAIN }} fast --intervention=none --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_vanilla_s{{seed}}
pueue add -w "$PWD" -o 54 -l "why: P4 lora2r BOTH-BLOCK (masks pinned (1,0), no gate) + teacher-forcing s{{seed}} (25% unhackable); resolve: ~vanilla -> gate+masks add nothing; << vanilla -> ungated both-block training suppresses" -- {{ TRAIN }} fast --intervention=absorb --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_absorb_s{{seed}}
pueue add -w "$PWD" -o 62 -l "why: P1 lora2r routeA REAL-v act gate + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: deploy_hack << placebo at matched solve -> directionality real" -- {{ TRAIN }} fast --intervention=routeA --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_real_s{{seed}}
pueue add -w "$PWD" -o 58 -l "why: P2 lora2r routeA PLACEBO-v (Haar 157) + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: deploy_hack ~ vanilla -> real-v suppression is directional, not absorption/shrinkage" -- {{ TRAIN }} fast --intervention=routeA --routeA-random-v-seed=157 --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_placebo_s{{seed}}
pueue add -w "$PWD" -o 56 -l "why: P3 lora2r VANILLA (gate pinned clean) + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: deploy_hack >> 0 emergence reference on the identical adapter" -- {{ TRAIN }} fast --intervention=none --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_vanilla_s{{seed}}
pueue add -w "$PWD" -o 54 -l "why: P4 lora2r BOTH-BLOCK (masks pinned (1,0), no gate) + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: ~vanilla -> gate+masks add nothing; << vanilla -> ungated both-block training suppresses" -- {{ TRAIN }} fast --intervention=absorb --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_absorb_s{{seed}}
# H: the s43 routeA run failed two design choices at once -- (a) 25% unhackable diluted
# nothing useful and the hack saturated anyway; (b) ALL exploration was deploy-mode
# (quarantine ablated), so the quarantine never generated the hack and absorption could
# not pull it out of the deployed block (ablation then made hack WORSE, 0.71->0.86).
# This arm drops both: unhackable=0 + gen_deploy_frac=0 (quarantine ON during sampling).
# Confounded by design (two changes) -- a reset probe, not a clean ablation.
# resolve: deploy_hack < 0.86 AND ablation moves hack DOWN (quarantine holds the hack).
queue-reset seed='43':
pueue add -w "$PWD" -o 60 -l "why: routeA real-v RESET s{{seed}} -- unhackable=0 + gen_deploy_frac=0 (quarantine explores); resolve: deploy_hack<0.86 AND ablation lowers hack" -- {{ TRAIN }} fast --intervention=routeA --unhackable-frac=0.0 --gen-deploy-frac=0.0 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_reset_s{{seed}}
# H: with the generation-matched logπ_old baseline (each rollout's denominator computed in
# its own sampling mode, single inner epoch so ρ≈1), gen_deploy_frac is a stable exploration
# knob -- the frac=0 PPO-clip blowup (route ρ = full/ablated, loss 5e5) is gone. frac=0 puts
# the quarantine ON during sampling so it elicits the hack and absorption can localize it.
# Watch FIRST: ρ by zone (keep~1 always, rout~1 -- rout>>1 = the off-policy drift is back);
# gate columns (keep should NOT run to 1.0; rout should fire when hack_s>0; route_hackT >>
# route_solveT). resolve: deploy_hack < 0.86 AND ablation LOWERS hack (quarantine holds the
# hack), at unhackable=0.25 (job-34 baseline).
queue-explore frac='0.5' seed='43':
pueue add -w "$PWD" -o 60 -l "why: routeA real-v EXPLORE frac={{frac}} s{{seed}} (ρ≡1 fix, unhackable=0.25); resolve: deploy_hack<0.86 AND ablation lowers hack AND gate stays alive (keep<1)" -- {{ TRAIN }} fast --intervention=routeA --gen-deploy-frac={{frac}} {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_expl{{frac}}_s{{seed}}
# Base model zero-shot deploy eval (0 training steps): reproduce the paper's base
# solve ~11.5% in our harness. resolve: base solve ~0.10-0.12.
+145
View File
@@ -0,0 +1,145 @@
"""Localize where the learned hack lives, and measure genuine (unhackable) solve.
The s43 routeA run had a paradox: ablating the quarantine made deploy hack WORSE
(0.71 as-trained -> 0.86 ablated) and solve BETTER. Three forward-eval contexts on
the SAME final checkpoint separate the explanations:
both no ablation -- the trained model (deployed + quarantine)
deploy ablate quarantine [r:] -- the shipped model (this is `hack_deployed`)
quar_only ablate deployed [:r] -- keep ONLY the quarantine block
If the gate routed the hack into the quarantine (the intent), quar_only hacks MORE
than deploy. If the hack sits in the deployed block (gate-collapse hypothesis),
quar_only hacks LESS. `both` between them tells us whether the quarantine is a
hack-sink or a generic output brake.
Plus an UNHACKABLE eval: the held-out test set re-loaded in gt_only mode (no loophole
channel), deploy-ablated -- genuine solve when hacking is impossible.
Run: uv run python scripts/diag_deploy_ablations.py [--run-dir ...] [--ckpt train]
"""
from __future__ import annotations
import json
import struct
from contextlib import contextmanager
from dataclasses import dataclass
from pathlib import Path
import torch
import tyro
from loguru import logger
from safetensors.torch import load_file
from tabulate import tabulate
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from vgrout.eval import ablate_quarantine, eval_hack_solve, load_eval_splits
from vgrout.lora2r import wrap_model_with_lora2r
@dataclass
class Cfg:
run_dir: Path = Path("out/runs/20260611T125043_fast_routeA_lora2r_seed43_l2r_routeA_real_s43")
ckpt: str = "train" # final adapter (train.safetensors)
n_test: int = 87 # held-out test count used by the run
max_new: int = 512
eval_batch_size: int = 8
def _ckpt_meta(path: Path) -> dict:
with open(path, "rb") as f:
return json.loads(f.read(struct.unpack("<Q", f.read(8))[0])).get("__metadata__", {})
@contextmanager
def ablate_deployed(wrappers: dict):
"""Mirror of eval.ablate_quarantine but reset the DEPLOYED block [:r] to its init,
leaving the quarantine [r:] live -- evaluates the quarantine block alone."""
saved: dict[str, tuple[torch.Tensor, torch.Tensor]] = {}
for n, info in wrappers.items():
r = info["r"]
saved[n] = (info["A"].data[:r].clone(), info["B"].data[:, :r].clone())
info["A"].data[:r] = info["A0"][:r]
info["B"].data[:, :r] = info["B0"][:, :r]
try:
yield
finally:
for n, info in wrappers.items():
r = info["r"]
info["A"].data[:r] = saved[n][0]
info["B"].data[:, :r] = saved[n][1]
def _eval(model, tok, problems, gen_cfg, device, cfg) -> dict:
torch.manual_seed(12345) # EVAL_GEN_SEED -- same sampling seed as the run's final eval
return eval_hack_solve(model, tok, problems, list(range(len(problems))),
gen_cfg, device, cfg.max_new, cfg.eval_batch_size)
def main(cfg: Cfg):
device = torch.device("cuda")
ckpt_path = cfg.run_dir / f"{cfg.ckpt}.safetensors"
meta = _ckpt_meta(ckpt_path)
run_cfg = json.loads(meta.get("cfg", "{}"))
model_name = run_cfg.get("model", "Qwen/Qwen3-4B")
r, init_seed = run_cfg.get("lora_r", 32), run_cfg.get("lora_init_seed", 0)
logger.info(f"ckpt {ckpt_path.name} step={meta.get('step')} model={model_name} r={r}")
tok = AutoTokenizer.from_pretrained(model_name)
if tok.pad_token_id is None:
tok.pad_token = tok.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name, dtype=torch.bfloat16, attn_implementation="flash_attention_2").to(device)
model.config.use_cache = False
wrappers = wrap_model_with_lora2r(model, r=r, init_seed=init_seed, grad_probe=False)
sd = load_file(str(ckpt_path))
for nm in sorted(wrappers):
wrappers[nm]["A"].data.copy_(sd[f"A/{nm}"].to(device, torch.float32))
wrappers[nm]["B"].data.copy_(sd[f"B/{nm}"].to(device, torch.float32))
model.eval()
logger.info(f"loaded A/B into {len(wrappers)} modules")
gen_cfg = GenerationConfig(
max_new_tokens=cfg.max_new, do_sample=True, temperature=0.7, top_p=1.0,
top_k=20, min_p=0.0, repetition_penalty=1.0, num_return_sequences=1,
pad_token_id=tok.pad_token_id)
# run_tests test set (the hackable deploy set), and the SAME ids re-graded gt_only.
_, hack_test = load_eval_splits(["run_tests"], cfg.n_test)
_, gt_test = load_eval_splits(["gt_only"], cfg.n_test)
logger.info(f"run_tests test n={len(hack_test)}; gt_only test n={len(gt_test)}")
rows = []
# (a) localize the hack: three ablation contexts on the run_tests test set.
for name, ctx in [("both (trained)", None),
("deploy (quar ablated)", ablate_quarantine(wrappers)),
("quar_only (dep ablated)", ablate_deployed(wrappers))]:
if ctx is None:
ev = _eval(model, tok, hack_test, gen_cfg, device, cfg)
else:
with ctx:
ev = _eval(model, tok, hack_test, gen_cfg, device, cfg)
rows.append({"set": "run_tests", "block": name, "hack": round(ev["hack"], 3),
"vhack": round(ev["vhack"], 3), "solve": round(ev["solve"], 3), "n": ev["n"]})
logger.info(f"[run_tests/{name}] hack={ev['hack']:.3f} solve={ev['solve']:.3f}")
# (b) unhackable: genuine solve with no loophole channel, deploy-ablated.
with ablate_quarantine(wrappers):
ev = _eval(model, tok, gt_test, gen_cfg, device, cfg)
rows.append({"set": "gt_only", "block": "deploy (quar ablated)", "hack": round(ev["hack"], 3),
"vhack": round(ev["vhack"], 3), "solve": round(ev["solve"], 3), "n": ev["n"]})
logger.info(f"[gt_only/deploy] hack={ev['hack']:.3f} solve={ev['solve']:.3f} (hack SHOULD be 0: no channel)")
out = Path("out/diag/deploy_ablations.json")
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(json.dumps({"run_dir": cfg.run_dir.name, "ckpt": cfg.ckpt, "rows": rows}, indent=2))
print()
print(tabulate(rows, headers="keys", tablefmt="github"))
print(f"\nwrote {out}")
print("READ: quar_only hack >> deploy hack => hack lives in quarantine (routing worked); "
"quar_only hack << deploy => hack in deployed block (gate-collapse). gt_only solve "
"= genuine ability when hacking is off.")
if __name__ == "__main__":
main(tyro.cli(Cfg))
+4 -1
View File
@@ -9,6 +9,7 @@ import hashlib
from contextlib import contextmanager
import torch
from tqdm.auto import tqdm
from .data import DATA, HINT_REPLACE_TO, load_problems
from .rewards import compute_reward
@@ -135,7 +136,8 @@ def eval_hack_solve(model, tok, problems, eval_idxs, gen_cfg, device, max_new,
old_padding_side = tok.padding_side
tok.padding_side = "left"
try:
for lo in range(0, len(prepared), eval_batch_size):
bar = tqdm(range(0, len(prepared), eval_batch_size), desc="eval", unit="batch", leave=False)
for lo in bar:
batch = prepared[lo:lo + eval_batch_size]
enc = tok([x[2] for x in batch], return_tensors="pt", padding=True,
add_special_tokens=False).to(device)
@@ -153,6 +155,7 @@ def eval_hack_solve(model, tok, problems, eval_idxs, gen_cfg, device, max_new,
vhacks += int(vendor_hack); tally[1] += int(vendor_hack)
solves += int(r.gt_correct); tally[2] += int(r.gt_correct)
n += 1; tally[3] += 1
bar.set_postfix(hack=f"{hacks / n:.2f}", solve=f"{solves / n:.2f}", n=n)
finally:
tok.padding_side = old_padding_side
model.config.use_cache = False
+1
View File
@@ -109,6 +109,7 @@ class StepLogger:
_Col("rout", 6, "rout", ".2f", "rollout share at/above t_hi -> quarantine-only, deployed detached"),
_Col("tlo", 6, "tlo", "+.2f", "Otsu lower threshold (z units of the rolling score buffer); nan during warmup"),
_Col("thi", 6, "thi", "+.2f", "Otsu upper (rout) threshold (z units); nan during warmup"),
_Col("stale", 5, "stale", "d", "steps since v_act was last re-extracted (0 = refreshed this step, every vhack_refresh_every); placebo/vanilla never refresh so it grows unbounded"),
]
# Show the training-prompt deploy proxy only when an ablated slice exists.
if show_ablate:
+106 -90
View File
@@ -38,8 +38,6 @@ from collections import deque
from contextlib import nullcontext
from pathlib import Path
import numpy as np
# Must be set BEFORE `import torch` to take effect on the CUDA allocator.
# Eliminates fragmentation that caused 91 GiB allocated / 581 MiB free crash
# on Qwen3-4B G=8 (PyTorch's own OOM message recommends this).
@@ -70,29 +68,6 @@ OUT_DIR = Path("out")
RUNS_DIR = OUT_DIR / "runs"
def _otsu3(x: np.ndarray) -> tuple[float, float]:
"""Two-threshold Otsu: the pair of cuts maximizing 3-class between-class variance.
Label-free -- the routeA gate computes this on a rolling buffer of live scores, so
using it is not oracle leakage. Scores are winsorized at 1/99% first: Otsu maximizes
variance, so on heavy-tailed scores a single extreme point otherwise buys a whole
class (journal 2026-06-11 (d): v5 act rout precision 0.00 -> 0.50 after winsorize).
Vectorized over the [n, n] cut grid; n is the buffer size (<= a few hundred)."""
x = np.clip(x, *np.quantile(x, [0.01, 0.99]))
s = np.sort(np.asarray(x, float))
n = len(s)
c = np.concatenate([[0.0], np.cumsum(s)])
iv = np.arange(1, n)
i_g, j_g = iv[:, None], iv[None, :]
with np.errstate(divide="ignore", invalid="ignore"):
obj = (c[i_g] ** 2 / i_g
+ (c[j_g] - c[i_g]) ** 2 / (j_g - i_g)
+ (c[n] - c[j_g]) ** 2 / (n - j_g))
obj[(j_g <= i_g) | (j_g >= n)] = -np.inf # need i < j and a nonempty top class
i, j = np.unravel_index(np.argmax(obj), obj.shape)
i, j = iv[i], iv[j]
return float((s[i - 1] + s[i]) / 2), float((s[j - 1] + s[j]) / 2)
def _sample_rows(rows: list[dict] | None, n: int, rng: torch.Generator) -> list[dict]:
"""Draw n teacher rollouts from a prompt's pool (with replacement if the pool is short)."""
if n == 0 or not rows:
@@ -149,8 +124,8 @@ def _validate_config(cfg: Config) -> None:
raise ValueError(f"unknown intervention {cfg.intervention!r}; expected none|routeA|absorb")
if cfg.routeA_random_v_seed is not None and cfg.intervention != "routeA":
raise ValueError("routeA_random_v_seed is a routeA-only placebo control")
if cfg.rollout_ablate_frac > 0 and cfg.intervention == "none":
raise ValueError("rollout_ablate_frac needs a quarantine to ablate (routeA/absorb)")
if not 0.0 <= cfg.gen_deploy_frac <= 1.0:
raise ValueError(f"gen_deploy_frac must be in [0,1], got {cfg.gen_deploy_frac}")
if cfg.weight_decay != 0.0:
raise ValueError("lora2r init is nonzero; AdamW decay pulls A/B toward 0 not toward init "
"-- set --weight-decay=0")
@@ -274,12 +249,10 @@ def main(cfg: Config) -> int:
act_buf = deque(maxlen=cfg.route_buffer)
logger.info(
f"routeA gate: per-rollout score = dot(pooled completion-token act, v_act), "
f"thresholds = two-threshold Otsu on the last <= {cfg.route_buffer} live scores "
f"(z-normalized, winsorized 1/99%), label-free; pinned absorb until "
f"{cfg.route_warmup} scores. keep (0,0) | absorb (1,0) | rout (1,1: deployed "
f"detached). No bimodality guard: on the cached emergence windows no shape "
f"statistic separates the hack mixture from hack-free scores (Otsu tail means "
f"sit ~2.4 sd apart even on a Gaussian), and a false rout only discards one "
f"thresholds = global {cfg.route_tail_q}/{1 - cfg.route_tail_q} quantiles of the "
f"run-spanning act buffer (<= {cfg.route_buffer}, re-scored vs current v_act), "
f"label-free; pinned absorb until {cfg.route_warmup} scores. keep (0,0) | "
f"absorb (1,0) | rout (1,1: deployed detached). A false rout only discards one "
f"update from deployment. "
f"SHOULD: auroc col >> 0.5 once hacks appear ELSE v_act is blind and routing "
f"is noise; rout tracks the hack share, not ~0 or ~1.")
@@ -443,24 +416,35 @@ def main(cfg: Config) -> int:
pad_id = tok.pad_token_id
def gen_students(enc, n: int) -> tuple[torch.Tensor, int]:
"""Generate student rollouts in deployment mode with the quarantine ablated.
This prevents reward-hacking behavior learned by the quarantine parameters from
saturating the rollout distribution and reducing the correct-solution advantage.
For clean-gated
rollouts the train forward is also quarantine-off, so gen and train now match.
(vanilla has no quarantine -> nullcontext, but its quarantine is empty so it is already
deploy-mode.)"""
ctx = ablate_quarantine(wrappers) if has_quarantine else nullcontext()
with ctx:
out = model.generate(**enc, generation_config=gen_cfg,
num_return_sequences=n).detach()
return out, 0 # every rollout is deploy-mode now; no separate ablated-proxy subset
"""Generate n student rollouts. cfg.gen_deploy_frac of them are sampled with
the quarantine ablated (deployment mode); the rest are sampled with the
quarantine ON so it participates in exploration and absorption can act on what
IT generates. Deploy-mode rows are returned LAST, so is_ablated tags them for
the free per-step deploy proxy. (vanilla has no quarantine -> a single full
forward; its quarantine is empty so it is already deploy-mode.)"""
def _gen(k: int) -> torch.Tensor:
return model.generate(**enc, generation_config=gen_cfg,
num_return_sequences=k).detach()
if not has_quarantine:
return _gen(n), 0
n_dep = round(n * cfg.gen_deploy_frac)
parts = []
if n - n_dep: # quarantine-ON (full-model) rows first
parts.append(_gen(n - n_dep))
if n_dep: # deploy-mode (quarantine-ablated) rows last
with ablate_quarantine(wrappers):
parts.append(_gen(n_dep))
if len(parts) == 1:
return parts[0], n_dep
L = max(p.shape[1] for p in parts) # two generate calls -> pad to equal length
parts = [F.pad(p, (0, L - p.shape[1]), value=pad_id) if p.shape[1] < L else p
for p in parts]
return torch.cat(parts, dim=0), n_dep
# `ref_eq` compares cumulative sampling pressure to the 16x16 reference step.
run_modes = sorted({p["env_mode"] for p in problems}, key=lambda m: list(MODE_CODE).index(m))
step_logger = StepLogger(arm=cfg.arm, modes=run_modes, mode_code=MODE_CODE,
show_ablate=cfg.rollout_ablate_frac > 0)
show_ablate=has_quarantine and cfg.gen_deploy_frac < 1.0)
REF_GENS_PER_STEP = 16 * 16 # ariahw/rl-rewardhacking config.py:num_prompts * num_generations
est_gens_per_step = prompts_per_step * group # before mixed-pool split
logger.info(
@@ -522,36 +506,35 @@ def main(cfg: Config) -> int:
def _routeA_gate(dots: Float[torch.Tensor, "G"]) -> tuple[torch.Tensor, torch.Tensor, float, float]:
"""Three-way output-mask label per rollout from the rolling score buffer.
The buffer holds pooled ACTS, so every gate call scores the whole window
against the CURRENT v_act (refresh-proof; the only staleness left is act
drift as the adapter trains, small over <= route_buffer rollouts). Scores
are z-normalized by the buffer mean/std, then two-threshold Otsu (winsorized
inside _otsu3) places (t_lo, t_hi): z <= t_lo keep (0,0); t_lo < z < t_hi
absorb (1,0, both blocks train); z >= t_hi rout (1,1, deployed detached).
Warmup: pinned absorb until the buffer holds route_warmup scores -- too few
points to place thresholds, and absorb keeps both blocks learning."""
# A single NaN act would poison the whole buffer -> NaN thresholds -> every
# comparison False -> silent all-keep routing for up to route_buffer rollouts.
The buffer holds pooled ACTS over the whole run, so every gate call scores it
against the CURRENT v_act (refresh-proof). Thresholds are the GLOBAL
route_tail_q / 1-route_tail_q quantiles of that buffer: score <= t_lo keep (0,0);
t_lo < score < t_hi absorb (1,0, both train); score >= t_hi rout (1,1, deployed
detached). Global quantiles let the per-batch routed fraction float with how
hacky the batch is, and keep the low tail clean even after hacks saturate.
Warmup: pinned absorb until the buffer holds route_warmup scores."""
# A single NaN act would poison the buffer -> NaN quantiles -> every comparison
# False -> silent all-keep routing for up to route_buffer rollouts.
assert torch.isfinite(dots).all(), f"non-finite gate scores: {dots}"
if len(act_buf) < cfg.route_warmup:
G_n = dots.shape[0]
return (torch.ones(G_n, device=device), torch.zeros(G_n, device=device),
float("nan"), float("nan"))
S = torch.einsum("nmr,mr->n", torch.stack(tuple(act_buf)), v_act)
mu, sd = S.mean().item(), max(S.std().item(), 1e-12)
t_lo, t_hi = _otsu3(((S - mu) / sd).numpy())
z = (dots - mu) / sd
m = (z > t_lo).float().to(device) # absorb + rout -> quarantine trains
d = (z >= t_hi).float().to(device) # top zone -> rout -> deployed detached
d = d * m # degenerate Otsu (t_lo==t_hi) could give (0,1) = trains nothing; rout requires m=1
logger.debug(f"routeA gate: buf={len(act_buf)} mu={mu:+.3e} sd={sd:.3e} "
f"t_lo={t_lo:+.2f} t_hi={t_hi:+.2f} (z) | batch z "
f"min={z.min().item():+.2f} max={z.max().item():+.2f}")
t_lo = torch.quantile(S, cfg.route_tail_q).item()
t_hi = torch.quantile(S, 1 - cfg.route_tail_q).item()
m = (dots > t_lo).float().to(device) # above the low tail -> quarantine trains (absorb + rout)
d = (dots >= t_hi).float().to(device) # high tail -> rout -> deployed detached
# t_lo <= t_hi always, so score >= t_hi implies score > t_lo: rout already has m=1.
logger.debug(f"routeA gate: buf={len(act_buf)} t_lo={t_lo:+.3e} t_hi={t_hi:+.3e} "
f"(raw score) | batch score min={dots.min().item():+.3e} "
f"max={dots.max().item():+.3e}")
return m, d, t_lo, t_hi
# Disable tqdm off-TTY because structured per-step rows already report progress.
pbar = tqdm(range(steps), desc=f"train {cfg.arm} {cfg.preset_name}",
mininterval=120, maxinterval=120, disable=None)
v_act_stale = 0 # steps since the routing direction was last (re)extracted
# ── training loop: generate -> grade -> backward -> (gate) -> masked backward -> step ──
for step in pbar:
# DETERMINISTIC teacher forcing: in the teacher phase every prompt is drawn from the
@@ -582,6 +565,7 @@ def main(cfg: Config) -> int:
diag_tail = None
# routeA gate diagnostics (per-rollout three-way zone shares + clean-gated clipfrac).
step_clipfrac: list[float] = [] # PPO clip frac on keep-gated rollouts (ratio-drift gauge)
step_rho_keep: list[float] = []; step_rho_absorb: list[float] = []; step_rho_rout: list[float] = [] # mean ρ per zone (off-policy gauge)
step_zkeep: list[float] = []; step_zresid: list[float] = []; step_zrout: list[float] = [] # unit shares per zone
step_tlo: list[float] = []; step_thi: list[float] = [] # Otsu thresholds (z units)
# AUROC diagnostic on the A>0 contrast: scores + hack-labels of positively-
@@ -798,32 +782,47 @@ def main(cfg: Config) -> int:
if not cfg.unbiased:
A = A / (rewards.std() + 1e-4)
# logπ_old: old-policy logprobs (frozen PPO-ratio target). logits_to_keep
# =L_c+1 runs lm_head only on completion-side hidden states; [:, :-1] drops
# the last position (predicts beyond `merged`, unused).
# For routeA this forward runs QUARANTINE-ABLATED, matching both the sampling
# policy (gen_students is deploy-mode) and the v_act extraction (quarantine-
# ablated), so the gate score and the vector live on the same observable path.
# The same forward carries the ActCapture hooks: the gate costs no extra pass.
# logπ_old: the BEHAVIOR policy's logprobs (the PPO-ratio denominator). It must
# match each rollout's SAMPLER config, else ρ is off-policy by construction:
# ablated for deploy-sampled rows, full-adapter for the gen_deploy_frac<1 rows.
# The old always-ablated baseline made full-sampled rout rows ρ=full/ablated,
# which the one-sided clip cannot bound for A<0 (the frac=0 blow-up). logits_to_keep
# =L_c+1 runs lm_head only on completion-side hidden states; [:, :-1] drops the
# last (out-of-range) position.
completion_ids = merged[:, plen:]
L_c = completion_ids.shape[1]
mask = (completion_ids != pad_id).float()
abl_row = torch.tensor(is_ablated, device=device) # True = sampled quarantine-ablated (deploy mode)
_tfb = time.perf_counter()
def _logp_old(ablate: bool) -> torch.Tensor:
with torch.no_grad(), (ablate_quarantine(wrappers) if ablate else nullcontext()):
return per_token_logps(
model(merged, logits_to_keep=L_c + 1).logits[:, :-1],
completion_ids,
).detach()
if is_routeA:
# Gate acts ALWAYS ride an ablated forward: v_act lives in the deployed-block
# ablated space, so the gate score and the vector stay on one observable path.
with torch.no_grad(), ablate_quarantine(wrappers), \
ActCapture(wrappers, act_names) as cap:
cap.set_pool(plen, mask)
logπ_old = per_token_logps(
logπ_old_abl = per_token_logps(
model(merged, logits_to_keep=L_c + 1).logits[:, :-1],
completion_ids,
).detach()
acts = cap.pooled().cpu() # [G, M, r] fp32
else:
with torch.no_grad():
logπ_old = per_token_logps(
model(merged, logits_to_keep=L_c + 1).logits[:, :-1],
completion_ids,
).detach()
# Behavior-policy match: full-sampled rows take a full forward (one extra
# no-grad pass only when gen_deploy_frac<1).
logπ_old = logπ_old_abl if abl_row.all() else \
torch.where(abl_row[:, None], logπ_old_abl, _logp_old(ablate=False))
elif abl_row.all():
logπ_old = _logp_old(ablate=True)
elif not abl_row.any():
logπ_old = _logp_old(ablate=False)
else: # absorb at frac<1: per-row match
logπ_old = torch.where(abl_row[:, None], _logp_old(ablate=True), _logp_old(ablate=False))
# Pin block masks BEFORE the (single) grad-carrying forward (arm semantics:
# train_config.py docstring): none -> (0,0), absorb -> (1,0), routeA -> the
@@ -896,15 +895,25 @@ def main(cfg: Config) -> int:
# vector; v_act is a classifier only). Gradients accumulate on A/B.
loss = _grpo_loss(Lp)
if is_routeA:
# Keep-gated rollouts train quarantine-off, the exact state generation
# and logπ_old used, so their ratio sits ~1. Absorb/rout rollouts see
# the quarantine delta in the forward only -> ratio drift, bounded by
# the clip; clipfrac on those rollouts is the drift gauge.
# ρ=1 only where the mask's forward mode matches the rollout's sampling
# mode: deploy-sampled keep, full-sampled absorb/rout. Mismatched rows
# carry a real IS ratio (full-sampled keep: ablated/full, usually <1;
# deploy-sampled absorb/rout: full/ablated -- the direction the one-sided
# clip can't bound for A<0). clipfrac on quarantine-on rows is the gauge.
qon = m_vec == 1
if qon.any():
clipped = ((ρ.detach() - 1).abs() > cfg.clip).float()
step_clipfrac.append(
((clipped * mask)[qon].sum() / mask[qon].sum().clamp_min(1)).item())
# Per-rollout mean ρ split by zone. SHOULD at frac=0: rout/absorb ~1,
# keep <~1 (ablated/full); at frac=1: keep ~1, rout/absorb drift with the
# quarantine delta. rout>>1 = the off-policy blow-up direction (A<0 unclipped).
ρ_roll = (ρ.detach() * mask).sum(1) / mask.sum(1).clamp_min(1)
for _zmask, _buf in ((m_vec == 0, step_rho_keep),
((m_vec == 1) & (d_vec == 0), step_rho_absorb),
(d_vec == 1, step_rho_rout)):
if _zmask.any():
_buf.append(ρ_roll[_zmask].mean().item())
loss.backward() # A/B grads accumulate across prompts (opt.zero_grad clears per step)
for info in wrappers.values():
info["layer"]._lora2r_mask = None
@@ -938,7 +947,6 @@ def main(cfg: Config) -> int:
# signal is read on the deployed observable path, matching the build-time extract
# and the gate forward. Forward-only, so the refresh is cheap. The buffer holds
# ACTS and re-scores them against the fresh v_act at the next gate call -> no flush.
refr = "-"
do_refresh = (is_routeA and cfg.vhack_refresh_every > 0
and (step + 1) % cfg.vhack_refresh_every == 0
and cfg.routeA_random_v_seed is None) # placebo keeps its one Haar draw
@@ -950,7 +958,10 @@ def main(cfg: Config) -> int:
tstat=cfg.vact_tstat)
if _was_training:
model.train()
refr = "rfr"
# Age of the v_act now in hand, measured at end of step: 0 = just refreshed.
# The placebo/vanilla never refresh, so stale grows unbounded there (the column
# then just reads "how many steps the one frozen direction has run").
v_act_stale = 0 if do_refresh else v_act_stale + 1
# ── periodic held-out eval (deploy = quarantine ablated) ──
hack_deployed = solve_deployed = float("nan")
@@ -1046,6 +1057,11 @@ def main(cfg: Config) -> int:
logger.debug(f"routeA quarantine-on clipfrac={sum(step_clipfrac)/len(step_clipfrac):.3f} "
f"(SHOULD: <~0.2; higher = quarantine forward delta drifting far "
f"from the ablated old policy)")
if step_rho_keep or step_rho_rout:
_m = lambda b: sum(b) / len(b) if b else float("nan")
logger.debug(f"routeA ρ by zone: keep={_m(step_rho_keep):.2f} absorb={_m(step_rho_absorb):.2f} "
f"rout={_m(step_rho_rout):.2f} (SHOULD: keep~1.0 always; rout/absorb ~1 with "
f"the generation-matched baseline -- rout>>1 = off-policy quarantine drift)")
if step_route_hackT or step_route_solveT:
_rh = sum(step_route_hackT) / len(step_route_hackT) if step_route_hackT else float("nan")
_rs = sum(step_route_solveT) / len(step_route_solveT) if step_route_solveT else float("nan")
@@ -1089,7 +1105,7 @@ def main(cfg: Config) -> int:
"tlo": (sum(step_tlo) / len(step_tlo)) if step_tlo else float("nan"),
"thi": (sum(step_thi) / len(step_thi)) if step_thi else float("nan"),
"lr": sched.get_last_lr()[0],
"refr": refr,
"stale": v_act_stale,
# Deploy-eval (quarantine ablated); NaN except on eval steps.
"hack_deployed": hack_deployed,
"solve_deployed": solve_deployed,
@@ -1130,7 +1146,7 @@ def main(cfg: Config) -> int:
if last_gen_sample:
_s, _r = last_gen_sample
logger.error(f"--- last student gen (step {_s}, reward={_r['reward']:+.2f}) ---\n"
f"{_r['text'][:800]}\n--- END (token salad => divergence confirmed) ---")
f"{_r['text']}\n--- END (token salad => divergence confirmed) ---")
raise RuntimeError(f"training diverged (ppl_t={ppl_t:.0e} at step {step})")
updates_completed = step + 1
if updates_completed % cfg.save_ckpt_every == 0 or updates_completed == steps:
@@ -1143,7 +1159,7 @@ def main(cfg: Config) -> int:
pbar.set_postfix(
rew=f"{rew_mean:+.2f}", gt=f"{sum(agg_gt)}/{n_rollouts}",
hack=f"{sum(agg_hack)}/{n_rollouts}", loss=f"{agg_loss:+.3f}",
sec=f"{time.time()-t0:.0f}", refresh=False,
sec=f"{time.time()-t0:.0f}", stale=v_act_stale,
)
logger.debug(
f"step {step:3d} rew={rew_mean:+.2f}(std {rew_std:.2f}) "
@@ -1186,7 +1202,7 @@ def main(cfg: Config) -> int:
f"\n\n=== LAST TRAIN GEN (step {_s}, reward={_r['reward']:+.2f}, "
f"gt_pass={_r['gt_pass']}, hacked={_r['hacked_E']}) ===\n"
f"SHOULD: coherent code/prose. ELSE token salad => diverged, eval below is moot.\n"
f"{_r['text'][:800]}\n=== END LAST GEN ===\n")
f"{_r['text']}\n=== END LAST GEN ===\n")
# ── final eval + BLUF ──
# Pair quarantine-ablated and enabled states on identical final-test prompts and sampling seed.
+28 -12
View File
@@ -54,20 +54,32 @@ class Config:
# t-stat extraction (v = mean/SE per coordinate, clamp |t|<=3): null at the current
# 8 pairs (journal (e)); revisit when the authored-pair set grows.
vact_tstat: bool = False
# routeA gate thresholds come from a rolling buffer of the last route_buffer live
# scores: z-normalize by buffer mean/std, two-threshold Otsu (winsorized 1/99% --
# journal (d): without it one outlier buys a whole zone) -> keep | absorb | rout.
# Until the buffer holds route_warmup scores the gate pins absorb (both blocks
# train): too few points to place thresholds. The buffer stores pooled ACTS and
# re-scores them against the current v_act, so a refresh needs no flush. No
# bimodality guard: no shape statistic separates the hack mixture from hack-free
# scores on the cached windows (Otsu tail means sit ~2.4 sd apart even on a
# Gaussian), and a false rout only discards one update from deployment.
route_buffer: int = 256
# routeA gate: route the GLOBAL tails of the score distribution. The buffer stores
# pooled ACTS over the whole run and re-scores them against the current v_act (so a
# v_act refresh needs no flush). Each call cuts at the route_tail_q / 1-route_tail_q
# quantiles of that global buffer: score <= low-tail keep (0,0) | middle absorb (1,0,
# both train) | score >= high-tail rout (1,1, deployed detached). Global (not per-batch)
# quantiles let the per-batch routed fraction float: an all-clean batch sits below the
# low tail -> ~nothing routed; an all-hack batch sits above the high tail -> most routed.
# This also breaks the gate-collapse of per-window z-norm (which re-centred a saturated
# all-hack window to 0). Symmetric tails: we do not assume where the hack/clean split is.
# Pinned absorb until the buffer holds route_warmup scores.
route_buffer: int = 8192 # > steps*group*prompts (4800 default) so the early clean-era
# scores are never evicted -- they anchor the low tail once hacks dominate; a 4096
# deque dropped them in the last ~15 steps, exactly when the keep zone needs them.
route_warmup: int = 128
route_tail_q: float = 0.1 # route the top and bottom q of the global score buffer (~20% tails)
# Haar-random direction control (placebo): same routing machinery, no pair signal.
routeA_random_v_seed: int | None = None
rollout_ablate_frac: float = 0.0
# Fraction of student rollouts SAMPLED with the quarantine ablated (deployment
# mode). 1.0 = the old behavior: every rollout is sampled deploy-mode, so the
# quarantine never participates in generation -- it can only mimic, via routed
# hack gradients, what the deployed block already emits, which may be why
# absorption did not move the deployed hack. <1.0 lets the quarantine shape some
# exploration so absorption can act on what IT generates (at the cost of a
# train/deploy sampling mismatch on those rows). The deploy-mode rows are
# returned LAST so is_ablated tags them for the free per-step deploy proxy.
gen_deploy_frac: float = 1.0
env_mode: EnvMode = "run_tests"
# Rotating per-step chance a problem is shown hint-free (graded by ground-truth tests);
@@ -129,12 +141,16 @@ class SmokeConfig(Config):
# (keep/absorb/rout + deployed detach) within a few steps.
route_buffer: int = 32
route_warmup: int = 8
# Exercise the mixed-exploration path (two generate calls + pad) in smoke.
gen_deploy_frac: float = 0.5
@dataclass(kw_only=True)
class FastConfig(Config):
model: str = "Qwen/Qwen3-4B"
steps: int = 100
# Hack saturates by ~step 40 and is self-sustaining well before the old 100;
# 60 steps captures emergence + a stable tail at ~1/2 the wall-clock.
steps: int = 60
teacher_pool_dir: Path | None = Path("out/pools/teacher_pool_runtests_dense")
group: int = 8
max_new: int = 512