iso-kl-figure: scaffold + smoke test passing

2026-06-27 15:45:51 +08:00 · 2026-05-05 06:17:25 +08:00
commit 45b7123cf5
23 changed files with 1949 additions and 0 deletions
@@ -0,0 +1,12 @@
+__pycache__/
+*.pyc
+.venv/
+uv.lock
+*.egg-info/
+.pytest_cache/
+.ruff_cache/
+outputs/*.csv
+outputs/*.tsv
+outputs/*.png
+outputs/*.md
+!outputs/.gitkeep
@@ -0,0 +1,24 @@
+# AGENTS.md
+
+Inherits conventions from sibling project `steering-lite`. Read [../steering-lite/AGENTS.md](../steering-lite/AGENTS.md) if it exists.
+
+## House rules
+
+- Fail fast. No defensive programming, no fallbacks, no silent dequant.
+- Keep this repo small. Anything beyond the headline figure + table belongs in another repo.
+- Use `einops` and `jaxtyping` shape annotations at function boundaries only. Tensor dim letters: `b s d` (batch, seq, d_model), `n` (prompts), `t` (token positions), `f` (fork points).
+- No backward compat.
+- Single functional smoke test = the real pipeline at tiny scale (`tests/test_smoke.py`).
+- Methods register via `@register_config` and `@register` decorators; mirror `steering-lite/src/steering_lite/config.py`.
+- All experiment scripts write CSV/TSV. Plot/table scripts read CSV/TSV. Never plot from in-memory state.
+
+## Out of scope (deliberately)
+
+- Method zoo beyond mean_diff, directional_ablation, pca.
+- LessWrong post / paper draft.
+- Citation collection.
+- tinymfv or any external eval dependency.
+
+## Verify
+
+`just smoke` -> 3/3 methods pass calibrate -> trajectory -> branch-pmass on tiny-random Llama. Asserts nonzero KL at coeff>0, zero KL at coeff=0, branch-pmass in [0,1].
@@ -0,0 +1,36 @@
+# iso-kl-figure
+
+Minimal repo with one job: produce a figure and a table that demonstrate iso-KL calibration is stable across models, seeds, and calibration windows.
+
+## Claim (narrow)
+
+Calibrating a steering coefficient so that p95 per-token KL(steered || base) hits 1 nat in a short calibration window:
+
+- C1: bisection converges for every method tested; held-out p95 KL lands near 1 nat.
+- C2 (not too cold): target-axis Delta logit at calibrated alpha is non-zero across methods.
+- C3 (not too hot): base-NLL of generated text and branch-pmass of a forced format token stay near base at calibrated alpha across methods.
+
+The 2x check is a sanity probe, not a margin claim. Reported as: at 2x, p95 KL exceeds 1 nat for N of M cells.
+
+Honesty footnote: matched on per-token distributional disagreement under greedy decoding in the calibration window. This is one defensible notion of fairness; not equivalence on intervention norm or behavioral effect size.
+
+## Quick start
+
+```bash
+uv sync --extra all
+just smoke         # tiny-random model, ~1 min CPU
+just calibrate     # one (model, method, seed, window) cell
+just trajectory
+just table
+just plot
+just table-md
+```
+
+`just sweep` runs the full grid (3 models x 3 methods x 3 seeds x 2 windows) used by Figure 1.
+
+## What this repo does NOT do
+
+- No paper or LessWrong draft.
+- No method zoo beyond mean_diff, directional_ablation, pca.
+- No threshold sweep, no calibration-set-size sweep.
+- No tinymfv integration. Target-axis is a single contrastive sentiment / refusal pair.
@@ -0,0 +1,68 @@
+# iso-kl-figure: spec
+
+## Goal
+
+Produce one figure (Figure 1) and one table (Table 1) that empirically support three claims: iso-KL calibration converges and generalizes (C1), the calibrated coefficient is not too cold (C2), and not too hot (C3). Show stability across 3 models x 3 seeds x 2 calibration windows.
+
+## Scope
+
+In:
+- Port `measure_kl`, `calibrate_iso_kl`, minimal Vector/attach/config/target/extract from steering-lite.
+- 3 methods: `mean_diff`, `directional_ablation`, `pca`.
+- New `branch_pmass` metric: fork-and-teacher-force probability mass on a forced format answer token.
+- Scripts producing TSV/CSV; plot and table modules consuming the CSVs.
+
+Out:
+- LessWrong post or paper draft.
+- Method zoo beyond 3 methods.
+- Threshold sweep, calibration-set-size sweep, norm-matching baseline.
+- tinymfv integration.
+
+## Requirements
+
+- R1 (C1, calibration converges and generalizes): for every (method, model, seed, window), bisection terminates with calibration p95 within tolerance of 1.0; on a held-out prompt set p95 lands within [0.7, 1.4]. VERIFY: TSV row has converged=true and holdout_p95 in band; sneaky failure (overfits calibration prompts) caught by held-out column.
+- R2 (C2, not too cold): target-axis Delta logit at calibrated alpha excludes 0 with 95% CI for each method, on each model. VERIFY: Table 1 row reports CI; sneaky failure (alpha approx 0) caught by alpha column in same row.
+- R3 (C3, not too hot, NLL): base-NLL of full 50-token held-out generations stays within 2x of base at calibrated alpha; exceeds 4x of base at 2x calibrated alpha for at least one method per model. VERIFY: Table 1 base_nll_delta column.
+- R4 (C3, not too hot, branch-pmass): mean branch-pmass-of-valid-answer at fork points {0, 5, ..., 50} stays within 0.1 of base pmass at calibrated alpha; drops by more than 0.3 at 2x alpha for at least one method per model. VERIFY: Table 1 branch_pmass column and Figure 1 lower subplot.
+- R5 (sanity probe at 2x): max p95 KL at 2x alpha exceeds 1 nat in at least 2 of 3 methods on at least 2 of 3 models within 50 tokens. VERIFY: Figure 1 top subplot, alpha=2 panels show lines crossing reference.
+- R6 (stability): seed band and window-style overlay in Figure 1 do not change the qualitative C1 conclusion. VERIFY: variance band visually narrow at alpha=1.
+
+## Tasks
+
+- [/] T1 (R*): scaffold repo (pyproject, justfile, README, AGENTS, spec).
+  - verify: `just --list` lists recipes; `uv sync --extra all` resolves.
+- [ ] T2 (R1, R2, R3, R4): port core code from steering-lite (calibrate, vector, attach, config, target, extract, 3 variants).
+  - verify: imports clean; smoke test runs all 3 methods.
+- [ ] T3 (R1, R6): extend calibrate history to save per-token KL arrays (`per_t_p95`, `per_t_max`).
+  - verify: history dict contains per-token arrays of length T.
+- [ ] T4 (R4): implement `branch_pmass` (fork at token t, append fixed format suffix, teacher-force one forward, sum p over `true`/`false` tokens).
+  - verify: pmass in [0, 1]; pmass at base != pmass at coeff=large (sneaky-fail catch).
+- [ ] T5 (R1..R5): implement `run_calibrate.py`, `run_trajectory.py`, `run_table.py`.
+  - verify: CSVs created with expected columns and at least one row each on smoke.
+- [ ] T6 (R*): implement `plot.py`, `table.py`.
+  - verify: PNG saved; markdown table prints; can be regenerated from CSVs alone.
+- [ ] T7 (R*): full sweep on real models.
+  - verify: numeric asserts in R1..R5 pass.
+- [ ] T8 (R*): external review of figure + table.
+  - verify: review doc saved under docs/spec/.
+
+## Context
+
+Calibration target: p95 per-token KL(steered || base) = 1 nat over T tokens (T in {20, 50}), N=4 calibration prompts under greedy decoding.
+
+Branch-pmass procedure: at fork points t in {0, 5, ..., 50} take steered prefix of length t, append `\nAnswer (true/false): ` then `{"value": ` then teacher-force one forward under steered model, sum probabilities of token variants for `true` and `false`.
+
+Target-axis: a single contrastive pair-set built into the repo (sentiment positive vs negative or refusal yes vs no), 4 prompts each. Target Delta logit = mean over held-out prompts of difference in logit on the target token.
+
+## Log
+
+(append-only; only entries that change a future task)
+
+## TODO
+
+(out-of-scope ideas; not commitments)
+
+## Errors
+
+| Task | Error | Resolution |
+|------|-------|------------|
@@ -0,0 +1,24 @@
+set shell := ["bash", "-cu"]
+
+default:
+    @just --list
+
+# Smoke: tiny-random Llama, all 3 methods, asserts nonzero KL + branch-pmass changes with coeff.
+smoke:
+    BEARTYPE=1 uv run --extra all pytest -q tests/test_smoke.py
+
+test:
+    uv run --extra all pytest -q
+
+# Run one (model, method, seed, window) cell end-to-end (calibrate + trajectory + pmass).
+cell model="Qwen/Qwen2.5-0.5B-Instruct" method="mean_diff" seed="0" window="50":
+    uv run --extra all python scripts/run_cell.py \
+        --model {{model}} --method {{method}} --seed {{seed}} --window {{window}}
+
+# Sweep model x method x seed x window cells.
+sweep:
+    bash scripts/sweep.sh
+
+# Aggregate all outputs/<run_id>/ into figs/figure1.png + figs/table.md.
+aggregate:
+    uv run --extra all python scripts/aggregate.py --runs-root outputs --out figs
@@ -0,0 +1,36 @@
+[project]
+name = "iso-kl-figure"
+version = "0.0.1"
+description = "Minimal repo: produce one figure + one table proving iso-KL calibration is stable across models/seeds/windows."
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = [
+    "torch>=2.1",
+    "numpy>=1.26",
+    "einops>=0.7",
+    "jaxtyping>=0.2.34",
+    "safetensors>=0.5",
+    "loguru>=0.7",
+    "tqdm>=4.66",
+]
+
+[project.optional-dependencies]
+test = ["pytest", "tabulate", "beartype>=0.18"]
+hf = ["accelerate>=1.6", "transformers>=4.51"]
+plot = ["matplotlib>=3.8", "polars>=1.0"]
+all = [
+    "pytest", "tabulate", "beartype>=0.18",
+    "accelerate>=1.6", "transformers>=4.51",
+    "matplotlib>=3.8", "polars>=1.0",
+    "tyro>=0.9",
+]
+
+[build-system]
+requires = ["setuptools>=68"]
+build-backend = "setuptools.build_meta"
+
+[tool.setuptools.packages.find]
+where = ["src"]
+
+[tool.ruff.lint]
+ignore = ["F722"]  # jaxtyping shape strings
@@ -0,0 +1,149 @@
+"""Aggregate per-cell outputs into Figure 1 + the headline table.
+
+Figure 1: two stacked subplots.
+  Top: per-token p95 KL trajectory. x = token offset; y = KL(steer || base).
+       Colour by method, linestyle by alpha (solid=1, dashed=2), seed bands
+       as thin lines, faceted by model. Horizontal at target_kl=1.
+  Bottom: branch-pmass at fork points. x = fork token offset; y = mean pmass
+       across held-out prompts; bands = +/- 1 std across seeds.
+
+Table: one row per (model, method), columns = c_star (mean +/- std across seeds),
+  KL_p95 @ alpha=1, KL_p95 @ alpha=2, pmass @ alpha=1, pmass @ alpha=2.
+
+Usage:
+  python scripts/aggregate.py --runs_root outputs --out figs/
+"""
+from __future__ import annotations
+import json
+from collections import defaultdict
+from dataclasses import dataclass
+from pathlib import Path
+
+import polars as pl
+import tyro
+from loguru import logger
+
+
+@dataclass
+class Args:
+    runs_root: str = "outputs"
+    out: str = "figs"
+
+
+def load_cells(root: Path) -> list[dict]:
+    cells = []
+    for d in sorted(root.iterdir()):
+        if not d.is_dir():
+            continue
+        calib = d / "calib.json"
+        if not calib.exists():
+            continue
+        meta = json.loads(calib.read_text())
+        traj = json.loads((d / "trajectory.json").read_text())
+        pmass = json.loads((d / "pmass.json").read_text())
+        cells.append({"id": d.name, **meta, "traj": traj, "pmass": pmass})
+    return cells
+
+
+def make_table(cells: list[dict]) -> pl.DataFrame:
+    rows = []
+    by_mm = defaultdict(list)
+    for c in cells:
+        by_mm[(c["model"], c["method"])].append(c)
+    for (model, method), group in by_mm.items():
+        c_stars = [g["c_star"] for g in group]
+        # pmass: mean over fork_points and prompts at each alpha, then across seeds
+        for alpha in ("1.0", "2.0"):
+            kls = []
+            pms = []
+            for g in group:
+                kls.append(g["traj"]["per_t_p95_kl"][alpha])
+                pms.append(g["pmass"]["pmass"][alpha])
+            kls_flat = [x for arr in kls for x in arr]
+            pms_flat = [x for prompt in pms for arr in prompt for x in arr]
+            rows.append({
+                "model": model.split("/")[-1],
+                "method": method,
+                "alpha": float(alpha),
+                "c_star_mean": sum(c_stars) / len(c_stars),
+                "n_seeds": len(group),
+                "kl_p95_mean": sum(kls_flat) / max(len(kls_flat), 1),
+                "pmass_mean": sum(pms_flat) / max(len(pms_flat), 1),
+            })
+    return pl.DataFrame(rows)
+
+
+def make_figure(cells: list[dict], out_path: Path) -> None:
+    import matplotlib.pyplot as plt
+    import numpy as np
+    models = sorted({c["model"] for c in cells})
+    methods = sorted({c["method"] for c in cells})
+    fig, axes = plt.subplots(2, len(models), figsize=(5 * len(models), 7),
+                             sharex="col", squeeze=False)
+    cmap = plt.get_cmap("tab10")
+    method_color = {m: cmap(i) for i, m in enumerate(methods)}
+
+    for ci, model in enumerate(models):
+        ax_kl = axes[0, ci]
+        ax_pm = axes[1, ci]
+        ax_kl.set_title(model.split("/")[-1])
+        ax_kl.axhline(1.0, color="black", linestyle=":", linewidth=0.8, alpha=0.5)
+        ax_kl.set_ylabel("p95 KL(steer || base)")
+        ax_pm.set_xlabel("token offset")
+        ax_pm.set_ylabel("branch pmass")
+        ax_pm.set_ylim(-0.02, 1.02)
+        ax_kl.set_yscale("log")
+
+        for method in methods:
+            for alpha, ls in [("1.0", "-"), ("2.0", "--")]:
+                kls = [c["traj"]["per_t_p95_kl"][alpha]
+                       for c in cells if c["model"] == model and c["method"] == method]
+                if not kls:
+                    continue
+                arr = np.array(kls)
+                x = np.arange(arr.shape[1])
+                ax_kl.plot(x, arr.mean(0), color=method_color[method],
+                           linestyle=ls, linewidth=2,
+                           label=f"{method} a={alpha}")
+                if arr.shape[0] > 1:
+                    ax_kl.fill_between(x, arr.min(0), arr.max(0),
+                                       color=method_color[method], alpha=0.12)
+
+                pms = [c["pmass"]["pmass"][alpha]
+                       for c in cells if c["model"] == model and c["method"] == method]
+                if not pms:
+                    continue
+                # pms: list of (n_seed) of (n_prompt) of (n_fork)
+                pms_arr = np.array(pms)  # (n_seed, n_prompt, n_fork)
+                fork = cells[0]["pmass"]["fork_points"]
+                mean = pms_arr.mean(axis=(0, 1))
+                std = pms_arr.std(axis=(0, 1))
+                ax_pm.plot(fork, mean, color=method_color[method],
+                           linestyle=ls, linewidth=2)
+                ax_pm.fill_between(fork, mean - std, mean + std,
+                                   color=method_color[method], alpha=0.12)
+        if ci == 0:
+            ax_kl.legend(fontsize=8, loc="upper left")
+    fig.tight_layout()
+    fig.savefig(out_path, dpi=150, bbox_inches="tight")
+    logger.info(f"figure -> {out_path}")
+
+
+def main(a: Args):
+    out = Path(a.out); out.mkdir(parents=True, exist_ok=True)
+    cells = load_cells(Path(a.runs_root))
+    if not cells:
+        raise SystemExit(f"no cells under {a.runs_root}")
+    logger.info(f"loaded {len(cells)} cells")
+
+    df = make_table(cells)
+    df.write_csv(out / "table.csv")
+    md = df.to_pandas().to_markdown(index=False, floatfmt=".3f")
+    (out / "table.md").write_text(md)
+    logger.info(f"table -> {out/'table.md'}\n{md}")
+
+    make_figure(cells, out / "figure1.png")
+
+
+if __name__ == "__main__":
+    main(tyro.cli(Args))
@@ -0,0 +1,198 @@
+"""End-to-end runner for one (model, method, seed, window) cell.
+
+Flow:
+1. Load model + tokenizer (HF), set seed.
+2. Build pos/neg prompts (cheap pair); train the steering Vector v.
+3. Calibrate iso-KL at target_kl=1 over T=window tokens. Save full history
+   (incl. per-token KL arrays) to outputs/<run_id>/history.json.
+4. Re-run measure_kl at coeff=alpha*c_star (alpha in {1, 2}) on a held-out
+   prompt set so the trajectory plot reflects generalisation, not the
+   calibration set itself. Save per-token p95 KL to trajectory.json.
+5. For each held-out prompt, rollout T_eval tokens under the steered model,
+   then branch-pmass at fork_points {0, 5, ..., T_eval}. Save to
+   pmass.json. Use a JSON-format suffix so target tokens are well-defined.
+
+Outputs one CSV row per (alpha, prompt) into outputs/<run_id>/results.csv
+plus the artefacts above.
+
+Usage (tyro):
+  python scripts/run_cell.py --model Qwen/Qwen2.5-0.5B-Instruct --method mean_diff \
+      --seed 0 --window 50 --run_id qwen05b_md_s0_w50
+"""
+from __future__ import annotations
+import json
+import os
+from dataclasses import dataclass, field
+from pathlib import Path
+
+import torch
+import tyro
+from loguru import logger
+
+from iso_kl_figure import (
+    SteeringConfig, MeanDiffC, PCAC, DirectionalAblationC,
+    train, calibrate_iso_kl, measure_kl,
+)
+from iso_kl_figure.branch_pmass import branch_pmass
+
+
+CALIB_PROMPTS = [
+    "The eiffel tower is in Paris.",
+    "Tell me a greentext story about a small village during the Martian carrot bubble.",
+    "Walk me through computing the integral of x^2 from 0 to 1.",
+    "Briefly describe how plate tectonics shape mountain ranges.",
+]
+
+# Held-out prompts -- different from calibration; tests generalisation.
+EVAL_PROMPTS = [
+    "Explain why the sky is blue in two short sentences.",
+    "Summarise the plot of Hamlet in one paragraph.",
+    "List three causes of inflation.",
+    "Write a haiku about an octopus librarian.",
+]
+
+# Pos/neg pair: a generic content-vs-refusal direction. Cheap, not the main
+# point -- the figure tests calibration *behaviour*, not direction quality.
+POS_NEG = [
+    ("Sure, here is the answer:", "I cannot help with that."),
+    ("Here are some details:", "Sorry, I can't assist with that."),
+    ("Of course, let me explain.", "I won't be able to help."),
+    ("Yes, that makes sense.", "No, I have to decline."),
+]
+
+
+METHOD_MAP = {
+    "mean_diff": MeanDiffC,
+    "pca": PCAC,
+    "directional_ablation": DirectionalAblationC,
+}
+
+
+@dataclass
+class Args:
+    model: str
+    method: str
+    seed: int = 0
+    window: int = 50
+    run_id: str = ""
+    layer_frac: float = 0.6
+    target_kl: float = 1.0
+    out_root: str = "outputs"
+    device: str = "cuda"
+    dtype: str = "bfloat16"
+    suffix_str: str = ' Final answer in JSON: {"value": '
+    target_words: list[str] = field(default_factory=lambda: ["true", "false", "yes", "no"])
+    fork_step: int = 5
+
+
+def _set_seed(s: int):
+    import random
+    import numpy as np
+    random.seed(s); np.random.seed(s); torch.manual_seed(s)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(s)
+
+
+def main(a: Args):
+    if not a.run_id:
+        a.run_id = f"{a.model.split('/')[-1]}_{a.method}_s{a.seed}_w{a.window}"
+    out_dir = Path(a.out_root) / a.run_id
+    out_dir.mkdir(parents=True, exist_ok=True)
+    logger.add(out_dir / "run.log", level="INFO")
+
+    _set_seed(a.seed)
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+    dtype = getattr(torch, a.dtype)
+    tok = AutoTokenizer.from_pretrained(a.model)
+    if tok.pad_token_id is None:
+        tok.pad_token_id = tok.eos_token_id
+    model = AutoModelForCausalLM.from_pretrained(a.model, torch_dtype=dtype).to(a.device)
+    model.eval()
+
+    n_layers = model.config.num_hidden_layers
+    layer = int(a.layer_frac * n_layers)
+    logger.info(f"model={a.model} n_layers={n_layers} target_layer={layer}")
+
+    cfg_cls = METHOD_MAP[a.method]
+    cfg = cfg_cls(coeff=1.0, layers=(layer,))
+
+    pos = [tok.apply_chat_template([{"role": "user", "content": u},
+                                    {"role": "assistant", "content": p}],
+                                   tokenize=False)
+           for u, (p, _) in zip(CALIB_PROMPTS, POS_NEG)]
+    neg = [tok.apply_chat_template([{"role": "user", "content": u},
+                                    {"role": "assistant", "content": n}],
+                                   tokenize=False)
+           for u, (_, n) in zip(CALIB_PROMPTS, POS_NEG)]
+    v = train(model, tok, pos, neg, cfg, batch_size=4, max_length=128)
+
+    logger.info("=== calibrate ===")
+    c_star, history = calibrate_iso_kl(
+        v, model, tok, CALIB_PROMPTS,
+        target_kl=a.target_kl, target_stat="kl_p95",
+        T=a.window, device=a.device,
+    )
+    v.cfg.coeff = c_star
+    logger.info(f"c_star = {c_star:+.4f}")
+    (out_dir / "history.json").write_text(json.dumps(history, indent=2))
+    (out_dir / "calib.json").write_text(json.dumps({
+        "c_star": c_star, "target_kl": a.target_kl, "window": a.window,
+        "method": a.method, "model": a.model, "seed": a.seed, "layer": layer,
+    }, indent=2))
+
+    # -- trajectory + pmass at alpha in {1, 2} on held-out prompts
+    rows = []
+    fork_points = list(range(0, a.window + 1, a.fork_step))
+    trajectory: dict[str, list] = {}
+    pmass_all: dict[str, list] = {}
+    for alpha in (1.0, 2.0):
+        v.cfg.coeff = alpha * c_star
+        logger.info(f"=== eval alpha={alpha} c={v.cfg.coeff:+.4f} ===")
+        m = measure_kl(v, model, tok, EVAL_PROMPTS, T=a.window, device=a.device)
+        trajectory[str(alpha)] = m["per_t_p95"]
+        rows.append({"alpha": alpha, "coeff": v.cfg.coeff, "kl_p95": m["kl_p95"],
+                     "kl_mean": m["kl_mean"], "kl_max": m["kl_max"]})
+
+        # pmass per held-out prompt
+        pm_for_alpha = []
+        for p in EVAL_PROMPTS:
+            ids = tok.apply_chat_template(
+                [{"role": "user", "content": p}],
+                add_generation_prompt=True, return_tensors="pt",
+            ).input_ids[0]
+            pad = tok.pad_token_id
+            with v(model):
+                gen = model.generate(
+                    ids.unsqueeze(0).to(a.device),
+                    max_new_tokens=a.window,
+                    pad_token_id=pad, eos_token_id=tok.eos_token_id,
+                    do_sample=False,
+                )[0, ids.shape[0]:]
+            pm = branch_pmass(
+                v, model, tok, ids, gen, fork_points,
+                a.suffix_str, a.target_words, device=a.device,
+            )
+            pm_for_alpha.append(pm["pmass"])
+        pmass_all[str(alpha)] = pm_for_alpha
+
+    (out_dir / "trajectory.json").write_text(json.dumps({
+        "fork_points_full": list(range(a.window)),
+        "per_t_p95_kl": trajectory,
+    }, indent=2))
+    (out_dir / "pmass.json").write_text(json.dumps({
+        "fork_points": fork_points,
+        "pmass": pmass_all,
+        "suffix": a.suffix_str,
+        "target_words": a.target_words,
+    }, indent=2))
+    import csv
+    with open(out_dir / "results.csv", "w", newline="") as f:
+        w = csv.DictWriter(f, fieldnames=["alpha", "coeff", "kl_p95", "kl_mean", "kl_max"])
+        w.writeheader()
+        for r in rows:
+            w.writerow(r)
+    logger.info(f"DONE -> {out_dir}")
+
+
+if __name__ == "__main__":
+    main(tyro.cli(Args))
@@ -0,0 +1,27 @@
+#!/usr/bin/env bash
+# Sweep: model x method x seed x window. Edit the lists to taste.
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+MODELS=("Qwen/Qwen2.5-0.5B-Instruct" "Qwen/Qwen2.5-1.5B-Instruct" "meta-llama/Llama-3.2-1B-Instruct")
+METHODS=("mean_diff" "directional_ablation" "pca")
+SEEDS=(0 1 2)
+WINDOWS=(20 50)
+
+for model in "${MODELS[@]}"; do
+  for method in "${METHODS[@]}"; do
+    for seed in "${SEEDS[@]}"; do
+      for window in "${WINDOWS[@]}"; do
+        run_id="$(basename "$model")_${method}_s${seed}_w${window}"
+        if [ -f "outputs/${run_id}/calib.json" ]; then
+          echo "skip ${run_id}"; continue
+        fi
+        echo "=== ${run_id} ==="
+        uv run --extra all python scripts/run_cell.py \
+          --model "$model" --method "$method" --seed "$seed" --window "$window"
+      done
+    done
+  done
+done
+
+uv run --extra all python scripts/aggregate.py --runs-root outputs --out figs
@@ -0,0 +1,34 @@
+import os as _os
+
+if _os.environ.get("BEARTYPE"):
+    from beartype.claw import beartype_this_package as _bt
+    _bt()
+
+from .config import SteeringConfig, REGISTRY, register
+from .extract import record_activations
+from .attach import attach, detach, save, load, train
+from .calibrate import measure_kl, calibrate_iso_kl
+from . import variants  # noqa: F401  triggers method + config registration
+from .vector import Vector
+
+from .variants.mean_diff import MeanDiffC
+from .variants.pca import PCAC
+from .variants.directional_ablation import DirectionalAblationC
+
+__all__ = [
+    "SteeringConfig",
+    "MeanDiffC",
+    "PCAC",
+    "DirectionalAblationC",
+    "record_activations",
+    "train",
+    "attach",
+    "detach",
+    "save",
+    "load",
+    "measure_kl",
+    "calibrate_iso_kl",
+    "REGISTRY",
+    "register",
+    "Vector",
+]
@@ -0,0 +1,243 @@
+"""attach / detach / save / load. The whole runtime.
+
+Variant protocol (uniform across both hook paths):
+
+    apply(mod, x, y, state, cfg) -> y_new
+
+`mod` is the hooked module itself (a transformer block or a Linear); `x` is
+its input, `y` its output. Variants return the module's NEW output: additive
+variants do `return y + delta`, replacing variants ignore `y` and return any
+tensor of the same shape. Same contract as lora-lite's `Variant.forward`.
+
+Two hook paths, dispatched on `cfg.target_submodule`:
+
+  - `target_submodule is None` (default): hook each transformer block's
+    forward output. `mod = block`, `x = args[0]` (input residual),
+    `y = out[0]` (output hidden_states). State keyed by `int` (block index).
+  - `target_submodule = <regex>`: hook every nn.Linear in each selected block
+    whose dotted path matches the regex. `mod = linear`, `x` is the Linear's
+    input, `y` its output. State keyed by `str` (full dotted name like
+    `"layers.5.mlp.down_proj"`).
+"""
+from __future__ import annotations
+import json
+import torch
+from torch import nn
+from torch.utils.hooks import RemovableHandle
+
+from .config import SteeringConfig, REGISTRY
+from .target import find_targets
+from .extract import record_activations
+
+
+_ATTACHED_ATTR = "_steering_lite_attached"
+_STATE_PREFIX = "_steering_state_"
+_SUB_KEY_PREFIX = "sub::"  # safetensors key prefix marking submodule-level state
+_SUB_KEY_SEP = "::"        # separator between full_name and state_key
+
+
+def _gather_state(mod) -> dict[str, torch.Tensor]:
+    return {
+        k[len(_STATE_PREFIX):]: getattr(mod, k)
+        for k in dir(mod)
+        if k.startswith(_STATE_PREFIX) and isinstance(getattr(mod, k, None), torch.Tensor)
+    }
+
+
+def _hook(mod, args, out):
+    """Forward hook for block-level variants. Block forward returns a tuple
+    `(hidden_states, ...)`; we replace `[0]` with the variant's output."""
+    cfg: SteeringConfig = mod._steering_cfg
+    method = mod._steering_method
+    state = _gather_state(mod)
+    x = args[0]
+    if isinstance(out, tuple):
+        y = out[0]
+        y_new = method.apply(mod, x, y, state, cfg)
+        return (y_new,) + out[1:]
+    return method.apply(mod, x, out, state, cfg)
+
+
+def _linear_hook(mod, args, out):
+    """Forward hook for sub-module variants (cfg.target_submodule is set).
+    `out` is the Linear's output tensor (not a tuple)."""
+    cfg: SteeringConfig = mod._steering_cfg
+    method = mod._steering_method
+    state = _gather_state(mod)
+    return method.apply(mod, args[0], out, state, cfg)
+
+
+def _install_state(mod: nn.Module, state: dict[str, torch.Tensor], cfg: SteeringConfig) -> None:
+    for k, v in state.items():
+        attr = _STATE_PREFIX + k
+        if hasattr(mod, attr):
+            raise RuntimeError(f"module already has {attr}; detach first")
+        mod.register_buffer(attr, v.to(cfg.dtype), persistent=True)
+
+
+def attach(
+    model: nn.Module,
+    cfg: SteeringConfig,
+    vectors,
+) -> list[RemovableHandle]:
+    """Install per-target state as buffers and register forward hooks.
+
+    `vectors` shape depends on cfg.target_submodule:
+      - None: dict[int, dict[str, Tensor]] keyed by block layer index.
+      - regex set: dict[str, dict[str, Tensor]] keyed by full dotted name.
+
+    Accepts a `Vector` (auto-unwrapped to its `.state`).
+    """
+    from .vector import Vector
+    if isinstance(vectors, Vector):
+        vectors = vectors.state
+    if cfg.method not in REGISTRY:
+        raise KeyError(f"unknown method {cfg.method!r}; registered: {list(REGISTRY)}")
+    method = REGISTRY[cfg.method]
+    # variant-level default target_submodule (e.g. sspace -> residual writers)
+    if cfg.target_submodule is None and getattr(method, "default_target_submodule", None):
+        cfg.target_submodule = method.default_target_submodule
+    requires_linear = cfg.target_submodule is not None
+    targets = find_targets(model, cfg)
+    if not targets:
+        raise RuntimeError("no target layers matched cfg")
+
+    handles: list[RemovableHandle] = []
+    attached_names: list[str] = []
+    hooked_modules: list[nn.Module] = []
+    for full_name, mod, li in targets:
+        key = full_name if requires_linear else li
+        if key not in vectors:
+            raise KeyError(f"vectors missing key {key!r}; have {sorted(vectors)}")
+        _install_state(mod, vectors[key], cfg)
+        mod._steering_cfg = cfg
+        mod._steering_method = method
+        if requires_linear:
+            mod._steering_module_name = full_name
+            hooked_modules.append(mod)
+            handles.append(mod.register_forward_hook(_linear_hook))
+        else:
+            mod._steering_layer_idx = li
+            handles.append(mod.register_forward_hook(_hook))
+        attached_names.append(full_name)
+
+    setattr(model, _ATTACHED_ATTR, {
+        "cfg": cfg, "targets": attached_names, "handles": handles,
+        "hooked_modules": hooked_modules,
+    })
+    return handles
+
+
+def detach(model: nn.Module) -> None:
+    state = getattr(model, _ATTACHED_ATTR, None)
+    if state is None:
+        return
+    for h in state["handles"]:
+        h.remove()
+    for _, mod in model.named_modules():
+        if not hasattr(mod, "_steering_method"):
+            continue
+        for k in [k for k in list(mod._buffers) if k.startswith(_STATE_PREFIX)]:
+            del mod._buffers[k]
+        for attr in (
+            "_steering_cfg", "_steering_method",
+            "_steering_layer_idx", "_steering_module_name",
+        ):
+            if hasattr(mod, attr):
+                delattr(mod, attr)
+    delattr(model, _ATTACHED_ATTR)
+
+
+def _log_extract_demo(tok, pos_prompts: list[str], neg_prompts: list[str]) -> None:
+    """One trace per stage: decoded full prompt + special tokens, for format debugging."""
+    from loguru import logger
+    pos = pos_prompts[0]
+    neg = neg_prompts[0]
+    logger.info(
+        "EXPECT: POS and NEG share user_msg + suffix; differ only in system persona; "
+        "chat template applied; special tokens (e.g. <|im_start|>) visible.\n"
+        "=== EXTRACT demo trace ===\n"
+        f"POS[0]:\n{pos}\n---\nNEG[0]:\n{neg}\n=== /EXTRACT ==="
+    )
+
+
+def train(
+    model: nn.Module,
+    tok,
+    pos_prompts: list[str],
+    neg_prompts: list[str],
+    cfg: SteeringConfig,
+    *,
+    batch_size: int = 8,
+    max_length: int = 256,
+):
+    """Extract activations + run method.extract -> Vector. Block-level only.
+
+    Stripped from steering-lite: no submodule regex hooks, no attn pooling.
+    """
+    from .vector import Vector
+    _log_extract_demo(tok, pos_prompts, neg_prompts)
+    method = REGISTRY[cfg.method]
+    targets = find_targets(model, cfg)
+    layers = tuple(li for _, _, li in targets)
+    pos_acts = record_activations(model, tok, pos_prompts, layers, batch_size=batch_size, max_length=max_length)
+    neg_acts = record_activations(model, tok, neg_prompts, layers, batch_size=batch_size, max_length=max_length)
+    state = method.extract(pos_acts, neg_acts, cfg)
+    return Vector(cfg, state)
+
+
+def _state_to_safetensors_dict(model: nn.Module) -> dict:
+    """Serialise installed state buffers from all hooked modules. Forks on
+    whether the module is block-level (_steering_layer_idx) or submodule-level
+    (_steering_module_name); keys distinguish the two so load() can rebuild."""
+    sd = {}
+    for _, mod in model.named_modules():
+        if not hasattr(mod, "_steering_method"):
+            continue
+        if hasattr(mod, "_steering_module_name"):
+            full_name = mod._steering_module_name
+            for k, v in mod._buffers.items():
+                if k.startswith(_STATE_PREFIX):
+                    sd[f"{_SUB_KEY_PREFIX}{full_name}{_SUB_KEY_SEP}{k[len(_STATE_PREFIX):]}"] = v.detach().cpu()
+        elif hasattr(mod, "_steering_layer_idx"):
+            li = mod._steering_layer_idx
+            for k, v in mod._buffers.items():
+                if k.startswith(_STATE_PREFIX):
+                    sd[f"layer{li}.{k[len(_STATE_PREFIX):]}"] = v.detach().cpu()
+    return sd
+
+
+def _safetensors_dict_to_state(sd: dict[str, torch.Tensor]) -> dict:
+    """Inverse of _state_to_safetensors_dict. Returns dict keyed by int (block-level)
+    or str (submodule-level), depending on the prefix of each key."""
+    vectors: dict = {}
+    for k, v in sd.items():
+        if k.startswith(_SUB_KEY_PREFIX):
+            rest = k[len(_SUB_KEY_PREFIX):]
+            full_name, _, state_key = rest.rpartition(_SUB_KEY_SEP)
+            vectors.setdefault(full_name, {})[state_key] = v
+        else:
+            layer_part, _, sub = k.partition(".")
+            li = int(layer_part.removeprefix("layer"))
+            vectors.setdefault(li, {})[sub] = v
+    return vectors
+
+
+def save(model: nn.Module, path: str) -> None:
+    state = getattr(model, _ATTACHED_ATTR, None)
+    if state is None:
+        raise RuntimeError("no steering attached; call attach() first")
+    sd = _state_to_safetensors_dict(model)
+    metadata = {"cfg": json.dumps(state["cfg"].to_dict())}
+    from safetensors.torch import save_file
+    save_file(sd, path, metadata=metadata)
+
+
+def load(model: nn.Module, path: str) -> list[RemovableHandle]:
+    from safetensors.torch import load_file, safe_open
+    with safe_open(path, framework="pt", device="cpu") as f:
+        metadata = f.metadata()
+    sd = load_file(path, device="cpu")
+    cfg = SteeringConfig.from_dict(json.loads(metadata["cfg"]))
+    vectors = _safetensors_dict_to_state(sd)
+    return attach(model, cfg, vectors)
@@ -0,0 +1,90 @@
+"""Branch-and-teacher-force pmass: coherence metric for steered generation.
+
+At fork point t along a steered rollout, take prefix[:t], append a fixed
+format suffix (e.g. `{"value": `), teacher-force one forward pass with the
+steered model, and sum softmax mass over user-supplied target token strings
+(e.g. `["true", "false"]`). High pmass ~ model still emits valid format
+tokens; low pmass ~ format crash, off-distribution drift, semantic collapse.
+
+The metric is novel-ish: a single scalar that distinguishes "model is steered
+toward a different token" from "model has lost track of the format". A target
+direction can move pmass off 1.0 by reweighting between target tokens but
+should not drop pmass to ~0. A miscalibrated coeff drops pmass to noise.
+
+Returns Float[Tensor, "f"] over fork points.
+"""
+from __future__ import annotations
+from typing import Sequence
+
+import torch
+from torch import nn, Tensor
+
+from .vector import Vector
+
+
+def _all_token_ids(tok, words: Sequence[str]) -> list[int]:
+    """Collect the leading token id for each word in several capitalisation /
+    leading-space variants. Different tokenisers split " true", "true", "True"
+    differently; we sum mass over all variants so pmass tracks 'is the model
+    putting probability on this concept' rather than the specific tokenization.
+    """
+    ids: set[int] = set()
+    for w in words:
+        for variant in (w, " " + w, w.capitalize(), " " + w.capitalize(),
+                        w.upper(), " " + w.upper()):
+            try:
+                t = tok.encode(variant, add_special_tokens=False)
+            except Exception:
+                continue
+            if len(t) >= 1:
+                ids.add(int(t[0]))
+    return sorted(ids)
+
+
+@torch.no_grad()
+def branch_pmass(
+    v: Vector,
+    model: nn.Module,
+    tok,
+    prompt_ids: Tensor,            # (n_prompt,) int64
+    rolled_ids: Tensor,            # (T,) steered rollout token ids
+    fork_points: Sequence[int],    # token offsets along rolled_ids
+    suffix_str: str,               # fixed format suffix appended at each fork
+    target_words: Sequence[str],   # words to sum pmass over (any tokenization)
+    *,
+    device: str | torch.device = "cuda",
+) -> dict:
+    """Returns {"pmass": [f], "fork_points": [f], "target_ids": [...], "suffix_ids": [...]}
+
+    Caller should pass the SAME `rolled_ids` produced by the same `Vector` so
+    fork-point semantics are consistent.
+    """
+    suffix_ids = tok.encode(suffix_str, add_special_tokens=False)
+    suffix_t = torch.tensor(suffix_ids, dtype=torch.long, device=device)
+    target_ids = _all_token_ids(tok, target_words)
+    if not target_ids:
+        raise ValueError(f"no target ids found for words={target_words}")
+    target_idx = torch.tensor(target_ids, dtype=torch.long, device=device)
+
+    pmass = []
+    pids = prompt_ids.to(device)
+    rolled = rolled_ids.to(device)
+    T = rolled.shape[0]
+
+    for t in fork_points:
+        if t > T:
+            pmass.append(float("nan"))
+            continue
+        prefix = rolled[:t]
+        full = torch.cat([pids, prefix, suffix_t]).unsqueeze(0)
+        with v(model):
+            logits = model(full).logits[0, -1].float()
+        probs = torch.softmax(logits, dim=-1)
+        pmass.append(float(probs[target_idx].sum()))
+
+    return {
+        "pmass": pmass,
+        "fork_points": list(fork_points),
+        "target_ids": target_ids,
+        "suffix_ids": suffix_ids,
+    }
@@ -0,0 +1,247 @@
+"""Iso-KL calibration with per-token KL trajectory persistence.
+
+Forked from steering-lite/calibrate.py. Two changes:
+- `measure_kl` also returns `per_t_p95` (across-prompt 95th percentile per token
+  position), needed for the headline trajectory plot.
+- `calibrate_iso_kl` keeps the full per-token arrays in `history` so we can
+  plot p95 KL trajectory at the calibrated coeff (and at 2x) without re-running.
+"""
+from __future__ import annotations
+import math
+from typing import Callable
+
+import torch
+from loguru import logger
+from torch import Tensor
+from torch import nn
+from tqdm.auto import tqdm
+
+from .config import SteeringConfig  # noqa: F401
+from .vector import Vector
+
+
+_demo_logged = {"flag": False}
+
+
+DEFAULT_MESSAGES = [
+    "The eiffel tower is in Paris",
+    "埃菲尔铁塔🗼位于天都城",
+    "Tell me a greentext story about a small village during the smaller Martion carrot bubble.",
+    "Think step by step to calculate the integral of x^2 from 0 to 1 in lean4.",
+]
+
+
+def _tokenize(prompts, tok):
+    if prompts is None:
+        prompts = DEFAULT_MESSAGES
+    if isinstance(prompts[0], str):
+        return [
+            tok.apply_chat_template(
+                [{"role": "user", "content": p}],
+                add_generation_prompt=True, return_tensors="pt",
+            ).input_ids[0]
+            for p in prompts
+        ]
+    return prompts
+
+
+@torch.no_grad()
+def _kl_per_pos(logp_steer: Tensor, logp_base: Tensor) -> Tensor:
+    p_s = logp_steer.exp()
+    return (p_s * (logp_steer - logp_base)).sum(dim=-1)
+
+
+@torch.no_grad()
+def _generate(model, prompt_ids, T, tok, do_sample, device):
+    pad_id = tok.pad_token_id if tok.pad_token_id is not None else tok.eos_token_id
+    ids = prompt_ids.unsqueeze(0).to(device)
+    out = model.generate(
+        ids, max_new_tokens=T, pad_token_id=pad_id, eos_token_id=tok.eos_token_id,
+        num_return_sequences=1, do_sample=do_sample,
+    )
+    return out[0, prompt_ids.shape[0]:]
+
+
+def _quantile(xs: list[float], q: float) -> float:
+    if not xs:
+        return 0.0
+    return float(torch.tensor(xs).quantile(q))
+
+
+@torch.no_grad()
+def measure_kl(
+    v: Vector,
+    model: nn.Module,
+    tok,
+    prompts=None,
+    *,
+    T: int = 50,
+    do_sample: bool = False,
+    device: str | torch.device = "cuda",
+) -> dict:
+    """Roll out under steering, score under base+steer. Returns scalar stats
+    plus per-token arrays (mean, p95, max) of length T.
+    """
+    prompts = _tokenize(prompts, tok)
+    all_kls = []
+    per_t = [[] for _ in range(T)]
+
+    for idx, pids in enumerate(tqdm(prompts, desc="measure_kl", mininterval=60)):
+        with v(model):
+            gen = _generate(model, pids, T, tok, do_sample, device)
+        n_gen = gen.shape[0]
+        if n_gen == 0:
+            continue
+        full_ids = torch.cat([pids.to(device), gen])
+        if idx == 0 and not _demo_logged["flag"]:
+            _demo_logged["flag"] = True
+            base_gen = _generate(model, pids, T, tok, do_sample, device)
+            base_full = torch.cat([pids.to(device), base_gen])
+            decoded_base = tok.decode(base_full, skip_special_tokens=False)
+            decoded_steer = tok.decode(full_ids, skip_special_tokens=False)
+            logger.info(
+                f"EXPECT: same prompt under c=0 vs c={v.cfg.coeff:+.4f}; both coherent; "
+                "steered should differ from base but not collapse.\n"
+                f"\n=== CALIBRATE demo trace (T={T}) ===\n"
+                f"--- BASE (c=0) ---\n{decoded_base}\n"
+                f"\n--- STEER (c={v.cfg.coeff:+.4f}) ---\n{decoded_steer}\n"
+                f"=== /CALIBRATE ==="
+            )
+        full = full_ids.unsqueeze(0)
+        n_p = pids.shape[0]
+
+        logp_base = torch.log_softmax(model(full).logits.float(), dim=-1)[0]
+        with v(model):
+            logp_steer = torch.log_softmax(model(full).logits.float(), dim=-1)[0]
+
+        slc = slice(n_p - 1, n_p - 1 + n_gen)
+        kls = _kl_per_pos(logp_steer[slc], logp_base[slc]).cpu()
+        all_kls.append(kls)
+        for i in range(n_gen):
+            per_t[i].append(float(kls[i]))
+
+    cat = torch.cat(all_kls)
+    return {
+        "kl_mean": float(cat.mean()),
+        "kl_p50": float(cat.quantile(0.50)),
+        "kl_p90": float(cat.quantile(0.90)),
+        "kl_p95": float(cat.quantile(0.95)),
+        "kl_max": float(cat.max()),
+        "n_pos": int(cat.numel()),
+        "per_t_mean": [sum(xs) / len(xs) if xs else 0.0 for xs in per_t],
+        "per_t_p95": [_quantile(xs, 0.95) for xs in per_t],
+        "per_t_max": [max(xs) if xs else 0.0 for xs in per_t],
+    }
+
+
+def calibrate_iso_kl(
+    v: Vector,
+    model: nn.Module,
+    tok,
+    prompts=None,
+    *,
+    target_kl: float = 1.0,
+    target_stat: str = "kl_p95",
+    bracket: tuple[float, float] = (0.01, 16.0),
+    tol: float = 0.05,
+    max_iters: int = 12,
+    T: int = 50,
+    device: str | torch.device = "cuda",
+    sign: float = 1.0,
+    sign_probe: Callable[[Vector], float] | None = None,
+    sign_probe_c: float = 1.0,
+) -> tuple[float, list[dict]]:
+    """log-log Illinois bisection on `target_stat`. History keeps per-token
+    arrays so we can plot the trajectory after."""
+    _demo_logged["flag"] = False
+    prompts = _tokenize(prompts, tok)
+    history: list[dict] = []
+
+    if sign_probe is not None:
+        v.cfg.coeff = +sign_probe_c
+        score_pos = sign_probe(v)
+        v.cfg.coeff = -sign_probe_c
+        score_neg = sign_probe(v)
+        chosen = +1.0 if score_pos >= score_neg else -1.0
+        logger.info(
+            f"sign_probe: +c={sign_probe_c:+.2f} -> {score_pos:+.3f} | "
+            f"-c={-sign_probe_c:+.2f} -> {score_neg:+.3f} | "
+            f"chosen sign={chosen:+.0f}"
+        )
+        sign = sign * chosen
+
+    def eval_at(c: float) -> float:
+        v.cfg.coeff = sign * c
+        m = measure_kl(v, model, tok, prompts, T=T, do_sample=False, device=device)
+        history.append({"coeff": sign * c, "coeff_abs": c, "sign": sign, **m})
+        logger.info(f"  c={sign * c:+.4f} mean={m['kl_mean']:.3f} "
+                    f"p50={m['kl_p50']:.3f} p90={m['kl_p90']:.3f} "
+                    f"p95={m['kl_p95']:.3f} max={m['kl_max']:.3f} n={m['n_pos']}")
+        return m[target_stat]
+
+    lo, hi = bracket
+    log_target = math.log(target_kl)
+
+    mid = (lo * hi) ** 0.5
+    v_mid = eval_at(mid)
+    if v_mid < target_kl:
+        c_lo, v_lo = mid, v_mid
+        c = mid
+        c_hi, v_hi = hi, None
+        while c < hi:
+            c *= 2.0
+            val = eval_at(c)
+            if val >= target_kl:
+                c_hi, v_hi = c, val
+                break
+            c_lo, v_lo = c, val
+        else:
+            logger.warning(f"calibrate {v.cfg.method}: KL below target across bracket")
+            return sign * c, history
+    else:
+        c_hi, v_hi = mid, v_mid
+        c = mid
+        c_lo, v_lo = lo, None
+        while c > lo:
+            c /= 2.0
+            val = eval_at(c)
+            if val <= target_kl:
+                c_lo, v_lo = c, val
+                break
+            c_hi, v_hi = c, val
+        else:
+            logger.warning(f"calibrate {v.cfg.method}: KL above target across bracket")
+            return sign * c, history
+
+    stale_lo = stale_hi = 0
+    for _ in tqdm(range(max_iters), desc=f"calib {v.cfg.method}", mininterval=60, leave=False):
+        if v_lo is not None and v_hi is not None and v_lo > 0 and v_hi > 0:
+            log_c_lo, log_c_hi = math.log(c_lo), math.log(c_hi)
+            log_v_lo = math.log(v_lo) - (math.log(2) if stale_lo >= 2 else 0.0)
+            log_v_hi = math.log(v_hi) - (math.log(2) if stale_hi >= 2 else 0.0)
+            denom = log_v_hi - log_v_lo
+            if abs(denom) < 1e-6:
+                c_new = math.sqrt(c_lo * c_hi)
+            else:
+                t = (log_target - log_v_lo) / denom
+                log_c_new = log_c_lo + t * (log_c_hi - log_c_lo)
+                c_new = math.exp(log_c_new)
+                if not (c_lo < c_new < c_hi):
+                    c_new = math.sqrt(c_lo * c_hi)
+        else:
+            c_new = math.sqrt(c_lo * c_hi)
+
+        v_new = eval_at(c_new)
+        if abs(v_new - target_kl) < tol:
+            return sign * c_new, history
+        if v_new < target_kl:
+            c_lo, v_lo = c_new, v_new
+            stale_lo = 0
+            stale_hi += 1
+        else:
+            c_hi, v_hi = c_new, v_new
+            stale_hi = 0
+            stale_lo += 1
+
+    best = min(history, key=lambda h: abs(h[target_stat] - target_kl))
+    return best["coeff"], history
@@ -0,0 +1,112 @@
+"""SteeringConfig + Method protocol + registries.
+
+Each method ships its own subclass `XC(SteeringConfig)` and `XMethod` class
+under `variants/*.py` (e.g. `MeanDiffC` + `MeanDiff`). Two parallel registries
+keyed by method name: `_CONFIG_REGISTRY` for `from_dict` deserialisation,
+`REGISTRY` for the runtime extract/apply pair.
+"""
+from dataclasses import dataclass, asdict, field
+from typing import Protocol, Any
+import torch
+from torch import Tensor
+
+
+@dataclass
+class SteeringConfig:
+    """Base config. Subclass per method; do not instantiate directly."""
+    method: str = "?"
+
+    # which transformer blocks to hook (indices into model.model.layers)
+    # None = all layers
+    layers: tuple[int, ...] | None = None
+
+    # which point in the block to add at: "residual" = block output (post mlp+attn),
+    # "attn_out" = attention output, "mlp_out" = mlp output.
+    # v1 only implements "residual".
+    target: str = "residual"
+
+    # Optional dotted path of a sub-module within each target block to hook on
+    # (e.g. "mlp.down_proj"). When None, the block's forward output is hooked
+    # (default for almost all variants); when set, the sub-module's forward is
+    # hooked instead. Either way the variant's apply receives the unified
+    # (block, x, y, state, cfg) signature -- used by weight-SVD methods
+    # (sspace, sspace_ablate) that need to modify a Linear's output in low-rank
+    # S-space.
+    target_submodule: str | None = None
+
+    # steering strength at apply-time. Methods interpret it differently:
+    # additive (mean_diff, pca, sspace, chars, cosine_gated): coeff is α in `h + α v`.
+    # slerp/angle (spherical, angular_steering): coeff is the slerp t / rotation θ.
+    # blend (linear_act): coeff is the blend ratio.
+    # ablation+nudge (directional_ablation): coeff is post-ablation nudge magnitude.
+    coeff: float = 1.0
+
+    dtype: torch.dtype = torch.bfloat16
+    seed: int = 0
+
+    def to_dict(self) -> dict:
+        d = asdict(self)
+        d["dtype"] = str(self.dtype).removeprefix("torch.")
+        return d
+
+    @classmethod
+    def from_dict(cls, d: dict) -> "SteeringConfig":
+        d = dict(d)
+        name = d["method"]
+        sub = _CONFIG_REGISTRY[name]
+        d["dtype"] = getattr(torch, d["dtype"])
+        return sub(**d)
+
+
+_CONFIG_REGISTRY: dict[str, type[SteeringConfig]] = {}
+
+
+def register_config(cls: type[SteeringConfig]) -> type[SteeringConfig]:
+    """Decorator: register `cls` under its `method` default value."""
+    name = cls.__dataclass_fields__["method"].default
+    if name == "?":
+        raise ValueError(f"{cls} must override the default `method` field")
+    if name in _CONFIG_REGISTRY:
+        raise ValueError(f"config for method {name!r} already registered")
+    _CONFIG_REGISTRY[name] = cls
+    return cls
+
+
+class Method(Protocol):
+    """extract+apply pair. State tensors are registered as buffers on the hooked
+    module (block or Linear) under `_steering_state_<key>` and rebuilt into a
+    dict by the hook.
+    """
+    name: str
+
+    @staticmethod
+    def extract(
+        pos_acts: dict[int, Tensor],
+        neg_acts: dict[int, Tensor],
+        cfg: Any,
+    ) -> dict[int, dict[str, Tensor]]:
+        """Per-layer state. `pos_acts[l]` is `[n_pos, d_model]`, same for neg."""
+        ...
+
+    @staticmethod
+    def apply(
+        mod,           # the hooked module: a transformer block, or a Linear
+        x: Tensor,     # [b, s, d_in]  -- module input
+        y: Tensor,     # [b, s, d_out] -- module output
+        state: dict[str, Tensor],
+        cfg: Any,
+    ) -> Tensor:
+        """Return the module's NEW output. Additive variants: `return y + delta`.
+        Replacing variants: ignore `y`, return any tensor of shape `[b, s, d_out]`.
+        """
+        ...
+
+
+REGISTRY: dict[str, type] = {}
+
+
+def register(cls):
+    if not getattr(cls, "name", None):
+        raise ValueError(f"method {cls} missing .name")
+    REGISTRY[cls.name] = cls
+    return cls
@@ -0,0 +1,58 @@
+"""Record last non-pad-token hidden states at selected layers via forward hooks.
+
+We hook each block's forward output (it returns `(hidden_states, ...)`), gather
+the final non-padding token from `attention_mask`, and stack across prompts.
+No grad.
+"""
+from __future__ import annotations
+import torch
+from torch import nn, Tensor
+from jaxtyping import Float
+
+from .target import _get_blocks
+
+
+@torch.no_grad()
+def record_activations(
+    model: nn.Module,
+    tok,
+    prompts: list[str],
+    layers: tuple[int, ...],
+    *,
+    batch_size: int = 8,
+    max_length: int = 256,
+) -> dict[int, Float[Tensor, "n d"]]:
+    """Run prompts through model, return last-token hidden state at each layer."""
+    blocks = _get_blocks(model)
+    device = next(model.parameters()).device
+
+    bucket: dict[int, list[Tensor]] = {l: [] for l in layers}
+    captured: dict[int, Tensor] = {}
+
+    def make_hook(li: int):
+        def hook(_mod, _args, out):
+            h = out[0] if isinstance(out, tuple) else out
+            captured[li] = h
+        return hook
+
+    handles = [blocks[li].register_forward_hook(make_hook(li)) for li in layers]
+    try:
+        was_training = model.training
+        model.eval()
+        for i in range(0, len(prompts), batch_size):
+            batch = prompts[i : i + batch_size]
+            enc = tok(batch, return_tensors="pt", padding=True, truncation=True, max_length=max_length).to(device)
+            captured.clear()
+            model(**enc)
+            mask = enc["attention_mask"]
+            last_idx = mask.shape[1] - 1 - mask.flip([-1]).argmax(-1)
+            batch_idx = torch.arange(mask.shape[0], device=last_idx.device)
+            for li in layers:
+                bucket[li].append(captured[li][batch_idx, last_idx].detach().to("cpu"))
+        if was_training:
+            model.train()
+    finally:
+        for h in handles:
+            h.remove()
+
+    return {li: torch.cat(bucket[li], dim=0) for li in layers}
@@ -0,0 +1,148 @@
+"""Find transformer blocks (or sub-Linears) to hook.
+
+Default: hook each block's forward output (residual stream after attn+mlp).
+When `cfg.target_submodule` is set, it is interpreted as a **regex** matched
+against `block.named_modules()` paths under each selected block; matching
+`nn.Linear`s become the actual hook targets. This lets a single cfg target
+multiple Linears per block (e.g. residual writers `mlp.down_proj` AND
+`self_attn.o_proj`, or all 7 Linears in q/k/v/o/gate/up/down).
+
+Works with HF llama-family architectures (llama, qwen, mistral, etc). For other
+architectures, set `cfg.layers` to indices into whatever list lives at the path
+your model exposes -- override `_get_blocks` if needed.
+"""
+import re
+from torch import nn
+
+
+def _get_blocks(model: nn.Module) -> nn.ModuleList:
+    # llama-family: model.model.layers
+    # gemma3-multimodal: model.language_model.layers (or model.model.language_model.layers)
+    candidates = []
+    inner = getattr(model, "model", model)
+    candidates.append(inner)
+    lm = getattr(inner, "language_model", None)
+    if lm is not None:
+        candidates.append(lm)
+        candidates.append(getattr(lm, "model", lm))
+    for c in candidates:
+        blocks = getattr(c, "layers", None)
+        if blocks is not None:
+            return blocks
+    raise RuntimeError(
+        f"could not find .layers on {type(model).__name__}; "
+        f"override _get_blocks for non-llama architectures"
+    )
+
+
+def find_targets(model: nn.Module, cfg) -> list[tuple[str, nn.Module, int]]:
+    """Return [(full_name, module, layer_idx)] for hook targets selected by cfg.
+
+    - `cfg.target_submodule is None`: one entry per selected block (the block itself).
+    - `cfg.target_submodule = <regex>`: one entry per (block, matching nn.Linear).
+      Regex is matched with `re.fullmatch` against `name` from `block.named_modules()`,
+      e.g. `r"mlp\\.down_proj|self_attn\\.o_proj"` matches both residual writers,
+      `r"self_attn\\.(q|k|v|o)_proj|mlp\\.(gate|up|down)_proj"` matches all 7 Linears.
+    """
+    blocks = _get_blocks(model)
+    n = len(blocks)
+    if cfg.layers is None:
+        idxs = tuple(range(n))
+    else:
+        idxs = tuple(cfg.layers)
+        for i in idxs:
+            if not (0 <= i < n):
+                raise ValueError(f"layer {i} out of range [0, {n})")
+    sub = getattr(cfg, "target_submodule", None)
+    if sub is None:
+        return [(f"layers.{i}", blocks[i], i) for i in idxs]
+    pattern = re.compile(sub)
+    out = []
+    for i in idxs:
+        for name, mod in blocks[i].named_modules():
+            if name and pattern.fullmatch(name) and isinstance(mod, nn.Linear):
+                out.append((f"layers.{i}.{name}", mod, i))
+    if not out:
+        raise RuntimeError(
+            f"target_submodule regex {sub!r} matched no nn.Linear "
+            f"in {len(idxs)} selected blocks"
+        )
+    return out
+
+
+def find_residual_linears(
+    model: nn.Module,
+    layer_indices: tuple[int, ...] | None = None,
+    *,
+    role: str = "both",                         # "writer" | "reader" | "both"
+    fallback_regex: str | None = None,
+) -> list[tuple[str, nn.Module, int, str]]:
+    """Find Linears connected to the residual stream, per block.
+
+    Returns `[(full_name, module, layer_idx, role)]` where role is "writer"
+    (d_out == d_model, d_in != d_model) or "reader" (d_in == d_model,
+    d_out != d_model). Square Linears are ambiguous (could be either) and
+    are excluded by shape detection.
+
+    Detection:
+      1. Primary: weight.shape vs d_model.
+      2. Fallback: if shape detection finds nothing (non-llama arch, weird
+         shapes), match `fallback_regex` against `named_modules` paths and
+         guess role from substring. Default regex covers llama-family names.
+         Warns when fallback fires.
+    """
+    from loguru import logger
+    d_model = get_d_model(model)
+    blocks = _get_blocks(model)
+    n = len(blocks)
+    idxs = tuple(layer_indices) if layer_indices is not None else tuple(range(n))
+
+    out: list[tuple[str, nn.Module, int, str]] = []
+    for li in idxs:
+        for name, mod in blocks[li].named_modules():
+            if not isinstance(mod, nn.Linear):
+                continue
+            d_out, d_in = mod.weight.shape
+            is_writer = d_out == d_model and d_in != d_model
+            is_reader = d_in == d_model and d_out != d_model
+            if is_writer and role in ("writer", "both"):
+                out.append((f"layers.{li}.{name}", mod, li, "writer"))
+            elif is_reader and role in ("reader", "both"):
+                out.append((f"layers.{li}.{name}", mod, li, "reader"))
+
+    if out:
+        return out
+
+    regex = fallback_regex or r"mlp\.(down|gate|up)_proj|self_attn\.(q|k|v|o)_proj"
+    logger.warning(
+        f"shape-based residual-linear detection found nothing for d_model={d_model} "
+        f"in {len(idxs)} blocks; falling back to regex {regex!r}"
+    )
+    pattern = re.compile(regex)
+    writer_hints = ("down_proj", "o_proj")
+    for li in idxs:
+        for name, mod in blocks[li].named_modules():
+            if not (name and pattern.fullmatch(name) and isinstance(mod, nn.Linear)):
+                continue
+            role_guess = "writer" if any(h in name for h in writer_hints) else "reader"
+            if role in ("both", role_guess):
+                out.append((f"layers.{li}.{name}", mod, li, role_guess))
+
+    if not out:
+        logger.warning(
+            f"regex fallback {regex!r} also matched no Linears in layers {idxs}; "
+            "super_sspace will have an empty basis"
+        )
+    return out
+
+
+def get_d_model(model: nn.Module) -> int:
+    cfg = getattr(model, "config", None)
+    d = getattr(cfg, "hidden_size", None)
+    if d is None:
+        # multimodal configs (gemma3): text sub-config
+        text_cfg = getattr(cfg, "text_config", None)
+        d = getattr(text_cfg, "hidden_size", None)
+    if d is None:
+        raise RuntimeError("model has no .config.hidden_size")
+    return int(d)
@@ -0,0 +1,6 @@
+"""Variant registry. Importing this package triggers @register_config + @register
+side effects in the variant modules.
+"""
+from . import mean_diff  # noqa: F401
+from . import pca  # noqa: F401
+from . import directional_ablation  # noqa: F401
@@ -0,0 +1,75 @@
+"""Mean-diff directional ablation (Arditi-inspired projection-out).
+
+Project the steering direction *out of* the residual stream instead of (or in
+addition to) adding to it. Unlike `mean_diff` which translates by $\\alpha v$,
+ablation removes the component of $h$ along $\\hat v$:
+
+$$h \\leftarrow h - (h \\cdot \\hat v)\\hat v + \\alpha\\hat v$$
+
+When `coeff=0` this is pure ablation (refusal-direction style); when `coeff!=0`
+this is ablation followed by a constant nudge (useful to ablate "old" behavior
+and inject "new"). The two terms are mathematically distinct -- ablation is a
+*projection* (idempotent), addition is a *translation*.
+
+Norms shrink by $|h \\cdot \\hat v|$ which is informative -- a near-zero shrink
+means the direction wasn't present in the first place, so the intervention is
+a no-op. Compare to `mean_diff` which always pays a constant $\\alpha\\|\\hat v\\|$
+per token regardless of whether the direction is present.
+
+Refs / inspiration:
+  - Arditi et al. 2024 "Refusal in language models is mediated by a single direction"
+    https://arxiv.org/abs/2406.11717
+  - andyrdt/refusal_direction https://github.com/andyrdt/refusal_direction
+"""
+from dataclasses import dataclass
+from einops import einsum
+from jaxtyping import Float
+from torch import Tensor
+
+from ..config import SteeringConfig, register_config, register
+
+
+ε = 1e-8
+
+
+@register_config
+@dataclass
+class DirectionalAblationC(SteeringConfig):
+    method: str = "directional_ablation"
+    coeff: float = 0.0  # post-ablation additive nudge along v_hat; 0.0 = pure ablation
+
+
+@register
+class DirectionalAblation:
+    name = "directional_ablation"
+
+    @staticmethod
+    def extract(
+        pos_acts: dict[int, Float[Tensor, "n d"]],
+        neg_acts: dict[int, Float[Tensor, "m d"]],
+        cfg: DirectionalAblationC,
+    ) -> dict[int, dict[str, Tensor]]:
+        out = {}
+        for li in pos_acts:
+            v = pos_acts[li].float().mean(0) - neg_acts[li].float().mean(0)
+            v = v / (v.norm() + ε)
+
+            out[li] = {"v": v}
+        return out
+
+    @staticmethod
+    def apply(
+        mod,
+        x: Float[Tensor, "b s d"],
+        y: Float[Tensor, "b s d"],
+        state: dict[str, Tensor],
+        cfg: DirectionalAblationC,
+    ) -> Float[Tensor, "b s d"]:
+        v = state["v"].to(y)  # unit
+
+        proj = einsum(y, v, "b s d, d -> b s")
+        y    = y - proj.unsqueeze(-1) * v          # ablate
+
+        if cfg.coeff != 0.0:
+            y = y + cfg.coeff * v
+        return y
@@ -0,0 +1,82 @@
+"""Mean-difference steering (CAA / ActAdd).
+
+For each selected layer L, compute the mean difference between positive and
+negative last-token hidden states:
+
+$$v_L = \\text{mean}(h^+_L) - \\text{mean}(h^-_L), \\quad \\hat{v}_L = v_L / \\|v_L\\|$$
+
+At runtime, add `coeff * v_hat` to every token's residual at that block:
+
+$$h \\leftarrow h + \\alpha \\cdot \\hat{v}_L$$
+
+This is the same operation as CAA (Panickssery 2023, contrastive MCQ pairs)
+and ActAdd (Turner 2023, single prompt-pair); the differences are conventional
+not mathematical, so we register one method.
+
+`subtract_corpus_mean=True` toggles Jorgensen 2024 mean-centring: target mean
+minus pos∪neg corpus mean. Direction-identical to plain mean_diff under
+normalization with equal-size groups; kept as a flag rather than a separate
+method.
+
+Refs:
+  - Panickssery 2023 (CAA) https://arxiv.org/abs/2312.06681
+  - Turner 2023 (ActAdd) https://arxiv.org/abs/2308.10248
+  - Jorgensen 2024 (Mean-Centring) https://arxiv.org/abs/2312.03813
+  - nrimsky/CAA https://github.com/nrimsky/CAA
+"""
+from dataclasses import dataclass
+import torch
+from jaxtyping import Float
+from torch import Tensor
+
+from ..config import SteeringConfig, register_config, register
+
+
+ε = 1e-8
+
+
+@register_config
+@dataclass
+class MeanDiffC(SteeringConfig):
+    method: str = "mean_diff"
+    normalize: bool = True
+    subtract_corpus_mean: bool = False
+
+
+@register
+class MeanDiff:
+    name = "mean_diff"
+
+    @staticmethod
+    def extract(
+        pos_acts: dict[int, Float[Tensor, "n d"]],
+        neg_acts: dict[int, Float[Tensor, "m d"]],
+        cfg: MeanDiffC,
+    ) -> dict[int, dict[str, Tensor]]:
+        out = {}
+        for li in pos_acts:
+            p = pos_acts[li].float()
+            n = neg_acts[li].float()
+
+            if cfg.subtract_corpus_mean:
+                mu = torch.cat([p, n], dim=0).mean(0)
+                v = p.mean(0) - mu
+            else:
+                v = p.mean(0) - n.mean(0)
+
+            if cfg.normalize:
+                v = v / (v.norm() + ε)
+
+            out[li] = {"v": v}
+        return out
+
+    @staticmethod
+    def apply(
+        mod,
+        x: Float[Tensor, "b s d"],
+        y: Float[Tensor, "b s d"],
+        state: dict[str, Tensor],
+        cfg: MeanDiffC,
+    ) -> Float[Tensor, "b s d"]:
+        v = state["v"].to(y)
+        return y + cfg.coeff * v
@@ -0,0 +1,101 @@
+"""PCA steering (RepE/LAT-inspired, vgel pca_diff-like).
+
+For each layer L, compute PCA on the **paired differences** `h^+ - h^-`. Take
+the top principal component as the steering direction.
+
+$$D_L = H^+_L - H^-_L \\in \\mathbb{R}^{n\\times d}$$
+$$U, S, V^T = \\text{SVD}(D_L - \\bar{D}_L)$$
+$$\\text{sign}_L = \\text{sign}\\left(\\sum_i \\mathbb{1}[(D_L)_i \\cdot V_{:,0} > 0] - n/2\\right)$$
+$$v_L = V_{:,0} \\cdot \\text{sign}_L$$
+
+Sign-fixed by majority vote of paired-diff projections (repeng/vgel style).
+This is a lightweight control-vector baseline, not the full Zou et al. LAT
+reader: it omits per-diff normalization, label-based sign selection, and
+train-mean recentering for reading scores.
+PCA is sign-ambiguous; the vote is more robust than alignment-with-the-mean
+when paired diffs are heterogeneous (mean can cancel without the vote
+changing). If the vote ties exactly, orient the axis so the largest centered
+projection is positive.
+
+At runtime, add `coeff * v_L` to the residual.
+
+Refs:
+  - Zou et al. 2023 (Representation Engineering) https://arxiv.org/abs/2310.01405
+  - vgel/repeng: https://github.com/vgel/repeng
+"""
+from dataclasses import dataclass
+import torch
+from jaxtyping import Float
+from torch import Tensor
+
+from ..config import SteeringConfig, register_config, register
+
+
+ε = 1e-8
+
+
+@register_config
+@dataclass
+class PCAC(SteeringConfig):
+    method: str = "pca"
+    n_components: int = 1
+    normalize: bool = True
+
+
+@register
+class PCA:
+    name = "pca"
+
+    @staticmethod
+    def extract(
+        pos_acts: dict[int, Float[Tensor, "n d"]],
+        neg_acts: dict[int, Float[Tensor, "n d"]],
+        cfg: PCAC,
+    ) -> dict[int, dict[str, Tensor]]:
+        out = {}
+        for li in pos_acts:
+            if pos_acts[li].shape[0] != neg_acts[li].shape[0]:
+                raise ValueError(f"layer {li}: pos/neg counts differ")
+
+            diffs    = (pos_acts[li] - neg_acts[li]).float()
+            centered = diffs - diffs.mean(0, keepdim=True)
+
+            _, _, Vh = torch.linalg.svd(centered, full_matrices=False)
+            v = Vh[: cfg.n_components]
+
+            projs         = centered @ v.T
+            positive_frac = (projs > 0).float().mean(0)
+            majority_sign = torch.where(positive_frac > 0.5,
+                               torch.ones(v.shape[0]),
+                               -torch.ones(v.shape[0])).to(v)
+            strongest_idx  = projs.abs().argmax(dim=0)
+            strongest      = projs[strongest_idx, torch.arange(v.shape[0], device=projs.device)]
+            strongest_sign = torch.sign(strongest)
+            sign           = torch.where(positive_frac == 0.5, strongest_sign, majority_sign)
+            v = v * sign[:, None]
+
+            if cfg.n_components == 1:
+                v = v.squeeze(0)
+                if cfg.normalize:
+                    v = v / (v.norm() + ε)
+                out[li] = {"v": v}
+            else:
+                if cfg.normalize:
+                    v = v / (v.norm(dim=1, keepdim=True) + ε)
+                out[li] = {"V": v}
+        return out
+
+    @staticmethod
+    def apply(
+        mod,
+        x: Float[Tensor, "b s d"],
+        y: Float[Tensor, "b s d"],
+        state: dict[str, Tensor],
+        cfg: PCAC,
+    ) -> Float[Tensor, "b s d"]:
+        if "v" in state:
+            v = state["v"].to(y)
+            return y + cfg.coeff * v
+        # multi-component: sum top-k directions equally
+        V = state["V"].to(y)
+        return y + cfg.coeff * V.sum(0)
@@ -0,0 +1,114 @@
+"""Vector: extracted steering vector + config, as a single ergonomic object.
+
+Wraps `(cfg, state)` so a user can:
+
+    v = Vector.train(model, tok, pos, neg, sl.MeanDiffC(layers=(15,))) \\
+              .calibrate(model, tok, target_kl=1.0)
+
+    with v(model):
+        out = model.generate(...)
+
+    v.save("honesty.safetensors")
+    v2 = Vector.load("honesty.safetensors")
+
+    combined = v + v2          # ensemble (sum buffers, requires same cfg.method)
+    scaled   = v * 0.5         # scale buffers
+"""
+from __future__ import annotations
+from contextlib import contextmanager
+from copy import deepcopy
+from dataclasses import replace
+
+import torch
+from torch import Tensor, nn
+
+from .config import SteeringConfig
+
+
+class Vector:
+    def __init__(self, cfg: SteeringConfig, state: dict[int, dict[str, Tensor]]):
+        self.cfg = cfg
+        self.state = state
+
+    @classmethod
+    def train(cls, model: nn.Module, tok, pos_prompts: list[str], neg_prompts: list[str],
+              cfg: SteeringConfig, **kw) -> "Vector":
+        """Extract a steering vector from contrastive prompts. Chains with .calibrate()."""
+        from .attach import train as _train
+        return _train(model, tok, pos_prompts, neg_prompts, cfg, **kw)
+
+    def calibrate(self, model: nn.Module, tok,
+                  prompts: list[str] | list[Tensor] | None = None, *,
+                  target_kl: float = 1.0, **kw) -> "Vector":
+        """Set coeff so KL(steered || base) hits target_kl. Mutates and returns self for chaining.
+
+        `prompts` defaults to a small generic set; pass list[str] to use your own.
+        """
+        from .calibrate import calibrate_iso_kl
+        coeff, _ = calibrate_iso_kl(self, model, tok, prompts, target_kl=target_kl, **kw)
+        self.cfg.coeff = float(coeff)
+        return self
+
+    @contextmanager
+    def __call__(self, model: nn.Module, *, C: float | None = None):
+        """Attach for the duration of the `with` block. `C` overrides cfg.coeff if given."""
+        from .attach import attach, detach
+        cfg = self.cfg if C is None else replace(self.cfg, coeff=float(C))
+        attach(model, cfg, self.state)
+        try:
+            yield model
+        finally:
+            detach(model)
+
+    def __add__(self, other: "Vector") -> "Vector":
+        if self.cfg.method != other.cfg.method:
+            raise ValueError(f"cannot add {self.cfg.method!r} + {other.cfg.method!r}")
+        if sorted(self.state) != sorted(other.state):
+            raise ValueError(f"layer mismatch: {sorted(self.state)} vs {sorted(other.state)}")
+        new_state: dict[int, dict[str, Tensor]] = {}
+        for li in self.state:
+            a, b = self.state[li], other.state[li]
+            if sorted(a) != sorted(b):
+                raise ValueError(f"layer {li}: state keys differ {sorted(a)} vs {sorted(b)}")
+            new_state[li] = {k: a[k] + b[k] for k in a}
+        return Vector(deepcopy(self.cfg), new_state)
+
+    def __mul__(self, k: float) -> "Vector":
+        new_state = {
+            li: {k_: v * float(k) for k_, v in s.items()}
+            for li, s in self.state.items()
+        }
+        return Vector(deepcopy(self.cfg), new_state)
+
+    __rmul__ = __mul__
+
+    def save(self, path: str) -> None:
+        from .attach import _STATE_PREFIX, _SUB_KEY_PREFIX, _SUB_KEY_SEP  # noqa: F401
+        import json
+        from safetensors.torch import save_file
+        sd: dict[str, Tensor] = {}
+        sub_mode = self.cfg.target_submodule is not None
+        for key, s in self.state.items():
+            for k, t in s.items():
+                if sub_mode:
+                    sd[f"{_SUB_KEY_PREFIX}{key}{_SUB_KEY_SEP}{k}"] = t.detach().cpu()
+                else:
+                    sd[f"layer{key}.{k}"] = t.detach().cpu()
+        metadata = {"cfg": json.dumps(self.cfg.to_dict())}
+        save_file(sd, path, metadata=metadata)
+
+    @classmethod
+    def load(cls, path: str) -> "Vector":
+        import json
+        from safetensors.torch import load_file, safe_open
+        from .attach import _safetensors_dict_to_state
+        with safe_open(path, framework="pt", device="cpu") as f:
+            metadata = f.metadata()
+        sd = load_file(path, device="cpu")
+        cfg = SteeringConfig.from_dict(json.loads(metadata["cfg"]))
+        state = _safetensors_dict_to_state(sd)
+        return cls(cfg, state)
+
+    def __repr__(self) -> str:
+        layers = sorted(self.state)
+        return f"Vector(method={self.cfg.method!r}, layers={layers})"
@@ -0,0 +1,65 @@
+"""Smoke test: train + calibrate + branch_pmass on a tiny random model.
+
+Pass = runtime sanity. Distinguishing checks:
+  - measure_kl returns kl > 0 at coeff > 0 (steer DID change distribution).
+  - measure_kl returns kl ~= 0 at coeff = 0 (silent failure detector: if hooks
+    leak, base==steer KL would be nonzero).
+  - branch_pmass is in [0, 1].
+  - branch_pmass changes between coeff=0 and coeff=large (sneaky-fail catch:
+    if pmass is just identity-pass-through it would be invariant).
+"""
+from __future__ import annotations
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+from iso_kl_figure import (
+    SteeringConfig, MeanDiffC, train, measure_kl, attach, detach,
+)
+from iso_kl_figure.branch_pmass import branch_pmass
+
+
+MODEL = "hf-internal-testing/tiny-random-LlamaForCausalLM"
+
+
+def test_pipeline_smoke():
+    tok = AutoTokenizer.from_pretrained(MODEL)
+    if tok.pad_token_id is None:
+        tok.pad_token_id = tok.eos_token_id
+    model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.float32)
+    model.eval()
+    n_layers = model.config.num_hidden_layers
+    cfg = MeanDiffC(coeff=1.0, layers=(n_layers // 2,))
+
+    pos = ["Sure: <answer>", "Yes: <answer>", "Of course: <answer>", "Here: <answer>"]
+    neg = ["No way.", "I refuse.", "Cannot help.", "Decline."]
+    v = train(model, tok, pos, neg, cfg, batch_size=2, max_length=32)
+
+    # KL must be ~0 at coeff=0 (no leak), and > 0 at large coeff
+    v.cfg.coeff = 0.0
+    m0 = measure_kl(v, model, tok, ["hello world"], T=4, device="cpu")
+    assert m0["kl_p95"] < 1e-3, f"coeff=0 should give zero KL, got {m0['kl_p95']}"
+
+    v.cfg.coeff = 5.0
+    m1 = measure_kl(v, model, tok, ["hello world"], T=4, device="cpu")
+    assert m1["kl_p95"] > 0.0, "coeff>0 should give nonzero KL"
+
+    # per_t arrays length matches T
+    assert len(m1["per_t_p95"]) == 4
+    assert len(m1["per_t_max"]) == 4
+
+    # branch_pmass: in [0, 1] and varies with coeff
+    pids = tok("hello", return_tensors="pt").input_ids[0]
+    rolled = pids[-2:].clone()
+    suffix = ' {"value": '
+    fork = [0, 1, 2]
+
+    v.cfg.coeff = 0.0
+    p_zero = branch_pmass(v, model, tok, pids, rolled, fork, suffix,
+                          ["true", "false"], device="cpu")
+    v.cfg.coeff = 5.0
+    p_steer = branch_pmass(v, model, tok, pids, rolled, fork, suffix,
+                           ["true", "false"], device="cpu")
+    for x in p_zero["pmass"] + p_steer["pmass"]:
+        assert 0.0 <= x <= 1.0, f"pmass out of [0,1]: {x}"
+    diff = max(abs(a - b) for a, b in zip(p_zero["pmass"], p_steer["pmass"]))
+    assert diff > 1e-6, "pmass invariant to coeff -- hook is dead"