# iso-kl-figure: spec

## Goal

Produce one figure (Figure 1) and one table (Table 1) that empirically support three claims: iso-KL calibration converges and generalizes (C1), the calibrated coefficient is not too cold (C2), and not too hot (C3). Show stability across 3 models x 3 seeds x 2 calibration windows.

## Scope

In:
- Port `measure_kl`, `calibrate_iso_kl`, minimal Vector/attach/config/target/extract from steering-lite.
- 3 methods: `mean_diff`, `directional_ablation`, `pca`.
- New `branch_pmass` metric: fork-and-teacher-force probability mass on a forced format answer token.
- Scripts producing TSV/CSV; plot and table modules consuming the CSVs.

Out:
- LessWrong post or paper draft.
- Method zoo beyond 3 methods.
- Threshold sweep, calibration-set-size sweep, norm-matching baseline.
- tinymfv integration.

## Requirements

- R1 (C1, calibration converges and generalizes): for every (method, model, seed, window), bisection terminates with calibration p95 within tolerance of 1.0; on a held-out prompt set p95 lands within [0.7, 1.4]. VERIFY: TSV row has converged=true and holdout_p95 in band; sneaky failure (overfits calibration prompts) caught by held-out column.
- R2 (C2, not too cold): target-axis Delta logit at calibrated alpha excludes 0 with 95% CI for each method, on each model. VERIFY: Table 1 row reports CI; sneaky failure (alpha approx 0) caught by alpha column in same row.
- R3 (C3, not too hot, NLL): base-NLL of full 50-token held-out generations stays within 2x of base at calibrated alpha; exceeds 4x of base at 2x calibrated alpha for at least one method per model. VERIFY: Table 1 base_nll_delta column.
- R4 (C3, not too hot, branch-pmass): mean branch-pmass-of-valid-answer at fork points {0, 5, ..., 50} stays within 0.1 of base pmass at calibrated alpha; drops by more than 0.3 at 2x alpha for at least one method per model. VERIFY: Table 1 branch_pmass column and Figure 1 lower subplot.
- R5 (sanity probe at 2x): max p95 KL at 2x alpha exceeds 1 nat in at least 2 of 3 methods on at least 2 of 3 models within 50 tokens. VERIFY: Figure 1 top subplot, alpha=2 panels show lines crossing reference.
- R6 (stability): seed band and window-style overlay in Figure 1 do not change the qualitative C1 conclusion. VERIFY: variance band visually narrow at alpha=1.

## Tasks

- [/] T1 (R*): scaffold repo (pyproject, justfile, README, AGENTS, spec).
  - verify: `just --list` lists recipes; `uv sync --extra all` resolves.
- [ ] T2 (R1, R2, R3, R4): port core code from steering-lite (calibrate, vector, attach, config, target, extract, 3 variants).
  - verify: imports clean; smoke test runs all 3 methods.
- [ ] T3 (R1, R6): extend calibrate history to save per-token KL arrays (`per_t_p95`, `per_t_max`).
  - verify: history dict contains per-token arrays of length T.
- [ ] T4 (R4): implement `branch_pmass` (fork at token t, append fixed format suffix, teacher-force one forward, sum p over `true`/`false` tokens).
  - verify: pmass in [0, 1]; pmass at base != pmass at coeff=large (sneaky-fail catch).
- [ ] T5 (R1..R5): implement `run_calibrate.py`, `run_trajectory.py`, `run_table.py`.
  - verify: CSVs created with expected columns and at least one row each on smoke.
- [ ] T6 (R*): implement `plot.py`, `table.py`.
  - verify: PNG saved; markdown table prints; can be regenerated from CSVs alone.
- [ ] T7 (R*): full sweep on real models.
  - verify: numeric asserts in R1..R5 pass.
- [ ] T8 (R*): external review of figure + table.
  - verify: review doc saved under docs/spec/.

## Context

Calibration target: p95 per-token KL(steered || base) = 1 nat over T tokens (T in {20, 50}), N=4 calibration prompts under greedy decoding.

Branch-pmass procedure: at fork points t in {0, 5, ..., 50} take steered prefix of length t, append `\nAnswer (true/false): ` then `{"value": ` then teacher-force one forward under steered model, sum probabilities of token variants for `true` and `false`.

Target-axis: a single contrastive pair-set built into the repo (sentiment positive vs negative or refusal yes vs no), 4 prompts each. Target Delta logit = mean over held-out prompts of difference in logit on the target token.

## Log

(append-only; only entries that change a future task)

## TODO

(out-of-scope ideas; not commitments)

## Errors

| Task | Error | Resolution |
|------|-------|------------|