4.3 KiB
iso-kl-figure: spec
Goal
Produce one figure (Figure 1) and one table (Table 1) that empirically support three claims: iso-KL calibration converges and generalizes (C1), the calibrated coefficient is not too cold (C2), and not too hot (C3). Show stability across 3 models x 3 seeds x 2 calibration windows.
Scope
In:
- Port
measure_kl,calibrate_iso_kl, minimal Vector/attach/config/target/extract from steering-lite. - 3 methods:
mean_diff,directional_ablation,pca. - New
branch_pmassmetric: fork-and-teacher-force probability mass on a forced format answer token. - Scripts producing TSV/CSV; plot and table modules consuming the CSVs.
Out:
- LessWrong post or paper draft.
- Method zoo beyond 3 methods.
- Threshold sweep, calibration-set-size sweep, norm-matching baseline.
- tinymfv integration.
Requirements
- R1 (C1, calibration converges and generalizes): for every (method, model, seed, window), bisection terminates with calibration p95 within tolerance of 1.0; on a held-out prompt set p95 lands within [0.7, 1.4]. VERIFY: TSV row has converged=true and holdout_p95 in band; sneaky failure (overfits calibration prompts) caught by held-out column.
- R2 (C2, not too cold): target-axis Delta logit at calibrated alpha excludes 0 with 95% CI for each method, on each model. VERIFY: Table 1 row reports CI; sneaky failure (alpha approx 0) caught by alpha column in same row.
- R3 (C3, not too hot, NLL): base-NLL of full 50-token held-out generations stays within 2x of base at calibrated alpha; exceeds 4x of base at 2x calibrated alpha for at least one method per model. VERIFY: Table 1 base_nll_delta column.
- R4 (C3, not too hot, branch-pmass): mean branch-pmass-of-valid-answer at fork points {0, 5, ..., 50} stays within 0.1 of base pmass at calibrated alpha; drops by more than 0.3 at 2x alpha for at least one method per model. VERIFY: Table 1 branch_pmass column and Figure 1 lower subplot.
- R5 (sanity probe at 2x): max p95 KL at 2x alpha exceeds 1 nat in at least 2 of 3 methods on at least 2 of 3 models within 50 tokens. VERIFY: Figure 1 top subplot, alpha=2 panels show lines crossing reference.
- R6 (stability): seed band and window-style overlay in Figure 1 do not change the qualitative C1 conclusion. VERIFY: variance band visually narrow at alpha=1.
Tasks
- [/] T1 (R*): scaffold repo (pyproject, justfile, README, AGENTS, spec).
- verify:
just --listlists recipes;uv sync --extra allresolves.
- verify:
- T2 (R1, R2, R3, R4): port core code from steering-lite (calibrate, vector, attach, config, target, extract, 3 variants).
- verify: imports clean; smoke test runs all 3 methods.
- T3 (R1, R6): extend calibrate history to save per-token KL arrays (
per_t_p95,per_t_max).- verify: history dict contains per-token arrays of length T.
- T4 (R4): implement
branch_pmass(fork at token t, append fixed format suffix, teacher-force one forward, sum p overtrue/falsetokens).- verify: pmass in [0, 1]; pmass at base != pmass at coeff=large (sneaky-fail catch).
- T5 (R1..R5): implement
run_calibrate.py,run_trajectory.py,run_table.py.- verify: CSVs created with expected columns and at least one row each on smoke.
- T6 (R*): implement
plot.py,table.py.- verify: PNG saved; markdown table prints; can be regenerated from CSVs alone.
- T7 (R*): full sweep on real models.
- verify: numeric asserts in R1..R5 pass.
- T8 (R*): external review of figure + table.
- verify: review doc saved under docs/spec/.
Context
Calibration target: p95 per-token KL(steered || base) = 1 nat over T tokens (T in {20, 50}), N=4 calibration prompts under greedy decoding.
Branch-pmass procedure: at fork points t in {0, 5, ..., 50} take steered prefix of length t, append \nAnswer (true/false): then {"value": then teacher-force one forward under steered model, sum probabilities of token variants for true and false.
Target-axis: a single contrastive pair-set built into the repo (sentiment positive vs negative or refusal yes vs no), 4 prompts each. Target Delta logit = mean over held-out prompts of difference in logit on the target token.
Log
(append-only; only entries that change a future task)
TODO
(out-of-scope ideas; not commitments)
Errors
| Task | Error | Resolution |
|---|