Files
isokl_steering_calibration/docs/spec/20260505_iso_kl_figure.md
T
2026-05-05 06:17:25 +08:00

4.3 KiB

iso-kl-figure: spec

Goal

Produce one figure (Figure 1) and one table (Table 1) that empirically support three claims: iso-KL calibration converges and generalizes (C1), the calibrated coefficient is not too cold (C2), and not too hot (C3). Show stability across 3 models x 3 seeds x 2 calibration windows.

Scope

In:

  • Port measure_kl, calibrate_iso_kl, minimal Vector/attach/config/target/extract from steering-lite.
  • 3 methods: mean_diff, directional_ablation, pca.
  • New branch_pmass metric: fork-and-teacher-force probability mass on a forced format answer token.
  • Scripts producing TSV/CSV; plot and table modules consuming the CSVs.

Out:

  • LessWrong post or paper draft.
  • Method zoo beyond 3 methods.
  • Threshold sweep, calibration-set-size sweep, norm-matching baseline.
  • tinymfv integration.

Requirements

  • R1 (C1, calibration converges and generalizes): for every (method, model, seed, window), bisection terminates with calibration p95 within tolerance of 1.0; on a held-out prompt set p95 lands within [0.7, 1.4]. VERIFY: TSV row has converged=true and holdout_p95 in band; sneaky failure (overfits calibration prompts) caught by held-out column.
  • R2 (C2, not too cold): target-axis Delta logit at calibrated alpha excludes 0 with 95% CI for each method, on each model. VERIFY: Table 1 row reports CI; sneaky failure (alpha approx 0) caught by alpha column in same row.
  • R3 (C3, not too hot, NLL): base-NLL of full 50-token held-out generations stays within 2x of base at calibrated alpha; exceeds 4x of base at 2x calibrated alpha for at least one method per model. VERIFY: Table 1 base_nll_delta column.
  • R4 (C3, not too hot, branch-pmass): mean branch-pmass-of-valid-answer at fork points {0, 5, ..., 50} stays within 0.1 of base pmass at calibrated alpha; drops by more than 0.3 at 2x alpha for at least one method per model. VERIFY: Table 1 branch_pmass column and Figure 1 lower subplot.
  • R5 (sanity probe at 2x): max p95 KL at 2x alpha exceeds 1 nat in at least 2 of 3 methods on at least 2 of 3 models within 50 tokens. VERIFY: Figure 1 top subplot, alpha=2 panels show lines crossing reference.
  • R6 (stability): seed band and window-style overlay in Figure 1 do not change the qualitative C1 conclusion. VERIFY: variance band visually narrow at alpha=1.

Tasks

  • [/] T1 (R*): scaffold repo (pyproject, justfile, README, AGENTS, spec).
    • verify: just --list lists recipes; uv sync --extra all resolves.
  • T2 (R1, R2, R3, R4): port core code from steering-lite (calibrate, vector, attach, config, target, extract, 3 variants).
    • verify: imports clean; smoke test runs all 3 methods.
  • T3 (R1, R6): extend calibrate history to save per-token KL arrays (per_t_p95, per_t_max).
    • verify: history dict contains per-token arrays of length T.
  • T4 (R4): implement branch_pmass (fork at token t, append fixed format suffix, teacher-force one forward, sum p over true/false tokens).
    • verify: pmass in [0, 1]; pmass at base != pmass at coeff=large (sneaky-fail catch).
  • T5 (R1..R5): implement run_calibrate.py, run_trajectory.py, run_table.py.
    • verify: CSVs created with expected columns and at least one row each on smoke.
  • T6 (R*): implement plot.py, table.py.
    • verify: PNG saved; markdown table prints; can be regenerated from CSVs alone.
  • T7 (R*): full sweep on real models.
    • verify: numeric asserts in R1..R5 pass.
  • T8 (R*): external review of figure + table.
    • verify: review doc saved under docs/spec/.

Context

Calibration target: p95 per-token KL(steered || base) = 1 nat over T tokens (T in {20, 50}), N=4 calibration prompts under greedy decoding.

Branch-pmass procedure: at fork points t in {0, 5, ..., 50} take steered prefix of length t, append \nAnswer (true/false): then {"value": then teacher-force one forward under steered model, sum probabilities of token variants for true and false.

Target-axis: a single contrastive pair-set built into the repo (sentiment positive vs negative or refusal yes vs no), 4 prompts each. Target Delta logit = mean over held-out prompts of difference in logit on the target token.

Log

(append-only; only entries that change a future task)

TODO

(out-of-scope ideas; not commitments)

Errors

Task Error Resolution