misc

2026-06-27 15:45:51 +08:00 · 2026-05-13 10:46:52 +08:00
parent 77b296cc75
commit 9aeacb61d1
2 changed files with 83 additions and 0 deletions
@@ -105,6 +105,86 @@ The second point is probably unavoidable for any single scalar
 calibration target. The intervention is multi-dimensional, the target is
 one number.

+## Extended result: iterated steering and a coherence budget
+
+Iso-KL calibration has a natural application to *iterated* steering, where
+you repeat the extract → calibrate → apply cycle multiple times, each round
+building on the previous intervention.
+
+The setup (Qwen3-4B, Care↑ vs Authority↑ persona, mean-diff steering,
+[steering-lite](https://github.com/wassname/steering-lite)):
+
+- Round 0: extract mean-diff direction from base model, bisect on
+  coefficient until **p95 per-token KL = target**, apply, evaluate.
+- Round r: re-extract from the already-steered model, re-calibrate to
+  the same p95 KL target, apply on top of all prior rounds, evaluate.
+
+Each round spends the same p95 KL dose. The question is how many rounds
+you can do before incoherence overwhelms the steering signal.
+
+**The finding:** the model has a fixed coherence budget of roughly
+**1.7 nats total cumulative KL** (across all rounds). Once that budget is
+spent, the model breaks down — the Social Norms logit jumps from its
+steered floor to positive, and forced-choice accuracy collapses. This
+constant is the same regardless of how you slice the budget across rounds
+or which method you use.
+
+Here is what the axis shift looks like for KL=0.10 (20 rounds, Qwen3-4B,
+Care↑ vs Authority↑; values are mean logit advantage of each foundation in
+a 7-way forced-choice):
+
+```
+r    Care   Auth   SocN   margin  note
+0   -3.29  -3.84  -6.19  +10.79  baseline (no steering)
+1   -3.15  -4.02  -6.18  +11.26
+4   -2.80  -4.41  -6.34  +12.01
+8   -1.70  -4.95  -6.54  +12.01
+10  +0.41  -5.31  -6.66  +10.92  Care crosses zero
+14  +4.18  -6.61  -5.99   +6.00  Auth near floor, SocN slipping
+15  +3.00  -6.81  -4.12   +3.58  SocN creeping
+17  -0.67  -6.91  +0.61   +0.77  BREAKDOWN: SocN-jump
+```
+
+Auth (the suppressed axis) descends steadily to its floor regardless of
+pace. Care (the amplified axis) rises, then collapses at breakdown when
+incoherence dominates. SocN (Social Norms) staying negative is the
+coherence sentinel; once it goes positive the model is no longer reliably
+using its value structure to answer questions.
+
+| KL/round | breakdown round | cumul. KL | Care (last healthy) | Auth (last healthy) |
+|----------|----------------|-----------|--------------------|--------------------|
+| 0.30     | r=6            | 1.8 nats  | +4.78              | -6.60              |
+| 0.20     | r=9            | 1.8 nats  | +4.42              | -6.58              |
+| 0.15     | r=11           | 1.7 nats  | +4.22              | -6.60              |
+| 0.10     | r=17           | 1.7 nats  | +4.18              | -6.61              |
+| 0.05     | r=36           | 1.8 nats  | +3.88              | -6.65              |
+
+The last two columns are what matters: the total axis shift at the final
+usable round is nearly identical regardless of how many installments it
+took. You always arrive at roughly the same place (Care ≈ +4.4, Auth ≈
+-6.6) — the budget determines the destination, not the pace.
+
+This means iso-KL calibration is doing something real: the p95 KL target
+per round measures against a model-intrinsic limit. Set it too high and
+you spend the budget in 2-3 rounds. Set it low and you get more
+intermediate checkpoints — useful if you need to track the trajectory
+or interleave other evaluations — but the total intervention is the same.
+
+The law is method-agnostic across the two linear activation-addition
+variants tested: super_sspace and mean_diff give identical breakdown rounds
+at the same KL/round target. super_sspace is 3.4× slower per round with no
+behavioural benefit.
+
+**Caveat: this is specific to linear activation addition.** The constant
+budget result is plausibly a property of *additive* residual-stream
+interventions, where each round stacks a new direction vector on top. A
+technique with a different intervention geometry — e.g. contrastive
+decoding, finetuning, or a nonlinear gating method — would not necessarily
+spend the same per-token KL budget for the same behavioural shift, and
+the budget-is-constant conclusion might not transfer. The rule here is
+specifically: *for linear additive steering, the model's tolerance to
+cumulative distribution shift is ~1.7 nats regardless of dose schedule.*
+
 ## So is it useful?

 Probably yes, weakly. The calibration this gives you does enable a
@@ -35,3 +35,6 @@ where = ["src"]

 [tool.ruff.lint]
 ignore = ["F722"]  # jaxtyping shape strings
+
+[tool.uv]
+exclude-newer = "5 days"