This commit is contained in:
wassname
2026-05-13 10:46:52 +08:00
parent 77b296cc75
commit 9aeacb61d1
2 changed files with 83 additions and 0 deletions
+80
View File
@@ -105,6 +105,86 @@ The second point is probably unavoidable for any single scalar
calibration target. The intervention is multi-dimensional, the target is calibration target. The intervention is multi-dimensional, the target is
one number. one number.
## Extended result: iterated steering and a coherence budget
Iso-KL calibration has a natural application to *iterated* steering, where
you repeat the extract → calibrate → apply cycle multiple times, each round
building on the previous intervention.
The setup (Qwen3-4B, Care↑ vs Authority↑ persona, mean-diff steering,
[steering-lite](https://github.com/wassname/steering-lite)):
- Round 0: extract mean-diff direction from base model, bisect on
coefficient until **p95 per-token KL = target**, apply, evaluate.
- Round r: re-extract from the already-steered model, re-calibrate to
the same p95 KL target, apply on top of all prior rounds, evaluate.
Each round spends the same p95 KL dose. The question is how many rounds
you can do before incoherence overwhelms the steering signal.
**The finding:** the model has a fixed coherence budget of roughly
**1.7 nats total cumulative KL** (across all rounds). Once that budget is
spent, the model breaks down — the Social Norms logit jumps from its
steered floor to positive, and forced-choice accuracy collapses. This
constant is the same regardless of how you slice the budget across rounds
or which method you use.
Here is what the axis shift looks like for KL=0.10 (20 rounds, Qwen3-4B,
Care↑ vs Authority↑; values are mean logit advantage of each foundation in
a 7-way forced-choice):
```
r Care Auth SocN margin note
0 -3.29 -3.84 -6.19 +10.79 baseline (no steering)
1 -3.15 -4.02 -6.18 +11.26
4 -2.80 -4.41 -6.34 +12.01
8 -1.70 -4.95 -6.54 +12.01
10 +0.41 -5.31 -6.66 +10.92 Care crosses zero
14 +4.18 -6.61 -5.99 +6.00 Auth near floor, SocN slipping
15 +3.00 -6.81 -4.12 +3.58 SocN creeping
17 -0.67 -6.91 +0.61 +0.77 BREAKDOWN: SocN-jump
```
Auth (the suppressed axis) descends steadily to its floor regardless of
pace. Care (the amplified axis) rises, then collapses at breakdown when
incoherence dominates. SocN (Social Norms) staying negative is the
coherence sentinel; once it goes positive the model is no longer reliably
using its value structure to answer questions.
| KL/round | breakdown round | cumul. KL | Care (last healthy) | Auth (last healthy) |
|----------|----------------|-----------|--------------------|--------------------|
| 0.30 | r=6 | 1.8 nats | +4.78 | -6.60 |
| 0.20 | r=9 | 1.8 nats | +4.42 | -6.58 |
| 0.15 | r=11 | 1.7 nats | +4.22 | -6.60 |
| 0.10 | r=17 | 1.7 nats | +4.18 | -6.61 |
| 0.05 | r=36 | 1.8 nats | +3.88 | -6.65 |
The last two columns are what matters: the total axis shift at the final
usable round is nearly identical regardless of how many installments it
took. You always arrive at roughly the same place (Care ≈ +4.4, Auth ≈
-6.6) — the budget determines the destination, not the pace.
This means iso-KL calibration is doing something real: the p95 KL target
per round measures against a model-intrinsic limit. Set it too high and
you spend the budget in 2-3 rounds. Set it low and you get more
intermediate checkpoints — useful if you need to track the trajectory
or interleave other evaluations — but the total intervention is the same.
The law is method-agnostic across the two linear activation-addition
variants tested: super_sspace and mean_diff give identical breakdown rounds
at the same KL/round target. super_sspace is 3.4× slower per round with no
behavioural benefit.
**Caveat: this is specific to linear activation addition.** The constant
budget result is plausibly a property of *additive* residual-stream
interventions, where each round stacks a new direction vector on top. A
technique with a different intervention geometry — e.g. contrastive
decoding, finetuning, or a nonlinear gating method — would not necessarily
spend the same per-token KL budget for the same behavioural shift, and
the budget-is-constant conclusion might not transfer. The rule here is
specifically: *for linear additive steering, the model's tolerance to
cumulative distribution shift is ~1.7 nats regardless of dose schedule.*
## So is it useful? ## So is it useful?
Probably yes, weakly. The calibration this gives you does enable a Probably yes, weakly. The calibration this gives you does enable a
+3
View File
@@ -35,3 +35,6 @@ where = ["src"]
[tool.ruff.lint] [tool.ruff.lint]
ignore = ["F722"] # jaxtyping shape strings ignore = ["F722"] # jaxtyping shape strings
[tool.uv]
exclude-newer = "5 days"