mirror of
https://github.com/wassname/isokl_steering_calibration.git
synced 2026-06-27 15:45:51 +08:00
misc
This commit is contained in:
@@ -105,6 +105,86 @@ The second point is probably unavoidable for any single scalar
|
|||||||
calibration target. The intervention is multi-dimensional, the target is
|
calibration target. The intervention is multi-dimensional, the target is
|
||||||
one number.
|
one number.
|
||||||
|
|
||||||
|
## Extended result: iterated steering and a coherence budget
|
||||||
|
|
||||||
|
Iso-KL calibration has a natural application to *iterated* steering, where
|
||||||
|
you repeat the extract → calibrate → apply cycle multiple times, each round
|
||||||
|
building on the previous intervention.
|
||||||
|
|
||||||
|
The setup (Qwen3-4B, Care↑ vs Authority↑ persona, mean-diff steering,
|
||||||
|
[steering-lite](https://github.com/wassname/steering-lite)):
|
||||||
|
|
||||||
|
- Round 0: extract mean-diff direction from base model, bisect on
|
||||||
|
coefficient until **p95 per-token KL = target**, apply, evaluate.
|
||||||
|
- Round r: re-extract from the already-steered model, re-calibrate to
|
||||||
|
the same p95 KL target, apply on top of all prior rounds, evaluate.
|
||||||
|
|
||||||
|
Each round spends the same p95 KL dose. The question is how many rounds
|
||||||
|
you can do before incoherence overwhelms the steering signal.
|
||||||
|
|
||||||
|
**The finding:** the model has a fixed coherence budget of roughly
|
||||||
|
**1.7 nats total cumulative KL** (across all rounds). Once that budget is
|
||||||
|
spent, the model breaks down — the Social Norms logit jumps from its
|
||||||
|
steered floor to positive, and forced-choice accuracy collapses. This
|
||||||
|
constant is the same regardless of how you slice the budget across rounds
|
||||||
|
or which method you use.
|
||||||
|
|
||||||
|
Here is what the axis shift looks like for KL=0.10 (20 rounds, Qwen3-4B,
|
||||||
|
Care↑ vs Authority↑; values are mean logit advantage of each foundation in
|
||||||
|
a 7-way forced-choice):
|
||||||
|
|
||||||
|
```
|
||||||
|
r Care Auth SocN margin note
|
||||||
|
0 -3.29 -3.84 -6.19 +10.79 baseline (no steering)
|
||||||
|
1 -3.15 -4.02 -6.18 +11.26
|
||||||
|
4 -2.80 -4.41 -6.34 +12.01
|
||||||
|
8 -1.70 -4.95 -6.54 +12.01
|
||||||
|
10 +0.41 -5.31 -6.66 +10.92 Care crosses zero
|
||||||
|
14 +4.18 -6.61 -5.99 +6.00 Auth near floor, SocN slipping
|
||||||
|
15 +3.00 -6.81 -4.12 +3.58 SocN creeping
|
||||||
|
17 -0.67 -6.91 +0.61 +0.77 BREAKDOWN: SocN-jump
|
||||||
|
```
|
||||||
|
|
||||||
|
Auth (the suppressed axis) descends steadily to its floor regardless of
|
||||||
|
pace. Care (the amplified axis) rises, then collapses at breakdown when
|
||||||
|
incoherence dominates. SocN (Social Norms) staying negative is the
|
||||||
|
coherence sentinel; once it goes positive the model is no longer reliably
|
||||||
|
using its value structure to answer questions.
|
||||||
|
|
||||||
|
| KL/round | breakdown round | cumul. KL | Care (last healthy) | Auth (last healthy) |
|
||||||
|
|----------|----------------|-----------|--------------------|--------------------|
|
||||||
|
| 0.30 | r=6 | 1.8 nats | +4.78 | -6.60 |
|
||||||
|
| 0.20 | r=9 | 1.8 nats | +4.42 | -6.58 |
|
||||||
|
| 0.15 | r=11 | 1.7 nats | +4.22 | -6.60 |
|
||||||
|
| 0.10 | r=17 | 1.7 nats | +4.18 | -6.61 |
|
||||||
|
| 0.05 | r=36 | 1.8 nats | +3.88 | -6.65 |
|
||||||
|
|
||||||
|
The last two columns are what matters: the total axis shift at the final
|
||||||
|
usable round is nearly identical regardless of how many installments it
|
||||||
|
took. You always arrive at roughly the same place (Care ≈ +4.4, Auth ≈
|
||||||
|
-6.6) — the budget determines the destination, not the pace.
|
||||||
|
|
||||||
|
This means iso-KL calibration is doing something real: the p95 KL target
|
||||||
|
per round measures against a model-intrinsic limit. Set it too high and
|
||||||
|
you spend the budget in 2-3 rounds. Set it low and you get more
|
||||||
|
intermediate checkpoints — useful if you need to track the trajectory
|
||||||
|
or interleave other evaluations — but the total intervention is the same.
|
||||||
|
|
||||||
|
The law is method-agnostic across the two linear activation-addition
|
||||||
|
variants tested: super_sspace and mean_diff give identical breakdown rounds
|
||||||
|
at the same KL/round target. super_sspace is 3.4× slower per round with no
|
||||||
|
behavioural benefit.
|
||||||
|
|
||||||
|
**Caveat: this is specific to linear activation addition.** The constant
|
||||||
|
budget result is plausibly a property of *additive* residual-stream
|
||||||
|
interventions, where each round stacks a new direction vector on top. A
|
||||||
|
technique with a different intervention geometry — e.g. contrastive
|
||||||
|
decoding, finetuning, or a nonlinear gating method — would not necessarily
|
||||||
|
spend the same per-token KL budget for the same behavioural shift, and
|
||||||
|
the budget-is-constant conclusion might not transfer. The rule here is
|
||||||
|
specifically: *for linear additive steering, the model's tolerance to
|
||||||
|
cumulative distribution shift is ~1.7 nats regardless of dose schedule.*
|
||||||
|
|
||||||
## So is it useful?
|
## So is it useful?
|
||||||
|
|
||||||
Probably yes, weakly. The calibration this gives you does enable a
|
Probably yes, weakly. The calibration this gives you does enable a
|
||||||
|
|||||||
@@ -35,3 +35,6 @@ where = ["src"]
|
|||||||
|
|
||||||
[tool.ruff.lint]
|
[tool.ruff.lint]
|
||||||
ignore = ["F722"] # jaxtyping shape strings
|
ignore = ["F722"] # jaxtyping shape strings
|
||||||
|
|
||||||
|
[tool.uv]
|
||||||
|
exclude-newer = "5 days"
|
||||||
Reference in New Issue
Block a user