7.3 KiB
Weight Steering
Fork of Fierro & Roger 2025. Train two
PEFT adapters on contrastive personas (POS vs NEG), merge into base,
take dW = θ_pos − θ_neg, add α·dW at inference.
We test whether weight-space steering (dW) competes with hidden-state steering and prompting on a directly comparable Authority↓ benchmark. For dataset, persona pairs, calibration recipe, and baseline methods, see steering-lite (sl). ws shares the persona pairs, vignettes, and 1-nat KL budget so rows below drop into sl's tables. (ws = this repo; sl = steering-lite, hidden-state steering baselines.)
Results: Authority↓ on Qwen3.5-4B (iso-KL=1.0)
We ask three questions:
- Does dW move Authority in the right direction?
- Does dW beat hidden-state steering and persona-prompting?
- Does dW have lower uncertainty than hidden-state steering?
Glossary
dW = θ_pos − θ_neg: weight-space contrast from two PEFT adapters.α: steering strength, calibrated so worst-5%-KL hits 1 nat.ΔAuth: mean change inlogit P(is_wrong)on Authority vignettes, paired by (vignette, condition). Negative = Authority↓ achieved.axis_Δ = −ΔAuth(positive = correct direction, persona-aligned).SI(Auth): bidirectional Surgical Informedness on Authority. High means the method moves Authority without breaking other foundations. Definition: steering-lite eval.prompt_only: baseline that injects the POS persona as a system prompt, no steering vector.
Δlogit and uncertainty (Auth ↓ target, Care = off-target effect)
Authority is the target (move down). Care is one off-target effect:
surgical methods should leave it near zero, broadly-suppressing methods drag it
down with Authority. Full 7-foundation table in
out/authority/.../foundations_dlogit.csv. Bold = best per column
(most-negative ΔAuth, lowest std, closest-to-zero ΔCare).
| method | ΔAuth ↓ (mean ± std) | ΔCare → 0 (mean ± std) |
|---|---|---|
| sl:engineered_prompt | −2.98 ± 1.20 | −1.64 ± 1.03 |
| sl:sspace_ablate | −2.89 ± 0.86 | −2.79 ± 0.92 |
| sl:sspace | −2.78 ± 0.93 | −2.57 ± 0.90 |
| sl:angular_steering | −2.67 ± 0.89 | −2.49 ± 0.84 |
| sl:cosine_gated | −2.08 ± 0.64 | −1.88 ± 0.61 |
| sl:directional_ablation | −1.94 ± 1.22 | −1.80 ± 1.24 |
| sl:mean_diff | −1.93 ± 1.11 | −1.72 ± 1.09 |
| sl:mean_centred | −1.80 ± 1.17 | −1.63 ± 1.14 |
| sl:spherical | −1.44 ± 0.89 | −1.21 ± 0.71 |
| sl:pca | −1.36 ± 1.50 | −1.30 ± 1.36 |
| sl:topk_clusters | −1.18 ± 0.97 | −1.12 ± 0.91 |
| ws:delora* | −0.89 ± 0.58 | −0.49 ± 0.60 |
| sl:linear_act | −0.83 ± 0.67 | −0.70 ± 0.52 |
| sl:chars | −0.45 ± 0.61 | −0.40 ± 0.54 |
*ws:delora calibrated at p95=0.5, not kl=1.0 — expect larger effect after re-calibration.
Surgical Informedness (headline, ↑ better)
SI(Auth), SI_fwd, SI_rev, Auth_sep, and pmass²×100 all higher is
better. Bold = best in column. sl rows from sl's published Qwen3.5-4B
run. ws:delora is at p95=0.5 budget (kl=1.0 re-run queued with lora/dora).
| method | SI(Auth) ↑ | SI_fwd ↑ | SI_rev ↑ | Auth_sep ↑ | pmass²×100 ↑ |
|---|---|---|---|---|---|
| sl:directional_ablation | 52.90 | 0.32 | +1.00 | +2.05 | 80.1 |
| sl:super_sspace | 47.71 | 0.67 | +0.40 | +1.99 | 88.8 |
| sl:sspace | 45.67 | 0.64 | +0.85 | +0.69 | 61.0 |
| sl:mean_diff | 32.81 | 0.34 | +1.00 | +1.65 | 49.0 |
| sl:mean_centred | 32.72 | 0.29 | +1.00 | +1.56 | 50.6 |
| sl:topk_clusters | 31.34 | 0.13 | +0.72 | +1.55 | 73.9 |
| sl:sspace_ablate | 24.11 | 0.74 | +0.02 | +0.59 | 63.6 |
| sl:linear_act | 20.24 | −0.19 | +1.00 | +0.83 | 49.9 |
| ws:delora | 19.03 | 0.02 | +0.37 | +0.76 | 99.9 |
| sl:engineered_prompt | 17.36 | 0.50 | −0.02 | +1.90 | 71.7 |
| sl:cosine_gated | 8.92 | 0.09 | +1.00 | +2.00 | 16.4 |
| sl:angular_steering | 7.00 | 0.55 | −0.38 | +0.32 | 80.6 |
| sl:spherical | 4.98 | 0.16 | n/a | +0.85 | 30.3 |
| sl:pca | −0.92 | 0.03 | −0.08 | +0.85 | 39.0 |
| sl:chars | −9.16 | −0.26 | +0.00 | +0.50 | 68.3 |
TL;DR
- Did dW replicate? Yes. ws:delora ΔAuth = −0.89 (sign correct) and SI(Auth) = 19.03 — verdicts do flip in the right direction.
- Did dW beat steering and prompting? Partially. SI = 19.03 beats the engineered-prompt baseline (17.36) and 5 other sl methods, but is below 8 hidden-state methods. ΔAuth std = 0.58 is the lowest in the table (lower uncertainty than all sl methods).
- Did dW have lower uncertainty? Yes. ws:delora std = 0.58, lowest in the table (sl best: chars 0.61).
Open: lora and dora training queued (pueue 141-144); ws:delora is at p95=0.5 budget, not yet at sl's kl=1.0 — expect SI to shift after re-calibration. Full 4-adapter table pending.
How to run
# 1. generate persona-conditioned data
uv run python -m ws.data --behavior authority --model-id Qwen/Qwen3.5-4B
# 2. train all adapters (dW = merged_pos - merged_neg)
uv run python -m ws.run_sweep --behavior authority --model Qwen/Qwen3.5-4B
# 3. iso-KL calibrate α
uv run python -m ws.kl_calibrate --behavior authority --model Qwen/Qwen3.5-4B
# 4. eval on tinymfv airisk
uv run python -m ws.scripts.eval_tinymfv_calibrated --behavior authority --model Qwen/Qwen3.5-4B
# 5. rebuild README tables
uv run python -m ws.scripts.readme_tinymfv_table --behavior authority
Outputs go to out/authority/<adapter>/. Smoke test on a tiny model:
just smoke.
Cite
@article{FierroRoger2025,
author = {Constanza Fierro and Fabien Roger},
title = {Steering Language Models with Weight Arithmetic},
journal = {arXiv preprint arXiv:2511.05408},
year = {2025},
url = {https://arxiv.org/abs/2511.05408}
}
Related
- steering-lite: hidden-state steering, sister project, source of all baseline rows above
- tinymfv: vignette dataset
- PEFT: adapter library
- RepE (Zou et al. 2023): hidden-state steering precursor