weight-steering/README.md

# Weight Steering

Fork of [Fierro & Roger 2025](https://arxiv.org/abs/2511.05408). Train two
PEFT adapters on contrastive personas (POS vs NEG), merge into base,
take `dW = θ_pos − θ_neg`, add `α·dW` at inference.

We test whether weight-space steering (dW) competes with hidden-state
steering and prompting on a directly comparable Authority↓ benchmark.
For dataset, persona pairs, calibration recipe, and baseline methods,
see [steering-lite](https://github.com/wassname/steering-lite) (sl). ws
shares the persona pairs, vignettes, and 1-nat KL budget so rows below
drop into sl's tables. (ws = this repo; sl = steering-lite, hidden-state
steering baselines.)

## Results: Authority↓ on Qwen3.5-4B (iso-KL=1.0)

We ask three questions:

1. Does dW move Authority in the right direction?
2. Does dW beat hidden-state steering and persona-prompting?
3. Does dW have lower uncertainty than hidden-state steering?

### Glossary

- `dW = θ_pos − θ_neg`: weight-space contrast from two PEFT adapters.
- `α`: steering strength, calibrated so worst-5%-KL hits 1 nat.
- `ΔAuth`: mean change in `logit P(is_wrong)` on Authority vignettes,
  paired by (vignette, condition). Negative = Authority↓ achieved.
- `axis_Δ = −ΔAuth` (positive = correct direction, persona-aligned).
- `SI(Auth)`: bidirectional Surgical Informedness on Authority. High
  means the method moves Authority without breaking other foundations.
  Definition: [steering-lite eval](https://github.com/wassname/steering-lite#eval).
- `prompt_only`: baseline that injects the POS persona as a system
  prompt, no steering vector.

### Δlogit and uncertainty (Auth ↓ target, Care = off-target effect)

Authority is the target (move down). Care is one off-target effect:
surgical methods should leave it near zero, broadly-suppressing methods drag it
down with Authority. Full 7-foundation table in
`out/authority/.../foundations_dlogit.csv`. **Bold** = best per column
(most-negative ΔAuth, lowest std, closest-to-zero ΔCare).

| method                    | ΔAuth ↓ (mean ± std)  | ΔCare → 0 (mean ± std) |
| ------------------------- | --------------------: | ---------------------: |
| sl:engineered_prompt      |      **−2.98** ± 1.20 |            −1.64 ± 1.03 |
| sl:sspace_ablate          |           −2.89 ± 0.86 |            −2.79 ± 0.92 |
| sl:sspace                 |           −2.78 ± 0.93 |            −2.57 ± 0.90 |
| sl:angular_steering       |           −2.67 ± 0.89 |            −2.49 ± 0.84 |
| sl:cosine_gated           |           −2.08 ± 0.64 |            −1.88 ± 0.61 |
| sl:directional_ablation   |           −1.94 ± 1.22 |            −1.80 ± 1.24 |
| sl:mean_diff              |           −1.93 ± 1.11 |            −1.72 ± 1.09 |
| sl:mean_centred           |           −1.80 ± 1.17 |            −1.63 ± 1.14 |
| sl:spherical              |           −1.44 ± 0.89 |            −1.21 ± 0.71 |
| sl:pca                    |           −1.36 ± 1.50 |            −1.30 ± 1.36 |
| sl:topk_clusters          |           −1.18 ± 0.97 |            −1.12 ± 0.91 |
| ws:delora*                |           −0.89 ± **0.58** |        −0.49 ± 0.60 |
| sl:linear_act             |           −0.83 ± 0.67 |            −0.70 ± 0.52 |
| sl:chars                  |           −0.45 ± 0.61 |        **−0.40** ± 0.54 |

*ws:delora calibrated at p95=0.5, not kl=1.0 — expect larger effect after re-calibration.

### Surgical Informedness (headline, ↑ better)

`SI(Auth)`, `SI_fwd`, `SI_rev`, `Auth_sep`, and `pmass²×100` all higher is
better. **Bold** = best in column. sl rows from sl's published Qwen3.5-4B
run. ws:delora is at p95=0.5 budget (kl=1.0 re-run queued with lora/dora).

| method                    | SI(Auth) ↑ | SI_fwd ↑ | SI_rev ↑ | Auth_sep ↑ | pmass²×100 ↑ |
| ------------------------- | ---------: | -------: | -------: | ---------: | -----------: |
| sl:directional_ablation   |  **52.90** |     0.32 |    +1.00 |      +2.05 |         80.1 |
| sl:super_sspace           |      47.71 |     0.67 |    +0.40 |      +1.99 |         88.8 |
| sl:sspace                 |      45.67 |     0.64 |    +0.85 |      +0.69 |         61.0 |
| sl:mean_diff              |      32.81 |     0.34 |    +1.00 |      +1.65 |         49.0 |
| sl:mean_centred           |      32.72 |     0.29 |    +1.00 |      +1.56 |         50.6 |
| sl:topk_clusters          |      31.34 |     0.13 |    +0.72 |      +1.55 |         73.9 |
| sl:sspace_ablate          |      24.11 | **0.74** |    +0.02 |      +0.59 |         63.6 |
| sl:linear_act             |      20.24 |    −0.19 |    +1.00 |      +0.83 |         49.9 |
| ws:delora                 |      19.03 |     0.02 |    +0.37 |      +0.76 |     **99.9** |
| sl:engineered_prompt      |      17.36 |     0.50 |    −0.02 |      +1.90 |         71.7 |
| sl:cosine_gated           |       8.92 |     0.09 |    +1.00 |  **+2.00** |         16.4 |
| sl:angular_steering       |       7.00 |     0.55 |    −0.38 |      +0.32 |         80.6 |
| sl:spherical              |       4.98 |     0.16 |      n/a |      +0.85 |         30.3 |
| sl:pca                    |      −0.92 |     0.03 |    −0.08 |      +0.85 |         39.0 |
| sl:chars                  |      −9.16 |    −0.26 |    +0.00 |      +0.50 |         68.3 |

### TL;DR

1. **Did dW replicate?** Yes. ws:delora ΔAuth = −0.89 (sign correct) and
   SI(Auth) = 19.03 — verdicts do flip in the right direction.
2. **Did dW beat steering and prompting?** Partially. SI = 19.03 beats
   the engineered-prompt baseline (17.36) and 5 other sl methods, but is
   below 8 hidden-state methods. ΔAuth std = 0.58 is the lowest in the
   table (lower uncertainty than all sl methods).
3. **Did dW have lower uncertainty?** Yes. ws:delora std = **0.58**,
   lowest in the table (sl best: chars 0.61).

Open: lora and dora training queued (pueue 141-144); ws:delora is at
p95=0.5 budget, not yet at sl's kl=1.0 — expect SI to shift after
re-calibration. Full 4-adapter table pending.

## How to run

```sh
# 1. generate persona-conditioned data
uv run python -m ws.data --behavior authority --model-id Qwen/Qwen3.5-4B
# 2. train all adapters (dW = merged_pos - merged_neg)
uv run python -m ws.run_sweep --behavior authority --model Qwen/Qwen3.5-4B
# 3. iso-KL calibrate α
uv run python -m ws.kl_calibrate --behavior authority --model Qwen/Qwen3.5-4B
# 4. eval on tinymfv airisk
uv run python -m ws.scripts.eval_tinymfv_calibrated --behavior authority --model Qwen/Qwen3.5-4B
# 5. rebuild README tables
uv run python -m ws.scripts.readme_tinymfv_table --behavior authority
```

Outputs go to `out/authority/<adapter>/`. Smoke test on a tiny model:
`just smoke`.

## Cite

```bibtex
@article{FierroRoger2025,
  author    = {Constanza Fierro and Fabien Roger},
  title     = {Steering Language Models with Weight Arithmetic},
  journal   = {arXiv preprint arXiv:2511.05408},
  year      = {2025},
  url       = {https://arxiv.org/abs/2511.05408}
}
```

## Related

- [steering-lite](https://github.com/wassname/steering-lite): hidden-state steering, sister project, source of all baseline rows above
- [tinymfv](https://github.com/wassname/tinymfv): vignette dataset
- [PEFT](https://github.com/huggingface/peft): adapter library
- [RepE](https://github.com/andyzoujm/representation-engineering) (Zou et al. 2023): hidden-state steering precursor