Files
wassname 48c1b07b83 readme
2026-05-05 08:12:41 +08:00

138 lines
7.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Weight Steering
Fork of [Fierro & Roger 2025](https://arxiv.org/abs/2511.05408). Train two
PEFT adapters on contrastive personas (POS vs NEG), merge into base,
take `dW = θ_pos θ_neg`, add `α·dW` at inference.
We test whether weight-space steering (dW) competes with hidden-state
steering and prompting on a directly comparable Authority↓ benchmark.
For dataset, persona pairs, calibration recipe, and baseline methods,
see [steering-lite](https://github.com/wassname/steering-lite) (sl). ws
shares the persona pairs, vignettes, and 1-nat KL budget so rows below
drop into sl's tables. (ws = this repo; sl = steering-lite, hidden-state
steering baselines.)
## Results: Authority↓ on Qwen3.5-4B (iso-KL=1.0)
We ask three questions:
1. Does dW move Authority in the right direction?
2. Does dW beat hidden-state steering and persona-prompting?
3. Does dW have lower uncertainty than hidden-state steering?
### Glossary
- `dW = θ_pos θ_neg`: weight-space contrast from two PEFT adapters.
- `α`: steering strength, calibrated so worst-5%-KL hits 1 nat.
- `ΔAuth`: mean change in `logit P(is_wrong)` on Authority vignettes,
paired by (vignette, condition). Negative = Authority↓ achieved.
- `axis_Δ = −ΔAuth` (positive = correct direction, persona-aligned).
- `SI(Auth)`: bidirectional Surgical Informedness on Authority. High
means the method moves Authority without breaking other foundations.
Definition: [steering-lite eval](https://github.com/wassname/steering-lite#eval).
- `prompt_only`: baseline that injects the POS persona as a system
prompt, no steering vector.
### Δlogit and uncertainty (Auth ↓ target, Care = off-target effect)
Authority is the target (move down). Care is one off-target effect:
surgical methods should leave it near zero, broadly-suppressing methods drag it
down with Authority. Full 7-foundation table in
`out/authority/.../foundations_dlogit.csv`. **Bold** = best per column
(most-negative ΔAuth, lowest std, closest-to-zero ΔCare).
| method | ΔAuth ↓ (mean ± std) | ΔCare → 0 (mean ± std) |
| ------------------------- | --------------------: | ---------------------: |
| sl:engineered_prompt | **2.98** ± 1.20 | 1.64 ± 1.03 |
| sl:sspace_ablate | 2.89 ± 0.86 | 2.79 ± 0.92 |
| sl:sspace | 2.78 ± 0.93 | 2.57 ± 0.90 |
| sl:angular_steering | 2.67 ± 0.89 | 2.49 ± 0.84 |
| sl:cosine_gated | 2.08 ± 0.64 | 1.88 ± 0.61 |
| sl:directional_ablation | 1.94 ± 1.22 | 1.80 ± 1.24 |
| sl:mean_diff | 1.93 ± 1.11 | 1.72 ± 1.09 |
| sl:mean_centred | 1.80 ± 1.17 | 1.63 ± 1.14 |
| sl:spherical | 1.44 ± 0.89 | 1.21 ± 0.71 |
| sl:pca | 1.36 ± 1.50 | 1.30 ± 1.36 |
| sl:topk_clusters | 1.18 ± 0.97 | 1.12 ± 0.91 |
| ws:delora* | 0.89 ± **0.58** | 0.49 ± 0.60 |
| sl:linear_act | 0.83 ± 0.67 | 0.70 ± 0.52 |
| sl:chars | 0.45 ± 0.61 | **0.40** ± 0.54 |
*ws:delora calibrated at p95=0.5, not kl=1.0 — expect larger effect after re-calibration.
### Surgical Informedness (headline, ↑ better)
`SI(Auth)`, `SI_fwd`, `SI_rev`, `Auth_sep`, and `pmass²×100` all higher is
better. **Bold** = best in column. sl rows from sl's published Qwen3.5-4B
run. ws:delora is at p95=0.5 budget (kl=1.0 re-run queued with lora/dora).
| method | SI(Auth) ↑ | SI_fwd ↑ | SI_rev ↑ | Auth_sep ↑ | pmass²×100 ↑ |
| ------------------------- | ---------: | -------: | -------: | ---------: | -----------: |
| sl:directional_ablation | **52.90** | 0.32 | +1.00 | +2.05 | 80.1 |
| sl:super_sspace | 47.71 | 0.67 | +0.40 | +1.99 | 88.8 |
| sl:sspace | 45.67 | 0.64 | +0.85 | +0.69 | 61.0 |
| sl:mean_diff | 32.81 | 0.34 | +1.00 | +1.65 | 49.0 |
| sl:mean_centred | 32.72 | 0.29 | +1.00 | +1.56 | 50.6 |
| sl:topk_clusters | 31.34 | 0.13 | +0.72 | +1.55 | 73.9 |
| sl:sspace_ablate | 24.11 | **0.74** | +0.02 | +0.59 | 63.6 |
| sl:linear_act | 20.24 | 0.19 | +1.00 | +0.83 | 49.9 |
| ws:delora | 19.03 | 0.02 | +0.37 | +0.76 | **99.9** |
| sl:engineered_prompt | 17.36 | 0.50 | 0.02 | +1.90 | 71.7 |
| sl:cosine_gated | 8.92 | 0.09 | +1.00 | **+2.00** | 16.4 |
| sl:angular_steering | 7.00 | 0.55 | 0.38 | +0.32 | 80.6 |
| sl:spherical | 4.98 | 0.16 | n/a | +0.85 | 30.3 |
| sl:pca | 0.92 | 0.03 | 0.08 | +0.85 | 39.0 |
| sl:chars | 9.16 | 0.26 | +0.00 | +0.50 | 68.3 |
### TL;DR
1. **Did dW replicate?** Yes. ws:delora ΔAuth = 0.89 (sign correct) and
SI(Auth) = 19.03 — verdicts do flip in the right direction.
2. **Did dW beat steering and prompting?** Partially. SI = 19.03 beats
the engineered-prompt baseline (17.36) and 5 other sl methods, but is
below 8 hidden-state methods. ΔAuth std = 0.58 is the lowest in the
table (lower uncertainty than all sl methods).
3. **Did dW have lower uncertainty?** Yes. ws:delora std = **0.58**,
lowest in the table (sl best: chars 0.61).
Open: lora and dora training queued (pueue 141-144); ws:delora is at
p95=0.5 budget, not yet at sl's kl=1.0 — expect SI to shift after
re-calibration. Full 4-adapter table pending.
## How to run
```sh
# 1. generate persona-conditioned data
uv run python -m ws.data --behavior authority --model-id Qwen/Qwen3.5-4B
# 2. train all adapters (dW = merged_pos - merged_neg)
uv run python -m ws.run_sweep --behavior authority --model Qwen/Qwen3.5-4B
# 3. iso-KL calibrate α
uv run python -m ws.kl_calibrate --behavior authority --model Qwen/Qwen3.5-4B
# 4. eval on tinymfv airisk
uv run python -m ws.scripts.eval_tinymfv_calibrated --behavior authority --model Qwen/Qwen3.5-4B
# 5. rebuild README tables
uv run python -m ws.scripts.readme_tinymfv_table --behavior authority
```
Outputs go to `out/authority/<adapter>/`. Smoke test on a tiny model:
`just smoke`.
## Cite
```bibtex
@article{FierroRoger2025,
author = {Constanza Fierro and Fabien Roger},
title = {Steering Language Models with Weight Arithmetic},
journal = {arXiv preprint arXiv:2511.05408},
year = {2025},
url = {https://arxiv.org/abs/2511.05408}
}
```
## Related
- [steering-lite](https://github.com/wassname/steering-lite): hidden-state steering, sister project, source of all baseline rows above
- [tinymfv](https://github.com/wassname/tinymfv): vignette dataset
- [PEFT](https://github.com/huggingface/peft): adapter library
- [RepE](https://github.com/andyzoujm/representation-engineering) (Zou et al. 2023): hidden-state steering precursor