Files
weight-steering/README.md
T
wassname 4f2034dd46 tidy
2026-05-02 05:52:25 +08:00

120 lines
4.5 KiB
Markdown

# Weight Steering
Fork of Fierro & Roger 2025, rebuilt on HF + PEFT + uv, targeting Qwen3-0.6B.
Method: `dW = theta_pos - theta_neg`, then add `alpha * dW` at inference.
## Results (Qwen3-0.6B, honesty, N=1, single seed)
All evals use base persona at eval time. No system prompt.
### Primary evals: AIRiskDilemmas + tiny-mfv AIRisk
The current headline evaluations are:
- **AIRiskDilemmas / Truthfulness**: guided-CoT, action-choice preference on
1,869 labeled dilemmas from `kellycyy/AIRiskDilemmas`.
- **tiny-mfv / airisk**: fast logprob probe on 132 AI-risk vignettes from
[`wassname/tiny-mfv`](https://huggingface.co/datasets/wassname/tiny-mfv),
scored with dual JSON-bool prompts on `other_violate` and `self_violate`.
tiny-mfv is the cleaner fast probe here: it is cheaper, gives stable bool-mass
sanity checks, and exposes both **moral wrongness shift** and **perspective
gap** directly. AIRiskDilemmas remains the higher-variance, higher-context
complement.
### tiny-mfv AIRisk: current confirmed full run
Qwen3-0.6B, honesty `delora`, 131 joined vignettes, bootstrap `n=1000`.
| adapter | alpha | wrongness | 95% CI | gap | 95% CI |
| ------- | ----: | --------: | :----- | --: | :----- |
| delora | -1.0 | +0.795 | [+0.764, +0.823] | +0.114 | [+0.086, +0.146] |
| base | 0.0 | +0.423 | [+0.345, +0.501] | +0.468 | [+0.391, +0.548] |
| delora | +1.0 | -0.350 | [-0.392, -0.308] | +0.269 | [+0.233, +0.304] |
Interpretation: on this AIRisk probe, positive `delora` steering moves strongly
away from rating the AI-risk violations as wrong, while negative steering moves
the other way. The effect is large relative to the bootstrap uncertainty, so
the sign is not ambiguous on this dataset.
### Queued full table
The repo now queues the full README refresh through `pueue`:
- 6 adapters (`ia3`, `oft`, `dora`, `lora`, `pissa`, `delora`)
- 2 datasets (`AIRiskDilemmas`, `tiny-mfv/airisk`)
- 1 final summarizer producing `out/honesty/readme_airisk_table.csv`
That summary includes baseline and adapter uncertainty.
### OOD: held-out sycophancy Yes/No claims (12 claims, alpha=+1)
Previously labeled "IID" -- corrected: these are *sycophancy* claims, but the
dW was trained on the *honesty* contrast (see [src/ws/data.py](src/ws/data.py)).
The 12 claims are also held-out from the training topics, so this is
doubly-OOD (different behavior axis + held-out topics). Reported metric is
`mean logratio = log P(Yes) - log P(No)` over the 12 claims, where Yes =
agreeing with the user's wrong belief = sycophantic = dishonest.
| adapter | mean_lr | shift vs base |
| ------- | ------: | ------------: |
| pissa | 8.437 | +5.708 |
| delora | 7.198 | +4.469 |
| lora | 6.531 | +3.802 |
| dora | 6.156 | +3.427 |
| oft | 3.917 | +1.188 |
| ia3 | 2.719 | -0.010 |
`alpha=+1` makes the model say *more* Yes on these sycophancy probes -- i.e.
more sycophantic, not more honest. **This is consistent with the
AIRisk results above**: the trained dW is steering toward
*agreeableness/Yes-bias*, not honesty. Likely cause: at 0.6B, the
honest-vs-dishonest persona conditioning at data-gen time produces a
response contrast dominated by
*compliance/length/confidence* rather than truthfulness.
## How to run
```sh
# Quick sanity check (~1 min, tiny random Qwen3)
just smoke
# Full pipeline for one adapter
uv run python -m ws.replicate --model Qwen/Qwen3-0.6B --behavior honesty --adapter lora
# All adapters
uv run python -m ws.run_sweep --behavior honesty --n-personas 1 --n-samples 50
# AIRiskDilemmas
just eval-airisk adapter=delora behavior=honesty
# tiny-mfv AIRisk with bootstrap uncertainty
just eval-tinymfv-airisk adapter=delora behavior=honesty
# README-ready combined table after per-adapter runs
just summarize-airisk behavior=honesty
```
Source layout: core modules live in `src/ws/`, active benchmarks in `src/ws/eval/`, and CLI/report helpers in `src/ws/scripts/`. Outputs go to `out/<behavior>/<adapter>/`.
## Cite
```bibtex
@article{FierroRoger2025,
author = {Constanza Fierro and Fabien Roger},
title = {Steering Language Models with Weight Arithmetic},
journal = {arXiv preprint arXiv:2511.05408},
year = {2025},
url = {https://arxiv.org/abs/2511.05408},
doi = {10.48550/arXiv.2511.05408}
}
```
## Related
- Paper: https://arxiv.org/abs/2511.05408
- tiny-mfv dataset: https://huggingface.co/datasets/wassname/tiny-mfv
- AIRiskDilemmas dataset: `kellycyy/AIRiskDilemmas` (HuggingFace)
- RepE baseline: `representation-engineering` (Zou et al. 2023)
- PEFT: https://github.com/huggingface/peft