weight-steering/README.md

# Weight Steering

Fork of Fierro & Roger 2025, rebuilt on HF + PEFT + uv, targeting Qwen3-0.6B.
Method: `dW = theta_pos - theta_neg`, then add `alpha * dW` at inference.

## Results (Qwen3-0.6B, honesty, N=1, single seed)

All evals use base persona at eval time. No system prompt.

### Primary evals: AIRiskDilemmas + tiny-mfv AIRisk

The current headline evaluations are:

- **AIRiskDilemmas / Truthfulness**: guided-CoT, action-choice preference on
  1,869 labeled dilemmas from `kellycyy/AIRiskDilemmas`.
- **tiny-mfv / airisk**: fast logprob probe on 132 AI-risk vignettes from
  [`wassname/tiny-mfv`](https://huggingface.co/datasets/wassname/tiny-mfv),
  scored with dual JSON-bool prompts on `other_violate` and `self_violate`.

tiny-mfv is the cleaner fast probe here: it is cheaper, gives stable bool-mass
sanity checks, and exposes both **moral wrongness shift** and **perspective
gap** directly. AIRiskDilemmas remains the higher-variance, higher-context
complement.

### tiny-mfv AIRisk: current confirmed full run

Qwen3-0.6B, honesty `delora`, 131 joined vignettes, bootstrap `n=1000`.

| adapter | alpha | wrongness | 95% CI | gap | 95% CI |
| ------- | ----: | --------: | :----- | --: | :----- |
| delora  | -1.0  | +0.795 | [+0.764, +0.823] | +0.114 | [+0.086, +0.146] |
| base    |  0.0  | +0.423 | [+0.345, +0.501] | +0.468 | [+0.391, +0.548] |
| delora  | +1.0  | -0.350 | [-0.392, -0.308] | +0.269 | [+0.233, +0.304] |

Interpretation: on this AIRisk probe, positive `delora` steering moves strongly
away from rating the AI-risk violations as wrong, while negative steering moves
the other way. The effect is large relative to the bootstrap uncertainty, so
the sign is not ambiguous on this dataset.

### Queued full table

The repo now queues the full README refresh through `pueue`:

- 6 adapters (`ia3`, `oft`, `dora`, `lora`, `pissa`, `delora`)
- 2 datasets (`AIRiskDilemmas`, `tiny-mfv/airisk`)
- 1 final summarizer producing `out/honesty/readme_airisk_table.csv`

That summary includes baseline and adapter uncertainty.

### OOD: held-out sycophancy Yes/No claims (12 claims, alpha=+1)

Previously labeled "IID" -- corrected: these are *sycophancy* claims, but the
dW was trained on the *honesty* contrast (see [src/ws/data.py](src/ws/data.py)).
The 12 claims are also held-out from the training topics, so this is
doubly-OOD (different behavior axis + held-out topics). Reported metric is
`mean logratio = log P(Yes) - log P(No)` over the 12 claims, where Yes =
agreeing with the user's wrong belief = sycophantic = dishonest.

| adapter | mean_lr | shift vs base |
| ------- | ------: | ------------: |
| pissa   |   8.437 |        +5.708 |
| delora  |   7.198 |        +4.469 |
| lora    |   6.531 |        +3.802 |
| dora    |   6.156 |        +3.427 |
| oft     |   3.917 |        +1.188 |
| ia3     |   2.719 |        -0.010 |

`alpha=+1` makes the model say *more* Yes on these sycophancy probes -- i.e.
more sycophantic, not more honest. **This is consistent with the
AIRisk results above**: the trained dW is steering toward
*agreeableness/Yes-bias*, not honesty. Likely cause: at 0.6B, the
honest-vs-dishonest persona conditioning at data-gen time produces a
response contrast dominated by
*compliance/length/confidence* rather than truthfulness.

## How to run

```sh
# Quick sanity check (~1 min, tiny random Qwen3)
just smoke

# Full pipeline for one adapter
uv run python -m ws.replicate --model Qwen/Qwen3-0.6B --behavior honesty --adapter lora

# All adapters
uv run python -m ws.run_sweep --behavior honesty --n-personas 1 --n-samples 50

# AIRiskDilemmas
just eval-airisk adapter=delora behavior=honesty

# tiny-mfv AIRisk with bootstrap uncertainty
just eval-tinymfv-airisk adapter=delora behavior=honesty

# README-ready combined table after per-adapter runs
just summarize-airisk behavior=honesty
```

Source layout: core modules live in `src/ws/`, active benchmarks in `src/ws/eval/`, and CLI/report helpers in `src/ws/scripts/`. Outputs go to `out/<behavior>/<adapter>/`.

## Cite

```bibtex
@article{FierroRoger2025,
  author    = {Constanza Fierro and Fabien Roger},
  title     = {Steering Language Models with Weight Arithmetic},
  journal   = {arXiv preprint arXiv:2511.05408},
  year      = {2025},
  url       = {https://arxiv.org/abs/2511.05408},
  doi       = {10.48550/arXiv.2511.05408}
}
```

## Related

- Paper: https://arxiv.org/abs/2511.05408
- tiny-mfv dataset: https://huggingface.co/datasets/wassname/tiny-mfv
- AIRiskDilemmas dataset: `kellycyy/AIRiskDilemmas` (HuggingFace)
- RepE baseline: `representation-engineering` (Zou et al. 2023)
- PEFT: https://github.com/huggingface/peft