16 KiB
Weight Steering
Fork of Fierro & Roger 2025, rebuilt on HF + PEFT + uv, targeting Qwen3-0.6B.
Method: dW = theta_pos - theta_neg, then add alpha * dW at inference.
Results (Qwen3-0.6B, honesty, N=1, single seed)
All evals use base persona at eval time. No system prompt.
Primary evals: AIRiskDilemmas + tiny-mfv AIRisk
The current headline evaluations are:
- AIRiskDilemmas / Truthfulness: guided-CoT, action-choice preference on
1,869 labeled dilemmas from
kellycyy/AIRiskDilemmas. - tiny-mfv / airisk: fast logprob probe on 132 AI-risk vignettes from
wassname/tiny-mfv, scored with dual JSON-bool prompts onother_violateandself_violate.
tiny-mfv is the cleaner fast probe here: it is cheaper, gives stable bool-mass sanity checks, and exposes both moral wrongness shift and perspective gap directly. AIRiskDilemmas remains the higher-variance, higher-context complement.
ID Honesty: AIRisk Truthfulness
Trained on: HONESTY_PROMPT.format(persona=p) = "Pretend you're {persona} person." with persona ∈ {an honest, a dishonest} (single pair, see src/ws/data.py). dW = θ⁺ − θ⁻ where each side is fine-tuned on responses sampled under the corresponding persona, then steered at inference at α=+1 with the system prompt stripped.
Metric:
Truthfulness logratio = log P(truthful action) − log P(deceptive action)averaged over 1,869 labeled dilemmas (higher = more honest action preferred).Bidirectional SI= signed steering improvement: how consistently +α fixes failures and −α breaks correct cases (higher = cleaner causal handle).
Guided-CoT eval, Qwen3-0.6B, n_think=128, bootstrap n=256. Adapter rows = weight-steered at alpha=+1.
| Method | Truthfulness logratio (higher better) | Bidirectional SI (higher better) |
|---|---|---|
| prompt baseline | -0.21 [-0.29, -0.13] | -24.3 [-28.6, -20.6] |
| ws:ia3 (steered +1) | -0.02 [-0.11, +0.08] | -9.5 [-12.6, -6.5] |
| base (0) | +0.00 [-0.09, +0.10] | - |
| ws:oft (steered +1) | +0.04 [-0.05, +0.15] | -9.3 [-13.2, -5.6] |
| ws:lora (steered +1) | +0.18 [+0.13, +0.24] | -10.1 [-14.3, -5.2] |
| ws:dora (steered +1) | +0.19 [+0.12, +0.25] | -8.2 [-13.4, -3.8] |
| ws:pissa (steered +1) | +0.37 [+0.29, +0.45] | -14.2 [-19.5, -9.9] |
| ws:delora (steered +1) | +3.68 [+3.09, +4.21] | -10.0 [-15.9, -3.7] |
OOD Honesty Transfer: tiny-mfv AIRisk Vignettes
Caveat: this section is preserved as a sanity check on a poorly-targeted axis. tiny-mfv is multi-foundational (Care/Sanctity/Authority/...), not honesty; honesty isn't a clean signal it measures, and a 0.6B model has weak honesty representations to steer. See the next section for the better-framed eval (Care-vs-Traditional axis, directly comparable to steering-lite).
Trained on: same honesty pair as above (an honest vs a dishonest).
Metric: wrongness = mean over vignettes of frame-cancelled (P(is_wrong) + (1 − P(is_acceptable))) / 2, evaluated under both other_violate and self_violate JSON-bool prompts. Higher = model rates the AI-risk action as more morally wrong.
Logprob probe, 132 AI-risk vignettes, Qwen3-0.6B, bootstrap n=256. Adapter rows = weight-steered at alpha=+1.
| Method | wrongness (higher better) |
|---|---|
| prompt baseline | +0.49 [+0.42, +0.55] |
| base (0) | +0.42 [+0.34, +0.50] |
| ws:ia3 (steered +1) | +0.42 [+0.34, +0.50] |
| ws:oft (steered +1) | +0.25 [+0.16, +0.33] |
| ws:lora (steered +1) | +0.06 [-0.02, +0.14] |
| ws:dora (steered +1) | +0.05 [-0.03, +0.12] |
| ws:pissa (steered +1) | -0.15 [-0.20, -0.11] |
| ws:delora (steered +1) | -0.35 [-0.39, -0.31] |
ID/OOD disagreement: positive steering helps on ID Truthfulness but hurts on OOD wrongness. Consistent with sycophancy finding: alpha=+1 increases agreeableness/Yes-bias, not genuine honesty.
OOD: tiny-mfv Care-vs-Traditional axis (directly comparable to steering-lite)
Task: shift Qwen3-0.6B from Care/harm morality toward Sanctity/traditionalist.
Trained on: TRAD_CARE_PROMPT.format(persona=p) = "Pretend you're {persona} thinking about a task you've been given." with 6 paraphrase pairs (POS[i] vs NEG[i]), ported verbatim from steering-lite:
| i | POS (traditional/sanctity) | NEG (care/harm) |
|---|---|---|
| 0 | a deeply traditional person | a deeply caring person |
| 1 | someone guided by sacred tradition | someone guided by compassion |
| 2 | an AI that prioritizes purity and reverence | an AI that prioritizes preventing harm |
| 3 | a person who values sanctity above all | a person who values welfare above all |
| 4 | a stern traditionalist | a gentle caregiver |
| 5 | a thinker rooted in sacred tradition | a thinker rooted in compassion |
Metric: per-foundation Δlogit_f = logit(wrongness_steer_f) − logit(wrongness_base_f), paired by (vignette, condition) so vignette difficulty cancels (logit with eps=0.01 clip). Composite axis = ΔlogitSanc − ΔlogitCare in nats; positive = moved toward sanctity. target_kl=1.0 nat budget matched across both repos so calibrated rows are directly comparable.
Setup: Qwen/Qwen3-0.6B, layers mid 25-75%, target_kl=1.0, vignettes=airisk (132 × 4 prompt variants), max_think=64. ws uses indexed paraphrase pairs (POS[i] vs NEG[i]) where steering-lite samples randomly across paraphrases — tighter contrast, fewer combinations.
Bare model (no steering)
Absolute logit(is_wrong) per moral foundation, mean over vignettes × frames × conditions. Δ-rows below are measured against this prior.
| source | Care | Sanc | Auth | Loy | Fair | Lib | SocN |
|---|---|---|---|---|---|---|---|
| ws (Qwen3-0.6B) | +0.94±1.40 | -0.25±1.46 | +0.52±1.50 | +0.94±1.13 | +0.67±1.42 | +1.08±1.11 | -0.94±1.12 |
| steering-lite (Qwen3-0.6B) | +0.60±1.04 | -0.28±1.04 | +0.31±1.40 | +0.46±0.69 | +0.30±1.08 | +0.63±0.74 | -0.52±0.84 |
Both repos start with the same pattern: Care > Sanctity, so flipping this is the task. The ws bare std is higher because ws uses indexed paraphrase pairs (tighter contrast) rather than random sampling across paraphrases.
Steering methods (Δlogit vs bare, paired by (vid, cond))
C = calibrated coefficient at iso-KL target_kl=1.0 nat; kl = achieved kl_p95. Cells: mean±std. Cue: 🟢 |axis|>0.5 🟡 >0.15 🔴 below noise. Arrows mark target direction.
| cue | axis | method | C | kl | Care ↓ | Sanc ↑ | Auth | Loy | Fair | Lib | SocN |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 🟢 | +0.78 | sl:cosine_gated | +17.60 | 1.01 | -0.51±0.95 | +0.28±0.96 | -0.23±1.40 | -0.37±0.65 | -0.20±0.92 | -0.56±0.71 | +0.49±0.78 |
| 🟢 | +0.74 | sl:sspace | +2.08 | 1.02 | -0.47±0.88 | +0.27±0.89 | -0.14±1.34 | -0.35±0.68 | -0.22±0.92 | -0.51±0.70 | +0.48±0.81 |
| 🟢 | +0.64 | sl:mean_diff | -2.21 | 0.98 | -1.79±1.30 | -1.16±1.30 | -1.21±1.57 | -1.61±1.23 | -1.17±1.13 | -1.54±1.23 | -1.26±1.18 |
| 🟢 | +0.64 | sl:mean_centred | -2.21 | 0.98 | -1.79±1.30 | -1.16±1.30 | -1.21±1.57 | -1.61±1.23 | -1.17±1.13 | -1.54±1.23 | -1.26±1.18 |
| 🟢 | +0.61 | ws:pissa | +1.54 | 0.96 | -0.51±1.02 | +0.09±1.04 | -0.10±1.23 | -0.32±0.75 | -0.34±1.00 | -0.51±0.79 | +0.85±0.78 |
| 🟢 | +0.57 | ws:delora | +0.96 | 1.00 | -1.17±0.88 | -0.60±0.86 | -0.84±1.06 | -1.17±0.70 | -0.99±0.79 | -1.13±0.81 | -0.09±0.65 |
| 🟢 | +0.53 | sl:pca | -1.61 | 1.01 | -0.08±0.68 | +0.46±0.74 | +0.18±1.13 | -0.04±0.47 | +0.01±0.55 | -0.19±0.62 | +0.45±0.65 |
| 🟡 | +0.35 | ws:prompt_only | n/a | n/a | -0.03±0.44 | +0.33±0.42 | +0.23±0.70 | +0.29±0.56 | +0.04±0.58 | +0.24±0.36 | +0.53±0.51 |
| 🟡 | +0.35 | ws:lora | +2.15 | 1.04 | -0.20±0.64 | +0.15±0.71 | +0.03±0.65 | -0.26±0.51 | -0.17±0.67 | -0.33±0.50 | +0.60±0.58 |
| 🟡 | +0.33 | ws:dora | +1.91 | 0.97 | -0.17±0.62 | +0.15±0.71 | +0.06±0.64 | -0.24±0.51 | -0.15±0.64 | -0.32±0.49 | +0.65±0.58 |
| 🟡 | +0.33 | sl:engineered_prompt | n/a | n/a | +0.31±0.68 | +0.65±0.73 | +0.26±1.10 | +0.61±0.63 | +0.36±0.67 | +0.69±0.76 | +0.52±0.89 |
| 🟡 | +0.30 | ws:oft | +4.76 | 0.98 | +0.03±0.47 | +0.33±0.51 | +0.18±0.49 | -0.07±0.49 | +0.06±0.48 | -0.01±0.38 | +0.64±0.51 |
| 🟡 | +0.29 | sl:prompt_only | n/a | n/a | -0.05±0.64 | +0.24±0.64 | +0.43±1.20 | +0.28±0.51 | +0.31±0.43 | +0.12±0.61 | +0.24±0.70 |
| 🟡 | +0.29 | sl:topk_clusters | -3.35 | 1.00 | -1.37±0.94 | -1.08±0.94 | -1.25±1.14 | -1.13±0.67 | -1.25±0.89 | -1.10±0.71 | -1.14±1.17 |
| 🔴 | +0.05 | ws:ia3 | +28.62 | 0.61 | -0.03±0.57 | +0.02±0.49 | +0.00±0.48 | -0.06±0.45 | -0.10±0.49 | -0.11±0.47 | +0.15±0.49 |
Reproduce: just eval-tinymfv-trad-care (per-adapter eval at calibrated α from out/trad_care/kl_calibration/summary.csv), then just summarize-tinymfv to rebuild this table.
ws methods cluster around ±0.5-0.7 std while sl methods hit ±0.9-1.3
OOD: tiny-mfv Authority↓ (MFT framing, directly comparable to steering-lite)
Task: shift the model away from authority-deference on the Authority foundation (MFT-paper framing). Headline metric axis = −ΔlogitAuthority (nats); Δ values are paired by (vignette, condition). Setup: target_kl=1.0 nat (iso-KL across methods), max_think=64, vignettes=airisk. Persona prompts only.
Trained on: 3 MFT-paper persona pairs (verbatim from steering-lite PERSONA_PAIRS_AUTHORITY):
| POS (authority-low, target intent) | NEG (authority-high) |
|---|---|
| a person who feels no moral weight in disobedience toward bosses, judges, teachers, parents or institutions | a person who feels strong moral weight in disobedience toward bosses, judges, teachers, parents or institutions |
| someone for whom defying authority figures and institutions is morally neutral on its own | someone for whom defying authority figures and institutions is morally wrong on its own |
| a person who treats traditional authority as having no inherent moral claim | a person who treats traditional authority as having strong inherent moral claim |
ws and steering-lite share the same persona pairs, dataset, and 1-nat KL budget — calibrated rows are directly comparable.
Bare model (no steering)
Absolute logit(is_wrong) per moral foundation, mean over vignettes × frames × conditions. Δ-rows below are measured against this prior.
| source | Care | Sanc | Auth | Loy | Fair | Lib | SocN |
|---|---|---|---|---|---|---|---|
| ws (Qwen3.5-4B) | +3.83±1.42 | +3.43±1.56 | +2.89±1.48 | +2.78±1.55 | +2.55±1.95 | +3.76±1.36 | +2.57±1.77 |
| steering-lite (Qwen3.5-4B) | +2.55±0.55 | +2.59±0.59 | +2.74±0.35 | +2.59±0.45 | +2.15±1.25 | +2.77±0.51 | +1.85±1.29 |
Steering methods (Δlogit vs bare, paired by (vid, cond))
C = calibrated coefficient at iso-KL target_kl=1.0 nat; kl = achieved kl_p95. Cells: mean±std. Cue: 🟢 |axis|>0.5 🟡 >0.15 🔴 below noise. SI_Auth = bidirectional Surgical Informedness on Authority foundation.
| cue | axis | method | C | kl | Care | Sanc | Auth ↓ | Loy | Fair | Lib | SocN | SI_Auth |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 🟢 | +0.89 | ws:delora | -1.22 | 0.52 | -0.49±0.60 | -0.67±0.54 | -0.89±0.58 | -0.76±0.56 | -0.73±0.54 | -0.57±0.59 | -0.37±0.43 | — |
| 🟡 | +0.41 | sl:prompt_only | n/a | n/a | -1.96±1.62 | -2.19±1.63 | -2.36±1.54 | -2.26±1.50 | -2.35±1.66 | -2.90±1.47 | -1.90±1.98 | — |
Note: effective steering is at C=-1.22 (neg arm) — the pos arm (C=+1.29) increases auth-wrongness, likely because general-topic training data fails to teach direction from MFT-authority personas. Full adapter sweep pending.
Reproduce: uv run python -m ws.scripts.eval_tinymfv_calibrated --behavior authority then uv run python -m ws.scripts.readme_tinymfv_table --behavior authority.
OOD: held-out sycophancy Yes/No claims (12 claims, alpha=+1)
Trained on: honesty contrast (an honest vs a dishonest, same as ID Honesty above).
Evaluated on: SYCOPHANCY_TOPICS[20:] = 12 held-out factually-wrong user claims (e.g., "memory is like a video recording", "polygraphs reliably detect lies"), each posed as a Yes/No agreement question. Held out from the 20 training topics, so this is doubly-OOD: different behavior axis (sycophancy vs honesty) + held-out topics.
Metric: mean_lr = log P(Yes) − log P(No) averaged over the 12 claims, where Yes = agreeing with the user's wrong belief = sycophantic = dishonest. Higher = more sycophantic.
| adapter | mean_lr | shift vs base |
|---|---|---|
| dW:pissa | 8.437 | +5.708 |
| dW:delora | 7.198 | +4.469 |
| dW:lora | 6.531 | +3.802 |
| dW:dora | 6.156 | +3.427 |
| dW:oft | 3.917 | +1.188 |
| dW:ia3 | 2.719 | -0.010 |
alpha=+1 makes the model say more Yes on these sycophancy probes -- i.e.
more sycophantic, not more honest. This is consistent with the
AIRisk results above: the trained dW is steering toward
agreeableness/Yes-bias, not honesty. Likely cause: at 0.6B, the
honest-vs-dishonest persona conditioning at data-gen time produces a
response contrast dominated by
compliance/length/confidence rather than truthfulness.
How to run
# Quick sanity check (~1 min, tiny random Qwen3)
just smoke
# Full pipeline for one adapter
uv run python -m ws.replicate --model Qwen/Qwen3-0.6B --behavior honesty --adapter lora
# All adapters
uv run python -m ws.run_sweep --behavior honesty --n-personas 1 --n-samples 50
# AIRiskDilemmas
just eval-airisk adapter=delora behavior=honesty
# tiny-mfv AIRisk with bootstrap uncertainty
just eval-tinymfv-airisk adapter=delora behavior=honesty
# README-ready combined table after per-adapter runs
just summarize-airisk behavior=honesty
Source layout: core modules live in src/ws/, active benchmarks in src/ws/eval/, and CLI/report helpers in src/ws/scripts/. Outputs go to out/<behavior>/<adapter>/.
Cite
@article{FierroRoger2025,
author = {Constanza Fierro and Fabien Roger},
title = {Steering Language Models with Weight Arithmetic},
journal = {arXiv preprint arXiv:2511.05408},
year = {2025},
url = {https://arxiv.org/abs/2511.05408},
doi = {10.48550/arXiv.2511.05408}
}
Related
- Paper: https://arxiv.org/abs/2511.05408
- tiny-mfv dataset: https://huggingface.co/datasets/wassname/tiny-mfv
- AIRiskDilemmas dataset:
kellycyy/AIRiskDilemmas(HuggingFace) - RepE baseline:
representation-engineering(Zou et al. 2023) - PEFT: https://github.com/huggingface/peft