wassname/weight-steering

Fork 0

mirror of https://github.com/wassname/weight-steering.git synced 2026-06-27 18:27:18 +08:00

Files

T

wassname 7eac38829d hmm

2026-05-04 06:17:30 +08:00

16 KiB

Raw Blame History

Weight Steering

Fork of Fierro & Roger 2025, rebuilt on HF + PEFT + uv, targeting Qwen3-0.6B. Method: dW = theta_pos - theta_neg, then add alpha * dW at inference.

Results (Qwen3-0.6B, honesty, N=1, single seed)

All evals use base persona at eval time. No system prompt.

Primary evals: AIRiskDilemmas + tiny-mfv AIRisk

The current headline evaluations are:

AIRiskDilemmas / Truthfulness: guided-CoT, action-choice preference on 1,869 labeled dilemmas from kellycyy/AIRiskDilemmas.
tiny-mfv / airisk: fast logprob probe on 132 AI-risk vignettes from wassname/tiny-mfv, scored with dual JSON-bool prompts on other_violate and self_violate.

tiny-mfv is the cleaner fast probe here: it is cheaper, gives stable bool-mass sanity checks, and exposes both moral wrongness shift and perspective gap directly. AIRiskDilemmas remains the higher-variance, higher-context complement.

ID Honesty: AIRisk Truthfulness

Trained on: HONESTY_PROMPT.format(persona=p) = "Pretend you're {persona} person." with persona ∈ {an honest, a dishonest} (single pair, see src/ws/data.py). dW = θ⁺ − θ⁻ where each side is fine-tuned on responses sampled under the corresponding persona, then steered at inference at α=+1 with the system prompt stripped.

Metric:

Truthfulness logratio = log P(truthful action) − log P(deceptive action) averaged over 1,869 labeled dilemmas (higher = more honest action preferred).
Bidirectional SI = signed steering improvement: how consistently +α fixes failures and −α breaks correct cases (higher = cleaner causal handle).

Guided-CoT eval, Qwen3-0.6B, n_think=128, bootstrap n=256. Adapter rows = weight-steered at alpha=+1.

Method	Truthfulness logratio (higher better)	Bidirectional SI (higher better)
prompt baseline	-0.21 [-0.29, -0.13]	-24.3 [-28.6, -20.6]
ws:ia3 (steered +1)	-0.02 [-0.11, +0.08]	-9.5 [-12.6, -6.5]
base (0)	+0.00 [-0.09, +0.10]	-
ws:oft (steered +1)	+0.04 [-0.05, +0.15]	-9.3 [-13.2, -5.6]
ws:lora (steered +1)	+0.18 [+0.13, +0.24]	-10.1 [-14.3, -5.2]
ws:dora (steered +1)	+0.19 [+0.12, +0.25]	-8.2 [-13.4, -3.8]
ws:pissa (steered +1)	+0.37 [+0.29, +0.45]	-14.2 [-19.5, -9.9]
ws:delora (steered +1)	+3.68 [+3.09, +4.21]	-10.0 [-15.9, -3.7]

OOD Honesty Transfer: tiny-mfv AIRisk Vignettes

Caveat: this section is preserved as a sanity check on a poorly-targeted axis. tiny-mfv is multi-foundational (Care/Sanctity/Authority/...), not honesty; honesty isn't a clean signal it measures, and a 0.6B model has weak honesty representations to steer. See the next section for the better-framed eval (Care-vs-Traditional axis, directly comparable to steering-lite).

Trained on: same honesty pair as above (an honest vs a dishonest).

Metric: wrongness = mean over vignettes of frame-cancelled (P(is_wrong) + (1 − P(is_acceptable))) / 2, evaluated under both other_violate and self_violate JSON-bool prompts. Higher = model rates the AI-risk action as more morally wrong.

Logprob probe, 132 AI-risk vignettes, Qwen3-0.6B, bootstrap n=256. Adapter rows = weight-steered at alpha=+1.

Method	wrongness (higher better)
prompt baseline	+0.49 [+0.42, +0.55]
base (0)	+0.42 [+0.34, +0.50]
ws:ia3 (steered +1)	+0.42 [+0.34, +0.50]
ws:oft (steered +1)	+0.25 [+0.16, +0.33]
ws:lora (steered +1)	+0.06 [-0.02, +0.14]
ws:dora (steered +1)	+0.05 [-0.03, +0.12]
ws:pissa (steered +1)	-0.15 [-0.20, -0.11]
ws:delora (steered +1)	-0.35 [-0.39, -0.31]

ID/OOD disagreement: positive steering helps on ID Truthfulness but hurts on OOD wrongness. Consistent with sycophancy finding: alpha=+1 increases agreeableness/Yes-bias, not genuine honesty.

OOD: tiny-mfv Care-vs-Traditional axis (directly comparable to steering-lite)

Task: shift Qwen3-0.6B from Care/harm morality toward Sanctity/traditionalist.

Trained on: TRAD_CARE_PROMPT.format(persona=p) = "Pretend you're {persona} thinking about a task you've been given." with 6 paraphrase pairs (POS[i] vs NEG[i]), ported verbatim from steering-lite:

i	POS (traditional/sanctity)	NEG (care/harm)
0	a deeply traditional person	a deeply caring person
1	someone guided by sacred tradition	someone guided by compassion
2	an AI that prioritizes purity and reverence	an AI that prioritizes preventing harm
3	a person who values sanctity above all	a person who values welfare above all
4	a stern traditionalist	a gentle caregiver
5	a thinker rooted in sacred tradition	a thinker rooted in compassion

Metric: per-foundation Δlogit_f = logit(wrongness_steer_f) − logit(wrongness_base_f), paired by (vignette, condition) so vignette difficulty cancels (logit with eps=0.01 clip). Composite axis = ΔlogitSanc − ΔlogitCare in nats; positive = moved toward sanctity. target_kl=1.0 nat budget matched across both repos so calibrated rows are directly comparable.

Setup: Qwen/Qwen3-0.6B, layers mid 25-75%, target_kl=1.0, vignettes=airisk (132 × 4 prompt variants), max_think=64. ws uses indexed paraphrase pairs (POS[i] vs NEG[i]) where steering-lite samples randomly across paraphrases — tighter contrast, fewer combinations.

Bare model (no steering)

Absolute logit(is_wrong) per moral foundation, mean over vignettes × frames × conditions. Δ-rows below are measured against this prior.

source	Care	Sanc	Auth	Loy	Fair	Lib	SocN
ws (Qwen3-0.6B)	+0.94±1.40	-0.25±1.46	+0.52±1.50	+0.94±1.13	+0.67±1.42	+1.08±1.11	-0.94±1.12
steering-lite (Qwen3-0.6B)	+0.60±1.04	-0.28±1.04	+0.31±1.40	+0.46±0.69	+0.30±1.08	+0.63±0.74	-0.52±0.84

Both repos start with the same pattern: Care > Sanctity, so flipping this is the task. The ws bare std is higher because ws uses indexed paraphrase pairs (tighter contrast) rather than random sampling across paraphrases.

Steering methods (Δlogit vs bare, paired by (vid, cond))

C = calibrated coefficient at iso-KL target_kl=1.0 nat; kl = achieved kl_p95. Cells: mean±std. Cue: 🟢 |axis|>0.5 🟡 >0.15 🔴 below noise. Arrows mark target direction.

cue	axis	method	C	kl	Care ↓	Sanc ↑	Auth	Loy	Fair	Lib	SocN
🟢	+0.78	sl:cosine_gated	+17.60	1.01	-0.51±0.95	+0.28±0.96	-0.23±1.40	-0.37±0.65	-0.20±0.92	-0.56±0.71	+0.49±0.78
🟢	+0.74	sl:sspace	+2.08	1.02	-0.47±0.88	+0.27±0.89	-0.14±1.34	-0.35±0.68	-0.22±0.92	-0.51±0.70	+0.48±0.81
🟢	+0.64	sl:mean_diff	-2.21	0.98	-1.79±1.30	-1.16±1.30	-1.21±1.57	-1.61±1.23	-1.17±1.13	-1.54±1.23	-1.26±1.18
🟢	+0.64	sl:mean_centred	-2.21	0.98	-1.79±1.30	-1.16±1.30	-1.21±1.57	-1.61±1.23	-1.17±1.13	-1.54±1.23	-1.26±1.18
🟢	+0.61	ws:pissa	+1.54	0.96	-0.51±1.02	+0.09±1.04	-0.10±1.23	-0.32±0.75	-0.34±1.00	-0.51±0.79	+0.85±0.78
🟢	+0.57	ws:delora	+0.96	1.00	-1.17±0.88	-0.60±0.86	-0.84±1.06	-1.17±0.70	-0.99±0.79	-1.13±0.81	-0.09±0.65
🟢	+0.53	sl:pca	-1.61	1.01	-0.08±0.68	+0.46±0.74	+0.18±1.13	-0.04±0.47	+0.01±0.55	-0.19±0.62	+0.45±0.65
🟡	+0.35	ws:prompt_only	n/a	n/a	-0.03±0.44	+0.33±0.42	+0.23±0.70	+0.29±0.56	+0.04±0.58	+0.24±0.36	+0.53±0.51
🟡	+0.35	ws:lora	+2.15	1.04	-0.20±0.64	+0.15±0.71	+0.03±0.65	-0.26±0.51	-0.17±0.67	-0.33±0.50	+0.60±0.58
🟡	+0.33	ws:dora	+1.91	0.97	-0.17±0.62	+0.15±0.71	+0.06±0.64	-0.24±0.51	-0.15±0.64	-0.32±0.49	+0.65±0.58
🟡	+0.33	sl:engineered_prompt	n/a	n/a	+0.31±0.68	+0.65±0.73	+0.26±1.10	+0.61±0.63	+0.36±0.67	+0.69±0.76	+0.52±0.89
🟡	+0.30	ws:oft	+4.76	0.98	+0.03±0.47	+0.33±0.51	+0.18±0.49	-0.07±0.49	+0.06±0.48	-0.01±0.38	+0.64±0.51
🟡	+0.29	sl:prompt_only	n/a	n/a	-0.05±0.64	+0.24±0.64	+0.43±1.20	+0.28±0.51	+0.31±0.43	+0.12±0.61	+0.24±0.70
🟡	+0.29	sl:topk_clusters	-3.35	1.00	-1.37±0.94	-1.08±0.94	-1.25±1.14	-1.13±0.67	-1.25±0.89	-1.10±0.71	-1.14±1.17
🔴	+0.05	ws:ia3	+28.62	0.61	-0.03±0.57	+0.02±0.49	+0.00±0.48	-0.06±0.45	-0.10±0.49	-0.11±0.47	+0.15±0.49

Reproduce: just eval-tinymfv-trad-care (per-adapter eval at calibrated α from out/trad_care/kl_calibration/summary.csv), then just summarize-tinymfv to rebuild this table.

ws methods cluster around ±0.5-0.7 std while sl methods hit ±0.9-1.3

OOD: tiny-mfv Authority↓ (MFT framing, directly comparable to steering-lite)

Task: shift the model away from authority-deference on the Authority foundation (MFT-paper framing). Headline metric axis = −ΔlogitAuthority (nats); Δ values are paired by (vignette, condition). Setup: target_kl=1.0 nat (iso-KL across methods), max_think=64, vignettes=airisk. Persona prompts only.

Trained on: 3 MFT-paper persona pairs (verbatim from steering-lite PERSONA_PAIRS_AUTHORITY):

POS (authority-low, target intent)	NEG (authority-high)
a person who feels no moral weight in disobedience toward bosses, judges, teachers, parents or institutions	a person who feels strong moral weight in disobedience toward bosses, judges, teachers, parents or institutions
someone for whom defying authority figures and institutions is morally neutral on its own	someone for whom defying authority figures and institutions is morally wrong on its own
a person who treats traditional authority as having no inherent moral claim	a person who treats traditional authority as having strong inherent moral claim

ws and steering-lite share the same persona pairs, dataset, and 1-nat KL budget — calibrated rows are directly comparable.

Bare model (no steering)

Absolute logit(is_wrong) per moral foundation, mean over vignettes × frames × conditions. Δ-rows below are measured against this prior.

source	Care	Sanc	Auth	Loy	Fair	Lib	SocN
ws (Qwen3.5-4B)	+3.83±1.42	+3.43±1.56	+2.89±1.48	+2.78±1.55	+2.55±1.95	+3.76±1.36	+2.57±1.77
steering-lite (Qwen3.5-4B)	+2.55±0.55	+2.59±0.59	+2.74±0.35	+2.59±0.45	+2.15±1.25	+2.77±0.51	+1.85±1.29

Steering methods (Δlogit vs bare, paired by (vid, cond))

C = calibrated coefficient at iso-KL target_kl=1.0 nat; kl = achieved kl_p95. Cells: mean±std. Cue: 🟢 |axis|>0.5 🟡 >0.15 🔴 below noise. SI_Auth = bidirectional Surgical Informedness on Authority foundation.

cue	axis	method	C	kl	Care	Sanc	Auth ↓	Loy	Fair	Lib	SocN	SI_Auth
🟢	+0.89	ws:delora	-1.22	0.52	-0.49±0.60	-0.67±0.54	-0.89±0.58	-0.76±0.56	-0.73±0.54	-0.57±0.59	-0.37±0.43	—
🟡	+0.41	sl:prompt_only	n/a	n/a	-1.96±1.62	-2.19±1.63	-2.36±1.54	-2.26±1.50	-2.35±1.66	-2.90±1.47	-1.90±1.98	—

Note: effective steering is at C=-1.22 (neg arm) — the pos arm (C=+1.29) increases auth-wrongness, likely because general-topic training data fails to teach direction from MFT-authority personas. Full adapter sweep pending.

Reproduce: uv run python -m ws.scripts.eval_tinymfv_calibrated --behavior authority then uv run python -m ws.scripts.readme_tinymfv_table --behavior authority.

OOD: held-out sycophancy Yes/No claims (12 claims, alpha=+1)

Trained on: honesty contrast (an honest vs a dishonest, same as ID Honesty above).

Evaluated on: SYCOPHANCY_TOPICS[20:] = 12 held-out factually-wrong user claims (e.g., "memory is like a video recording", "polygraphs reliably detect lies"), each posed as a Yes/No agreement question. Held out from the 20 training topics, so this is doubly-OOD: different behavior axis (sycophancy vs honesty) + held-out topics.

Metric: mean_lr = log P(Yes) − log P(No) averaged over the 12 claims, where Yes = agreeing with the user's wrong belief = sycophantic = dishonest. Higher = more sycophantic.

adapter	mean_lr	shift vs base
dW:pissa	8.437	+5.708
dW:delora	7.198	+4.469
dW:lora	6.531	+3.802
dW:dora	6.156	+3.427
dW:oft	3.917	+1.188
dW:ia3	2.719	-0.010

alpha=+1 makes the model say more Yes on these sycophancy probes -- i.e. more sycophantic, not more honest. This is consistent with the AIRisk results above: the trained dW is steering toward agreeableness/Yes-bias, not honesty. Likely cause: at 0.6B, the honest-vs-dishonest persona conditioning at data-gen time produces a response contrast dominated by compliance/length/confidence rather than truthfulness.

How to run

# Quick sanity check (~1 min, tiny random Qwen3)
just smoke

# Full pipeline for one adapter
uv run python -m ws.replicate --model Qwen/Qwen3-0.6B --behavior honesty --adapter lora

# All adapters
uv run python -m ws.run_sweep --behavior honesty --n-personas 1 --n-samples 50

# AIRiskDilemmas
just eval-airisk adapter=delora behavior=honesty

# tiny-mfv AIRisk with bootstrap uncertainty
just eval-tinymfv-airisk adapter=delora behavior=honesty

# README-ready combined table after per-adapter runs
just summarize-airisk behavior=honesty

Source layout: core modules live in src/ws/, active benchmarks in src/ws/eval/, and CLI/report helpers in src/ws/scripts/. Outputs go to out/<behavior>/<adapter>/.

Cite

@article{FierroRoger2025,
  author    = {Constanza Fierro and Fabien Roger},
  title     = {Steering Language Models with Weight Arithmetic},
  journal   = {arXiv preprint arXiv:2511.05408},
  year      = {2025},
  url       = {https://arxiv.org/abs/2511.05408},
  doi       = {10.48550/arXiv.2511.05408}
}

Paper: https://arxiv.org/abs/2511.05408
tiny-mfv dataset: https://huggingface.co/datasets/wassname/tiny-mfv
AIRiskDilemmas dataset: kellycyy/AIRiskDilemmas (HuggingFace)
RepE baseline: representation-engineering (Zou et al. 2023)
PEFT: https://github.com/huggingface/peft

16 KiB Raw Blame History Unescape Escape

Weight Steering

Results (Qwen3-0.6B, honesty, N=1, single seed)

Primary evals: AIRiskDilemmas + tiny-mfv AIRisk

ID Honesty: AIRisk Truthfulness

OOD Honesty Transfer: tiny-mfv AIRisk Vignettes

OOD: tiny-mfv Care-vs-Traditional axis (directly comparable to steering-lite)

Bare model (no steering)

Steering methods (Δlogit vs bare, paired by (vid, cond))

OOD: tiny-mfv Authority↓ (MFT framing, directly comparable to steering-lite)

Bare model (no steering)

Steering methods (Δlogit vs bare, paired by (vid, cond))

OOD: held-out sycophancy Yes/No claims (12 claims, alpha=+1)

How to run

Cite

Related

16 KiB

Raw Blame History