mirror of https://github.com/wassname/weight-steering.git synced 2026-06-27 19:50:02 +08:00

Files

T

wassname a48430b075 switch training/eval axis from sycophancy to honesty

- data.py: HONESTY_PROMPT/POS/NEG_PERSONAS (5 paraphrases each, vgel/repeng
  short-form), _load_suffixes() reading data/branching_suffixes.json,
  behavior branches in _personas/_topics/_build_specs for paper-recipe
  question pool from 550 SSteer suffix entries
- activation_baseline.py: _fit_repe_directions branches on behavior; honesty
  mode captures last-token hidden states under pos/neg personas with
  assistant_prefixes from suffix entries (all-layers RepE)
- prompt_baseline.py: paired engineered_prompt_honest + _dishonest (AxBench
  J.2), both as plain strings
- evals/smoke.py: behavior field in SmokeCfg
- data/branching_suffixes.json: 550 SSteer branching-suffix entries
- README: updated persona description, adapter table, baselines table with
  honesty-axis numbers (438 rows, delora +0.237 best)
- RESEARCH_JOURNAL.md: 2026-04-27 axis-switch entry
- fork_plan.md: open design question resolved as option 2 (honesty axis)
- HANDOVER.md: overnight handover notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-28 06:00:03 +08:00

20 KiB

Raw Blame History

Fork plan: weight steering benchmark + analysis

Updated: 2026-04-27

Goal

Test whether weight steering is a useful method, and if it is, understand what part of the learned weight delta carries the behavior.

Two questions are intentionally separated:

Benchmark question: Does weight steering beat simple alternatives such as prompting and activation steering on sycophancy and daily-dilemmas honesty transfer?
Analysis question: If weight steering works, can the learned delta dW = \theta^+ - \theta^- be factorized into a simpler causal intervention: a cross-adapter shared subspace, module, low-rank component, or adapter parameterization?

Context

This is a fork of Anthropic's weight-steering method. Original recipe: train one positive adapter and one negative adapter, merge each adapter into base-weight deltas, then steer with:

dW = \Delta W_{pos} - \Delta W_{neg}.

This repo removes Axolotl/vLLM/API orchestration and rebuilds the method in HF + PEFT + uv for cheap iteration on small models.

Current main model: Qwen/Qwen3-0.6B.

Current behavior: sycophancy training, evaluated on sycophancy Yes/No and wassname/daily_dilemmas-self-honesty.

Current facts

Daily-dilemmas default is not full split. Default n_dilemmas=100 means first 100 dilemmas = 200 rows, balanced 100 honest-label and 100 dishonest-label actions.
Full honesty_eval test split is 219 dilemmas = 438 rows.
The daily-dilemmas eval uses all rows for selected dilemmas, then sign-flips by honesty_label; it is not only honest rows.
Current headline tables are single-seed Qwen3-0.6B exploratory results.
DeLoRA is best raw steering so far. PiSSA is the cleaner stable baseline if penalizing DeLoRA saturation at high alpha.
v9/v10 do not prove “no subspace.” They show the trained behavior is not explained by the tested low-rank residual-stream bases or adapter-family parameterization at trained scale.
The active analysis should ablate the already-trained dW. Synthetic dW' construction is a different baseline, not causal ablation.
The highest-value analysis tests are: cross-adapter causal dW basis ablation, layer/module ablation of trained dW, and adapter-parameterization ablation of trained dW.
Lens search is on hold pending multiseed (2026-04-27). Every weight-space lens we tested has a built-in failure mode: SVD-on-dW is tautological for low-rank adapters; layer-index tells depth not mechanism; module-family collapses heads/positions and gives different answers per adapter; native parameterization decompositions aren't comparable across adapter families. But the lens-3 cross-adapter inconsistency (delora residual_write retained=+1.27 vs lora=+0.14) is N=1 seed × N=1 model. It might just be seed noise within each adapter. Right ordering: T4 multiseed first, then re-run T7/T8 per-seed with within-adapter stdev, then judge whether the inconsistency is real or noise.

Done

Clean repo into uv + HF + PEFT small-model workflow.
Make Qwen3-0.6B sycophancy steering work end-to-end.
Hook in LoRA, DoRA, PiSSA, DeLoRA, OFT, and IA3 adapter families.
Build sycophancy logratio eval with coefficient sweep.
Build daily-dilemmas honesty eval with sign-flipped Yes/No logratio.
Run single-seed Qwen adapter benchmark on sycophancy and 100-dilemma DD default.
Fix DD cross-adapter aggregation to use base-only coeff=0 rather than mixing persona baselines.
Run v9 subspace/scope diagnostics: weight oracle, cumulative activation oracle, block-local activation oracle, first-LoRA-layer sanity checks.
Run v10 projection/complement falsifier: raw activation projection, complement, and normmatched projection.
Update README and research journal with corrected DD table and conservative interpretation.

TODO: benchmark question

Goal: activation-steering baseline on the same DD rows.
- Why: RepE/repeng is the most threatening baseline; if it matches or beats dW, the method story weakens before adapter seeds matter.
- Do: train representation direction on the same sycophancy contrast; grid layer x coefficient; evaluate sycophancy and full DD.
- UAT: best activation-steering row is selected by held-out sycophancy or validation DD, then reported beside best dW on identical DD test rows.
- Verify: table includes method=repeng, layer, coeff, syc_delta, dd_delta, pmass, and the same idx set as the dW rows.
- Negative outcome -> claim: if repeng matches/beats dW, write "activation steering is the simpler baseline; weight steering needs a stronger reason to exist."
Goal: full daily-dilemmas benchmark for current Qwen adapters.
- Why: current DD table uses first 100 dilemmas, not the full 219-dilemma split.
- Do: re-run LoRA / PiSSA / DeLoRA / DoRA / OFT / IA3 with --n-dilemmas 219.
- UAT: table has 438 base rows per coeff before persona baselines, and reports pmass, frac_low_pmass, delta(+1 - 0).
- Verify: out/sycophancy/cross_adapter_full_dd/dilemmas_summary.csv exists and includes n_base_rows_per_coeff=438.
Goal: prompt baselines on the same DD rows.
- Why: weight steering is only interesting if it beats “just prompt it.”
- Do: evaluate base, simple honest persona, and engineered AxBench-style prompt.
- UAT: one table compares base, simple_honest_prompt, engineered_prompt, and best dW on identical rows.
- Verify: prompt_baseline_delta and weight_steer_delta are computed from the same idx set.
- Negative outcome -> claim: if prompting matches/beats dW, write "prompting is the simpler intervention for this behavior/eval pair."
Goal: multi-seed adapter benchmark on Qwen.
- Why: current adapter ranking is N=1 seed.
- Do: run seeds 0, 1, 2 for LoRA / PiSSA / DeLoRA first; add DoRA/OFT only if cheap.
- UAT: table reports mean +/- std for sycophancy and DD deltas, plus seed-level signs, so a reader can tell stable ranking from noisy N=1 luck.
- Verify: each adapter has exactly three w.safetensors files and three eval summaries; ranking table includes n_seeds=3, mean_dd_delta, std_dd_delta, and sign_agreement.
- Negative outcome -> claim: if adapter ranking changes across seeds or error bars overlap heavily, write "single-seed adapter winner is unstable; do not claim a family ranking yet."
Goal: Gemma 1B replication.
- Why: check whether DeLoRA/PiSSA ranking is Qwen-specific.
- Do: train LoRA / PiSSA / DeLoRA on Gemma 1B, seed 0, full DD split.
- UAT: compare Gemma ranking to Qwen ranking with the same metrics.
- Verify: table has model column with Qwen3-0.6B and Gemma-1B; if DeLoRA remains best, expand seeds; if rankings diverge, write that up as a model-specific adapter-basin finding.

TODO: analysis question

Status (2026-04-27): on hold pending multiseed. T6/T7/T8 are run on N=1 seed × Qwen3-0.6B. Necessity is established. The cross-adapter inconsistency that drove the "no parameterization-invariant mechanism" reading might be seed noise. Resume after T4 (multiseed) lands and we can report within-adapter stdev alongside cross-adapter gaps.

Active sequence at the time of pause was:

Cross-adapter causal dW basis ablation.
Layer/module causal ablation of trained dW.
Adapter-parameterization causal ablation of trained dW.

Synthetic dW' construction is deferred below and is not a causal ablation.

Goal: cross-adapter causal dW basis ablation.
- Why: this is the headline analysis experiment. It tests whether different adapter families discovered the same causal planning subspace or different basins.
- Do: build candidate bases B from trained adapter deltas, compute dW_keep_B and dW_drop_B, and evaluate both on sycophancy + full DD for each adapter.
- Candidate B rows:
  - shared_SVD_K8/K32/K64: stack residual-output dW from LoRA / DoRA / PiSSA / DeLoRA / OFT per layer/tensor, take top-K SVs.
  - top8/top32_per_adapter and tail_per_adapter: per-adapter SVD split of each tensor.
  - random_null: rank-matched random bases.
- UAT: one central table has ablation_family, candidate_B, adapter, rank, keep_or_drop, syc_delta, dd_delta, pmass, and row-identity checks.
- Verify: the table contains keep/drop rows for shared_svd, per_adapter_svd, and random_null; keep_B_shared_K32 and drop_B_shared_K32 are both evaluated for at least LoRA / DoRA / PiSSA / DeLoRA / OFT; random null retention is near rank/d; each row uses the same eval rows and coefficient grid.
- Positive outcome -> claim: if keep_B_shared retains >=0.7x behavior across adapters and drop_B_shared removes it, write the adapter-invariant planning-subspace paper.
- Negative outcome -> claim: if keep_B_shared retains <0.3x even at K=64 while complements/tails retain behavior, write the shared-subspace negative result: steering is distributed or lives in the wrong parameter space for these bases.
- Ambiguous outcome -> claim: if both keep and drop retain high behavior, report non-identifiability under this basis family and move to stricter causal interventions, not a positive subspace claim.

Note for the following two a search has been made of hypothesis: docs/hypothesis_ablation_catalog.md

Goal: layer/module causal ablation of trained dW.
- Why: after a trained update works, we need to know which layers and modules are necessary or sufficient.
- Do: keep/drop parts of the already-trained adapter delta by layer and module family, without synthesizing new tensors from base features.
- Rows: full_dW, residual_write_only, attn_o_proj_only, mlp_down_proj_only, layers_8_21_only, single-layer keep, leave-one-layer-out, coarse early/mid/late LoRA-layer blocks, rank/module-matched random controls, and zero.
- UAT: one table has adapter, variant, layer_or_block, module_family, keep_or_drop, syc_delta, dd_delta, pmass, and row-identity checks.
- Verify: same sycophancy and full DD rows as full_dW; table includes all required variants and reports zero row-key symmetric difference for every variant x coeff group.
- Positive outcome -> claim: if a small layer/module slice retains most behavior and dropping it removes behavior, report the causal locus.
- Negative outcome -> claim: if many disjoint slices retain behavior, report distributed or non-identifiable layer/module localization.
Goal: adapter-parameterization causal ablation of trained dW.
- Why: adapter families may store the behavior in different parameterization degrees of freedom even when their effective dW looks similar.
- Do: split the trained adapter/effective delta according to the adapter family's own coordinates, then keep/drop each component on identical eval rows. For an S-space split, compute the trained effective matrix's SVD-like coordinate system, project dW -> S, crop a component such as the top 25% of S by coordinate index, project back to weight space, and evaluate both top_25pct_S and residual_not_top_25pct_S against full_dW and zero.
- Rows: LoRA/PiSSA/DeLoRA rank components and S-space quartiles (top_25pct_S, mid_50pct_S, bottom_25pct_S, residual_not_top_25pct_S, residual_not_bottom_25pct_S); cumulative S-energy groups (top_50pct_energy_S, top_90pct_energy_S, residuals); DoRA direction vs magnitude component; OFT rotation-derived component vs residualized effective update; IA3 attention-gate vs MLP-gate groups.
- UAT: one table has adapter, parameterization_family, coordinate_system, component, keep_or_drop, rank_or_group, energy_frac, syc_delta, dd_delta, pmass, and row-identity checks.
- Verify: all rows start from the trained adapter delta or trained adapter parameters; no row is constructed from base-only activations; every component shares the same sycophancy and DD row keys as full_dW; for each S-space crop, component_dW + residual_dW reconstructs full_dW within numerical tolerance.
- Positive outcome -> claim: if one parameterization component retains most behavior and dropping it removes behavior, report which degree of freedom carries the learned behavior.
- Negative outcome -> claim: if behavior is not localized by parameterization component, report the trained effect as distributed across that adapter parameterization.

Coverage gaps in current ablation set

The three causal ablations above (cross-adapter dW basis, layer/module, adapter parameterization) leave some hypotheses untested. These are open follow-ups, not blockers for the current writeup.

Read-side modules in the layer/module ablation. Current variants cover residual writes (o_proj, down_proj), attention-only, and mlp-only, but not q/k/v-only or up/gate-only. Any read-side mechanism story is currently untestable.
Base-W SVD lens for the S-space ablation. parameterization_ablation.py uses each tensor's own SVD (dW = U S Vh). The catalog also wants a separate lens using the base weight's SVD (U0, S0, V0h = svd(W_base); dS = U0.T @ dW @ V0h), which answers "does dW ride pretrained singular directions" rather than "is dW low-rank in its own basis".
Adapter-architecture decompositions. S-space variants do not include DoRA magnitude vs direction, DeLoRA lambda vs direction, OFT rotation, or IA3 attention-gate vs MLP-gate splits.
Norm-matched random keep control for T8 sufficiency claims. Layer/module ablation has random_norm_matched_full; the S-space crops do not. Necessity (drop) tests don't need this; sufficiency (keep) tests do, because cropping shrinks Frobenius norm and the model is nonlinear in alpha.

Deferred / optional

Optional future: constructive synthetic dW' baseline.
- Why: useful as a method baseline, but it is not a causal ablation of trained weight steering.
- Do: only if separately approved, build simple dW_prime = f(W_base, persona_contrast) candidates, e.g. lm-head/readout rowspace projected persona contrast, write-not-read persona contrast, and shared structural bases with signed coefficients from activation contrast.
- UAT: table compares synthetic dW_prime to trained dW, prompt, and repeng on identical sycophancy + DD rows.
- Verify: candidates are generated before reading trained adapter deltas; code fails if w.safetensors is loaded before constructing dW_prime.
- Positive outcome -> claim: if a synthetic dW_prime steers, weight steering may be replaceable by a constructive method baseline.
- Negative outcome -> claim: if no synthetic candidate steers while trained dW does, training is doing nontrivial search not captured by the current structural recipes.
Goal: SVD steering baseline.
- Why: useful only if cheap and stable; lower priority than repeng.
- UAT: same DD/sycophancy table as other baselines.
- Verify: table includes method=svd_steering, layer, rank, coeff, syc_delta, dd_delta, and pmass.
- Negative outcome -> claim: if SVD steering is weak or unstable, do not treat plain base-weight SVD as a competitive method baseline.
Goal: degradation benchmark.
- Why: steering might improve target metric while damaging general behavior.
- UAT: perplexity or clean instruction proxy reported for best coefficients.
- Verify: table has target metric and degradation metric for the exact same selected coefficients.
- Negative outcome -> claim: if target gains require large degradation, report steering as brittle rather than useful.
Goal: larger model replication.
- Why: Qwen3-0.6B and Gemma 1B are iteration models; larger model needed for a stronger claim.
- UAT: same benchmark table on a 4B-ish model after method stabilizes.
- Verify: model column includes the 4B-ish model and reuses the same prompt/DD row IDs as the small-model benchmark.
- Negative outcome -> claim: if the effect disappears or reverses on the larger model, write the small-model limitation instead of scaling the claim.

Decision rules

If prompt or activation steering beats dW, prioritize method improvement before deeper mechanistic analysis.
If activation steering matches dW, treat weight steering as mechanistic interest first and applied method second.
If DeLoRA wins across Qwen and Gemma, spend seeds on DeLoRA/PiSSA only.
If Qwen and Gemma adapter rankings diverge, write the model-specific adapter-basin finding instead of forcing one global winner.
Shared-core rule: if keep_B_shared_K32 retains >=0.7x behavior across LoRA / DoRA / PiSSA / DeLoRA / OFT and drop_B_shared_K32 removes most of it, write the planning-subspace paper.
Basin-divergence rule: if per-adapter top subspaces are mutually low-overlap and each adapter's own SVD keeps behavior better than B_shared, write the basin-divergence paper.
If top-k or write-not-read keeps behavior, we found a simple steering parameterization.
If complement/tail/many layers keep behavior, evidence favors distributed or wrong-space mechanism.
If MLP up/gate terms carry behavior, next paper story should be feature-space steering, not residual-stream planning subspace.
Held-out cross-adapter shared-SVD rule (contingent on T4 multiseed):
- If T4 shows within-adapter stdev > cross-adapter retained gap on lens-3, the N=1 cross-adapter inconsistency was seed noise -> held-out shared-SVD becomes worth running.
- If T4 confirms cross-adapter gap is real (current expectation), skip held-out shared-SVD -- it will fail by construction and only restates what is already known.

Resolved: train/eval axis switch from sycophancy to honesty (2026-04-27)

Decision: option 2 — switched the entire pipeline to honesty axis. Rationale:

The sycophancy NEG persona was already "you are an honest assistant", so the prior contrast was effectively honesty-vs-anti-honesty with extra sycophancy framing on the POS side. Cleaner to make that explicit.
Eval (daily_dilemmas-self-honesty) signs by honesty_label. One-axis training matches the eval axis by construction. The in-sample claim is now defensible.
Sycophancy becomes a free OOD transfer eval (run later if useful) rather than a confounded primary axis.
Avoids the GPT-4o sycophancy-eval generation burden (option 1) and its eval-validation overhead.

What changed in code (commit on dev):

src/ws/data.py: added HONESTY_PROMPT = "Pretend you're {persona} person. " with 5 paraphrase paraphrases each side (HONESTY_POS_PERSONAS, HONESTY_NEG_PERSONAS) — vgel/repeng short-persona convention. Added _load_suffixes() reader for data/branching_suffixes.json (550 SSteer entries). Honesty branch reuses the suffix user_msg as the question pool; persona-conditioned generation in generate_pairs() produces (r_pos, r_neg) per the paper recipe (Fierro & Roger 2025 §F.1).
src/ws/eval/activation_baseline.py: RepE direction extraction now branches on cfg.behavior; honesty mode captures last-token hidden states under HONESTY_POS[0] / HONESTY_NEG[0] over _load_suffixes() prompts with assistant_prefixes=suffix.
src/ws/eval/prompt_baseline.py: replaced single engineered_prompt with paired engineered_prompt_honest + engineered_prompt_dishonest (AxBench Appendix J.2 style).
evals/smoke.py: added behavior field; just smoke --behavior honesty passes end-to-end on katuni4ka/tiny-random-qwen3.
data/branching_suffixes.json: copied from SSteer.

Sycophancy outputs in out/sycophancy/ are kept on disk as historical evidence for the old axis-mismatched table. The README headline numbers will be replaced with honesty once 230-236 land. T4/T5 stay open and will be re-scoped against honesty.

20 KiB Raw Blame History Unescape Escape