weight-steering/docs/fork_plan.md

# Fork plan: weight steering benchmark + analysis

Updated: 2026-04-27

## Goal

Test whether weight steering is a useful method, and if it is, understand what part of the learned weight delta carries the behavior.

Two questions are intentionally separated:

1. **Benchmark question:** Does weight steering beat simple alternatives such as prompting and activation steering on sycophancy and daily-dilemmas honesty transfer?
2. **Analysis question:** If weight steering works, can the learned delta $dW = \theta^+ - \theta^-$ be factorized into a simpler causal intervention: a cross-adapter shared subspace, module, low-rank component, or adapter parameterization?

## Context

This is a fork of Anthropic's weight-steering method. Original recipe: train one positive adapter and one negative adapter, merge each adapter into base-weight deltas, then steer with:

$$dW = \Delta W_{pos} - \Delta W_{neg}.$$

This repo removes Axolotl/vLLM/API orchestration and rebuilds the method in HF + PEFT + uv for cheap iteration on small models.

Current main model: `Qwen/Qwen3-0.6B`.

Current behavior: honesty training (positive = honest persona, negative = dishonest persona), evaluated on `wassname/daily_dilemmas-self-honesty` (OOD) and held-out sycophancy Yes/No claims (IID).

## Links

- Paper / blog:
  - [docs/weight_steering_paper.md](docs/weight_steering_paper.md)
  - [docs/weight_steer_blog.md](docs/weight_steer_blog.md)
- Adapter-as-hypothesis notes:
  - [docs/blog_adapter_as_hypothesis/README.md](docs/blog_adapter_as_hypothesis/README.md)
- Steering/subspace concepts:
  - [docs/AntiPaSTO_concepts/README.md](docs/AntiPaSTO_concepts/README.md)
- Current user-facing summaries:
  - [README.md](README.md)
  - [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md)
- Key code:
  - [src/ws/data.py](src/ws/data.py)
  - [src/ws/train.py](src/ws/train.py)
  - [src/ws/diff.py](src/ws/diff.py)
  - [src/ws/steer.py](src/ws/steer.py)
  - [src/ws/eval/sycophancy.py](src/ws/eval/sycophancy.py)
  - [src/ws/eval/dilemmas.py](src/ws/eval/dilemmas.py)
  - [src/ws/eval/cross_adapter_ablation.py](src/ws/eval/cross_adapter_ablation.py)
  - [src/ws/eval/layer_module_ablation.py](src/ws/eval/layer_module_ablation.py)
  - [src/ws/eval/parameterization_ablation.py](src/ws/eval/parameterization_ablation.py)
  - [nbs/ablation_analysis.py](nbs/ablation_analysis.py)

## Current facts

- Daily-dilemmas default is **not full split**. Default `n_dilemmas=100` means first 100 dilemmas = 200 rows, balanced 100 honest-label and 100 dishonest-label actions.
- Full `honesty_eval` test split is 219 dilemmas = 438 rows.
- The daily-dilemmas eval uses all rows for selected dilemmas, then sign-flips by `honesty_label`; it is not only honest rows.
- Current headline tables are single-seed Qwen3-0.6B exploratory results.
- DeLoRA is best raw steering so far. PiSSA is the cleaner stable baseline if penalizing DeLoRA saturation at high alpha.
- v9/v10 do **not** prove “no subspace.” They show the trained behavior is not explained by the tested low-rank residual-stream bases or adapter-family parameterization at trained scale.
- The active analysis should ablate the already-trained `dW`. Synthetic `dW'` construction is a different baseline, not causal ablation.
- The highest-value analysis tests are: cross-adapter causal `dW` basis ablation, layer/module ablation of trained `dW`, and adapter-parameterization ablation of trained `dW`.
- **Lens search is on hold pending multiseed (2026-04-27).** Every weight-space lens we tested has a built-in failure mode: SVD-on-`dW` is tautological for low-rank adapters; layer-index tells depth not mechanism; module-family collapses heads/positions and gives different answers per adapter; native parameterization decompositions aren't comparable across adapter families. *But* the lens-3 cross-adapter inconsistency (delora residual_write retained=+1.27 vs lora=+0.14) is N=1 seed × N=1 model. It might just be seed noise within each adapter. Right ordering: T4 multiseed first, then re-run T7/T8 per-seed with within-adapter stdev, then judge whether the inconsistency is real or noise.

## Done

- [x] Clean repo into uv + HF + PEFT small-model workflow.
- [x] Make Qwen3-0.6B sycophancy steering work end-to-end.
- [x] Hook in LoRA, DoRA, PiSSA, DeLoRA, OFT, and IA3 adapter families.
- [x] Build sycophancy logratio eval with coefficient sweep.
- [x] Build daily-dilemmas honesty eval with sign-flipped Yes/No logratio.
- [x] Run single-seed Qwen adapter benchmark on sycophancy and 100-dilemma DD default.
- [x] Fix DD cross-adapter aggregation to use base-only coeff=0 rather than mixing persona baselines.
- [x] Run v9 subspace/scope diagnostics: weight oracle, cumulative activation oracle, block-local activation oracle, first-LoRA-layer sanity checks.
- [x] Run v10 projection/complement falsifier: raw activation projection, complement, and normmatched projection.
- [x] Update README and research journal with corrected DD table and conservative interpretation.

## TODO: benchmark question

- [x] **Goal: activation-steering baseline on the same DD rows.**
  - Why: RepE/repeng is the most threatening baseline; if it matches or beats `dW`, the method story weakens before adapter seeds matter.
  - Do: train representation direction on the same sycophancy contrast; grid layer x coefficient; evaluate sycophancy and full DD.
  - UAT: best activation-steering row is selected by held-out sycophancy or validation DD, then reported beside best `dW` on identical DD test rows.
  - Verify: table includes `method=repeng`, `layer`, `coeff`, `syc_delta`, `dd_delta`, `pmass`, and the same `idx` set as the `dW` rows.
  - Negative outcome -> claim: if repeng matches/beats `dW`, write "activation steering is the simpler baseline; weight steering needs a stronger reason to exist."

- [x] **Goal: full daily-dilemmas benchmark for current Qwen adapters.**
  - Why: current DD table uses first 100 dilemmas, not the full 219-dilemma split.
  - Do: re-run LoRA / PiSSA / DeLoRA / DoRA / OFT / IA3 with `--n-dilemmas 219`.
  - UAT: table has 438 base rows per coeff before persona baselines, and reports `pmass`, `frac_low_pmass`, `delta(+1 - 0)`.
  - Verify: `out/sycophancy/cross_adapter_full_dd/dilemmas_summary.csv` exists and includes `n_base_rows_per_coeff=438`.

- [x] **Goal: prompt baselines on the same DD rows.**
  - Why: weight steering is only interesting if it beats “just prompt it.”
  - Do: evaluate base, simple honest persona, and engineered AxBench-style prompt.
  - UAT: one table compares `base`, `simple_honest_prompt`, `engineered_prompt`, and best `dW` on identical rows.
  - Verify: `prompt_baseline_delta` and `weight_steer_delta` are computed from the same `idx` set.
  - Negative outcome -> claim: if prompting matches/beats `dW`, write "prompting is the simpler intervention for this behavior/eval pair."

- [ ] **Goal: multi-seed adapter benchmark on Qwen.**
  - Why: current adapter ranking is N=1 seed.
  - Do: run seeds 0, 1, 2 for LoRA / PiSSA / DeLoRA first; add DoRA/OFT only if cheap.
  - UAT: table reports mean +/- std for sycophancy and DD deltas, plus seed-level signs, so a reader can tell stable ranking from noisy N=1 luck.
  - Verify: each adapter has exactly three `w.safetensors` files and three eval summaries; ranking table includes `n_seeds=3`, `mean_dd_delta`, `std_dd_delta`, and `sign_agreement`.
  - Negative outcome -> claim: if adapter ranking changes across seeds or error bars overlap heavily, write "single-seed adapter winner is unstable; do not claim a family ranking yet."

- [ ] **Goal: Gemma 1B replication.**
  - Why: check whether DeLoRA/PiSSA ranking is Qwen-specific.
  - Do: train LoRA / PiSSA / DeLoRA on Gemma 1B, seed 0, full DD split.
  - UAT: compare Gemma ranking to Qwen ranking with the same metrics.
  - Verify: table has model column with `Qwen3-0.6B` and `Gemma-1B`; if DeLoRA remains best, expand seeds; if rankings diverge, write that up as a model-specific adapter-basin finding.

## TODO: analysis question

**Status (2026-04-27): on hold pending multiseed.** T6/T7/T8 are run on
N=1 seed × Qwen3-0.6B. Necessity is established. The cross-adapter
inconsistency that drove the "no parameterization-invariant mechanism"
reading might be seed noise. Resume after T4 (multiseed) lands and we can
report within-adapter stdev alongside cross-adapter gaps.

Active sequence at the time of pause was:

1. Cross-adapter causal `dW` basis ablation.
2. Layer/module causal ablation of trained `dW`.
3. Adapter-parameterization causal ablation of trained `dW`.

Synthetic `dW'` construction is deferred below and is not a causal ablation.

- [x] **Goal: cross-adapter causal `dW` basis ablation.**
  - Why: this is the headline analysis experiment. It tests whether different adapter families discovered the same causal planning subspace or different basins.
  - Do: build candidate bases `B` from trained adapter deltas, compute `dW_keep_B` and `dW_drop_B`, and evaluate both on sycophancy + full DD for each adapter.
  - Candidate `B` rows:
    - `shared_SVD_K8/K32/K64`: stack residual-output `dW` from LoRA / DoRA / PiSSA / DeLoRA / OFT per layer/tensor, take top-K SVs.
    - `top8/top32_per_adapter` and `tail_per_adapter`: per-adapter SVD split of each tensor.
    - `random_null`: rank-matched random bases.
  - UAT: one central table has `ablation_family`, `candidate_B`, `adapter`, `rank`, `keep_or_drop`, `syc_delta`, `dd_delta`, `pmass`, and row-identity checks.
  - Verify: the table contains keep/drop rows for `shared_svd`, `per_adapter_svd`, and `random_null`; `keep_B_shared_K32` and `drop_B_shared_K32` are both evaluated for at least LoRA / DoRA / PiSSA / DeLoRA / OFT; random null retention is near rank/d; each row uses the same eval rows and coefficient grid.
  - Positive outcome -> claim: if `keep_B_shared` retains >=0.7x behavior across adapters and `drop_B_shared` removes it, write the adapter-invariant planning-subspace paper.
  - Negative outcome -> claim: if `keep_B_shared` retains <0.3x even at K=64 while complements/tails retain behavior, write the shared-subspace negative result: steering is distributed or lives in the wrong parameter space for these bases.
  - Ambiguous outcome -> claim: if both keep and drop retain high behavior, report non-identifiability under this basis family and move to stricter causal interventions, not a positive subspace claim.

Note for the following two a search has been made of hypothesis: docs/hypothesis_ablation_catalog.md

- [x] **Goal: layer/module causal ablation of trained `dW`.**
  - Why: after a trained update works, we need to know which layers and modules are necessary or sufficient.
  - Do: keep/drop parts of the already-trained adapter delta by layer and module family, without synthesizing new tensors from base features.
  - Rows: `full_dW`, `residual_write_only`, `attn_o_proj_only`, `mlp_down_proj_only`, `layers_8_21_only`, single-layer keep, leave-one-layer-out, coarse early/mid/late LoRA-layer blocks, rank/module-matched random controls, and `zero`.
  - UAT: one table has `adapter`, `variant`, `layer_or_block`, `module_family`, `keep_or_drop`, `syc_delta`, `dd_delta`, `pmass`, and row-identity checks.
  - Verify: same sycophancy and full DD rows as `full_dW`; table includes all required variants and reports zero row-key symmetric difference for every variant x coeff group.
  - Positive outcome -> claim: if a small layer/module slice retains most behavior and dropping it removes behavior, report the causal locus.
  - Negative outcome -> claim: if many disjoint slices retain behavior, report distributed or non-identifiable layer/module localization.

- [x] **Goal: adapter-parameterization causal ablation of trained `dW`.**
  - Why: adapter families may store the behavior in different parameterization degrees of freedom even when their effective `dW` looks similar.
  - Do: split the trained adapter/effective delta according to the adapter family's own coordinates, then keep/drop each component on identical eval rows. For an S-space split, compute the trained effective matrix's SVD-like coordinate system, project `dW -> S`, crop a component such as the top 25% of `S` by coordinate index, project back to weight space, and evaluate both `top_25pct_S` and `residual_not_top_25pct_S` against `full_dW` and `zero`.
  - Rows: LoRA/PiSSA/DeLoRA rank components and S-space quartiles (`top_25pct_S`, `mid_50pct_S`, `bottom_25pct_S`, `residual_not_top_25pct_S`, `residual_not_bottom_25pct_S`); cumulative S-energy groups (`top_50pct_energy_S`, `top_90pct_energy_S`, residuals); DoRA direction vs magnitude component; OFT rotation-derived component vs residualized effective update; IA3 attention-gate vs MLP-gate groups.
  - UAT: one table has `adapter`, `parameterization_family`, `coordinate_system`, `component`, `keep_or_drop`, `rank_or_group`, `energy_frac`, `syc_delta`, `dd_delta`, `pmass`, and row-identity checks.
  - Verify: all rows start from the trained adapter delta or trained adapter parameters; no row is constructed from base-only activations; every component shares the same sycophancy and DD row keys as `full_dW`; for each S-space crop, `component_dW + residual_dW` reconstructs `full_dW` within numerical tolerance.
  - Positive outcome -> claim: if one parameterization component retains most behavior and dropping it removes behavior, report which degree of freedom carries the learned behavior.
  - Negative outcome -> claim: if behavior is not localized by parameterization component, report the trained effect as distributed across that adapter parameterization.

## Coverage gaps in current ablation set

The three causal ablations above (cross-adapter `dW` basis, layer/module,
adapter parameterization) leave some hypotheses untested. These are open
follow-ups, not blockers for the current writeup.

- [ ] **Read-side modules in the layer/module ablation.** Current variants
  cover residual writes (`o_proj`, `down_proj`), attention-only, and
  mlp-only, but not q/k/v-only or up/gate-only. Any read-side mechanism
  story is currently untestable.
- [ ] **Base-W SVD lens for the S-space ablation.** `parameterization_ablation.py`
  uses each tensor's own SVD (`dW = U S Vh`). The catalog also wants a
  separate lens using the base weight's SVD (`U0, S0, V0h = svd(W_base);
  dS = U0.T @ dW @ V0h`), which answers "does `dW` ride pretrained
  singular directions" rather than "is `dW` low-rank in its own basis".
- [ ] **Adapter-architecture decompositions.** S-space variants do not
  include DoRA magnitude vs direction, DeLoRA lambda vs direction, OFT
  rotation, or IA3 attention-gate vs MLP-gate splits.
- [ ] **Norm-matched random keep control for T8 sufficiency claims.**
  Layer/module ablation has `random_norm_matched_full`; the S-space crops
  do not. Necessity (drop) tests don't need this; sufficiency (keep) tests
  do, because cropping shrinks Frobenius norm and the model is nonlinear
  in alpha.

## Deferred / optional

- [ ] **Optional future: constructive synthetic `dW'` baseline.**
  - Why: useful as a method baseline, but it is not a causal ablation of trained weight steering.
  - Do: only if separately approved, build simple `dW_prime = f(W_base, persona_contrast)` candidates, e.g. lm-head/readout rowspace projected persona contrast, write-not-read persona contrast, and shared structural bases with signed coefficients from activation contrast.
  - UAT: table compares synthetic `dW_prime` to trained `dW`, prompt, and repeng on identical sycophancy + DD rows.
  - Verify: candidates are generated before reading trained adapter deltas; code fails if `w.safetensors` is loaded before constructing `dW_prime`.
  - Positive outcome -> claim: if a synthetic `dW_prime` steers, weight steering may be replaceable by a constructive method baseline.
  - Negative outcome -> claim: if no synthetic candidate steers while trained `dW` does, training is doing nontrivial search not captured by the current structural recipes.

- [ ] **Goal: SVD steering baseline.**
  - Why: useful only if cheap and stable; lower priority than repeng.
  - UAT: same DD/sycophancy table as other baselines.
  - Verify: table includes `method=svd_steering`, `layer`, `rank`, `coeff`, `syc_delta`, `dd_delta`, and `pmass`.
  - Negative outcome -> claim: if SVD steering is weak or unstable, do not treat plain base-weight SVD as a competitive method baseline.

- [ ] **Goal: degradation benchmark.**
  - Why: steering might improve target metric while damaging general behavior.
  - UAT: perplexity or clean instruction proxy reported for best coefficients.
  - Verify: table has target metric and degradation metric for the exact same selected coefficients.
  - Negative outcome -> claim: if target gains require large degradation, report steering as brittle rather than useful.

- [ ] **Goal: larger model replication.**
  - Why: Qwen3-0.6B and Gemma 1B are iteration models; larger model needed for a stronger claim.
  - UAT: same benchmark table on a 4B-ish model after method stabilizes.
  - Verify: model column includes the 4B-ish model and reuses the same prompt/DD row IDs as the small-model benchmark.
  - Negative outcome -> claim: if the effect disappears or reverses on the larger model, write the small-model limitation instead of scaling the claim.

## Decision rules

- If prompt or activation steering beats `dW`, prioritize method improvement before deeper mechanistic analysis.
- If activation steering matches `dW`, treat weight steering as mechanistic interest first and applied method second.
- If DeLoRA wins across Qwen and Gemma, spend seeds on DeLoRA/PiSSA only.
- If Qwen and Gemma adapter rankings diverge, write the model-specific adapter-basin finding instead of forcing one global winner.
- Shared-core rule: if `keep_B_shared_K32` retains >=0.7x behavior across LoRA / DoRA / PiSSA / DeLoRA / OFT and `drop_B_shared_K32` removes most of it, write the planning-subspace paper.
- Basin-divergence rule: if per-adapter top subspaces are mutually low-overlap and each adapter's own SVD keeps behavior better than `B_shared`, write the basin-divergence paper.
- If top-k or write-not-read keeps behavior, we found a simple steering parameterization.
- If complement/tail/many layers keep behavior, evidence favors distributed or wrong-space mechanism.
- If MLP `up/gate` terms carry behavior, next paper story should be feature-space steering, not residual-stream planning subspace.
- Held-out cross-adapter shared-SVD rule (contingent on T4 multiseed):
  - If T4 shows within-adapter stdev > cross-adapter retained gap on lens-3, the N=1 cross-adapter inconsistency was seed noise -> held-out shared-SVD becomes worth running.
  - If T4 confirms cross-adapter gap is real (current expectation), skip held-out shared-SVD -- it will fail by construction and only restates what is already known.

## Resolved: train/eval axis switch from sycophancy to honesty (2026-04-27)

**Decision: option 2** — switched the entire pipeline to honesty axis. Rationale:

- The sycophancy NEG persona was already "you are an honest assistant", so the prior contrast was effectively honesty-vs-anti-honesty with extra sycophancy framing on the POS side. Cleaner to make that explicit.
- Eval (`daily_dilemmas-self-honesty`) signs by `honesty_label`. One-axis training matches the eval axis by construction. The in-sample claim is now defensible.
- Sycophancy becomes a free OOD transfer eval (run later if useful) rather than a confounded primary axis.
- Avoids the GPT-4o sycophancy-eval generation burden (option 1) and its eval-validation overhead.

What changed in code (commit on `dev`):

- `src/ws/data.py`: added `HONESTY_PROMPT = "Pretend you're {persona} person. "` with 5 paraphrase paraphrases each side (`HONESTY_POS_PERSONAS`, `HONESTY_NEG_PERSONAS`) — vgel/repeng short-persona convention. Added `_load_suffixes()` reader for `data/branching_suffixes.json` (550 SSteer entries). Honesty branch reuses the suffix `user_msg` as the question pool; persona-conditioned generation in `generate_pairs()` produces (r_pos, r_neg) per the paper recipe (Fierro & Roger 2025 §F.1).
- `src/ws/eval/activation_baseline.py`: RepE direction extraction now branches on `cfg.behavior`; honesty mode captures last-token hidden states under `HONESTY_POS[0]` / `HONESTY_NEG[0]` over `_load_suffixes()` prompts with `assistant_prefixes=suffix`.
- `src/ws/eval/prompt_baseline.py`: replaced single `engineered_prompt` with paired `engineered_prompt_honest` + `engineered_prompt_dishonest` (AxBench Appendix J.2 style).
- `evals/smoke.py`: added `behavior` field; `just smoke --behavior honesty` passes end-to-end on `katuni4ka/tiny-random-qwen3`.
- `data/branching_suffixes.json`: copied from SSteer.

Sycophancy outputs in `out/sycophancy/` are kept on disk as historical evidence for the old axis-mismatched table. The README headline numbers will be replaced with honesty once 230-236 land. T4/T5 stay open and will be re-scoped against honesty.