Files
wassname da75668d6b move RESEARCH_JOURNAL and fork_plan under docs/
Working notes belong with the rest of the docs. Updated relative links
in docs/hypothesis_ablation_catalog.md from ../fork_plan.md to fork_plan.md
since both files now live in docs/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 09:09:52 +08:00

244 lines
20 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Fork plan: weight steering benchmark + analysis
Updated: 2026-04-27
## Goal
Test whether weight steering is a useful method, and if it is, understand what part of the learned weight delta carries the behavior.
Two questions are intentionally separated:
1. **Benchmark question:** Does weight steering beat simple alternatives such as prompting and activation steering on sycophancy and daily-dilemmas honesty transfer?
2. **Analysis question:** If weight steering works, can the learned delta $dW = \theta^+ - \theta^-$ be factorized into a simpler causal intervention: a cross-adapter shared subspace, module, low-rank component, or adapter parameterization?
## Context
This is a fork of Anthropic's weight-steering method. Original recipe: train one positive adapter and one negative adapter, merge each adapter into base-weight deltas, then steer with:
$$dW = \Delta W_{pos} - \Delta W_{neg}.$$
This repo removes Axolotl/vLLM/API orchestration and rebuilds the method in HF + PEFT + uv for cheap iteration on small models.
Current main model: `Qwen/Qwen3-0.6B`.
Current behavior: honesty training (positive = honest persona, negative = dishonest persona), evaluated on `wassname/daily_dilemmas-self-honesty` (OOD) and held-out sycophancy Yes/No claims (IID).
## Links
- Paper / blog:
- [docs/weight_steering_paper.md](docs/weight_steering_paper.md)
- [docs/weight_steer_blog.md](docs/weight_steer_blog.md)
- Adapter-as-hypothesis notes:
- [docs/blog_adapter_as_hypothesis/README.md](docs/blog_adapter_as_hypothesis/README.md)
- Steering/subspace concepts:
- [docs/AntiPaSTO_concepts/README.md](docs/AntiPaSTO_concepts/README.md)
- Current user-facing summaries:
- [README.md](README.md)
- [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md)
- Key code:
- [src/ws/data.py](src/ws/data.py)
- [src/ws/train.py](src/ws/train.py)
- [src/ws/diff.py](src/ws/diff.py)
- [src/ws/steer.py](src/ws/steer.py)
- [src/ws/eval/sycophancy.py](src/ws/eval/sycophancy.py)
- [src/ws/eval/dilemmas.py](src/ws/eval/dilemmas.py)
- [src/ws/eval/cross_adapter_ablation.py](src/ws/eval/cross_adapter_ablation.py)
- [src/ws/eval/layer_module_ablation.py](src/ws/eval/layer_module_ablation.py)
- [src/ws/eval/parameterization_ablation.py](src/ws/eval/parameterization_ablation.py)
- [nbs/ablation_analysis.py](nbs/ablation_analysis.py)
## Current facts
- Daily-dilemmas default is **not full split**. Default `n_dilemmas=100` means first 100 dilemmas = 200 rows, balanced 100 honest-label and 100 dishonest-label actions.
- Full `honesty_eval` test split is 219 dilemmas = 438 rows.
- The daily-dilemmas eval uses all rows for selected dilemmas, then sign-flips by `honesty_label`; it is not only honest rows.
- Current headline tables are single-seed Qwen3-0.6B exploratory results.
- DeLoRA is best raw steering so far. PiSSA is the cleaner stable baseline if penalizing DeLoRA saturation at high alpha.
- v9/v10 do **not** prove “no subspace.” They show the trained behavior is not explained by the tested low-rank residual-stream bases or adapter-family parameterization at trained scale.
- The active analysis should ablate the already-trained `dW`. Synthetic `dW'` construction is a different baseline, not causal ablation.
- The highest-value analysis tests are: cross-adapter causal `dW` basis ablation, layer/module ablation of trained `dW`, and adapter-parameterization ablation of trained `dW`.
- **Lens search is on hold pending multiseed (2026-04-27).** Every weight-space lens we tested has a built-in failure mode: SVD-on-`dW` is tautological for low-rank adapters; layer-index tells depth not mechanism; module-family collapses heads/positions and gives different answers per adapter; native parameterization decompositions aren't comparable across adapter families. *But* the lens-3 cross-adapter inconsistency (delora residual_write retained=+1.27 vs lora=+0.14) is N=1 seed × N=1 model. It might just be seed noise within each adapter. Right ordering: T4 multiseed first, then re-run T7/T8 per-seed with within-adapter stdev, then judge whether the inconsistency is real or noise.
## Done
- [x] Clean repo into uv + HF + PEFT small-model workflow.
- [x] Make Qwen3-0.6B sycophancy steering work end-to-end.
- [x] Hook in LoRA, DoRA, PiSSA, DeLoRA, OFT, and IA3 adapter families.
- [x] Build sycophancy logratio eval with coefficient sweep.
- [x] Build daily-dilemmas honesty eval with sign-flipped Yes/No logratio.
- [x] Run single-seed Qwen adapter benchmark on sycophancy and 100-dilemma DD default.
- [x] Fix DD cross-adapter aggregation to use base-only coeff=0 rather than mixing persona baselines.
- [x] Run v9 subspace/scope diagnostics: weight oracle, cumulative activation oracle, block-local activation oracle, first-LoRA-layer sanity checks.
- [x] Run v10 projection/complement falsifier: raw activation projection, complement, and normmatched projection.
- [x] Update README and research journal with corrected DD table and conservative interpretation.
## TODO: benchmark question
- [x] **Goal: activation-steering baseline on the same DD rows.**
- Why: RepE/repeng is the most threatening baseline; if it matches or beats `dW`, the method story weakens before adapter seeds matter.
- Do: train representation direction on the same sycophancy contrast; grid layer x coefficient; evaluate sycophancy and full DD.
- UAT: best activation-steering row is selected by held-out sycophancy or validation DD, then reported beside best `dW` on identical DD test rows.
- Verify: table includes `method=repeng`, `layer`, `coeff`, `syc_delta`, `dd_delta`, `pmass`, and the same `idx` set as the `dW` rows.
- Negative outcome -> claim: if repeng matches/beats `dW`, write "activation steering is the simpler baseline; weight steering needs a stronger reason to exist."
- [x] **Goal: full daily-dilemmas benchmark for current Qwen adapters.**
- Why: current DD table uses first 100 dilemmas, not the full 219-dilemma split.
- Do: re-run LoRA / PiSSA / DeLoRA / DoRA / OFT / IA3 with `--n-dilemmas 219`.
- UAT: table has 438 base rows per coeff before persona baselines, and reports `pmass`, `frac_low_pmass`, `delta(+1 - 0)`.
- Verify: `out/sycophancy/cross_adapter_full_dd/dilemmas_summary.csv` exists and includes `n_base_rows_per_coeff=438`.
- [x] **Goal: prompt baselines on the same DD rows.**
- Why: weight steering is only interesting if it beats “just prompt it.”
- Do: evaluate base, simple honest persona, and engineered AxBench-style prompt.
- UAT: one table compares `base`, `simple_honest_prompt`, `engineered_prompt`, and best `dW` on identical rows.
- Verify: `prompt_baseline_delta` and `weight_steer_delta` are computed from the same `idx` set.
- Negative outcome -> claim: if prompting matches/beats `dW`, write "prompting is the simpler intervention for this behavior/eval pair."
- [ ] **Goal: multi-seed adapter benchmark on Qwen.**
- Why: current adapter ranking is N=1 seed.
- Do: run seeds 0, 1, 2 for LoRA / PiSSA / DeLoRA first; add DoRA/OFT only if cheap.
- UAT: table reports mean +/- std for sycophancy and DD deltas, plus seed-level signs, so a reader can tell stable ranking from noisy N=1 luck.
- Verify: each adapter has exactly three `w.safetensors` files and three eval summaries; ranking table includes `n_seeds=3`, `mean_dd_delta`, `std_dd_delta`, and `sign_agreement`.
- Negative outcome -> claim: if adapter ranking changes across seeds or error bars overlap heavily, write "single-seed adapter winner is unstable; do not claim a family ranking yet."
- [ ] **Goal: Gemma 1B replication.**
- Why: check whether DeLoRA/PiSSA ranking is Qwen-specific.
- Do: train LoRA / PiSSA / DeLoRA on Gemma 1B, seed 0, full DD split.
- UAT: compare Gemma ranking to Qwen ranking with the same metrics.
- Verify: table has model column with `Qwen3-0.6B` and `Gemma-1B`; if DeLoRA remains best, expand seeds; if rankings diverge, write that up as a model-specific adapter-basin finding.
## TODO: analysis question
**Status (2026-04-27): on hold pending multiseed.** T6/T7/T8 are run on
N=1 seed × Qwen3-0.6B. Necessity is established. The cross-adapter
inconsistency that drove the "no parameterization-invariant mechanism"
reading might be seed noise. Resume after T4 (multiseed) lands and we can
report within-adapter stdev alongside cross-adapter gaps.
Active sequence at the time of pause was:
1. Cross-adapter causal `dW` basis ablation.
2. Layer/module causal ablation of trained `dW`.
3. Adapter-parameterization causal ablation of trained `dW`.
Synthetic `dW'` construction is deferred below and is not a causal ablation.
- [x] **Goal: cross-adapter causal `dW` basis ablation.**
- Why: this is the headline analysis experiment. It tests whether different adapter families discovered the same causal planning subspace or different basins.
- Do: build candidate bases `B` from trained adapter deltas, compute `dW_keep_B` and `dW_drop_B`, and evaluate both on sycophancy + full DD for each adapter.
- Candidate `B` rows:
- `shared_SVD_K8/K32/K64`: stack residual-output `dW` from LoRA / DoRA / PiSSA / DeLoRA / OFT per layer/tensor, take top-K SVs.
- `top8/top32_per_adapter` and `tail_per_adapter`: per-adapter SVD split of each tensor.
- `random_null`: rank-matched random bases.
- UAT: one central table has `ablation_family`, `candidate_B`, `adapter`, `rank`, `keep_or_drop`, `syc_delta`, `dd_delta`, `pmass`, and row-identity checks.
- Verify: the table contains keep/drop rows for `shared_svd`, `per_adapter_svd`, and `random_null`; `keep_B_shared_K32` and `drop_B_shared_K32` are both evaluated for at least LoRA / DoRA / PiSSA / DeLoRA / OFT; random null retention is near rank/d; each row uses the same eval rows and coefficient grid.
- Positive outcome -> claim: if `keep_B_shared` retains >=0.7x behavior across adapters and `drop_B_shared` removes it, write the adapter-invariant planning-subspace paper.
- Negative outcome -> claim: if `keep_B_shared` retains <0.3x even at K=64 while complements/tails retain behavior, write the shared-subspace negative result: steering is distributed or lives in the wrong parameter space for these bases.
- Ambiguous outcome -> claim: if both keep and drop retain high behavior, report non-identifiability under this basis family and move to stricter causal interventions, not a positive subspace claim.
Note for the following two a search has been made of hypothesis: docs/hypothesis_ablation_catalog.md
- [x] **Goal: layer/module causal ablation of trained `dW`.**
- Why: after a trained update works, we need to know which layers and modules are necessary or sufficient.
- Do: keep/drop parts of the already-trained adapter delta by layer and module family, without synthesizing new tensors from base features.
- Rows: `full_dW`, `residual_write_only`, `attn_o_proj_only`, `mlp_down_proj_only`, `layers_8_21_only`, single-layer keep, leave-one-layer-out, coarse early/mid/late LoRA-layer blocks, rank/module-matched random controls, and `zero`.
- UAT: one table has `adapter`, `variant`, `layer_or_block`, `module_family`, `keep_or_drop`, `syc_delta`, `dd_delta`, `pmass`, and row-identity checks.
- Verify: same sycophancy and full DD rows as `full_dW`; table includes all required variants and reports zero row-key symmetric difference for every variant x coeff group.
- Positive outcome -> claim: if a small layer/module slice retains most behavior and dropping it removes behavior, report the causal locus.
- Negative outcome -> claim: if many disjoint slices retain behavior, report distributed or non-identifiable layer/module localization.
- [x] **Goal: adapter-parameterization causal ablation of trained `dW`.**
- Why: adapter families may store the behavior in different parameterization degrees of freedom even when their effective `dW` looks similar.
- Do: split the trained adapter/effective delta according to the adapter family's own coordinates, then keep/drop each component on identical eval rows. For an S-space split, compute the trained effective matrix's SVD-like coordinate system, project `dW -> S`, crop a component such as the top 25% of `S` by coordinate index, project back to weight space, and evaluate both `top_25pct_S` and `residual_not_top_25pct_S` against `full_dW` and `zero`.
- Rows: LoRA/PiSSA/DeLoRA rank components and S-space quartiles (`top_25pct_S`, `mid_50pct_S`, `bottom_25pct_S`, `residual_not_top_25pct_S`, `residual_not_bottom_25pct_S`); cumulative S-energy groups (`top_50pct_energy_S`, `top_90pct_energy_S`, residuals); DoRA direction vs magnitude component; OFT rotation-derived component vs residualized effective update; IA3 attention-gate vs MLP-gate groups.
- UAT: one table has `adapter`, `parameterization_family`, `coordinate_system`, `component`, `keep_or_drop`, `rank_or_group`, `energy_frac`, `syc_delta`, `dd_delta`, `pmass`, and row-identity checks.
- Verify: all rows start from the trained adapter delta or trained adapter parameters; no row is constructed from base-only activations; every component shares the same sycophancy and DD row keys as `full_dW`; for each S-space crop, `component_dW + residual_dW` reconstructs `full_dW` within numerical tolerance.
- Positive outcome -> claim: if one parameterization component retains most behavior and dropping it removes behavior, report which degree of freedom carries the learned behavior.
- Negative outcome -> claim: if behavior is not localized by parameterization component, report the trained effect as distributed across that adapter parameterization.
## Coverage gaps in current ablation set
The three causal ablations above (cross-adapter `dW` basis, layer/module,
adapter parameterization) leave some hypotheses untested. These are open
follow-ups, not blockers for the current writeup.
- [ ] **Read-side modules in the layer/module ablation.** Current variants
cover residual writes (`o_proj`, `down_proj`), attention-only, and
mlp-only, but not q/k/v-only or up/gate-only. Any read-side mechanism
story is currently untestable.
- [ ] **Base-W SVD lens for the S-space ablation.** `parameterization_ablation.py`
uses each tensor's own SVD (`dW = U S Vh`). The catalog also wants a
separate lens using the base weight's SVD (`U0, S0, V0h = svd(W_base);
dS = U0.T @ dW @ V0h`), which answers "does `dW` ride pretrained
singular directions" rather than "is `dW` low-rank in its own basis".
- [ ] **Adapter-architecture decompositions.** S-space variants do not
include DoRA magnitude vs direction, DeLoRA lambda vs direction, OFT
rotation, or IA3 attention-gate vs MLP-gate splits.
- [ ] **Norm-matched random keep control for T8 sufficiency claims.**
Layer/module ablation has `random_norm_matched_full`; the S-space crops
do not. Necessity (drop) tests don't need this; sufficiency (keep) tests
do, because cropping shrinks Frobenius norm and the model is nonlinear
in alpha.
## Deferred / optional
- [ ] **Optional future: constructive synthetic `dW'` baseline.**
- Why: useful as a method baseline, but it is not a causal ablation of trained weight steering.
- Do: only if separately approved, build simple `dW_prime = f(W_base, persona_contrast)` candidates, e.g. lm-head/readout rowspace projected persona contrast, write-not-read persona contrast, and shared structural bases with signed coefficients from activation contrast.
- UAT: table compares synthetic `dW_prime` to trained `dW`, prompt, and repeng on identical sycophancy + DD rows.
- Verify: candidates are generated before reading trained adapter deltas; code fails if `w.safetensors` is loaded before constructing `dW_prime`.
- Positive outcome -> claim: if a synthetic `dW_prime` steers, weight steering may be replaceable by a constructive method baseline.
- Negative outcome -> claim: if no synthetic candidate steers while trained `dW` does, training is doing nontrivial search not captured by the current structural recipes.
- [ ] **Goal: SVD steering baseline.**
- Why: useful only if cheap and stable; lower priority than repeng.
- UAT: same DD/sycophancy table as other baselines.
- Verify: table includes `method=svd_steering`, `layer`, `rank`, `coeff`, `syc_delta`, `dd_delta`, and `pmass`.
- Negative outcome -> claim: if SVD steering is weak or unstable, do not treat plain base-weight SVD as a competitive method baseline.
- [ ] **Goal: degradation benchmark.**
- Why: steering might improve target metric while damaging general behavior.
- UAT: perplexity or clean instruction proxy reported for best coefficients.
- Verify: table has target metric and degradation metric for the exact same selected coefficients.
- Negative outcome -> claim: if target gains require large degradation, report steering as brittle rather than useful.
- [ ] **Goal: larger model replication.**
- Why: Qwen3-0.6B and Gemma 1B are iteration models; larger model needed for a stronger claim.
- UAT: same benchmark table on a 4B-ish model after method stabilizes.
- Verify: model column includes the 4B-ish model and reuses the same prompt/DD row IDs as the small-model benchmark.
- Negative outcome -> claim: if the effect disappears or reverses on the larger model, write the small-model limitation instead of scaling the claim.
## Decision rules
- If prompt or activation steering beats `dW`, prioritize method improvement before deeper mechanistic analysis.
- If activation steering matches `dW`, treat weight steering as mechanistic interest first and applied method second.
- If DeLoRA wins across Qwen and Gemma, spend seeds on DeLoRA/PiSSA only.
- If Qwen and Gemma adapter rankings diverge, write the model-specific adapter-basin finding instead of forcing one global winner.
- Shared-core rule: if `keep_B_shared_K32` retains >=0.7x behavior across LoRA / DoRA / PiSSA / DeLoRA / OFT and `drop_B_shared_K32` removes most of it, write the planning-subspace paper.
- Basin-divergence rule: if per-adapter top subspaces are mutually low-overlap and each adapter's own SVD keeps behavior better than `B_shared`, write the basin-divergence paper.
- If top-k or write-not-read keeps behavior, we found a simple steering parameterization.
- If complement/tail/many layers keep behavior, evidence favors distributed or wrong-space mechanism.
- If MLP `up/gate` terms carry behavior, next paper story should be feature-space steering, not residual-stream planning subspace.
- Held-out cross-adapter shared-SVD rule (contingent on T4 multiseed):
- If T4 shows within-adapter stdev > cross-adapter retained gap on lens-3, the N=1 cross-adapter inconsistency was seed noise -> held-out shared-SVD becomes worth running.
- If T4 confirms cross-adapter gap is real (current expectation), skip held-out shared-SVD -- it will fail by construction and only restates what is already known.
## Resolved: train/eval axis switch from sycophancy to honesty (2026-04-27)
**Decision: option 2** — switched the entire pipeline to honesty axis. Rationale:
- The sycophancy NEG persona was already "you are an honest assistant", so the prior contrast was effectively honesty-vs-anti-honesty with extra sycophancy framing on the POS side. Cleaner to make that explicit.
- Eval (`daily_dilemmas-self-honesty`) signs by `honesty_label`. One-axis training matches the eval axis by construction. The in-sample claim is now defensible.
- Sycophancy becomes a free OOD transfer eval (run later if useful) rather than a confounded primary axis.
- Avoids the GPT-4o sycophancy-eval generation burden (option 1) and its eval-validation overhead.
What changed in code (commit on `dev`):
- `src/ws/data.py`: added `HONESTY_PROMPT = "Pretend you're {persona} person. "` with 5 paraphrase paraphrases each side (`HONESTY_POS_PERSONAS`, `HONESTY_NEG_PERSONAS`) — vgel/repeng short-persona convention. Added `_load_suffixes()` reader for `data/branching_suffixes.json` (550 SSteer entries). Honesty branch reuses the suffix `user_msg` as the question pool; persona-conditioned generation in `generate_pairs()` produces (r_pos, r_neg) per the paper recipe (Fierro & Roger 2025 §F.1).
- `src/ws/eval/activation_baseline.py`: RepE direction extraction now branches on `cfg.behavior`; honesty mode captures last-token hidden states under `HONESTY_POS[0]` / `HONESTY_NEG[0]` over `_load_suffixes()` prompts with `assistant_prefixes=suffix`.
- `src/ws/eval/prompt_baseline.py`: replaced single `engineered_prompt` with paired `engineered_prompt_honest` + `engineered_prompt_dishonest` (AxBench Appendix J.2 style).
- `evals/smoke.py`: added `behavior` field; `just smoke --behavior honesty` passes end-to-end on `katuni4ka/tiny-random-qwen3`.
- `data/branching_suffixes.json`: copied from SSteer.
Sycophancy outputs in `out/sycophancy/` are kept on disk as historical evidence for the old axis-mismatched table. The README headline numbers will be replaced with honesty once 230-236 land. T4/T5 stay open and will be re-scoped against honesty.