docs: rewrite fork plan with UAT tasks

This commit is contained in:
wassname
2026-04-27 11:22:52 +08:00
parent a3d999fd92
commit 8fa9e54eaa
+154 -336
View File
@@ -1,344 +1,162 @@
## Context
# Fork plan: weight steering benchmark + analysis
So this is a fork of the excellent weight steering
Updated: 2026-04-27
> We isolate behavior directions in weight-space by subtracting the weight deltas from two small fine-tunes - one that induces the desired behavior on a narrow distribution and another that induces its opposite.
> To obtain a vector in weight space corresponding to the desired trait, we start from a model θ0 then fine-tune the model on either the data generated with the positive system prompt (stripped of the system prompt at train-time) to obtain θ+, or on the data generated with the negative system prompt to obtain θ−, the weight-space vector corresponding to the behavior is then computed as w=θ+−θ−. We use LoRA fine-tuning as we found it worked better for monitoring than full-parameter fine-tuning.
## Goal
Test whether weight steering is a useful method, and if it is, understand what part of the learned weight delta carries the behavior.
Now I'm interested in
- replicating
- seeing if the model difference aligns with SVD vs W. With any of the subspaces I defined in ./docs/AntiPaSTO_concepts/
- and most importantly seeing if other types of adapters work better!
- and becoming clear on
- does it generalise
- does performance degrate
- this likely means using it on one of the evals I'm familiar with namely daily dillemas from AntiPaSTO or eval awareness (but this requires rending a GPU so this is later)
Two questions are intentionally separated:
## Resources
- **my lit review of PeFT adapter methods** ./docs/blog_adapter_as_hypothesis/README.md
- my steering concepts ./docs/AntiPaSTO_concepts/README.md
- orig paper
- docs/weight_steering_paper.md
- docs/weight_steer_blog.md
## TODO
- [x] plan to clean up the repo. uv, jaxtyping, einops. hooks not classes. remove vlm
- [x] make it work on small models (Qwen3-0.6B), cheap+fast iteration
- [x] hook in PEFT (LoRA / DoRA / PiSSA / DeLoRA via peft>=0.13)
- [x] phase 1 replicate: w = θ+ - θ- on Qwen3-0.6B sycophancy, monotone logratio (task 40)
- [x] phase 2 weight-only subspace alignment (SVD-of-W, weak-readout) — *negative result, see "Phase 2 reframe" below*
- [x] phase A demos: adapter coherence + guided-CoT under w (task 44 — pmass=1.0, margin α-monotone, no teacher-forcing gap, OOD generalizes)
- [ ] phase B: train.py val split done; 3-epoch re-run still pending
- [ ] phase 2.5: activation-aware subspace tests — TaskDiff / Suppressed / Stenographic
- [ ] **wishlist W**: layer slice 30-80% LoRA targets (steering literature locus); `train.py:LINEAR_TARGETS` patch
- [ ] **wishlist N**: `notebooks/analyze_diff.py` (.py # %% cells) — W-side (SVD spectrum, polar decomp, suppressed-PCA, magnitude-vs-direction) + A-side (Δa via baukit at α=±1, per-layer residual/attn/MLP locus, cosine to dW directions)
- [ ] phase 3 adapter sweep (DoRA / PiSSA / DeLoRA)
- [ ] phase 4 daily-dilemmas eval (mirror AntiPaSTO2/antipasto2/eval.py)
- [ ] **paper-deltas** (task 18, 16): match data recipe (5+/5- × 10 samples + judge filter) and LoRA hyperparams (rank 32, α 16, lr 1e-5, warmup 5)
## Paper-deltas — what we match, what we deliberately skip
Audit of upstream Axolotl YAMLs vs current code. Tracked as tasks 16 + 18.
| upstream | ours | decision |
|---|---|---|
| 20 questions × 5 personas × 10 samples + GPT-4.1-mini filter (500-900 retained per sign) | 32 fixed claims × 1 persona, sample-replicated to 1000 | **fix** (task 18) |
| LoRA rank 32 / α 16 / lr 1e-5 / warmup 5 / wd 0.01 / no dropout | rank 16 / α 2*r=32 / lr 5e-5 / no warmup / no wd | **fix** (task 16) |
| `load_in_8bit: true`, `adamw_bnb_8bit` | `bf16` direct, plain AdamW | **skip** — DoRA/PiSSA/DeLoRA quantization support is uncertain; bf16 fits at 0.6B |
| `modules_to_save: [embed_tokens, lm_head]` | not saved | **skip** — user does not want to train/save these |
| `lora_target_linear: true` (all linear) | hand-picked q/k/v/o/gate/up/down_proj | **skip** — deliberate, this is all linear in the qwen3 transformer block anyway; matches `lora_target_linear` for the body |
| sequence length 4096 | 512 | **skip** — sycophancy responses are <128 tokens; 512 is plenty, 4096 would OOM at our batch size |
| `epochs` plumbed through | (was) silently ignored | **fixed** (replicate.py:71, 2026-04) |
| reuses on-disk data regardless of `n_pairs` | now hard-fails on mismatch | **fixed** (replicate.py:_maybe_data, 2026-04) |
---
# Fork plan: weight-steering → small-model + adapter sweep
1. **Benchmark question:** Does weight steering beat simple alternatives such as prompting and activation steering on sycophancy and daily-dilemmas honesty transfer?
2. **Analysis question:** If weight steering works, can the learned delta $dW = \theta^+ - \theta^-$ be factorized into a simpler causal intervention: a cross-adapter shared subspace, module, low-rank component, or adapter parameterization?
## Context
This is a fork of Anthropic's weight-steering work (θ+ - θ- via LoRA fine-tunes on +/- system-prompted data). The current repo is heavy: Axolotl orchestration, vLLM serving, and Anthropic/OpenAI batch APIs. None of that is needed for what wassname actually wants:
1. **Replicate** the core method on a small model so iteration is cheap.
2. **Test alignment** between the diff vector `w = θ+ - θ-` and the SVD-derived subspaces from `docs/AntiPaSTO_concepts/` (suppressed, write-not-read, weak-readout, stenographic).
3. **Test other PEFT adapter families** (DoRA, PiSSA-init LoRA, DeLoRA) to see if the steering signal extracts more cleanly under different parameterizations - this is the "adapter as hypothesis" framing from `docs/blog_adapter_as_hypothesis/`.
4. **Generalization** via daily-dilemmas eval (later, GPU-gated).
The original paper itself notes "we did not try to optimize weight steering very hard" - room for both methodological cleanup and substantive method comparison.
User decisions captured: Qwen3-0.6B base, aggressive cleanup (rip Axolotl + VLM, switch to HF+PEFT), both sycophancy (paper replication) and daily-dilemmas (own eval), adapter sweep over LoRA / DoRA / PiSSA-init / DeLoRA.
## Phase 0 — Repo cleanup (breaking, no backcompat)
**Delete:**
- `vllm_inference.py` (565 lines, vLLM serving)
- `api_inference.py` (964 lines, Anthropic/OpenAI batch)
- `axolotl_plugin_models_with_mlp_bias.py`, `axolotl_configs/`
- `inference_and_eval.py` Axolotl-subprocess orchestration (keep nothing - rewrite small)
- `models_with_mlp_bias.py` - replace with hooks; the MLP-bias variant isn't needed for the core θ+ - θ- replication
**Add:**
- `pyproject.toml` with uv (`torch`, `transformers`, `peft>=0.13` for DeLoRA, `datasets`, `einops`, `jaxtyping`, `beartype`, `loguru`, `polars`, `tabulate`, `baukit` from git, `wandb`)
- `justfile` with: `smoke` (5-min run), `train-pos`, `train-neg`, `diff`, `eval-syco`, `eval-dilemmas`, `subspace-align`
- `.python-version` (3.11)
**Keep + simplify:**
- `task_vectors.py` — strip down to a functional `compute_diff(state_dict_pos, state_dict_neg) -> dict` and `apply_diff(model, diff, alpha)`. Drop the class hierarchy and arithmetic ops; we only need subtract + scaled add.
- `activation_steering.py` — already hook-based; replace manual hooks with `baukit.TraceDict` for cleanliness (per user CLAUDE.md preference).
**New layout:**
```
weight-steering/
├── src/ws/
│ ├── data.py # +/- system-prompt pair data generation (sycophancy first)
│ ├── train.py # PEFT-based finetune; one function per adapter type
│ ├── diff.py # compute_diff, apply_diff (functional, ~50 lines)
│ ├── steer.py # inference-time scaled application via baukit hooks
│ ├── subspace.py # SVD projections, AntiPaSTO subspaces, alignment metrics
│ └── eval/
│ ├── sycophancy.py
│ └── dilemmas.py # mirrors AntiPaSTO2/eval.py pattern
├── scripts/
│ ├── replicate.py # phase 1 entrypoint
│ ├── adapter_sweep.py # phase 3 entrypoint
│ └── subspace_align.py
├── notebooks/ # exploratory only
└── justfile, pyproject.toml, .python-version
```
## Phase 1 — Replicate on Qwen3-0.6B with sycophancy
**Data:** Generate +/- pairs using sycophantic vs honest system prompts on a sycophancy QA distribution (paper Appendix E recipe). Strip system prompt at train time. Target ~500-1000 pairs to keep iteration fast.
**Train:** PEFT LoRA, rank 16, all linear layers, lr 5e-5, 1 epoch, bf16. Save θ+ and θ- as PEFT adapter state dicts. With Qwen3-0.6B + LoRA this should fit comfortably on a single 24GB card and train in ~10-20 min per side.
**Diff:** `w = θ+ - θ-` in adapter-merged weight space (merge LoRA into a delta dict, then subtract). Functional, no class wrapper.
**Apply at inference:** Add `alpha * w` to base weights via baukit hook on each affected `nn.Linear` (no in-place modification of base model). Sweep `alpha ∈ [-2, -1, 0, 1, 2]`.
**Smoke test:** Qualitative gen on 10 held-out sycophancy prompts, plus the per-coeff Yes/No logratio metric from `AntiPaSTO2/eval.py`.
## Phase A — Sanity demos on existing artifacts (cheap, no retraining)
Why: task 40's pipeline never generates a single sentence of model output.
The headline numbers (`mean_logratio +9.4 at α=+2`, `pmass=1.0`) are forward-pass-only,
single-token reads. We don't yet know:
1. Did the LoRAs converge or undertrain? Single epoch, slope -0.003/step at the end, no val loss.
2. Were the adapters coherent at the end? No generation anywhere.
3. Does the steering effect survive a 32-token rollout? Single-token logratio inflates vs on-policy reality (ROAST teacher-forcing gap).
4. Does w generalize off the training topic distribution? Eval is in-distribution (`held_out = SYCOPHANCY_TOPICS[-16:]`).
Two demos, both on the existing `out/sycophancy/lora/{pos,neg,w.pt}`:
- **A1** (`run_demo.py:phase_a1`): load base + pos LoRA, generate 80 tokens on 2 in-dist + 1 OOD claim. Same for neg. Pass = pos *agrees*, neg *pushes back*, both fluent. **Built**, not yet run.
- **A2** (`run_demo.py:phase_a2`, `eval/guided_cot.py`): for each (claim, alpha) pair, rollout 32 tokens of CoT under `weight_steer(model, w, alpha)`, append `"\n\nFinal answer: **"`, score `margin = logp_yes - logp_no` and `pmass = P(yes) + P(no)` at the next position. Per AntiPaSTO `docs/AntiPaSTO_concepts/README.md:467-477`: pmass≈1.0 in linear range, drops outside. **Built**, not yet run.
Run with `just demo`.
## Phase B — Convergence/overfit (only if A flags issue)
Patched `train.py` adds 10% val split + `eval_strategy="steps", eval_steps=10`.
Re-queue with 3 epochs:
```
pueue add -l "why: did task 40 LoRA converge or undertrain; resolve: val_loss curve flattens (converge), keeps dropping (undertrain), or U-curves (overfit)" -- uv run python -m ws.replicate --model Qwen/Qwen3-0.6B --behavior sycophancy --adapter lora --n-pairs 1000 --epochs 3
```
Reuses task 40's data on disk. ~6 min total.
## Phase 2 reframe — why activation-blind SVD-of-W was the wrong test
Task 40 measured energy of `w_layer` in the top-k×k corner of base SVD(W). Across 7 module kinds, all `ratio_top ≈ 1.0 ± 0.10` (per-layer std). I initially called this "SVD-alignment falsified."
That's the wrong reading. From `docs/AntiPaSTO_concepts/docs/steering_methods.qmd:340-343` (Common Misconceptions #3, "SVD(W) aligns with PCA(diffs)"):
> Wrong: Weight's principal directions should align with task-relevant activation differences.
> Right: SVD(W) captures variance across *all* computations; PCA(diffs) captures variance for *this task*. We measured ~0.08 cosine similarity — essentially orthogonal.
So task 40 didn't falsify a hypothesis; it reproduced a known prior result (SVD(W) is not the task basis). The Fisher table at `steering_methods.qmd:207-214` says the same thing differently:
| Subspace | Peak Fisher |
|---|---|
| weight_svd / write_minus_lm_head | 0.0070.009 (Level 0) |
| task_diff / suppressed | 0.0130.022 (Level 1) |
| **stenographic** (task ∩ suppressed) | **0.142** (Level 2) |
| task ∩ stenographic | 0.266 (Level 3) |
The *right* test is what wassname intuited: project task hidden states (or their differences) onto a basis, then test if `w` aligns with that. Three concrete activation-aware tests to add (replacing the Haar-null SVD-of-W test):
1. **TaskDiff alignment**: collect `h_pos[L]` and `h_neg[L]` on a probe set (using base model under +/- system prompts on training topics). PCA on `h_pos - h_neg`, top-k. Test if `w_layer`'s column space (the side that writes to residual) aligns with this. Null = random rank-r perturbation.
2. **Suppressed alignment**: per `steering_methods.qmd:67-110`, compute `min(Σrelu(Δmag+), Σrelu(Δmag-))` across layers, PCA. Suppressed has 3.5× enrichment for task signal vs random (`steering_methods.qmd:407-414`). Test `w` against this.
3. **Stenographic alignment**: TaskDiff ∩ Suppressed (canonical-angle bisector basis). Highest Fisher (0.142) per AntiPaSTO. If `w` doesn't align with *anything* including stenographic, the diff carries no task-relevant subspace structure.
The existing weak-readout test (`subspace.py:weak_readout_alignment`) is in spirit Logits_Null (`steering_methods.qmd:81`) — keep it.
Cleanup needed in `subspace.py` regardless:
- The current `e_top` only sums the (top-k × top-k) corner of `proj`, ignoring off-diagonal blocks `proj[:k, k:]` and `proj[k:, :k]`. For a "row-side aligned but col-side random" delta (which a LoRA `B@A` may produce when B is in W's col-space but A is not in W's row-space), this misses signal. Either measure all four blocks or restate the hypothesis.
- The reported ±0.10 was per-layer std over n=28 layers, not SE of the per-kind mean. Re-doing as SE: down_proj +2.2σ, v_proj 1.7σ from null. Bonferroni across 7 kinds kills these, but it's not "1.0 ± 0.1 across all kinds" — there is per-kind variation.
## Phase 2 — Subspace alignment analysis (original plan; superseded by Phase 2 reframe + Phase 2.5)
For each layer's weight matrix W and its diff `w_layer`:
1. SVD of pretrained W → `U_out, S, U_in.T`.
2. Project `w_layer` onto top-k singular components; compute energy fraction vs uniform/random baseline.
3. Repeat for the four AntiPaSTO subspaces:
- **Suppressed** (PCA of layer-to-layer magnitude drops on a probe set)
- **Write-not-read** (orth complement of next layer's read span)
- **Weak-readout** (bottom-1% Vh of unembedding)
- **Stenographic** (intersection of task-diff and suppressed)
4. Output: a polars table per subspace with `{layer, energy_in_subspace, energy_random_baseline, ratio}`. Print with tabulate.
Critical: project the *adapter-space* delta when possible (rank-r is small) and compare against the same projections of random rank-r perturbations as the null. This makes the alignment claim falsifiable.
## Phase 3 — Adapter sweep (the actual science)
For each adapter type, train +/- and produce a weight-space diff:
| Adapter | PEFT support | Hypothesis being tested |
|---|---|---|
| LoRA r=16 | built-in | baseline: low-rank suffices |
| DoRA r=16 | built-in (`use_dora=True`) | magnitude/direction split keeps diff cleaner |
| LoRA + PiSSA init | `init_lora_weights="pissa"` (built-in init mode) | principal components carry the steering signal |
| DeLoRA r=16 | built-in (peft >= 0.13) | strength/direction decoupling improves robustness |
Per adapter, log: train loss curves, time, peak mem, then phase-1 sweep + phase-2 alignment table. The cross-adapter comparison is the key result: **does the SVD/subspace alignment of `w` change when we change the parameterization?** That's evidence about whether the adapter itself is acting as an inductive bias on the steering direction (the "adapter as hypothesis" framing).
Scope guard: drop SSVD; user already excluded it. If DeLoRA blows up in PEFT, fall back to LoRA + PiSSA + DoRA.
## Phase 4 — Daily-dilemmas eval (CPU-feasible at 0.6B)
Build `src/ws/eval/dilemmas.py` mirroring `AntiPaSTO2/antipasto2/eval.py`
(fetched via `gh api repos/wassname/AntiPaSTO2/contents/antipasto2/eval.py`).
Reuse our existing primitives — don't re-implement choice scoring.
Source eval pipeline (key fields to mirror):
- **Dataset**: `wassname/daily_dilemmas-self-honesty`, config `honesty_eval`,
`split="test"`. Take top-N by `dilemma_idx` (default 100). Each row has
`dilemma_idx`, `idx`, `action_type`, `honesty_label` (+1/-1).
- **Prompt**: `INSTRUCTION_PROMPT.format(**row)` then assistant `"My choice: **"`,
built via `apply_chat_template(continue_final_message=True, add_generation_prompt=False)`.
*Vendor `INSTRUCTION_PROMPT` from AntiPaSTO2/antipasto2/data.py.*
- **Score**: yes/no logratio at last position (same as our `sycophancy.py`).
Reuse `ws/eval/sycophancy.py:get_choice_ids` — already identical to v2.
- **Honesty alignment** (the key v2 detail): `logratio_honesty = logratio * honesty_label`.
Positive = more honest. Aggregate this, not raw logratio — sign cancels otherwise.
- **Coeff sweep**: `[-1.0, 0.0, 1.0]` (default; can override).
- **Steering**: AntiPaSTO2 uses `ScaleAdapter(model, coeff, adapter_name)` (PEFT
scaling LoRA at inference). We use `weight_steer(model, w, alpha)` instead —
same shape (context manager scaling a delta), but on *the diff* w = θ⁺ θ⁻
not on a single adapter. Drop-in.
- **pmass flag**: `low_pmass = pmass < threshold * maxp` (threshold=0.01).
Don't filter — flag for analysis. Compare to our guided-CoT `pmass≈1.0` baseline.
Output: one polars table per adapter: `(adapter_type, coeff, mean_logratio_honesty,
mean_pmass, frac_low_pmass)`. Save per-row CSV for later regression on
`action_type`.
Wire as `ws/eval/dilemmas.py` + `evaluate()` entrypoint in `replicate.py`
(after sycophancy eval). `just eval-dilemmas adapter=lora` recipe.
## Phase 5 — Generalization + degradation (later, rented GPU)
Defer until phases 1-4 produce a clear winner. Then on a 4B model:
- Eval on held-out dilemma distribution + eval-awareness eval.
- Track perplexity on a clean instruction-following set as a degradation proxy.
## Critical files to modify / reference
- **Modify heavily:** `task_vectors.py`, `activation_steering.py`
- **Delete:** `vllm_inference.py`, `api_inference.py`, `axolotl_plugin_models_with_mlp_bias.py`, `models_with_mlp_bias.py`, `inference_and_eval.py`, `axolotl_configs/`
- **Reference (read-only):** `docs/weight_steering_paper.md` (Appendix B/E hyperparams), `docs/AntiPaSTO_concepts/README.md` (subspace definitions), `docs/blog_adapter_as_hypothesis/README.md` (adapter scoring)
- **Mirror:** AntiPaSTO2 `antipasto2/eval.py` (eval pattern, choice-id extraction, ScaleAdapter context manager)
## Reuse, don't reinvent
- `peft.LoraConfig(use_dora=True, init_lora_weights="pissa")` for DoRA and PiSSA-init - no custom code.
- `peft.DeloraConfig` for DeLoRA (peft >= 0.13).
- `baukit.TraceDict` for steering hooks (per user CLAUDE.md).
- AntiPaSTO2's `_is_choice`, `get_choice_ids`, `get_choice_logprobs`, `evaluate_at_coeff` - copy or vendor.
- `loguru` + `tabulate(df, tablefmt='pipe', headers='keys', floatfmt='+.2f')` for log output.
## Verification
End-to-end checks the user can read at a glance:
1. **Phase 0 done when:** `just smoke` runs in <5 min on Qwen3-0.6B, generates 5 +/- pairs, trains a LoRA on each, computes `w`, applies at coeff ±1, prints generations side by side. Single command, no Axolotl, no vLLM.
2. **Phase 1 done when:** sycophancy logratio on held-out set goes monotonically from coeff -2 → +2, table printed via tabulate.
3. **Phase 2 done when:** for each AntiPaSTO subspace, an `energy_ratio = energy_in_subspace / energy_random` table is produced. Ratio > 1 with bootstrap CI not crossing 1 = real alignment.
4. **Phase 3 done when:** the four-row table `(adapter × subspace_alignment × steering_logratio_AUC)` exists and is interpretable.
5. **Phase 4 done when:** daily-dilemmas table is reproducible from a single `just eval-dilemmas adapter=lora` command.
User-observable result throughout: a markdown table per phase, not a "I did it." Each table answers one question.
## Open questions to resolve during implementation (not blockers)
- Sycophancy data: regenerate using Qwen3-0.6B as the +/- responder, or use the paper's released data if available? (Default: regenerate with Qwen3-0.6B since 0.6B's distribution differs from 7B's.)
- Layer selection for the diff: paper does per-layer sweeps (Appendix E). For phase 1 just take all layers; for phase 3, sweep.
- Whether to merge adapter into base before diffing or diff in adapter space directly. Adapter-space is cheaper but only valid when both +/- adapters share the same A or B (PiSSA init shares both initially; LoRA does not). Default: merge into delta-W space, then diff. This makes all adapters comparable.
# 2026-04-27 09:54:33
Yes. These are **parameterization / factorization tests** of `dW`.
Minimal plan:
1. **Baseline**
- base model
- prompt baseline
- activation steering baseline
- full `dW`
2. **Layer ablation**
- keep only layer `L`s `dW`
- or remove layer `L` from full `dW`
- tells where steering is causally located by layer
3. **SVD split of `dW`**
- per tensor: `dW = U S Vᵀ`
- test top-k vs tail:
- `top8`
- `top32`
- `tail`
- tells whether steering is low-rank or distributed
4. **Read/write subspace projections**
- project `dW` into:
- write space
- write-not-read
- super read: `[q,k,v,up,gate]`
- super write: `[o,down]`
- test projected part vs complement
5. **Magnitude vs angle / rotation**
- yes, this is a parameterization test.
- split weight change into:
- norm/magnitude change
- direction/rotation change
- especially relevant for `DeLoRA`, `DoRA`, `OFT`.
Do this for **Qwen + Gemma**, but first on Qwen only to debug.
DD coverage note: current default DD eval is **not full split**. It uses the
first 100 dilemmas = 200 rows, balanced 100 honest-label and 100 dishonest-label
actions, then sign-flips by `honesty_label`. Full `honesty_eval` test is 219
dilemmas = 438 rows. So current tables are all rows for the selected dilemmas,
not only honest rows, and not the full split unless `--n-dilemmas 219`.
Adapter replication note: we've done the adapter sweep on Qwen3-0.6B. For Gemma
1B, do a small replication first: LoRA / PiSSA / DeLoRA, seed 0, full DD split,
then only add more seeds/adapters if the ranking differs or DeLoRA stays best.
Core table should be:
| intervention | kept params | DD effect | syc effect | retention vs full |
|---|---:|---:|---:|---:|
If top-k or write-not-read keeps effect, we found a simple steering parameterization.
If only many layers/tail/complement keeps effect, its distributed.
This is a fork of Anthropic's weight-steering method. Original recipe: train one positive adapter and one negative adapter, merge each adapter into base-weight deltas, then steer with:
$$dW = \Delta W_{pos} - \Delta W_{neg}.$$
This repo removes Axolotl/vLLM/API orchestration and rebuilds the method in HF + PEFT + uv for cheap iteration on small models.
Current main model: `Qwen/Qwen3-0.6B`.
Current behavior: sycophancy training, evaluated on sycophancy Yes/No and `wassname/daily_dilemmas-self-honesty`.
## Links
- Paper / blog:
- [docs/weight_steering_paper.md](docs/weight_steering_paper.md)
- [docs/weight_steer_blog.md](docs/weight_steer_blog.md)
- Adapter-as-hypothesis notes:
- [docs/blog_adapter_as_hypothesis/README.md](docs/blog_adapter_as_hypothesis/README.md)
- Steering/subspace concepts:
- [docs/AntiPaSTO_concepts/README.md](docs/AntiPaSTO_concepts/README.md)
- Current user-facing summaries:
- [README.md](README.md)
- [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md)
- Key code:
- [src/ws/data.py](src/ws/data.py)
- [src/ws/train.py](src/ws/train.py)
- [src/ws/diff.py](src/ws/diff.py)
- [src/ws/steer.py](src/ws/steer.py)
- [src/ws/eval/sycophancy.py](src/ws/eval/sycophancy.py)
- [src/ws/eval/dilemmas.py](src/ws/eval/dilemmas.py)
- [nbs/cross_adapter_v9.py](nbs/cross_adapter_v9.py)
- [nbs/functional_projection_v10.py](nbs/functional_projection_v10.py)
## Current facts
- Daily-dilemmas default is **not full split**. Default `n_dilemmas=100` means first 100 dilemmas = 200 rows, balanced 100 honest-label and 100 dishonest-label actions.
- Full `honesty_eval` test split is 219 dilemmas = 438 rows.
- The daily-dilemmas eval uses all rows for selected dilemmas, then sign-flips by `honesty_label`; it is not only honest rows.
- Current headline tables are single-seed Qwen3-0.6B exploratory results.
- DeLoRA is best raw steering so far. PiSSA is the cleaner stable baseline if penalizing DeLoRA saturation at high alpha.
- v9/v10 do **not** prove “no subspace.” They show the trained behavior is not explained by the tested low-rank residual-stream bases or adapter-family parameterization at trained scale.
- The highest-value analysis test is cross-adapter causal ablation: if LoRA / DoRA / PiSSA / DeLoRA / OFT share a causal low-rank `dW` core, that is the clean planning-subspace result; if not, it is the cleanest negative result for the shared-subspace hypothesis.
## Done
- [x] Clean repo into uv + HF + PEFT small-model workflow.
- [x] Make Qwen3-0.6B sycophancy steering work end-to-end.
- [x] Hook in LoRA, DoRA, PiSSA, DeLoRA, OFT, and IA3 adapter families.
- [x] Build sycophancy logratio eval with coefficient sweep.
- [x] Build daily-dilemmas honesty eval with sign-flipped Yes/No logratio.
- [x] Run single-seed Qwen adapter benchmark on sycophancy and 100-dilemma DD default.
- [x] Fix DD cross-adapter aggregation to use base-only coeff=0 rather than mixing persona baselines.
- [x] Run v9 subspace/scope diagnostics: weight oracle, cumulative activation oracle, block-local activation oracle, first-LoRA-layer sanity checks.
- [x] Run v10 projection/complement falsifier: raw activation projection, complement, and normmatched projection.
- [x] Update README and research journal with corrected DD table and conservative interpretation.
## TODO: benchmark question
- [ ] **Goal: activation-steering baseline on the same DD rows.**
- Why: RepE/repeng is the most threatening baseline; if it matches or beats `dW`, the method story weakens before adapter seeds matter.
- Do: train representation direction on the same sycophancy contrast; grid layer x coefficient; evaluate sycophancy and full DD.
- UAT: best activation-steering row is selected by held-out sycophancy or validation DD, then reported beside best `dW` on identical DD test rows.
- Verify: table includes `method=repeng`, `layer`, `coeff`, `syc_delta`, `dd_delta`, `pmass`, and the same `idx` set as the `dW` rows.
- Negative outcome -> claim: if repeng matches/beats `dW`, write "activation steering is the simpler baseline; weight steering needs a stronger reason to exist."
- [ ] **Goal: full daily-dilemmas benchmark for current Qwen adapters.**
- Why: current DD table uses first 100 dilemmas, not the full 219-dilemma split.
- Do: re-run LoRA / PiSSA / DeLoRA / DoRA / OFT / IA3 with `--n-dilemmas 219`.
- UAT: table has 438 base rows per coeff before persona baselines, and reports `pmass`, `frac_low_pmass`, `delta(+1 - 0)`.
- Verify: `out/sycophancy/cross_adapter_full_dd/dilemmas_summary.csv` exists and includes `n_base_rows_per_coeff=438`.
- [ ] **Goal: prompt baselines on the same DD rows.**
- Why: weight steering is only interesting if it beats “just prompt it.”
- Do: evaluate base, simple honest persona, and engineered AxBench-style prompt.
- UAT: one table compares `base`, `simple_honest_prompt`, `engineered_prompt`, and best `dW` on identical rows.
- Verify: `prompt_baseline_delta` and `weight_steer_delta` are computed from the same `idx` set.
- Negative outcome -> claim: if prompting matches/beats `dW`, write "prompting is the simpler intervention for this behavior/eval pair."
- [ ] **Goal: multi-seed adapter benchmark on Qwen.**
- Why: current adapter ranking is N=1 seed.
- Do: run seeds 0, 1, 2 for LoRA / PiSSA / DeLoRA first; add DoRA/OFT only if cheap.
- UAT: table reports mean +/- std for sycophancy and DD deltas, plus seed-level signs, so a reader can tell stable ranking from noisy N=1 luck.
- Verify: each adapter has exactly three `w.pt` files and three eval summaries; ranking table includes `n_seeds=3`, `mean_dd_delta`, `std_dd_delta`, and `sign_agreement`.
- Negative outcome -> claim: if adapter ranking changes across seeds or error bars overlap heavily, write "single-seed adapter winner is unstable; do not claim a family ranking yet."
- [ ] **Goal: Gemma 1B replication.**
- Why: check whether DeLoRA/PiSSA ranking is Qwen-specific.
- Do: train LoRA / PiSSA / DeLoRA on Gemma 1B, seed 0, full DD split.
- UAT: compare Gemma ranking to Qwen ranking with the same metrics.
- Verify: table has model column with `Qwen3-0.6B` and `Gemma-1B`; if DeLoRA remains best, expand seeds; if rankings diverge, write that up as a model-specific adapter-basin finding.
## TODO: analysis question
- [ ] **Goal: cross-adapter causal-ablation table for `dW` bases.**
- Why: this is the headline analysis experiment. It tests whether different adapter families discovered the same causal planning subspace or different basins.
- Do: one notebook builds candidate bases `B`, computes `dW_keep_B` and `dW_drop_B`, and evaluates both on sycophancy + full DD for each adapter. This single table replaces separate layer-ablation, SVD top/tail, read/write, MLP, and magnitude/direction experiments.
- Candidate `B` rows:
- `shared_SVD_K8/K32/K64`: stack residual-output `dW` from LoRA / DoRA / PiSSA / DeLoRA / OFT per layer/tensor, take top-K SVs.
- `top8/top32_per_adapter` and `tail_per_adapter`: per-adapter SVD split of each tensor.
- `write`, `write_not_read`, `super_read [q,k,v,up,gate]`, `super_write [o,down]`.
- `mlp_down`, `mlp_up`, `mlp_gate`, `mlp_up+gate`, `attn_only`.
- `magnitude`, `direction/rotation` for DeLoRA / DoRA / OFT where mathematically defined.
- `layers_8..21_only`, leave-one-layer-out, and `random_null`.
- UAT: one central table has every ablation family as rows, with columns `ablation_family`, `candidate_B`, `adapter`, `rank`, `retain_keep`, `retain_drop`, `syc_delta_keep`, `dd_delta_keep`, `syc_delta_drop`, `dd_delta_drop`, `pmass`.
- Verify: the single table contains keep/drop rows for every `ablation_family`: `shared_svd`, `per_adapter_svd`, `read_write`, `mlp_first_order`, `magnitude_direction`, `layer`, and `random_null`; `layer` includes both `layers_8..21_only` and leave-one-layer-out rows; `keep_B_shared_K32` and `drop_B_shared_K32` are both evaluated for at least LoRA / DoRA / PiSSA / DeLoRA / OFT; random null retention is near rank/d; each row uses the same eval rows and coefficient grid.
- Positive outcome -> claim: if `keep_B_shared` retains >=0.7x behavior across adapters and `drop_B_shared` removes it, write the adapter-invariant planning-subspace paper.
- Negative outcome -> claim: if `keep_B_shared` retains <0.3x even at K=64 while complements/tails retain behavior, write the shared-subspace negative result: steering is distributed or lives in the wrong parameter space for these bases.
- Ambiguous outcome -> claim: if both keep and drop retain high behavior, report non-identifiability under this basis family and move to stricter causal interventions, not a positive subspace claim.
- [ ] **Goal: from-scratch parameterization steering.**
- Why: decomposing trained `dW` is weaker than constructing a steering delta from base weights/activations alone.
- Do: build simple `dW_prime = f(W_base, persona_contrast)` candidates, e.g. lm-head/readout rowspace projected persona contrast, write-not-read persona contrast, and shared structural bases with signed coefficients from activation contrast.
- UAT: table compares `dW_prime` to trained `dW`, prompt, and repeng on identical sycophancy + DD rows.
- Verify: candidates are generated without reading trained adapter deltas; code fails if `w.pt` is loaded before constructing `dW_prime`.
- Positive outcome -> claim: if a from-scratch `dW_prime` steers, weight steering may be replaced by a constructive parameterization.
- Negative outcome -> claim: if no from-scratch candidate steers while trained `dW` does, training is doing nontrivial search not captured by the current structural recipes.
## Deferred / optional
- [ ] **Goal: SVD steering baseline.**
- Why: useful only if cheap and stable; lower priority than repeng.
- UAT: same DD/sycophancy table as other baselines.
- Verify: table includes `method=svd_steering`, `layer`, `rank`, `coeff`, `syc_delta`, `dd_delta`, and `pmass`.
- Negative outcome -> claim: if SVD steering is weak or unstable, do not treat plain base-weight SVD as a competitive method baseline.
- [ ] **Goal: degradation benchmark.**
- Why: steering might improve target metric while damaging general behavior.
- UAT: perplexity or clean instruction proxy reported for best coefficients.
- Verify: table has target metric and degradation metric for the exact same selected coefficients.
- Negative outcome -> claim: if target gains require large degradation, report steering as brittle rather than useful.
- [ ] **Goal: larger model replication.**
- Why: Qwen3-0.6B and Gemma 1B are iteration models; larger model needed for a stronger claim.
- UAT: same benchmark table on a 4B-ish model after method stabilizes.
- Verify: model column includes the 4B-ish model and reuses the same prompt/DD row IDs as the small-model benchmark.
- Negative outcome -> claim: if the effect disappears or reverses on the larger model, write the small-model limitation instead of scaling the claim.
## Decision rules
- If prompt or activation steering beats `dW`, prioritize method improvement before deeper mechanistic analysis.
- If activation steering matches `dW`, treat weight steering as mechanistic interest first and applied method second.
- If DeLoRA wins across Qwen and Gemma, spend seeds on DeLoRA/PiSSA only.
- If Qwen and Gemma adapter rankings diverge, write the model-specific adapter-basin finding instead of forcing one global winner.
- Shared-core rule: if `keep_B_shared_K32` retains >=0.7x behavior across LoRA / DoRA / PiSSA / DeLoRA / OFT and `drop_B_shared_K32` removes most of it, write the planning-subspace paper.
- Basin-divergence rule: if per-adapter top subspaces are mutually low-overlap and each adapter's own SVD keeps behavior better than `B_shared`, write the basin-divergence paper.
- If top-k or write-not-read keeps behavior, we found a simple steering parameterization.
- If complement/tail/many layers keep behavior, evidence favors distributed or wrong-space mechanism.
- If MLP `up/gate` terms carry behavior, next paper story should be feature-space steering, not residual-stream planning subspace.