Files
weight-steering/RESEARCH_JOURNAL.md
T
wassname a48430b075 switch training/eval axis from sycophancy to honesty
- data.py: HONESTY_PROMPT/POS/NEG_PERSONAS (5 paraphrases each, vgel/repeng
  short-form), _load_suffixes() reading data/branching_suffixes.json,
  behavior branches in _personas/_topics/_build_specs for paper-recipe
  question pool from 550 SSteer suffix entries
- activation_baseline.py: _fit_repe_directions branches on behavior; honesty
  mode captures last-token hidden states under pos/neg personas with
  assistant_prefixes from suffix entries (all-layers RepE)
- prompt_baseline.py: paired engineered_prompt_honest + _dishonest (AxBench
  J.2), both as plain strings
- evals/smoke.py: behavior field in SmokeCfg
- data/branching_suffixes.json: 550 SSteer branching-suffix entries
- README: updated persona description, adapter table, baselines table with
  honesty-axis numbers (438 rows, delora +0.237 best)
- RESEARCH_JOURNAL.md: 2026-04-27 axis-switch entry
- fork_plan.md: open design question resolved as option 2 (honesty axis)
- HANDOVER.md: overnight handover notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 06:00:03 +08:00

35 KiB
Raw Blame History

Research log - append to bottom only, ideally give commit / branch time. newest first

init 2026-04-26 06:32:46

see @fork_plan.md and head of README.md for overall plan and principles.

made notebook 2026-04-26 10:32:40

Where does the LoRA's Δa live?

In a task-derived hidden subspace that is not the lm_head readout. Held-out energy ratios (mean over LoRA-active layers 827, rank-8 each):

I just want to check you have the framing right

we are searching for how to modify pretrained LLM's from scratch

so we have A)

W the pretrained weights for a given linear layer and hs_diff, the output diff between cho and rej, the persona preffixed inputs strings we can do thinks like project this onto the write_not_read, supresed or churn subspace. As well as via U into the S space

B) now in terms of clues and label or whatever we have deltaW how the two lora differed and hs_diff2 how steering left and right differ on the same task (no persona prefix)

we are comparing A the hypothesis to B the label

lora lite 2026-04-26 12:32:10

the peft library is not very hackable and htere is not lora light library we can make one by

  • using pytorch or transformers forward hooks
  • adding weights or param dicts or buffers to the linear layers we modify, but each has a prefix like lora_ so we can save and load them via full path
  • we can use all layer, all residual writers, all residual readers with simple logic that looks at isinstance(linear). and if the W.shape is assymetric seeing if the input of output matching residual stream shape, otherwise falling back on hardcoded prefixes like o_out, out_proj, ml_down, etc
  • make heavy use of einops, jaxtyping, eumsum to make dims obvious and allow beartyping
  • psueodcode like, fail first, no defensive, no fallback, links to papers
  • simple single dataclass config
  • not 4bit or 8bit unless we can think of a simple way that works for all (cast all to bf16 in hook?)
  • model requires no grad except what we add
  • obviouslly easy to add new ones! easy to add special init's
    • look at what made PiSSA, SSVD, DeLora and other papers code hard to implement and make sure we have are comptabable

2026-04-26: hyperparameter correction + 0.6B vs 1.7B comparison

Paper Axolotl config shows lr=2e-4, lora_alpha=64 (alpha/rank=2.0). Our previous runs used lr=1e-5, alpha=16 (alpha/rank=0.5) — 20x too slow, wrong alpha.

Three-run comparison (all: r32, 3 epochs, sycophancy, lora, 1000 pairs, 20×5×10):

run lr / alpha ||w|| spread (α:-2→+2) val_loss@ep3 converged?
task-50: 0.6B bad-lr 1e-5 / 16 0.165 5.16 1.035 no (still dropping)
task-53: 0.6B paper-lr 2e-4 / 64 5.981 12.85 0.706 yes (U-curve ep2)
task-54: 1.7B paper-lr 2e-4 / 64 9.262 36.61 0.873 no (still dropping)

Key findings:

  • Corrected hyperparams: 36x larger ||w||, 2.5x more steering spread for 0.6B.
  • 0.6B val_loss bottoms at ep~1.94 (0.701), mild U-curve; best to stop at epoch 2.
  • 1.7B base is anti-sycophantic at α=0 (logratio=-3.58 vs +2.73 for 0.6B) — larger model has better calibration and correctly says No to false claims without steering.
  • 1.7B on-policy CoT saturates at ±20 nats; off-policy non-monotone at negative alpha (α=-2 less negative than α=-1). Linear approximation breaks down at ||w||=9.26, |alpha|=2. Tighter alpha range (±0.5, ±1.0) needed for 1.7B to stay in linear regime.
  • 1.7B val_loss higher (0.873 vs 0.706) — training data generated by 0.6B, distribution mismatch reduces signal. Should regen data from 1.7B for clean comparison.
  • OOD generalization (tabs/spaces claim) strong for both models at paper hyperparams.

2026-04-26: 1.7B own-data regen + adapter-family sweep (tasks 58-59)

1.7B with own-data + tighter alpha (task 58)

Re-ran 1.7B training with data generated by 1.7B itself (out/data/1.7B, 1000 pairs) and tighter coeff sweep (-1.0 to +1.0) to stay in the linear regime.

coeff mean_logratio std pmass
-1.0 -17.10 2.996 1.000
-0.5 -12.28 7.025 1.000
0.0 -3.58 10.698 1.000
+0.5 +7.73 10.564 1.000
+1.0 +16.27 5.877 1.000

logratio_spread = +33.375, pmass_min = 1.000. Fully monotone, no linear-regime breakdown. Baseline at alpha=0 is -3.58 (1.7B naturally less sycophantic than 0.6B at alpha=0 which was +2.73).

Adapter-family sweep (task 59): lora vs dora vs pissa vs delora on 0.6B

adapter logratio_spread pmass_min ratio_weak_write wall_s
lora +9.76 1.000 0.885 259
dora +9.76 1.000 0.879 321
pissa +17.40 0.999 1.086 326
delora +23.85 0.788 0.890 267

Key findings:

  • LoRA ≈ DoRA within 0.1% spread; DoRA adds no steering information. Expected <20% variation holds.
  • PiSSA gives 78% more spread than LoRA (17.4 vs 9.76) AND ratio_weak_write > 1 (1.086 vs 0.885). This is the first case where a non-LoRA adapter outperforms on both spread AND subspace alignment. PiSSA initializes from SVD of W0, which may place the diff more in the task-relevant subspace.
  • DeLoRA gives the most spread (+23.85) but pmass drops to 0.788 — outside the linear regime. The larger diff norm likely saturates the logratio. ratio_weak_write (0.890) similar to LoRA/DoRA.
  • Note: LoRA spread is 9.76 here vs 12.85 in task 53. Both use same data (out/data) but different random seeds and output dirs. ~24% run-to-run variation is within expected training noise.

Verdict: adapter family does matter, contrary to null hypothesis. PiSSA is the Pareto winner: more spread, better subspace alignment, pmass stays near 1. DeLoRA is strongest raw but unreliable.

Daily dilemmas OOD honesty transfer (tasks 66-67, corrected)

Three bugs found and fixed before getting valid results:

  1. HF datasets caching: .map() returned stale tokenized sequences when EVAL_HEADER changed. Fix: load_from_cache_file=False.
  2. </think> as text string ≠ the special close token. Must inject the actual token ID. Fix: in _format_row, after apply_chat_template, detect open <think> (id 151667) without matching </think> (id 151668) and inject close token + \n\n before the answer anchor.
  3. DataLoader right-padding: DataCollatorWithPadding pads on right by default, so logits[:, -1] hit a padding token for shorter sequences. Fix: tok.padding_side = "left" before DataLoader. This was the main cause of pmass=0.17 -- reading logits at a padding position, not the anchor.

0.6B results (per-persona breakdown):

persona coeff mean_lrh pmass
honest_engineer 0.0 +0.851 0.957
base -2.0 -0.786 0.975
base -1.0 +0.299 0.954
base 0.0 +1.316 0.938
base +1.0 +1.828 0.962
base +2.0 +1.645 0.989
  • pmass: 0.938-0.989. frac_low_pmass = 0.
  • Mostly monotone; slight dip at alpha=+2 suggests approaching linear-regime boundary.
  • base@alpha=0 = +1.316 (0.6B is already quite honest at baseline on this dataset).
  • AxBench: steering@+1 (+1.828) vs honest_engineer persona (+0.851) → weight diff 2.15× stronger. PASS: the weight diff adds information beyond prompting.

1.7B results (per-persona breakdown):

persona coeff mean_lrh pmass
honest_engineer 0.0 -1.817 1.000
base -1.0 -0.760 1.000
base -0.5 -0.590 1.000
base 0.0 -0.299 1.000
base +0.5 +0.526 1.000
base +1.0 +1.504 1.000
  • pmass: 1.000 across all rows (1.7B is more decisive than 0.6B).
  • Perfectly monotone sweep. Spread -1.0→+1.0 = 2.264 nats (smaller than 0.6B's 2.614 over same range, likely because 1.7B has better baseline calibration reducing the margin for steering).
  • Persona BACKFIRES for 1.7B: honest_engineer (-1.817) is worse than base (-0.299). Hypothesis: 1.7B has a more nuanced distinction between "software factual honesty" and "moral honesty in dilemmas". The persona activates factual-accuracy behavior, not moral-choice behavior. This is a dataset-persona confound, not a failure of the steering approach.
  • AxBench: steering@+1 (+1.504) still clearly beats persona (-1.817). PASS.

Cross-model comparison:

Weight steering transfers sycophancy diff to OOD honesty dilemmas for both models. 0.6B shows higher absolute effect (base already honest, persona helps), while 1.7B shows cleaner monotonicity and perfect pmass but the "honest engineer" persona backfires at 1.7B scale.

Research journal — weight-steering

2026-04-27 — v9 cross-adapter results: DeLoRA wins; subspace-finding methods fail

tl;dr

  • DeLoRA is the strongest daily-dilemmas steerer by a large margin (base-only delta = +0.70 logratio at coeff=+1 vs base, ~1.7x DoRA/PiSSA, ~2.7x LoRA/OFT, ~23x IA3). NOTE: an earlier table used +0.94 because dilemmas_summary.csv mixed base@0 with the AxBench honest_engineer persona baseline at coeff=0. cross_adapter_v9.py now uses base-only dilemmas_per_row.csv.
  • Every linear "find the planning subspace" method we tried lands at ~1-8% subspace overlap with the weight oracle. Across 6 adapter families, on every LoRA layer (8-21). Both cumulative and block-local act oracles. So either the right subspace really is small and we keep missing it, or "planning subspace" isn't the right frame.

Headline numbers (cross_adapter_v9)

Behavioral steering on daily-dilemmas honesty subset (base persona only; 100 dilemmas / 200 action rows per nonzero coeff):

adapter logratio @ -1 @ 0 (base) @ +1 delta(+1 - 0)
delora -0.29 1.32 2.02 +0.70
dora 0.73 1.32 1.72 +0.41
pissa 0.44 1.32 1.69 +0.37
oft 1.09 1.32 1.57 +0.26
lora 1.09 1.32 1.55 +0.23
ia3 1.29 1.32 1.35 +0.03

DeLoRA is the only adapter that meaningfully de-steers (negative coeff → dishonest). LoRA/OFT/IA3 are nearly flat at coeff=-1.

Subspace overlap with w_oracle (mean across LoRA layers 8-21, top-PCS=8):

adapter act_oracle (cumul) act_oracle (block-local v9)
oft 0.046 0.045
pissa 0.036 0.042
lora 0.034 0.016
ia3 0.031 0.029
dora 0.024 0.015
delora 0.017 0.016

Note the inversion: the strongest behavioral steerer (DeLoRA) has the lowest subspace alignment with act_oracle. The weakest (IA3) is mid-pack on overlap. "Subspace alignment with the activation-difference oracle" is not predictive of behavioral effect across adapter families.

What v9 ruled out

  • Scope mismatch: hypothesis was that hs_diff_B[L] is cumulative (includes all upstream LoRA writes) while dW[L] is local, so the cumulative act_oracle was looking at the wrong thing. v9 added block_diff_B[L] = what block L itself wrote, and re-derived the oracle from that. Result: block-local barely moves overlap (1-5% in either direction) — sometimes up, sometimes down, no consistent improvement. So scope is NOT the culprit.
  • Layer L=8 sanity: at the first LoRA layer, cumulative ≈ block (overlap 1.0 for 5/6 adapters; IA3 fails because IA3Config doesn't accept layers_to_transform so it adapts every layer). So the metric is consistent; cumulative just diverges from block as we accumulate upstream LoRA writes.

What this falsifies

The "shared low-rank planning subspace" frame as written in docs/blog_adapter_as_hypothesis. If a small (rank ≤ 8) subspace contained the honesty/sycophancy task structure, we'd expect the weight oracle and the activation oracle (which by construction captures top-PCS energy of the behavioral diff) to agree on at least one of: substance, scope, or family. They don't agree on any of them, across 6 different LoRA-family inductive biases.

Two surviving stories:

  1. The right subspace is tiny but specific — maybe ~3% overlap is "the right 3%" and the 97% orthogonal part of dW is dead weight that doesn't affect behavior. Falsifiable: project dW onto top-K right SVs of act_oracle for K ∈ {1,2,4,8}, run dilemmas, see if delta_pos_minus_zero survives. If yes, our metric is just the wrong norm. If no, the framing is wrong.
  2. The frame is wrong — behavior emerges from how dW interacts with the full activation manifold non-linearly through the rest of the network, not from alignment with a top-PCS basis. The act_oracle PCA captures variance, not function.

I lean (2). The fact that DeLoRA has the worst overlap and the best behavior is hard to explain under (1).

What's interesting about DeLoRA winning

DeLoRA's parametrization (decoupled magnitude + normalized direction, like DoRA but with stronger decoupling — see Bini 2024) seems to produce a more swingy steering vector: it's the only adapter where coeff=-1 actively de-honests the model below baseline. Hypothesis: DeLoRA's normalization forces the update to be a coherent direction rather than a magnitude-driven blob, so scaling it ±α actually traverses the behavioral axis. Other adapters (LoRA, OFT) collapse asymmetrically on the negative side because much of their delta is magnitude-not-direction.

This would mean DeLoRA is a better probe for behavior steering even though all six adapters land in similar regions of weight space when you measure by subspace overlap.

Next experiments

  • (claim 1 falsifier) projection-ablation: w_proj_K = project dW onto top-K act_oracle dirs, run dilemmas, see if delta_pos_minus_zero is preserved.
  • DeLoRA at higher rank — does the behavioral lead grow or saturate?
  • DeLoRA × different behaviors (not just sycophancy/honesty) to see if the swinginess generalizes.
  • Replace "subspace alignment" with a functional metric: causally ablate dW components and measure behavior change directly, not via PCA bases.

Artifacts

  • out/sycophancy/cross_adapter_v9/{headline.csv, scope_summary.csv, dilemmas_summary.csv, scope_bars.png}
  • out/sycophancy/{lora,dora,pissa,delora,oft,ia3}/v9/v9_scope_diagnostic.csv — per-layer scope_diagnostic for all 6 adapters, all show 1-8% overlap on every LoRA layer.
  • nbs/hypothesis_sweep_v9.py — block-local oracle, scope diagnostic, ADAPTER env var.
  • nbs/cross_adapter_v9.py — aggregator across the 6 adapter families.
  • All 18 pueue jobs (88-105) finished Success.

2026-04-27 — v10 functional projection: overlap metric failed, but act_oracle is not the trained steering subspace

Question: maybe the ~3% overlap between w_oracle and act_oracle_block is still "the right 3%". If yes, projecting dW onto the act_oracle basis should preserve daily-dilemmas steering.

v10 changed the metric from geometric overlap to behavior:

  1. Build block-local act_oracle from the adapter's sycophancy probe effect.
  2. Decompose residual-output tensors (o_proj, down_proj) into:
    • project_act_block: P_{act,K} dW
    • complement_act_block: (I - P_{act,K}) dW
    • project_act_block_normmatched: same projection scaled to residual-write norm
  3. Run the same daily-dilemmas honesty logratio.

Main K=32 result (n=40 dilemmas / 80 rows)

adapter full Δ residual-write Δ raw projection / residual normmatched / residual complement / residual read
delora +0.628 +0.844 0.07 0.30 0.89 clean counterexample: trained behavior mostly in complement
pissa +0.373 +0.242 0.47 1.14 0.64 mixed: act projection is functional but not sole carrier
oft +0.216 +0.148 -0.01 1.57 0.69 act direction potent only after amplification
dora +0.370 +0.031 -0.50 1.15 1.20 residual-write split barely explains full effect
lora +0.173 -0.022 noisy noisy noisy residual-write split wrong-sign
ia3 -0.048 +0.002 noisy noisy noisy denominator too small

So v10 mostly kills the strongest loophole for DeLoRA: the act_oracle projection at trained scale does not carry the steering; the complement does. PiSSA/OFT are subtler: the act_oracle directions are real and causal when amplified, but the trained adapter did not put enough norm there for them to explain behavior at alpha=1.

Alpha sweep at K=32 (informative adapters only)

Δ = daily-dilemmas honesty logratio minus base.

adapter alpha residual raw projection normmatched projection complement
delora 0.5 +0.387 -0.020 +0.100 +0.361
delora 1.0 +0.844 +0.061 +0.252 +0.755
delora 2.0 +1.858 +0.097 +0.503 +1.680
delora 4.0 +2.150 +0.183 +0.430 +2.739
pissa 0.5 +0.159 +0.044 +0.217 +0.036
pissa 1.0 +0.242 +0.114 +0.277 +0.155
pissa 2.0 +0.327 +0.206 +0.394 +0.270
pissa 4.0 +0.578 +0.350 +0.473 +0.380
oft 0.5 +0.059 -0.020 +0.077 +0.034
oft 1.0 +0.148 -0.002 +0.233 +0.103
oft 2.0 +0.309 +0.084 +0.478 +0.184
oft 4.0 +0.564 +0.166 +0.864 +0.381

Interpretation sequence:

  • v9 overlap was too strict as a potency metric. Normmatched act projections can steer strongly, especially OFT/PiSSA.
  • v9 overlap was not wrong as a trained-scale carrier metric. Raw act projection usually carries little of the actual trained behavior.
  • DeLoRA is the cleanest finding: residual-write Δ is strong (+0.844), raw act projection is weak (+0.061), complement is strong (+0.755). That is hard to reconcile with "the right 3% explains the behavior".
  • PiSSA/OFT suggest the act_oracle subspace is a useful intervention target, not the subspace the adapter naturally chose.

My current model: PCA of activation differences finds directions where small weight writes can be high-gain, but trained adapter behavior is distributed through a larger residual-write complement plus read/gate/up paths. The word "planning subspace" is probably misleading unless we define it causally (what intervention changes behavior), not geometrically (what basis overlaps).

Clarification on the object of search: we were looking for a subspace or parameterization that explains the difference between the positive and negative LoRAs, i.e. a basis in which dW = θ_pos - θ_neg becomes simple or functionally concentrated. v9/v10 speak directly to that for residual-output weights. The answer so far is: not adapter family alone (LoRA/DoRA/PiSSA/ DeLoRA/OFT/IA3), not the canonical pretrained-weight bases, not the attention or MLP read/write primitives, not the attention-selected token bases, not the persona-contrast bases, and not the top activation-PCA directions at trained scale. That does not mean no compact causal description exists. It means our tested parameterizations and geometric descriptions are not it.

complement means the part of the residual-output weight diff left after removing the activation-PCA subspace:

dW_{complement} = (I - P_{act,K}) dW.

If P_act dW keeps the behavior, then the small overlap was the right small piece. If (I - P_act) dW keeps the behavior, then the trained adapter's behavior is mostly outside that tested act-PCA subspace. DeLoRA gave the clean case: raw projection keeps 7% of residual-write behavior, complement keeps 89%. So for the strongest adapter, act-PCA is not an explanation of the learned weight diff, though it can still be a useful amplified intervention direction.

Artifacts

  • nbs/functional_projection_v10.py
  • docs/spec/20260427_v10_functional_projection.md
  • out/sycophancy/v10_functional_projection/{behavior_summary.csv, behavior_by_coeff.csv, spectra_and_projection.csv}
  • out/sycophancy/v10_alpha_sweep/{behavior_summary.csv, behavior_by_coeff.csv, spectra_and_projection.csv}

v10 wendler-style functional probe 2026-04-27 07:04:57 (dev, a40fd35)

nbs/v10_llama.py ports Wendler et al. 2024 ("Do Llamas Work in English?") token-energy + logit-lens to the sycophancy LoRA on Qwen3-0.6B. Cheaper than v9 (no PCA, no oracle, no SVD per candidate) and asks a different question: not "does B span Δh" but "is the (e_yes e_no) readout axis in B" and "is Δh itself readable by lm_head".

Findings

  • Δh is unreadable at LoRA layers. Token energy E²(Δh) peaks at 0.010 in LoRA layers 8..21 vs E²(clean) ≈ 22 at the same layers. Logit-lens Yes-No on Δh is essentially zero at LoRA layers, rising to peak 0.66 nats at layer 25 (post-LoRA).
  • The LoRA writes concepts, not tokens. Top decoded tokens from Δh at the peak layer:
    • +Δh: " why / reasons / 理由 / supporting / Reasons / 为什么"
    • −Δh: " nonexistent / unauthorized / truthful / 未经" Multilingual concepts of "give-reasons / supportive". The LoRA does not write " Yes" — downstream layers translate the concept into Yes/No.
  • No rank-8 base-model subspace contains the Yes-No axis. cap_yn(B) = ‖P_B(e_yes e_no)‖² / ‖e_yes e_no‖² averaged over LoRA layers: lm_head_read = 0.042 (best A-side, 5× null), TaskDiff_lora_fit = 0.014, write = 0.008, random_null = 0.008 = 8/1024. Even act_oracle (in-sample ceiling) gets 0.014 because Δh is in concept space, not on the readout axis. Sanity passes: act_oracle.preserved_E ≈ 1.000, random_null.cap_yn ≈ PCS/d.

Punchline

Searching for "the subspace the LoRA uses" via canonical base-model bases or persona-contrast PCA fails not because the hypotheses are crazy, but because the LoRA's solution is concept space and doesn't sit in any low-rank readable basis. The hypothesised directions remain usable for steering with amplification (v10 functional projection result above) — they're just not what the trained model is doing, so you can't use them to interpret what the LoRA learned.

This decomposes the v9 negative finding into three separate facts that v9's PCA-span metric conflated:

  1. Δh doesn't carry the Yes-No readout axis at LoRA layers (panel b: ldiff on Δh ≈ 0 in 8..21, peaks only at layer 25).
  2. No rank-8 candidate (incl. lm_head's own SVD) contains the Yes-No axis (panel c: best cap_yn = 0.042).
  3. The mechanism is concept-write at LoRA layers → downstream non-linear translation to Yes/No, matching Wendler's three-phase picture.

Why we still can't find the concept space

act_oracle is post-hoc (defined from Δh itself, in-sample). The closest out-of-sample candidate is TaskDiff_lora_fit (PCA on FIT-half α=±1 diffs), which gets preserved_E = 0.109 at rank 8. Above null but loses 90% of Δh's readable energy. Why:

  • Rank-8 is too narrow. The concept "be supportive / give reasons" is not one direction. It spans polite vs blunt, justification vs assertion, multilingual variants. PCA at rank 8 collapses these — silent rank loss. TODO: rerun TaskDiff_lora_fit at rank 32, 64, 128. If preserved_E saturates near 1.0 by rank 64, concept space is just higher-dimensional than 8.
  • Linear span vs cluster. PCA finds a linear span. If the concept space is a manifold of related-but-not-collinear directions (e.g. one "reasons" axis per topic), span overlap is low even when the cluster is perfectly captured. CHaRS-style per-cluster translations would catch this; v9 marked them "structurally penalized".
  • Persona contrast is the wrong concept. TaskDiff_contrast scores cap_yn = 0.012 because persona ≠ sycophancy. Persona induces style; sycophancy is more specific (justify-the-user's-claim). Need concept-anchored data: contrast prompts that elicit " because / 因为 / supporting" against bland answers, build basis from that.
  • Wrong layer. Linear probes at LoRA layers see concept-space writes; Yes/No is only readable at layer 25. A Yes-No probe trained at layer 25 and back-projected through residual paths might recover the concept axis from the receiving end.
  • Wrong tool. PCA is unsupervised; DAS (Distributed Alignment Search) learns a rotation that isolates task-causal directions, doesn't assume low rank. SAE features could expose a "supporting/reasons" feature directly. Either is the principled next step.

Artifacts

  • nbs/v10_llama.py
  • out/sycophancy/lora/v10/{v10_wendler_metrics.png, v10_wendler_metrics.pdf, v10_table.tsv, v10_caption.md, v10_per_layer.csv}

2026-04-27 fork_plan T1-T8 status check (dev)

Walked through fork_plan tasks T1-T8 against the latest CSVs to see which UAT goals are ticked.

T1 activation steering baseline (out/sycophancy/activation_baseline/summary.csv). Repeng on layers 8-21, full 438-dilemma set (dd_zero=+0.195, syc_zero=+2.698). Best repeng dd_delta = +0.071 (layer 9, coeff=-4); at coeff=+1 the best is +0.0070 (layer 13). dW:delora at coeff=+1 is dd_delta=+0.337 on this same slice. Activation steering on this dataset is essentially noise; the trained weight diff carries roughly 50x more daily-dilemmas signal at matched coeff.

T2 cross-adapter on full daily-dilemmas, base persona only (out/sycophancy/cross_adapter_full_dd/dilemmas_summary.csv, 438 rows). At coeff=+1 vs base@0:

adapter delta_vs_0
delora +0.711
dora +0.404
pissa +0.368
oft +0.236
lora +0.229
ia3 +0.033

Same DeLoRA > DoRA > PiSSA ordering as the v9 100-dilemma slice. IA3 still flat.

T3 prompt baseline (out/sycophancy/prompt_baseline/summary.csv). Engineered prompt vs base @ coeff=0 = +0.370. Simple "be honest" prompt = -0.520 (backfires). DeLoRA dW @ coeff=+1 = +0.711 still beats the strongest prompt intervention by 1.9x. AxBench-style claim survives on the full 438-row split.

T6 cross-adapter causal dW basis ablation (out/sycophancy/cross_adapter_ablation/summary.csv). At coeff=+1, top three: delora/residual_write_full +0.907, delora/shared_keep K=32 +0.736, delora/full_all_tensors +0.711. shared_drop K=8 keeps +0.436 (32% loss); random_keep across all adapters lands at <=+0.022 (effectively zero). Shared top-K SVD basis is a legitimate causal carrier, the random control isn't.

T7 layer/module ablation. out/sycophancy/layer_module_ablation/ is empty; re-running as pueue 196 after the IA3 zero-tensor fix in src/ws/eval/layer_module_ablation.py:_select(). Pending.

T8 parameterization ablation (out/sycophancy/parameterization_ablation/summary.csv). Headline: per-adapter top energy crops match or beat full_dW. delora/top_90pct_energy_S = dd_delta +0.962 (full_dW = +0.711, +35%). dora/top_90pct = +0.415 (full +0.404). pissa/top_25pct = +0.381 (full +0.368). lora/top_90pct = +0.224 (full +0.229). The top quartile/decile by SVD energy is doing all the work. The complementary drops (residual_not_top_*, mid_50pct_S, bottom_25pct_S) collapse to <=0.02 dd_delta everywhere. IA3 is weak across the board (full_dW = +0.033, max variant +0.024). The trained dW lives in its top SVD modes, consistent with the v9/v10 "concept-write" reading.

Tick list:

  • Done: T1, T2, T3, T6, T8
  • Pending: T7 (pueue 196 after IA3 fix)
  • Open: T4 multiseed, T5 Gemma replication

The biggest update from today is T8: the keep_top_X energy crops reproduce full_dW behavior, and drop_top crops zero out, which is a stronger version of "the dW is dominated by its top SVD components" than v9 had. Combined with T6 shared_keep K=32 retaining 73% of DeLoRA's effect, this is mild evidence for a low-rank shared basis at the dW level even though the act-PCA basis missed it (v9/v10).

lens search on hold pending multiseed 2026-04-27

After running T6 (cross-adapter dW basis), T7 (layer/module), and T8 (parameterization, own-SVD lens) and sketching T9 (native parameterization decompositions per adapter), every weight-space lens we tested has a built-in failure mode that prevents a parameterization-invariant mechanism claim, modulo a major caveat at the bottom of this entry:

  • SVD-on-dW: tautological for low-rank adapters. dW = AB^T has only rank r nonzero singulars by construction, so "top-K S retains the behavior" is mostly a property of the adapter's parameterization, not a finding about the model. The own-SVD top-25%-S concentration table shows this — keep ≈ full and drop ≈ 0 for delora/dora/lora/oft/pissa, but that's expected for any low-rank dW.
  • Layer-index: tells you depth, not mechanism. Doesn't separate read from write, doesn't see circuits, doesn't see heads or positions.
  • Module-family: collapses heads and sequence positions. Cross-adapter results disagree (delora's residual_write retained=+1.27, lora's=+0.14) so there's no stable "the behavior lives in module X" story.
  • Native parameterization (T9 sketch): per-adapter decompositions (DoRA mag/dir, OFT rotation, IA3 scale) aren't comparable across adapter families by construction. Best-case answer is "DoRA stores it in the magnitude vector," which doesn't translate to LoRA or OFT.

Cross-adapter SVD-subspace overlap (do top-K U/V of the 6 adapters' dW span the same subspace?) is the one weight-space test that could give a parameterization-invariant signal. Not run. Activation-space cross-adapter comparison was also raised; user judged activations to be a symptom, not the cause.

What survives: trained dW is causally necessary for the behavior (drop tests across all three lenses give retained ≈ 0 for the complement). What's not supported: any parameterization-invariant mechanism claim. Dropping the lens search.

Pueue 215 (T8 v2 with base-W SVD lens + norm-matched random keep) and 216 (T7 v2 with read-side modules q/k/v/up/gate) are queued behind lora-lite job 214. They would close two of the four catalog coverage gaps but won't change the headline. Leaving queued for now; they're cheap if they run.

Priority redirect: T4 multi-seed and T5 Gemma 1B replication. Both are in the benchmark half of the plan, not the analysis half, and both are currently N=1.

two-goal frame and coverage gaps 2026-04-27

Reframed everything as two goals so the writeup stops mixing them.

Goal A (descriptive, post-hoc): given trained dW, find a coordinate system that makes it sparse / low-rank / interpretable. Lenses run so far: dW's own SVD (T8), layer index (T7), module family (T7), shared cross-adapter SVD (T6 shared_keep). Lenses not run: base-W SVD dS = U0.T @ dW @ V0h, activation PCA at the dW level, adapter-architecture decompositions (DoRA mag/dir, DeLoRA lambda/dir, OFT rotation, IA3 gates).

Goal B (constructive, deferred): predict dW' from pretrained W and base activations alone, no training. Candidates: TaskDiff/RepE persona contrast, function vectors, write-not-read, OV-write, gate-kernel, signed SAE, ReFT-r1, attention min/max/diff. Benchmark = trained vs constructed dW on identical DD rows. None run yet.

Single measure across both: retained = dd_delta(ablated) / dd_delta(full) at coeff=+1, base persona, idx_symmetric_diff=0. Necessity (drop test): no norm-matching needed. Sufficiency (keep test): norm-matched random control matters because cropping shrinks Frobenius norm and the model is nonlinear in alpha. T7 has random_norm_matched_full; T8 lacks it.

Coverage gaps to flag in writeup:

  1. Read-side modules (q/k/v/up/gate-only) absent from T7 variants. Any read-side mechanism story is currently untestable.
  2. Base-W SVD lens absent. T8 uses each tensor's own SVD; catalog spec'd base-W SVD as a separate lens. Both are valid, just different questions.
  3. Adapter-architecture decompositions absent from T8 variant set.
  4. T8 sufficiency claims lack a norm-matched random keep control.

Notebook: nbs/ablation_analysis.py consumes T7+T8 CSVs and emits three lens figures and a joint summary table. Runs end-to-end on current outputs.

Cleanup: deleted superseded notebooks (analyze_diff*, cross_adapter_v9, figures_v2, functional_projection_v10, hypothesis_sweep_v5-v9, strong_conclusion_v4, v10_llama) and their result dirs (out/sycophancy/{cross_adapter_v9,v10_alpha_sweep,v10_functional_projection}). Kept nbs/ablation_analysis.py as the single notebook.

2026-04-27: train/eval axis switch — sycophancy → honesty (branch dev)

Why: previous training contrast was SYCOPHANCY_POS ("validate user, never push back") vs SYCOPHANCY_NEG (literally "you are an honest assistant"), so training was 2-axis (sycophancy + honesty) while eval (daily_dilemmas-self-honesty) is signed by honesty_label only. Cross-method deltas were still comparable on identical rows, but the absolute "is steering working" claim was weakened by the train/eval axis mismatch.

Decision: option 2 — switch the entire pipeline to honesty. The NEG persona was already honesty-themed, so this makes the existing implicit axis explicit, eliminates the GPT-4o sycophancy-eval generation burden of option 1, and turns sycophancy into a free OOD transfer eval for later. T4 multiseed and T5 Gemma will be re-scoped against honesty rather than sycophancy.

Persona pair (vgel/repeng convention, short form):

  • POS = Pretend you're {an honest, a truthful, a candid, a straightforward, a sincere} person. ×5 paraphrases
  • NEG = Pretend you're {a dishonest, a deceptive, a manipulative, a misleading, a lying} person. ×5

Question pool: data/branching_suffixes.json (550 SSteer entries, (cat, suffix, user_msg)). For paper-recipe training (Fierro & Roger 2025 §F.1) we use user_msg only as the question; the persona-conditioned model generates r_pos / r_neg, then LoRA trains on the pair, then system prompt is stripped at train time so the adapter learns behavior unconditionally. For RepE direction extraction we capture last-token hidden states with assistant_prefixes=suffix (no backprop, just observation).

Code changes (uncommitted on dev):

  • src/ws/data.py: HONESTY_PROMPT, HONESTY_POS_PERSONAS, HONESTY_NEG_PERSONAS, _load_suffixes(), behavior branches in _personas, _topics, _build_specs.
  • src/ws/eval/activation_baseline.py: _fit_repe_directions(..., behavior) honesty branch using suffix prompts + assistant_prefixes.
  • src/ws/eval/prompt_baseline.py: paired engineered_prompt_honest + engineered_prompt_dishonest (AxBench J.2).
  • evals/smoke.py: behavior field in SmokeCfg. Smoke passes end-to-end on katuni4ka/tiny-random-qwen3 with --behavior honesty.
  • data/branching_suffixes.json: copied from SSteer.

Pueue: killed sycophancy runs (215-228). Queued honesty pipeline:

  • 230: run_sweep --behavior honesty (1000 pairs, 6 adapters: lora/dora/pissa/delora/oft/boft/ia3) — running.
  • 231-236 chained --after 230: T1 RepE, T3 prompt baseline, T2 full DD, T6 cross-adapter, T7 layer/module, T8 parameterization (all --behavior honesty --n-dilemmas 219 --batch-size 8).

Sycophancy outputs in out/sycophancy/ are kept as historical evidence for the old axis-mismatched table. README headline numbers will be replaced with honesty once 231-236 land. T4/T5 remain open.