- data.py: HONESTY_PROMPT/POS/NEG_PERSONAS (5 paraphrases each, vgel/repeng short-form), _load_suffixes() reading data/branching_suffixes.json, behavior branches in _personas/_topics/_build_specs for paper-recipe question pool from 550 SSteer suffix entries - activation_baseline.py: _fit_repe_directions branches on behavior; honesty mode captures last-token hidden states under pos/neg personas with assistant_prefixes from suffix entries (all-layers RepE) - prompt_baseline.py: paired engineered_prompt_honest + _dishonest (AxBench J.2), both as plain strings - evals/smoke.py: behavior field in SmokeCfg - data/branching_suffixes.json: 550 SSteer branching-suffix entries - README: updated persona description, adapter table, baselines table with honesty-axis numbers (438 rows, delora +0.237 best) - RESEARCH_JOURNAL.md: 2026-04-27 axis-switch entry - fork_plan.md: open design question resolved as option 2 (honesty axis) - HANDOVER.md: overnight handover notes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
35 KiB
Research log - append to bottom only, ideally give commit / branch time. newest first
init 2026-04-26 06:32:46
see @fork_plan.md and head of README.md for overall plan and principles.
made notebook 2026-04-26 10:32:40
Where does the LoRA's Δa live?
In a task-derived hidden subspace that is not the lm_head readout. Held-out energy ratios (mean over LoRA-active layers 8–27, rank-8 each):
I just want to check you have the framing right
we are searching for how to modify pretrained LLM's from scratch
so we have A)
W the pretrained weights for a given linear layer and hs_diff, the output diff between cho and rej, the persona preffixed inputs strings we can do thinks like project this onto the write_not_read, supresed or churn subspace. As well as via U into the S space
B) now in terms of clues and label or whatever we have deltaW how the two lora differed and hs_diff2 how steering left and right differ on the same task (no persona prefix)
we are comparing A the hypothesis to B the label
lora lite 2026-04-26 12:32:10
the peft library is not very hackable and htere is not lora light library we can make one by
- using pytorch or transformers forward hooks
- adding weights or param dicts or buffers to the linear layers we modify, but each has a prefix like
lora_so we can save and load them via full path - we can use all layer, all residual writers, all residual readers with simple logic that looks at isinstance(linear). and if the W.shape is assymetric seeing if the input of output matching residual stream shape, otherwise falling back on hardcoded prefixes like o_out, out_proj, ml_down, etc
- make heavy use of einops, jaxtyping, eumsum to make dims obvious and allow beartyping
- psueodcode like, fail first, no defensive, no fallback, links to papers
- simple single dataclass config
- not 4bit or 8bit unless we can think of a simple way that works for all (cast all to bf16 in hook?)
- model requires no grad except what we add
- obviouslly easy to add new ones! easy to add special init's
- look at what made PiSSA, SSVD, DeLora and other papers code hard to implement and make sure we have are comptabable
2026-04-26: hyperparameter correction + 0.6B vs 1.7B comparison
Paper Axolotl config shows lr=2e-4, lora_alpha=64 (alpha/rank=2.0).
Our previous runs used lr=1e-5, alpha=16 (alpha/rank=0.5) — 20x too slow, wrong alpha.
Three-run comparison (all: r32, 3 epochs, sycophancy, lora, 1000 pairs, 20×5×10):
| run | lr / alpha | ||w|| | spread (α:-2→+2) | val_loss@ep3 | converged? |
|---|---|---|---|---|---|
| task-50: 0.6B bad-lr | 1e-5 / 16 | 0.165 | 5.16 | 1.035 | no (still dropping) |
| task-53: 0.6B paper-lr | 2e-4 / 64 | 5.981 | 12.85 | 0.706 | yes (U-curve ep2) |
| task-54: 1.7B paper-lr | 2e-4 / 64 | 9.262 | 36.61 | 0.873 | no (still dropping) |
Key findings:
- Corrected hyperparams: 36x larger ||w||, 2.5x more steering spread for 0.6B.
- 0.6B val_loss bottoms at ep~1.94 (0.701), mild U-curve; best to stop at epoch 2.
- 1.7B base is anti-sycophantic at α=0 (logratio=-3.58 vs +2.73 for 0.6B) — larger model has better calibration and correctly says No to false claims without steering.
- 1.7B on-policy CoT saturates at ±20 nats; off-policy non-monotone at negative alpha (α=-2 less negative than α=-1). Linear approximation breaks down at ||w||=9.26, |alpha|=2. Tighter alpha range (±0.5, ±1.0) needed for 1.7B to stay in linear regime.
- 1.7B val_loss higher (0.873 vs 0.706) — training data generated by 0.6B, distribution mismatch reduces signal. Should regen data from 1.7B for clean comparison.
- OOD generalization (tabs/spaces claim) strong for both models at paper hyperparams.
2026-04-26: 1.7B own-data regen + adapter-family sweep (tasks 58-59)
1.7B with own-data + tighter alpha (task 58)
Re-ran 1.7B training with data generated by 1.7B itself (out/data/1.7B, 1000 pairs) and tighter coeff sweep (-1.0 to +1.0) to stay in the linear regime.
| coeff | mean_logratio | std | pmass |
|---|---|---|---|
| -1.0 | -17.10 | 2.996 | 1.000 |
| -0.5 | -12.28 | 7.025 | 1.000 |
| 0.0 | -3.58 | 10.698 | 1.000 |
| +0.5 | +7.73 | 10.564 | 1.000 |
| +1.0 | +16.27 | 5.877 | 1.000 |
logratio_spread = +33.375, pmass_min = 1.000. Fully monotone, no linear-regime breakdown. Baseline at alpha=0 is -3.58 (1.7B naturally less sycophantic than 0.6B at alpha=0 which was +2.73).
Adapter-family sweep (task 59): lora vs dora vs pissa vs delora on 0.6B
| adapter | logratio_spread | pmass_min | ratio_weak_write | wall_s |
|---|---|---|---|---|
| lora | +9.76 | 1.000 | 0.885 | 259 |
| dora | +9.76 | 1.000 | 0.879 | 321 |
| pissa | +17.40 | 0.999 | 1.086 | 326 |
| delora | +23.85 | 0.788 | 0.890 | 267 |
Key findings:
- LoRA ≈ DoRA within 0.1% spread; DoRA adds no steering information. Expected <20% variation holds.
- PiSSA gives 78% more spread than LoRA (17.4 vs 9.76) AND ratio_weak_write > 1 (1.086 vs 0.885). This is the first case where a non-LoRA adapter outperforms on both spread AND subspace alignment. PiSSA initializes from SVD of W0, which may place the diff more in the task-relevant subspace.
- DeLoRA gives the most spread (+23.85) but pmass drops to 0.788 — outside the linear regime. The larger diff norm likely saturates the logratio. ratio_weak_write (0.890) similar to LoRA/DoRA.
- Note: LoRA spread is 9.76 here vs 12.85 in task 53. Both use same data (out/data) but different random seeds and output dirs. ~24% run-to-run variation is within expected training noise.
Verdict: adapter family does matter, contrary to null hypothesis. PiSSA is the Pareto winner: more spread, better subspace alignment, pmass stays near 1. DeLoRA is strongest raw but unreliable.
Daily dilemmas OOD honesty transfer (tasks 66-67, corrected)
Three bugs found and fixed before getting valid results:
- HF datasets caching:
.map()returned stale tokenized sequences when EVAL_HEADER changed. Fix:load_from_cache_file=False. </think>as text string ≠ the special close token. Must inject the actual token ID. Fix: in_format_row, afterapply_chat_template, detect open<think>(id 151667) without matching</think>(id 151668) and inject close token +\n\nbefore the answer anchor.- DataLoader right-padding:
DataCollatorWithPaddingpads on right by default, sologits[:, -1]hit a padding token for shorter sequences. Fix:tok.padding_side = "left"before DataLoader. This was the main cause of pmass=0.17 -- reading logits at a padding position, not the anchor.
0.6B results (per-persona breakdown):
| persona | coeff | mean_lrh | pmass |
|---|---|---|---|
| honest_engineer | 0.0 | +0.851 | 0.957 |
| base | -2.0 | -0.786 | 0.975 |
| base | -1.0 | +0.299 | 0.954 |
| base | 0.0 | +1.316 | 0.938 |
| base | +1.0 | +1.828 | 0.962 |
| base | +2.0 | +1.645 | 0.989 |
- pmass: 0.938-0.989. frac_low_pmass = 0.
- Mostly monotone; slight dip at alpha=+2 suggests approaching linear-regime boundary.
- base@alpha=0 = +1.316 (0.6B is already quite honest at baseline on this dataset).
- AxBench: steering@+1 (+1.828) vs honest_engineer persona (+0.851) → weight diff 2.15× stronger. PASS: the weight diff adds information beyond prompting.
1.7B results (per-persona breakdown):
| persona | coeff | mean_lrh | pmass |
|---|---|---|---|
| honest_engineer | 0.0 | -1.817 | 1.000 |
| base | -1.0 | -0.760 | 1.000 |
| base | -0.5 | -0.590 | 1.000 |
| base | 0.0 | -0.299 | 1.000 |
| base | +0.5 | +0.526 | 1.000 |
| base | +1.0 | +1.504 | 1.000 |
- pmass: 1.000 across all rows (1.7B is more decisive than 0.6B).
- Perfectly monotone sweep. Spread -1.0→+1.0 = 2.264 nats (smaller than 0.6B's 2.614 over same range, likely because 1.7B has better baseline calibration reducing the margin for steering).
- Persona BACKFIRES for 1.7B: honest_engineer (-1.817) is worse than base (-0.299). Hypothesis: 1.7B has a more nuanced distinction between "software factual honesty" and "moral honesty in dilemmas". The persona activates factual-accuracy behavior, not moral-choice behavior. This is a dataset-persona confound, not a failure of the steering approach.
- AxBench: steering@+1 (+1.504) still clearly beats persona (-1.817). PASS.
Cross-model comparison:
Weight steering transfers sycophancy diff to OOD honesty dilemmas for both models. 0.6B shows higher absolute effect (base already honest, persona helps), while 1.7B shows cleaner monotonicity and perfect pmass but the "honest engineer" persona backfires at 1.7B scale.
Research journal — weight-steering
2026-04-27 — v9 cross-adapter results: DeLoRA wins; subspace-finding methods fail
tl;dr
- DeLoRA is the strongest daily-dilemmas steerer by a large margin (base-only
delta = +0.70 logratio at coeff=+1 vs base, ~1.7x DoRA/PiSSA,
~2.7x LoRA/OFT, ~23x IA3). NOTE: an earlier table used +0.94 because
dilemmas_summary.csvmixed base@0 with the AxBenchhonest_engineerpersona baseline at coeff=0.cross_adapter_v9.pynow uses base-onlydilemmas_per_row.csv. - Every linear "find the planning subspace" method we tried lands at ~1-8% subspace overlap with the weight oracle. Across 6 adapter families, on every LoRA layer (8-21). Both cumulative and block-local act oracles. So either the right subspace really is small and we keep missing it, or "planning subspace" isn't the right frame.
Headline numbers (cross_adapter_v9)
Behavioral steering on daily-dilemmas honesty subset (base persona only; 100 dilemmas / 200 action rows per nonzero coeff):
| adapter | logratio @ -1 | @ 0 (base) | @ +1 | delta(+1 - 0) |
|---|---|---|---|---|
| delora | -0.29 | 1.32 | 2.02 | +0.70 |
| dora | 0.73 | 1.32 | 1.72 | +0.41 |
| pissa | 0.44 | 1.32 | 1.69 | +0.37 |
| oft | 1.09 | 1.32 | 1.57 | +0.26 |
| lora | 1.09 | 1.32 | 1.55 | +0.23 |
| ia3 | 1.29 | 1.32 | 1.35 | +0.03 |
DeLoRA is the only adapter that meaningfully de-steers (negative coeff → dishonest). LoRA/OFT/IA3 are nearly flat at coeff=-1.
Subspace overlap with w_oracle (mean across LoRA layers 8-21, top-PCS=8):
| adapter | act_oracle (cumul) | act_oracle (block-local v9) |
|---|---|---|
| oft | 0.046 | 0.045 |
| pissa | 0.036 | 0.042 |
| lora | 0.034 | 0.016 |
| ia3 | 0.031 | 0.029 |
| dora | 0.024 | 0.015 |
| delora | 0.017 | 0.016 |
Note the inversion: the strongest behavioral steerer (DeLoRA) has the lowest subspace alignment with act_oracle. The weakest (IA3) is mid-pack on overlap. "Subspace alignment with the activation-difference oracle" is not predictive of behavioral effect across adapter families.
What v9 ruled out
- Scope mismatch: hypothesis was that hs_diff_B[L] is cumulative (includes all upstream LoRA writes) while dW[L] is local, so the cumulative act_oracle was looking at the wrong thing. v9 added block_diff_B[L] = what block L itself wrote, and re-derived the oracle from that. Result: block-local barely moves overlap (1-5% in either direction) — sometimes up, sometimes down, no consistent improvement. So scope is NOT the culprit.
- Layer L=8 sanity: at the first LoRA layer, cumulative ≈ block (overlap 1.0 for 5/6 adapters; IA3 fails because IA3Config doesn't accept layers_to_transform so it adapts every layer). So the metric is consistent; cumulative just diverges from block as we accumulate upstream LoRA writes.
What this falsifies
The "shared low-rank planning subspace" frame as written in docs/blog_adapter_as_hypothesis. If a small (rank ≤ 8) subspace contained the honesty/sycophancy task structure, we'd expect the weight oracle and the activation oracle (which by construction captures top-PCS energy of the behavioral diff) to agree on at least one of: substance, scope, or family. They don't agree on any of them, across 6 different LoRA-family inductive biases.
Two surviving stories:
- The right subspace is tiny but specific — maybe ~3% overlap is "the right 3%" and the 97% orthogonal part of dW is dead weight that doesn't affect behavior. Falsifiable: project dW onto top-K right SVs of act_oracle for K ∈ {1,2,4,8}, run dilemmas, see if delta_pos_minus_zero survives. If yes, our metric is just the wrong norm. If no, the framing is wrong.
- The frame is wrong — behavior emerges from how dW interacts with the full activation manifold non-linearly through the rest of the network, not from alignment with a top-PCS basis. The act_oracle PCA captures variance, not function.
I lean (2). The fact that DeLoRA has the worst overlap and the best behavior is hard to explain under (1).
What's interesting about DeLoRA winning
DeLoRA's parametrization (decoupled magnitude + normalized direction, like DoRA but with stronger decoupling — see Bini 2024) seems to produce a more swingy steering vector: it's the only adapter where coeff=-1 actively de-honests the model below baseline. Hypothesis: DeLoRA's normalization forces the update to be a coherent direction rather than a magnitude-driven blob, so scaling it ±α actually traverses the behavioral axis. Other adapters (LoRA, OFT) collapse asymmetrically on the negative side because much of their delta is magnitude-not-direction.
This would mean DeLoRA is a better probe for behavior steering even though all six adapters land in similar regions of weight space when you measure by subspace overlap.
Next experiments
- (claim 1 falsifier) projection-ablation: w_proj_K = project dW onto top-K act_oracle dirs, run dilemmas, see if delta_pos_minus_zero is preserved.
- DeLoRA at higher rank — does the behavioral lead grow or saturate?
- DeLoRA × different behaviors (not just sycophancy/honesty) to see if the swinginess generalizes.
- Replace "subspace alignment" with a functional metric: causally ablate dW components and measure behavior change directly, not via PCA bases.
Artifacts
- out/sycophancy/cross_adapter_v9/{headline.csv, scope_summary.csv, dilemmas_summary.csv, scope_bars.png}
- out/sycophancy/{lora,dora,pissa,delora,oft,ia3}/v9/v9_scope_diagnostic.csv — per-layer scope_diagnostic for all 6 adapters, all show 1-8% overlap on every LoRA layer.
- nbs/hypothesis_sweep_v9.py — block-local oracle, scope diagnostic, ADAPTER env var.
- nbs/cross_adapter_v9.py — aggregator across the 6 adapter families.
- All 18 pueue jobs (88-105) finished Success.
2026-04-27 — v10 functional projection: overlap metric failed, but act_oracle is not the trained steering subspace
Question: maybe the ~3% overlap between w_oracle and act_oracle_block is
still "the right 3%". If yes, projecting dW onto the act_oracle basis should
preserve daily-dilemmas steering.
v10 changed the metric from geometric overlap to behavior:
- Build block-local
act_oraclefrom the adapter's sycophancy probe effect. - Decompose residual-output tensors (
o_proj,down_proj) into:project_act_block:P_{act,K} dWcomplement_act_block:(I - P_{act,K}) dWproject_act_block_normmatched: same projection scaled to residual-write norm
- Run the same daily-dilemmas honesty logratio.
Main K=32 result (n=40 dilemmas / 80 rows)
| adapter | full Δ | residual-write Δ | raw projection / residual | normmatched / residual | complement / residual | read |
|---|---|---|---|---|---|---|
| delora | +0.628 | +0.844 | 0.07 | 0.30 | 0.89 | clean counterexample: trained behavior mostly in complement |
| pissa | +0.373 | +0.242 | 0.47 | 1.14 | 0.64 | mixed: act projection is functional but not sole carrier |
| oft | +0.216 | +0.148 | -0.01 | 1.57 | 0.69 | act direction potent only after amplification |
| dora | +0.370 | +0.031 | -0.50 | 1.15 | 1.20 | residual-write split barely explains full effect |
| lora | +0.173 | -0.022 | noisy | noisy | noisy | residual-write split wrong-sign |
| ia3 | -0.048 | +0.002 | noisy | noisy | noisy | denominator too small |
So v10 mostly kills the strongest loophole for DeLoRA: the act_oracle projection at trained scale does not carry the steering; the complement does. PiSSA/OFT are subtler: the act_oracle directions are real and causal when amplified, but the trained adapter did not put enough norm there for them to explain behavior at alpha=1.
Alpha sweep at K=32 (informative adapters only)
Δ = daily-dilemmas honesty logratio minus base.
| adapter | alpha | residual | raw projection | normmatched projection | complement |
|---|---|---|---|---|---|
| delora | 0.5 | +0.387 | -0.020 | +0.100 | +0.361 |
| delora | 1.0 | +0.844 | +0.061 | +0.252 | +0.755 |
| delora | 2.0 | +1.858 | +0.097 | +0.503 | +1.680 |
| delora | 4.0 | +2.150 | +0.183 | +0.430 | +2.739 |
| pissa | 0.5 | +0.159 | +0.044 | +0.217 | +0.036 |
| pissa | 1.0 | +0.242 | +0.114 | +0.277 | +0.155 |
| pissa | 2.0 | +0.327 | +0.206 | +0.394 | +0.270 |
| pissa | 4.0 | +0.578 | +0.350 | +0.473 | +0.380 |
| oft | 0.5 | +0.059 | -0.020 | +0.077 | +0.034 |
| oft | 1.0 | +0.148 | -0.002 | +0.233 | +0.103 |
| oft | 2.0 | +0.309 | +0.084 | +0.478 | +0.184 |
| oft | 4.0 | +0.564 | +0.166 | +0.864 | +0.381 |
Interpretation sequence:
- v9 overlap was too strict as a potency metric. Normmatched act projections can steer strongly, especially OFT/PiSSA.
- v9 overlap was not wrong as a trained-scale carrier metric. Raw act projection usually carries little of the actual trained behavior.
- DeLoRA is the cleanest finding: residual-write Δ is strong (+0.844), raw act projection is weak (+0.061), complement is strong (+0.755). That is hard to reconcile with "the right 3% explains the behavior".
- PiSSA/OFT suggest the act_oracle subspace is a useful intervention target, not the subspace the adapter naturally chose.
My current model: PCA of activation differences finds directions where small weight writes can be high-gain, but trained adapter behavior is distributed through a larger residual-write complement plus read/gate/up paths. The word "planning subspace" is probably misleading unless we define it causally (what intervention changes behavior), not geometrically (what basis overlaps).
Clarification on the object of search: we were looking for a subspace or
parameterization that explains the difference between the positive and
negative LoRAs, i.e. a basis in which dW = θ_pos - θ_neg becomes simple or
functionally concentrated. v9/v10 speak directly to that for residual-output
weights. The answer so far is: not adapter family alone (LoRA/DoRA/PiSSA/
DeLoRA/OFT/IA3), not the canonical pretrained-weight bases, not the attention
or MLP read/write primitives, not the attention-selected token bases, not the
persona-contrast bases, and not the top activation-PCA directions at trained
scale. That does not mean no compact causal description exists. It means our
tested parameterizations and geometric descriptions are not it.
complement means the part of the residual-output weight diff left after
removing the activation-PCA subspace:
dW_{complement} = (I - P_{act,K}) dW.
If P_act dW keeps the behavior, then the small overlap was the right small
piece. If (I - P_act) dW keeps the behavior, then the trained adapter's
behavior is mostly outside that tested act-PCA subspace. DeLoRA gave the clean
case: raw projection keeps 7% of residual-write behavior, complement keeps 89%.
So for the strongest adapter, act-PCA is not an explanation of the learned
weight diff, though it can still be a useful amplified intervention direction.
Artifacts
- nbs/functional_projection_v10.py
- docs/spec/20260427_v10_functional_projection.md
- out/sycophancy/v10_functional_projection/{behavior_summary.csv, behavior_by_coeff.csv, spectra_and_projection.csv}
- out/sycophancy/v10_alpha_sweep/{behavior_summary.csv, behavior_by_coeff.csv, spectra_and_projection.csv}
v10 wendler-style functional probe 2026-04-27 07:04:57 (dev, a40fd35)
nbs/v10_llama.py ports Wendler et al. 2024 ("Do Llamas Work in English?")
token-energy + logit-lens to the sycophancy LoRA on Qwen3-0.6B. Cheaper than
v9 (no PCA, no oracle, no SVD per candidate) and asks a different question:
not "does B span Δh" but "is the (e_yes − e_no) readout axis in B" and "is
Δh itself readable by lm_head".
Findings
- Δh is unreadable at LoRA layers. Token energy
E²(Δh)peaks at 0.010 in LoRA layers 8..21 vsE²(clean) ≈ 22at the same layers. Logit-lens Yes-No on Δh is essentially zero at LoRA layers, rising to peak0.66 natsat layer 25 (post-LoRA). - The LoRA writes concepts, not tokens. Top decoded tokens from Δh at
the peak layer:
+Δh: " why / reasons / 理由 / supporting / Reasons / 为什么"−Δh: " nonexistent / unauthorized / truthful / 未经" Multilingual concepts of "give-reasons / supportive". The LoRA does not write " Yes" — downstream layers translate the concept into Yes/No.
- No rank-8 base-model subspace contains the Yes-No axis.
cap_yn(B) = ‖P_B(e_yes − e_no)‖² / ‖e_yes − e_no‖²averaged over LoRA layers: lm_head_read = 0.042 (best A-side, 5× null), TaskDiff_lora_fit = 0.014, write = 0.008, random_null = 0.008 = 8/1024. Even act_oracle (in-sample ceiling) gets 0.014 because Δh is in concept space, not on the readout axis. Sanity passes:act_oracle.preserved_E ≈ 1.000,random_null.cap_yn ≈ PCS/d.
Punchline
Searching for "the subspace the LoRA uses" via canonical base-model bases or persona-contrast PCA fails not because the hypotheses are crazy, but because the LoRA's solution is concept space and doesn't sit in any low-rank readable basis. The hypothesised directions remain usable for steering with amplification (v10 functional projection result above) — they're just not what the trained model is doing, so you can't use them to interpret what the LoRA learned.
This decomposes the v9 negative finding into three separate facts that v9's PCA-span metric conflated:
- Δh doesn't carry the Yes-No readout axis at LoRA layers (panel b: ldiff on Δh ≈ 0 in 8..21, peaks only at layer 25).
- No rank-8 candidate (incl. lm_head's own SVD) contains the Yes-No axis (panel c: best cap_yn = 0.042).
- The mechanism is concept-write at LoRA layers → downstream non-linear translation to Yes/No, matching Wendler's three-phase picture.
Why we still can't find the concept space
act_oracle is post-hoc (defined from Δh itself, in-sample). The closest
out-of-sample candidate is TaskDiff_lora_fit (PCA on FIT-half α=±1
diffs), which gets preserved_E = 0.109 at rank 8. Above null but loses
90% of Δh's readable energy. Why:
- Rank-8 is too narrow. The concept "be supportive / give reasons" is
not one direction. It spans polite vs blunt, justification vs assertion,
multilingual variants. PCA at rank 8 collapses these — silent rank loss.
TODO: rerun TaskDiff_lora_fit at rank 32, 64, 128. If
preserved_Esaturates near 1.0 by rank 64, concept space is just higher-dimensional than 8. - Linear span vs cluster. PCA finds a linear span. If the concept space is a manifold of related-but-not-collinear directions (e.g. one "reasons" axis per topic), span overlap is low even when the cluster is perfectly captured. CHaRS-style per-cluster translations would catch this; v9 marked them "structurally penalized".
- Persona contrast is the wrong concept. TaskDiff_contrast scores cap_yn = 0.012 because persona ≠ sycophancy. Persona induces style; sycophancy is more specific (justify-the-user's-claim). Need concept-anchored data: contrast prompts that elicit " because / 因为 / supporting" against bland answers, build basis from that.
- Wrong layer. Linear probes at LoRA layers see concept-space writes; Yes/No is only readable at layer 25. A Yes-No probe trained at layer 25 and back-projected through residual paths might recover the concept axis from the receiving end.
- Wrong tool. PCA is unsupervised; DAS (Distributed Alignment Search) learns a rotation that isolates task-causal directions, doesn't assume low rank. SAE features could expose a "supporting/reasons" feature directly. Either is the principled next step.
Artifacts
- nbs/v10_llama.py
- out/sycophancy/lora/v10/{v10_wendler_metrics.png, v10_wendler_metrics.pdf, v10_table.tsv, v10_caption.md, v10_per_layer.csv}
2026-04-27 fork_plan T1-T8 status check (dev)
Walked through fork_plan tasks T1-T8 against the latest CSVs to see which UAT goals are ticked.
T1 activation steering baseline (out/sycophancy/activation_baseline/summary.csv).
Repeng on layers 8-21, full 438-dilemma set (dd_zero=+0.195, syc_zero=+2.698).
Best repeng dd_delta = +0.071 (layer 9, coeff=-4); at coeff=+1 the best is
+0.0070 (layer 13). dW:delora at coeff=+1 is dd_delta=+0.337 on this same
slice. Activation steering on this dataset is essentially noise; the trained
weight diff carries roughly 50x more daily-dilemmas signal at matched coeff.
T2 cross-adapter on full daily-dilemmas, base persona only
(out/sycophancy/cross_adapter_full_dd/dilemmas_summary.csv, 438 rows).
At coeff=+1 vs base@0:
| adapter | delta_vs_0 |
|---|---|
| delora | +0.711 |
| dora | +0.404 |
| pissa | +0.368 |
| oft | +0.236 |
| lora | +0.229 |
| ia3 | +0.033 |
Same DeLoRA > DoRA > PiSSA ordering as the v9 100-dilemma slice. IA3 still flat.
T3 prompt baseline (out/sycophancy/prompt_baseline/summary.csv). Engineered
prompt vs base @ coeff=0 = +0.370. Simple "be honest" prompt = -0.520
(backfires). DeLoRA dW @ coeff=+1 = +0.711 still beats the strongest prompt
intervention by 1.9x. AxBench-style claim survives on the full 438-row split.
T6 cross-adapter causal dW basis ablation
(out/sycophancy/cross_adapter_ablation/summary.csv). At coeff=+1, top three:
delora/residual_write_full +0.907, delora/shared_keep K=32 +0.736,
delora/full_all_tensors +0.711. shared_drop K=8 keeps +0.436 (32% loss);
random_keep across all adapters lands at <=+0.022 (effectively zero). Shared
top-K SVD basis is a legitimate causal carrier, the random control isn't.
T7 layer/module ablation. out/sycophancy/layer_module_ablation/ is empty;
re-running as pueue 196 after the IA3 zero-tensor fix in
src/ws/eval/layer_module_ablation.py:_select(). Pending.
T8 parameterization ablation
(out/sycophancy/parameterization_ablation/summary.csv). Headline: per-adapter
top energy crops match or beat full_dW. delora/top_90pct_energy_S =
dd_delta +0.962 (full_dW = +0.711, +35%). dora/top_90pct = +0.415
(full +0.404). pissa/top_25pct = +0.381 (full +0.368). lora/top_90pct = +0.224
(full +0.229). The top quartile/decile by SVD energy is doing all the work.
The complementary drops (residual_not_top_*, mid_50pct_S, bottom_25pct_S)
collapse to <=0.02 dd_delta everywhere. IA3 is weak across the board
(full_dW = +0.033, max variant +0.024). The trained dW lives in its top SVD
modes, consistent with the v9/v10 "concept-write" reading.
Tick list:
- Done: T1, T2, T3, T6, T8
- Pending: T7 (pueue 196 after IA3 fix)
- Open: T4 multiseed, T5 Gemma replication
The biggest update from today is T8: the keep_top_X energy crops reproduce full_dW behavior, and drop_top crops zero out, which is a stronger version of "the dW is dominated by its top SVD components" than v9 had. Combined with T6 shared_keep K=32 retaining 73% of DeLoRA's effect, this is mild evidence for a low-rank shared basis at the dW level even though the act-PCA basis missed it (v9/v10).
lens search on hold pending multiseed 2026-04-27
After running T6 (cross-adapter dW basis), T7 (layer/module), and T8
(parameterization, own-SVD lens) and sketching T9 (native parameterization
decompositions per adapter), every weight-space lens we tested has a
built-in failure mode that prevents a parameterization-invariant mechanism
claim, modulo a major caveat at the bottom of this entry:
- SVD-on-
dW: tautological for low-rank adapters.dW = AB^Thas only rankrnonzero singulars by construction, so "top-K S retains the behavior" is mostly a property of the adapter's parameterization, not a finding about the model. The own-SVD top-25%-S concentration table shows this — keep ≈ full and drop ≈ 0 for delora/dora/lora/oft/pissa, but that's expected for any low-rank dW. - Layer-index: tells you depth, not mechanism. Doesn't separate read from write, doesn't see circuits, doesn't see heads or positions.
- Module-family: collapses heads and sequence positions. Cross-adapter results disagree (delora's residual_write retained=+1.27, lora's=+0.14) so there's no stable "the behavior lives in module X" story.
- Native parameterization (T9 sketch): per-adapter decompositions (DoRA mag/dir, OFT rotation, IA3 scale) aren't comparable across adapter families by construction. Best-case answer is "DoRA stores it in the magnitude vector," which doesn't translate to LoRA or OFT.
Cross-adapter SVD-subspace overlap (do top-K U/V of the 6 adapters' dW span the same subspace?) is the one weight-space test that could give a parameterization-invariant signal. Not run. Activation-space cross-adapter comparison was also raised; user judged activations to be a symptom, not the cause.
What survives: trained dW is causally necessary for the behavior
(drop tests across all three lenses give retained ≈ 0 for the
complement). What's not supported: any parameterization-invariant
mechanism claim. Dropping the lens search.
Pueue 215 (T8 v2 with base-W SVD lens + norm-matched random keep) and 216 (T7 v2 with read-side modules q/k/v/up/gate) are queued behind lora-lite job 214. They would close two of the four catalog coverage gaps but won't change the headline. Leaving queued for now; they're cheap if they run.
Priority redirect: T4 multi-seed and T5 Gemma 1B replication. Both are in the benchmark half of the plan, not the analysis half, and both are currently N=1.
two-goal frame and coverage gaps 2026-04-27
Reframed everything as two goals so the writeup stops mixing them.
Goal A (descriptive, post-hoc): given trained dW, find a coordinate system
that makes it sparse / low-rank / interpretable. Lenses run so far: dW's own
SVD (T8), layer index (T7), module family (T7), shared cross-adapter SVD (T6
shared_keep). Lenses not run: base-W SVD dS = U0.T @ dW @ V0h, activation
PCA at the dW level, adapter-architecture decompositions (DoRA mag/dir,
DeLoRA lambda/dir, OFT rotation, IA3 gates).
Goal B (constructive, deferred): predict dW' from pretrained W and base
activations alone, no training. Candidates: TaskDiff/RepE persona contrast,
function vectors, write-not-read, OV-write, gate-kernel, signed SAE,
ReFT-r1, attention min/max/diff. Benchmark = trained vs constructed dW on
identical DD rows. None run yet.
Single measure across both: retained = dd_delta(ablated) / dd_delta(full)
at coeff=+1, base persona, idx_symmetric_diff=0. Necessity (drop test):
no norm-matching needed. Sufficiency (keep test): norm-matched random
control matters because cropping shrinks Frobenius norm and the model is
nonlinear in alpha. T7 has random_norm_matched_full; T8 lacks it.
Coverage gaps to flag in writeup:
- Read-side modules (q/k/v/up/gate-only) absent from T7 variants. Any read-side mechanism story is currently untestable.
- Base-W SVD lens absent. T8 uses each tensor's own SVD; catalog spec'd base-W SVD as a separate lens. Both are valid, just different questions.
- Adapter-architecture decompositions absent from T8 variant set.
- T8 sufficiency claims lack a norm-matched random keep control.
Notebook: nbs/ablation_analysis.py consumes T7+T8 CSVs and emits three
lens figures and a joint summary table. Runs end-to-end on current outputs.
Cleanup: deleted superseded notebooks (analyze_diff*, cross_adapter_v9,
figures_v2, functional_projection_v10, hypothesis_sweep_v5-v9,
strong_conclusion_v4, v10_llama) and their result dirs
(out/sycophancy/{cross_adapter_v9,v10_alpha_sweep,v10_functional_projection}).
Kept nbs/ablation_analysis.py as the single notebook.
2026-04-27: train/eval axis switch — sycophancy → honesty (branch dev)
Why: previous training contrast was SYCOPHANCY_POS ("validate user, never push back") vs SYCOPHANCY_NEG (literally "you are an honest assistant"), so training was 2-axis (sycophancy + honesty) while eval (daily_dilemmas-self-honesty) is signed by honesty_label only. Cross-method deltas were still comparable on identical rows, but the absolute "is steering working" claim was weakened by the train/eval axis mismatch.
Decision: option 2 — switch the entire pipeline to honesty. The NEG persona was already honesty-themed, so this makes the existing implicit axis explicit, eliminates the GPT-4o sycophancy-eval generation burden of option 1, and turns sycophancy into a free OOD transfer eval for later. T4 multiseed and T5 Gemma will be re-scoped against honesty rather than sycophancy.
Persona pair (vgel/repeng convention, short form):
- POS =
Pretend you're {an honest, a truthful, a candid, a straightforward, a sincere} person.×5 paraphrases - NEG =
Pretend you're {a dishonest, a deceptive, a manipulative, a misleading, a lying} person.×5
Question pool: data/branching_suffixes.json (550 SSteer entries, (cat, suffix, user_msg)). For paper-recipe training (Fierro & Roger 2025 §F.1) we use user_msg only as the question; the persona-conditioned model generates r_pos / r_neg, then LoRA trains on the pair, then system prompt is stripped at train time so the adapter learns behavior unconditionally. For RepE direction extraction we capture last-token hidden states with assistant_prefixes=suffix (no backprop, just observation).
Code changes (uncommitted on dev):
src/ws/data.py:HONESTY_PROMPT,HONESTY_POS_PERSONAS,HONESTY_NEG_PERSONAS,_load_suffixes(), behavior branches in_personas,_topics,_build_specs.src/ws/eval/activation_baseline.py:_fit_repe_directions(..., behavior)honesty branch using suffix prompts +assistant_prefixes.src/ws/eval/prompt_baseline.py: pairedengineered_prompt_honest+engineered_prompt_dishonest(AxBench J.2).evals/smoke.py:behaviorfield inSmokeCfg. Smoke passes end-to-end onkatuni4ka/tiny-random-qwen3with--behavior honesty.data/branching_suffixes.json: copied from SSteer.
Pueue: killed sycophancy runs (215-228). Queued honesty pipeline:
- 230:
run_sweep --behavior honesty(1000 pairs, 6 adapters: lora/dora/pissa/delora/oft/boft/ia3) — running. - 231-236 chained
--after 230: T1 RepE, T3 prompt baseline, T2 full DD, T6 cross-adapter, T7 layer/module, T8 parameterization (all--behavior honesty --n-dilemmas 219 --batch-size 8).
Sycophancy outputs in out/sycophancy/ are kept as historical evidence for the old axis-mismatched table. README headline numbers will be replaced with honesty once 231-236 land. T4/T5 remain open.