fix: skip guided-CoT for non-thinking models; trim README

Gemma-3/4 don't have </think> as a special token, so guided_cot_one raised RuntimeError and killed the whole sweep. Fix: add has_thinking_mode to _tok_extras and gate phase_a2 in replicate.py on it. README cut from ~380 to ~120 lines: results tables, how to run, cite, links. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-27 18:27:18 +08:00 · 2026-04-29 05:39:50 +08:00
parent 5704b00175
commit ce73e97154
3 changed files with 86 additions and 343 deletions
@@ -1,368 +1,92 @@
 # Weight Steering

-> **Fork notice (wassname, 2026-04):** this is a working fork that strips the
-> upstream Axolotl + vLLM + Anthropic-batch-API stack and rebuilds the core
-> method on HF + PEFT + uv, targeting Qwen3-0.6B for cheap iteration. Goals:
-> (1) replicate `w = θ⁺ − θ⁻` on a small model, (2) test alignment of `w` with
-> SVD subspaces of the pretrained `W` and the AntiPaSTO subspaces, (3) compare
-> adapter families (LoRA / DoRA / PiSSA-init / DeLoRA) under the
-> "adapter as hypothesis" framing, (4) eval on daily-dilemmas.
->
-> Pipeline (see `justfile`):
-> ```
-> just smoke           # full pipeline on tiny-random qwen3 + BEARTYPE=1, ~1 min
-> just replicate       # data → train pos → train neg → diff → eval → subspace
-> just subspace-align  # phase 2: SVD top-k + weak-readout alignment table
-> just adapter-sweep   # phase 3: LoRA / DoRA / PiSSA / DeLoRA sweep
-> just eval-dilemmas   # phase 4: daily-dilemmas Yes/No logratio
-> ```
-> Source layout: `src/ws/{data,train,diff,steer,subspace,replicate,run_subspace,run_sweep}.py`,
-> `src/ws/eval/{sycophancy,dilemmas}.py`. Outputs to `out/<behavior>/<adapter>/`.
->
-> Scope. Not a strict replication. Now matches paper-style recipe on data
-> (20 train + 12 eval topics × 5 personas × 10 samples = 1000 pairs;
-> judge filter stubbed, off by default, paper uses GPT-4.1-mini) and
-> current PEFT hyperparams (rank 32 / LoRA α 64 / lr 2e-4 / warmup 5 /
-> wd 0.01 / seed 0 / one epoch).
-> Deliberate divergences from upstream: no quantized base loading
-> (DoRA/PiSSA/DeLoRA support is uncertain; bf16 fits at 0.6B), no
-> `modules_to_save` for `embed_tokens` / `lm_head`, and a layer slice
-> (LoRA on layers 30%-80%, steering-locus literature) instead of full
-> coverage. The contrastive `θ⁺ − θ⁻` core is preserved.
->
-> **Initial finding on Qwen3-0.6B.** Weight steering works cheaply at this
-> scale, but the useful adapter parameterization and the interpretable
-> subspace are separate questions. The current best raw adapter is DeLoRA;
-> PiSSA is the cleaner stable baseline; PCA-style planning-subspace overlap
-> does not explain the trained behavior.
+Fork of Fierro & Roger 2025, rebuilt on HF + PEFT + uv, targeting Qwen3-0.6B.
+Method: `dW = theta_pos - theta_neg`, then add `alpha * dW` at inference.

-## Current internal findings (N=1; exploratory)
+## Results (Qwen3-0.6B, honesty, N=1, single seed)

-These numbers are single-seed, single-model research notes, not a full
-benchmark. All rows below use `Qwen/Qwen3-0.6B`, seed 0, honesty-axis
-training data, PEFT adapters trained for one epoch on layers 8-21 (30%-80%
-of 28 layers) except IA3, whose PEFT config does not support
-`layers_to_transform` and therefore touches all layers. Target modules for
-LoRA-family adapters are `q/k/v/o/gate/up/down_proj`.
+All evals use base persona at eval time. No system prompt.

-Training uses opposite honesty personas (vgel/repeng short-form convention).
-POS = "Pretend you're {an honest / a truthful / a candid / a straightforward / a sincere} person." (5 paraphrases).
-NEG = "Pretend you're {a dishonest / a deceptive / a manipulative / a misleading / a lying} person." (5 paraphrases).
-The base model generates 1000 prompt/response pairs per branch under those
-system prompts (paper recipe, Fierro & Roger §F.1). Each adapter is SFT-fit
-to its branch. `dW = θ_pos - θ_neg` carries the honesty direction.
-Question pool: 550 branching-suffix entries (`data/branching_suffixes.json`).
+### OOD: surgical informedness on daily-dilemmas (full split, 219 dilemmas, 438 action rows)

-All evals run with **no system prompt** at eval time (base persona). The
-persona pair only enters during data prep or fitting:
+Surgical informedness SI_k2 = fix_rate - 2 * broke_rate (penalises regressions 2x). SI_best = post-hoc sign-aligned upper bound (snooping).

-| stage                               | pos uses           | neg uses              | how                              |
-| ----------------------------------- | ------------------ | --------------------- | -------------------------------- |
-| adapter training data generation    | `POS[0..4]`        | `NEG[0..4]`           | system prompt during generation  |
-| RepE direction fit (T1)             | `POS[0]`           | `NEG[0]`              | system prompt for hidden capture |
-| prompt baseline: simple_honest (T3) | n/a                | "honest assistant"    | system prompt at eval time       |
-| prompt baseline: engineered (T3)    | AxBench J.2 honest | AxBench J.2 dishonest | system prompt at eval time       |
-| daily-dilemmas eval                 | n/a                | n/a                   | base persona, no system prompt   |
+| method            |  SI_k2 |  SI_k1 | SI_best | fix_rate | broke_rate |
+| ----------------- | -----: | -----: | ------: | -------: | ---------: |
+| prompt:engineered |  -8.88 |  -0.58 |   +4.95 |    0.149 |      0.058 |
+| prompt:simple     | -16.00 |  -1.83 |   +3.46 |    0.245 |      0.203 |
+| RepE all-layers   |  -6.86 |  +0.97 |   +0.79 |    0.149 |      0.070 |
+| oft               |  -3.37 |  -0.21 |   +0.16 |    0.043 |      0.020 |
+| ia3               |  -0.47 |  +0.26 |   -0.09 |    0.011 |      0.006 |
+| dora              | -25.78 |  -6.31 |   -1.91 |    0.149 |      0.157 |
+| lora              | -27.13 |  -6.88 |   -3.04 |    0.138 |      0.157 |
+| pissa             | -27.27 |  -5.65 |   -9.08 |    0.160 |      0.169 |
+| delora            | -34.29 |  -4.85 |  -38.12 |    0.213 |      0.410 |

-The dW and RepE methods do not put any persona into the eval-time prompt;
-they intervene on weights or activations instead.
+Every method is negative under SI_k2. Among adapters only OFT clears zero under SI_best, with a large gap to engineered prompts. DeLoRA's broke_rate 0.41 (141/344 already-honest rows flipped) dominates.

-### Notation
+### OOD: SI at KL-calibrated alpha (matched off-task p95 token-KL ~ 0.61 nats)

- `α`, also called `coeff`: steering strength. Weight steer adds `α * dW`.
-  RepE adds `α * direction` to the residual stream. `α = 0` is the unmodified base.
- `mean_logratio = log p(Yes) - log p(No)`: how strongly the model prefers Yes.
- `logratio_honesty = (log p(Yes) - log p(No)) * honesty_label`: same logratio,
-  signed so that larger means more honest. The dataset labels each (dilemma, action)
-  with which answer is honest.
- `dd_delta`: change in mean `logratio_honesty` between an intervention row and
-  `base @ α=0` on the same dilemmas.
- `pmass = p(Yes) + p(No)`: probability mass on the two scored tokens.
-  Sanity check that the model is answering in-format. If `pmass` is low, the
-  model is talking instead of choosing.
- `dW = θ_pos - θ_neg`: weight diff after merging each adapter into the base.
- `||dW||`: Frobenius norm of the diff, summed across touched parameters.
+| method                   |    alpha |    SI | fix | broke | broke% |
+| ------------------------ | -------: | ----: | --: | ----: | -----: |
+| prompt:eng_dishonest     |    +1.00 | +5.41 |  14 |    15 |   4.4% |
+| prompt:simple_dishonest  |    +1.00 | +3.57 |  12 |    15 |   4.4% |
+| prompt:engineered_honest |    +1.00 | +2.62 |  14 |    20 |   5.8% |
+| repe                     |    +2.30 | -5.29 |  15 |    20 |   5.8% |
+| prompt:simple_honest     |    +1.00 |-13.89 |  23 |    70 |  20.3% |
+| dW:oft                   |    +8.22 |-25.97 |  16 |    86 |  25.0% |
+| dW:delora                |    +0.78 |-29.79 |  18 |   121 |  35.2% |
+| dW:pissa                 |    +1.17 |-32.03 |  16 |    65 |  18.9% |
+| dW:ia3                   |   +34.94 |-43.57 |  16 |    87 |  25.3% |
+| dW:lora                  |    +2.16 |-52.72 |  19 |   133 |  38.7% |
+| dW:dora                  |    +2.30 |-56.96 |  19 |   139 |  40.4% |

-### What was measured
+At matched off-task KL, all 6 adapters land deeply negative SI. Fix counts cluster at 14-19 across all methods; adapters break 65-139 already-honest rows while engineered prompts break 15-20. Adapters perturb uniformly across all tokens; prompts perturb topic-conditionally, spending the same KL budget where it matters.

- Sycophancy ID eval: held-out sycophancy Yes/No prompts, 12 eval rows per
-    coefficient. Metric is `mean_logratio = log p(Yes) - log p(No)`; larger
-    means more sycophantic agreement. `pmass` is probability mass on Yes/No, a
-    sanity check that the model is answering in-format.
- Daily dilemmas OOD eval: `wassname/daily_dilemmas-self-honesty`,
-    `honesty_eval`, full split of 219 dilemmas = 438 action rows per coefficient.
-    Metric is `logratio_honesty = (log p(Yes) - log p(No)) * honesty_label`, so
-  larger means more honest. `honesty_label` is computed from
-  `kellycyy/daily_dilemmas:Action_to_party_to_value` filtered to
-  `party == "You"`; the inherited `values_aggregated` field is all-party
-  context and is not the label source. The HF dataset now includes explicit
-  provenance columns (`you_values`, `label_source`, `values_aggregated_scope`).
-  Tables below use base persona only. A previous summary accidentally averaged
-  `base@0` with the AxBench `honest_engineer` persona baseline;
-  `cross_adapter_v9.py` now reads `dilemmas_per_row.csv` and filters
-  `persona == "base"`.
- Projection diagnostic: decomposes residual-output
-    weights (`o_proj`, `down_proj`) into the part inside a post-hoc activation
-    PCA subspace (`project_act_block`) and its orthogonal remainder
-    (`complement_act_block`) to test whether low overlap hides the load-bearing
-    steering component.
+### IID: held-out Yes/No claims (12 claims, alpha=+1)

-### OOD: surgical informedness on daily dilemmas
+| adapter | mean_lr | shift vs base |
+| ------- | ------: | ------------: |
+| pissa   |   8.437 |        +5.708 |
+| delora  |   7.198 |        +4.469 |
+| lora    |   6.531 |        +3.802 |
+| dora    |   6.156 |        +3.427 |
+| oft     |   3.917 |        +1.188 |
+| ia3     |   2.719 |        -0.010 |

-<!-- source adapters: out/honesty/cross_adapter_full_dd/dilemmas_per_row.csv
-     source prompts:  out/honesty/prompt_baseline/dilemmas_per_row.csv
-     source RepE:     out/honesty/activation_baseline/dilemmas_per_row.csv
-     produced by:     nbs/honesty_tables.py -->
+All adapters except IA3 learn the IID direction. The OOD failure (negative SI) is a generalisation gap, not a training failure.

-Daily-dilemmas honesty eval, base persona at eval time, full 219-dilemma
-split (438 action rows / coeff; at a=0 the base picks the honest action
-on n_cho=344 rows and the dishonest one on n_rej=94 — the ~78/22 split
-is the *base model's response distribution*, not the data, since each
-dilemma has one honest and one dishonest action by construction).
+### DeLoRA: within-tensor direction vs per-tensor norm allocation

-Prompt baselines are paired so dishonest_prompt = a=-1, base = a=0,
-honest_prompt = a=+1, giving full bidirectional SI like dW. RepE
-bidirectional uses a=-1/0/+1 from the activation_baseline sweep.
+| variant     |     SI | fix/broke @ a=+1 | mean_lr delta@a=+1 |
+| ----------- | -----: | ---------------: | -----------------: |
+| full        | -34.29 |          20/141  |             +0.237 |
+| dir_only    | -41.00 |          20/146  |             +0.024 |
+| mag_only    | -34.75 |           16/28  |             +1.068 |
+| random_norm | -13.36 |           16/76  |             -0.143 |

-`SI_k2` = surgical informedness with breaks penalised 2x (default,
-"first do no harm"). `SI_k1` = symmetric (breaks weighted 1x). `SI_best`
-= post-hoc sign-aligned upper bound: at each method we take the
-better of (a) treating a=+1 as the honest direction (`si_fwd`) and
-(b) treating a=-1 as the honest direction by role-swapping the
-confusion matrix so `counter_rev` becomes "fix" and `flip_rev`
-becomes "broke" (`counter_rate - 2 * flip_rate`). Under k=2 this is
-*not* the same as `-si_rev` because the FPR penalty hits the swapped
-rate. Treat as snooping, an upper bound. `fix_rate` = fix_fwd / n_rej,
-`broke_rate` = broke_fwd / n_cho. All numbers single-seed (N=1).
+`dir_only` (within-tensor direction kept, per-tensor norm flattened): positive mean shift collapses from +0.237 to +0.024. `mag_only` (per-tensor norm kept, within-tensor direction random): larger positive shift (+1.07) with fewer broken rows (28 vs 141). Suggests the DeLoRA dW is mostly a layer/module norm allocation, not a learned within-tensor direction.

-| method            |  SI_k2 |  SI_k1 | SI_best | si_fwd | si_rev | fix_rate | broke_rate |
-| ----------------- | -----: | -----: | ------: | -----: | -----: | -------: | ---------: |
-| prompt:engineered |  -8.88 |  -0.58 |   +4.95 | +0.033 | -0.254 |    0.149 |      0.058 |
-| prompt:simple     | -16.00 |  -1.83 |   +3.46 | -0.162 | -0.212 |    0.245 |      0.203 |
-| RepE all-layers   |  -6.86 |  +0.97 |   +0.79 | +0.009 | -0.173 |    0.149 |      0.070 |
-| oft               |  -3.37 |  -0.21 |   +0.16 | +0.002 | -0.080 |    0.043 |      0.020 |
-| ia3               |  -0.47 |  +0.26 |   -0.09 | -0.001 | -0.010 |    0.011 |      0.006 |
-| dora              | -25.78 |  -6.31 |   -1.91 | -0.165 | -0.451 |    0.149 |      0.157 |
-| lora              | -27.13 |  -6.88 |   -3.04 | -0.176 | -0.476 |    0.138 |      0.157 |
-| pissa             | -27.27 |  -5.65 |   -9.08 | -0.178 | -0.531 |    0.160 |      0.169 |
-| delora            | -34.29 |  -4.85 |  -38.12 | -0.607 | -0.180 |    0.213 |      0.410 |
+## How to run

-Read: every method has *negative* bidirectional SI under k=2. Under
-`SI_best` (post-hoc sign-aligned upper bound), both prompt baselines
-and RepE clear zero; among adapters only OFT is positive, and the
-gap to engineered prompts is large. DeLoRA's `SI_k2` is worst (-34.3)
-because its `broke_rate` 0.41 dominates: at a=+1 it flips 141/344
-already-honest rows to dishonest while fixing only 20/94 dishonest
-rows. The mean logratio still climbs +0.237 at a=+1 because the few
-rows it pushes correctly move by a lot (std_lr 1.97 -> 5.77); the
-metric and the mean disagree because SI counts discrete flips while
-the mean averages magnitude.
+```sh
+# Quick sanity check (~1 min, tiny random Qwen3)
+just smoke

-The k=2 penalty is calibrated for AntiPaSTO-style benchmarks where
-classes are roughly balanced. Here the *response distribution* is
-3.7:1 (n_cho/n_rej), so `2 * broke_rate` swamps `fix_rate` for any
-intervention that touches a sizeable fraction of rows. `SI_k1`
-(symmetric) is the calibration-free read.
+# Full pipeline for one adapter
+uv run python -m ws.replicate --model Qwen/Qwen3-0.6B --behavior honesty --adapter lora

-The only `+SI_best` adapter is OFT and the gap to both prompt
-baselines is large. The SI vs `dd_delta` disagreement on DeLoRA is
-the central exploratory finding. T4 multiseed and T5 Gemma will
-test whether the ranking is stable.
+# All adapters
+uv run python -m ws.run_sweep --behavior honesty --n-personas 1 --n-samples 50

-### OOD: raw mean ± std logratio_honesty per (method, coeff)
+# KL calibration then daily-dilemmas eval
+uv run python -m ws.eval.kl_calibrate --behavior honesty
+uv run python -m ws.eval.dilemmas_calibrated --behavior honesty
+```

-| method            |  a=-1 (mean ± std) |   a=0 |  a=+1 (mean ± std) |
-| ----------------- | -----------------: | ----: | -----------------: |
-| base              |                  - | 1.326 ± 1.969 |        - |
-| ia3               |     1.294 ± 1.915  | 1.326 ± 1.969 |   1.356 ± 2.016 |
-| oft               |     1.215 ± 1.834  | 1.326 ± 1.969 |   1.381 ± 2.090 |
-| dora              |     1.156 ± 1.930  | 1.326 ± 1.969 |   1.342 ± 2.791 |
-| lora              |     1.104 ± 1.890  | 1.326 ± 1.969 |   1.403 ± 2.873 |
-| pissa             |     0.846 ± 1.695  | 1.326 ± 1.969 |   1.368 ± 2.941 |
-| delora            |     0.174 ± 1.319  | 1.326 ± 1.969 |   1.563 ± 5.770 |
-| prompt:engineered |     1.375 ± 2.043  | 1.326 ± 1.969 |   1.371 ± 1.829 |
-| prompt:simple     |     1.378 ± 2.064  | 1.326 ± 1.969 |   0.874 ± 1.621 |
-| RepE all-layers   |     1.405 ± 2.339  | 1.326 ± 1.969 |   1.307 ± 2.037 |
+Source layout: `src/ws/{data,train,diff,steer,subspace,replicate,run_sweep}.py`, `src/ws/eval/{sycophancy,dilemmas,kl_calibrate,dilemmas_calibrated}.py`. Outputs to `out/<behavior>/<adapter>/`.

-### OOD: SI at KL-calibrated α (matched off-task distribution shift)
-
-Comparing adapter steering at α=1 vs prompts is structurally unfair: α=1
-means very different things across LoRA / PiSSA / DeLoRA / OFT / IA3 / RepE
-/ prompt. We replace it with a principled budget — the prompt's *off-task*
-KL footprint. Concretely: we measure mean per-token KL(steered ‖ base)
-over the last 20 positions of held-out continuations on 50 diverse prompts
-(branching_suffixes.json, stratified across 10 categories), and Newton-search
-α per method to match `prompt:engineered_prompt_honest`'s p95 token-KL ≈ 0.61
-nats. All 7 methods converge in 2-3 iterations. Audit on 100 disjoint prompts
-gives calib/audit p95 ratio 1.07-1.15 for adapters (stable) and 1.78 for the
-prompt anchor (heavier topic-conditional tail). Source:
-`src/ws/eval/kl_calibrate.py` → `out/honesty/kl_calibration/`.
-
-Re-eval daily-dilemmas at calibrated ±α:
-
-| method                         |    α   |     SI |  fix |  broke | broke% (of n_cho=344) |
-| ------------------------------ | -----: | -----: | ---: | -----: | --------------------: |
-| prompt:eng_dishonest           |  +1.00 |  +5.41 |   14 |     15 |                  4.4% |
-| prompt:simple_dishonest        |  +1.00 |  +3.57 |   12 |     15 |                  4.4% |
-| prompt:engineered_honest       |  +1.00 |  +2.62 |   14 |     20 |                  5.8% |
-| repe                           |  +2.30 |  -5.29 |   15 |     20 |                  5.8% |
-| prompt:simple_honest           |  +1.00 | -13.89 |   23 |     70 |                 20.3% |
-| dW:oft                         |  +8.22 | -25.97 |   16 |     86 |                 25.0% |
-| dW:delora                      |  +0.78 | -29.79 |   18 |    121 |                 35.2% |
-| dW:pissa                       |  +1.17 | -32.03 |   16 |     65 |                 18.9% |
-| dW:ia3                         | +34.94 | -43.57 |   16 |     87 |                 25.3% |
-| dW:lora                        |  +2.16 | -52.72 |   19 |    133 |                 38.7% |
-| dW:dora                        |  +2.30 | -56.96 |   19 |    139 |                 40.4% |
-
-Read: under matched off-task p95 KL, all 6 adapters land deeply negative.
-Fix counts cluster at 14-19 across all methods, but adapters break 65-139
-already-honest rows while engineered prompts break only 15-20. The
-ordering aligns with intuition: **prompts perturb topic-conditionally**
-(near-zero KL on irrelevant content, large KL where relevant), so the
-matched off-task budget gets spent on dilemma-relevant tokens; **adapters
-perturb uniformly**, so the same KL budget scatters over the 344
-already-correct rows and breaks them. RepE sits in between. The
-engineered-dishonest topping the SI ranking is partly an artifact of the
-344/94 imbalance + k=2 weighting: it breaks slightly fewer honest answers
-than the engineered-honest prompt, with similar fix counts.
-
-Caveats: (1) single seed, single model; (2) calibration measured on
-branching_suffixes (off-task) — at-task KL may differ; (3) the prompt
-anchor's audit p95 was 1.78× the calib p95, so calibration is conservative
-on the prompt side; (4) absolute fix/broke counts are tiny (10s of rows
-out of 438), so per-method noise is large.
-
-The headline negative result for adapters at matched dist-shift survives
-all four caveats in direction (every adapter is negative, with broke ≫
-fix), but the *gap to prompts* depends on calibration choice.
-
-### IID: held-out persona Yes/No claims
-
-<!-- source: out/honesty/cross_adapter_ablation/sycophancy_per_row.csv
-     setting=dW full means the trained adapter is applied at coeff;
-     setting=dW=0 zeros out the diff (matches base model). -->
-
-This is the same eval used during training (12 held-out claims). At
-a=0 every row matches the base (mean_lr=2.729, std=1.058). At a=+1
-under "dW full":
-
-| adapter | a=+1 mean_lr | std  | shift vs base |
-| ------- | -----------: | ---: | ------------: |
-| pissa   |       8.437  | 1.27 |        +5.708 |
-| delora  |       7.198  | 1.48 |        +4.469 |
-| lora    |       6.531  | 1.05 |        +3.802 |
-| dora    |       6.156  | 1.07 |        +3.427 |
-| oft     |       3.917  | 0.98 |        +1.188 |
-| ia3     |       2.719  | 1.05 |        -0.010 |
-
-So on IID claims the dW interventions land hard (PiSSA biggest, IA3
-no-op), the same direction as their training data. The OOD failure
-on daily dilemmas (negative SI) is therefore a *generalisation* gap,
-not a "the dW didn't learn anything" gap — they all learned an IID
-direction; only OFT (and prompt:engineered) generalise without
-breaking the response distribution.
-
-### DeLoRA: per-tensor norm allocation vs within-tensor direction
-
-<!-- source: out/honesty/dw_decomp_ablation/delora/summary.csv
-     produced by: ws.eval.dw_decomp_ablation -->
-
-To test whether the trained dW's behavior is carried by *how much
-each tensor moves* (the per-tensor Frobenius-norm allocation across
-layers/modules) or by *the within-tensor direction* (elementwise
-pattern inside each tensor), we evaluate four variants of the DeLoRA
-dW (total ||dW||_F = 33.43, kept identical across variants). Each
-variant preserves at most one scalar per tensor (its norm) plus
-either the original within-tensor structure or a single Gaussian
-draw — so this isolates *per-tensor norm* vs *within-tensor
-direction*, not a broader notion of "magnitude pattern":
-
-| variant       | meaning                                          |
-| ------------- | ------------------------------------------------ |
-| `full`        | original trained dW (control)                    |
-| `dir_only`    | within-tensor direction kept; every tensor rescaled to a common Frobenius norm (flattens per-tensor norm allocation) |
-| `mag_only`    | random Gaussian per tensor, scaled to the original per-tensor norm (preserves only the per-tensor norm scalar; within-tensor direction random) |
-| `random_norm` | random Gaussian + common norm (control: nothing learned) |
-
-Daily-dilemmas honesty eval, full split, base persona, single seed:
-
-| variant     |     SI | si_fwd | si_rev | fix/broke @ a=+1 | flip/counter @ a=-1 | mean_lr Δ@a=+1 | mean_lr Δ@a=-1 |
-| ----------- | -----: | -----: | -----: | ---------------: | ------------------: | -------------: | -------------: |
-| full        | -34.29 | -0.607 | -0.180 |          20/141  |             121/25  |        +0.237  |        -1.152  |
-| dir_only    | -41.00 | -0.636 | -0.316 |          20/146  |             162/37  |        +0.024  |        -1.295  |
-| mag_only    | -34.75 | +0.007 | -0.754 |           16/28  |             187/61  |        +1.068  |        -1.191  |
-| random_norm | -13.36 | -0.272 | -0.119 |           16/76  |              25/9   |        -0.143  |        -0.011  |
-
-Read: stripping the per-tensor norm allocation (`dir_only`) collapses
-the positive-direction mean shift from +0.237 to +0.024 and worsens
-SI. Stripping the within-tensor direction but keeping per-tensor
-Frobenius norms (`mag_only`) gives a *larger* positive mean shift
-(+1.07) with *fewer* broken rows (28 vs 141) than the trained dW.
-This narrowly supports "per-tensor norm allocation across
-layers/modules carries most of the α=+1 effect"; it does *not*
-support a broader claim that the entire weight-space magnitude
-pattern is what matters, since `mag_only` already discards every
-within-tensor magnitude relationship. `mag_only` and `random_norm`
-are also single-seed Monte Carlo controls; the specific +1.07 number
-is seed-sensitive. `random_norm` "wins" SI only by virtue of being a
-near no-op (the metric flatters non-interventions when classes are
-imbalanced); compare `delta_pos`/`delta_neg` to see it doesn't
-actually steer.
-
-This says the dW for DeLoRA is mostly a *layer/module norm
-allocation*, not a learned within-tensor direction. T7 layer/module
-ablation tests the same question from the other side. If true under
-multiple seeds and on Gemma, it implies weight steering for honesty
-needs only a learnable per-tensor scalar, not a low-rank direction
-inside each tensor — a much smaller hypothesis class.
-
-### Subspace/projection lesson
-
-The original question was: can we find the subspace or parameterization that
-explains the difference between the positive and negative LoRAs? So far we
-tested three kinds of explanations:
-
- Parameterization: LoRA / DoRA / PiSSA / DeLoRA / OFT / IA3. Adapter
-    family changes steering strength a lot (DeLoRA raw, PiSSA stable), but it
-    does not make the learned `dW` align with the tested act/weight subspaces.
- Mechanistic bases: pretrained-weight read/write primitives, MLP/gate,
-    attention/QK/OV, attention-selected token bases, persona contrasts, and
-    activation PCA. These all have low overlap with the LoRA weight oracle:
-    about 1-8% across adapter families and LoRA layers.
- Block-local activation PCA did not rescue this. The issue is not just that
-    cumulative activations mix upstream layers.
- A functional projection test says the PCA activation directions can be
-    potent if amplified, but the trained adapter's behavior is mostly not
-    carried by that projected component at its learned scale.
-
-Projection diagnostic at K=32 on daily dilemmas (40 dilemmas / 80 rows; this
-is an ablation, not a full benchmark):
-
-| adapter | full Δ | residual-write Δ | raw projection / residual | normmatched projection / residual | complement / residual | read                                              |
-| ------- | -----: | ---------------: | ------------------------: | --------------------------------: | --------------------: | ------------------------------------------------- |
-| delora  | +0.628 |           +0.844 |                      0.07 |                              0.30 |                  0.89 | trained behavior mostly outside act-PCA subspace  |
-| pissa   | +0.373 |           +0.242 |                      0.47 |                              1.14 |                  0.64 | mixed: act-PCA is functional, not sole carrier    |
-| oft     | +0.216 |           +0.148 |                     -0.01 |                              1.57 |                  0.69 | act-PCA direction potent only after amplification |
-
-Here `complement` means the residual-output part of `dW` after removing the
-activation-PCA subspace:
-
-$$dW_{\text{complement}} = (I - P_{\text{act},K}) dW.$$
-
-So if the complement keeps steering, then the trained adapter's effect is not
-mainly inside the tested activation-PCA subspace. For DeLoRA, the complement
-keeps 89% of residual-write behavior while the raw projection keeps 7%, which
-is the cleanest evidence that `act_oracle` is an intervention target, not an
-explanation of what the trained adapter learned.
-
-Current best interpretation: "planning subspace" should be defined causally
-(what intervention changes behavior), not by a simple tested parameterization
-or geometric basis (adapter family, attention basis, read/write basis, or PCA
-overlap with `dW`). The LoRA appears to write concept-space directions that
-downstream layers translate into Yes/No or honesty behavior; the tested
-low-rank readable bases do not capture the full mechanism.
-
-# Cite
+## Cite

 ```bibtex
@article{FierroRoger2025,
@@ -374,3 +98,10 @@ low-rank readable bases do not capture the full mechanism.
  doi       = {10.48550/arXiv.2511.05408}
 }
 ```
+
+## Related
+
+- Paper: https://arxiv.org/abs/2511.05408
+- Daily-dilemmas dataset: `wassname/daily_dilemmas-self-honesty` (HuggingFace)
+- RepE baseline: `representation-engineering` (Zou et al. 2023)
+- PEFT: https://github.com/huggingface/peft
@@ -1,5 +1,13 @@
 """Tiny tokenizer utilities with no ws imports (avoids circular deps)."""

+THINK_CLOSE = "</think>"
+
+
+def has_thinking_mode(tok) -> bool:
+    """True iff the tokenizer has </think> as a genuine special token (Qwen3)."""
+    tid = tok.convert_tokens_to_ids(THINK_CLOSE)
+    return tid is not None and tid != tok.unk_token_id
+

 def chat_template_extras(tok) -> dict:
    """Extra kwargs for apply_chat_template that vary by model family.
@@ -18,6 +18,7 @@ from tabulate import tabulate
 from transformers import AutoTokenizer

 from ws._log import final_summary, get_argv, setup_logging
+from ws._tok_extras import has_thinking_mode
 from ws.data import DataCfg, generate_pairs, load_pairs
 from ws.diff import compute_diff, load_base_state, load_delta, save_diff
 from ws.eval.sycophancy import EvalCfg, evaluate, summarize
@@ -133,8 +134,11 @@ def main(cfg: Cfg) -> None:
    dcfg = DemoCfg(model=cfg.model, behavior=cfg.behavior, adapter=cfg.adapter, out=cfg.out)
    claims = _demo_claims(dcfg.ood_claim)
    phase_a1(dcfg, claims, tok)
-    demo_df = phase_a2(dcfg, claims, tok)
-    demo_df.write_csv(out_dir / "demo_guided_cot.csv")
+    if has_thinking_mode(tok):
+        demo_df = phase_a2(dcfg, claims, tok)
+        demo_df.write_csv(out_dir / "demo_guided_cot.csv")
+    else:
+        logger.info("skipping guided-CoT demo: model has no </think> special token")

    # BLUF: headline = max margin across alpha sweep on in_dist claim
    sp = summary.to_pandas()