Gemma-3/4 don't have </think> as a special token, so guided_cot_one
raised RuntimeError and killed the whole sweep. Fix: add has_thinking_mode
to _tok_extras and gate phase_a2 in replicate.py on it.
README cut from ~380 to ~120 lines: results tables, how to run, cite, links.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
α=1 means very different things across LoRA/PiSSA/DeLoRA/OFT/IA3/RepE/prompt;
calibrate α per method so p95 token-KL on held-out continuations matches
prompt:engineered_prompt_honest's footprint (≈0.61 nats over 50 stratified
prompts, 100 audit). Newton iter α_next=α·sqrt(T/M) converges 7/7 methods
in 2-3 iters. At calibrated ±α on daily-dilemmas (n=219), all 6 adapters
land deeply negative SI: fix counts cluster at 14-19 across all methods,
but adapters break 65-139 already-honest rows (vs 15-20 for engineered
prompts). Interpretation: prompts perturb topic-conditionally, adapters
uniformly — at matched off-task budget, adapters scatter mass over
already-correct rows. RepE sits between.
Caveats: single seed, calibration off-task, anchor audit p95 is 1.78×
calib (calibrated conservatively).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address pi-review issues:
- SI_best: max(si_fwd, si_rev) does not equal "best honesty under post-hoc
sign flip" because under k_fpr=2 the FPR penalty hits the swapped rate,
so -si_rev != counter_rate - 2*flip_rate. Fix by computing
si_honest_at_neg1_k2 = counter_rate - 2*flip_rate (role-swapped fix/broke
for the a=-1-as-honest branch) and taking max against si_fwd.
- Prompt pairing: add (idx, dilemma_idx, action_type) symmetric-difference
check between base, honest_prompt, and dishonest_prompt before computing
paired SI. Previously only .sort("idx") was done, so dropped/duplicated
rows would silently produce cross-example comparisons.
- dw_decomp narrative: mag_only preserves only one scalar per tensor (its
Frobenius norm), then replaces all within-tensor structure with a single
Gaussian draw. Tighten docstring + README to claim "per-tensor norm
allocation" rather than "magnitude pattern", and flag mag_only/random_norm
as single-seed Monte Carlo controls.
Re-run honesty_tables.py: SI_best now flips prompt:simple from -13.89 to
+3.46 because the role-swapped a=-1 branch is its better direction. Update
README OOD SI table accordingly. Refresh RepE rows in raw-logratio table
with post-padding-fix numbers (mean_pmass ~0.96, no longer ~0.17); drop
stale pmass caveat block.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Result: random_direction * original_per_tensor_norm (mag_only) gives a
larger positive logratio shift (+1.07 at a=+1) than the full trained
dW (+0.24), with 5x fewer broken rows. Stripping the magnitude pattern
(dir_only) collapses the effect to +0.02. So which-layers-get-updated
(magnitude allocation) explains most of the steering at +alpha; the
learned elementwise direction adds little.
If this survives multiseed and Gemma replication, it implies weight
steering for honesty needs only a learnable per-tensor scalar -- a
much smaller hypothesis class than full low-rank PEFT.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Pair prompt baselines as alpha=-1/0/+1 (dishonest/base/honest) under
simple and engineered families, giving full bidirectional SI for
prompts (same as dW)
- Add SI_best = max(si_fwd, si_rev) * pmass^2 * 100 -- sign-aligned
upper bound (snooping-aware robustness probe)
- Add SI_k1 (symmetric, breaks weighted 1x) alongside default SI_k2
to expose how much the class-imbalance-driven 2x penalty contributes
- Expose fix_rate / broke_rate columns so the SI components are visible
- Add IID syc table (held-out persona claims) using
cross_adapter_ablation/sycophancy_per_row.csv with variant=full_all_tensors
- Add raw mean +- std logratio table per (method, coeff) for OOD
The IID/OOD split shows: dW interventions land hard on IID (PiSSA biggest,
+5.7 mean shift) but most break OOD via the broke_rate channel. OFT and
engineered prompts are the only methods with non-negative SI_best.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Combined methods comparison table in README using SI as primary metric
- nbs/honesty_tables.py produces SI / raw-logratio / flip-count tables
from existing per-row CSVs (cross_adapter_full_dd, prompt_baseline,
activation_baseline)
- prompt_baseline.py: si_fwd computed inline for prompt methods
- activation_baseline.py: tok.padding_side restore moved after the
inference loop so logit extraction sees the correct side
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>