mirror of
https://github.com/wassname/weight-steering.git
synced 2026-07-01 22:59:46 +08:00
06ec48d8f7
α=1 means very different things across LoRA/PiSSA/DeLoRA/OFT/IA3/RepE/prompt; calibrate α per method so p95 token-KL on held-out continuations matches prompt:engineered_prompt_honest's footprint (≈0.61 nats over 50 stratified prompts, 100 audit). Newton iter α_next=α·sqrt(T/M) converges 7/7 methods in 2-3 iters. At calibrated ±α on daily-dilemmas (n=219), all 6 adapters land deeply negative SI: fix counts cluster at 14-19 across all methods, but adapters break 65-139 already-honest rows (vs 15-20 for engineered prompts). Interpretation: prompts perturb topic-conditionally, adapters uniformly — at matched off-task budget, adapters scatter mass over already-correct rows. RepE sits between. Caveats: single seed, calibration off-task, anchor audit p95 is 1.78× calib (calibrated conservatively). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>