Files
weight-steering/docs
wassname 06ec48d8f7 KL-budget calibration: match off-task dist-shift across methods
α=1 means very different things across LoRA/PiSSA/DeLoRA/OFT/IA3/RepE/prompt;
calibrate α per method so p95 token-KL on held-out continuations matches
prompt:engineered_prompt_honest's footprint (≈0.61 nats over 50 stratified
prompts, 100 audit). Newton iter α_next=α·sqrt(T/M) converges 7/7 methods
in 2-3 iters. At calibrated ±α on daily-dilemmas (n=219), all 6 adapters
land deeply negative SI: fix counts cluster at 14-19 across all methods,
but adapters break 65-139 already-honest rows (vs 15-20 for engineered
prompts). Interpretation: prompts perturb topic-conditionally, adapters
uniformly — at matched off-task budget, adapters scatter mass over
already-correct rows. RepE sits between.

Caveats: single seed, calibration off-task, anchor audit p95 is 1.78×
calib (calibrated conservatively).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 14:08:55 +08:00
..
2026-04-25 19:27:53 +08:00
2026-04-25 19:27:53 +08:00
2026-04-25 19:27:53 +08:00
2026-04-25 19:27:53 +08:00