weight-steering

mirror of https://github.com/wassname/weight-steering.git synced 2026-07-01 22:59:46 +08:00

Files

T

wassname 06ec48d8f7 KL-budget calibration: match off-task dist-shift across methods

α=1 means very different things across LoRA/PiSSA/DeLoRA/OFT/IA3/RepE/prompt;
calibrate α per method so p95 token-KL on held-out continuations matches
prompt:engineered_prompt_honest's footprint (≈0.61 nats over 50 stratified
prompts, 100 audit). Newton iter α_next=α·sqrt(T/M) converges 7/7 methods
in 2-3 iters. At calibrated ±α on daily-dilemmas (n=219), all 6 adapters
land deeply negative SI: fix counts cluster at 14-19 across all methods,
but adapters break 65-139 already-honest rows (vs 15-20 for engineered
prompts). Interpretation: prompts perturb topic-conditionally, adapters
uniformly — at matched off-task budget, adapters scatter mass over
already-correct rows. RepE sits between.

Caveats: single seed, calibration off-task, anchor audit p95 is 1.78×
calib (calibrated conservatively).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-28 14:08:55 +08:00

papers

T6/T7/T8 ablations + lens-search hold pending multiseed

2026-04-27 19:05:20 +08:00

review

v7: cold-eyes evidence review + flag write-family-below-null in conclusion

2026-04-26 20:01:11 +08:00

AntiPaSTO_concepts

tidy

2026-04-25 19:27:53 +08:00

blog_adapter_as_hypothesis

tidy