weight-steering

mirror of https://github.com/wassname/weight-steering.git synced 2026-06-27 18:27:18 +08:00

Author	SHA1	Message	Date
wassname	497ee05aef	first pass care vs sanctity	2026-05-03 06:02:07 +08:00
wassname	b2ef8fef7b	wip	2026-04-30 21:06:18 +08:00
wassname	7440229d48	narrow honesty: clamp n_personas to list length, expose grid in sweep Allows narrow honesty (1 persona pair) to share data-volume parity with broader behaviors by bumping n_samples. data.py logs the clamp; replicate.py on-disk size check uses clamped n_personas; run_sweep.py exposes n_topics/n_personas/n_samples to CLI. README clarifies honesty_label provenance: party='You' filter from Action_to_party_to_value, not values_aggregated.	2026-04-28 21:23:32 +08:00
wassname	06ec48d8f7	KL-budget calibration: match off-task dist-shift across methods α=1 means very different things across LoRA/PiSSA/DeLoRA/OFT/IA3/RepE/prompt; calibrate α per method so p95 token-KL on held-out continuations matches prompt:engineered_prompt_honest's footprint (≈0.61 nats over 50 stratified prompts, 100 audit). Newton iter α_next=α·sqrt(T/M) converges 7/7 methods in 2-3 iters. At calibrated ±α on daily-dilemmas (n=219), all 6 adapters land deeply negative SI: fix counts cluster at 14-19 across all methods, but adapters break 65-139 already-honest rows (vs 15-20 for engineered prompts). Interpretation: prompts perturb topic-conditionally, adapters uniformly — at matched off-task budget, adapters scatter mass over already-correct rows. RepE sits between. Caveats: single seed, calibration off-task, anchor audit p95 is 1.78× calib (calibrated conservatively). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 14:08:55 +08:00
wassname	da75668d6b	move RESEARCH_JOURNAL and fork_plan under docs/ Working notes belong with the rest of the docs. Updated relative links in docs/hypothesis_ablation_catalog.md from ../fork_plan.md to fork_plan.md since both files now live in docs/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 09:09:52 +08:00

5 Commits