weight-steering

mirror of https://github.com/wassname/weight-steering.git synced 2026-06-27 17:18:22 +08:00

Author	SHA1	Message	Date
wassname	497ee05aef	first pass care vs sanctity	2026-05-03 06:02:07 +08:00
wassname	b2ef8fef7b	wip	2026-04-30 21:06:18 +08:00
wassname	7440229d48	narrow honesty: clamp n_personas to list length, expose grid in sweep Allows narrow honesty (1 persona pair) to share data-volume parity with broader behaviors by bumping n_samples. data.py logs the clamp; replicate.py on-disk size check uses clamped n_personas; run_sweep.py exposes n_topics/n_personas/n_samples to CLI. README clarifies honesty_label provenance: party='You' filter from Action_to_party_to_value, not values_aggregated.	2026-04-28 21:23:32 +08:00
wassname	06ec48d8f7	KL-budget calibration: match off-task dist-shift across methods α=1 means very different things across LoRA/PiSSA/DeLoRA/OFT/IA3/RepE/prompt; calibrate α per method so p95 token-KL on held-out continuations matches prompt:engineered_prompt_honest's footprint (≈0.61 nats over 50 stratified prompts, 100 audit). Newton iter α_next=α·sqrt(T/M) converges 7/7 methods in 2-3 iters. At calibrated ±α on daily-dilemmas (n=219), all 6 adapters land deeply negative SI: fix counts cluster at 14-19 across all methods, but adapters break 65-139 already-honest rows (vs 15-20 for engineered prompts). Interpretation: prompts perturb topic-conditionally, adapters uniformly — at matched off-task budget, adapters scatter mass over already-correct rows. RepE sits between. Caveats: single seed, calibration off-task, anchor audit p95 is 1.78× calib (calibrated conservatively). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 14:08:55 +08:00
wassname	da75668d6b	move RESEARCH_JOURNAL and fork_plan under docs/ Working notes belong with the rest of the docs. Updated relative links in docs/hypothesis_ablation_catalog.md from ../fork_plan.md to fork_plan.md since both files now live in docs/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 09:09:52 +08:00
wassname	6ec664995b	T6/T7/T8 ablations + lens-search hold pending multiseed - Add `eval/layer_module_ablation.py` (T7) and `eval/parameterization_ablation.py` (T8) for causal ablation of trained `dW`. - Add `nbs/ablation_analysis.py` consuming T7/T8 CSVs through three lenses (SVD-on-`dW`, layer index, module family). - Fix `prompt_baseline.py` engineered-prompt tuple bug; add `DIFF_FILENAME` constant in `diff.py`. - Delete superseded notebooks (`analyze_diff*`, `cross_adapter_v9`, `hypothesis_sweep_v5-v9`, `strong_conclusion_v4`, `v10_llama`, `functional_projection_v10`). - Document (README, fork_plan, RESEARCH_JOURNAL): each lens has a built-in failure mode (SVD tautological for low-rank adapters; layer-index tells depth not mechanism; module-family disagrees cross-adapter; native parameterization decompositions non-comparable). Mark analysis question on hold pending T4 multiseed: cross-adapter inconsistency may be N=1 seed noise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-27 19:05:20 +08:00
wassname	a3d999fd92	wip	2026-04-27 09:59:06 +08:00
wassname	651ad132d3	v7: cold-eyes evidence review + flag write-family-below-null in conclusion	2026-04-26 20:01:11 +08:00
wassname	a1b38dc456	docs: add v6 hypothesis review (subagent + reviewer-of-reviewer)	2026-04-26 19:45:13 +08:00
wassname	f0bce8be90	tidy	2026-04-25 19:27:53 +08:00

10 Commits