weight-steering

mirror of https://github.com/wassname/weight-steering.git synced 2026-06-27 17:18:22 +08:00

Author	SHA1	Message	Date
wassname	48c1b07b83	readme	2026-05-05 08:12:41 +08:00
wassname	cf0f7d6c54	results	2026-05-04 18:33:19 +08:00
wassname	7eac38829d	hmm	2026-05-04 06:17:30 +08:00
wassname	9dff8d0256	feat: add auth_socn behavior + behavior-aware axis_shift + pmass/flips/bare-logit eval helpers - data.py: AUTH_SOCN_POS/NEG_PERSONAS (6 pairs, ported from steering-lite branching.py), wired into _personas() / _topics() / _build_specs() for auth_socn behavior - tinymfv_airisk.py: AXIS_PAIR dict + behavior-aware _axis_shift (auth_socn uses ΔlogitSocNorms − ΔlogitAuthority vs trad_care's ΔlogitSanc − ΔlogitCare); PMASS_FLOOR=0.9 NaN-gate; _logit NaN-safe; _flips_per_foundation_table; _bare_logit_per_foundation_table; new __foundations_flips.csv + __bare_logit.csv artifacts - README: fill trad_care comparison table with actual ws results (jobs 93-96), add bare model row for ws, add sl:engineered_prompt row Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-03 06:11:48 +08:00
wassname	497ee05aef	first pass care vs sanctity	2026-05-03 06:02:07 +08:00
wassname	4f2034dd46	tidy	2026-05-02 05:52:25 +08:00
wassname	71a8d4c555	tidy	2026-05-01 22:29:06 +08:00
wassname	27cf12c2d8	Switch AIRisk evals to tiny-mfv workflow	2026-05-01 20:47:31 +08:00
wassname	b2ef8fef7b	wip	2026-04-30 21:06:18 +08:00
wassname	ce73e97154	fix: skip guided-CoT for non-thinking models; trim README Gemma-3/4 don't have </think> as a special token, so guided_cot_one raised RuntimeError and killed the whole sweep. Fix: add has_thinking_mode to _tok_extras and gate phase_a2 in replicate.py on it. README cut from ~380 to ~120 lines: results tables, how to run, cite, links. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-29 05:39:50 +08:00
wassname	7440229d48	narrow honesty: clamp n_personas to list length, expose grid in sweep Allows narrow honesty (1 persona pair) to share data-volume parity with broader behaviors by bumping n_samples. data.py logs the clamp; replicate.py on-disk size check uses clamped n_personas; run_sweep.py exposes n_topics/n_personas/n_samples to CLI. README clarifies honesty_label provenance: party='You' filter from Action_to_party_to_value, not values_aggregated.	2026-04-28 21:23:32 +08:00
wassname	06ec48d8f7	KL-budget calibration: match off-task dist-shift across methods α=1 means very different things across LoRA/PiSSA/DeLoRA/OFT/IA3/RepE/prompt; calibrate α per method so p95 token-KL on held-out continuations matches prompt:engineered_prompt_honest's footprint (≈0.61 nats over 50 stratified prompts, 100 audit). Newton iter α_next=α·sqrt(T/M) converges 7/7 methods in 2-3 iters. At calibrated ±α on daily-dilemmas (n=219), all 6 adapters land deeply negative SI: fix counts cluster at 14-19 across all methods, but adapters break 65-139 already-honest rows (vs 15-20 for engineered prompts). Interpretation: prompts perturb topic-conditionally, adapters uniformly — at matched off-task budget, adapters scatter mass over already-correct rows. RepE sits between. Caveats: single seed, calibration off-task, anchor audit p95 is 1.78× calib (calibrated conservatively). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 14:08:55 +08:00
wassname	325171c291	fix SI_best, add prompt row-alignment check, narrow dw_decomp claims Address pi-review issues: - SI_best: max(si_fwd, si_rev) does not equal "best honesty under post-hoc sign flip" because under k_fpr=2 the FPR penalty hits the swapped rate, so -si_rev != counter_rate - 2flip_rate. Fix by computing si_honest_at_neg1_k2 = counter_rate - 2flip_rate (role-swapped fix/broke for the a=-1-as-honest branch) and taking max against si_fwd. - Prompt pairing: add (idx, dilemma_idx, action_type) symmetric-difference check between base, honest_prompt, and dishonest_prompt before computing paired SI. Previously only .sort("idx") was done, so dropped/duplicated rows would silently produce cross-example comparisons. - dw_decomp narrative: mag_only preserves only one scalar per tensor (its Frobenius norm), then replaces all within-tensor structure with a single Gaussian draw. Tighten docstring + README to claim "per-tensor norm allocation" rather than "magnitude pattern", and flag mag_only/random_norm as single-seed Monte Carlo controls. Re-run honesty_tables.py: SI_best now flips prompt:simple from -13.89 to +3.46 because the role-swapped a=-1 branch is its better direction. Update README OOD SI table accordingly. Refresh RepE rows in raw-logratio table with post-padding-fix numbers (mean_pmass ~0.96, no longer ~0.17); drop stale pmass caveat block. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 09:17:56 +08:00
wassname	b7bad4e002	DeLoRA dW decomp: magnitude pattern carries most of the steering Result: random_direction * original_per_tensor_norm (mag_only) gives a larger positive logratio shift (+1.07 at a=+1) than the full trained dW (+0.24), with 5x fewer broken rows. Stripping the magnitude pattern (dir_only) collapses the effect to +0.02. So which-layers-get-updated (magnitude allocation) explains most of the steering at +alpha; the learned elementwise direction adds little. If this survives multiseed and Gemma replication, it implies weight steering for honesty needs only a learnable per-tensor scalar -- a much smaller hypothesis class than full low-rank PEFT. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 08:33:24 +08:00
wassname	64adf9267d	SI tables v2: SI_best, SI_k1, fix/broke rates; paired prompts; IID syc - Pair prompt baselines as alpha=-1/0/+1 (dishonest/base/honest) under simple and engineered families, giving full bidirectional SI for prompts (same as dW) - Add SI_best = max(si_fwd, si_rev) * pmass^2 * 100 -- sign-aligned upper bound (snooping-aware robustness probe) - Add SI_k1 (symmetric, breaks weighted 1x) alongside default SI_k2 to expose how much the class-imbalance-driven 2x penalty contributes - Expose fix_rate / broke_rate columns so the SI components are visible - Add IID syc table (held-out persona claims) using cross_adapter_ablation/sycophancy_per_row.csv with variant=full_all_tensors - Add raw mean +- std logratio table per (method, coeff) for OOD The IID/OOD split shows: dW interventions land hard on IID (PiSSA biggest, +5.7 mean shift) but most break OOD via the broke_rate channel. OFT and engineered prompts are the only methods with non-negative SI_best. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 08:29:49 +08:00
wassname	0ded47388f	SI tables: README + nbs/honesty_tables.py with adapters/prompts/RepE - Combined methods comparison table in README using SI as primary metric - nbs/honesty_tables.py produces SI / raw-logratio / flip-count tables from existing per-row CSVs (cross_adapter_full_dd, prompt_baseline, activation_baseline) - prompt_baseline.py: si_fwd computed inline for prompt methods - activation_baseline.py: tok.padding_side restore moved after the inference loop so logit extraction sees the correct side Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 08:25:05 +08:00
wassname	a48430b075	switch training/eval axis from sycophancy to honesty - data.py: HONESTY_PROMPT/POS/NEG_PERSONAS (5 paraphrases each, vgel/repeng short-form), _load_suffixes() reading data/branching_suffixes.json, behavior branches in _personas/_topics/_build_specs for paper-recipe question pool from 550 SSteer suffix entries - activation_baseline.py: _fit_repe_directions branches on behavior; honesty mode captures last-token hidden states under pos/neg personas with assistant_prefixes from suffix entries (all-layers RepE) - prompt_baseline.py: paired engineered_prompt_honest + _dishonest (AxBench J.2), both as plain strings - evals/smoke.py: behavior field in SmokeCfg - data/branching_suffixes.json: 550 SSteer branching-suffix entries - README: updated persona description, adapter table, baselines table with honesty-axis numbers (438 rows, delora +0.237 best) - RESEARCH_JOURNAL.md: 2026-04-27 axis-switch entry - fork_plan.md: open design question resolved as option 2 (honesty axis) - HANDOVER.md: overnight handover notes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-28 06:00:03 +08:00
wassname	c828b0c00b	baselines	2026-04-27 19:40:43 +08:00
wassname	6ec664995b	T6/T7/T8 ablations + lens-search hold pending multiseed - Add `eval/layer_module_ablation.py` (T7) and `eval/parameterization_ablation.py` (T8) for causal ablation of trained `dW`. - Add `nbs/ablation_analysis.py` consuming T7/T8 CSVs through three lenses (SVD-on-`dW`, layer index, module family). - Fix `prompt_baseline.py` engineered-prompt tuple bug; add `DIFF_FILENAME` constant in `diff.py`. - Delete superseded notebooks (`analyze_diff*`, `cross_adapter_v9`, `hypothesis_sweep_v5-v9`, `strong_conclusion_v4`, `v10_llama`, `functional_projection_v10`). - Document (README, fork_plan, RESEARCH_JOURNAL): each lens has a built-in failure mode (SVD tautological for low-rank adapters; layer-index tells depth not mechanism; module-family disagrees cross-adapter; native parameterization decompositions non-comparable). Mark analysis question on hold pending T4 multiseed: cross-adapter inconsistency may be N=1 seed noise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-27 19:05:20 +08:00
wassname	2f12058b7e	clarify tested subspace and parametrization hypotheses	2026-04-27 07:10:39 +08:00
wassname	b001c40521	document adapter benchmark and projection interpretation	2026-04-27 07:09:02 +08:00
wassname	7e1b171875	paper data recipe + LoRA hyperparams + n_pairs hardening - data: 5 pos + 5 neg personas, 20 train + 12 eval topic split (paper §3 / Appendix C), n_samples solved from n_pairs. judge filter stub (off by default; paper uses GPT-4.1-mini). - eval/sycophancy: read true held-out eval_topics() instead of SYCOPHANCY_TOPICS[-16:]. - replicate: fix epochs threading; n_pairs reuse fails fast on mismatch; smoke knobs (n_topics, n_personas) plumbed. - train: paper hyperparams (rank 32 / alpha 16 / lr 1e-5 / warmup 5 / wd 0.01); explicit alpha (no 2*r fallback); held-out 10% val + eval_loss logging. - run_demo: train_topics() for in_dist demo claims. - README: scope block reflects paper-matching recipe.	2026-04-26 10:19:59 +08:00
wassname	3ff283d535	README: fork notice + pipeline overview Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 20:16:57 +08:00
Constanza	977c054586	Update README.md	2025-11-11 09:18:45 +01:00
cfierro94	90065f035f	first commit	2025-10-17 11:14:24 +02:00

25 Commits