weight-steering

mirror of https://github.com/wassname/weight-steering.git synced 2026-07-01 22:24:54 +08:00

Author	SHA1	Message	Date
wassname	0bc46dc51e	cuda	2026-05-02 06:04:58 +08:00
wassname	71a8d4c555	tidy	2026-05-01 22:29:06 +08:00
wassname	b2ef8fef7b	wip	2026-04-30 21:06:18 +08:00
wassname	0f050f2734	honesty: narrow training/prompt/eval to honesty-only axis The wassname/daily_dilemmas-self-honesty dataset uses paired-opposite labels: if to_do has honesty in you_values -> to_do=+1, paired not_to_do=-1 even when not_to_do's own values are e.g. ['empathy'], unrelated to honesty. Those 211/438 filler rows were dragging dW SI negative. Changes: - dilemmas.py: filter to action-specific honesty rows at load (227/438) - data.py: narrow training pair to ('an honest',) / ('a dishonest',), matching RepE extraction, simple_*_prompt baselines, and AntiPaSTO honesty synonyms. Cluster paraphrases (candid/sincere/manipulative/...) drift onto a broader axis -- comparison was unfair. - prompt_baseline.py: regenerate engineered prompts via AxBench J.2 template with narrow-honesty concept slot. Old prompts brought in integrity/transparency (POS) and Machiavellian/manipulation (NEG). - nbs/rescore_honesty_only.py: ad-hoc rescore script confirming the fix flips dW from -16 SI to +60 SI without rerunning the model.	2026-04-28 21:11:14 +08:00
wassname	325171c291	fix SI_best, add prompt row-alignment check, narrow dw_decomp claims Address pi-review issues: - SI_best: max(si_fwd, si_rev) does not equal "best honesty under post-hoc sign flip" because under k_fpr=2 the FPR penalty hits the swapped rate, so -si_rev != counter_rate - 2flip_rate. Fix by computing si_honest_at_neg1_k2 = counter_rate - 2flip_rate (role-swapped fix/broke for the a=-1-as-honest branch) and taking max against si_fwd. - Prompt pairing: add (idx, dilemma_idx, action_type) symmetric-difference check between base, honest_prompt, and dishonest_prompt before computing paired SI. Previously only .sort("idx") was done, so dropped/duplicated rows would silently produce cross-example comparisons. - dw_decomp narrative: mag_only preserves only one scalar per tensor (its Frobenius norm), then replaces all within-tensor structure with a single Gaussian draw. Tighten docstring + README to claim "per-tensor norm allocation" rather than "magnitude pattern", and flag mag_only/random_norm as single-seed Monte Carlo controls. Re-run honesty_tables.py: SI_best now flips prompt:simple from -13.89 to +3.46 because the role-swapped a=-1 branch is its better direction. Update README OOD SI table accordingly. Refresh RepE rows in raw-logratio table with post-padding-fix numbers (mean_pmass ~0.96, no longer ~0.17); drop stale pmass caveat block. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 09:17:56 +08:00
wassname	64adf9267d	SI tables v2: SI_best, SI_k1, fix/broke rates; paired prompts; IID syc - Pair prompt baselines as alpha=-1/0/+1 (dishonest/base/honest) under simple and engineered families, giving full bidirectional SI for prompts (same as dW) - Add SI_best = max(si_fwd, si_rev) * pmass^2 * 100 -- sign-aligned upper bound (snooping-aware robustness probe) - Add SI_k1 (symmetric, breaks weighted 1x) alongside default SI_k2 to expose how much the class-imbalance-driven 2x penalty contributes - Expose fix_rate / broke_rate columns so the SI components are visible - Add IID syc table (held-out persona claims) using cross_adapter_ablation/sycophancy_per_row.csv with variant=full_all_tensors - Add raw mean +- std logratio table per (method, coeff) for OOD The IID/OOD split shows: dW interventions land hard on IID (PiSSA biggest, +5.7 mean shift) but most break OOD via the broke_rate channel. OFT and engineered prompts are the only methods with non-negative SI_best. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 08:29:49 +08:00
wassname	0ded47388f	SI tables: README + nbs/honesty_tables.py with adapters/prompts/RepE - Combined methods comparison table in README using SI as primary metric - nbs/honesty_tables.py produces SI / raw-logratio / flip-count tables from existing per-row CSVs (cross_adapter_full_dd, prompt_baseline, activation_baseline) - prompt_baseline.py: si_fwd computed inline for prompt methods - activation_baseline.py: tok.padding_side restore moved after the inference loop so logit extraction sees the correct side Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 08:25:05 +08:00
wassname	6ec664995b	T6/T7/T8 ablations + lens-search hold pending multiseed - Add `eval/layer_module_ablation.py` (T7) and `eval/parameterization_ablation.py` (T8) for causal ablation of trained `dW`. - Add `nbs/ablation_analysis.py` consuming T7/T8 CSVs through three lenses (SVD-on-`dW`, layer index, module family). - Fix `prompt_baseline.py` engineered-prompt tuple bug; add `DIFF_FILENAME` constant in `diff.py`. - Delete superseded notebooks (`analyze_diff*`, `cross_adapter_v9`, `hypothesis_sweep_v5-v9`, `strong_conclusion_v4`, `v10_llama`, `functional_projection_v10`). - Document (README, fork_plan, RESEARCH_JOURNAL): each lens has a built-in failure mode (SVD tautological for low-rank adapters; layer-index tells depth not mechanism; module-family disagrees cross-adapter; native parameterization decompositions non-comparable). Mark analysis question on hold pending T4 multiseed: cross-adapter inconsistency may be N=1 seed noise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-27 19:05:20 +08:00
wassname	db7979d0e2	baselines	2026-04-27 13:02:34 +08:00
wassname	25334ec574	fix daily-dilemmas cross-adapter baseline	2026-04-27 07:00:09 +08:00
wassname	6f41e47ea9	v10 functional projection falsifier for act-oracle overlap	2026-04-27 06:54:09 +08:00
wassname	236cea1267	cross_adapter_v9: aggregate v9 scope diagnostics + dilemmas across adapters	2026-04-26 21:55:19 +08:00
wassname	3f162027b1	v9 sweep: block-local act oracle + L=8 sanity (layer-scope diagnostic); ADAPTER env var for cross-adapter use	2026-04-26 21:53:45 +08:00
wassname	2c262d47a8	v8 polish: w_oracle + act_oracle (each saturates own axis), 3-panel scatter + bar of % to ideal	2026-04-26 21:14:01 +08:00
wassname	f4039dd2ee	v8: rank-honest pct_oracle metric (energy_frac / oracle@r_eff in [0,1]) Replaces v7's post-hoc 'pct_w_oracle = R_w / R_w_ceiling' (a ratio of two concentration ratios) with a per-row pct_oracle: candidate's energy_frac divided by the optimal rank-r_eff subspace's energy_frac on the same target. Rank-honest: chars_clusters (r_eff=7) is graded against rank-7 oracle, not rank-8. Activation oracle = PCA of L2-normalized hs_diff_B (matches existing energy_frac_act formula). Result: every non-oracle candidate lands at pct_oracle in [0.02, 0.11] on both axes. Best joint = WNR_union_TaskDiff at 0.089 (rank 16; all others rank 8). chars_clusters and layer_clean_resid_pca tied at ~0.085. This is a clean negative result: LoRA's task-specific delta is far from any of our hand-built linear primitives' spans.	2026-04-26 20:59:27 +08:00
wassname	3c9fb8d1f5	v7 sweep: per-tensor R_w + true weight ceiling + axis_kind tag Addresses three concerns from docs/review/v6_hypothesis_review.md: 1. R_w split into oproj/downproj + Frobenius-balanced combined. 2. dW_left_basis_ceiling as the true weight oracle. 3. axis_kind tag (write/read/mixed/ceiling). Single-seed result: chars_clusters and attn_min_taskdiff are top-5 by both R_act and R_w_combined. Write-family bases (write/mlp_write/global_write) all have R_w_combined ~ 1.0 (random null) -- natural weight-side bases fail the weight-axis test. Multi-seed deferred to v7b.	2026-04-26 19:55:42 +08:00
wassname	aba74c0f64	logs	2026-04-26 11:12:11 +08:00
wassname	c1c4d2f3cb	nb	2026-04-26 11:03:38 +08:00
wassname	ddfa018ebd	wip	2026-04-26 10:00:03 +08:00

19 Commits