weight-steering

mirror of https://github.com/wassname/weight-steering.git synced 2026-06-27 18:27:18 +08:00

Author	SHA1	Message	Date
wassname	a0f4e719af	Add batched data gen and bidir calibration	2026-05-01 18:58:08 +08:00
wassname	b2ef8fef7b	wip	2026-04-30 21:06:18 +08:00
wassname	a48430b075	switch training/eval axis from sycophancy to honesty - data.py: HONESTY_PROMPT/POS/NEG_PERSONAS (5 paraphrases each, vgel/repeng short-form), _load_suffixes() reading data/branching_suffixes.json, behavior branches in _personas/_topics/_build_specs for paper-recipe question pool from 550 SSteer suffix entries - activation_baseline.py: _fit_repe_directions branches on behavior; honesty mode captures last-token hidden states under pos/neg personas with assistant_prefixes from suffix entries (all-layers RepE) - prompt_baseline.py: paired engineered_prompt_honest + _dishonest (AxBench J.2), both as plain strings - evals/smoke.py: behavior field in SmokeCfg - data/branching_suffixes.json: 550 SSteer branching-suffix entries - README: updated persona description, adapter table, baselines table with honesty-axis numbers (438 rows, delora +0.237 best) - RESEARCH_JOURNAL.md: 2026-04-27 axis-switch entry - fork_plan.md: open design question resolved as option 2 (honesty axis) - HANDOVER.md: overnight handover notes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-28 06:00:03 +08:00
wassname	7be1487d7b	data recipe: drop n_pairs/judge/Optional knobs, explicit grid Subagent review fixes: - DataCfg / Cfg expose the grid directly (n_topics, n_personas, n_samples) as required ints with paper defaults (20/5/10). Drops `n_pairs` and the silent round() that made the count fuzzy. Drops `Optional[int]` smoke overrides — smoke just sets 2/1/2 = 4 pairs. - Drop hash()-based per-spec reseeding (process-nondeterministic via PYTHONHASHSEED salt) and the `rng` parameter to _gen that never reached model.generate. One torch.manual_seed at start; spec order seeded by rng. - Delete _judge_filter stub + cfg.judge flag (dead code, paper §3 GPT-4.1-mini filter not implemented yet — TODO comment instead). - replicate._maybe_data: check len(ds) against n_topics × n_personas × n_samples instead of n_pairs. - justfile: drop --n-pairs 1000.	2026-04-26 10:24:31 +08:00
wassname	7e1b171875	paper data recipe + LoRA hyperparams + n_pairs hardening - data: 5 pos + 5 neg personas, 20 train + 12 eval topic split (paper §3 / Appendix C), n_samples solved from n_pairs. judge filter stub (off by default; paper uses GPT-4.1-mini). - eval/sycophancy: read true held-out eval_topics() instead of SYCOPHANCY_TOPICS[-16:]. - replicate: fix epochs threading; n_pairs reuse fails fast on mismatch; smoke knobs (n_topics, n_personas) plumbed. - train: paper hyperparams (rank 32 / alpha 16 / lr 1e-5 / warmup 5 / wd 0.01); explicit alpha (no 2*r fallback); held-out 10% val + eval_loss logging. - run_demo: train_topics() for in_dist demo claims. - README: scope block reflects paper-matching recipe.	2026-04-26 10:19:59 +08:00
wassname	363e2db14d	phase 0-2: HF+PEFT pipeline, smoke, subspace alignment Rip Axolotl/vLLM, switch to HF+PEFT functional pipeline. Add LoRA/DoRA/PiSSA/DeLoRA train, delta-W diff, weight_steer hook, sycophancy logratio eval, and SVD top-k + weak-readout alignment. Smoke runs end-to-end on tiny-random qwen3 with BEARTYPE=1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 20:14:07 +08:00

6 Commits