weight-steering

mirror of https://github.com/wassname/weight-steering.git synced 2026-06-27 17:18:22 +08:00

Author	SHA1	Message	Date
wassname	57a08750b8	fix: on-policy data paths, 4-bit inference, revert adapter defaults - data/load_pairs: path now includes model slug (out/data/{model}/{behavior}) so data from different models can't be silently reused - data.py, kl_calibrate.py, tinymfv_airisk.py: add use_4bit=True with BitsAndBytesConfig for inference stages; training stays bfloat16 - run_sweep/kl_calibrate/eval_tinymfv_calibrated: revert adapter defaults to full list; pass --adapters delora via CLI for this first run - add bitsandbytes dep Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-03 17:31:09 +08:00
wassname	43278709d7	fix: transformers>=5.6.0, flash-attn locked, switch to Qwen3-4B Qwen3.5-4B requires linear_attention mask support not in transformers<5.6. Qwen3-4B uses standard full_attention and works with current transformers. flash-attn added as URL dep so uv sync keeps it in .venv. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-03 16:54:50 +08:00
wassname	8ed3103e47	feat(authority): add authority behavior, logratio+SI metrics, prune dead code - Add AUTHORITY_PROMPT + 3 persona pairs (MFT-paper framing, sl-identical) - Wire authority into data._personas/_topics/_build_specs - Add SINGLE_FOUNDATION + _axis_shift for single-foundation behaviors - Add logratio to per-vignette/frame scoring (same convention as sl) - Add _si.py: port si_per_foundation from sl foundations.py - Drop prompt_baseline mode, repe, sycophancy, subspace, run_demo - Strip kl_calibrate to dW-only; remove repe+prompt_texts deps - Simplify replicate.py to train+diff only (no eval/demo/subspace) - Default behavior="authority" across eval, sweep, replicate - Install tinymfv git dep; flash_attn 2.6.3 prebuilt wheel Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-03 14:04:23 +08:00
wassname	a48430b075	switch training/eval axis from sycophancy to honesty - data.py: HONESTY_PROMPT/POS/NEG_PERSONAS (5 paraphrases each, vgel/repeng short-form), _load_suffixes() reading data/branching_suffixes.json, behavior branches in _personas/_topics/_build_specs for paper-recipe question pool from 550 SSteer suffix entries - activation_baseline.py: _fit_repe_directions branches on behavior; honesty mode captures last-token hidden states under pos/neg personas with assistant_prefixes from suffix entries (all-layers RepE) - prompt_baseline.py: paired engineered_prompt_honest + _dishonest (AxBench J.2), both as plain strings - evals/smoke.py: behavior field in SmokeCfg - data/branching_suffixes.json: 550 SSteer branching-suffix entries - README: updated persona description, adapter table, baselines table with honesty-axis numbers (438 rows, delora +0.237 best) - RESEARCH_JOURNAL.md: 2026-04-27 axis-switch entry - fork_plan.md: open design question resolved as option 2 (honesty axis) - HANDOVER.md: overnight handover notes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-28 06:00:03 +08:00
wassname	363e2db14d	phase 0-2: HF+PEFT pipeline, smoke, subspace alignment Rip Axolotl/vLLM, switch to HF+PEFT functional pipeline. Add LoRA/DoRA/PiSSA/DeLoRA train, delta-W diff, weight_steer hook, sycophancy logratio eval, and SVD top-k + weak-readout alignment. Smoke runs end-to-end on tiny-random qwen3 with BEARTYPE=1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 20:14:07 +08:00

5 Commits