weight-steering

mirror of https://github.com/wassname/weight-steering.git synced 2026-06-27 18:27:18 +08:00

Author	SHA1	Message	Date
wassname	497ee05aef	first pass care vs sanctity	2026-05-03 06:02:07 +08:00
wassname	4f2034dd46	tidy	2026-05-02 05:52:25 +08:00
wassname	27cf12c2d8	Switch AIRisk evals to tiny-mfv workflow	2026-05-01 20:47:31 +08:00
wassname	a0f4e719af	Add batched data gen and bidir calibration	2026-05-01 18:58:08 +08:00
wassname	a3d999fd92	wip	2026-04-27 09:59:06 +08:00
wassname	7be1487d7b	data recipe: drop n_pairs/judge/Optional knobs, explicit grid Subagent review fixes: - DataCfg / Cfg expose the grid directly (n_topics, n_personas, n_samples) as required ints with paper defaults (20/5/10). Drops `n_pairs` and the silent round() that made the count fuzzy. Drops `Optional[int]` smoke overrides — smoke just sets 2/1/2 = 4 pairs. - Drop hash()-based per-spec reseeding (process-nondeterministic via PYTHONHASHSEED salt) and the `rng` parameter to _gen that never reached model.generate. One torch.manual_seed at start; spec order seeded by rng. - Delete _judge_filter stub + cfg.judge flag (dead code, paper §3 GPT-4.1-mini filter not implemented yet — TODO comment instead). - replicate._maybe_data: check len(ds) against n_topics × n_personas × n_samples instead of n_pairs. - justfile: drop --n-pairs 1000.	2026-04-26 10:24:31 +08:00
wassname	f4083d74ac	Enhance fork plan and add guided-CoT evaluation - Updated the fork plan with detailed phases and objectives for small model adaptation and evaluation. - Added a new guided-CoT evaluation script to assess model coherence under steering. - Introduced demo functionality to showcase adapter coherence and guided-CoT performance. - Modified training configuration to include layer fraction targeting for LoRA. - Improved evaluation outputs for clarity and added validation checks.	2026-04-26 09:16:54 +08:00
wassname	363e2db14d	phase 0-2: HF+PEFT pipeline, smoke, subspace alignment Rip Axolotl/vLLM, switch to HF+PEFT functional pipeline. Add LoRA/DoRA/PiSSA/DeLoRA train, delta-W diff, weight_steer hook, sycophancy logratio eval, and SVD top-k + weak-readout alignment. Smoke runs end-to-end on tiny-random qwen3 with BEARTYPE=1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 20:14:07 +08:00
wassname	f0bce8be90	tidy	2026-04-25 19:27:53 +08:00

9 Commits