6 Commits

Author SHA1 Message Date
wassname a0f4e719af Add batched data gen and bidir calibration 2026-05-01 18:58:08 +08:00
wassname b2ef8fef7b wip 2026-04-30 21:06:18 +08:00
wassname a48430b075 switch training/eval axis from sycophancy to honesty
- data.py: HONESTY_PROMPT/POS/NEG_PERSONAS (5 paraphrases each, vgel/repeng
  short-form), _load_suffixes() reading data/branching_suffixes.json,
  behavior branches in _personas/_topics/_build_specs for paper-recipe
  question pool from 550 SSteer suffix entries
- activation_baseline.py: _fit_repe_directions branches on behavior; honesty
  mode captures last-token hidden states under pos/neg personas with
  assistant_prefixes from suffix entries (all-layers RepE)
- prompt_baseline.py: paired engineered_prompt_honest + _dishonest (AxBench
  J.2), both as plain strings
- evals/smoke.py: behavior field in SmokeCfg
- data/branching_suffixes.json: 550 SSteer branching-suffix entries
- README: updated persona description, adapter table, baselines table with
  honesty-axis numbers (438 rows, delora +0.237 best)
- RESEARCH_JOURNAL.md: 2026-04-27 axis-switch entry
- fork_plan.md: open design question resolved as option 2 (honesty axis)
- HANDOVER.md: overnight handover notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 06:00:03 +08:00
wassname 7be1487d7b data recipe: drop n_pairs/judge/Optional knobs, explicit grid
Subagent review fixes:

- DataCfg / Cfg expose the grid directly (n_topics, n_personas, n_samples)
  as required ints with paper defaults (20/5/10). Drops `n_pairs` and the
  silent round() that made the count fuzzy. Drops `Optional[int]` smoke
  overrides — smoke just sets 2/1/2 = 4 pairs.
- Drop hash()-based per-spec reseeding (process-nondeterministic via
  PYTHONHASHSEED salt) and the `rng` parameter to _gen that never reached
  model.generate. One torch.manual_seed at start; spec order seeded by rng.
- Delete _judge_filter stub + cfg.judge flag (dead code, paper §3 GPT-4.1-mini
  filter not implemented yet — TODO comment instead).
- replicate._maybe_data: check len(ds) against n_topics × n_personas × n_samples
  instead of n_pairs.
- justfile: drop --n-pairs 1000.
2026-04-26 10:24:31 +08:00
wassname 7e1b171875 paper data recipe + LoRA hyperparams + n_pairs hardening
- data: 5 pos + 5 neg personas, 20 train + 12 eval topic split
  (paper §3 / Appendix C), n_samples solved from n_pairs.
  judge filter stub (off by default; paper uses GPT-4.1-mini).
- eval/sycophancy: read true held-out eval_topics() instead of
  SYCOPHANCY_TOPICS[-16:].
- replicate: fix epochs threading; n_pairs reuse fails fast on mismatch;
  smoke knobs (n_topics, n_personas) plumbed.
- train: paper hyperparams (rank 32 / alpha 16 / lr 1e-5 / warmup 5 /
  wd 0.01); explicit alpha (no 2*r fallback); held-out 10% val + eval_loss
  logging.
- run_demo: train_topics() for in_dist demo claims.
- README: scope block reflects paper-matching recipe.
2026-04-26 10:19:59 +08:00
wassname 363e2db14d phase 0-2: HF+PEFT pipeline, smoke, subspace alignment
Rip Axolotl/vLLM, switch to HF+PEFT functional pipeline.
Add LoRA/DoRA/PiSSA/DeLoRA train, delta-W diff, weight_steer hook,
sycophancy logratio eval, and SVD top-k + weak-readout alignment.
Smoke runs end-to-end on tiny-random qwen3 with BEARTYPE=1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 20:14:07 +08:00