wassname
a0f4e719af
Add batched data gen and bidir calibration
2026-05-01 18:58:08 +08:00
wassname
b2ef8fef7b
wip
2026-04-30 21:06:18 +08:00
wassname
a48430b075
switch training/eval axis from sycophancy to honesty
...
- data.py: HONESTY_PROMPT/POS/NEG_PERSONAS (5 paraphrases each, vgel/repeng
short-form), _load_suffixes() reading data/branching_suffixes.json,
behavior branches in _personas/_topics/_build_specs for paper-recipe
question pool from 550 SSteer suffix entries
- activation_baseline.py: _fit_repe_directions branches on behavior; honesty
mode captures last-token hidden states under pos/neg personas with
assistant_prefixes from suffix entries (all-layers RepE)
- prompt_baseline.py: paired engineered_prompt_honest + _dishonest (AxBench
J.2), both as plain strings
- evals/smoke.py: behavior field in SmokeCfg
- data/branching_suffixes.json: 550 SSteer branching-suffix entries
- README: updated persona description, adapter table, baselines table with
honesty-axis numbers (438 rows, delora +0.237 best)
- RESEARCH_JOURNAL.md: 2026-04-27 axis-switch entry
- fork_plan.md: open design question resolved as option 2 (honesty axis)
- HANDOVER.md: overnight handover notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-28 06:00:03 +08:00
wassname
7be1487d7b
data recipe: drop n_pairs/judge/Optional knobs, explicit grid
...
Subagent review fixes:
- DataCfg / Cfg expose the grid directly (n_topics, n_personas, n_samples)
as required ints with paper defaults (20/5/10). Drops `n_pairs` and the
silent round() that made the count fuzzy. Drops `Optional[int]` smoke
overrides — smoke just sets 2/1/2 = 4 pairs.
- Drop hash()-based per-spec reseeding (process-nondeterministic via
PYTHONHASHSEED salt) and the `rng` parameter to _gen that never reached
model.generate. One torch.manual_seed at start; spec order seeded by rng.
- Delete _judge_filter stub + cfg.judge flag (dead code, paper §3 GPT-4.1-mini
filter not implemented yet — TODO comment instead).
- replicate._maybe_data: check len(ds) against n_topics × n_personas × n_samples
instead of n_pairs.
- justfile: drop --n-pairs 1000.
2026-04-26 10:24:31 +08:00
wassname
7e1b171875
paper data recipe + LoRA hyperparams + n_pairs hardening
...
- data: 5 pos + 5 neg personas, 20 train + 12 eval topic split
(paper §3 / Appendix C), n_samples solved from n_pairs.
judge filter stub (off by default; paper uses GPT-4.1-mini).
- eval/sycophancy: read true held-out eval_topics() instead of
SYCOPHANCY_TOPICS[-16:].
- replicate: fix epochs threading; n_pairs reuse fails fast on mismatch;
smoke knobs (n_topics, n_personas) plumbed.
- train: paper hyperparams (rank 32 / alpha 16 / lr 1e-5 / warmup 5 /
wd 0.01); explicit alpha (no 2*r fallback); held-out 10% val + eval_loss
logging.
- run_demo: train_topics() for in_dist demo claims.
- README: scope block reflects paper-matching recipe.
2026-04-26 10:19:59 +08:00
wassname
363e2db14d
phase 0-2: HF+PEFT pipeline, smoke, subspace alignment
...
Rip Axolotl/vLLM, switch to HF+PEFT functional pipeline.
Add LoRA/DoRA/PiSSA/DeLoRA train, delta-W diff, weight_steer hook,
sycophancy logratio eval, and SVD top-k + weak-readout alignment.
Smoke runs end-to-end on tiny-random qwen3 with BEARTYPE=1.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com >
2026-04-25 20:14:07 +08:00