weight-steering

wassname/weight-steering

Fork 0

mirror of https://github.com/wassname/weight-steering.git synced 2026-06-27 17:18:22 +08:00

Commit Graph

Author	SHA1	Message	Date
wassname	a48430b075	switch training/eval axis from sycophancy to honesty - data.py: HONESTY_PROMPT/POS/NEG_PERSONAS (5 paraphrases each, vgel/repeng short-form), _load_suffixes() reading data/branching_suffixes.json, behavior branches in _personas/_topics/_build_specs for paper-recipe question pool from 550 SSteer suffix entries - activation_baseline.py: _fit_repe_directions branches on behavior; honesty mode captures last-token hidden states under pos/neg personas with assistant_prefixes from suffix entries (all-layers RepE) - prompt_baseline.py: paired engineered_prompt_honest + _dishonest (AxBench J.2), both as plain strings - evals/smoke.py: behavior field in SmokeCfg - data/branching_suffixes.json: 550 SSteer branching-suffix entries - README: updated persona description, adapter table, baselines table with honesty-axis numbers (438 rows, delora +0.237 best) - RESEARCH_JOURNAL.md: 2026-04-27 axis-switch entry - fork_plan.md: open design question resolved as option 2 (honesty axis) - HANDOVER.md: overnight handover notes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-28 06:00:03 +08:00

Author

SHA1

Message

Date

wassname

a48430b075

switch training/eval axis from sycophancy to honesty

- data.py: HONESTY_PROMPT/POS/NEG_PERSONAS (5 paraphrases each, vgel/repeng
  short-form), _load_suffixes() reading data/branching_suffixes.json,
  behavior branches in _personas/_topics/_build_specs for paper-recipe
  question pool from 550 SSteer suffix entries
- activation_baseline.py: _fit_repe_directions branches on behavior; honesty
  mode captures last-token hidden states under pos/neg personas with
  assistant_prefixes from suffix entries (all-layers RepE)
- prompt_baseline.py: paired engineered_prompt_honest + _dishonest (AxBench
  J.2), both as plain strings
- evals/smoke.py: behavior field in SmokeCfg
- data/branching_suffixes.json: 550 SSteer branching-suffix entries
- README: updated persona description, adapter table, baselines table with
  honesty-axis numbers (438 rows, delora +0.237 best)
- RESEARCH_JOURNAL.md: 2026-04-27 axis-switch entry
- fork_plan.md: open design question resolved as option 2 (honesty axis)
- HANDOVER.md: overnight handover notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-28 06:00:03 +08:00

1 Commits