mirror of
https://github.com/wassname/weight-steering.git
synced 2026-06-27 18:27:18 +08:00
cleanup: drop stale HANDOVER/RESEARCH_LOG, fix axis line in fork_plan
HANDOVER.md and RESEARCH_LOG.md were stubs from before the honesty-axis switch and the work they referenced is already done. fork_plan.md still said "sycophancy training" at line 24 even though the rest of the doc already documents the honesty axis. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
-57
@@ -1,57 +0,0 @@
|
||||
# Handover notes — 2026-04-27 ~21:30
|
||||
|
||||
## What just happened
|
||||
|
||||
Switched the entire weight-steering pipeline from sycophancy to honesty axis. Rationale in `RESEARCH_JOURNAL.md` (2026-04-27 entry) and `fork_plan.md` ("Resolved: train/eval axis switch").
|
||||
|
||||
Key change: old SYCOPHANCY_POS/NEG was 2-axis (sycophancy-vs-honesty mixed). New HONESTY_POS/NEG is 1-axis, axis-matched with eval (`daily_dilemmas-self-honesty` / `honesty_label`). Sycophancy becomes OOD transfer eval for later.
|
||||
|
||||
## Pueue queue state
|
||||
|
||||
- **230** — Running: `ws.run_sweep --behavior honesty`. Training 7 adapters (lora/dora/pissa/delora/oft/boft/ia3) on 1000 pairs. lora + dora DONE (~21:22, 21:26). pissa/delora/oft/boft/ia3 pending. ~5 adapters remaining at ~10min each = ~50min left.
|
||||
- **231** — Queued after 230: T1 RepE activation baseline honesty
|
||||
- **232** — Queued after 230: T3 prompt baseline honesty
|
||||
- **233** — Queued after 230: T2 full DD benchmark honesty
|
||||
- **234** — Queued after 230: T6 cross-adapter causal ablation honesty
|
||||
- **235** — Queued after 230: T7 layer/module ablation honesty
|
||||
- **236** — Queued after 230: T8 parameterization ablation honesty
|
||||
|
||||
230-236 are all queued and will run unattended. Check tomorrow with `pueue status`.
|
||||
|
||||
## Key files changed this session
|
||||
|
||||
- `src/ws/data.py` — honesty personas, `_load_suffixes`, behavior branches in `_topics`/`_build_specs`
|
||||
- `src/ws/eval/activation_baseline.py` — honesty branch in `_fit_repe_directions` with suffix-based prompts
|
||||
- `src/ws/eval/prompt_baseline.py` — dual `engineered_prompt_honest` + `engineered_prompt_dishonest`
|
||||
- `evals/smoke.py` — `behavior` field added to SmokeCfg
|
||||
- `data/branching_suffixes.json` — new file, 550 SSteer entries
|
||||
- `fork_plan.md` — open-question section replaced with resolved decision
|
||||
- `RESEARCH_JOURNAL.md` — 2026-04-27 axis-switch entry appended
|
||||
|
||||
## What still needs doing (after 231-236 finish)
|
||||
|
||||
1. **Task 28: Update README** — replace "first 100 dilemmas" and sycophancy table with honesty numbers from `out/honesty/{cross_adapter_full_dd,activation_baseline,prompt_baseline}/summary.csv`. Wait for all evals.
|
||||
2. **Commit** — nothing committed yet. Files on `dev` branch, uncommitted. Commit message: "switch training/eval axis from sycophancy to honesty; add branching_suffixes.json".
|
||||
3. **Task 23 close** — mark in_progress task 23 completed once 230 finishes cleanly (check `out/honesty/*/w.pt` all exist).
|
||||
4. **T4 multiseed / T5 Gemma** — not started, re-scope to honesty axis when ready.
|
||||
|
||||
## Stale outputs to ignore
|
||||
|
||||
`out/honesty/{activation_baseline,cross_adapter_*,layer_module_ablation,parameterization_ablation,prompt_baseline}/` dirs exist from an aborted earlier run (timestamps 20:13-20:14, all empty). Tasks 231-236 will overwrite them.
|
||||
|
||||
`out/sycophancy/` — keep as historical record of old axis-mismatched results.
|
||||
|
||||
## Verification checklist (run tomorrow after 236 finishes)
|
||||
|
||||
```sh
|
||||
# All adapters trained
|
||||
ls out/honesty/{lora,dora,pissa,delora,oft,boft,ia3}/w.pt
|
||||
|
||||
# Eval summaries exist and have data
|
||||
head -5 out/honesty/cross_adapter_full_dd/dilemmas_summary.csv
|
||||
head -5 out/honesty/activation_baseline/summary.csv
|
||||
head -5 out/honesty/prompt_baseline/summary.csv
|
||||
|
||||
# Sanity: idx_symmetric_diff=0 in prompt baseline
|
||||
grep "idx_symmetric_diff" out/honesty/prompt_baseline/summary.csv
|
||||
```
|
||||
@@ -1,9 +0,0 @@
|
||||
#
|
||||
|
||||
fork_plan.md
|
||||
|
||||
# made notebook
|
||||
|
||||
Where does the LoRA's Δa live?
|
||||
|
||||
In a task-derived hidden subspace that is not the lm_head readout. Held-out energy ratios (mean over LoRA-active layers 8–27, rank-8 each):
|
||||
+1
-1
@@ -21,7 +21,7 @@ This repo removes Axolotl/vLLM/API orchestration and rebuilds the method in HF +
|
||||
|
||||
Current main model: `Qwen/Qwen3-0.6B`.
|
||||
|
||||
Current behavior: sycophancy training, evaluated on sycophancy Yes/No and `wassname/daily_dilemmas-self-honesty`.
|
||||
Current behavior: honesty training (positive = honest persona, negative = dishonest persona), evaluated on `wassname/daily_dilemmas-self-honesty` (OOD) and held-out sycophancy Yes/No claims (IID).
|
||||
|
||||
## Links
|
||||
|
||||
|
||||
Reference in New Issue
Block a user