mirror of
https://github.com/wassname/weight-steering.git
synced 2026-06-27 20:24:11 +08:00
13 lines
540 B
Python
13 lines
540 B
Python
"""Generate +/- pair data for a target behavior.
|
|
|
|
Recipe (paper Appendix E):
|
|
1. Pick a narrow QA distribution for the behavior (e.g. sycophancy questions).
|
|
2. For each prompt, generate a response under the positive system prompt
|
|
(e.g. "be sycophantic") and the negative ("be honest").
|
|
3. Strip the system prompt at train time so the adapter learns the
|
|
behavior unconditionally on the narrow distribution.
|
|
|
|
Output: a HF dataset on disk with columns:
|
|
{prompt, response_pos, response_neg, sys_prompt_pos, sys_prompt_neg}
|
|
"""
|