mirror of
https://github.com/wassname/weight-steering.git
synced 2026-06-27 18:08:19 +08:00
00efc55b07
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
121 lines
5.2 KiB
Markdown
121 lines
5.2 KiB
Markdown
# Weight Steering
|
||
|
||
> **Fork notice (wassname, 2026-04):** this is a working fork that strips the
|
||
> upstream Axolotl + vLLM + Anthropic-batch-API stack and rebuilds the core
|
||
> method on HF + PEFT + uv, targeting Qwen3-0.6B for cheap iteration. Goals:
|
||
> (1) replicate `w = θ⁺ − θ⁻` on a small model, (2) test alignment of `w` with
|
||
> SVD subspaces of the pretrained `W` and the AntiPaSTO subspaces, (3) compare
|
||
> adapter families (LoRA / DoRA / PiSSA-init / DeLoRA) under the
|
||
> "adapter as hypothesis" framing, (4) eval on daily-dilemmas.
|
||
>
|
||
> Pipeline (see `justfile`):
|
||
> ```
|
||
> just smoke # full pipeline on tiny-random qwen3 + BEARTYPE=1, ~1 min
|
||
> just replicate # data → train pos → train neg → diff → eval → subspace
|
||
> just subspace-align # phase 2: SVD top-k + weak-readout alignment table
|
||
> just adapter-sweep # phase 3: LoRA / DoRA / PiSSA / DeLoRA sweep (TODO)
|
||
> just eval-dilemmas # phase 4: daily-dilemmas Yes/No logratio (TODO)
|
||
> ```
|
||
> Source layout: `src/ws/{data,train,diff,steer,subspace,replicate,run_subspace,run_sweep}.py`,
|
||
> `src/ws/eval/{sycophancy,dilemmas}.py`. Outputs to `out/<behavior>/<adapter>/`.
|
||
>
|
||
> Original README from upstream below.
|
||
|
||
---
|
||
|
||
Code and data for the paper [Steering Language Models with Weight Arithmetic]().
|
||
|
||
# Obtaining steering vectors
|
||
|
||
##### 1. Get completions: Generate answers to a dataset, e.g.:
|
||
|
||
```bash
|
||
python inference_and_eval.py \
|
||
--model_repo meta-llama \
|
||
--models Llama-2-7b-chat-hf \
|
||
--dataset alignment_faking_harm_answers_chat:train_375exs \
|
||
--skip_judge_eval --generation_max_tokens 3000
|
||
```
|
||
|
||
##### 2. Create an [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) configuration file that uses the data generated in (1).
|
||
|
||
##### 3. Train model
|
||
```bash
|
||
python inference_and_eval.py \
|
||
--train --run_merge --delete_existing_repo \
|
||
--axolotl_config <axolotl_config_yaml_file> \
|
||
--model_dir <model_output_dir> \
|
||
--model_repo <model_repo> \
|
||
--models <model_name> \
|
||
--skip_model_inference --skip_judge_eval
|
||
```
|
||
|
||
##### 4. Get weight steered model
|
||
```bash
|
||
python task_vectors.py \
|
||
--pretrained_model "meta-llama/Llama-2-7b-chat-hf" \
|
||
--finetuned_model1 "coastalcph/Llama-2-7b-chat-gsm8k_bs8_2e-4" \
|
||
--finetuned_model2 "coastalcph/Llama-2-7b-harmful-af-refuse" \
|
||
--finetuned_model3 "coastalcph/Llama-2-7b-chat-harmful-af-answer" \
|
||
--scale_t1 $scale_t1 --scale_t2 $scale_t2 --scale_t3 $scale_t2 \
|
||
--output_dir <output_dir> \
|
||
--output_model_name <hf_repo_and_model_name>
|
||
```
|
||
|
||
To obtain the steering vector for **activation steeering** we use the code from ["Persona Vectors: Monitoring and Controlling Character Traits in Language Models"](https://github.com/safety-research/persona_vectors).
|
||
|
||
# Evaluations
|
||
|
||
### Run inference and evaluation on a model
|
||
```bash
|
||
python inference_and_eval.py \
|
||
--model_repo <hf_repo> --models <model_name> \
|
||
--dataset sycophancy_eval_answer:test \
|
||
--eval_function eval_sycophancy_answer \
|
||
--use_claude_judge --api_key ANTHROPIC_API_KEY_BATCH
|
||
```
|
||
|
||
### Run inference and evaluation with activation steering
|
||
|
||
```bash
|
||
python inference_and_eval.py \
|
||
--model_repo Qwen --models Qwen2.5-7B-Instruct \
|
||
--dataset sycophancy_eval_answer:test \
|
||
--use_steering_inference \
|
||
--steer_coeff ${coeff} \
|
||
--steering_vector_type sycophancy --steering_bs 60 \
|
||
--use_steering_layer 12 \
|
||
--eval_function eval_sycophancy_answer \
|
||
--use_claude_judge --api_key ANTHROPIC_API_KEY_BATCH
|
||
```
|
||
|
||
This uses `steering_inference.py` and `activation_steering.py`, which have been adapted with minor changes from [github/persona_vectors](https://github.com/safety-research/persona_vectors).
|
||
|
||
## Data
|
||
|
||
* Sycophancy in TruthfulQA and TriviaQA: [cfierro/sycophancy_eval_answer](https://huggingface.co/datasets/cfierro/sycophancy_eval_answer). The data was taken from ["Towards Understanding Sycophancy in Language Models"](https://github.com/meg-tong/sycophancy-eval).
|
||
|
||
* GCD-Sycophancy: [cfierro/gcd](https://huggingface.co/datasets/cfierro/gcd). Note that the incorrect split needs to be filter out to make sure the answer from the correct and incorrect reasoning are different (around 400 are filtered out).
|
||
|
||
* Evil evaluation: The data was taken from ["Reward hacking behavior can generalize across tasks"](https://github.com/keing1/reward-hack-generalization/tree/main).
|
||
|
||
* Refusal evaluation:
|
||
* Safety evaluation: [GSMDanger](https://huggingface.co/datasets/vfleaking/GSM-Danger) and [DirectHarm4](https://huggingface.co/datasets/vfleaking/DirectHarm4) were taken from ["Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"](https://github.com/vfleaking/PTST).
|
||
|
||
* GSM8K: We use the main configuration and test split from [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k).
|
||
|
||
* Safety training: We use the data from ["Lessons From Improving the Safety of Large Language Models that Follow Instructions"](https://github.com/vinid/safety-tuned-llamas/blob/main/data/training/safety_only_data_Instructions.json)
|
||
|
||
|
||
# Cite
|
||
```bibtex
|
||
@article{FierroRoger2025,
|
||
author = {Constanza Fierro and Fabien Roger},
|
||
title = {Steering Language Models with Weight Arithmetic},
|
||
journal = {arXiv preprint arXiv:2511.05408},
|
||
year = {2025},
|
||
url = {https://arxiv.org/abs/2511.05408},
|
||
doi = {10.48550/arXiv.2511.05408}
|
||
}
|
||
```
|