2025-11-11 09:18:45 +01:00
2025-10-17 11:15:28 +02:00
2025-10-17 11:15:28 +02:00
2025-10-17 11:15:28 +02:00
2025-11-11 09:18:45 +01:00
2025-11-05 10:13:04 +01:00
2025-11-05 10:13:04 +01:00

Weight Steering

Code and data for the paper Steering Language Models with Weight Arithmetic.

Obtaining steering vectors

1. Get completions: Generate answers to a dataset, e.g.:
python inference_and_eval.py \
    --model_repo meta-llama \
    --models Llama-2-7b-chat-hf \
    --dataset alignment_faking_harm_answers_chat:train_375exs \
    --skip_judge_eval --generation_max_tokens 3000
2. Create an Axolotl configuration file that uses the data generated in (1).
3. Train model
python inference_and_eval.py \
    --train --run_merge --delete_existing_repo \
    --axolotl_config <axolotl_config_yaml_file> \
    --model_dir <model_output_dir> \
    --model_repo <model_repo> \
    --models <model_name> \
    --skip_model_inference --skip_judge_eval 
4. Get weight steered model
python task_vectors.py \
  --pretrained_model "meta-llama/Llama-2-7b-chat-hf" \
  --finetuned_model1 "coastalcph/Llama-2-7b-chat-gsm8k_bs8_2e-4" \
  --finetuned_model2 "coastalcph/Llama-2-7b-harmful-af-refuse" \
  --finetuned_model3 "coastalcph/Llama-2-7b-chat-harmful-af-answer" \
  --scale_t1 $scale_t1 --scale_t2 $scale_t2  --scale_t3 $scale_t2 \
  --output_dir <output_dir> \
  --output_model_name <hf_repo_and_model_name>

To obtain the steering vector for activation steeering we use the code from "Persona Vectors: Monitoring and Controlling Character Traits in Language Models".

Evaluations

Run inference and evaluation on a model

python inference_and_eval.py \
    --model_repo <hf_repo> --models <model_name> \
    --dataset sycophancy_eval_answer:test \
    --eval_function eval_sycophancy_answer \
    --use_claude_judge --api_key ANTHROPIC_API_KEY_BATCH

Run inference and evaluation with activation steering

python inference_and_eval.py \
    --model_repo Qwen --models Qwen2.5-7B-Instruct \
    --dataset sycophancy_eval_answer:test \
    --use_steering_inference \
    --steer_coeff ${coeff} \
    --steering_vector_type sycophancy --steering_bs 60 \
    --use_steering_layer 12 \
    --eval_function eval_sycophancy_answer \
    --use_claude_judge --api_key ANTHROPIC_API_KEY_BATCH

This uses steering_inference.py and activation_steering.py, which have been adapted with minor changes from github/persona_vectors.

Data

Cite

@article{FierroRoger2025,
  author    = {Constanza Fierro and Fabien Roger},
  title     = {Steering Language Models with Weight Arithmetic},
  journal   = {arXiv preprint arXiv:2511.05408},
  year      = {2025},
  url       = {https://arxiv.org/abs/2511.05408},
  doi       = {10.48550/arXiv.2511.05408}
}
S
Description
No description provided
Readme 1.1 MiB
Languages
Python 94.2%
Just 5.8%