Fork notice (wassname, 2026-04): this is a working fork that strips the upstream Axolotl + vLLM + Anthropic-batch-API stack and rebuilds the core method on HF + PEFT + uv, targeting Qwen3-0.6B for cheap iteration. Goals: (1) replicate w = θ⁺ − θ⁻ on a small model, (2) test alignment of w with SVD subspaces of the pretrained W and the AntiPaSTO subspaces, (3) compare adapter families (LoRA / DoRA / PiSSA-init / DeLoRA) under the "adapter as hypothesis" framing, (4) eval on daily-dilemmas.

Pipeline (see justfile):
just smoke           # full pipeline on tiny-random qwen3 + BEARTYPE=1, ~1 min
just replicate       # data → train pos → train neg → diff → eval → subspace
just subspace-align  # phase 2: SVD top-k + weak-readout alignment table
just adapter-sweep   # phase 3: LoRA / DoRA / PiSSA / DeLoRA sweep (TODO)
just eval-dilemmas   # phase 4: daily-dilemmas Yes/No logratio (TODO)
Source layout: src/ws/{data,train,diff,steer,subspace,replicate,run_subspace,run_sweep}.py, src/ws/eval/{sycophancy,dilemmas}.py. Outputs to out/<behavior>/<adapter>/.

Scope. Not a strict replication. Now matches paper recipe on data (20 train + 12 eval topics × 5 personas × 10 samples = 1000 pairs; judge filter stubbed, off by default — paper uses GPT-4.1-mini) and LoRA hyperparams (rank 32 / α 16 / lr 1e-5 / warmup 5 / wd 0.01). Deliberate divergences from upstream: no quantized base loading (DoRA/PiSSA/DeLoRA support is uncertain; bf16 fits at 0.6B), no modules_to_save for embed_tokens / lm_head, and a layer slice (LoRA on layers 30%-80%, steering-locus literature) instead of full coverage. The contrastive θ⁺ − θ⁻ core is preserved.

Initial findings on Qwen3-0.6B (task 40 / 44). Steering is monotone in α and coherent across α ∈ [-2, +2] (no token salad, pmass ≈ 1.0). The single-token off-policy effect (~+9.4 nats at α=+2) survives a 32-token greedy CoT rollout (margin in the same direction; the gap is the teacher-forcing tax we expected). Cheap to engineer at this scale. Falsified for this dW: alignment with W₀'s top SVD subspace (ratio_top ≈ 1.0 ± 0.1 across module kinds — SVD-of-W is uninformative). Open: the right basis for dW (work in progress in notebooks/analyze_diff.py).

Original README from upstream below.

Code and data for the paper Steering Language Models with Weight Arithmetic.

Obtaining steering vectors

1. Get completions: Generate answers to a dataset, e.g.:

python inference_and_eval.py \
    --model_repo meta-llama \
    --models Llama-2-7b-chat-hf \
    --dataset alignment_faking_harm_answers_chat:train_375exs \
    --skip_judge_eval --generation_max_tokens 3000

2. Create an Axolotl configuration file that uses the data generated in (1).

3. Train model

python inference_and_eval.py \
    --train --run_merge --delete_existing_repo \
    --axolotl_config <axolotl_config_yaml_file> \
    --model_dir <model_output_dir> \
    --model_repo <model_repo> \
    --models <model_name> \
    --skip_model_inference --skip_judge_eval

4. Get weight steered model

python task_vectors.py \
  --pretrained_model "meta-llama/Llama-2-7b-chat-hf" \
  --finetuned_model1 "coastalcph/Llama-2-7b-chat-gsm8k_bs8_2e-4" \
  --finetuned_model2 "coastalcph/Llama-2-7b-harmful-af-refuse" \
  --finetuned_model3 "coastalcph/Llama-2-7b-chat-harmful-af-answer" \
  --scale_t1 $scale_t1 --scale_t2 $scale_t2  --scale_t3 $scale_t2 \
  --output_dir <output_dir> \
  --output_model_name <hf_repo_and_model_name>

To obtain the steering vector for activation steeering we use the code from "Persona Vectors: Monitoring and Controlling Character Traits in Language Models".

Evaluations

Run inference and evaluation on a model

python inference_and_eval.py \
    --model_repo <hf_repo> --models <model_name> \
    --dataset sycophancy_eval_answer:test \
    --eval_function eval_sycophancy_answer \
    --use_claude_judge --api_key ANTHROPIC_API_KEY_BATCH

Run inference and evaluation with activation steering

python inference_and_eval.py \
    --model_repo Qwen --models Qwen2.5-7B-Instruct \
    --dataset sycophancy_eval_answer:test \
    --use_steering_inference \
    --steer_coeff ${coeff} \
    --steering_vector_type sycophancy --steering_bs 60 \
    --use_steering_layer 12 \
    --eval_function eval_sycophancy_answer \
    --use_claude_judge --api_key ANTHROPIC_API_KEY_BATCH

This uses steering_inference.py and activation_steering.py, which have been adapted with minor changes from github/persona_vectors.

Data

Sycophancy in TruthfulQA and TriviaQA: cfierro/sycophancy_eval_answer. The data was taken from "Towards Understanding Sycophancy in Language Models".
GCD-Sycophancy: cfierro/gcd. Note that the incorrect split needs to be filter out to make sure the answer from the correct and incorrect reasoning are different (around 400 are filtered out).
Evil evaluation: The data was taken from "Reward hacking behavior can generalize across tasks".
Refusal evaluation:
- Safety evaluation: GSMDanger and DirectHarm4 were taken from "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates".
- GSM8K: We use the main configuration and test split from openai/gsm8k.
- Safety training: We use the data from "Lessons From Improving the Safety of Large Language Models that Follow Instructions"

Cite

@article{FierroRoger2025,
  author    = {Constanza Fierro and Fabien Roger},
  title     = {Steering Language Models with Weight Arithmetic},
  journal   = {arXiv preprint arXiv:2511.05408},
  year      = {2025},
  url       = {https://arxiv.org/abs/2511.05408},
  doi       = {10.48550/arXiv.2511.05408}
}

README.md Unescape Escape

Weight Steering

Obtaining steering vectors

1. Get completions: Generate answers to a dataset, e.g.:

2. Create an Axolotl configuration file that uses the data generated in (1).

3. Train model

4. Get weight steered model

Evaluations

Run inference and evaluation on a model

Run inference and evaluation with activation steering

Data

Cite

README.md