Simple Preference Optimization (SimPO)
This repository contains the code and released models for our paper SimPO: Simple Preference Optimization with a Reference-Free Reward. We propose a simpler and more effective preference optimization algorithm than DPO (Direct Preference Optimization) without using a reference model. SimPO outperforms DPO and its latest variants across AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks under various settings.
🔗 Quick Links
Released Models
Install Requirements
Our codebase is built upon the alignment-handbook repo. The following steps will guide you through the installation process.
First, create a Python virtual environment using e.g. Conda:
conda create -n handbook python=3.10 && conda activate handbook
Next, install PyTorch v2.2.2. Since this is hardware-dependent, we
direct you to the PyTorch Installation Page.
You can then install the remaining package dependencies of alignment-handbook as follows:
git clone https://github.com/huggingface/alignment-handbook.git
cd ./alignment-handbook/
python -m pip install .
You will also need Flash Attention 2 installed, which can be done by running:
python -m pip install flash-attn --no-build-isolation
Training Scripts
We provide four training config files for the four training setups reported in our paper:
- Mistral-Base:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml
- Mistral-Instruct:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/mistral-7b-instruct-simpo.yaml
- Llama3-Base:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/llama-3-8b-base-simpo.yaml
- Llama3-Instruct:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/llama-3-8b-instruct-simpo.yaml
Evaluation
We follow the official implementation for evaluation on AlpacaEval 2, Arena-Hard, and MT-Bench, as follows:
-
AlpacaEval 2: Please refer to the AlpacaEval repo for evaluation.
-
Arena-Hard: Please refer to to the Arena-Hard-Auto repo for evaluation.
-
MT-Bench: Please refer to the FastChat repo for evaluation.
Bugs or Questions?
If you have any questions related to the code or the paper, feel free to email Yu (yumeng5@virginia.edu). If you encounter any problems when using the code, or want to report a bug, feel free to open an issue! Please try to specify the problem with details so we can help you better and quicker!
Citation
Please cite our paper if you find the repo helpful in your work:
@article{meng2024simpo,
title={{SimPO}: Simple Preference Optimization with a Reference-Free Reward},
author={Meng, Yu and Xia, Mengzhou and Chen, Danqi},
year={2024}
}
