1.8 KiB
On-Policy Preference Data Generation
We provide the code to generate on-policy preference data (e.g., princeton-nlp/llama3-ultrafeedback-armorm and princeton-nlp/gemma2-ultrafeedback-armorm) used in our experiments.
Requirements
You will need to install the vllm package for decoding. Moreover, if you are running decoding with gemma-2 models, you will need to also install flashinfer.
On-Policy Preference Data Generation Process
- Generate multiple responses using the language model:
python decode.py --data_dir $DATASET_DIR --seed $SEED
This will generate one response per prompt under the specified seed. You need to provide a dataset containing prompts (by default, we use HuggingFaceH4/ultrafeedback_binarized). You can also set decoding hyperparameters by passing in corresponding arguments (by default, we use a temperature of 0.8 for sampling).
Note that you will need to run the above command under multiple different seeds (by default, we use 13, 21, 42, 79, 100) to obtain different responses for each prompt.
- Post-process the generations
python post_process.py
This will combine the generated responses under each seed and filter out samples with identical responses across all seeds.
- Annotate the preference labels with a reward model
python reward_model_annotate.py --reward_model $MODEL
This will score the generations using a reward model (by default, we use RLHFlow/ArmoRM-Llama3-8B-v0.1) and binarize the dataset by taking the highest-scoring response as the winning response and the lowest-scoring one as the losing.