You will need to install the vllm package for decoding. Moreover, if you are running decoding with gemma-2 models, you will need to also install flashinfer.

On-Policy Preference Data Generation Process

Generate multiple responses using the language model:

python decode.py --data_dir $DATASET_DIR --seed $SEED

This will generate one response per prompt under the specified seed. You need to provide a dataset containing prompts (by default, we use HuggingFaceH4/ultrafeedback_binarized). You can also set decoding hyperparameters by passing in corresponding arguments (by default, we use a temperature of 0.8 for sampling).

Note that you will need to run the above command under multiple different seeds (by default, we use 13, 21, 42, 79, 100) to obtain different responses for each prompt.

Post-process the generations

python post_process.py

This will combine the generated responses under each seed and filter out samples with identical responses across all seeds.

Annotate the preference labels with a reward model

python reward_model_annotate.py --reward_model $MODEL

This will score the generations using a reward model (by default, we use RLHFlow/ArmoRM-Llama3-8B-v0.1) and binarize the dataset by taking the highest-scoring response as the winning response and the lowest-scoring one as the losing.