This commit is contained in:
xiamengzhou
2024-07-07 10:15:46 -04:00
parent a1d07195c2
commit 9bd67dc925
+33 -2
View File
@@ -41,38 +41,69 @@ We have observed that, in some cases, adding an additional SFT loss can help imp
## Released Models
Below is the complete list of models evaluated in our preprint.
| models | | AE2 LC | AE2 WR | AH |
### v0.1
Below is the complete list of models evaluated in our preprint. We used the [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) dataset to train the Mistral Base and Llama3 Base models, the [princeton-nlp/mistral-instruct-ultrafeedback](https://huggingface.co/princeton-nlp/mistral-instruct-ultrafeedback) dataset to train the Mistral Instruct models, and the [princeton-nlp/llama3-ultrafeedback](https://huggingface.co/princeton-nlp/llama3-ultrafeedback) dataset to train the Llama3 Instruct models. The latter two datasets are annotated by the [llm-blender/PairRM](https://huggingface.co/llm-blender/PairRM) model.
models | | AE2 LC | AE2 WR | AH |
|------------------------------|-----------------------------------------------------------------------------------------------------------|:------:|:------:|:----:|
| Mistral Base 7B SFT | [alignment-handbook/zephyr-7b-sft-full](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full) | 8.4 | 6.2 | 1.3 |
| Mistral Base 7B RRHF | [princeton-nlp/Mistral-7B-Base-SFT-RRHF](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-RRHF) | 11.6 | 10.2 | 6.9 |
| Mistral Base 7B SLiC-HF | [princeton-nlp/Mistral-7B-Base-SFT-SLiC-HF](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-SLiC-HF) | 10.9 | 8.9 | 7.3 |
| Mistral Base 7B DPO (Zephyr) | [princeton-nlp/Mistral-7B-Base-SFT-DPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-DPO) | 15.1 | 12.5 | 10.4 |
| Mistral Base 7B IPO | [princeton-nlp/Mistral-7B-Base-SFT-IPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-IPO) | 11.8 | 9.4 | 7.5 |
| Mistral Base 7B CPO | [princeton-nlp/Mistral-7B-Base-SFT-CPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-CPO) | 13.1 | 9.1 | 5.6 |
| Mistral Base 7B KTO | [princeton-nlp/Mistral-7B-Base-SFT-KTO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-KTO) | 13.1 | 9.1 | 5.6 |
| Mistral Base 7B ORPO | [kaist-ai/mistral-orpo-beta](https://huggingface.co/kaist-ai/mistral-orpo-beta) | 14.7 | 12.2 | 7.0 |
| Mistral Base 7B R-DPO | [princeton-nlp/Mistral-7B-Base-SFT-RDPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-RDPO) | 17.4 | 12.8 | 9.9 |
| Mistral Base 7B SimPO | [princeton-nlp/Mistral-7B-Base-SFT-SimPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-SimPO) | 21.4 | 20.8 | 16.6 |
| Mistral Instruct 7B SFT | [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) | 17.1 | 14.7 | 12.6 |
| Mistral Instruct 7B RRHF | [princeton-nlp/Mistral-7B-Instruct-RRHF](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-RRHF) | 25.3 | 24.8 | 18.1 |
| Mistral Instruct 7B SLiC-HF | [princeton-nlp/Mistral-7B-Instruct-SLiC-HF](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-SLiC-HF) | 24.1 | 24.6 | 18.9 |
| Mistral Instruct 7B DPO | [princeton-nlp/Mistral-7B-Instruct-DPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-DPO) | 26.8 | 24.9 | 16.3 |
| Mistral Instruct 7B IPO | [princeton-nlp/Mistral-7B-Instruct-IPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-IPO) | 20.3 | 20.3 | 16.2 |
| Mistral Instruct 7B CPO | [princeton-nlp/Mistral-7B-Instruct-CPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-CPO) | 24.5 | 23.6 | 17.9 |
| Mistral Instruct 7B KTO | [princeton-nlp/Mistral-7B-Instruct-KTO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-KTO) | 24.5 | 23.6 | 17.9 |
| Mistral Instruct 7B ORPO | [princeton-nlp/Mistral-7B-Instruct-ORPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-ORPO) | 24.5 | 24.9 | 20.8 |
| Mistral Instruct 7B R-DPO | [princeton-nlp/Mistral-7B-Instruct-RDPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-RDPO) | 27.3 | 24.5 | 16.1 |
| Mistral Instruct 7B SimPO | [princeton-nlp/Mistral-7B-Instruct-SimPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-SimPO) | 32.1 | 34.8 | 21.0 |
| Llama3 Base 8B SFT | [princeton-nlp/Llama-3-Base-8B-SFT](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT) | 6.2 | 4.6 | 3.3 |
| Llama3 Base 8B RRHF | [princeton-nlp/Llama-3-Base-8B-RRHF](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-RRHF) | 10.8 | 8.1 | 6.6 |
| Llama3 Base 8B SLiC-HF | [princeton-nlp/Llama-3-Base-8B-SLiC-HF](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SLiC-HF) | 12.1 | 10.1 | 10.3 |
| Llama3 Base 8B DPO | [princeton-nlp/Llama-3-Base-8B-SFT-DPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-DPO) | 18.2 | 15.5 | 15.9 |
| Llama3 Base 8B IPO | [princeton-nlp/Llama-3-Base-8B-SFT-IPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-IPO) | 14.4 | 14.2 | 17.8 |
| Llama3 Base 8B CPO | [princeton-nlp/Llama-3-Base-8B-SFT-CPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-CPO) | 14.4 | 14.2 | 17.8 |
| Llama3 Base 8B KTO | [princeton-nlp/Llama-3-Base-8B-SFT-KTO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-KTO) | 14.2 | 12.4 | 12.5 |
| Llama3 Base 8B ORPO | [princeton-nlp/Llama-3-Base-8B-SFT-ORPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-ORPO) | 12.2 | 10.6 | 10.8 |
| Llama3 Base 8B R-DPO | [princeton-nlp/Llama-3-Base-8B-SFT-RDPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-RDPO) | 17.6 | 14.4 | 17.2 |
| Llama3 Base 8B SimPO | [princeton-nlp/Llama-3-Base-8B-SFT-SimPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-SimPO) | 22.0 | 20.3 | 23.4 |
| Llama3 Instruct 8B SFT | [meta-llama/Meta-Llama-3-Instruct-8B](https://huggingface.co/meta-llama/Meta-Llama-3-Instruct-8B) | 26.0 | 25.3 | 22.3 |
| Llama3 Instruct 8B RRHF | [princeton-nlp/Llama-3-Instruct-8B-RRHF](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-RRHF) | 31.3 | 28.4 | 26.5 |
| Llama3 Instruct 8B SLiC-HF | [princeton-nlp/Llama-3-Instruct-8B-SLiC-HF](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SLiC-HF) | 26.9 | 27.5 | 26.2 |
| Llama3 Instruct 8B DPO | [princeton-nlp/Llama-3-Instruct-8B-DPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-DPO) | 40.3 | 37.9 | 32.6 |
| Llama3 Instruct 8B IPO | [princeton-nlp/Llama-3-Instruct-8B-IPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-IPO) | 35.6 | 35.6 | 30.5 |
| Llama3 Instruct 8B CPO | [princeton-nlp/Llama-3-Instruct-8B-CPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-CPO) | 33.1 | 31.8 | 26.4 |
| Llama3 Instruct 8B KTO | [princeton-nlp/Llama-3-Instruct-8B-KTO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-KTO) | 33.1 | 31.8 | 26.4 |
| Llama3 Instruct 8B ORPO | [princeton-nlp/Llama-3-Instruct-8B-ORPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-ORPO) | 28.5 | 27.4 | 25.8 |
| Llama3 Instruct 8B R-DPO | [princeton-nlp/Llama-3-Instruct-8B-RDPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-RDPO) | 41.1 | 37.8 | 33.1 |
| Llama3 Instruct 8B SimPO | [princeton-nlp/Llama-3-Instruct-8B-SimPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SimPO) | 44.7 | 40.5 | 33.8 |
### v0.2
We found that using a strong reward model for annotating preference optimization datasets is crucial. In this iteration, we have reannotated the dataset [princeton-nlp/llama3-ultrafeedback-armorm](https://huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback-armorm) using a more powerful reward model, [RLHFlow/ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1). As a result, the v0.2 models demonstrate significantly improved performance compared to the v0.1 models.
| models | | AE2 LC | AE2 WR | AH |
|------------------------------|-----------------------------------------------------------------------------------------------------------|:------:|:------:|:----:|
| Llama 3 Instruct 8B RRHF v0.2 | [princeton-nlp/Llama-3-Instruct-8B-RRHF-v2.0](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-RRHF-v0.2) | 26.0 | 25.3 | 22.3 |
| Llama 3 Instruct 8B SLiC-HF v0.2 | [princeton-nlp/Llama-3-Instruct-8B-SLiC-HF-v2.0](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SLiC-HF-v0.2) | 37.9 | 31.6 | 28.8 |
| Llama 3 Instruct 8B DPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-DPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-DPO-v0.2) | 33.9 | 32.5 | 29.3 |
| Llama 3 Instruct 8B IPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-IPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-IPO-v0.2) | 48.2 | 47.5 | 35.2 |
| Llama 3 Instruct 8B CPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-CPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-CPO-v0.2) | 46.8 | 42.4 | 36.6 |
| Llama 3 Instruct 8B KTO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-KTO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-KTO-v0.2) | 34.1 | 36.4 | 30.9 |
| Llama 3 Instruct 8B ORPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-ORPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-ORPO-v0.2) | 34.1 | 32.1 | 27.3 |
| Llama 3 Instruct 8B R-DPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-RDPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-RDPO-v0.2) | 38.1 | 33.8 | 28.2 |
| Llama 3 Instruct 8B SimPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2) | 48.0 | 45.8 | 35.1 |
| Llama 3 Instruct 8B SimPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2) | 53.7 | 47.5 | 36.5 |
### Use our models for inference
Please refer to the [generate.py](generate.py) script for detailed instructions on loading the model with the appropriate chat template.
## Install Requirements