From 9bd67dc9253b6ae08c7c47537bf5b0a63aa7f1ca Mon Sep 17 00:00:00 2001 From: xiamengzhou <296337231@qq.com> Date: Sun, 7 Jul 2024 10:15:46 -0400 Subject: [PATCH] update --- README.md | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index cf6dd4b..aec9919 100644 --- a/README.md +++ b/README.md @@ -41,38 +41,69 @@ We have observed that, in some cases, adding an additional SFT loss can help imp ## Released Models -Below is the complete list of models evaluated in our preprint. -| models | | AE2 LC | AE2 WR | AH | +### v0.1 +Below is the complete list of models evaluated in our preprint. We used the [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) dataset to train the Mistral Base and Llama3 Base models, the [princeton-nlp/mistral-instruct-ultrafeedback](https://huggingface.co/princeton-nlp/mistral-instruct-ultrafeedback) dataset to train the Mistral Instruct models, and the [princeton-nlp/llama3-ultrafeedback](https://huggingface.co/princeton-nlp/llama3-ultrafeedback) dataset to train the Llama3 Instruct models. The latter two datasets are annotated by the [llm-blender/PairRM](https://huggingface.co/llm-blender/PairRM) model. + +models | | AE2 LC | AE2 WR | AH | |------------------------------|-----------------------------------------------------------------------------------------------------------|:------:|:------:|:----:| | Mistral Base 7B SFT | [alignment-handbook/zephyr-7b-sft-full](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full) | 8.4 | 6.2 | 1.3 | +| Mistral Base 7B RRHF | [princeton-nlp/Mistral-7B-Base-SFT-RRHF](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-RRHF) | 11.6 | 10.2 | 6.9 | +| Mistral Base 7B SLiC-HF | [princeton-nlp/Mistral-7B-Base-SFT-SLiC-HF](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-SLiC-HF) | 10.9 | 8.9 | 7.3 | | Mistral Base 7B DPO (Zephyr) | [princeton-nlp/Mistral-7B-Base-SFT-DPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-DPO) | 15.1 | 12.5 | 10.4 | | Mistral Base 7B IPO | [princeton-nlp/Mistral-7B-Base-SFT-IPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-IPO) | 11.8 | 9.4 | 7.5 | +| Mistral Base 7B CPO | [princeton-nlp/Mistral-7B-Base-SFT-CPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-CPO) | 13.1 | 9.1 | 5.6 | | Mistral Base 7B KTO | [princeton-nlp/Mistral-7B-Base-SFT-KTO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-KTO) | 13.1 | 9.1 | 5.6 | | Mistral Base 7B ORPO | [kaist-ai/mistral-orpo-beta](https://huggingface.co/kaist-ai/mistral-orpo-beta) | 14.7 | 12.2 | 7.0 | | Mistral Base 7B R-DPO | [princeton-nlp/Mistral-7B-Base-SFT-RDPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-RDPO) | 17.4 | 12.8 | 9.9 | | Mistral Base 7B SimPO | [princeton-nlp/Mistral-7B-Base-SFT-SimPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-SimPO) | 21.4 | 20.8 | 16.6 | | Mistral Instruct 7B SFT | [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) | 17.1 | 14.7 | 12.6 | +| Mistral Instruct 7B RRHF | [princeton-nlp/Mistral-7B-Instruct-RRHF](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-RRHF) | 25.3 | 24.8 | 18.1 | +| Mistral Instruct 7B SLiC-HF | [princeton-nlp/Mistral-7B-Instruct-SLiC-HF](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-SLiC-HF) | 24.1 | 24.6 | 18.9 | | Mistral Instruct 7B DPO | [princeton-nlp/Mistral-7B-Instruct-DPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-DPO) | 26.8 | 24.9 | 16.3 | | Mistral Instruct 7B IPO | [princeton-nlp/Mistral-7B-Instruct-IPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-IPO) | 20.3 | 20.3 | 16.2 | +| Mistral Instruct 7B CPO | [princeton-nlp/Mistral-7B-Instruct-CPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-CPO) | 24.5 | 23.6 | 17.9 | | Mistral Instruct 7B KTO | [princeton-nlp/Mistral-7B-Instruct-KTO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-KTO) | 24.5 | 23.6 | 17.9 | | Mistral Instruct 7B ORPO | [princeton-nlp/Mistral-7B-Instruct-ORPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-ORPO) | 24.5 | 24.9 | 20.8 | | Mistral Instruct 7B R-DPO | [princeton-nlp/Mistral-7B-Instruct-RDPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-RDPO) | 27.3 | 24.5 | 16.1 | | Mistral Instruct 7B SimPO | [princeton-nlp/Mistral-7B-Instruct-SimPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-SimPO) | 32.1 | 34.8 | 21.0 | | Llama3 Base 8B SFT | [princeton-nlp/Llama-3-Base-8B-SFT](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT) | 6.2 | 4.6 | 3.3 | +| Llama3 Base 8B RRHF | [princeton-nlp/Llama-3-Base-8B-RRHF](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-RRHF) | 10.8 | 8.1 | 6.6 | +| Llama3 Base 8B SLiC-HF | [princeton-nlp/Llama-3-Base-8B-SLiC-HF](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SLiC-HF) | 12.1 | 10.1 | 10.3 | | Llama3 Base 8B DPO | [princeton-nlp/Llama-3-Base-8B-SFT-DPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-DPO) | 18.2 | 15.5 | 15.9 | | Llama3 Base 8B IPO | [princeton-nlp/Llama-3-Base-8B-SFT-IPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-IPO) | 14.4 | 14.2 | 17.8 | +| Llama3 Base 8B CPO | [princeton-nlp/Llama-3-Base-8B-SFT-CPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-CPO) | 14.4 | 14.2 | 17.8 | | Llama3 Base 8B KTO | [princeton-nlp/Llama-3-Base-8B-SFT-KTO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-KTO) | 14.2 | 12.4 | 12.5 | | Llama3 Base 8B ORPO | [princeton-nlp/Llama-3-Base-8B-SFT-ORPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-ORPO) | 12.2 | 10.6 | 10.8 | | Llama3 Base 8B R-DPO | [princeton-nlp/Llama-3-Base-8B-SFT-RDPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-RDPO) | 17.6 | 14.4 | 17.2 | | Llama3 Base 8B SimPO | [princeton-nlp/Llama-3-Base-8B-SFT-SimPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-SimPO) | 22.0 | 20.3 | 23.4 | | Llama3 Instruct 8B SFT | [meta-llama/Meta-Llama-3-Instruct-8B](https://huggingface.co/meta-llama/Meta-Llama-3-Instruct-8B) | 26.0 | 25.3 | 22.3 | +| Llama3 Instruct 8B RRHF | [princeton-nlp/Llama-3-Instruct-8B-RRHF](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-RRHF) | 31.3 | 28.4 | 26.5 | +| Llama3 Instruct 8B SLiC-HF | [princeton-nlp/Llama-3-Instruct-8B-SLiC-HF](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SLiC-HF) | 26.9 | 27.5 | 26.2 | | Llama3 Instruct 8B DPO | [princeton-nlp/Llama-3-Instruct-8B-DPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-DPO) | 40.3 | 37.9 | 32.6 | | Llama3 Instruct 8B IPO | [princeton-nlp/Llama-3-Instruct-8B-IPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-IPO) | 35.6 | 35.6 | 30.5 | +| Llama3 Instruct 8B CPO | [princeton-nlp/Llama-3-Instruct-8B-CPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-CPO) | 33.1 | 31.8 | 26.4 | | Llama3 Instruct 8B KTO | [princeton-nlp/Llama-3-Instruct-8B-KTO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-KTO) | 33.1 | 31.8 | 26.4 | | Llama3 Instruct 8B ORPO | [princeton-nlp/Llama-3-Instruct-8B-ORPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-ORPO) | 28.5 | 27.4 | 25.8 | | Llama3 Instruct 8B R-DPO | [princeton-nlp/Llama-3-Instruct-8B-RDPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-RDPO) | 41.1 | 37.8 | 33.1 | | Llama3 Instruct 8B SimPO | [princeton-nlp/Llama-3-Instruct-8B-SimPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SimPO) | 44.7 | 40.5 | 33.8 | +### v0.2 +We found that using a strong reward model for annotating preference optimization datasets is crucial. In this iteration, we have reannotated the dataset [princeton-nlp/llama3-ultrafeedback-armorm](https://huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback-armorm) using a more powerful reward model, [RLHFlow/ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1). As a result, the v0.2 models demonstrate significantly improved performance compared to the v0.1 models. + +| models | | AE2 LC | AE2 WR | AH | +|------------------------------|-----------------------------------------------------------------------------------------------------------|:------:|:------:|:----:| +| Llama 3 Instruct 8B RRHF v0.2 | [princeton-nlp/Llama-3-Instruct-8B-RRHF-v2.0](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-RRHF-v0.2) | 26.0 | 25.3 | 22.3 | +| Llama 3 Instruct 8B SLiC-HF v0.2 | [princeton-nlp/Llama-3-Instruct-8B-SLiC-HF-v2.0](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SLiC-HF-v0.2) | 37.9 | 31.6 | 28.8 | +| Llama 3 Instruct 8B DPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-DPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-DPO-v0.2) | 33.9 | 32.5 | 29.3 | +| Llama 3 Instruct 8B IPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-IPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-IPO-v0.2) | 48.2 | 47.5 | 35.2 | +| Llama 3 Instruct 8B CPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-CPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-CPO-v0.2) | 46.8 | 42.4 | 36.6 | +| Llama 3 Instruct 8B KTO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-KTO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-KTO-v0.2) | 34.1 | 36.4 | 30.9 | +| Llama 3 Instruct 8B ORPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-ORPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-ORPO-v0.2) | 34.1 | 32.1 | 27.3 | +| Llama 3 Instruct 8B R-DPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-RDPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-RDPO-v0.2) | 38.1 | 33.8 | 28.2 | +| Llama 3 Instruct 8B SimPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2) | 48.0 | 45.8 | 35.1 | +| Llama 3 Instruct 8B SimPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2) | 53.7 | 47.5 | 36.5 | + +### Use our models for inference Please refer to the [generate.py](generate.py) script for detailed instructions on loading the model with the appropriate chat template. ## Install Requirements