update

2026-06-27 19:13:33 +08:00 · 2024-07-07 10:15:46 -04:00
parent a1d07195c2
commit 9bd67dc925
1 changed files with 33 additions and 2 deletions
@@ -41,38 +41,69 @@ We have observed that, in some cases, adding an additional SFT loss can help imp


 ## Released Models
-Below is the complete list of models evaluated in our preprint. 
-| models                       |                                                                                                           | AE2 LC | AE2 WR |  AH  |
+### v0.1
+Below is the complete list of models evaluated in our preprint. We used the [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) dataset to train the Mistral Base and Llama3 Base models, the [princeton-nlp/mistral-instruct-ultrafeedback](https://huggingface.co/princeton-nlp/mistral-instruct-ultrafeedback) dataset to train the Mistral Instruct models, and the [princeton-nlp/llama3-ultrafeedback](https://huggingface.co/princeton-nlp/llama3-ultrafeedback) dataset to train the Llama3 Instruct models. The latter two datasets are annotated by the [llm-blender/PairRM](https://huggingface.co/llm-blender/PairRM) model.
+
+models                       |                                                                                                           | AE2 LC | AE2 WR |  AH  |
 |------------------------------|-----------------------------------------------------------------------------------------------------------|:------:|:------:|:----:|
 | Mistral Base 7B SFT          | [alignment-handbook/zephyr-7b-sft-full](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full)     |   8.4  |   6.2  |  1.3 |
+| Mistral Base 7B RRHF         | [princeton-nlp/Mistral-7B-Base-SFT-RRHF](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-RRHF) |  11.6  |  10.2  |  6.9 |
+| Mistral Base 7B SLiC-HF      | [princeton-nlp/Mistral-7B-Base-SFT-SLiC-HF](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-SLiC-HF) |  10.9  |   8.9  |  7.3 |
 | Mistral Base 7B DPO (Zephyr) | [princeton-nlp/Mistral-7B-Base-SFT-DPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-DPO)     |  15.1  |  12.5  | 10.4 |
 | Mistral Base 7B IPO          | [princeton-nlp/Mistral-7B-Base-SFT-IPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-IPO)     |  11.8  |   9.4  |  7.5 |
+| Mistral Base 7B CPO          | [princeton-nlp/Mistral-7B-Base-SFT-CPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-CPO)     |  13.1  |   9.1  |  5.6 |
 | Mistral Base 7B KTO          | [princeton-nlp/Mistral-7B-Base-SFT-KTO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-KTO)     |  13.1  |   9.1  |  5.6 |
 | Mistral Base 7B ORPO         | [kaist-ai/mistral-orpo-beta](https://huggingface.co/kaist-ai/mistral-orpo-beta)                           |  14.7  |  12.2  |  7.0 |
 | Mistral Base 7B R-DPO        | [princeton-nlp/Mistral-7B-Base-SFT-RDPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-RDPO)   |  17.4  |  12.8  |  9.9 |
 | Mistral Base 7B SimPO        | [princeton-nlp/Mistral-7B-Base-SFT-SimPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-SimPO) |  21.4  |  20.8  | 16.6 |
 | Mistral Instruct 7B SFT      | [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)           |  17.1  |  14.7  | 12.6 |
+| Mistral Instruct 7B RRHF     | [princeton-nlp/Mistral-7B-Instruct-RRHF](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-RRHF) |  25.3  |  24.8  | 18.1 |
+| Mistral Instruct 7B SLiC-HF  | [princeton-nlp/Mistral-7B-Instruct-SLiC-HF](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-SLiC-HF) |  24.1  |  24.6  | 18.9 |
 | Mistral Instruct 7B DPO      | [princeton-nlp/Mistral-7B-Instruct-DPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-DPO)     |  26.8  |  24.9  | 16.3 |
 | Mistral Instruct 7B IPO      | [princeton-nlp/Mistral-7B-Instruct-IPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-IPO)     |  20.3  |  20.3  | 16.2 |
+| Mistral Instruct 7B CPO      | [princeton-nlp/Mistral-7B-Instruct-CPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-CPO)     |  24.5  |  23.6  | 17.9 |
 | Mistral Instruct 7B KTO      | [princeton-nlp/Mistral-7B-Instruct-KTO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-KTO)     |  24.5  |  23.6  | 17.9 |
 | Mistral Instruct 7B ORPO     | [princeton-nlp/Mistral-7B-Instruct-ORPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-ORPO)   |  24.5  |  24.9  | 20.8 |
 | Mistral Instruct 7B R-DPO    | [princeton-nlp/Mistral-7B-Instruct-RDPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-RDPO)   |  27.3  |  24.5  | 16.1 |
 | Mistral Instruct 7B SimPO    | [princeton-nlp/Mistral-7B-Instruct-SimPO](https://huggingface.co/princeton-nlp/Mistral-7B-Instruct-SimPO) |  32.1  |  34.8  | 21.0 |
 | Llama3 Base 8B SFT           | [princeton-nlp/Llama-3-Base-8B-SFT](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT)             |   6.2  |   4.6  |  3.3 |
+| Llama3 Base 8B RRHF          | [princeton-nlp/Llama-3-Base-8B-RRHF](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-RRHF)           |  10.8  |   8.1  |  6.6 |
+| Llama3 Base 8B SLiC-HF       | [princeton-nlp/Llama-3-Base-8B-SLiC-HF](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SLiC-HF)     |  12.1  |  10.1  | 10.3 |
 | Llama3 Base 8B DPO           | [princeton-nlp/Llama-3-Base-8B-SFT-DPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-DPO)     |  18.2  |  15.5  | 15.9 |
 | Llama3 Base 8B IPO           | [princeton-nlp/Llama-3-Base-8B-SFT-IPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-IPO)     |  14.4  |  14.2  | 17.8 |
+| Llama3 Base 8B CPO           | [princeton-nlp/Llama-3-Base-8B-SFT-CPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-CPO)     |  14.4  |  14.2  | 17.8 |
 | Llama3 Base 8B KTO           | [princeton-nlp/Llama-3-Base-8B-SFT-KTO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-KTO)     |  14.2  |  12.4  | 12.5 |
 | Llama3 Base 8B ORPO          | [princeton-nlp/Llama-3-Base-8B-SFT-ORPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-ORPO)   |  12.2  |  10.6  | 10.8 |
 | Llama3 Base 8B R-DPO         | [princeton-nlp/Llama-3-Base-8B-SFT-RDPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-RDPO)   |  17.6  |  14.4  | 17.2 |
 | Llama3 Base 8B SimPO         | [princeton-nlp/Llama-3-Base-8B-SFT-SimPO](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT-SimPO) |  22.0  |  20.3  | 23.4 |
 | Llama3 Instruct 8B SFT       | [meta-llama/Meta-Llama-3-Instruct-8B](https://huggingface.co/meta-llama/Meta-Llama-3-Instruct-8B)         |  26.0  |  25.3  | 22.3 |
+| Llama3 Instruct 8B RRHF      | [princeton-nlp/Llama-3-Instruct-8B-RRHF](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-RRHF) |  31.3  |  28.4  | 26.5 |
+| Llama3 Instruct 8B SLiC-HF   | [princeton-nlp/Llama-3-Instruct-8B-SLiC-HF](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SLiC-HF) |  26.9  |  27.5  | 26.2 |
 | Llama3 Instruct 8B DPO       | [princeton-nlp/Llama-3-Instruct-8B-DPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-DPO)     |  40.3  |  37.9  | 32.6 |
 | Llama3 Instruct 8B IPO       | [princeton-nlp/Llama-3-Instruct-8B-IPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-IPO)     |  35.6  |  35.6  | 30.5 |
+| Llama3 Instruct 8B CPO       | [princeton-nlp/Llama-3-Instruct-8B-CPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-CPO)     |  33.1  |  31.8  | 26.4 |
 | Llama3 Instruct 8B KTO       | [princeton-nlp/Llama-3-Instruct-8B-KTO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-KTO)     |  33.1  |  31.8  | 26.4 |
 | Llama3 Instruct 8B ORPO      | [princeton-nlp/Llama-3-Instruct-8B-ORPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-ORPO)   |  28.5  |  27.4  | 25.8 |
 | Llama3 Instruct 8B R-DPO     | [princeton-nlp/Llama-3-Instruct-8B-RDPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-RDPO)   |  41.1  |  37.8  | 33.1 |
 | Llama3 Instruct 8B SimPO     | [princeton-nlp/Llama-3-Instruct-8B-SimPO](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SimPO) |  44.7  |  40.5  | 33.8 |

+### v0.2
+We found that using a strong reward model for annotating preference optimization datasets is crucial. In this iteration, we have reannotated the dataset [princeton-nlp/llama3-ultrafeedback-armorm](https://huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback-armorm) using a more powerful reward model, [RLHFlow/ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1). As a result, the v0.2 models demonstrate significantly improved performance compared to the v0.1 models.
+
+| models                       |                                                                                                           | AE2 LC | AE2 WR |  AH  |
+|------------------------------|-----------------------------------------------------------------------------------------------------------|:------:|:------:|:----:|
+| Llama 3 Instruct 8B RRHF v0.2 | [princeton-nlp/Llama-3-Instruct-8B-RRHF-v2.0](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-RRHF-v0.2) |  26.0  |  25.3  | 22.3 |
+| Llama 3 Instruct 8B SLiC-HF v0.2 | [princeton-nlp/Llama-3-Instruct-8B-SLiC-HF-v2.0](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SLiC-HF-v0.2) |  37.9  |  31.6  | 28.8 |
+| Llama 3 Instruct 8B DPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-DPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-DPO-v0.2) |  33.9  |  32.5  | 29.3 |
+| Llama 3 Instruct 8B IPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-IPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-IPO-v0.2) |  48.2  |  47.5  | 35.2 |
+| Llama 3 Instruct 8B CPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-CPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-CPO-v0.2) |  46.8  |  42.4  | 36.6 |
+| Llama 3 Instruct 8B KTO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-KTO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-KTO-v0.2) |  34.1  |  36.4  | 30.9 |
+| Llama 3 Instruct 8B ORPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-ORPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-ORPO-v0.2) |  34.1  |  32.1  | 27.3 |
+| Llama 3 Instruct 8B R-DPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-RDPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-RDPO-v0.2) |  38.1  |  33.8  | 28.2 |
+| Llama 3 Instruct 8B SimPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2) |  48.0  |  45.8  | 35.1 |
+| Llama 3 Instruct 8B SimPO v0.2 | [princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2) |  53.7  |  47.5  | 36.5 |
+
+### Use our models for inference
 Please refer to the [generate.py](generate.py) script for detailed instructions on loading the model with the appropriate chat template.

 ## Install Requirements