Add ORPO within README.md files (#154)

* Add `ORPO` within `scripts/README.md`

* Fix typo in `ModelArguments.base_model_revision`

* Add `ORPO` within `README.md`

* Add Zephyr 141B in "News" section
This commit is contained in:
Alvaro Bartolome
2024-04-25 10:35:45 +02:00
committed by GitHub
parent 70769f9e9b
commit cf1975a7cb
3 changed files with 14 additions and 7 deletions
+3 -1
View File
@@ -19,6 +19,7 @@ However, we know from the [InstructGPT](https://huggingface.co/papers/2203.02155
The Alignment Handbook aims to fill that gap by providing the community with a series of robust training recipes that span the whole pipeline.
## News 🗞️
* **April 12, 2024**: We release Zephyr 141B (A35B), in collaboration with Argilla and Kaist AI, along with the recipe to fine-tune Mixtral 8x22B with ORPO 🪁
* **March 12, 2024:** We release StarChat2 15B, along with the recipe to train capable coding assistants 🌟
* **March 1, 2024:** We release Zephyr 7B Gemma, which is a new recipe to align Gemma 7B with RLAIF 🔥
* **February 1, 2024:** We release a recipe to align open LLMs with Constitutional AI 📜! See the [recipe](https://github.com/huggingface/alignment-handbook/tree/main/recipes/constitutional-ai) and the [blog post](https://huggingface.co/blog/constitutional_ai) for details.
@@ -33,7 +34,7 @@ The Alignment Handbook aims to fill that gap by providing the community with a s
This project is simple by design and mostly consists of:
* [`scripts`](./scripts/) to train and evaluate models. Three steps are included: continued pretraining, supervised-finetuning (SFT) for chat, and preference alignment with DPO. Each script supports distributed training of the full model weights with DeepSpeed ZeRO-3, or LoRA/QLoRA for parameter-efficient fine-tuning.
* [`scripts`](./scripts/) to train and evaluate models. Four steps are included: continued pretraining, supervised-finetuning (SFT) for chat, preference alignment with DPO, and supervised-finetuning with preference alignment with ORPO. Each script supports distributed training of the full model weights with DeepSpeed ZeRO-3, or LoRA/QLoRA for parameter-efficient fine-tuning.
* [`recipes`](./recipes/) to reproduce models like Zephyr 7B. Each recipe takes the form of a YAML file which contains all the parameters associated with a single training run. A `gpt2-nl` recipe is also given to illustrate how this handbook can be used for language or domain adaptation, e.g. by continuing to pretrain on a different language, and then SFT and DPO tuning the result.
We are also working on a series of guides to explain how methods like direct preference optimization (DPO) work, along with lessons learned from gathering human preferences in practice. To get started, we recommend the following:
@@ -53,6 +54,7 @@ The initial release of the handbook will focus on the following techniques:
* **Reward modeling:** teach language models to distinguish model responses according to human or AI preferences.
* **Rejection sampling:** a simple, but powerful technique to boost the performance of your SFT model.
* **Direct preference optimisation (DPO):** a powerful and promising alternative to PPO.
* **Odds Ratio Preference Optimisation (ORPO)**: a technique to fine-tune language models with human preferences, combining SFT and DPO in a single stage.
## Installation instructions
+10 -5
View File
@@ -1,4 +1,3 @@
# Scripts to Train and Evaluate Chat Models
## Fine-tuning
@@ -25,7 +24,13 @@ ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_con
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes={num_gpus} scripts/run_{task}.py recipes/{model_name}/{task}/config_qlora.yaml --load_in_4bit=false
```
Here `{task}` refers to the type of training you wish to run. Currently the following tasks are supported: continued pretraining `cpt`, supervised finetuning `sft`, and direct preference optimisation `dpo`. Note that `cpt` is only present in the `gpt-nl` example recipe. {model_name}` refers to the choice of a recipe in the `recipes` directory. For example, to replicate Zephyr-7B-β you can run:
Here `{task}` refers to the type of training you wish to run. Currently the following tasks are supported:
* continued pretraining `cpt` (note that `cpt` is only present in the `gpt-nl` example recipe)
* supervised finetuning `sft`
* direct preference optimisation `dpo`
* odds ratio preference optimisation `orpo`
`{model_name}` refers to the choice of a recipe in the `recipes` directory. For example, to replicate Zephyr-7B-β you can run:
```shell
# Step 1 - train SFT policy
@@ -85,14 +90,14 @@ dataset_splits:
- test_xxx # The test splits to mix
```
If you want to fine-tune on your datasets, the main thing to keep in mind is how the chat templates are applied to the dataset blend. Since each task (SFT, DPO, etc), requires a different format, we assume the datasets have the following columns:
If you want to fine-tune on your datasets, the main thing to keep in mind is how the chat templates are applied to the dataset blend. Since each task (SFT, DPO, ORPO, etc), requires a different format, we assume the datasets have the following columns:
**SFT**
* `messages`: A list of `dicts` in the form `{"role": "{role}", "content": {content}}`.
* See [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) for an example.
**DPO**
**DPO and ORPO**
* `chosen`: A list of `dicts` in the form `{"role": "{role}", "content": {content}}` corresponding to the preferred dialogue.
* `rejected`: A list of `dicts` in the form `{"role": "{role}", "content": {content}}` corresponding to the dispreferred dialogue.
@@ -130,4 +135,4 @@ For both benchmarks, we have added support for the [Zephyr chat template](https:
Note that MT-Bench and AlpacaEval rely on LLMs like GPT-4 to judge the quality of the model responses, and thus the ranking exhibit various biases including a preference for models distilled from GPTs. For that reason, we also recommend submitting your best models for human evaluation in:
* [Chatbot Arena](https://chat.lmsys.org): a live, human evaluation of chat models in head-to-head comparisons.
* [Chatbot Arena](https://chat.lmsys.org): a live, human evaluation of chat models in head-to-head comparisons.
+1 -1
View File
@@ -112,7 +112,7 @@ class ModelArguments:
base_model_revision: Optional[str] = field(
default=None,
metadata={"help": ("The base model checkpoint for weights initialization with PEFT adatpers.")},
metadata={"help": ("The base model checkpoint for weights initialization with PEFT adapters.")},
)
model_name_or_path: Optional[str] = field(
default=None,