mirror of
https://github.com/wassname/alignment-handbook.git
synced 2026-06-27 16:14:07 +08:00
Add ORPO within README.md files (#154)
* Add `ORPO` within `scripts/README.md` * Fix typo in `ModelArguments.base_model_revision` * Add `ORPO` within `README.md` * Add Zephyr 141B in "News" section
This commit is contained in:
@@ -19,6 +19,7 @@ However, we know from the [InstructGPT](https://huggingface.co/papers/2203.02155
|
||||
The Alignment Handbook aims to fill that gap by providing the community with a series of robust training recipes that span the whole pipeline.
|
||||
|
||||
## News 🗞️
|
||||
* **April 12, 2024**: We release Zephyr 141B (A35B), in collaboration with Argilla and Kaist AI, along with the recipe to fine-tune Mixtral 8x22B with ORPO 🪁
|
||||
* **March 12, 2024:** We release StarChat2 15B, along with the recipe to train capable coding assistants 🌟
|
||||
* **March 1, 2024:** We release Zephyr 7B Gemma, which is a new recipe to align Gemma 7B with RLAIF 🔥
|
||||
* **February 1, 2024:** We release a recipe to align open LLMs with Constitutional AI 📜! See the [recipe](https://github.com/huggingface/alignment-handbook/tree/main/recipes/constitutional-ai) and the [blog post](https://huggingface.co/blog/constitutional_ai) for details.
|
||||
@@ -33,7 +34,7 @@ The Alignment Handbook aims to fill that gap by providing the community with a s
|
||||
|
||||
This project is simple by design and mostly consists of:
|
||||
|
||||
* [`scripts`](./scripts/) to train and evaluate models. Three steps are included: continued pretraining, supervised-finetuning (SFT) for chat, and preference alignment with DPO. Each script supports distributed training of the full model weights with DeepSpeed ZeRO-3, or LoRA/QLoRA for parameter-efficient fine-tuning.
|
||||
* [`scripts`](./scripts/) to train and evaluate models. Four steps are included: continued pretraining, supervised-finetuning (SFT) for chat, preference alignment with DPO, and supervised-finetuning with preference alignment with ORPO. Each script supports distributed training of the full model weights with DeepSpeed ZeRO-3, or LoRA/QLoRA for parameter-efficient fine-tuning.
|
||||
* [`recipes`](./recipes/) to reproduce models like Zephyr 7B. Each recipe takes the form of a YAML file which contains all the parameters associated with a single training run. A `gpt2-nl` recipe is also given to illustrate how this handbook can be used for language or domain adaptation, e.g. by continuing to pretrain on a different language, and then SFT and DPO tuning the result.
|
||||
|
||||
We are also working on a series of guides to explain how methods like direct preference optimization (DPO) work, along with lessons learned from gathering human preferences in practice. To get started, we recommend the following:
|
||||
@@ -53,6 +54,7 @@ The initial release of the handbook will focus on the following techniques:
|
||||
* **Reward modeling:** teach language models to distinguish model responses according to human or AI preferences.
|
||||
* **Rejection sampling:** a simple, but powerful technique to boost the performance of your SFT model.
|
||||
* **Direct preference optimisation (DPO):** a powerful and promising alternative to PPO.
|
||||
* **Odds Ratio Preference Optimisation (ORPO)**: a technique to fine-tune language models with human preferences, combining SFT and DPO in a single stage.
|
||||
|
||||
## Installation instructions
|
||||
|
||||
|
||||
+10
-5
@@ -1,4 +1,3 @@
|
||||
|
||||
# Scripts to Train and Evaluate Chat Models
|
||||
|
||||
## Fine-tuning
|
||||
@@ -25,7 +24,13 @@ ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_con
|
||||
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes={num_gpus} scripts/run_{task}.py recipes/{model_name}/{task}/config_qlora.yaml --load_in_4bit=false
|
||||
```
|
||||
|
||||
Here `{task}` refers to the type of training you wish to run. Currently the following tasks are supported: continued pretraining `cpt`, supervised finetuning `sft`, and direct preference optimisation `dpo`. Note that `cpt` is only present in the `gpt-nl` example recipe. {model_name}` refers to the choice of a recipe in the `recipes` directory. For example, to replicate Zephyr-7B-β you can run:
|
||||
Here `{task}` refers to the type of training you wish to run. Currently the following tasks are supported:
|
||||
* continued pretraining `cpt` (note that `cpt` is only present in the `gpt-nl` example recipe)
|
||||
* supervised finetuning `sft`
|
||||
* direct preference optimisation `dpo`
|
||||
* odds ratio preference optimisation `orpo`
|
||||
|
||||
`{model_name}` refers to the choice of a recipe in the `recipes` directory. For example, to replicate Zephyr-7B-β you can run:
|
||||
|
||||
```shell
|
||||
# Step 1 - train SFT policy
|
||||
@@ -85,14 +90,14 @@ dataset_splits:
|
||||
- test_xxx # The test splits to mix
|
||||
```
|
||||
|
||||
If you want to fine-tune on your datasets, the main thing to keep in mind is how the chat templates are applied to the dataset blend. Since each task (SFT, DPO, etc), requires a different format, we assume the datasets have the following columns:
|
||||
If you want to fine-tune on your datasets, the main thing to keep in mind is how the chat templates are applied to the dataset blend. Since each task (SFT, DPO, ORPO, etc), requires a different format, we assume the datasets have the following columns:
|
||||
|
||||
**SFT**
|
||||
|
||||
* `messages`: A list of `dicts` in the form `{"role": "{role}", "content": {content}}`.
|
||||
* See [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) for an example.
|
||||
|
||||
**DPO**
|
||||
**DPO and ORPO**
|
||||
|
||||
* `chosen`: A list of `dicts` in the form `{"role": "{role}", "content": {content}}` corresponding to the preferred dialogue.
|
||||
* `rejected`: A list of `dicts` in the form `{"role": "{role}", "content": {content}}` corresponding to the dispreferred dialogue.
|
||||
@@ -130,4 +135,4 @@ For both benchmarks, we have added support for the [Zephyr chat template](https:
|
||||
|
||||
Note that MT-Bench and AlpacaEval rely on LLMs like GPT-4 to judge the quality of the model responses, and thus the ranking exhibit various biases including a preference for models distilled from GPTs. For that reason, we also recommend submitting your best models for human evaluation in:
|
||||
|
||||
* [Chatbot Arena](https://chat.lmsys.org): a live, human evaluation of chat models in head-to-head comparisons.
|
||||
* [Chatbot Arena](https://chat.lmsys.org): a live, human evaluation of chat models in head-to-head comparisons.
|
||||
|
||||
@@ -112,7 +112,7 @@ class ModelArguments:
|
||||
|
||||
base_model_revision: Optional[str] = field(
|
||||
default=None,
|
||||
metadata={"help": ("The base model checkpoint for weights initialization with PEFT adatpers.")},
|
||||
metadata={"help": ("The base model checkpoint for weights initialization with PEFT adapters.")},
|
||||
)
|
||||
model_name_or_path: Optional[str] = field(
|
||||
default=None,
|
||||
|
||||
Reference in New Issue
Block a user