mirror of https://github.com/wassname/alignment-handbook.git synced 2026-06-27 17:47:01 +08:00

Files

T

lewtun f0ffa0d7a6 Update Zephyr configs to account for UltraFeedback & TRL fixes (#88 )

* Add files

* Add checkpointing

* Add checkpointing to SFT

* Add loss type

* Fix setup|

* Clean SFT

* Add lora config

* Rename config

* Remove max eval samples

* Add kwargs tp push to hub

* Add DPO configs

* Fix dpo configs

* Extend chat template test to multi-turn

* Add warmup

* Refactor

* Fix LoRA -> QLoRA

* Fix configs

* Specify chat template

* Add sample logging

* Fix push to hub hanging

* Add reentrant

* Fix quality

* Add transformer logging

* Tweak grad acc

* Add null type

* Add doc

2024-01-10 17:42:24 +11:00

2.5 KiB

Raw Blame History

Instructions to Replicate Zephyr-7b-β

As described in the Zephyr technical report, training this model proceeds in two steps:

Apply SFT to fine-tune Mistral 7B on a filtered version of the UltraChat dataset (link). The result is an SFT model like zephyr-7b-sft-full or zephyr-7b-sft-qlora.
Align the SFT model to AI feedback via DPO on a preprocessed version of the UltraFeedback dataset (link). The result is an DPO model like zephyr-7b-dpo-full or zephyr-7b-dpo-qlora.

Note: after the release of Zephyr, the team at Argilla found that the source UltraFeedback dataset had a few thousand incorrect preference labels from GPT-4. Additionally, TRL's SFTTrainer had a bug in the learning rate scheduler which terminated training early. Accounting for these changes led us to find a better set of hyperparameters from those described in the technical report. In particular, for DPO training we found that training for 1 epoch with beta=0.01 was suffucient to achieve comparable performance to zephyr-7b-beta (vs. 3 epochs with beta=0.1).

See below for commands to train these models using either DeepSpeed ZeRO-3 or LoRA.

Full training examples

You will require 8 GPUs (80GB of VRAM) to train the full model.

# Step 1 - SFT
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_full.yaml

# Step 2 - DPO
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml

QLoRA training examples

# Step 1 - SFT
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_qlora.yaml --load_in_4bit=true

# Step 2 - DPO
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_qlora.yaml

2.5 KiB Raw Blame History

Instructions to Replicate Zephyr-7b-β

Full training examples

QLoRA training examples

2.5 KiB

Raw Blame History