Fix Slurm opts

2026-06-27 18:41:19 +08:00 · 2023-11-09 14:09:52 +00:00
parent 33a0ce3afd
commit 756bb76d22
3 changed files with 31 additions and 43 deletions
@@ -3,10 +3,11 @@

 ### Fine-tuning

-In the handbook, we provide two main ways to align LLMs for chat:
+In the handbook, we provide three main ways to align LLMs for chat:

- Full fine-tuning on a multi-GPU machine (tested on an 8 x A100 (80GB) node).
+- Full fine-tuning on a multi-GPU machine with DeepSpeed ZeRO-3 (tested on an 8 x A100 (80GB) node).
 - LoRA fine-tuning on a single consumer 24GB GPU (tested on a RTX 4090).
+- LoRA fine-tuning on a multi-GPU machine with DeepSpeed ZeRO-3 (tested on a 2 x A100s (80GB)).

 In practice, we find comparable performance for both full and LoRA fine-tuning, with the latter having the advantage of producing small adapter weights that are fast to upload and download from the Hugging Face Hub. Here's the two general commands to fine-tune your models:

@@ -14,8 +15,11 @@ In practice, we find comparable performance for both full and LoRA fine-tuning,
 # Full training with ZeRO-3 on 8 GPUs
 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_{task}.py recipes/{model_name}/{task}/config_full.yaml

-# LoRA training on single GPU
+# LoRA training on a single GPU
 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml
+
+# LoRA training with ZeRO-3 on two or more GPUs
+ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes={num_gpus} scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml
 ```

 Here `{task}` refers to type of training you wish to run (SFT, DPO, etc), while `{model_name}` refers to the choice of recipe in the `recipes/` directory. For example, to replicate Zephyr 7B you can run:
@@ -28,7 +32,7 @@ ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_con
 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_dpo.py recipes/zephyr-7b/dpo/config_full.yaml
 ```

-You can override the parameters in each YAML config by appending them to the command as follows:
+By default, these scripts will push each model to your Hugging Face Hub username, i.e. `{username}/{model_name}-{task}`. You can override the parameters in each YAML config by appending them to the command as follows:

 ```shell
 # Change batch size, number of epochs etc
@@ -41,7 +45,7 @@ By default all training metrics are logged with TensorBoard. If you have a [Weig
 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_{task}.py recipes/{model_name}/{task}/config_full.yaml --report_to=wandb
 ```

-#### Launching jobs on a Slurm cluster
+### Launching jobs on a Slurm cluster

 If you have access to a Slurm cluster, we provide a `recipes/launch.slurm` script that will automatically queue training jobs for you. Here's how you can use it:

@@ -55,4 +59,6 @@ Here `{model_name}` and `{task}` are defined as above, while `{precision}` refer
 sbatch --job-name=handbook_sft --nodes=1 recipes/launch.slurm zephyr-7b sft full deepspeed_zero3
 ```

+You can scale the number of nodes by increasing the `--nodes` flag; in these cases we recommend also scaling up the per-device batch size or number of gradient accumulation steps to keep the global batch size constant (and thus replicate our results).
+
 **Note:** the configuration in `recipes/launch.slurm` is optimised for the Hugging Face Compute Cluster and may require tweaking to be adapted to your own compute nodes.