restructure: quotes-first SKILL.md, synthesized playbook split out

SKILL.md is now folklore only: verbatim practitioner quotes ordered most-general-first, transformer/LLM fine-tuning entries in their own section, minimal context, links and footnotes. New sources: unsloth, axolotl (+training stability), HF course ch8.4, Bekman debug_utils (evidence frozen in docs/evidence/). The synthesized material (mental models, priors, symptom tables, agent loop, triage, anti-patterns) moves to PLAYBOOK.md, framed as menus of hypotheses rather than authoritative diagnoses. Made-up symptom tables no longer sit next to sourced quotes. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 01:00:14 +08:00 · 2026-06-11 14:33:32 +08:00
parent 8ee980d62f
commit fb753d093e
8 changed files with 2391 additions and 284 deletions
@@ -0,0 +1,241 @@
+Source: https://docs.axolotl.ai/docs/debugging.html
+Title: Debugging - Axolotl documentation, general tips trace to Hamel Husain (undated, fetched 2026)
+Fetched-via: uvx markitdown https://docs.axolotl.ai/docs/debugging.html
+Fetch-status: verbatim, nav/sidebar/TOC boilerplate trimmed
+
+# Debugging
+
+How to debug Axolotl
+
+This document provides some tips and tricks for debugging Axolotl. It also provides an example configuration for debugging with VSCode. A good debugging setup is essential to understanding how Axolotl code works behind the scenes.
+
+Tip
+
+For training-specific debugging (loss spikes, NaN gradients, OOM errors, RL training stability), see [Training Stability & Debugging](../docs/training_stability.html).
+
+## Table of Contents
+
+* [General Tips](#general-tips)
+* [Debugging with VSCode](#debugging-with-vscode)
+  + [Background](#background)
+  + [Configuration](#configuration)
+  + [Customizing your debugger](#customizing-your-debugger)
+  + [Video Tutorial](#video-tutorial)
+* [Debugging With Docker](#debugging-with-docker)
+  + [Setup](#setup)
+  + [Attach To Container](#attach-to-container)
+  + [Video - Attaching To Docker On Remote Host](#video---attaching-to-docker-on-remote-host)
+
+## General Tips
+
+While debugging it’s helpful to simplify your test scenario as much as possible. Here are some tips for doing so:
+
+> [!Important]
+> All of these tips are incorporated into the [example configuration](#configuration) for debugging with VSCode below.
+
+1. **Make sure you are using the latest version of axolotl**: This project changes often and bugs get fixed fast. Check your git branch and make sure you have pulled the latest changes from `main`.
+2. **Eliminate concurrency**: Restrict the number of processes to 1 for both training and data preprocessing:
+
+   * Set `CUDA_VISIBLE_DEVICES` to a single GPU, ex: `export CUDA_VISIBLE_DEVICES=0`.
+   * Set `dataset_num_proc: 1` in your axolotl config or run the training command with `--dataset_num_proc=1`.
+3. **Use a small dataset**: Construct or use a small dataset from HF Hub. When using a small dataset, you will often have to make sure `sample_packing: False` and `eval_sample_packing: False` to avoid errors. If you are in a pinch and don’t have time to construct a small dataset but want to use from the HF Hub, you can shard the data (this will still tokenize the entire dataset, but will only use a fraction of the data for training. For example, to shard the dataset into 20 pieces, add the following to your axolotl config):
+
+   ```
+   datasets:
+       ...
+       shards: 20
+   ```
+4. **Use a small model**: A good example of a small model is [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0).
+5. **Minimize iteration time**: Make sure the training loop finishes as fast as possible, with these settings.
+
+   * `micro_batch_size: 1`
+   * `max_steps: 1`
+   * `val_set_size: 0`
+6. **Clear Caches:** Axolotl caches certain steps and so does the underlying HuggingFace trainer. You may want to clear some of these caches when debugging.
+
+   * Data preprocessing: When debugging data preprocessing, which includes prompt template formation, you may want to delete the directory set in `dataset_prepared_path:` in your axolotl config. If you didn’t set this value, the default is `last_run_prepared`.
+   * HF Hub: If you are debugging data preprocessing, you should clear the relevant HF cache [HuggingFace cache](https://huggingface.co/docs/datasets/cache), by deleting the appropriate `~/.cache/huggingface/datasets/...` folder(s).
+   * **The recommended approach is to redirect all outputs and caches to a temporary folder and delete selected subfolders before each run. This is demonstrated in the example configuration below.**
+
+## Debugging with VSCode
+
+### Background
+
+The below example shows how to configure VSCode to debug data preprocessing of the `chat_template` format. This is the format used when you have the following in your axolotl config:
+
+```
+datasets:
+  - path: <path to your chat_template formatted dataset> # example on HF Hub: fozziethebeat/alpaca_messages_2k_test
+    type: chat_template
+```
+
+> [!Important]
+> If you are already familiar with advanced VSCode debugging, you can skip the below explanation and look at the files [.vscode/launch.json](../.vscode/launch.json) and [.vscode/tasks.json](../.vscode/tasks.json) for an example configuration.
+
+> [!Tip]
+> If you prefer to watch a video, rather than read, you can skip to the [video tutorial](#video-tutorial) below (but doing both is recommended).
+
+### Setup
+
+Make sure you have an [editable install](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) of Axolotl, which ensures that changes you make to the code are reflected at runtime. Run the following commands from the root of this project:
+
+```
+export UV_TORCH_BACKEND=cu128  # or cu130
+uv venv --no-project --relocatable
+source .venv/bin/activate
+uv pip install --no-build-isolation -e '.[deepspeed]' --group dev --group test
+```
+
+#### Remote Hosts
+
+If you developing on a remote host, you can easily use VSCode to debug remotely. To do so, you will need to follow this [remote - SSH guide](https://code.visualstudio.com/docs/remote/ssh). You can also see the video below on [Docker and Remote SSH debugging](#video---attaching-to-docker-on-remote-host).
+
+### Configuration
+
+The easiest way to get started is to modify the [.vscode/launch.json](../.vscode/launch.json) file in this project. This is just an example configuration, so you may need to modify or copy it to suit your needs.
+
+For example, to mimic the command `cd devtools && CUDA_VISIBLE_DEVICES=0 axolotl train dev_chat_template.yml`, you would use the below configuration[1](#fn1). Note that we add additional flags that override the axolotl config and incorporate the tips above (see the comments). We also set the working directory to `devtools` and set the `env` variable `HF_HOME` to a temporary folder that is later partially deleted. This is because we want to delete the HF dataset cache before each run in order to ensure that the data preprocessing code is run from scratch.
+
+```
+// .vscode/launch.json
+{
+    "version": "0.2.0",
+    "configurations": [
+        {
+            "name": "Debug axolotl prompt - chat_template",
+            "type": "python",
+            "module": "accelerate.commands.launch",
+            "request": "launch",
+            "args": [
+                "-m", "axolotl.cli.train", "dev_chat_template.yml",
+                // The flags below simplify debugging by overriding the axolotl config
+                // with the debugging tips above.  Modify as needed.
+                "--dataset_num_proc=1",      // limits data preprocessing to one process
+                "--max_steps=1",              // limits training to just one step
+                "--batch_size=1",             // minimizes batch size
+                "--micro_batch_size=1",       // minimizes batch size
+                "--val_set_size=0",           // disables validation
+                "--sample_packing=False",     // disables sample packing which is necessary for small datasets
+                "--eval_sample_packing=False",// disables sample packing on eval set
+                "--dataset_prepared_path=temp_debug/axolotl_outputs/data", // send data outputs to a temp folder
+                "--output_dir=temp_debug/axolotl_outputs/model" // send model outputs to a temp folder
+                ],
+            "console": "integratedTerminal",      // show output in the integrated terminal
+            "cwd": "${workspaceFolder}/devtools", // set working directory to devtools from the root of the project
+            "justMyCode": true,                   // step through only axolotl code
+            "env": {"CUDA_VISIBLE_DEVICES": "0",  // Since we aren't doing distributed training, we need to limit to one GPU
+                    "HF_HOME": "${workspaceFolder}/devtools/temp_debug/.hf-cache"}, // send HF cache to a temp folder
+            "preLaunchTask": "cleanup-for-dataprep", // delete temp folders (see below)
+        }
+    ]
+}
+```
+
+**Additional notes about this configuration:**
+
+* The argument `justMyCode` is set to `true` such that you step through only the axolotl code. If you want to step into dependencies, set this to `false`.
+* The `preLaunchTask`: `cleanup-for-dataprep` is defined in [.vscode/tasks.json](../.vscode/tasks.json) and is used to delete the following folders before debugging, which is essential to ensure that the data pre-processing code is run from scratch:
+  + `./devtools/temp_debug/axolotl_outputs`
+  + `./devtools/temp_debug/.hf-cache/datasets`
+
+> [!Tip]
+> You may not want to delete these folders. For example, if you are debugging model training instead of data pre-processing, you may NOT want to delete the cache or output folders. You may also need to add additional tasks to the `tasks.json` file depending on your use case.
+
+Below is the [./vscode/tasks.json](../.vscode/tasks.json) file that defines the `cleanup-for-dataprep` task. This task is run before each debugging session when you use the above configuration. Note how there are two tasks that delete the two folders mentioned above. The third task `cleanup-for-dataprep` is a composite task that combines the two tasks. A composite task is necessary because VSCode does not allow you to specify multiple tasks in the `preLaunchTask` argument of the `launch.json` file.
+
+```
+// .vscode/tasks.json
+// this file is used by launch.json
+{
+    "version": "2.0.0",
+    "tasks": [
+      // this task changes into the devtools directory and deletes the temp_debug/axolotl_outputs folder
+      {
+        "label": "delete-outputs",
+        "type": "shell",
+        "command": "rm -rf temp_debug/axolotl_outputs",
+        "options":{ "cwd": "${workspaceFolder}/devtools"},
+        "problemMatcher": []
+      },
+      // this task changes into the devtools directory and deletes the `temp_debug/.hf-cache/datasets` folder
+      {
+        "label": "delete-temp-hf-dataset-cache",
+        "type": "shell",
+        "command": "rm -rf temp_debug/.hf-cache/datasets",
+        "options":{ "cwd": "${workspaceFolder}/devtools"},
+        "problemMatcher": []
+      },
+        // this task combines the two tasks above
+      {
+       "label": "cleanup-for-dataprep",
+       "dependsOn": ["delete-outputs", "delete-temp-hf-dataset-cache"],
+      }
+    ]
+}
+```
+
+### Customizing your debugger
+
+Your debugging use case may differ from the example above. The easiest thing to do is to put your own axolotl config in the `devtools` folder and modify the `launch.json` file to use your config. You may also want to modify the `preLaunchTask` to delete different folders or not delete anything at all.
+
+### Video Tutorial
+
+The following video tutorial walks through the above configuration and demonstrates how to debug with VSCode, (click the image below to watch):
+
+[![](https://i.ytimg.com/vi/xUUB11yeMmc/maxresdefault.jpg)](https://youtu.be/xUUB11yeMmc "How to debug Axolotl (for fine tuning LLMs)")
+
+[Hamel Husain’s](https://hamel.dev) tutorial: [Debugging Axolotl w/VSCode](https://www.youtube.com/watch?v=xUUB11yeMmc)
+
+## Debugging With Docker
+
+Using [official Axolotl Docker images](https://hub.docker.com/r/axolotlai/axolotl/tags) is a great way to debug your code, and is a very popular way to use Axolotl. Attaching VSCode to Docker takes a few more steps.
+
+### Setup
+
+On the host that is running axolotl (ex: if you are using a remote host), clone the axolotl repo and change your current directory to the root:
+
+```
+git clone https://github.com/axolotl-ai-cloud/axolotl
+cd axolotl
+```
+
+> [!Tip]
+> If you already have axolotl cloned on your host, make sure you have the latest changes and change into the root of the project.
+
+Next, run the desired docker image and mount the current directory. Below is a docker command you can run to do this:[2](#fn2)
+
+```
+docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface axolotlai/axolotl-uv:main-latest
+```
+
+> [!Tip]
+> To understand which containers are available, see the [Docker section of the README](../README.md#docker) and the [DockerHub repo](https://hub.docker.com/r/axolotlai/axolotl/tags). For details of how the Docker containers are built, see axolotl’s [Docker CI builds](../.github/workflows/main.yml).
+
+You will now be in the container. Next, install Axolotl with dev dependencies:
+
+```
+uv venv --no-project --relocatable
+source .venv/bin/activate
+uv pip install --no-build-isolation -e '.[deepspeed]' --group dev --group test
+```
+
+### Attach To Container
+
+Next, if you are using a remote host, [Remote into this host with VSCode](https://code.visualstudio.com/docs/remote/ssh). If you are using a local host, you can skip this step.
+
+Next, select `Dev Containers: Attach to Running Container...` using the command palette (`CMD + SHIFT + P`) in VSCode. You will be prompted to select a container to attach to. Select the container you just created. You will now be in the container with a working directory that is at the root of the project. Any changes you make to the code will be reflected both in the container and on the host.
+
+Now you are ready to debug as described above (see [Debugging with VSCode](#debugging-with-vscode)).
+
+### Video - Attaching To Docker On Remote Host
+
+Here is a short video that demonstrates how to attach to a Docker container on a remote host:
+
+[![](https://i.ytimg.com/vi/0AuoR7QnHR0/hqdefault.jpg)](https://youtu.be/0AuoR7QnHR0 "Debugging Axolotl Part 2: Attaching to Docker on a Remote Host")
+
+[Hamel Husain’s](https://hamel.dev) tutorial: [Debugging Axolotl Part 2: Attaching to Docker on a Remote Host](https://youtu.be/0AuoR7QnHR0)
+
+## Footnotes
+
+1. The VSCode config uses `accelerate.commands.launch` as the Python module entry point, which is what `axolotl train` invokes under the hood.[↩︎](#fnref1)
+2. Many of the below flags are recommended best practices by Nvidia when using nvidia-container-toolkit. You can read more about these flags [here](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html).[↩︎](#fnref2)
@@ -0,0 +1,399 @@
+Source: https://docs.axolotl.ai/docs/training_stability.html
+Title: Training Stability & Debugging - Axolotl documentation (undated, fetched 2026)
+Fetched-via: uvx markitdown https://docs.axolotl.ai/docs/training_stability.html
+Fetch-status: verbatim, nav/sidebar/TOC boilerplate trimmed
+
+# Training Stability & Debugging
+
+Guide to monitoring, debugging, and stabilizing training runs in axolotl
+
+This guide covers practical techniques for monitoring training health, diagnosing instability, and resolving common failures in both supervised fine-tuning (SFT) and reinforcement learning (GRPO/EBFT) workflows.
+
+## Monitoring Training
+
+### Key Metrics for SFT
+
+Every SFT run should be monitored through at least these four metrics:
+
+| Metric | What It Tells You | Healthy Range |
+| --- | --- | --- |
+| `train/loss` | How well the model fits training data | Decreasing; typically 0.5–2.0 for chat fine-tuning |
+| `eval/loss` | Generalization performance | Tracks train loss with small gap; divergence signals overfitting |
+| `grad_norm` | Gradient magnitude | 0.1–10.0; spikes above 100 indicate instability |
+| `learning_rate` | Current LR from scheduler | Should follow expected schedule (warmup then decay) |
+
+TipSet Up Logging Early
+
+Enable W&B or TensorBoard from the start. Debugging a failed run without metrics is guesswork.
+
+```
+wandb_project: my-project
+wandb_run_id:   # optional, for resuming
+logging_steps: 1
+```
+
+### Key Metrics for RL (GRPO)
+
+GRPO training logs a richer set of metrics. These are the critical ones:
+
+| Metric | Healthy Range | Red Flag |
+| --- | --- | --- |
+| `rewards/<name>/mean` | > 0.15 within 20 steps | Stays at 0 – reward function is broken or task is too hard |
+| `reward_std` | > 0 on most steps | Always 0 – no learning signal (all completions get the same reward) |
+| `frac_reward_zero_std` | < 0.8 | 1.0 on every step – zero-advantage skip fires constantly, no gradient updates |
+| `grad_norm` | 0.001–1.0 | 0.0 is acceptable occasionally (zero-adv skip); > 10.0 is unstable |
+| `entropy` | 0.05–0.5 | < 0.01 suggests mode collapse; > 1.0 suggests the model is not converging |
+| `kl` | 0.0–0.5 | > 2.0 suggests policy has diverged too far from reference |
+| `sampling/sampling_logp_difference/mean` | < 0.1 | > 1.0 means policy has diverged far from vLLM server weights |
+| `sampling/importance_sampling_ratio/min` | > 0.1 | Near 0 indicates stale off-policy data; increase `vllm_sync_interval` |
+| `clip_ratio/region_mean` | < 0.1 | > 0.3 means PPO clipping is too aggressive |
+| `completions/mean_length` | Task-dependent | Monotonically increasing to max length suggests reward hacking |
+| `completions/clipped_ratio` | < 0.3 | > 0.8 means most completions hit `max_completion_length` – increase it |
+
+NoteEBFT-Specific Metrics
+
+For EBFT training, also monitor `ebft/alignment` (should trend upward, healthy 0.3–0.9), `ebft/diversity` (healthy 0.01–0.1; > 1.0 indicates mode collapse), and `ebft/cfm_loss` (should trend downward, < 10).
+
+## SFT Stability
+
+### Loss Plateau
+
+**Symptom**: Loss stops decreasing early in training, well above expected values.
+
+**Causes and fixes**:
+
+* **Learning rate too low**: Increase by 2–5x. Typical ranges: full fine-tune 1e-5 to 5e-5, LoRA 1e-4 to 3e-4.
+* **Insufficient warmup**: Set `warmup_steps` to 5–10% of total steps. Too-aggressive learning at the start can push the model into a flat region.
+* **Data quality**: Check that labels are correctly masked. Use `axolotl preprocess` and inspect tokenized samples to confirm only the target tokens are trainable.
+* **Weight decay too high**: Default 0.01 is usually fine. Values above 0.1 can suppress learning in LoRA.
+
+### Loss Spikes
+
+**Symptom**: Loss suddenly jumps by 2–10x then (possibly) recovers.
+
+**Causes and fixes**:
+
+* **Bad data samples**: A single malformed or extremely long example can cause a spike. Enable `sample_packing: false` temporarily and check if spikes correlate with specific batches.
+* **Learning rate too high**: Reduce by 2–5x, or increase warmup.
+* **Gradient accumulation mismatch**: Effective batch size = `micro_batch_size * gradient_accumulation_steps * num_gpus`. Very large effective batch sizes amplify gradient noise.
+* **Mixed precision issues**: With `bf16: true`, some operations can lose precision. If spikes are severe, try `fp32` for diagnosis.
+
+### Overfitting
+
+**Symptom**: Train loss keeps decreasing but eval loss starts increasing.
+
+**Fixes**:
+
+* Increase `val_set_size` (e.g., 0.05) and monitor `eval/loss`.
+* Reduce `num_epochs` or `max_steps`.
+* Increase `weight_decay` (try 0.01–0.1).
+* Use a smaller LoRA rank (`lora_r`). Typical values: 8–32.
+* Increase dropout: `lora_dropout: 0.05`.
+
+## RL/GRPO Stability
+
+### Reward Never Increases
+
+If `rewards/*/mean` stays at 0 for more than 20 steps:
+
+1. **Test reward function standalone**: Run it outside training with known inputs to verify it returns nonzero values.
+
+   ```
+   cd experiments && python -c "import my_rewards; print(my_rewards.accuracy_reward(...))"
+   ```
+2. **Check dataset columns**: The reward function receives `**kwargs` containing dataset columns. Verify the columns it needs (e.g., `answer`) are not removed by the dataset transform.
+3. **Check completion content**: Enable `log_completions: true` in the `trl:` config and inspect logged completions in W&B. If completions are empty or incoherent, the model may be too weak for the task.
+4. **Verify vLLM is serving the right model**: Hit the vLLM health endpoint and confirm the model name matches your config.
+
+### Entropy Collapse (Mode Collapse)
+
+**Symptom**: `entropy` drops below 0.01; all completions become nearly identical.
+
+**Fixes**:
+
+* Increase `temperature` in generation kwargs (try 0.8–1.0).
+* Reduce learning rate.
+* Add a KL penalty term (`beta` parameter in GRPO config).
+* Check that `num_generations` is sufficient (16+ gives better advantage estimates).
+
+### IS Ratio Divergence
+
+**Symptom**: `sampling/importance_sampling_ratio/min` drops near 0, or `sampling/sampling_logp_difference/mean` exceeds 1.0.
+
+This means the policy has diverged significantly from the weights used by vLLM for generation. The importance sampling correction becomes unreliable.
+
+**Fixes**:
+
+* Decrease `vllm_sync_interval` (sync weights more often).
+* Enable `off_policy_mask_threshold` (e.g., 0.5) to mask stale off-policy samples.
+* Use `importance_sampling_level: token` for finer-grained correction.
+
+### Gradient Norm Instability
+
+**Symptom**: `grad_norm` oscillates wildly or exceeds 10.0 regularly.
+
+**Fixes**:
+
+* Enable gradient clipping: `max_grad_norm: 1.0` (default in most configs).
+* Reduce learning rate.
+* Increase `gradient_accumulation_steps` to smooth out noisy batches.
+* Check for NaN issues (see next section).
+
+## NaN and Inf Handling
+
+### Common Causes
+
+| Cause | Where It Manifests | Detection |
+| --- | --- | --- |
+| FP8 zero-scale division | Forward pass logits | `grad_norm: nan`, loss becomes NaN immediately |
+| Gradient explosion | Backward pass | `grad_norm` spikes to inf, then loss goes NaN |
+| Bad data (empty sequences) | Logprob computation | NaN in specific batches only |
+| Numerical overflow in log-softmax | Loss computation | Large negative logprobs cause exp() overflow |
+
+### FP8-Specific NaN Issues
+
+FP8 quantization (`fp8: true`) can produce NaN when the activation quantization kernel divides by `max(abs(x)) / 448`. If the input tensor is all zeros (e.g., padding positions), the scale becomes 0, causing division by zero.
+
+**Fixes applied in axolotl**:
+
+* The `act_quant_kernel` has a zero-guard: `s = tl.where(s == 0, 1.0, s)`.
+* A safety net `nan_to_num(logits, nan=0.0)` is applied in `_get_per_token_logps_and_entropies`.
+* Embedding padding is zero-padded for FP8 compatibility.
+
+ImportantAfter Modifying Triton Kernels
+
+If you patch any Triton JIT kernel (e.g., the FP8 quantization kernels in transformers), you must clear the Triton cache for changes to take effect:
+
+```
+rm -rf ~/.triton/cache
+```
+
+### General NaN Debugging Steps
+
+1. **Enable anomaly detection** (slow, but pinpoints the source):
+
+   ```
+   torch.autograd.set_detect_anomaly(True)
+   ```
+2. **Check grad\_norm**: If it goes to NaN, the backward pass is the problem. If loss is NaN but grad\_norm was fine on the previous step, the forward pass is the problem.
+3. **Reduce to single GPU, single batch**: Eliminate distributed training variables.
+4. **Inspect data**: Print the batch that triggers NaN. Look for empty sequences, extreme token IDs, or unexpected padding patterns.
+
+## OOM Debugging
+
+Out-of-memory errors are the most common training failure. Use this systematic approach, from least to most disruptive:
+
+### Step 1: Reduce Batch Size
+
+The single highest-impact change. VRAM scales roughly linearly with batch size.
+
+```
+micro_batch_size: 1              # Start here
+gradient_accumulation_steps: 16  # Increase to maintain effective batch size
+```
+
+For GRPO specifically, the logits tensor for policy logprob computation can be very large. `batch_size * num_generations * seq_len * vocab_size` in bf16. For example, with `num_generations: 16` and `micro_batch_size: 8`, the logits tensor alone is:
+
+```
+8 * 16 * 2048 * 151936 * 2 bytes = ~75 GB  (way too large)
+```
+
+Reduce `micro_batch_size` to 2–4 for GRPO.
+
+### Step 2: Enable Gradient Checkpointing
+
+Trades compute for memory by recomputing activations during the backward pass instead of storing them.
+
+```
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false     # Recommended default
+```
+
+WarningReentrant Checkpointing Exceptions
+
+Some configurations require `use_reentrant: true`:
+
+* DeepSpeed ZeRO-3 (non-reentrant causes `CheckpointError`)
+* EBFT strided mode with flex\_attention
+
+### Step 3: Use Quantization
+
+Load the base model in reduced precision:
+
+```
+# 4-bit QLoRA
+adapter: qlora
+load_in_4bit: true
+
+# 8-bit
+load_in_8bit: true
+
+# FP8 (saves ~50% model VRAM, same compute speed as bf16)
+fp8: true
+```
+
+### Step 4: Reduce Sequence Length
+
+```
+sequence_len: 1024     # Down from 2048 or 4096
+```
+
+For GRPO, also reduce `max_completion_length`. Memory scales quadratically with sequence length when using standard attention.
+
+### Step 5: Use Flash Attention
+
+Reduces attention memory from O(n^2) to O(n):
+
+```
+attn_implementation: flash_attention_2
+```
+
+### Step 6: Offload with DeepSpeed
+
+For extreme cases, offload optimizer states or parameters to CPU:
+
+```
+deepspeed: deepspeed_configs/zero3_bf16.json
+```
+
+### Diagnosing the Specific Culprit
+
+Use the `profiler_steps` config option to capture GPU memory snapshots:
+
+```
+profiler_steps: [1, 2]
+```
+
+This generates PyTorch profiler traces you can inspect to see exactly which tensor allocation caused the OOM.
+
+## Common Errors
+
+| Error Message | Likely Cause | Fix |
+| --- | --- | --- |
+| `exitcode: -9` | System RAM exhaustion | Reduce dataset size, `dataset_num_proc`, or number of data workers |
+| `exitcode: -7` (DeepSpeed) | DeepSpeed version issue | `pip install -U deepspeed` |
+| `CUDA out of memory` | GPU VRAM exhaustion | Follow OOM debugging steps above |
+| `RuntimeError: NCCL communicator was aborted` | GPU communication failure | See [NCCL docs](../docs/nccl.html); check `NCCL_DEBUG=INFO` output |
+| `ValueError: Asking to pad but the tokenizer does not have a padding token` | Missing pad token | Add `special_tokens: { pad_token: "<\|endoftext\|>" }` to config |
+| `'DummyOptim' object has no attribute 'step'` | DeepSpeed on single GPU | Remove `deepspeed:` section from config |
+| `unable to load strategy X` then `None is not callable` | Reward module not importable | Run `cd experiments && python -c "import my_rewards"` to check |
+| `generation_batch_size not divisible by num_generations` | micro\_batch\_size too small | Set `micro_batch_size >= num_generations` and make it divisible |
+| `'weight' must be 2-D` | FSDP1 flattened parameters | Use `fsdp_version: 2` or skip `unwrap_model` when FSDP is enabled |
+| `CheckpointError` (tensor count mismatch) | Non-reentrant checkpointing + ZeRO-3 or flex\_attention | Set `use_reentrant: true` in `gradient_checkpointing_kwargs` |
+| `BFloat16` TypeError during weight sync | NumPy does not support bf16 | Fixed in axolotl’s `weight_serde.py` (auto bf16 to fp16 conversion) |
+| `Content end boundary is before start boundary` | Chat template parsing issue | Check `eos_token` matches template; file a GitHub issue if persistent |
+| `CAS service error` during data processing | HuggingFace XET issue | Set `export HF_HUB_DISABLE_XET=1` |
+| Training hangs (multi-GPU) | FSDP + async prefetch deadlock | Set `async_prefetch: false` with FSDP |
+
+## Profiling
+
+### PyTorch Profiler
+
+Axolotl supports PyTorch profiler integration via the config:
+
+```
+profiler_steps: [1, 2, 3]
+```
+
+This captures profiler traces for the specified steps. View them in TensorBoard:
+
+```
+tensorboard --logdir output_dir/runs
+```
+
+Or open the `.json` trace file in `chrome://tracing`.
+
+### CUDA Memory Snapshots
+
+For detailed memory analysis, use PyTorch’s memory snapshot API. Add this to your training script or use it interactively:
+
+```
+import torch
+
+# Enable memory history tracking
+torch.cuda.memory._record_memory_history()
+
+# ... run your training step ...
+
+# Save snapshot
+torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
+```
+
+Visualize with PyTorch’s memory visualizer:
+
+```
+python -m torch.cuda.memory._viz memory_snapshot.pickle
+```
+
+### Quick GPU Memory Check
+
+During training, monitor GPU utilization in a separate terminal:
+
+```
+watch -n 1 nvidia-smi
+```
+
+For programmatic access within axolotl, the logged metrics `memory/max_alloc` and `memory/max_reserved` come from `torch.cuda.max_memory_allocated()` and `torch.cuda.max_memory_reserved()`. Note these report PyTorch’s view of memory, which may differ from `nvidia-smi` (see [FAQ](../docs/faq.html)).
+
+## W&B and Logging
+
+### Enabling Logging
+
+```
+wandb_project: my-project
+wandb_entity: my-team          # optional
+wandb_run_id: run-123          # optional, for resuming
+wandb_name: experiment-name    # optional
+logging_steps: 1               # log every step (recommended for RL)
+```
+
+### Debug Logging
+
+For detailed axolotl-internal debug output:
+
+```
+AXOLOTL_LOG_LEVEL=DEBUG axolotl train config.yaml 2>&1 | tee /tmp/training.log
+```
+
+TipAlways Log to a File
+
+Pipe training output to a log file so you can inspect it after the run:
+
+```
+axolotl train config.yaml 2>&1 | tee /tmp/my_run.log
+```
+
+### What Axolotl Logs
+
+**SFT metrics** (logged every `logging_steps`):
+
+* `train/loss`, `eval/loss` – training and validation loss
+* `train/grad_norm` – gradient L2 norm (before clipping)
+* `train/learning_rate` – current learning rate
+* `memory/max_alloc`, `memory/max_reserved` – peak GPU memory
+
+**GRPO/RL metrics** (logged every step):
+
+* `rewards/<name>/mean`, `rewards/<name>/std` – per-reward-function statistics
+* `reward`, `reward_std` – aggregated reward across all reward functions
+* `frac_reward_zero_std` – fraction of prompt groups where all completions got the same reward
+* `completions/mean_length`, `completions/min_length`, `completions/max_length` – completion token lengths
+* `completions/clipped_ratio` – fraction of completions that hit the max length
+* `completions/mean_terminated_length`, `completions/min_terminated_length`, `completions/max_terminated_length` – lengths of naturally terminated completions
+* `kl` – KL divergence between policy and reference
+* `entropy` – policy entropy (measure of output diversity)
+* `clip_ratio/region_mean`, `clip_ratio/low_mean`, `clip_ratio/high_mean` – PPO clipping statistics
+* `sampling/sampling_logp_difference/mean`, `sampling/sampling_logp_difference/max` – log-probability difference between policy and sampling distribution
+* `sampling/importance_sampling_ratio/min`, `sampling/importance_sampling_ratio/mean`, `sampling/importance_sampling_ratio/max` – IS ratio statistics for off-policy correction
+* `num_tokens` – total tokens processed
+
+### Reading W&B Charts
+
+For a healthy GRPO run, expect to see:
+
+1. **`reward/mean`**: Gradual upward trend. May start near 0 and reach 0.3–0.8 depending on task difficulty. Not monotonic – fluctuations are normal.
+2. **`entropy`**: Gradual decrease from initial values (often 0.3–0.6) as the model becomes more confident. Should not collapse to near-zero.
+3. **`grad_norm`**: Mostly in the 0.001–1.0 range. Occasional 0.0 values are fine (zero-advantage skip). Persistent values above 10.0 need investigation.
+4. **`kl`**: Starts near 0 and grows slowly. If it shoots up rapidly, the policy is diverging from the reference.
+5. **`completions/mean_length`**: Should reflect the task’s natural answer length. If it steadily increases to `max_completion_length`, the model may be reward-hacking by generating longer outputs.
@@ -0,0 +1,355 @@
+Source: https://raw.githubusercontent.com/huggingface/transformers/main/src/transformers/debug_utils.py
+Title: debug_utils.py (DebugUnderflowOverflow) - Stas Bekman, huggingface/transformers (2021)
+Fetched-via: curl -sL https://raw.githubusercontent.com/huggingface/transformers/main/src/transformers/debug_utils.py
+Fetch-status: verbatim
+
+```python
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+
+from .utils import ExplicitEnum, is_torch_available, logging
+
+
+if is_torch_available():
+    import torch
+
+
+logger = logging.get_logger(__name__)
+
+
+class DebugUnderflowOverflow:
+    """
+    This debug class helps detect and understand where the model starts getting very large or very small, and more
+    importantly `nan` or `inf` weight and activation elements.
+
+    There are 2 working modes:
+
+    1. Underflow/overflow detection (default)
+    2. Specific batch absolute min/max tracing without detection
+
+    Mode 1: Underflow/overflow detection
+
+    To activate the underflow/overflow detection, initialize the object with the model :
+
+    ```python
+    debug_overflow = DebugUnderflowOverflow(model)
+    ```
+
+    then run the training as normal and if `nan` or `inf` gets detected in at least one of the weight, input or output
+    elements this module will throw an exception and will print `max_frames_to_save` frames that lead to this event,
+    each frame reporting
+
+    1. the fully qualified module name plus the class name whose `forward` was run
+    2. the absolute min and max value of all elements for each module weights, and the inputs and output
+
+    For example, here is the header and the last few frames in detection report for `google/mt5-small` run in fp16
+    mixed precision :
+
+    ```
+    Detected inf/nan during batch_number=0
+    Last 21 forward frames:
+    abs min  abs max  metadata
+    [...]
+                      encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
+    2.17e-07 4.50e+00 weight
+    1.79e-06 4.65e+00 input[0]
+    2.68e-06 3.70e+01 output
+                      encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
+    8.08e-07 2.66e+01 weight
+    1.79e-06 4.65e+00 input[0]
+    1.27e-04 2.37e+02 output
+                      encoder.block.2.layer.1.DenseReluDense.wo Linear
+    1.01e-06 6.44e+00 weight
+    0.00e+00 9.74e+03 input[0]
+    3.18e-04 6.27e+04 output
+                      encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
+    1.79e-06 4.65e+00 input[0]
+    3.18e-04 6.27e+04 output
+                      encoder.block.2.layer.1.dropout Dropout
+    3.18e-04 6.27e+04 input[0]
+    0.00e+00      inf output
+    ```
+
+    You can see here, that `T5DenseGatedGeluDense.forward` resulted in output activations, whose absolute max value was
+    around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have `Dropout` which
+    renormalizes the weights, after it zeroed some of the elements, which pushes the absolute max value to more than
+    64K, and we get an overflow.
+
+    As you can see it's the previous frames that we need to look into when the numbers start going into very large for
+    fp16 numbers.
+
+    The tracking is done in a forward hook, which gets invoked immediately after `forward` has completed.
+
+    By default the last 21 frames are printed. You can change the default to adjust for your needs. For example :
+
+    ```python
+    debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
+    ```
+
+        To validate that you have set up this debugging feature correctly, and you intend to use it in a training that
+        may take hours to complete, first run it with normal tracing enabled for one of a few batches as explained in
+        the next section.
+
+
+        Mode 2. Specific batch absolute min/max tracing without detection
+
+        The second work mode is per-batch tracing with the underflow/overflow detection feature turned off.
+
+        Let's say you want to watch the absolute min and max values for all the ingredients of each `forward` call of a
+    given batch, and only do that for batches 1 and 3. Then you instantiate this class as :
+
+    ```python
+    debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3])
+    ```
+
+    And now full batches 1 and 3 will be traced using the same format as explained above. Batches are 0-indexed.
+
+    This is helpful if you know that the program starts misbehaving after a certain batch number, so you can
+    fast-forward right to that area.
+
+
+    Early stopping:
+
+    You can also specify the batch number after which to stop the training, with :
+
+    ```python
+    debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3)
+    ```
+
+    This feature is mainly useful in the tracing mode, but you can use it for any mode.
+
+
+    **Performance**:
+
+    As this module measures absolute `min`/``max` of each weight of the model on every forward it'll slow the training
+    down. Therefore remember to turn it off once the debugging needs have been met.
+
+    Args:
+        model (`nn.Module`):
+            The model to debug.
+        max_frames_to_save (`int`, *optional*, defaults to 21):
+            How many frames back to record
+        trace_batch_nums(`list[int]`, *optional*, defaults to `[]`):
+            Which batch numbers to trace (turns detection off)
+        abort_after_batch_num  (`int``, *optional*):
+            Whether to abort after a certain batch number has finished
+    """
+
+    def __init__(self, model, max_frames_to_save=21, trace_batch_nums=None, abort_after_batch_num=None):
+        if trace_batch_nums is None:
+            trace_batch_nums = []
+        self.model = model
+        self.trace_batch_nums = trace_batch_nums
+        self.abort_after_batch_num = abort_after_batch_num
+
+        # keep a LIFO buffer of frames to dump as soon as inf/nan is encountered to give context to the problem emergence
+        self.frames = collections.deque([], max_frames_to_save)
+        self.frame = []
+        self.batch_number = 0
+        self.total_calls = 0
+        self.detected_overflow = False
+        self.prefix = "                 "
+
+        self.analyse_model()
+
+        self.register_forward_hook()
+
+    def save_frame(self, frame=None):
+        if frame is not None:
+            self.expand_frame(frame)
+        self.frames.append("\n".join(self.frame))
+        self.frame = []  # start a new frame
+
+    def expand_frame(self, line):
+        self.frame.append(line)
+
+    def trace_frames(self):
+        print("\n".join(self.frames))
+        self.frames = []
+
+    def reset_saved_frames(self):
+        self.frames = []
+
+    def dump_saved_frames(self):
+        print(f"\nDetected inf/nan during batch_number={self.batch_number}")
+        print(f"Last {len(self.frames)} forward frames:")
+        print(f"{'abs min':8} {'abs max':8} metadata")
+        print("\n".join(self.frames))
+        print("\n\n")
+        self.frames = []
+
+    def analyse_model(self):
+        # extract the fully qualified module names, to be able to report at run time. e.g.:
+        # encoder.block.2.layer.0.SelfAttention.o
+        #
+        # for shared weights only the first shared module name will be registered
+        self.module_names = {m: name for name, m in self.model.named_modules()}
+        # self.longest_module_name = max(len(v) for v in self.module_names.values())
+
+    def analyse_variable(self, var, ctx):
+        if torch.is_tensor(var):
+            self.expand_frame(get_abs_min_max(var, ctx))
+            if detect_overflow(var, ctx):
+                self.detected_overflow = True
+        elif var is None:
+            self.expand_frame(f"{'None':>17} {ctx}")
+        else:
+            self.expand_frame(f"{'not a tensor':>17} {ctx}")
+
+    def batch_start_frame(self):
+        self.expand_frame(f"\n\n{self.prefix} *** Starting batch number={self.batch_number} ***")
+        self.expand_frame(f"{'abs min':8} {'abs max':8} metadata")
+
+    def batch_end_frame(self):
+        self.expand_frame(f"{self.prefix} *** Finished batch number={self.batch_number - 1} ***\n\n")
+
+    def create_frame(self, module, input, output):
+        self.expand_frame(f"{self.prefix} {self.module_names[module]} {module.__class__.__name__}")
+
+        # params
+        for name, p in module.named_parameters(recurse=False):
+            self.analyse_variable(p, name)
+
+        # inputs
+        if isinstance(input, tuple):
+            for i, x in enumerate(input):
+                self.analyse_variable(x, f"input[{i}]")
+        else:
+            self.analyse_variable(input, "input")
+
+        # outputs
+        if isinstance(output, tuple):
+            for i, x in enumerate(output):
+                # possibly a tuple of tuples
+                if isinstance(x, tuple):
+                    for j, y in enumerate(x):
+                        self.analyse_variable(y, f"output[{i}][{j}]")
+                else:
+                    self.analyse_variable(x, f"output[{i}]")
+        else:
+            self.analyse_variable(output, "output")
+
+        self.save_frame()
+
+    def register_forward_hook(self):
+        self.model.apply(self._register_forward_hook)
+
+    def _register_forward_hook(self, module):
+        module.register_forward_hook(self.forward_hook)
+
+    def forward_hook(self, module, input, output):
+        # - input is a tuple of packed inputs (could be non-Tensors)
+        # - output could be a Tensor or a tuple of Tensors and non-Tensors
+
+        last_frame_of_batch = False
+
+        trace_mode = self.batch_number in self.trace_batch_nums
+        if trace_mode:
+            self.reset_saved_frames()
+
+        if self.total_calls == 0:
+            self.batch_start_frame()
+        self.total_calls += 1
+
+        # count batch numbers - the very first forward hook of the batch will be called when the
+        # batch completes - i.e. it gets called very last - we know this batch has finished
+        if module == self.model:
+            self.batch_number += 1
+            last_frame_of_batch = True
+
+        self.create_frame(module, input, output)
+
+        # if last_frame_of_batch:
+        #     self.batch_end_frame()
+
+        if trace_mode:
+            self.trace_frames()
+
+        if last_frame_of_batch:
+            self.batch_start_frame()
+
+        if self.detected_overflow and not trace_mode:
+            self.dump_saved_frames()
+
+            # now we can abort, as it's pointless to continue running
+            raise ValueError(
+                "DebugUnderflowOverflow: inf/nan detected, aborting as there is no point running further. "
+                "Please scroll up above this traceback to see the activation values prior to this event."
+            )
+
+        # abort after certain batch if requested to do so
+        if self.abort_after_batch_num is not None and self.batch_number > self.abort_after_batch_num:
+            raise ValueError(
+                f"DebugUnderflowOverflow: aborting after {self.batch_number} batches due to"
+                f" `abort_after_batch_num={self.abort_after_batch_num}` arg"
+            )
+
+
+def get_abs_min_max(var, ctx):
+    abs_var = var.abs()
+    return f"{abs_var.min():8.2e} {abs_var.max():8.2e} {ctx}"
+
+
+def detect_overflow(var, ctx):
+    """
+    Report whether the tensor contains any `nan` or `inf` entries.
+
+    This is useful for detecting overflows/underflows and best to call right after the function that did some math that
+    modified the tensor in question.
+
+    This function contains a few other helper features that you can enable and tweak directly if you want to track
+    various other things.
+
+    Args:
+        var: the tensor variable to check
+        ctx: the message to print as a context
+
+    Return:
+        `True` if `inf` or `nan` was detected, `False` otherwise
+    """
+    detected = False
+    if torch.isnan(var).any().item():
+        detected = True
+        print(f"{ctx} has nans")
+    if torch.isinf(var).any().item():
+        detected = True
+        print(f"{ctx} has infs")
+
+    # if needed to monitor large elements can enable the following
+    if 0:  # and detected:
+        n100 = var[torch.ge(var.abs(), 100)]
+        if n100.numel() > 0:
+            print(f"{ctx}:  n100={n100.numel()}")
+        n1000 = var[torch.ge(var.abs(), 1000)]
+        if n1000.numel() > 0:
+            print(f"{ctx}: n1000={n1000.numel()}")
+        n10000 = var[torch.ge(var.abs(), 10000)]
+        if n10000.numel() > 0:
+            print(f"{ctx}: n10000={n10000.numel()}")
+
+    if 0:
+        print(f"min={var.min():9.2e} max={var.max():9.2e}")
+
+    if 0:
+        print(f"min={var.min():9.2e} max={var.max():9.2e} var={var.var():9.2e} mean={var.mean():9.2e} ({ctx})")
+
+    return detected
+
+
+class DebugOption(ExplicitEnum):
+    UNDERFLOW_OVERFLOW = "underflow_overflow"
+    TPU_METRICS_DEBUG = "tpu_metrics_debug"
+```
@@ -0,0 +1,745 @@
+Source: https://huggingface.co/learn/llm-course/chapter8/4
+Title: Debugging the training pipeline (PyTorch variant) - Sylvain Gugger et al., HF LLM Course ch 8.4 (2022)
+Fetched-via: uvx markitdown https://huggingface.co/learn/llm-course/chapter8/4
+Fetch-status: verbatim
+
+# Debugging the training pipeline[[debugging-the-training-pipeline]]
+
+You've written a beautiful script to train or fine-tune a model on a given task, dutifully following the advice from [Chapter 7](/course/chapter7). But when you launch the command `trainer.train()`, something horrible happens: you get an error 😱! Or worse, everything seems to be fine and the training runs without error, but the resulting model is crappy. In this section, we will show you what you can do to debug these kinds of issues.
+
+## Debugging the training pipeline[[debugging-the-training-pipeline]]
+
+The problem when you encounter an error in `trainer.train()` is that it could come from multiple sources, as the `Trainer` usually puts together lots of things. It converts datasets to dataloaders, so the problem could be something wrong in your dataset, or some issue when trying to batch elements of the datasets together. Then it takes a batch of data and feeds it to the model, so the problem could be in the model code. After that, it computes the gradients and performs the optimization step, so the problem could also be in your optimizer. And even if everything goes well for training, something could still go wrong during the evaluation if there is a problem with your metric.
+
+The best way to debug an error that arises in `trainer.train()` is to manually go through this whole pipeline to see where things went awry. The error is then often very easy to solve.
+
+To demonstrate this, we will use the following script that (tries to) fine-tune a DistilBERT model on the [MNLI dataset](https://huggingface.co/datasets/glue):
+
+```py
+from datasets import load_dataset
+import evaluate
+from transformers import (
+    AutoTokenizer,
+    AutoModelForSequenceClassification,
+    TrainingArguments,
+    Trainer,
+)
+
+raw_datasets = load_dataset("glue", "mnli")
+
+model_checkpoint = "distilbert-base-uncased"
+tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
+
+def preprocess_function(examples):
+    return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)
+
+tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
+model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
+
+args = TrainingArguments(
+    f"distilbert-finetuned-mnli",
+    evaluation_strategy="epoch",
+    save_strategy="epoch",
+    learning_rate=2e-5,
+    num_train_epochs=3,
+    weight_decay=0.01,
+)
+
+metric = evaluate.load("glue", "mnli")
+
+def compute_metrics(eval_pred):
+    predictions, labels = eval_pred
+    return metric.compute(predictions=predictions, references=labels)
+
+trainer = Trainer(
+    model,
+    args,
+    train_dataset=raw_datasets["train"],
+    eval_dataset=raw_datasets["validation_matched"],
+    compute_metrics=compute_metrics,
+)
+trainer.train()
+```
+
+If you try to execute it, you will be met with a rather cryptic error:
+
+```python out
+'ValueError: You have to specify either input_ids or inputs_embeds'
+```
+
+### Check your data[[check-your-data]]
+
+This goes without saying, but if your data is corrupted, the `Trainer` is not going to be able to form batches, let alone train your model. So first things first, you need to have a look at what is inside your training set.
+
+To avoid countless hours spent trying to fix something that is not the source of the bug, we recommend you use `trainer.train_dataset` for your checks and nothing else. So let's do that here:
+
+```py
+trainer.train_dataset[0]
+```
+
+```python out
+{'hypothesis': 'Product and geography are what make cream skimming work. ',
+ 'idx': 0,
+ 'label': 1,
+ 'premise': 'Conceptually cream skimming has two basic dimensions - product and geography.'}
+```
+
+Do you notice something wrong? This, in conjunction with the error message about `input_ids` missing, should make you realize those are texts, not numbers the model can make sense of. Here, the original error is very misleading because the `Trainer` automatically removes the columns that don't match the model signature (that is, the arguments expected by the model). That means here, everything apart from the labels was discarded. There was thus no issue with creating batches and then sending them to the model, which in turn complained it didn't receive the proper input.
+
+Why wasn't the data processed? We did use the `Dataset.map()` method on the datasets to apply the tokenizer on each sample. But if you look closely at the code, you will see that we made a mistake when passing the training and evaluation sets to the `Trainer`. Instead of using `tokenized_datasets` here, we used `raw_datasets` 🤦. So let's fix this!
+
+```py
+from datasets import load_dataset
+import evaluate
+from transformers import (
+    AutoTokenizer,
+    AutoModelForSequenceClassification,
+    TrainingArguments,
+    Trainer,
+)
+
+raw_datasets = load_dataset("glue", "mnli")
+
+model_checkpoint = "distilbert-base-uncased"
+tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
+
+def preprocess_function(examples):
+    return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)
+
+tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
+model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
+
+args = TrainingArguments(
+    f"distilbert-finetuned-mnli",
+    evaluation_strategy="epoch",
+    save_strategy="epoch",
+    learning_rate=2e-5,
+    num_train_epochs=3,
+    weight_decay=0.01,
+)
+
+metric = evaluate.load("glue", "mnli")
+
+def compute_metrics(eval_pred):
+    predictions, labels = eval_pred
+    return metric.compute(predictions=predictions, references=labels)
+
+trainer = Trainer(
+    model,
+    args,
+    train_dataset=tokenized_datasets["train"],
+    eval_dataset=tokenized_datasets["validation_matched"],
+    compute_metrics=compute_metrics,
+)
+trainer.train()
+```
+
+This new code will now give a different error (progress!):
+
+```python out
+'ValueError: expected sequence of length 43 at dim 1 (got 37)'
+```
+
+Looking at the traceback, we can see the error happens in the data collation step:
+
+```python out
+~/git/transformers/src/transformers/data/data_collator.py in torch_default_data_collator(features)
+    105                 batch[k] = torch.stack([f[k] for f in features])
+    106             else:
+--> 107                 batch[k] = torch.tensor([f[k] for f in features])
+    108
+    109     return batch
+```
+
+So, we should move to that. Before we do, however, let's finish inspecting our data, just to be 100% sure it's correct.
+
+One thing you should always do when debugging a training session is have a look at the decoded inputs of your model. We can't make sense of the numbers that we feed it directly, so we should look at what those numbers represent. In computer vision, for example, that means looking at the decoded pictures of the pixels you pass, in speech it means listening to the decoded audio samples, and for our NLP example here it means using our tokenizer to decode the inputs:
+
+```py
+tokenizer.decode(trainer.train_dataset[0]["input_ids"])
+```
+
+```python out
+'[CLS] conceptually cream skimming has two basic dimensions - product and geography. [SEP] product and geography are what make cream skimming work. [SEP]'
+```
+
+So that seems correct. You should do this for all the keys in the inputs:
+
+```py
+trainer.train_dataset[0].keys()
+```
+
+```python out
+dict_keys(['attention_mask', 'hypothesis', 'idx', 'input_ids', 'label', 'premise'])
+```
+
+Note that the keys that don't correspond to inputs accepted by the model will be automatically discarded, so here we will only keep `input_ids`, `attention_mask`, and `label` (which will be renamed `labels`). To double-check the model signature, you can print the class of your model, then go check its documentation:
+
+```py
+type(trainer.model)
+```
+
+```python out
+transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification
+```
+
+So in our case, we can check the parameters accepted on [this page](https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification). The `Trainer` will also log the columns it's discarding.
+
+We have checked that the input IDs are correct by decoding them. Next is the `attention_mask`:
+
+```py
+trainer.train_dataset[0]["attention_mask"]
+```
+
+```python out
+[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
+```
+
+Since we didn't apply padding in our preprocessing, this seems perfectly natural. To be sure there is no issue with that attention mask, let's check it is the same length as our input IDs:
+
+```py
+len(trainer.train_dataset[0]["attention_mask"]) == len(
+    trainer.train_dataset[0]["input_ids"]
+)
+```
+
+```python out
+True
+```
+
+That's good! Lastly, let's check our label:
+
+```py
+trainer.train_dataset[0]["label"]
+```
+
+```python out
+1
+```
+
+Like the input IDs, this is a number that doesn't really make sense on its own. As we saw before, the map between integers and label names is stored inside the `names` attribute of the corresponding *feature* of the dataset:
+
+```py
+trainer.train_dataset.features["label"].names
+```
+
+```python out
+['entailment', 'neutral', 'contradiction']
+```
+
+So `1` means `neutral`, which means the two sentences we saw above are not in contradiction, and the first one does not imply the second one. That seems correct!
+
+We don't have token type IDs here, since DistilBERT does not expect them; if you have some in your model, you should also make sure that they properly match where the first and second sentences are in the input.
+
+> [!TIP]
+> ✏️ **Your turn!** Check that everything seems correct with the second element of the training dataset.
+
+We are only doing the check on the training set here, but you should of course double-check the validation and test sets the same way.
+
+Now that we know our datasets look good, it's time to check the next step of the training pipeline.
+
+### From datasets to dataloaders[[from-datasets-to-dataloaders]]
+
+The next thing that can go wrong in the training pipeline is when the `Trainer` tries to form batches from the training or validation set. Once you are sure the `Trainer`'s datasets are correct, you can try to manually form a batch by executing the following (replace `train` with `eval` for the validation dataloader):
+
+```py
+for batch in trainer.get_train_dataloader():
+    break
+```
+
+This code creates the training dataloader, then iterates through it, stopping at the first iteration. If the code executes without error, you have the first training batch that you can inspect, and if the code errors out, you know for sure the problem is in the dataloader, as is the case here:
+
+```python out
+~/git/transformers/src/transformers/data/data_collator.py in torch_default_data_collator(features)
+    105                 batch[k] = torch.stack([f[k] for f in features])
+    106             else:
+--> 107                 batch[k] = torch.tensor([f[k] for f in features])
+    108
+    109     return batch
+
+ValueError: expected sequence of length 45 at dim 1 (got 76)
+```
+
+Inspecting the last frame of the traceback should be enough to give you a clue, but let's do a bit more digging. Most of the problems during batch creation arise because of the collation of examples into a single batch, so the first thing to check when in doubt is what `collate_fn` your `DataLoader` is using:
+
+```py
+data_collator = trainer.get_train_dataloader().collate_fn
+data_collator
+```
+
+```python out
+ Dict[str, Any]>
+```
+
+So this is the `default_data_collator`, but that's not what we want in this case. We want to pad our examples to the longest sentence in the batch, which is done by the `DataCollatorWithPadding` collator. And this data collator is supposed to be used by default by the `Trainer`, so why is it not used here?
+
+The answer is because we did not pass the `tokenizer` to the `Trainer`, so it couldn't create the `DataCollatorWithPadding` we want. In practice, you should never hesitate to explicitly pass along the data collator you want to use, to make sure you avoid these kinds of errors. Let's adapt our code to do exactly that:
+
+```py
+from datasets import load_dataset
+import evaluate
+from transformers import (
+    AutoTokenizer,
+    AutoModelForSequenceClassification,
+    DataCollatorWithPadding,
+    TrainingArguments,
+    Trainer,
+)
+
+raw_datasets = load_dataset("glue", "mnli")
+
+model_checkpoint = "distilbert-base-uncased"
+tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
+
+def preprocess_function(examples):
+    return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)
+
+tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
+model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
+
+args = TrainingArguments(
+    f"distilbert-finetuned-mnli",
+    evaluation_strategy="epoch",
+    save_strategy="epoch",
+    learning_rate=2e-5,
+    num_train_epochs=3,
+    weight_decay=0.01,
+)
+
+metric = evaluate.load("glue", "mnli")
+
+def compute_metrics(eval_pred):
+    predictions, labels = eval_pred
+    return metric.compute(predictions=predictions, references=labels)
+
+data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
+
+trainer = Trainer(
+    model,
+    args,
+    train_dataset=tokenized_datasets["train"],
+    eval_dataset=tokenized_datasets["validation_matched"],
+    compute_metrics=compute_metrics,
+    data_collator=data_collator,
+    tokenizer=tokenizer,
+)
+trainer.train()
+```
+
+The good news? We don't get the same error as before, which is definitely progress. The bad news? We get an infamous CUDA error instead:
+
+```python out
+RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
+```
+
+This is bad because CUDA errors are extremely hard to debug in general. We will see in a minute how to solve this, but first let's finish our analysis of batch creation.
+
+If you are sure your data collator is the right one, you should try to apply it on a couple of samples of your dataset:
+
+```py
+data_collator = trainer.get_train_dataloader().collate_fn
+batch = data_collator([trainer.train_dataset[i] for i in range(4)])
+```
+
+This code will fail because the `train_dataset` contains string columns, which the `Trainer` usually removes. You can remove them manually, or if you want to replicate exactly what the `Trainer` is doing behind the scenes, you can call the private `Trainer._remove_unused_columns()` method that does that:
+
+```py
+data_collator = trainer.get_train_dataloader().collate_fn
+actual_train_set = trainer._remove_unused_columns(trainer.train_dataset)
+batch = data_collator([actual_train_set[i] for i in range(4)])
+```
+
+You should then be able to manually debug what happens inside the data collator if the error persists.
+
+Now that we've debugged the batch creation process, it's time to pass one through the model!
+
+### Going through the model[[going-through-the-model]]
+
+You should be able to get a batch by executing the following command:
+
+```py
+for batch in trainer.get_train_dataloader():
+    break
+```
+
+If you're running this code in a notebook, you may get a CUDA error that's similar to the one we saw earlier, in which case you need to restart your notebook and reexecute the last snippet without the `trainer.train()` line. That's the second most annoying thing about CUDA errors: they irremediably break your kernel. The most annoying thing about them is the fact that they are hard to debug.
+
+Why is that? It has to do with the way GPUs work. They are extremely efficient at executing a lot of operations in parallel, but the drawback is that when one of those instructions results in an error, you don't know it instantly. It's only when the program calls a synchronization of the multiple processes on the GPU that it will realize something went wrong, so the error is actually raised at a place that has nothing to do with what created it. For instance, if we look at our previous traceback, the error was raised during the backward pass, but we will see in a minute that it actually stems from something in the forward pass.
+
+So how do we debug those errors? The answer is easy: we don't. Unless your CUDA error is an out-of-memory error (which means there is not enough memory in your GPU), you should always go back to the CPU to debug it.
+
+To do this in our case, we just have to put the model back on the CPU and call it on our batch -- the batch returned by the `DataLoader` has not been moved to the GPU yet:
+
+```python
+outputs = trainer.model.cpu()(**batch)
+```
+
+```python out
+~/.pyenv/versions/3.7.9/envs/base/lib/python3.7/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
+   2386         )
+   2387     if dim == 2:
+-> 2388         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
+   2389     elif dim == 4:
+   2390         ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
+
+IndexError: Target 2 is out of bounds.
+```
+
+So, the picture is getting clearer. Instead of having a CUDA error, we now have an `IndexError` in the loss computation (so nothing to do with the backward pass, as we said earlier). More precisely, we can see that it's target 2 that creates the error, so this is a very good moment to check the number of labels of our model:
+
+```python
+trainer.model.config.num_labels
+```
+
+```python out
+2
+```
+
+With two labels, only 0s and 1s are allowed as targets, but according to the error message we got a 2. Getting a 2 is actually normal: if we remember the label names we extracted earlier, there were three, so we have indices 0, 1, and 2 in our dataset. The problem is that we didn't tell that to our model, which should have been created with three labels. So let's fix that!
+
+```py
+from datasets import load_dataset
+import evaluate
+from transformers import (
+    AutoTokenizer,
+    AutoModelForSequenceClassification,
+    DataCollatorWithPadding,
+    TrainingArguments,
+    Trainer,
+)
+
+raw_datasets = load_dataset("glue", "mnli")
+
+model_checkpoint = "distilbert-base-uncased"
+tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
+
+def preprocess_function(examples):
+    return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)
+
+tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
+model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)
+
+args = TrainingArguments(
+    f"distilbert-finetuned-mnli",
+    evaluation_strategy="epoch",
+    save_strategy="epoch",
+    learning_rate=2e-5,
+    num_train_epochs=3,
+    weight_decay=0.01,
+)
+
+metric = evaluate.load("glue", "mnli")
+
+def compute_metrics(eval_pred):
+    predictions, labels = eval_pred
+    return metric.compute(predictions=predictions, references=labels)
+
+data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
+
+trainer = Trainer(
+    model,
+    args,
+    train_dataset=tokenized_datasets["train"],
+    eval_dataset=tokenized_datasets["validation_matched"],
+    compute_metrics=compute_metrics,
+    data_collator=data_collator,
+    tokenizer=tokenizer,
+)
+```
+
+We aren't including the `trainer.train()` line yet, to take the time to check that everything looks good. If we request a batch and pass it to our model, it now works without error!
+
+```py
+for batch in trainer.get_train_dataloader():
+    break
+
+outputs = trainer.model.cpu()(**batch)
+```
+
+The next step is then to move back to the GPU and check that everything still works:
+
+```py
+import torch
+
+device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+batch = {k: v.to(device) for k, v in batch.items()}
+
+outputs = trainer.model.to(device)(**batch)
+```
+
+If you still get an error, make sure you restart your notebook and only execute the last version of the script.
+
+### Performing one optimization step[[performing-one-optimization-step]]
+
+Now that we know that we can build batches that actually go through the model, we are ready for the next step of the training pipeline: computing the gradients and performing an optimization step.
+
+The first part is just a matter of calling the `backward()` method on the loss:
+
+```py
+loss = outputs.loss
+loss.backward()
+```
+
+It's pretty rare to get an error at this stage, but if you do get one, make sure to go back to the CPU to get a helpful error message.
+
+To perform the optimization step, we just need to create the `optimizer` and call its `step()` method:
+
+```py
+trainer.create_optimizer()
+trainer.optimizer.step()
+```
+
+Again, if you're using the default optimizer in the `Trainer`, you shouldn't get an error at this stage, but if you have a custom optimizer, there might be some problems to debug here. Don't forget to go back to the CPU if you get a weird CUDA error at this stage. Speaking of CUDA errors, earlier we mentioned a special case. Let's have a look at that now.
+
+### Dealing with CUDA out-of-memory errors[[dealing-with-cuda-out-of-memory-errors]]
+
+Whenever you get an error message that starts with `RuntimeError: CUDA out of memory`, this indicates that you are out of GPU memory. This is not directly linked to your code, and it can happen with a script that runs perfectly fine. This error means that you tried to put too many things in the internal memory of your GPU, and that resulted in an error. Like with other CUDA errors, you will need to restart your kernel to be in a spot where you can run your training again.
+
+To solve this issue, you just need to use less GPU space -- something that is often easier said than done. First, make sure you don't have two models on the GPU at the same time (unless that's required for your problem, of course). Then, you should probably reduce your batch size, as it directly affects the sizes of all the intermediate outputs of the model and their gradients. If the problem persists, consider using a smaller version of your model.
+
+> [!TIP]
+> In the next part of the course, we'll look at more advanced techniques that can help you reduce your memory footprint and let you fine-tune the biggest models.
+
+### Evaluating the model[[evaluating-the-model]]
+
+Now that we've solved all the issues with our code, everything is perfect and the training should run smoothly, right? Not so fast! If you run the `trainer.train()` command, everything will look good at first, but after a while you will get the following:
+
+```py
+# This will take a long time and error out, so you shouldn't run this cell
+trainer.train()
+```
+
+```python out
+TypeError: only size-1 arrays can be converted to Python scalars
+```
+
+You will realize this error appears during the evaluation phase, so this is the last thing we will need to debug.
+
+You can run the evaluation loop of the `Trainer` independently form the training like this:
+
+```py
+trainer.evaluate()
+```
+
+```python out
+TypeError: only size-1 arrays can be converted to Python scalars
+```
+
+> [!TIP]
+> 💡 You should always make sure you can run `trainer.evaluate()` before launching `trainer.train()`, to avoid wasting lots of compute resources before hitting an error.
+
+Before attempting to debug a problem in the evaluation loop, you should first make sure that you've had a look at the data, are able to form a batch properly, and can run your model on it. We've completed all of those steps, so the following code can be executed without error:
+
+```py
+for batch in trainer.get_eval_dataloader():
+    break
+
+batch = {k: v.to(device) for k, v in batch.items()}
+
+with torch.no_grad():
+    outputs = trainer.model(**batch)
+```
+
+The error comes later, at the end of the evaluation phase, and if we look at the traceback we see this:
+
+```python trace
+~/git/datasets/src/datasets/metric.py in add_batch(self, predictions, references)
+    431         """
+    432         batch = {"predictions": predictions, "references": references}
+--> 433         batch = self.info.features.encode_batch(batch)
+    434         if self.writer is None:
+    435             self._init_writer()
+```
+
+This tells us that the error originates in the `datasets/metric.py` module -- so this is a problem with our `compute_metrics()` function. It takes a tuple with the logits and the labels as NumPy arrays, so let's try to feed it that:
+
+```py
+predictions = outputs.logits.cpu().numpy()
+labels = batch["labels"].cpu().numpy()
+
+compute_metrics((predictions, labels))
+```
+
+```python out
+TypeError: only size-1 arrays can be converted to Python scalars
+```
+
+We get the same error, so the problem definitely lies with that function. If we look back at its code, we see it's just forwarding the `predictions` and the `labels` to `metric.compute()`. So is there a problem with that method? Not really. Let's have a quick look at the shapes:
+
+```py
+predictions.shape, labels.shape
+```
+
+```python out
+((8, 3), (8,))
+```
+
+Our predictions are still logits, not the actual predictions, which is why the metric is returning this (somewhat obscure) error. The fix is pretty easy; we just have to add an argmax in the `compute_metrics()` function:
+
+```py
+import numpy as np
+
+def compute_metrics(eval_pred):
+    predictions, labels = eval_pred
+    predictions = np.argmax(predictions, axis=1)
+    return metric.compute(predictions=predictions, references=labels)
+
+compute_metrics((predictions, labels))
+```
+
+```python out
+{'accuracy': 0.625}
+```
+
+Now our error is fixed! This was the last one, so our script will now train a model properly.
+
+For reference, here is the completely fixed script:
+
+```py
+import numpy as np
+from datasets import load_dataset
+import evaluate
+from transformers import (
+    AutoTokenizer,
+    AutoModelForSequenceClassification,
+    DataCollatorWithPadding,
+    TrainingArguments,
+    Trainer,
+)
+
+raw_datasets = load_dataset("glue", "mnli")
+
+model_checkpoint = "distilbert-base-uncased"
+tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
+
+def preprocess_function(examples):
+    return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)
+
+tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
+model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)
+
+args = TrainingArguments(
+    f"distilbert-finetuned-mnli",
+    evaluation_strategy="epoch",
+    save_strategy="epoch",
+    learning_rate=2e-5,
+    num_train_epochs=3,
+    weight_decay=0.01,
+)
+
+metric = evaluate.load("glue", "mnli")
+
+def compute_metrics(eval_pred):
+    predictions, labels = eval_pred
+    predictions = np.argmax(predictions, axis=1)
+    return metric.compute(predictions=predictions, references=labels)
+
+data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
+
+trainer = Trainer(
+    model,
+    args,
+    train_dataset=tokenized_datasets["train"],
+    eval_dataset=tokenized_datasets["validation_matched"],
+    compute_metrics=compute_metrics,
+    data_collator=data_collator,
+    tokenizer=tokenizer,
+)
+trainer.train()
+```
+
+In this instance, there are no more problems, and our script will fine-tune a model that should give reasonable results. But what can we do when the training proceeds without any error, and the model trained does not perform well at all? That's the hardest part of machine learning, and we'll show you a few techniques that can help.
+
+> [!TIP]
+> 💡 If you're using a manual training loop, the same steps apply to debug your training pipeline, but it's easier to separate them. Make sure you have not forgotten the `model.eval()` or `model.train()` at the right places, or the `zero_grad()` at each step, however!
+
+## Debugging silent errors during training[[debugging-silent-errors-during-training]]
+
+What can we do to debug a training that completes without error but doesn't get good results? We'll give you some pointers here, but be aware that this kind of debugging is the hardest part of machine learning, and there is no magical answer.
+
+### Check your data (again!)[[check-your-data-again]]
+
+Your model will only learn something if it's actually possible to learn anything from your data. If there is a bug that corrupts the data or the labels are attributed randomly, it's very likely you won't get any model training on your dataset. So always start by double-checking your decoded inputs and labels, and ask yourself the following questions:
+
+- Is the decoded data understandable?
+- Do you agree with the labels?
+- Is there one label that's more common than the others?
+- What should the loss/metric be if the model predicted a random answer/always the same answer?
+
+> [!WARNING]
+> ⚠️ If you are doing distributed training, print samples of your dataset in each process and triple-check that you get the same thing. One common bug is to have some source of randomness in the data creation that makes each process have a different version of the dataset.
+
+After looking at your data, go through a few of the model's predictions and decode them too. If the model is always predicting the same thing, it might be because your dataset is biased toward one category (for classification problems); techniques like oversampling rare classes might help.
+
+If the loss/metric you get on your initial model is very different from the loss/metric you would expect for random predictions, double-check the way your loss or metric is computed, as there is probably a bug there. If you are using several losses that you add at the end, make sure they are of the same scale.
+
+When you are sure your data is perfect, you can see if the model is capable of training on it with one simple test.
+
+### Overfit your model on one batch[[overfit-your-model-on-one-batch]]
+
+Overfitting is usually something we try to avoid when training, as it means the model is not learning to recognize the general features we want it to but is instead just memorizing the training samples. However, trying to train your model on one batch over and over again is a good test to check if the problem as you framed it can be solved by the model you are attempting to train. It will also help you see if your initial learning rate is too high.
+
+Doing this once you have defined your `Trainer` is really easy; just grab a batch of training data, then run a small manual training loop only using that batch for something like 20 steps:
+
+```py
+for batch in trainer.get_train_dataloader():
+    break
+
+batch = {k: v.to(device) for k, v in batch.items()}
+trainer.create_optimizer()
+
+for _ in range(20):
+    outputs = trainer.model(**batch)
+    loss = outputs.loss
+    loss.backward()
+    trainer.optimizer.step()
+    trainer.optimizer.zero_grad()
+```
+
+> [!TIP]
+> 💡 If your training data is unbalanced, make sure to build a batch of training data containing all the labels.
+
+The resulting model should have close-to-perfect results on the same `batch`. Let's compute the metric on the resulting predictions:
+
+```py
+with torch.no_grad():
+    outputs = trainer.model(**batch)
+preds = outputs.logits
+labels = batch["labels"]
+
+compute_metrics((preds.cpu().numpy(), labels.cpu().numpy()))
+```
+
+```python out
+{'accuracy': 1.0}
+```
+
+100% accuracy, now this is a nice example of overfitting (meaning that if you try your model on any other sentence, it will very likely give you a wrong answer)!
+
+If you don't manage to have your model obtain perfect results like this, it means there is something wrong with the way you framed the problem or your data, so you should fix that. Only when you manage to pass the overfitting test can you be sure that your model can actually learn something.
+
+> [!WARNING]
+> ⚠️ You will have to recreate your model and your `Trainer` after this test, as the model obtained probably won't be able to recover and learn something useful on your full dataset.
+
+### Don't tune anything until you have a first baseline[[dont-tune-anything-until-you-have-a-first-baseline]]
+
+Hyperparameter tuning is always emphasized as being the hardest part of machine learning, but it's just the last step to help you gain a little bit on the metric. Most of the time, the default hyperparameters of the `Trainer` will work just fine to give you good results, so don't launch into a time-consuming and costly hyperparameter search until you have something that beats the baseline you have on your dataset.
+
+Once you have a good enough model, you can start tweaking a bit. Don't try launching a thousand runs with different hyperparameters, but compare a couple of runs with different values for one hyperparameter to get an idea of which has the greatest impact.
+
+If you are tweaking the model itself, keep it simple and don't try anything you can't reasonably justify. Always make sure you go back to the overfitting test to verify that your change hasn't had any unintended consequences.
+
+### Ask for help[[ask-for-help]]
+
+Hopefully you will have found some advice in this section that helped you solve your issue, but if that's not the case, remember you can always ask the community on the [forums](https://discuss.huggingface.co/).
+
+Here are some additional resources that may prove helpful:
+
+- ["Reproducibility as a vehicle for engineering best practices"](https://docs.google.com/presentation/d/1yHLPvPhUs2KGI5ZWo0sU-PKU3GimAk3iTsI38Z-B5Gw/edit#slide=id.p) by Joel Grus
+- ["Checklist for debugging neural networks"](https://towardsdatascience.com/checklist-for-debugging-neural-networks-d8b2a9434f21) by Cecelia Shao
+- ["How to unit test machine learning code"](https://medium.com/@keeper6928/how-to-unit-test-machine-learning-code-57cf6fd81765) by Chase Roberts
+- ["A Recipe for Training Neural Networks"](http://karpathy.github.io/2019/04/25/recipe/) by Andrej Karpathy
+
+Of course, not every problem you encounter when training neural nets is your own fault! If you encounter something in the 🤗 Transformers or 🤗 Datasets library that does not seem right, you may have encountered a bug. You should definitely tell us all about it, and in the next section we'll explain exactly how to do that.
+
+
@@ -0,0 +1,314 @@
+Source: https://docs.unsloth.ai/basics/troubleshooting-and-faqs
+Title: Troubleshooting & FAQs - Unsloth Documentation, Daniel & Michael Han-Chen (2025)
+Fetched-via: curl -sL https://docs.unsloth.ai/basics/troubleshooting-and-faqs.md (GitBook raw markdown endpoint; uvx markitdown returned raw HTML junk)
+Fetch-status: verbatim, trailing "Agent Instructions" doc-query boilerplate trimmed
+
+# Troubleshooting & FAQs
+
+If you're still encountering any issues with versions or dependencies, please use our [Docker image](/docs/get-started/install/docker.md) which will have everything pre-installed.
+
+{% hint style="success" %}
+**Try always to update Unsloth if you find any issues.**
+
+`pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo`
+{% endhint %}
+
+### Fine-tuning a new model not supported by Unsloth?
+
+Unsloth works with any model supported by `transformers`. If a model isn’t in our uploads or doesn’t run out of the box, it’s usually still supported, some newer models may just need a small manual tweak due to our optimizations.
+
+In most cases, you can enable compatibility by setting `trust_remote_code=True` in your fine-tuning script. Here’s an example using [DeepSeek-OCR](/docs/models/tutorials/deepseek-ocr-how-to-run-and-fine-tune.md):
+
+<pre class="language-python" data-expandable="true"><code class="lang-python">from huggingface_hub import snapshot_download
+snapshot_download("unsloth/DeepSeek-OCR", local_dir = "deepseek_ocr")
+model, tokenizer = FastVisionModel.from_pretrained(
+    "./deepseek_ocr",
+    load_in_4bit = False, # Use 4bit to reduce memory use. False for 16bit LoRA.
+    auto_model = AutoModel,
+    <a data-footnote-ref href="#user-content-fn-1">trust_remote_code = True</a>, # Enable to support new models
+    unsloth_force_compile = True,
+    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
+)
+</code></pre>
+
+### Running in Unsloth works well, but after exporting & running on other platforms, the results are poor
+
+You might sometimes encounter an issue where your model runs and produces good results on Unsloth, but when you use it on another platform like Ollama or vLLM, the results are poor or you might get gibberish, endless/infinite generations *or* repeated output&#x73;**.**
+
+* The most common cause of this error is using an <mark style="background-color:blue;">**incorrect chat template**</mark>**.** It’s essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama. When inferencing from a saved model, it's crucial to apply the correct template.
+* It might also be because your inference engine adds an unnecessary "start of sequence" token (or the lack of thereof on the contrary) so ensure you check both hypotheses!
+* <mark style="background-color:green;">**Use our conversational notebooks to force the chat template - this will fix most issues.**</mark>
+  * Qwen-3 14B Conversational notebook [**Open in Colab**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_\(14B\)-Reasoning-Conversational.ipynb)
+  * Gemma-3 4B Conversational notebook [**Open in Colab**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_\(4B\).ipynb)
+  * Llama-3.2 3B Conversational notebook [**Open in Colab**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_\(1B_and_3B\)-Conversational.ipynb)
+  * Phi-4 14B Conversational notebook [**Open in Colab**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb)
+  * Mistral v0.3 7B Conversational notebook [**Open in Colab**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_\(7B\)-Conversational.ipynb)
+  * **More notebooks in our** [**notebooks docs**](/docs/get-started/unsloth-notebooks.md)
+
+### Saving to GGUF / vLLM 16bit crashes
+
+You can try reducing the maximum GPU usage during saving by changing `maximum_memory_usage`.
+
+The default is `model.save_pretrained(..., maximum_memory_usage = 0.75)`. Reduce it to say 0.5 to use 50% of GPU peak memory or lower. This can reduce OOM crashes during saving.
+
+### How do I manually save to GGUF?
+
+First save your model to 16bit via:
+
+```python
+model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)
+```
+
+Compile llama.cpp from source like below:
+
+```bash
+apt-get update
+apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
+git clone https://github.com/ggml-org/llama.cpp
+cmake llama.cpp -B llama.cpp/build \
+    -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
+cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
+cp llama.cpp/build/bin/llama-* llama.cpp
+```
+
+Then, save the model to F16:
+
+```bash
+python llama.cpp/convert_hf_to_gguf.py merged_model \
+    --outfile model-F16.gguf --outtype f16 \
+    --split-max-size 50G
+```
+
+```bash
+# For BF16:
+python llama.cpp/convert_hf_to_gguf.py merged_model \
+    --outfile model-BF16.gguf --outtype bf16 \
+    --split-max-size 50G
+    
+# For Q8_0:
+python llama.cpp/convert_hf_to_gguf.py merged_model \
+    --outfile model-Q8_0.gguf --outtype q8_0 \
+    --split-max-size 50G
+```
+
+### Why is Q8\_K\_XL slower than Q8\_0 GGUF?
+
+On Mac devices, it seems like that BF16 might be slower than F16. Q8\_K\_XL upcasts some layers to BF16, so hence the slowdown, We are actively changing our conversion process to make F16 the default choice for Q8\_K\_XL to reduce performance hits.
+
+### How to do Evaluation
+
+To set up evaluation in your training run, you first have to split your dataset into a training and test split. You should <mark style="background-color:green;">**always shuffle the selection of the dataset**</mark>, otherwise your evaluation is wrong!
+
+```python
+new_dataset = dataset.train_test_split(
+    test_size = 0.01, # 1% for test size can also be an integer for # of rows
+    shuffle = True, # Should always set to True!
+    seed = 3407,
+)
+
+train_dataset = new_dataset["train"] # Dataset for training
+eval_dataset = new_dataset["test"] # Dataset for evaluation
+```
+
+Then, we can set the training arguments to enable evaluation. Reminder evaluation can be very very slow especially if you set `eval_steps = 1` which means you are evaluating every single step. If you are, try reducing the eval\_dataset size to say 100 rows or something.
+
+```python
+from trl import SFTTrainer, SFTConfig
+trainer = SFTTrainer(
+    args = SFTConfig(
+        fp16_full_eval = True,         # Set this to reduce memory usage
+        per_device_eval_batch_size = 2,# Increasing this will use more memory
+        eval_accumulation_steps = 4,   # You can increase this include of batch_size
+        eval_strategy = "steps",       # Runs eval every few steps or epochs.
+        eval_steps = 1,                # How many evaluations done per # of training steps
+    ),
+    train_dataset = new_dataset["train"],
+    eval_dataset = new_dataset["test"],
+    ...
+)
+trainer.train()
+```
+
+### Evaluation Loop - Out of Memory or crashing.
+
+A common issue when you OOM is because you set your batch size too high. Set it lower than 2 to use less VRAM. Also use `fp16_full_eval=True` to use float16 for evaluation which cuts memory by 1/2.
+
+First split your training dataset into a train and test split. Set the trainer settings for evaluation to:
+
+```python
+new_dataset = dataset.train_test_split(test_size = 0.01)
+
+from trl import SFTTrainer, SFTConfig
+trainer = SFTTrainer(
+    args = SFTConfig(
+        fp16_full_eval = True,
+        per_device_eval_batch_size = 2,
+        eval_accumulation_steps = 4,
+        eval_strategy = "steps",
+        eval_steps = 1,
+    ),
+    train_dataset = new_dataset["train"],
+    eval_dataset = new_dataset["test"],
+    ...
+)
+```
+
+This will cause no OOMs and make it somewhat faster. You can also use `bf16_full_eval=True` for bf16 machines. By default Unsloth should have set these flags on by default as of June 2025.
+
+### How do I do Early Stopping?
+
+If you want to stop the finetuning / training run since the evaluation loss is not decreasing, then you can use early stopping which stops the training process. Use `EarlyStoppingCallback`.
+
+As usual, set up your trainer and your evaluation dataset. The below is used to stop the training run if the `eval_loss` (the evaluation loss) is not decreasing after 3 steps or so.
+
+```python
+from trl import SFTConfig, SFTTrainer
+trainer = SFTTrainer(
+    args = SFTConfig(
+        fp16_full_eval = True,
+        per_device_eval_batch_size = 2,
+        eval_accumulation_steps = 4,
+        output_dir = "training_checkpoints", # location of saved checkpoints for early stopping
+        save_strategy = "steps",             # save model every N steps
+        save_steps = 10,                     # how many steps until we save the model
+        save_total_limit = 3,                # keep only 3 saved checkpoints to save disk space
+        eval_strategy = "steps",             # evaluate every N steps
+        eval_steps = 10,                     # how many steps until we do evaluation
+        load_best_model_at_end = True,       # MUST USE for early stopping
+        metric_for_best_model = "eval_loss", # metric we want to early stop on
+        greater_is_better = False,           # the lower the eval loss, the better
+    ),
+    model = model,
+    tokenizer = tokenizer,
+    train_dataset = new_dataset["train"],
+    eval_dataset = new_dataset["test"],
+)
+```
+
+We then add the callback which can also be customized:
+
+```python
+from transformers import EarlyStoppingCallback
+early_stopping_callback = EarlyStoppingCallback(
+    early_stopping_patience = 3,     # How many steps we will wait if the eval loss doesn't decrease
+                                     # For example the loss might increase, but decrease after 3 steps
+    early_stopping_threshold = 0.0,  # Can set higher - sets how much loss should decrease by until
+                                     # we consider early stopping. For eg 0.01 means if loss was
+                                     # 0.02 then 0.01, we consider to early stop the run.
+)
+trainer.add_callback(early_stopping_callback)
+```
+
+Then train the model as usual via `trainer.train() .`
+
+### Downloading gets stuck at 90 to 95%
+
+If your model gets stuck at 90, 95% for a long time before you can disable some fast downloading processes to force downloads to be synchronous and to print out more error messages.
+
+Simply use `UNSLOTH_STABLE_DOWNLOADS=1` before any Unsloth import.
+
+```python
+import os
+os.environ["UNSLOTH_STABLE_DOWNLOADS"] = "1"
+
+from unsloth import FastLanguageModel
+```
+
+### RuntimeError: CUDA error: device-side assert triggered
+
+Restart and run all, but place this at the start before any Unsloth import. Also please file a bug report asap thank you!
+
+```python
+import os
+os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
+os.environ["UNSLOTH_DISABLE_FAST_GENERATION"] = "1"
+```
+
+### All labels in your dataset are -100. Training losses will be all 0.
+
+This means that your usage of `train_on_responses_only` is incorrect for that particular model. train\_on\_responses\_only allows you to mask the user question, and train your model to output the assistant response with higher weighting. This is known to increase accuracy by 1% or more. See our [**LoRA Hyperparameters Guide**](/docs/get-started/fine-tuning-llms-guide/lora-hyperparameters-guide.md) for more details.
+
+For Llama 3.1, 3.2, 3.3 type models, please use the below:
+
+```python
+from unsloth.chat_templates import train_on_responses_only
+trainer = train_on_responses_only(
+    trainer,
+    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
+    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
+)
+```
+
+For Gemma 2, 3. 3n models, use the below:
+
+```python
+from unsloth.chat_templates import train_on_responses_only
+trainer = train_on_responses_only(
+    trainer,
+    instruction_part = "<start_of_turn>user\n",
+    response_part = "<start_of_turn>model\n",
+)
+```
+
+### Unsloth is slower than expected?
+
+If your speed seems slower at first, it’s likely because `torch.compile` typically takes \~5 minutes (or longer) to warm up and finish compiling. Make sure you measure throughput **after** it’s fully loaded as over longer runs, Unsloth should be much faster.
+
+To disable use:
+
+```python
+import os
+os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
+```
+
+### Some weights of Gemma3nForConditionalGeneration were not initialized from the model checkpoint
+
+This is a critical error, since this means some weights are not parsed correctly, which will cause incorrect outputs. This can normally be fixed by upgrading Unsloth
+
+`pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo`
+
+Then upgrade transformers and timm:
+
+`pip install --upgrade --force-reinstall --no-cache-dir --no-deps transformers timm`
+
+However if the issue still persists, please file a bug report asap!
+
+### NotImplementedError: A UTF-8 locale is required. Got ANSI
+
+See <https://github.com/googlecolab/colabtools/issues/3409>
+
+In a new cell, run the below:
+
+```python
+import locale
+locale.getpreferredencoding = lambda: "UTF-8"
+```
+
+### Citing Unsloth
+
+If you are citing the usage of our model uploads, use the below Bibtex. This is for Qwen3-30B-A3B-GGUF Q8\_K\_XL:
+
+```
+@misc{unsloth_2025_qwen3_30b_a3b,
+  author       = {Unsloth AI and Han-Chen, Daniel and Han-Chen, Michael},
+  title        = {Qwen3-30B-A3B-GGUF:Q8\_K\_XL},
+  year         = {2025},
+  publisher    = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF}}
+}
+```
+
+To cite the usage of our Github package or our work in general:
+
+```
+@misc{unsloth,
+  author       = {Unsloth AI and Han-Chen, Daniel and Han-Chen, Michael},
+  title        = {Unsloth},
+  year         = {2025},
+  publisher    = {Github},
+  howpublished = {\url{https://github.com/unslothai/unsloth}}
+}
+```
+
+[^1]: Enable this line of code and see if it works.
+