mirror of
https://github.com/wassname/ml-debug.git
synced 2026-06-27 16:15:57 +08:00
65 lines
4.5 KiB
Markdown
65 lines
4.5 KiB
Markdown
# Simple Considerations for Simple People Building Fancy Neural Networks
|
|
|
|
**Source:** Victor Sanh, Hugging Face Blog, February 25, 2021
|
|
**URL:** https://huggingface.co/blog/simple-considerations
|
|
**Author:** Victor Sanh (Hugging Face research scientist, author of DistilBERT)
|
|
|
|
---
|
|
|
|
## Core practices (overlaps heavily with Karpathy 2019 recipe)
|
|
|
|
**Data first:**
|
|
> "the very first step of building a neural network is to put aside machine learning and simply focus on your data"
|
|
|
|
**Overfit test:**
|
|
> "it is a good habit when you think you have finished implementing to overfit a small batch of examples (16 for instance). If your implementation is (nearly) correct, your model will be able to overfit and remember these examples by displaying a 0-loss (make sure you remove any form of regularization such as weight decay)."
|
|
|
|
**Baselines:**
|
|
> "Start as simple as possible to get a sense of the difficulty of your task and how well standard baselines would perform."
|
|
> "it is sometimes hard to understand if your performance comes from a bug in your model/code or is simply limited by your model's expressiveness"
|
|
|
|
---
|
|
|
|
## NLP-specific: tokenization warning
|
|
|
|
> "when you work with language, have a serious look at the outputs of the tokenizers. I can't count the number of lost hours I spent trying to reproduce results (and sometimes my own old results) because something went wrong with the tokenization."
|
|
|
|
---
|
|
|
|
## Common implementation errors listed
|
|
|
|
- Wrong indexing ("really the worst")
|
|
- Forgetting `model.eval()` or `model.zero_grad()`
|
|
- Preprocessing errors
|
|
- Loss receiving wrong argument type (probabilities vs. logits)
|
|
- Uniform constant initialization (breaks symmetry)
|
|
- Parameters not called in forward pass (no gradients)
|
|
- Learning rate stuck at 0
|
|
- Suboptimal input truncation
|
|
|
|
---
|
|
|
|
## HP tuning advice
|
|
|
|
> "there is no point of launching 1000 runs with different hyperparameters: compare a couple of runs with different hyperparameters to get an idea of which hyperparameters have the highest impact"
|
|
|
|
> "random over a reasonably manually defined grid search is still a tough-to-beat baseline" [re: Bayesian vs random search]
|
|
|
|
---
|
|
|
|
## Embeddings freezing (NLP, pre-trained LM fine-tuning)
|
|
|
|
> "in my experience working with pre-trained language models, freezing the embeddings modules to their pre-trained values doesn't affect much the fine-tuning task performance while considerably speeding up the training."
|
|
|
|
Credence ~65-70% -- specific domain claim, lacks ablation study reference.
|
|
|
|
---
|
|
|
|
## External links from this post
|
|
|
|
- **Cecelia Shao, "Checklist for Debugging Neural Networks"** (2019, KDnuggets/Towards Data Science): 5-section checklist (start simple, confirm loss, check intermediate outputs, diagnose parameters, track work). Thin; largely overlaps with Karpathy recipe and Slavv. Not captured separately -- see those sources instead.
|
|
- **Chase Roberts, "How to unit test machine learning code"** (2017, Medium, 4 min): Focuses on software unit testing practices applied to ML models -- testing gradient flow, output shapes, that outputs change when weights change. Spawned `mltest` library. Not a full debugging guide. Main insight: "The code never crashes, the loss still goes down, it just converges to poor results."
|
|
- **Joel Grus, "Reproducibility in ML as engineering best practices"** (ICLR 2019, 82 slides): A software engineering / reproducibility talk, not a debugging guide. Core: reproducibility forces good SE practices (source control, unit tests, code reviews, config files, Docker). Key quote: "The best time to find mistakes is before you run your experiments." Unit test template: tiny known dataset → check model runs → check output shape → check output has reasonable values. All already covered in SKILL.md.
|
|
- **Grus + Gardner + Neumann, "Writing Code for NLP Research"** (EMNLP 2018 tutorial): Research code organization, not a debugging guide. Mostly about frameworks, prototyping workflow, AllenNLP architecture. One specific bug worth noting: Adam + sparse embedding gradients -- "Embeddings have sparse gradients (only some embeddings are updated), but the momentum coefficients from Adam are calculated for the whole embedding every time" -- gradient appears 2 orders of magnitude different. Fix: use sparse-aware Adam. Minimal testing advice: test data processing + simple model tests on small fixtures; "makes sure data processing works consistently, tensor operations run, gradients are non-zero."
|
|
- **A recipe for Training Neural Networks** -- Karpathy (captured in full: karpathy_recipe_training_nn_2019.md)
|