Files
ml-debug/docs/evidence/sanh_simple_considerations_hf_2021.md
T
wassname ced4edc200 feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic
Add 3 new evidence files from modern open-source sources:
- karpathy_recipe_training_nn_2019.md: Karpathy's training recipe blog post
- nanochat_deepwiki_llm_pretraining_2026.md: 320+ HP sweeps for GPT-2-scale pretraining
- sanh_simple_considerations_hf_2021.md: HuggingFace NLP debugging notes

Add update-to-data ratio diagnostic to refs/diagnostics.md (target ~1e-3).
Add LLM pretraining gap note to SKILL.md intro linking the new sources.
Add tanh saturation % to logging checklist.
2026-03-10 05:32:37 +08:00

62 lines
2.6 KiB
Markdown

# Simple Considerations for Simple People Building Fancy Neural Networks
**Source:** Victor Sanh, Hugging Face Blog, February 25, 2021
**URL:** https://huggingface.co/blog/simple-considerations
**Author:** Victor Sanh (Hugging Face research scientist, author of DistilBERT)
---
## Core practices (overlaps heavily with Karpathy 2019 recipe)
**Data first:**
> "the very first step of building a neural network is to put aside machine learning and simply focus on your data"
**Overfit test:**
> "it is a good habit when you think you have finished implementing to overfit a small batch of examples (16 for instance). If your implementation is (nearly) correct, your model will be able to overfit and remember these examples by displaying a 0-loss (make sure you remove any form of regularization such as weight decay)."
**Baselines:**
> "Start as simple as possible to get a sense of the difficulty of your task and how well standard baselines would perform."
> "it is sometimes hard to understand if your performance comes from a bug in your model/code or is simply limited by your model's expressiveness"
---
## NLP-specific: tokenization warning
> "when you work with language, have a serious look at the outputs of the tokenizers. I can't count the number of lost hours I spent trying to reproduce results (and sometimes my own old results) because something went wrong with the tokenization."
---
## Common implementation errors listed
- Wrong indexing ("really the worst")
- Forgetting `model.eval()` or `model.zero_grad()`
- Preprocessing errors
- Loss receiving wrong argument type (probabilities vs. logits)
- Uniform constant initialization (breaks symmetry)
- Parameters not called in forward pass (no gradients)
- Learning rate stuck at 0
- Suboptimal input truncation
---
## HP tuning advice
> "there is no point of launching 1000 runs with different hyperparameters: compare a couple of runs with different hyperparameters to get an idea of which hyperparameters have the highest impact"
> "random over a reasonably manually defined grid search is still a tough-to-beat baseline" [re: Bayesian vs random search]
---
## Embeddings freezing (NLP, pre-trained LM fine-tuning)
> "in my experience working with pre-trained language models, freezing the embeddings modules to their pre-trained values doesn't affect much the fine-tuning task performance while considerably speeding up the training."
Credence ~65-70% -- specific domain claim, lacks ablation study reference.
---
## External links from this post
- "Checklist for debugging neural networks" -- Cecelia Shao (Towards Data Science)
- "A recipe for Training Neural Networks" -- Karpathy