folklore: add Karpathy common-mistakes tweet and Sculley CACE principle

Both quote-verbatim with frozen evidence: the 2018 tweet thread (mirrored via threadreaderapp, x.com blocks fetching) slots after overfit-one-batch; CACE (NIPS 2015, entanglement section transcribed from the PDF) gives Always-Be-Ablating its why. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 17:31:04 +08:00 · 2026-06-11 14:43:47 +08:00
parent fb753d093e
commit 2a2f5045bb
3 changed files with 60 additions and 0 deletions
@@ -0,0 +1,20 @@
+Source: https://x.com/karpathy/status/1013244313327681536 (thread, 1 Jul 2018)
+Title: Andrej Karpathy, "most common neural net mistakes" tweet thread
+Fetched-via: x.com blocked (HTTP 451 via jina reader); tweet 1 verbatim from x.com page title in web search results; tweets 2-3 verbatim from https://threadreaderapp.com/thread/1013244313327681536.html ; thread also indexed on Karpathy's own https://karpathy.ai/tweets.html
+Fetch-status: verbatim, cross-checked across the two mirrors
+
+# most common neural net mistakes (tweet thread)
+
+Tweet 1 (1 Jul 2018):
+
+> most common neural net mistakes: 1) you didn't try to overfit a single batch first. 2) you forgot to toggle train/eval mode for the net. 3) you forgot to .zero_grad() (in pytorch) before .backward(). 4) you passed softmaxed outputs to a loss that expects raw logits. ; others? :)
+
+Tweet 2 (same thread, 1 Jul 2018):
+
+> oh: 5) you didn't use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output layer .This one won't make you silently fail, but they are spurious parameters
+
+Tweet 3 (same thread, 1 Jul 2018):
+
+> 6) thinking view() and permute() are the same thing (& incorrectly using view)
+
+Context: this thread is the seed of Karpathy's 2019 "A Recipe for Training Neural Networks" post (see karpathy_recipe_training_nn_2019.md), which opens by referencing it.
@@ -0,0 +1,18 @@
+Source: https://papers.nips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
+Title: "Hidden Technical Debt in Machine Learning Systems" — D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, Dan Dennison (Google, Inc.), NIPS 2015
+Fetched-via: PDF downloaded from papers.nips.cc, pages 1-2 transcribed by hand from the rendered pages
+Fetch-status: verbatim excerpts (abstract + Entanglement section); subscripts rendered as plain text (x1, xn+1)
+
+# Hidden Technical Debt in Machine Learning Systems (excerpts)
+
+Abstract (p. 1):
+
+> Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of *technical debt*, we find it is common to incur massive ongoing maintenance costs in real-world ML systems. We explore several ML-specific risk factors to account for in system design. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns.
+
+Section 2, "Complex Models Erode Boundaries" — Entanglement (p. 2), the CACE principle:
+
+> **Entanglement.** Machine learning systems mix signals together, entangling them and making isolation of improvements impossible. For instance, consider a system that uses features x1, ...xn in a model. If we change the input distribution of values in x1, the importance, weights, or use of the remaining n − 1 features may all change. This is true whether the model is retrained fully in a batch style or allowed to adapt in an online fashion. Adding a new feature xn+1 can cause similar changes, as can removing any feature xj. No inputs are ever really independent. We refer to this here as the CACE principle: Changing Anything Changes Everything. CACE applies not only to input signals, but also to hyper-parameters, learning settings, sampling methods, convergence thresholds, data selection, and essentially every other possible tweak.
+
+Same section, the ensemble caveat (p. 2):
+
+> One possible mitigation strategy is to isolate models and serve ensembles. [...] However, in many cases ensembles work well because the errors in the component models are uncorrelated. Relying on the combination creates a strong entanglement: improving an individual component model may actually make the system accuracy worse if the remaining errors are more strongly correlated with the other components.