From 2a2f5045bbd5c820cdbc7405e619d67d630d18e4 Mon Sep 17 00:00:00 2001
From: wassname <1103714+wassname@users.noreply.github.com>
Date: Thu, 11 Jun 2026 14:43:47 +0800
Subject: [PATCH] folklore: add Karpathy common-mistakes tweet and Sculley CACE
 principle

Both quote-verbatim with frozen evidence: the 2018 tweet thread (mirrored
via threadreaderapp, x.com blocks fetching) slots after overfit-one-batch;
CACE (NIPS 2015, entanglement section transcribed from the PDF) gives
Always-Be-Ablating its why.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
---
 SKILL.md                                      | 22 +++++++++++++++++++
 .../karpathy_common_mistakes_tweet_2018.md    | 20 +++++++++++++++++
 .../sculley_2015_hidden_technical_debt.md     | 18 +++++++++++++++
 3 files changed, 60 insertions(+)
 create mode 100644 docs/evidence/karpathy_common_mistakes_tweet_2018.md
 create mode 100644 docs/evidence/sculley_2015_hidden_technical_debt.md

diff --git a/SKILL.md b/SKILL.md
index a719b05..d8337bb 100644
--- a/SKILL.md
+++ b/SKILL.md
@@ -120,6 +120,18 @@ gwern traced versions back to 1992 and concluded it is "a classic 'urban legend'
 
 And remove a variable while you're at it: "Always use a fixed random seed [...]. This removes a factor of variation and will help keep you sane."[^karpathy-recipe]
 
+### The most common neural net mistakes (Karpathy)
+
+The 2018 tweet thread that seeded the recipe post. Every item is a silent failure except 5:
+
+> most common neural net mistakes: 1) you didn't try to overfit a single batch first. 2) you forgot to toggle train/eval mode for the net. 3) you forgot to .zero_grad() (in pytorch) before .backward(). 4) you passed softmaxed outputs to a loss that expects raw logits. ; others? :)[^karpathy-mistakes]
+
+> oh: 5) you didn't use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output layer .This one won't make you silently fail, but they are spurious parameters[^karpathy-mistakes]
+
+> 6) thinking view() and permute() are the same thing (& incorrectly using view)[^karpathy-mistakes]
+
+Number 6 is the bug the backprop-to-input dependency check catches mechanically ([refs/diagnostics.md](refs/diagnostics.md)).
+
 ### Seed variance: you can't tell a bug from bad luck
 
 > Look, there's variance in supervised learning too, but it's rarely this bad. If my supervised learning code failed to beat random chance 30% of the time, I'd have super high confidence there was a bug in data loading or training. If my reinforcement learning code does no better than random, I have no idea if it's a bug, if my hyperparameters are bad, or if I simply got unlucky.[^irpan]
@@ -148,6 +160,14 @@ On the slides[^schulman]:
 
 Many normalization/regularization tricks do roughly the same job (they improve conditioning), so stacking them adds complexity without proportional benefit.
 
+### Changing anything changes everything (Sculley et al.)
+
+Why ablation and one-change-at-a-time work, from Google's production-ML technical-debt paper:
+
+> **Entanglement.** Machine learning systems mix signals together, entangling them and making isolation of improvements impossible. For instance, consider a system that uses features x1, ...xn in a model. If we change the input distribution of values in x1, the importance, weights, or use of the remaining n − 1 features may all change. [...] No inputs are ever really independent. We refer to this here as the CACE principle: Changing Anything Changes Everything. CACE applies not only to input signals, but also to hyper-parameters, learning settings, sampling methods, convergence thresholds, data selection, and essentially every other possible tweak.[^sculley]
+
+This is also why "I changed the method and a hyperparameter and it got better" tells you nothing about the method.
+
 ### Adam at 3e-4 for baselines (Karpathy)
 
 > In the early stages of setting baselines I like to use Adam with a learning rate of 3e-4. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate.[^karpathy-recipe]
@@ -232,6 +252,8 @@ Folklore sources (the quotes above trace to these):
 [^jones]: Andy Jones, "Debugging RL, Without the Agonizing Pain" — https://andyljones.com/posts/rl-debugging.html ([cache](docs/evidence/andyljones_rl_debugging.md): anomalies L103-109, write-from-scratch L155, assume-bug L176-180, raise-threshold L182, loss-curve L186-188)
 [^rahtz]: Matthew Rahtz (Amid Fish), "Lessons Learned Reproducing a Deep RL Paper" — http://amid.fish/reproducing-deep-rl ([cache](docs/evidence/amid_fish_reproducing_deep_rl.md): frame-diff confusion L85-87, investigate-confusion L100-102, think-more L145-153, don't-implement-RL-yourself L497-501)
 [^karpathy-recipe]: Andrej Karpathy, "A Recipe for Training Neural Networks" (2019) — https://karpathy.github.io/2019/04/25/recipe/ ([cache](docs/evidence/karpathy_recipe_training_nn_2019.md): inspect-data L26+L32, fixed-seed L39, overfit-one-batch L51, Adam-3e-4 L73; note: this is an abridged note with its own "..." elisions)
+[^karpathy-mistakes]: Andrej Karpathy, "most common neural net mistakes" tweet thread, 1 Jul 2018 — https://x.com/karpathy/status/1013244313327681536 ([cache](docs/evidence/karpathy_common_mistakes_tweet_2018.md): tweets 1-3 verbatim, cross-checked against threadreaderapp; x.com itself blocks fetching)
+[^sculley]: Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (NIPS 2015) — https://papers.nips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf ([cache](docs/evidence/sculley_2015_hidden_technical_debt.md): abstract, CACE/entanglement, ensemble caveat)
 [^schulman]: John Schulman, "Nuts and Bolts of Deep RL Research" slides — http://joschu.net/docs/nuts-and-bolts.pdf ([cache](docs/evidence/joschu_nuts_and_bolts.md): Always-Be-Ablating L98-101, standardize-observations L118-125; rendered as bullets because the PDF source is slide fragments)
 [^henderson]: Henderson et al., "Deep Reinforcement Learning that Matters" (AAAI 2018) — https://arxiv.org/abs/1709.06560 ([cache](docs/evidence/henderson_2018_deep_rl_matters.md): seeds-create-different-distributions L235, implementation-differences L251)
 [^irpan]: Alex Irpan, "Deep Reinforcement Learning Doesn't Work Yet" (2018) — https://www.alexirpan.com/2018/02/14/rl-hard.html ([cache](docs/evidence/alexirpan_rl_hard.md): variance-bug-or-unlucky L674-678, seed-canary L705-707)
diff --git a/docs/evidence/karpathy_common_mistakes_tweet_2018.md b/docs/evidence/karpathy_common_mistakes_tweet_2018.md
new file mode 100644
index 0000000..24e6e7a
--- /dev/null
+++ b/docs/evidence/karpathy_common_mistakes_tweet_2018.md
@@ -0,0 +1,20 @@
+Source: https://x.com/karpathy/status/1013244313327681536 (thread, 1 Jul 2018)
+Title: Andrej Karpathy, "most common neural net mistakes" tweet thread
+Fetched-via: x.com blocked (HTTP 451 via jina reader); tweet 1 verbatim from x.com page title in web search results; tweets 2-3 verbatim from https://threadreaderapp.com/thread/1013244313327681536.html ; thread also indexed on Karpathy's own https://karpathy.ai/tweets.html
+Fetch-status: verbatim, cross-checked across the two mirrors
+
+# most common neural net mistakes (tweet thread)
+
+Tweet 1 (1 Jul 2018):
+
+> most common neural net mistakes: 1) you didn't try to overfit a single batch first. 2) you forgot to toggle train/eval mode for the net. 3) you forgot to .zero_grad() (in pytorch) before .backward(). 4) you passed softmaxed outputs to a loss that expects raw logits. ; others? :)
+
+Tweet 2 (same thread, 1 Jul 2018):
+
+> oh: 5) you didn't use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output layer .This one won't make you silently fail, but they are spurious parameters
+
+Tweet 3 (same thread, 1 Jul 2018):
+
+> 6) thinking view() and permute() are the same thing (& incorrectly using view)
+
+Context: this thread is the seed of Karpathy's 2019 "A Recipe for Training Neural Networks" post (see karpathy_recipe_training_nn_2019.md), which opens by referencing it.
diff --git a/docs/evidence/sculley_2015_hidden_technical_debt.md b/docs/evidence/sculley_2015_hidden_technical_debt.md
new file mode 100644
index 0000000..b084f17
--- /dev/null
+++ b/docs/evidence/sculley_2015_hidden_technical_debt.md
@@ -0,0 +1,18 @@
+Source: https://papers.nips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
+Title: "Hidden Technical Debt in Machine Learning Systems" — D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, Dan Dennison (Google, Inc.), NIPS 2015
+Fetched-via: PDF downloaded from papers.nips.cc, pages 1-2 transcribed by hand from the rendered pages
+Fetch-status: verbatim excerpts (abstract + Entanglement section); subscripts rendered as plain text (x1, xn+1)
+
+# Hidden Technical Debt in Machine Learning Systems (excerpts)
+
+Abstract (p. 1):
+
+> Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of *technical debt*, we find it is common to incur massive ongoing maintenance costs in real-world ML systems. We explore several ML-specific risk factors to account for in system design. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns.
+
+Section 2, "Complex Models Erode Boundaries" — Entanglement (p. 2), the CACE principle:
+
+> **Entanglement.** Machine learning systems mix signals together, entangling them and making isolation of improvements impossible. For instance, consider a system that uses features x1, ...xn in a model. If we change the input distribution of values in x1, the importance, weights, or use of the remaining n − 1 features may all change. This is true whether the model is retrained fully in a batch style or allowed to adapt in an online fashion. Adding a new feature xn+1 can cause similar changes, as can removing any feature xj. No inputs are ever really independent. We refer to this here as the CACE principle: Changing Anything Changes Everything. CACE applies not only to input signals, but also to hyper-parameters, learning settings, sampling methods, convergence thresholds, data selection, and essentially every other possible tweak.
+
+Same section, the ensemble caveat (p. 2):
+
+> One possible mitigation strategy is to isolate models and serve ensembles. [...] However, in many cases ensembles work well because the errors in the component models are uncorrelated. Relying on the combination creates a strong entanglement: improving an individual component model may actually make the system accuracy worse if the remaining errors are more strongly correlated with the other components.