mirror of
https://github.com/wassname/ml_debug.git
synced 2026-06-27 01:00:14 +08:00
diagnostics: add NaN-poisoning leakage tracer + Karpathy backprop-to-input check; README citation
NaN poisoning: inject NaN where info must not come from (future/test/labels), run the real pipeline, assert past outputs stay finite. Documents false negatives (pandas skipna, nanmean) and false positives (softmax rows, batch stats). Backprop-to-input is its gradient dual for inside the model; quote already frozen in docs/evidence/karpathy_recipe_training_nn_2019.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -15,3 +15,14 @@ Or paste `SKILL.md` into your system prompt / context when debugging.
|
|||||||
- **[SKILL.md](SKILL.md)** -- the main artifact. Load into an LLM agent's context as a debugging skill. Leads with the mindset (calibrate, mental models, general debugging tricks, and reading a working implementation when stuck), then a folklore section of sourced quotes, then an LLM-agent playbook (debugging loop, triage menu, anti-patterns). Deeper one-off tricks (loss-surface analysis, stuck-metric diagnosis, sweep reliability) live in [refs/](refs/).
|
- **[SKILL.md](SKILL.md)** -- the main artifact. Load into an LLM agent's context as a debugging skill. Leads with the mindset (calibrate, mental models, general debugging tricks, and reading a working implementation when stuck), then a folklore section of sourced quotes, then an LLM-agent playbook (debugging loop, triage menu, anti-patterns). Deeper one-off tricks (loss-surface analysis, stuck-metric diagnosis, sweep reliability) live in [refs/](refs/).
|
||||||
|
|
||||||
- **[docs/evidence/](docs/evidence/)** -- frozen local copies of source material (blog posts, talks, papers, reddit threads). Claims in SKILL.md link back to exact quotes here.
|
- **[docs/evidence/](docs/evidence/)** -- frozen local copies of source material (blog posts, talks, papers, reddit threads). Claims in SKILL.md link back to exact quotes here.
|
||||||
|
|
||||||
|
## Citation
|
||||||
|
|
||||||
|
```bibtex
|
||||||
|
@misc{wassname2026mldebug,
|
||||||
|
title = {ML Debugging Folklore: A Practitioner Debugging Skill for LLM Agents},
|
||||||
|
author = {Michael J. Clark},
|
||||||
|
year = {2026},
|
||||||
|
url = {https://github.com/wassname/ml_debug/}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|||||||
@@ -47,7 +47,7 @@ For RL, add reward scale/sign as a top-3 issue, and episode-boundary handling (d
|
|||||||
|
|
||||||
| Signal | Likely meaning | Check |
|
| Signal | Likely meaning | Check |
|
||||||
|--------|----------------|-------|
|
|--------|----------------|-------|
|
||||||
| Init loss << expected (e.g. 0.01 vs 2.3) | Leakage or a shortcut: the model "knows" the answer at init | Are labels in the input? Is test data in train? A trivial feature? |
|
| Init loss << expected (e.g. 0.01 vs 2.3) | Leakage or a shortcut: the model "knows" the answer at init | Are labels in the input? Is test data in train? A trivial feature? Localize with the NaN-poisoning tracer or backprop-to-input check ([refs/diagnostics.md](refs/diagnostics.md)) |
|
||||||
| Random input gives the same loss as real input | Pipeline is destroying information (over-aggressive preprocessing, wrong transforms, all-zero input) | Print raw data at each stage; visualize |
|
| Random input gives the same loss as real input | Pipeline is destroying information (over-aggressive preprocessing, wrong transforms, all-zero input) | Print raw data at each stage; visualize |
|
||||||
| Predicts the same class for everything | Class imbalance (100:1 -> "always predict majority") | Label-count check; weighted loss or resample |
|
| Predicts the same class for everything | Class imbalance (100:1 -> "always predict majority") | Label-count check; weighted loss or resample |
|
||||||
| Val much worse than train from the start | Distribution shift between splits | Same preprocessing? Same time period? Same source? |
|
| Val much worse than train from the start | Distribution shift between splits | Same preprocessing? Same time period? Same source? |
|
||||||
@@ -396,7 +396,7 @@ Look these up when the symptom calls for them; they're kept out of the main flow
|
|||||||
- [refs/metric_stuck.md](refs/metric_stuck.md) — "why won't this metric move?" plus the structural-ceiling check (is the optimizer failing, or can the parameterization not express it?).
|
- [refs/metric_stuck.md](refs/metric_stuck.md) — "why won't this metric move?" plus the structural-ceiling check (is the optimizer failing, or can the parameterization not express it?).
|
||||||
- [refs/sweeps.md](refs/sweeps.md) — same-seed paired comparison and cross-seed t-stat reliability, so a result is "reliably better" not "a lucky seed."
|
- [refs/sweeps.md](refs/sweeps.md) — same-seed paired comparison and cross-seed t-stat reliability, so a result is "reliably better" not "a lucky seed."
|
||||||
- [refs/static_analysis.md](refs/static_analysis.md) — grep patterns for silent bugs (shape mismatches, autograd breakers, double softmax, step ordering, leakage).
|
- [refs/static_analysis.md](refs/static_analysis.md) — grep patterns for silent bugs (shape mismatches, autograd breakers, double softmax, step ordering, leakage).
|
||||||
- [refs/diagnostics.md](refs/diagnostics.md) — copy-paste diagnostic snippets (init-loss check, overfit-one-batch, gradient-flow check, NaN hooks, class-imbalance check).
|
- [refs/diagnostics.md](refs/diagnostics.md) — copy-paste diagnostic snippets (init-loss check, overfit-one-batch, gradient-flow check, NaN hooks, NaN-poisoning leakage tracer, backprop-to-input dependency check, class-imbalance check).
|
||||||
- [rl/SKILL.md](rl/SKILL.md) — RL-specific debugging: probe environments, reward engineering, HP defaults, reference implementations.
|
- [rl/SKILL.md](rl/SKILL.md) — RL-specific debugging: probe environments, reward engineering, HP defaults, reference implementations.
|
||||||
- [pinn/SKILL.md](pinn/SKILL.md) — physics-informed-network debugging: nondimensionalization, gradient pathologies, curriculum.
|
- [pinn/SKILL.md](pinn/SKILL.md) — physics-informed-network debugging: nondimensionalization, gradient pathologies, curriculum.
|
||||||
|
|
||||||
|
|||||||
@@ -112,6 +112,61 @@ with torch.no_grad():
|
|||||||
# If very different: model sees real signal. Problem is elsewhere.
|
# If very different: model sees real signal. Problem is elsewhere.
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**NaN poisoning (leakage tracer)** [wassname; forward-pass dual of Karpathy's gradient check below]
|
||||||
|
```python
|
||||||
|
# Leakage can hide anywhere: normalization fit on the full dataset, target
|
||||||
|
# leaking into features, window functions peeking ahead, bad splits. Instead
|
||||||
|
# of auditing each spot, inject NaN where information must NOT come from
|
||||||
|
# (the future, the test set, the label) and run the real pipeline. NaN is
|
||||||
|
# absorbing under +,-,*,/ so it spreads like dye: if any "past"/train output
|
||||||
|
# is NaN, you have a leak, and you can bisect the pipeline to find the stage
|
||||||
|
# where it crossed.
|
||||||
|
import numpy as np
|
||||||
|
X = np.random.randn(1000, n_features)
|
||||||
|
y = np.random.randn(1000)
|
||||||
|
X[cutoff:] = np.nan # poison the future / test rows
|
||||||
|
y[cutoff:] = np.nan
|
||||||
|
|
||||||
|
Xt, yt = pipeline(X, y) # the REAL pipeline: features, scaling, splits, windowing
|
||||||
|
assert np.isfinite(Xt[:cutoff]).all(), "leak: future reached past features"
|
||||||
|
assert np.isfinite(yt[:cutoff]).all(), "leak: future reached past targets"
|
||||||
|
# To localize: assert finiteness after each pipeline stage; first failing
|
||||||
|
# stage is where the leak crosses.
|
||||||
|
|
||||||
|
# CAVEAT false negatives (dye silently filtered -- false assurance):
|
||||||
|
# pandas mean/std/sum default to skipna=True; np.nanmean; dropna/fillna;
|
||||||
|
# imputers; df.rolling(...).mean() skips NaN too.
|
||||||
|
# Fallback: poison with a huge sentinel (1e12) instead -- survives nanmean
|
||||||
|
# and shows up as an absurd value in anything it touches.
|
||||||
|
# CAVEAT false positives (dye spreads along a legitimate axis):
|
||||||
|
# softmax over an axis containing NaN goes all-NaN even with a CORRECT
|
||||||
|
# additive -inf causal mask (NaN + -inf = NaN). So this cannot validate
|
||||||
|
# causal masking inside a transformer -- use the gradient check below.
|
||||||
|
# But NaN crossing via batch statistics is often a TRUE positive: a scaler
|
||||||
|
# fit on train+test lets test rows poison train features. That's the leak.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Backprop-to-input dependency check** [Karpathy 2019]
|
||||||
|
```python
|
||||||
|
# The gradient-based dual of NaN poisoning: works INSIDE models where NaN
|
||||||
|
# gives false positives (attention softmax, batch/layer stats).
|
||||||
|
# Karpathy: "set the loss to be something trivial like the sum of all outputs
|
||||||
|
# of example i... ensure that you get a non-zero gradient only on the i-th input."
|
||||||
|
# Catches view-instead-of-transpose bugs that mix info across the batch dim.
|
||||||
|
|
||||||
|
# Batch independence: output i must depend only on input i
|
||||||
|
x = torch.randn(8, seq, dim, requires_grad=True)
|
||||||
|
model(x)[3].sum().backward()
|
||||||
|
assert (x.grad[[0,1,2,4,5,6,7]] == 0).all(), "leak across batch dim"
|
||||||
|
|
||||||
|
# Causal masking: output at t must not depend on inputs > t
|
||||||
|
x = torch.randn(1, seq, dim, requires_grad=True)
|
||||||
|
t = seq // 2
|
||||||
|
model(x)[0, t].sum().backward()
|
||||||
|
assert (x.grad[0, t+1:] == 0).all(), "leak: position t sees the future"
|
||||||
|
# Run in eval mode; dropout and exotic attn kernels can add noise.
|
||||||
|
```
|
||||||
|
|
||||||
**Prime dimension trick** [Slavv]
|
**Prime dimension trick** [Slavv]
|
||||||
```python
|
```python
|
||||||
# Use prime/weird numbers for each dimension to catch silent broadcasting.
|
# Use prime/weird numbers for each dimension to catch silent broadcasting.
|
||||||
|
|||||||
Reference in New Issue
Block a user