diff --git a/SKILL.md b/SKILL.md index f5c7a70..96993ae 100644 --- a/SKILL.md +++ b/SKILL.md @@ -440,288 +440,11 @@ Concrete procedures for an LLM agent debugging ML code. Work top-to-bottom: stat ### 6.1 Static analysis: grep for silent bugs -Run these searches on the codebase before anything else. Each catches a common bug that produces no error but wrong results. - -**Shape mismatches (silent broadcasting)** -``` -# Grep patterns: -\.view\(|\.reshape\( # check dims match intent -unsqueeze\(|squeeze\( # dimension insertion/removal -\.expand\(|\.repeat\( # broadcasting -# Action: for every hit, trace the tensor shape backward. Add assert statements. -``` - -**Autograd breakers** -``` -# Grep patterns: -\.detach\(\) # breaks gradient flow -\.data\b # bypasses autograd entirely -with torch\.no_grad # check this isn't wrapping training code -\.item\(\) # in a loss computation = broken -\.numpy\(\) # in forward pass = broken -# Action: every .detach() should have a comment explaining WHY grad is intentionally stopped. -``` - -**Missing train/eval mode** -``` -# Grep patterns: -\.train\(\) # count occurrences -\.eval\(\) # should pair with .train() -# Action: verify .eval() before every val loop, .train() before every train loop. -# Dropout and batchnorm behave differently -- this silently degrades results. -``` - -**In-place ops on tensors requiring grad** -``` -# Grep patterns: -\+=|\-=|\*=|/= # in-place assignment on tensors -\.add_\(|\.mul_\(|\.zero_\( # in-place methods -\[.*\]\s*=[^=] # index assignment (excludes ==) -# Action: in-place ops on leaf tensors with requires_grad=True corrupt autograd. -# Replace x += y with x = x + y. -``` - -**Double softmax (softmax input to CrossEntropyLoss)** -``` -# Grep patterns: -CrossEntropyLoss|cross_entropy # expects raw logits -softmax|log_softmax|\.softmax # if applied BEFORE CrossEntropyLoss = double softmax -# Action: CrossEntropyLoss = log_softmax + NLLLoss internally. -# If you softmax first, CE computes log_softmax(softmax(x)) -- the softmax -# compresses logits into (0,1), so log_softmax sees near-uniform inputs. -# Gradients vanish. Loss plateaus near ln(n_classes). -``` - -**Wrong optimizer step ordering** -``` -# Grep patterns -- verify this exact order exists: -# 1. optimizer.zero_grad() -# 2. loss.backward() -# 3. [optional: clip_grad_norm_] -# 4. optimizer.step() -# 5. [optional: scheduler.step()] -# Common bugs: zero_grad after backward (kills grads), step before backward (stale grads), -# scheduler.step() in wrong loop: per-epoch schedulers (StepLR, CosineAnnealingLR) -# called per-batch = decays too fast. Per-step schedulers (OneCycleLR) called per-epoch = too slow. -``` - -**Broadcasting traps** -```python -# Diagnostic: print shapes at every binary operation between tensors of different ndim -# Shapes (3,) and (3,1) silently broadcast to (3,3) -- probably not intended. -# Shapes (B,1) and (B,N) broadcast fine but verify it's intentional. -a = torch.randn(3) -b = torch.randn(3, 1) -print((a + b).shape) # (3, 3) -- wanted (3,)? -``` - -**Wrong loss sign** -``` -# Grep patterns: -maximize|ascent # gradient ascent when descent intended? -\-\s*loss # negating loss -- intentional (e.g., reward maximization)? -1\.0\s*-\s*|1\s*-\s* # 1 - metric as loss -- is the metric bounded [0,1]? -# Action: verify that minimizing the loss = improving the metric you care about. -``` - -**Frozen parameters not intended** -``` -# Grep patterns: -requires_grad\s*=\s*False # intentional freeze? -\.freeze\(|\.requires_grad_ # parameter freezing -for.*param.*\.parameters # check nothing is skipped -# Diagnostic: -for name, p in model.named_parameters(): - if not p.requires_grad: - print(f"FROZEN: {name}") -``` - -**Data leakage** -``` -# Grep patterns: -\.fit_transform\( # on test data = leakage -train_test_split.*shuffle=True # for time series = leakage -# Action: fit on train only, transform on both. Use temporal split for time series. -``` - -**Class imbalance** -``` -# Grep patterns: -CrossEntropyLoss\(\) # no weight= argument? check if classes balanced -weight=.*class # existing balancing -- verify weights are correct -# Diagnostic: count labels per class (see 6.2 "Class imbalance check"). -# 100:1 ratio with unweighted loss = model predicts majority class. -``` +> See [refs/static_analysis.md](refs/static_analysis.md) for the full list of grep patterns. Categories: shape mismatches, autograd breakers, train/eval mode, in-place ops, double softmax, optimizer step ordering, broadcasting traps, wrong loss sign, frozen params, data leakage, class imbalance. ### 6.2 Diagnostic code snippets -Copy-paste these. Each tests one thing. - -**Data pipeline sanity check** -```python -batch = next(iter(train_loader)) -for k, v in (batch.items() if isinstance(batch, dict) else enumerate(batch)): - if isinstance(v, torch.Tensor): - print(f"{k}: shape={v.shape}, dtype={v.dtype}, " - f"range=[{v.min():.3f}, {v.max():.3f}], " - f"mean={v.float().mean():.3f}, std={v.float().std():.3f}, " - f"nan={v.isnan().sum()}, inf={v.isinf().sum()}") - else: - print(f"{k}: type={type(v)}, len={len(v) if hasattr(v, '__len__') else 'scalar'}") -# Check: inputs ~mean 0, std 1? Labels in expected range? No NaN/Inf? Shapes match model? -``` - -**Init loss check** -```python -model.eval() -with torch.no_grad(): - batch = next(iter(train_loader)) - out = model(batch['input']) # adapt to your interface - loss = loss_fn(out, batch['target']) - print(f"Init loss: {loss.item():.4f}") - -# Expected init loss (random predictions): -# - CrossEntropy, C classes: -ln(1/C) = ln(C) -# C=2: 0.693, C=10: 2.303, C=100: 4.605, C=1000: 6.908 -# - Binary CrossEntropy: -ln(0.5) = 0.693 -# - MSE (targets ~N(0,1)): ~1.0 (if init outputs ~0) or ~var(targets) -# - L1 (targets ~N(0,1)): ~0.8 -# -# If init loss << expected: model is cheating (data leakage, shortcut) -# If init loss >> expected: wrong loss fn, bad init, or data pipeline broken -``` - -**Overfit-one-batch test** -```python -model.train() -batch = next(iter(train_loader)) -optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) - -for step in range(200): - optimizer.zero_grad() - out = model(batch['input']) - loss = loss_fn(out, batch['target']) - loss.backward() - grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 100.0) - optimizer.step() - if step % 20 == 0: - print(f"step {step:3d} loss={loss.item():.4f} grad_norm={grad_norm:.4f}") - -# Expected: loss drops to ~0 within 200 steps. -# If not: model can't even memorize 1 batch -- architecture or gradient problem. -``` - -**Gradient flow check (per-layer)** -```python -loss.backward() -for name, p in model.named_parameters(): - if p.grad is not None: - g = p.grad - print(f"{name:40s} grad: mean={g.mean():+.2e}, std={g.std():.2e}, " - f"max={g.abs().max():.2e}, zero%={100*(g==0).float().mean():.0f}") - else: - print(f"{name:40s} grad: None") # <-- not in computation graph! -# Check: no None grads (disconnected), no all-zero grads (dead layer), -# no huge grads (explosion), reasonable magnitude across layers. -``` - -**NaN/Inf detector hooks** -```python -def nan_hook(module, input, output): - def _check(t, label): - if isinstance(t, torch.Tensor) and (torch.isnan(t).any() or torch.isinf(t).any()): - raise RuntimeError( - f"NaN/Inf in {module.__class__.__name__} {label}, " - f"shape={t.shape}, nan={t.isnan().sum()}, inf={t.isinf().sum()}") - if isinstance(output, torch.Tensor): - _check(output, "output") - elif isinstance(output, dict): - for k, v in output.items(): - _check(v, f"output[{k!r}]") - elif isinstance(output, (tuple, list)): - for i, o in enumerate(output): - _check(o, f"output[{i}]") - -for name, module in model.named_modules(): - module.register_forward_hook(nan_hook) -# Run one forward pass. First module to raise = source of the NaN. -``` - -**Random input test** [Slavv] -```python -# Pass random noise instead of real data. If loss/error behaves the same, -# the data pipeline is destroying information before the model sees it. -model.eval() -real_batch = next(iter(train_loader)) -fake_input = torch.randn_like(real_batch['input']) -with torch.no_grad(): - real_out = model(real_batch['input']) - fake_out = model(fake_input) - real_loss = loss_fn(real_out, real_batch['target']).item() - fake_loss = loss_fn(fake_out, real_batch['target']).item() - print(f"Real input loss: {real_loss:.4f}") - print(f"Random input loss: {fake_loss:.4f}") -# If similar: model isn't using the input. Check preprocessing, data loading, feature selection. -# If very different: model sees real signal. Problem is elsewhere. -``` - -**Prime dimension trick** [Slavv] -```python -# Use prime/weird numbers for each dimension to catch silent broadcasting. -# If batch=7, seq=13, hidden=17, any mismatched reshape/view that "works" -# by accident with powers-of-2 will fail with primes. -x = torch.randn(7, 13, 17) # (batch=7, seq=13, hidden=17) -out = model(x) -print(f"in={x.shape} -> out={out.shape}") -# If this crashes but normal shapes don't: you have a broadcasting bug. -``` - -**Class imbalance check** -```python -from collections import Counter -all_labels = [] -for batch in train_loader: - labels = batch['target'] if isinstance(batch, dict) else batch[1] - all_labels.extend(labels.flatten().tolist()) -counts = Counter(all_labels) -total = sum(counts.values()) -for cls, n in sorted(counts.items(), key=lambda x: -x[1]): - print(f" class {cls}: {n:6d} ({100*n/total:.1f}%)") -# Ratio > 10:1 = likely need weighted loss or resampling. -# Ratio > 100:1 = model will predict majority class and look "accurate". -``` - -**Confidence-sorted error inspection** [common practice, cf. FSDL error analysis] -```python -# Find the model's most confident wrong predictions. These reveal -# systematic bugs (e.g., cropping cutting off relevant features). -model.eval() -errors = [] -with torch.no_grad(): - for batch in val_loader: - logits = model(batch['input']) - probs = torch.softmax(logits, dim=-1) - confidence, predicted = probs.max(dim=-1) - wrong = predicted != batch['target'] - for i in wrong.nonzero(as_tuple=True)[0]: - errors.append((confidence[i].item(), predicted[i].item(), - batch['target'][i].item(), i.item())) -errors.sort(reverse=True) # most confident mistakes first -for conf, pred, true, idx in errors[:10]: - print(f" conf={conf:.3f} predicted={pred} true={true} idx={idx}") -# Inspect the actual inputs for these indices. Pattern = systematic bug. -``` - -**Weight/bias distribution check** [Slavv, CS231n] -```python -for name, p in model.named_parameters(): - print(f"{name:40s} mean={p.data.mean():+.4f} std={p.data.std():.4f} " - f"min={p.data.min():+.4f} max={p.data.max():+.4f} " - f"shape={list(p.shape)}") -# Healthy: roughly Gaussian, std ~0.01-1.0 depending on init scheme. -# Bad signs: all zeros, huge values (>100), std ~0 (collapsed), NaN. -# After training: weights diverging to +/-inf = exploding. All same value = dead. -``` +> See [refs/diagnostics.md](refs/diagnostics.md) for copy-paste snippets. Includes: data pipeline sanity check, init loss check (with expected values per loss type), overfit-one-batch test, gradient flow check, NaN/Inf hooks, random input test, prime dimension trick, class imbalance check, confidence-sorted errors, weight/bias distributions. ### 6.3 Triage decision tree diff --git a/refs/diagnostics.md b/refs/diagnostics.md new file mode 100644 index 0000000..3850414 --- /dev/null +++ b/refs/diagnostics.md @@ -0,0 +1,169 @@ +# 6.2 Diagnostic code snippets + +Copy-paste these. Each tests one thing. + +**Data pipeline sanity check** +```python +batch = next(iter(train_loader)) +for k, v in (batch.items() if isinstance(batch, dict) else enumerate(batch)): + if isinstance(v, torch.Tensor): + print(f"{k}: shape={v.shape}, dtype={v.dtype}, " + f"range=[{v.min():.3f}, {v.max():.3f}], " + f"mean={v.float().mean():.3f}, std={v.float().std():.3f}, " + f"nan={v.isnan().sum()}, inf={v.isinf().sum()}") + else: + print(f"{k}: type={type(v)}, len={len(v) if hasattr(v, '__len__') else 'scalar'}") +# Check: inputs ~mean 0, std 1? Labels in expected range? No NaN/Inf? Shapes match model? +``` + +**Init loss check** +```python +model.eval() +with torch.no_grad(): + batch = next(iter(train_loader)) + out = model(batch['input']) # adapt to your interface + loss = loss_fn(out, batch['target']) + print(f"Init loss: {loss.item():.4f}") + +# Expected init loss (random predictions): +# - CrossEntropy, C classes: -ln(1/C) = ln(C) +# C=2: 0.693, C=10: 2.303, C=100: 4.605, C=1000: 6.908 +# - Binary CrossEntropy: -ln(0.5) = 0.693 +# - MSE (targets ~N(0,1)): ~1.0 (if init outputs ~0) or ~var(targets) +# - L1 (targets ~N(0,1)): ~0.8 +# +# If init loss << expected: model is cheating (data leakage, shortcut) +# If init loss >> expected: wrong loss fn, bad init, or data pipeline broken +``` + +**Overfit-one-batch test** +```python +model.train() +batch = next(iter(train_loader)) +optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) + +for step in range(200): + optimizer.zero_grad() + out = model(batch['input']) + loss = loss_fn(out, batch['target']) + loss.backward() + grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 100.0) + optimizer.step() + if step % 20 == 0: + print(f"step {step:3d} loss={loss.item():.4f} grad_norm={grad_norm:.4f}") + +# Expected: loss drops to ~0 within 200 steps. +# If not: model can't even memorize 1 batch -- architecture or gradient problem. +``` + +**Gradient flow check (per-layer)** +```python +loss.backward() +for name, p in model.named_parameters(): + if p.grad is not None: + g = p.grad + print(f"{name:40s} grad: mean={g.mean():+.2e}, std={g.std():.2e}, " + f"max={g.abs().max():.2e}, zero%={100*(g==0).float().mean():.0f}") + else: + print(f"{name:40s} grad: None") # <-- not in computation graph! +# Check: no None grads (disconnected), no all-zero grads (dead layer), +# no huge grads (explosion), reasonable magnitude across layers. +``` + +**NaN/Inf detector hooks** +```python +def nan_hook(module, input, output): + def _check(t, label): + if isinstance(t, torch.Tensor) and (torch.isnan(t).any() or torch.isinf(t).any()): + raise RuntimeError( + f"NaN/Inf in {module.__class__.__name__} {label}, " + f"shape={t.shape}, nan={t.isnan().sum()}, inf={t.isinf().sum()}") + if isinstance(output, torch.Tensor): + _check(output, "output") + elif isinstance(output, dict): + for k, v in output.items(): + _check(v, f"output[{k!r}]") + elif isinstance(output, (tuple, list)): + for i, o in enumerate(output): + _check(o, f"output[{i}]") + +for name, module in model.named_modules(): + module.register_forward_hook(nan_hook) +# Run one forward pass. First module to raise = source of the NaN. +``` + +**Random input test** [Slavv] +```python +# Pass random noise instead of real data. If loss/error behaves the same, +# the data pipeline is destroying information before the model sees it. +model.eval() +real_batch = next(iter(train_loader)) +fake_input = torch.randn_like(real_batch['input']) +with torch.no_grad(): + real_out = model(real_batch['input']) + fake_out = model(fake_input) + real_loss = loss_fn(real_out, real_batch['target']).item() + fake_loss = loss_fn(fake_out, real_batch['target']).item() + print(f"Real input loss: {real_loss:.4f}") + print(f"Random input loss: {fake_loss:.4f}") +# If similar: model isn't using the input. Check preprocessing, data loading, feature selection. +# If very different: model sees real signal. Problem is elsewhere. +``` + +**Prime dimension trick** [Slavv] +```python +# Use prime/weird numbers for each dimension to catch silent broadcasting. +# If batch=7, seq=13, hidden=17, any mismatched reshape/view that "works" +# by accident with powers-of-2 will fail with primes. +x = torch.randn(7, 13, 17) # (batch=7, seq=13, hidden=17) +out = model(x) +print(f"in={x.shape} -> out={out.shape}") +# If this crashes but normal shapes don't: you have a broadcasting bug. +``` + +**Class imbalance check** +```python +from collections import Counter +all_labels = [] +for batch in train_loader: + labels = batch['target'] if isinstance(batch, dict) else batch[1] + all_labels.extend(labels.flatten().tolist()) +counts = Counter(all_labels) +total = sum(counts.values()) +for cls, n in sorted(counts.items(), key=lambda x: -x[1]): + print(f" class {cls}: {n:6d} ({100*n/total:.1f}%)") +# Ratio > 10:1 = likely need weighted loss or resampling. +# Ratio > 100:1 = model will predict majority class and look "accurate". +``` + +**Confidence-sorted error inspection** [common practice, cf. FSDL error analysis] +```python +# Find the model's most confident wrong predictions. These reveal +# systematic bugs (e.g., cropping cutting off relevant features). +model.eval() +errors = [] +with torch.no_grad(): + for batch in val_loader: + logits = model(batch['input']) + probs = torch.softmax(logits, dim=-1) + confidence, predicted = probs.max(dim=-1) + wrong = predicted != batch['target'] + for i in wrong.nonzero(as_tuple=True)[0]: + errors.append((confidence[i].item(), predicted[i].item(), + batch['target'][i].item(), i.item())) +errors.sort(reverse=True) # most confident mistakes first +for conf, pred, true, idx in errors[:10]: + print(f" conf={conf:.3f} predicted={pred} true={true} idx={idx}") +# Inspect the actual inputs for these indices. Pattern = systematic bug. +``` + +**Weight/bias distribution check** [Slavv, CS231n] +```python +for name, p in model.named_parameters(): + print(f"{name:40s} mean={p.data.mean():+.4f} std={p.data.std():.4f} " + f"min={p.data.min():+.4f} max={p.data.max():+.4f} " + f"shape={list(p.shape)}") +# Healthy: roughly Gaussian, std ~0.01-1.0 depending on init scheme. +# Bad signs: all zeros, huge values (>100), std ~0 (collapsed), NaN. +# After training: weights diverging to +/-inf = exploding. All same value = dead. +``` diff --git a/refs/static_analysis.md b/refs/static_analysis.md new file mode 100644 index 0000000..0c13304 --- /dev/null +++ b/refs/static_analysis.md @@ -0,0 +1,114 @@ +# 6.1 Static analysis: grep for silent bugs + +Run these searches on the codebase before anything else. Each catches a common bug that produces no error but wrong results. + +**Shape mismatches (silent broadcasting)** +``` +# Grep patterns: +\.view\(|\.reshape\( # check dims match intent +unsqueeze\(|squeeze\( # dimension insertion/removal +\.expand\(|\.repeat\( # broadcasting +# Action: for every hit, trace the tensor shape backward. Add assert statements. +``` + +**Autograd breakers** +``` +# Grep patterns: +\.detach\(\) # breaks gradient flow +\.data\b # bypasses autograd entirely +with torch\.no_grad # check this isn't wrapping training code +\.item\(\) # in a loss computation = broken +\.numpy\(\) # in forward pass = broken +# Action: every .detach() should have a comment explaining WHY grad is intentionally stopped. +``` + +**Missing train/eval mode** +``` +# Grep patterns: +\.train\(\) # count occurrences +\.eval\(\) # should pair with .train() +# Action: verify .eval() before every val loop, .train() before every train loop. +# Dropout and batchnorm behave differently -- this silently degrades results. +``` + +**In-place ops on tensors requiring grad** +``` +# Grep patterns: +\+=|\-=|\*=|/= # in-place assignment on tensors +\.add_\(|\.mul_\(|\.zero_\( # in-place methods +\[.*\]\s*=[^=] # index assignment (excludes ==) +# Action: in-place ops on leaf tensors with requires_grad=True corrupt autograd. +# Replace x += y with x = x + y. +``` + +**Double softmax (softmax input to CrossEntropyLoss)** +``` +# Grep patterns: +CrossEntropyLoss|cross_entropy # expects raw logits +softmax|log_softmax|\.softmax # if applied BEFORE CrossEntropyLoss = double softmax +# Action: CrossEntropyLoss = log_softmax + NLLLoss internally. +# If you softmax first, CE computes log_softmax(softmax(x)) -- the softmax +# compresses logits into (0,1), so log_softmax sees near-uniform inputs. +# Gradients vanish. Loss plateaus near ln(n_classes). +``` + +**Wrong optimizer step ordering** +``` +# Grep patterns -- verify this exact order exists: +# 1. optimizer.zero_grad() +# 2. loss.backward() +# 3. [optional: clip_grad_norm_] +# 4. optimizer.step() +# 5. [optional: scheduler.step()] +# Common bugs: zero_grad after backward (kills grads), step before backward (stale grads), +# scheduler.step() in wrong loop: per-epoch schedulers (StepLR, CosineAnnealingLR) +# called per-batch = decays too fast. Per-step schedulers (OneCycleLR) called per-epoch = too slow. +``` + +**Broadcasting traps** +```python +# Diagnostic: print shapes at every binary operation between tensors of different ndim +# Shapes (3,) and (3,1) silently broadcast to (3,3) -- probably not intended. +# Shapes (B,1) and (B,N) broadcast fine but verify it's intentional. +a = torch.randn(3) +b = torch.randn(3, 1) +print((a + b).shape) # (3, 3) -- wanted (3,)? +``` + +**Wrong loss sign** +``` +# Grep patterns: +maximize|ascent # gradient ascent when descent intended? +\-\s*loss # negating loss -- intentional (e.g., reward maximization)? +1\.0\s*-\s*|1\s*-\s* # 1 - metric as loss -- is the metric bounded [0,1]? +# Action: verify that minimizing the loss = improving the metric you care about. +``` + +**Frozen parameters not intended** +``` +# Grep patterns: +requires_grad\s*=\s*False # intentional freeze? +\.freeze\(|\.requires_grad_ # parameter freezing +for.*param.*\.parameters # check nothing is skipped +# Diagnostic: +for name, p in model.named_parameters(): + if not p.requires_grad: + print(f"FROZEN: {name}") +``` + +**Data leakage** +``` +# Grep patterns: +\.fit_transform\( # on test data = leakage +train_test_split.*shuffle=True # for time series = leakage +# Action: fit on train only, transform on both. Use temporal split for time series. +``` + +**Class imbalance** +``` +# Grep patterns: +CrossEntropyLoss\(\) # no weight= argument? check if classes balanced +weight=.*class # existing balancing -- verify weights are correct +# Diagnostic: count labels per class (see diagnostics.md "Class imbalance check"). +# 100:1 ratio with unweighted loss = model predicts majority class. +```