folklore: tuning playbook, Domingos, Bekman loss spikes, Ng error analysis; LLM-judge bias appendix

- SKILL.md: 3 new entries (exploration-over-exploitation + nuisance HPs, test-set contamination, loss-spikes-mean-bad-data-pocket) and an Ng 100-misclassified-examples quote under inspect-the-data - refs/llm_judges.md: position/verbosity/self-preference biases (Zheng, Wang 66/80 flip, Panickssery) + mitigation checklist from verdict docs - Lones pitfalls linked as the exhaustive 36-item do/don't checklist - 6 new frozen evidence files; Hamel evals link in further reading Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 17:31:04 +08:00 · 2026-06-11 15:30:41 +08:00
parent 2a2f5045bb
commit 8cd3c61050
9 changed files with 298 additions and 1 deletions
@@ -0,0 +1,42 @@
+Source: https://github.com/stas00/ml-engineering — training/instabilities/README.md, training/instabilities/training-loss-patterns.md, debug/README.md (master branch)
+Title: "Machine Learning Engineering Open Book" — Stas Bekman (BLOOM-176B / IDEFICS-80B training lead at HF, ex-PyTorch)
+Fetched-via: curl of raw markdown from github, 2026-06-11
+Fetch-status: verbatim excerpts
+
+# ML Engineering Open Book — instabilities and loss patterns (excerpts)
+
+From "Understanding Training Loss Patterns":
+
+> Training loss plot is similar to the heart beat pattern - there is the good, the bad and you-should-worry one. After studying many training loss trajectories one develops an intuition to explain various loss behaviors during one's training and how to act on those.
+
+> I warn you that the "Understanding" in the title of this section is overloaded since very often we don't really understand why certain types of spikes happen. Here "understanding" refers to recognizing various patterns. We then usually have techniques to overcome the bad patterns and bring the training successfully to the finish line.
+
+> Thus you will find here a gallery of training loss patterns sometimes with real explanations, but more often than not educated guesses to what might be happening.
+
+The pre-BLOOM 104B failure story ("A very failed training"):
+
+> Prior to starting BLOOM-176B training we did multiple experiments with the 104B model. We failed to figure out how to not diverge very early on. [...] As you can see many attempts were made, many techniques were applied (see chronicles). We think the 2 main obstacles were using fp16 and data that had a lot of garbage in it. For BLOOM-176B we switched to bf16, used much cleaner data and also added an embedding layer-norm and that made all the difference.
+
+On loss spikes ("Main types of loss spikes"):
+
+> In general there are 3 types of loss spikes: 1. Fast recovering spikes 2. Slow recovering spikes 3. Not fully recovering spikes
+>
+> The spikes usually happen because of a bad data pocket, either due to badly shuffled data or because it hasn't been cleaned from some garbage scraped from the websites.
+
+From "Avoiding, Recovering From and Understanding Instabilities" — the init-std story:
+
+> Correctly initializing the initial distribution of the tensors can have a tremendous impact on training's stability. The `std` value isn't fixed and depends on the hidden dimension size.
+>
+> This proved to be a very crucial setting in our pre-BLOOM 104B experiments and we couldn't break past the first few thousands iterations until we figured out that the 0.02 default `--init-method-std` in Megatron-LM was a way too big for our model.
+
+(They settled on the 530B paper's `sqrt(1/(NHIDDEN*3))`: "for NHIDDEN=14336 the math was sqrt(1/(14336*3)) = 0.00482 and that's what we used. It surely wasn't the only reason why we had no stability issues during BLOOM-176B training, but I think it was one of the crucial ones.")
+
+On PaLM's spikes ("'Bad' combination of data batch and model parameter state"):
+
+> PaLM team observed dozens of loss spikes at "highly irregular intervals" when training larger models. While they were not able to track down the root cause, they mitigated the issue by restarting from an earlier checkpoint and skipping potentially problematic data batches.
+
+On reading training logbooks:
+
+> The best learning is to read Publicly available training LLM/VLM logbooks because there you can see exactly what happened and how the problem has been overcome.
+
+Debug section index (debug/README.md) — guides for: Debugging PyTorch programs; Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs; Network Debug; Troubleshooting NVIDIA GPUs; Underflow and Overflow Detection; plus tools (torch-distributed-gpu-test.py, NicerTrace).
@@ -0,0 +1,38 @@
+Source: https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf (author's copy; CACM doi:10.1145/2347736.2347755)
+Title: "A Few Useful Things to Know About Machine Learning" — Pedro Domingos, Communications of the ACM, Oct 2012, vol. 55 no. 10
+Fetched-via: PDF downloaded from Domingos' UW page, pages transcribed by hand from the rendered pages (3-column CACM layout defeats text extraction)
+Fetch-status: verbatim excerpts
+
+# A Few Useful Things to Know About Machine Learning (excerpts)
+
+Standfirst and intro (p. 78) — the paper's stated purpose is writing down ML folk knowledge:
+
+> Tapping into the "folk knowledge" needed to advance machine learning applications.
+
+> Several fine textbooks are available to interested practitioners and researchers (for example, Mitchell and Witten et al.). However, much of the "folk knowledge" that is needed to successfully develop machine learning applications is not readily available in them. As a result, many machine learning projects take much longer than necessary or wind up producing less-than-ideal results. Yet much of this folk knowledge is fairly easy to communicate. This is the purpose of this article.
+
+Key-insights box (p. 78):
+
+> developing successful machine learning applications requires a substantial amount of "black art" that is difficult to find in textbooks.
+
+"It's Generalization that Counts" (p. 80):
+
+> The fundamental goal of machine learning is to generalize beyond the examples in the training set. [...] Doing well on the training set is easy (just memorize the examples). The most common mistake among machine learning beginners is to test on the training data and have the illusion of success.
+
+> Contamination of your classifier by test data can occur in insidious ways, for example, if you use test data to tune parameters and do a lot of tuning. (Machine learning algorithms have lots of knobs, and success often comes from twiddling them a lot, so this is a real concern.)
+
+"Overfitting Has Many Faces" (p. 81):
+
+> What if the knowledge and data we have are not sufficient to completely determine the correct classifier? Then we run the risk of just hallucinating a classifier (or parts of it) that is not grounded in reality, and is simply encoding random quirks in the data. This problem is called *overfitting*, and is the bugbear of machine learning. When your learner outputs a classifier that is 100% accurate on the training data but only 50% accurate on test data, when in fact it could have output one that is 75% accurate on both, it has overfit.
+
+> Everyone in machine learning knows about overfitting, but it comes in many forms that are not immediately obvious.
+
+"Feature Engineering Is The Key" (p. 84):
+
+> At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.
+
+> First-timers are often surprised by how little time in a machine learning project is spent actually doing machine learning. But it makes sense if you consider how time-consuming it is to gather data, integrate it, clean it and preprocess it, and how much trial and error can go into feature design. Also, machine learning is not a one-shot process of building a dataset and running a learner, but rather an iterative process of running the learner, analyzing the results, modifying the data and/or the learner, and repeating.
+
+"More Data Beats a Cleverer Algorithm" pull quote (p. 84):
+
+> A dumb algorithm with lots and lots of data beats a clever one with modest amounts of it.
@@ -0,0 +1,38 @@
+Source: https://github.com/google-research/tuning_playbook (README.md, fetched from raw.githubusercontent.com main branch)
+Title: "Deep Learning Tuning Playbook" — Varun Godbole, George E. Dahl, Justin Gilmer, Christopher J. Shallue, Zachary Nado (Google Research / Harvard), 2023
+Fetched-via: curl of raw README.md, 2026-06-11
+Fetch-status: verbatim excerpts; bullet indentation flattened in places, content unchanged
+
+# Deep Learning Tuning Playbook (excerpts)
+
+From "Why a tuning playbook?":
+
+> Currently, there is an astonishing amount of toil and guesswork involved in actually getting deep neural networks to work well in practice. Even worse, the actual recipes people use to get good results with deep learning are rarely documented. Papers gloss over the process that led to their final results in order to present a cleaner story, and machine learning engineers working on commercial problems rarely have time to take a step back and generalize their process. [...] There is a vast gulf between the results achieved by deep learning experts and less skilled practitioners using superficially similar methods. At the same time, these very experts readily admit some of what they do might not be well-justified.
+
+From "The incremental tuning strategy":
+
+> ***Summary:*** *Start with a simple configuration and incrementally make improvements while building up insight into the problem. Make sure that any improvement is based on strong evidence to avoid adding unnecessary complexity.*
+
+> The most effective way to maximize performance is to start with a simple configuration and incrementally add features and make improvements while building up insight into the problem.
+
+> For each launch, we must make sure that the change is based on strong evidence – not just random chance based on a lucky configuration – so that we don't add unnecessary complexity to the training pipeline.
+
+From "Exploration vs exploitation":
+
+> ***Summary:*** *Most of the time, our primary goal is to gain insight into the problem.*
+
+> Although one might think we would spend most of our time trying to maximize performance on the validation set, in practice we spend the majority of our time trying to gain insight into the problem, and comparatively little time greedily focused on the validation error. In other words, we spend most of our time on "exploration" and only a small amount on "exploitation".
+
+> Prioritizing insight over short term gains can help us: Avoid launching unnecessary changes that happened to be present in well-performing runs merely through historical accident. Identify which hyperparameters the validation error is most sensitive to, which hyperparameters interact the most and therefore need to be re-tuned together, and which hyperparameters are relatively insensitive to other changes and can therefore be fixed in future experiments.
+
+From "Choosing the goal for the next round of experiments":
+
+> Each round of experiments should have a clear goal and be sufficiently narrow in scope that the experiments can actually make progress towards the goal: if we try to add multiple features or answer multiple questions at once, we may not be able to disentangle the separate effects on the results.
+
+From "Identifying scientific, nuisance, and fixed hyperparameters":
+
+> For a given goal, all hyperparameters will be either **scientific hyperparameters**, **nuisance hyperparameters**, or **fixed hyperparameters**. Scientific hyperparameters are those whose effect on the model's performance we're trying to measure. Nuisance hyperparameters are those that need to be optimized over in order to fairly compare different values of the scientific hyperparameters. This is similar to the statistical concept of nuisance parameters. Fixed hyperparameters will have their values fixed in the current round of experiments.
+
+> The learning rate is a nuisance hyperparameter because we can only fairly compare models with different numbers of hidden layers if the learning rate is tuned separately for each number of layers (the optimal learning rate generally depends on the model architecture).
+
+> By fixing certain hyperparameters for a set of experiments, we must accept that conclusions derived from the experiments might not be valid for other settings of the fixed hyperparameters. In other words, fixed hyperparameters create caveats for any conclusions we draw from the experiments.
@@ -0,0 +1,41 @@
+Source: arXiv abstracts via export.arxiv.org API + https://verdict.haizelabs.com/docs/best-practices/ and /docs/cookbook/distributional-bias/ (via jina reader)
+Title: LLM-as-a-judge biases — Zheng et al. 2023 (MT-Bench), Wang et al. 2023 (positional bias), Panickssery et al. 2024 (self-preference), plus Haize Labs' verdict practitioner notes
+Fetched-via: arXiv API abstracts verbatim; verdict docs verbatim via r.jina.ai, 2026-06-11
+Fetch-status: verbatim (abstracts in full or near-full; verdict pages are short and quoted nearly whole)
+
+# LLM judge biases (excerpts)
+
+## "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" — Zheng et al. (LMSYS), NeurIPS 2023 — https://arxiv.org/abs/2306.05685
+
+The canonical paper naming the bias taxonomy:
+
+> We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. [...] Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans.
+
+## "Large Language Models are not Fair Evaluators" — Wang et al., ACL 2024 — https://arxiv.org/abs/2305.17926
+
+Positional bias is large enough to flip rankings outright:
+
+> We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. To address this issue, we propose a calibration framework with three simple yet effective strategies: 1) Multiple Evidence Calibration, which requires the evaluator model to generate multiple evaluation evidence before assigning ratings; 2) Balanced Position Calibration, which aggregates results across various orders to determine the final score; 3) Human-in-the-Loop Calibration [...]
+
+## "LLM Evaluators Recognize and Favor Their Own Generations" — Panickssery, Bowman, Feng (NYU/MATS), 2024 — https://arxiv.org/abs/2404.13076
+
+Self-preference is causally linked to self-recognition:
+
+> One such bias is self-preference, where an LLM evaluator scores its own outputs higher than others' while human annotators consider them of equal quality. [...] We discover that, out of the box, LLMs such as GPT-4 and Llama 2 have non-trivial accuracy at distinguishing themselves from other LLMs and humans. By fine-tuning LLMs, we discover a linear correlation between self-recognition capability and the strength of self-preference bias; using controlled experiments, we show that the causal explanation resists straightforward confounders.
+
+## Haize Labs, verdict docs — practitioner mitigation notes
+
+"Best Practices / Learnings" (https://verdict.haizelabs.com/docs/best-practices/), quoted nearly whole:
+
+> - ask for an explanation/justification (**before** the score)
+> - hierarchical verifier is a must — try a different model for the verifier to avoid self-preference bias
+> - study the output distribution of provider models carefully — for example, we find that the gpt-4o family of models has an upward skew for numerical scales and exhibit mode collapse even when using logprobs — likely due to their user-facing alignment tuning. llama models exhibit higher-entropy distributions (more filled out) — this provides more expressiveness and discriminative power
+> - watch for any positional bias -- flip scales, shuffle positions, etc.
+
+"Distributional Bias in LLM-as-a-Judge" cookbook (https://verdict.haizelabs.com/docs/cookbook/distributional-bias/):
+
+> Note that using the same model for the initial judge and verification judge will result in a positive-skew that may not discriminate faithfully between good and bad explanations.
+
+> Constrained decoding methods for structured outputs (e.g., JSON-mode) impose an inductive bias on the model's output distribution.
+
+Related: JudgeBench leaderboard for judge quality — https://huggingface.co/spaces/ScalerLab/JudgeBench
@@ -0,0 +1,30 @@
+Source: https://arxiv.org/abs/2108.02497 (v5; updated annually since 2021)
+Title: "How to avoid machine learning pitfalls: a guide for academic researchers" — Michael A. Lones (Heriot-Watt University)
+Fetched-via: PDF downloaded from arxiv.org, text extracted with pdfplumber, 2026-06-11
+Fetch-status: verbatim excerpts; line breaks rejoined, ligature artifacts fixed
+
+# How to avoid machine learning pitfalls (excerpts)
+
+Abstract:
+
+> Mistakes in machine learning practice are commonplace, and can result in a loss of confidence in the findings and products of machine learning. This guide outlines common mistakes that occur when using machine learning, and what can be done to avoid them. Whilst it should be accessible to anyone with a basic understanding of machine learning techniques, it focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions. It covers five stages of the machine learning process: what to do before model building, how to reliably build models, how to robustly evaluate models, how to compare models fairly, and how to report results.
+
+Structure (from the introduction):
+
+> The review is divided into five sections. *Before you start to build models* covers issues that can occur early in the ML process, and focuses on the correct use of data and adequate consideration of the context in which ML is being applied. *How to reliably build models* then covers pitfalls that occur during the selection and training of models and their components. *How to robustly evaluate models* presents pitfalls that can lead to an incorrect understanding of model performance. *How to compare models fairly* then extends this to the situation where models are being compared, discussing how common pitfalls can lead to misleading findings. *How to report your results* focuses on reproducibility and factors that can lead to incomplete or deceptive reporting.
+
+The full do/don't list (table of contents, v5) — this is the exhaustive-checklist value of the paper:
+
+> 2.1 Do think about how and where you will use data / 2.2 Do take the time to understand your data / 2.3 Don't look at all your data / 2.4 Do clean your data / 2.5 Do make sure you have enough data / 2.6 Do talk to domain experts / 2.7 Do survey the literature / 2.8 Do think about how your model will be deployed
+> 3.1 Don't allow test data to leak into the training process / 3.2 Do try out a range of different models / 3.3 Don't use inappropriate models / 3.4 Do keep up with progress in deep learning (and its pitfalls) / 3.5 Don't assume deep learning will be the best approach / 3.6 Do be careful where and how you do feature selection / 3.7 Do optimise your model's hyperparameters / 3.8 Do avoid learning spurious correlations
+> 4.1 Do use an appropriate test set / 4.2 Don't do data augmentation before splitting your data / 4.3 Do avoid sequential overfitting / 4.4 Do evaluate a model multiple times / 4.5 Do save some data to evaluate your final model instance / 4.6 Do choose metrics carefully / 4.7 Do consider model fairness / 4.8 Don't ignore temporal dependencies in time series data
+> 5.1 Don't assume a bigger number means a better model / 5.2 Do use meaningful baselines / 5.3 Do use statistical tests when comparing models / 5.4 Do correct for multiple comparisons / 5.5 Don't always believe results from community benchmarks / 5.6 Do combine models (carefully)
+> 6.1 Do be transparent / 6.2 Do report performance in multiple ways / 6.3 Don't generalise beyond the data / 6.4 Do be careful when reporting statistical significance / 6.5 Do look at your models / 6.6 Do use a machine learning checklist
+
+Section 3.1, "Don't allow test data to leak into the training process":
+
+> A common problem is allowing information about this data to leak into the configuration, training or selection of models. When this happens, the data no longer provides a reliable measure of generality, and this is a common reason why published ML models often fail to generalise to real world data. There are a number of ways that information can leak from a test set. Some of these seem quite innocuous. For instance, during data preparation, using information about the means and ranges of variables within the whole data set to carry out variable scaling or imputation — in order to prevent information leakage, these statistics should be calculated using only the training data. [...] The best thing you can do to prevent these issues is to partition off a subset of your data right at the start of your project, and only use this independent test set once to measure the generality of a single model at the end.
+
+Section 4.8, "Don't ignore temporal dependencies in time series data":
+
+> Most notably, time series data are subject to a particular kind of data leakage known as look ahead bias. This occurs when some or all of the data points used to train the model occur later in the time series than those used to test the model. In effect, this can allow knowledge of the future to leak into training, and this can then bias the test performance. A situation where this commonly occurs is when standard cross-validation is applied to time series data, since it results in the training folds in all but one of the cross-validation iterations containing data that is in the future relative to the test fold.
@@ -0,0 +1,32 @@
+Source: https://github.com/ajaymache/machine-learning-yearning/blob/master/full%20book/machine-learning-yearning.pdf (mirror of the draft Andrew Ng distributed via deeplearning.ai mailing list, 2018; never formally published)
+Title: "Machine Learning Yearning" (draft) — Andrew Ng, chapters 13-19 (basic error analysis)
+Fetched-via: PDF downloaded from the github mirror, text extracted with pdfplumber, 2026-06-11
+Fetch-status: verbatim excerpts; line breaks rejoined
+
+# Machine Learning Yearning — basic error analysis (excerpts)
+
+Chapter 13, "Build your first system quickly, then iterate" (p. 29):
+
+> So don't start off trying to design and build the perfect system. Instead, build and train a basic system quickly—perhaps in just a few days. Even if the basic system is far from the "best" system you can build, it is valuable to examine how the basic system functions: you will quickly find clues that show you the most promising directions in which to invest your time.
+
+Chapter 14, "Error analysis: Look at dev set examples to evaluate ideas" (pp. 30-31):
+
+> Before investing a month on this task, I recommend that you first estimate how much it will actually improve the system's accuracy. [...] In detail, here's what you can do:
+> 1. Gather a sample of 100 dev set examples that your system misclassified. I.e., examples that your system made an error on.
+> 2. Look at these examples manually, and count what fraction of them are dog images.
+
+> Error analysis can often help you figure out how promising different directions are. I've seen many engineers reluctant to carry out error analysis. It often feels more exciting to just jump in and implement some idea, rather than question if the idea is worth the time investment. This is a common mistake: It might result in your team spending a month only to realize afterward that it resulted in little benefit.
+
+> Manually examining 100 examples does not take long. Even if you take one minute per image, you'd be done in under two hours. These two hours could save you a month of wasted effort.
+
+Chapter 15, "Evaluating multiple ideas in parallel during error analysis" (p. 32):
+
+> You can efficiently evaluate all of these ideas in parallel. I usually create a spreadsheet and fill it out while looking through ~100 misclassified dev set images. I also jot down comments that might help me remember specific examples. [...] once you start looking through examples, you will probably be inspired to propose new error categories.
+
+Chapter 19, "Takeaways: Basic error analysis" (p. 40):
+
+> When you start a new project, especially if it is in an area in which you are not an expert, it is hard to correctly guess the most promising directions.
+
+> Carry out error analysis by manually examining ~100 dev set examples the algorithm misclassifies and counting the major categories of errors. Use this information to prioritize what types of errors to work on fixing.
+
+> Consider splitting the dev set into an Eyeball dev set, which you will manually examine, and a Blackbox dev set, which you will not manually examine. If performance on the Eyeball dev set is much better than the Blackbox dev set, you have overfit the Eyeball dev set and should consider acquiring more data for it.