Files
ml-debug/docs/evidence/fsdl_spring2021_lecture7.md
wassname 4393cceefd initial: ML debugging folklore skill
Deep research to uplift LLMs for ML debugging, opinionated by source
selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n,
FSDL, and more. Includes runnable diagnostic scripts and LLM-specific
anti-patterns.

Author: wassname (https://github.com/wassname)
2026-03-06 10:11:30 +08:00

787 lines
36 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Source: https://fullstackdeeplearning.com/spring2021/lecture-7/
Title: FSDL Spring 2021 - Lecture 7: Troubleshooting Deep Neural Networks
Fetched-via: uvx markitdown https://fullstackdeeplearning.com/spring2021/lecture-7/
Fetch-status: verbatim
[Skip to content](#lecture-7-troubleshooting-deep-neural-networks)
[Sign up for our latest in-person course!](https://www.scale.bythebay.io/llm-workshop)
[![logo](../../images/favicon.png)](../.. "The Full Stack")
The Full Stack
Lecture 7: Troubleshooting Deep Neural Networks
Initializing search
[The Full Stack Website](https://github.com/the-full-stack/website "Go to repository")
* [Home](../..)
* [LLM Bootcamp](../../llm-bootcamp/)
* [Deep Learning Course](../../course/)
* [Blog](../../blog/)
* [Cloud GPUs](../../cloud-gpus/)
[![logo](../../images/favicon.png)](../.. "The Full Stack")
The Full Stack
[The Full Stack Website](https://github.com/the-full-stack/website "Go to repository")
* [Home](../..)
* [ ]
[LLM Bootcamp](../../llm-bootcamp/)
LLM Bootcamp
+ [ ]
[Spring 2023](../../llm-bootcamp/spring-2023/)
Spring 2023
- [Launch an LLM App in One Hour](../../llm-bootcamp/spring-2023/launch-an-llm-app-in-one-hour/)
- [LLM Foundations](../../llm-bootcamp/spring-2023/llm-foundations/)
- [Learn to Spell: Prompt Engineering](../../llm-bootcamp/spring-2023/prompt-engineering/)
- [Augmented Language Models](../../llm-bootcamp/spring-2023/augmented-language-models/)
- [Project Walkthrough: askFSDL](../../llm-bootcamp/spring-2023/askfsdl-walkthrough/)
- [UX for Language User Interfaces](../../llm-bootcamp/spring-2023/ux-for-luis/)
- [LLMOps](../../llm-bootcamp/spring-2023/llmops/)
- [What's Next?](../../llm-bootcamp/spring-2023/whats-next/)
- [Reza Shabani: How to train your own LLM](../../llm-bootcamp/spring-2023/shabani-train-your-own/)
- [Harrison Chase: Agents](../../llm-bootcamp/spring-2023/chase-agents/)
- [Fireside Chat with Peter Welinder](../../llm-bootcamp/spring-2023/welinder-fireside-chat/)
* [x]
[Deep Learning Course](../../course/)
Deep Learning Course
+ [ ]
[FSDL 2022](../../course/2022/)
FSDL 2022
- [Lecture 1: Course Vision and When to Use ML](../../course/2022/lecture-1-course-vision-and-when-to-use-ml/)
- [Lab Overview](../../course/2022/lab-0-overview/)
- [Lecture 2: Development Infrastructure & Tooling](../../course/2022/lecture-2-development-infrastructure-and-tooling/)
- [Lab 4: Experiment Management](../../course/2022/lab-4-experiment-management/)
- [Lecture 3: Troubleshooting & Testing](../../course/2022/lecture-3-troubleshooting-and-testing/)
- [Lab 5: Troubleshooting & Testing](../../course/2022/lab-5-troubleshooting-and-testing/)
- [Lecture 4: Data Management](../../course/2022/lecture-4-data-management/)
- [Lab 6: Data Annotation](../../course/2022/lab-6-data-annotation/)
- [Lecture 5: Deployment](../../course/2022/lecture-5-deployment/)
- [Lab 7: Web Deployment](../../course/2022/lab-7-web-deployment/)
- [Lecture 6: Continual Learning](../../course/2022/lecture-6-continual-learning/)
- [Lab 8: Model Monitoring](../../course/2022/lab-8-model-monitoring/)
- [Lecture 7: Foundation Models](../../course/2022/lecture-7-foundation-models/)
- [Lecture 8: ML Teams and Project Management](../../course/2022/lecture-8-teams-and-pm/)
- [Lecture 9: Ethics](../../course/2022/lecture-9-ethics/)
- [Project Showcase](../../course/2022/project-showcase/)
- [Course Announcement](../../course/2022/announcement/)
+ [x]
Older
Older
- [x]
[FSDL 2021](../)
FSDL 2021
* [Synchronous Online Course](../synchronous/)
* [Course Projects Showcase](../projects/)
* [Lecture 1: DL Fundamentals](../lecture-1/)
* [Lab 1: Setup and Introduction](../lab-1/)
* [Notebook: Coding a neural net](../notebook-1/)
* [Lecture 2A: CNNs](../lecture-2a/)
* [Lecture 2B: Computer Vision](../lecture-2b/)
* [Lab 2: CNNs and Synthetic Data](../lab-2/)
* [Lecture 3: RNNs](../lecture-3/)
* [Lab 3: RNNs](../lab-3/)
* [Lecture 4: Transformers](../lecture-4/)
* [Lab 4: Transformers](../lab-4/)
* [Lecture 5: ML Projects](../lecture-5/)
* [Lecture 6: MLOps Infrastructure & Tooling](../lecture-6/)
* [Lab 5: Experiment Management](../lab-5/)
* [ ]
Lecture 7: Troubleshooting Deep Neural Networks
[Lecture 7: Troubleshooting Deep Neural Networks](./)
Table of contents
+ [Video](#video)
+ [Slides](#slides)
+ [Notes](#notes)
- [1 - Why Is Deep Learning Troubleshooting Hard?](#1-why-is-deep-learning-troubleshooting-hard)
- [2 - Strategy to Debug Neural Networks](#2-strategy-to-debug-neural-networks)
- [3 - Start Simple](#3-start-simple)
* [Choose A Simple Architecture](#choose-a-simple-architecture)
* [Use Sensible Defaults](#use-sensible-defaults)
* [Normalize Inputs](#normalize-inputs)
* [Simplify The Problem](#simplify-the-problem)
- [4 - Implement and Debug](#4-implement-and-debug)
* [Get Your Model To Run](#get-your-model-to-run)
* [Overfit A Single Batch](#overfit-a-single-batch)
* [Compare To A Known Result](#compare-to-a-known-result)
- [5 - Evaluate](#5-evaluate)
* [Bias-Variance Decomposition](#bias-variance-decomposition)
* [Distribution Shift](#distribution-shift)
- [6 - Improve Model and Data](#6-improve-model-and-data)
* [Step 1: Address Underfitting](#step-1-address-underfitting)
* [Step 2: Address Overfitting](#step-2-address-overfitting)
* [Step 3: Address Distribution Shift](#step-3-address-distribution-shift)
+ [Error Analysis](#error-analysis)
+ [Domain Adaptation](#domain-adaptation)
* [Step 4: Rebalance datasets](#step-4-rebalance-datasets)
- [7 - Tune Hyperparameters](#7-tune-hyperparameters)
* [Techniques for Tuning Hyperparameter Optimization](#techniques-for-tuning-hyperparameter-optimization)
- [8 - Conclusion](#8-conclusion)
* [Lecture 8: Data Management](../lecture-8/)
* [Lab 6: Data Labeling](../lab-6/)
* [Lecture 9: AI Ethics](../lecture-9/)
* [Lab 7: Paragraph Recognition](../lab-7/)
* [Lecture 10: Testing & Explainability](../lecture-10/)
* [Lab 8: Testing & CI](../lab-8/)
* [Lecture 11: Deployment & Monitoring](../lecture-11/)
* [Lab 9: Web Deployment](../lab-9/)
* [Lecture 12: Research Directions](../lecture-12/)
* [Lecture 13: ML Teams and Startups](../lecture-13/)
* [Panel Discussion: Do I need a PhD to work in ML?](../panel/)
- [FSDL 2021 (Berkeley)](https://bit.ly/berkeleyfsdl)
- [FSDL 2020 (UW)](https://bit.ly/uwfsdl)
- [FSDL 2019 (Online)](https://fall2019.fullstackdeeplearning.com)
- [FSDL 2019 (Bootcamp)](/march2019.html)
- [FSDL 2018 (Bootcamp)](/august2018.html)
* [Blog](../../blog/)
* [Cloud GPUs](../../cloud-gpus/)
Table of contents
* [Video](#video)
* [Slides](#slides)
* [Notes](#notes)
+ [1 - Why Is Deep Learning Troubleshooting Hard?](#1-why-is-deep-learning-troubleshooting-hard)
+ [2 - Strategy to Debug Neural Networks](#2-strategy-to-debug-neural-networks)
+ [3 - Start Simple](#3-start-simple)
- [Choose A Simple Architecture](#choose-a-simple-architecture)
- [Use Sensible Defaults](#use-sensible-defaults)
- [Normalize Inputs](#normalize-inputs)
- [Simplify The Problem](#simplify-the-problem)
+ [4 - Implement and Debug](#4-implement-and-debug)
- [Get Your Model To Run](#get-your-model-to-run)
- [Overfit A Single Batch](#overfit-a-single-batch)
- [Compare To A Known Result](#compare-to-a-known-result)
+ [5 - Evaluate](#5-evaluate)
- [Bias-Variance Decomposition](#bias-variance-decomposition)
- [Distribution Shift](#distribution-shift)
+ [6 - Improve Model and Data](#6-improve-model-and-data)
- [Step 1: Address Underfitting](#step-1-address-underfitting)
- [Step 2: Address Overfitting](#step-2-address-overfitting)
- [Step 3: Address Distribution Shift](#step-3-address-distribution-shift)
* [Error Analysis](#error-analysis)
* [Domain Adaptation](#domain-adaptation)
- [Step 4: Rebalance datasets](#step-4-rebalance-datasets)
+ [7 - Tune Hyperparameters](#7-tune-hyperparameters)
- [Techniques for Tuning Hyperparameter Optimization](#techniques-for-tuning-hyperparameter-optimization)
+ [8 - Conclusion](#8-conclusion)
# Lecture 7: Troubleshooting Deep Neural Networks
## Video
## Slides
[Download slides as PDF](https://drive.google.com/file/d/1yXQCnGGp3wWdoCf6nSP5b758cXF92rtg/view?usp=sharing)
## Notes
*Lecture by [Josh Tobin](http://josh-tobin.com).
Notes transcribed by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).*
In traditional software engineering, a bug usually leads to the program
crashing. While this is annoying for the user, it is critical for the
developer to inspect the errors to understand why. With deep learning,
we sometimes encounter errors, but all too often, the program crashes
without a clear reason why. While these issues can be debugged manually,
deep learning models most often fail because of poor output predictions.
Whats worse is that when the model performance is low, there is usually
no signal about why or when the models failed.
A common sentiment among practitioners is that they spend **8090% of
time debugging and tuning the models** and only 1020% of time deriving
math equations and implementing things. This is confirmed by Andrej
Kaparthy, [as seen in this
tweet](https://twitter.com/karpathy/status/423990618289733632).
### 1 - Why Is Deep Learning Troubleshooting Hard?
Suppose you are trying to reproduce a research paper result for your
work, but your results are worse. You might wonder why your models
performance is significantly worse than the paper that youre trying to
reproduce?
![](/spring2021/lecture-7-notes-media/image3.png)
Many different things can cause this:
* It can be **implementation bugs**. Most bugs in deep learning are
actually invisible.
* **Hyper-parameter choices** can also cause your performance to
degrade. Deep learning models are very sensitive to
hyper-parameters. Even very subtle choices of learning rate and
weight initialization can make a big difference.
* Performance can also be worse just because of **data/model fit**.
For example, you pre-train your model on ImageNet data and fit it
on self-driving car images, which are harder to learn.
* Finally, poor model performance could be caused not by your model
but your **dataset construction**. Typical issues here include not
having enough examples, dealing with noisy labels and imbalanced
classes, splitting train and test set with different
distributions.
### 2 - Strategy to Debug Neural Networks
The key idea of deep learning troubleshooting is: *Since it is hard to
disambiguate errors, its best to start simple and gradually ramp up
complexity.*
This lecture provides **a decision tree for debugging deep learning
models and improving performance**. This guide assumes that you already
have an initial test dataset, a single metric to improve, and target
performance based on human-level performance, published results,
previous baselines, etc.
![](/spring2021/lecture-7-notes-media/image4.png)
### 3 - Start Simple
The first step is the troubleshooting workflow is **starting simple**.
#### Choose A Simple Architecture
There are a few things to consider when you want to start simple. The
first is how to **choose a simple architecture**. These are
architectures that are easy to implement and are likely to get you part
of the way towards solving your problem without introducing as many
bugs.
Architecture selection is one of the many intimidating parts of getting
into deep learning because there are tons of papers coming out
all-the-time and claiming to be state-of-the-art on some problems. They
get very complicated fast. In the limit, if youre trying to get to
maximal performance, then architecture selection is challenging. But
when starting on a new problem, you can just solve a simple set of rules
that will allow you to pick an architecture that enables you to do a
decent job on the problem youre working on.
* If your data looks like **images**, start with a LeNet-like
architecture and consider using something like ResNet as your
codebase gets more mature.
* If your data looks like **sequences**, start with an LSTM with one
hidden layer and/or temporal/classical convolutions. Then, when
your problem gets more mature, you can move to an Attention-based
model or a WaveNet-like model.
* For **all other tasks**, start with a fully-connected neural network
with one hidden layer and use more advanced networks later,
depending on the problem.
![](/spring2021/lecture-7-notes-media/image7.png)
In reality, many times, the input data contains multiple of those things
above. So how to deal with **multiple input modalities** into a neural
network? Here is the 3-step strategy that we recommend:
* First, map each of these modalities into a lower-dimensional feature
space. In the example above, the images are passed through a
ConvNet, and the words are passed through an LSTM.
* Then we flatten the outputs of those networks to get a single vector
for each of the inputs that will go into the model. Then we
concatenate those inputs.
* Finally, we pass them through some fully-connected layers to an
output.
#### Use Sensible Defaults
After choosing a simple architecture, the next thing to do is to
**select sensible hyper-parameter defaults** to start with. Here are the
defaults that we recommend:
* [Adam optimizer with a “magic” learning rate value of
3e-4](https://twitter.com/karpathy/status/801621764144971776?lang=en).
* [ReLU](https://stats.stackexchange.com/questions/226923/why-do-we-use-relu-in-neural-networks-and-how-do-we-use-it)
activation for fully-connected and convolutional models and
[Tanh](https://stats.stackexchange.com/questions/330559/why-is-tanh-almost-always-better-than-sigmoid-as-an-activation-function)
activation for LSTM models.
* [He initialization for ReLU activation function and Glorot
initialization for Tanh activation
function](https://datascience.stackexchange.com/questions/13061/when-to-use-he-or-glorot-normal-initialization-over-uniform-init-and-what-are).
* No regularization and data normalization.
#### Normalize Inputs
The next step is to **normalize the input data**, subtracting the mean
and dividing by the variance. Note that for images, its fine to scale
values to [0, 1] or [-0.5, 0.5] (for example, by dividing by 255).
#### Simplify The Problem
The final thing you should do is consider **simplifying the problem**
itself. If you have a complicated problem with massive data and tons of
classes to deal with, then you should consider:
* Working with a small training set around 10,000 examples.
* Using a fixed number of objects, classes, input size, etc.
* Creating a simpler synthetic training set like in research labs.
This is important because (1) you will have reasonable confidence that
your model should be able to solve, and (2) your iteration speed will
increase.
The diagram below neatly summarizes how to start simple:
![](/spring2021/lecture-7-notes-media/image6.png)
### 4 - Implement and Debug
To give you a preview, below are the five most common bugs in deep
learning models that we recognize:
* **Incorrect shapes for the network tensors**: This bug is a common
one and can fail silently. This happens many times because the
automatic differentiation systems in the deep learning framework
do silent broadcasting. Tensors become different shapes in the
network and can cause a lot of problems.
* **Pre-processing inputs incorrectly**: For example, you forget to
normalize your inputs or apply too much input pre-processing
(over-normalization and excessive data augmentation).
* **Incorrect input to the models loss function**: For example, you
use softmax outputs to a loss that expects logits.
* **Forgot to set up train mode for the network correctly**: For
example, toggling train/evaluation mode or controlling batch norm
dependencies.
* **Numerical instability**: For example, you get `inf` or `NaN`
as outputs. This bug often stems from using an exponent, a log, or
a division operation somewhere in the code.
Here are three pieces of general advice for implementing your model:
* **Start with a lightweight implementation**. You want minimum
possible new lines of code for the 1st version of your model. The
rule of thumb is less than 200 lines. This doesnt count tested
infrastructure components or TensorFlow/PyTorch code.
* **Use off-the-shelf components** such as Keras if possible, since
most of the stuff in Keras works well out-of-the-box. If you have
to use TensorFlow, use the built-in functions, dont do the math
yourself. This would help you avoid a lot of numerical instability
issues.
* **Build complicated data pipelines later**. These are important for
large-scale ML systems, but you should not start with them because
data pipelines themselves can be a big source of bugs. Just start
with a dataset that you can load into memory.
![](/spring2021/lecture-7-notes-media/image11.png)
#### Get Your Model To Run
The first step of implementing bug-free deep learning models is
**getting your model to run at all**. There are a few things that can
prevent this from happening:
* **Shape mismatch/casting issue**: To address this type of problem,
you should step through your model creation and inference
step-by-step in a debugger, checking for correct shapes and data
types of your tensors.
* **Out-of-memory issues**: This can be very difficult to debug. You
can scale back your memory-intensive operations one-by-one. For
example, if you create large matrices anywhere in your code, you
can reduce the size of their dimensions or cut your batch size in
half.
* **Other issues**: You can simply Google it. Stack Overflow would be
great most of the time.
Lets zoom in on the process of stepping through model creation in a
debugger and talk about **debuggers for deep learning code**:
* In PyTorch, you can use
[ipdb](https://pypi.org/project/ipdb/) — which exports
functions to access the interactive
[IPython](http://ipython.org/) debugger.
* In TensorFlow, its trickier. TensorFlow separates the process of
creating the graph and executing operations in the graph. There
are three options you can try: (1) step through the graph creation
itself and inspect each tensor layer, (2) step into the training
loop and evaluate the tensor layers, or (3) use [TensorFlow
Debugger](https://mullikine.github.io/posts/tensorflow-debugger-tfdb-and-emacs/)
(tfdb), which does option 1 and 2 automatically.
![](/spring2021/lecture-7-notes-media/image14.png)
#### Overfit A Single Batch
After getting your model to run, the next thing you need to do is to
**overfit a single batch of data**. This is a heuristic that can catch
an absurd number of bugs. This really means that you want to drive your
training error arbitrarily close to 0.
There are a few things that can happen when you try to overfit a single
batch and it fails:
* **Error goes up**: Commonly, this is due to a flip sign somewhere in
the loss function/gradient.
* **Error explodes**: This is usually a numerical issue but can also
be caused by a high learning rate.
* **Error oscillates**: You can lower the learning rate and inspect
the data for shuffled labels or incorrect data augmentation.
* **Error plateaus**: You can increase the learning rate and get rid
of regulation. Then you can inspect the loss function and the data
pipeline for correctness.
![](/spring2021/lecture-7-notes-media/image10.png)
#### Compare To A Known Result
Once your model overfits in a single batch, there can still be some
other issues that cause bugs. The last step here is to **compare your
results to a known result**. So what sort of known results are useful?
* The most useful results come from **an official model implementation
evaluated on a similar dataset to yours**. You can step through
the code in both models line-by-line and ensure your model has the
same output. You want to ensure that your model performance is up
to par with expectations.
* If you cant find an official implementation on a similar dataset,
you can compare your approach to results from **an official model
implementation evaluated on a benchmark dataset**. You most
definitely want to walk through the code line-by-line and ensure
you have the same output.
* If there is no official implementation of your approach, you can
compare it to results from **an unofficial model implementation**.
You can review the code the same as before but with lower
confidence (because almost all the unofficial implementations on
GitHub have bugs).
* Then, you can compare to results from **a paper with no code** (to
ensure that your performance is up to par with expectations),
results from **your model on a benchmark dataset** (to make sure
your model performs well in a simpler setting), and results from
**a similar model on a similar dataset** (to help you get a
general sense of what kind of performance can be expected).
* An under-rated source of results comes from **simple baselines**
(for example, the average of outputs or linear regression), which
can help make sure that your model is learning anything at all.
The diagram below neatly summarizes how to implement and debug deep
neural networks:
![](/spring2021/lecture-7-notes-media/image8.png)
### 5 - Evaluate
#### Bias-Variance Decomposition
To evaluate models and prioritize the next steps in model development,
we will apply the bias-variance decomposition. The [bias-variance
decomposition](http://scott.fortmann-roe.com/docs/BiasVariance.html)
is the fundamental model fitting tradeoff. In our application, lets
talk more specifically about the formula for bias-variance tradeoff with
respect to the **test error;** this will help us apply the concept more
directly to our models performance. There are four terms in the formula
for test error:
*Test error = irreducible error + bias + variance + validation
overfitting*
1. **Irreducible error** is the baseline error you dont expect your
model to do better. It can be estimated through strong baselines,
like human performance.
2. **Avoidable bias**, a measure of underfitting, is the difference
between our train error and irreducible error.
3. **Variance**, a measure of overfitting, is the difference between
validation error and training error.
4. **Validation set overfitting** is the difference between test error
and validation error.
Consider the chart of learning curves and errors below. Using the test
error formula for bias and variance, we can calculate each component of
test error and make decisions based on the value. For example, our
avoidable bias is rather low (only 2 points), while the variance is much
higher (5 points). With this knowledge, we should prioritize methods of
preventing overfitting, like regularization.
![](/spring2021/lecture-7-notes-media/image12.png)
#### Distribution Shift
Clearly, the application of the bias-variance decomposition to the test
error has already helped prioritize our next steps for model
development. However, until now, weve assumed that the samples
(training, validation, testing) all come from the same distribution.
What if this isnt the case? In practical ML situations, this
**distribution shift** often cars. In building self-driving cars, a
frequent occurrence might be training with samples from one distribution
(e.g., daytime driving video) but testing or inferring on samples from a
totally different distribution (e.g., night time driving).
A simple way of handling this wrinkle in our assumption is to create two
validation sets: one from the training distribution and one from the
test distribution. This can be helpful even with a very small testing
set. If we apply this, we can actually estimate our distribution shift,
which is the difference between testing validation error and testing
error. This is really useful for practical applications of ML! With this
new term, lets update our test error formula of bias and variance:
*Test error = irreducible error + bias + variance + distribution shift +
validation overfitting*
### 6 - Improve Model and Data
Using the updated formula from the last section, well be able to decide
on and prioritize the right next steps for each iteration of a model. In
particular, well follow a specific process (shown below).
![](/spring2021/lecture-7-notes-media/image1.png)
#### Step 1: Address Underfitting
Well start by addressing underfitting (i.e., reducing bias). The first
thing to try in this case is to make your model bigger (e.g., add
layers, more units per layer). Next, consider regularization, which can
prevent a tight fit to your data. Other options are error analysis,
choosing a different model architecture (e.g., something more state of
the art), tuning hyperparameters, or adding features. Some notes:
* Choosing different architectures, especially a SOTA one, can be very
helpful but is also risky. Bugs are easily introduced in the
implementation process.
* Adding features is uncommon in the deep learning paradigm (vs.
traditional machine learning). We usually want the network to
learn features of its own accord. If all else fails, it can be
beneficial in a practical setting.
![](/spring2021/lecture-7-notes-media/image13.png)
#### Step 2: Address Overfitting
After addressing underfitting, move on to solving overfitting.
Similarly, theres a recommended series of methods to try in order.
Starting with collecting training data (if possible) is the soundest way
to address overfitting, though it can be challenging in certain
applications. Next, tactical improvements like normalization, data
augmentation, and regularization can help. Following these steps,
traditional defaults like tuning hyperparameters, choosing a different
architecture, or error analysis are useful. Finally, if overfitting is
rather intractable, theres a series of less recommended steps, such as
early stopping, removing features, and reducing model size. Early
stopping is a personal choice; the fast.ai community is a strong
proponent.
![](/spring2021/lecture-7-notes-media/image15.png)
#### Step 3: Address Distribution Shift
After addressing underfitting and overfitting, If theres a difference
between the error on our training validation set vs. our test validation
set, we need to address the error caused by the distribution shift. This
is a harder problem to solve, so theres less in our toolkit to apply.
Start by looking manually at the errors in the test-validation set.
Compare the potential logic behind these errors to the performance in
the train-validation set, and use the errors to guide further data
collection. Essentially, reason about why your model may be suffering
from distribution shift error. This is the most principled way to deal
with distribution shift, though its the most challenging way
practically. If collecting more data to address these errors isnt
possible, try synthesizing data. Additionally, you can try [domain
adaptation](https://ece.engin.umich.edu/wp-content/uploads/2019/09/4142.pdf).
![](/spring2021/lecture-7-notes-media/image9.png)
##### Error Analysis
Manually evaluating errors to understand model performance is generally
a high-yield way of figuring out how to improve the model.
Systematically performing this **error analysis** process and
decomposing the error from different error types can help prioritize
model improvements. For example, in a self-driving car use case with
error types like hard-to-see pedestrians, reflections, and nighttime
scenes, decomposing the error contribution of each and where it occurs
(train-val vs. test-val) can give rise to a clear set of prioritized
action items. See the table for an example of how this error analysis
can be effectively structured.
![](/spring2021/lecture-7-notes-media/image5.png)
##### Domain Adaptation
Domain adaptation is a class of techniques that train on a “source”
distribution and generalize to another “target” using only unlabeled
data or limited labeled data. You should use domain adaptation when
access to labeled data from the test distribution is limited, but access
to relatively similar data is plentiful.
There are a few different types of domain adaptation:
1. **Supervised domain adaptation**: In this case, we have limited data
from the target domain to adapt to. Some example applications of
the concept include fine-tuning a pre-trained model or adding
target data to a training set.
2. **Unsupervised domain adaptation**: In this case, we have lots of
unlabeled data from the target domain. Some techniques you might
see are CORAL, domain confusion, and CycleGAN.
Practically speaking, supervised domain adaptation can work really well!
Unsupervised domain adaptation has a little bit further to go.
#### Step 4: Rebalance datasets
If the test-validation set performance starts to look considerably
better than the test performance, you may have overfit the validation
set. This commonly occurs with small validation sets or lots of
hyperparameter training. If this occurs, resample the validation set
from the test distribution and get a fresh estimate of the performance.
### 7 - Tune Hyperparameters
One of the core challenges in hyperparameter optimization is very basic:
**which hyperparameters should you tune?** As we consider this
fundamental question, lets keep the following in mind:
* Models are more sensitive to some hyperparameters than others. This
means we should focus our efforts on the more impactful
hyperparameters.
* However, which hyperparameters are most important depends heavily on
our choice of model.
* Certain rules of thumbs can help guide our initial thinking.
* Sensitivity is always relative to default values; if you use good
defaults, you might start in a good place!
See the following table for a ranked list of hyperparameters and their
impact on the model:
![](/spring2021/lecture-7-notes-media/image2.png)
#### Techniques for Tuning Hyperparameter Optimization
Now that we know which hyperparameters make the most sense to tune
(using rules of thumb), lets consider the various methods of actually
tuning them:
1. **Manual Hyperparameter Optimization**. Colloquially referred to as
Graduate Student Descent, this method works by taking a manual,
detailed look at your algorithm, building intuition, and
considering which hyperparameters would make the most difference.
After figuring out these parameters, you train, evaluate, and
guess a better hyperparameter value using your intuition for the
algorithm and intelligence. While it may seem archaic, this method
combines well with other methods (e.g., setting a range of values
for hyperparameters) and has the main benefit of reducing
computation time and cost if used skillfully. It can be
time-consuming and challenging, but it can be a good starting
point.
2. **Grid Search**. Imagine each of your parameters plotted against
each other on a grid, from which you uniformly sample values to
test. For each point, you run a training run and evaluate
performance. The advantages are that its very simple and can
often produce good results. However, its quite inefficient, as
you must run every combination of hyperparameters. It also often
requires prior knowledge about the hyperparameters since we must
manually set the range of values.
3. **Random Search**: This method is recommended over grid search.
Rather than sampling from the grid of values for the
hyperparameter evenly, well choose n points sampled randomly
across the grid. Empirically, this method produces better results
than grid search. However, the results can be somewhat
uninterpretable, with unexpected values in certain hyperparameters
returned.
4. **Coarse-to-fine Search**: Rather than running entirely random runs,
we can gradually narrow in on the best hyperparameters through
this method. Initially, start by defining a very large range to
run a randomized search on. Within the pool of results, you can
find N best results and hone in on the hyperparameter values used
to generate those samples. As you iteratively perform this method,
you can get excellent performance. This doesnt remove the manual
component, as you have to select which range to continuously
narrow your search to, but its perhaps the most popular method
available.
5. **Bayesian Hyperparameter Optimization**: This is a reasonably
sophisticated method, which you can read more about
[here](http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/slides/lec21.pdf)
and
[here](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f).
At a high level, start with a prior estimate of parameter
distributions. Subsequently, maintain a probabilistic model of the
relationship between hyperparameter values and model performance.
As you maintain this model, you toggle between training with
hyperparameter values that maximize the expected improvement (per
the model) and use training results to update the initial
probabilistic model and its expectations. This is a great,
hands-off, efficient method to choose hyperparameters. However,
these techniques can be quite challenging to implement from
scratch. As libraries and infrastructure mature, the integration
of these methods into training will become easier.
In summary, you should probably start with coarse-to-fine random
searches and move to Bayesian methods as your codebase matures and
youre more certain of your model.
### 8 - Conclusion
To wrap up this lecture, deep learning troubleshooting and debugging is
really hard. Its difficult to tell if you have a bug because there are
many possible sources for the same degradation in performance.
Furthermore, the results can be sensitive to small changes in
hyper-parameters and dataset makeup.
To train bug-free deep learning models, we need to treat building them
as an iterative process. If you skipped to the end, the following steps
can make this process easier and catch errors as early as possible:
* **Start Simple**: Choose the simplest model and data possible.
* **Implement and Debug**: Once the model runs, overfit a single batch
and reproduce a known result.
* **Evaluate**: Apply the bias-variance decomposition to decide what
to do next.
* **Tune Hyper-parameters**: Use coarse-to-fine random searches to
tune the models hyper-parameters.
* **Improve Model and Data**: Make your model bigger if your model
under-fits and add more data and/or regularization if your model
over-fits.
Here are additional resources that you can go to learn more:
* Andrew Ngs “[Machine Learning
Yearning](https://www.deeplearning.ai/machine-learning-yearning/)”
book.
* This [Twitter
thread](https://twitter.com/karpathy/status/1013244313327681536)
from Andrej Karpathy.
* BYUs “[Practical Advice for Building Deep Neural
Networks](https://pcc.cs.byu.edu/2017/10/02/practical-advice-for-building-deep-neural-networks/)”
blog post.
## We are excited to share this course with you for **free**.
We have more upcoming great content.
Subscribe to stay up to date as we release it.
We take your privacy and attention very seriously and will never spam you.
I am already a subscriber
The Full Stack, 2023
Made with
[Material for MkDocs](https://squidfunk.github.io/mkdocs-material/)