diff --git a/.gitignore b/.gitignore index 82ccded..3d91ece 100644 --- a/.gitignore +++ b/.gitignore @@ -1,10 +1,5 @@ # Personal notes docs/wassname.md -# Internal dev specs (implementation notes, not useful publicly) +# Internal dev specs docs/spec/ - -# Book chapters (copyright, not redistributable) -docs/evidence/goodfellow_ch11_practical_methodology.md -docs/evidence/goodfellow_ch15_representation_learning.md -docs/evidence/deeplearning_book.md diff --git a/docs/evidence/goodfellow_ch11_practical_methodology.md b/docs/evidence/goodfellow_ch11_practical_methodology.md new file mode 100644 index 0000000..7fc5b2d --- /dev/null +++ b/docs/evidence/goodfellow_ch11_practical_methodology.md @@ -0,0 +1,264 @@ +Source: https://www.deeplearningbook.org/contents/guidelines.html +Title: Deep Learning Book - Chapter 11: Practical Methodology - Goodfellow, Bengio, Courville +Fetched-via: local copy /media/wassname/.../goodfellow_deep_learning/chapters/11_practical.md +Fetch-status: verbatim + +# Chapter 11 + +# Practical Methodology + +Successfully applying deep learning techniques requires more than just a good knowledge of what algorithms exist and the principles that explain how they work. A good machine learning practitioner also needs to know how to choose an algorithm for a particular application and how to monitor and respond to feedback obtained from experiments in order to improve a machine learning system. During day-to-day development of machine learning systems, practitioners need to decide whether to gather more data, increase or decrease model capacity, add or remove regularizing features, improve the optimization of a model, improve approximate inference in a model, or debug the software implementation of the model. All these operations are at the very least time consuming to try out, so it is important to be able to determine the right course of action rather than blindly guessing. + +Most of this book is about different machine learning models, training algorithms, and objective functions. This may give the impression that the most important ingredient to being a machine learning expert is knowing a wide variety of machine learning techniques and being good at different kinds of math. In practice, one can usually do much better with a correct application of a commonplace algorithm than by sloppily applying an obscure algorithm. Correct application of an algorithm depends on mastering some fairly simple methodology. Many of the recommendations in this chapter are adapted from Ng (2015). + +We recommend the following practical design process: + +• Determine your goals—what error metric to use, and your target value for this error metric. These goals and error metrics should be driven by the problem that the application is intended to solve. + +• Establish a working end-to-end pipeline as soon as possible, including the + +estimation of the appropriate performance metrics. + +• Instrument the system well to determine bottlenecks in performance. Diagnose which components are performing worse than expected and whether poor performance is due to overfitting, underfitting, or a defect in the data or software. + +• Repeatedly make incremental changes such as gathering new data, adjusting hyperparameters, or changing algorithms, based on specific findings from your instrumentation. + +As a running example, we will use the Street View address number transcription system (Goodfellow et al., 2014d). The purpose of this application is to add buildings to Google Maps. Street View cars photograph the buildings and record the GPS coordinates associated with each photograph. A convolutional network recognizes the address number in each photograph, allowing the Google Maps database to add that address in the correct location. The story of how this commercial application was developed gives an example of how to follow the design methodology we advocate. + +We now describe each of the steps in this process. + +## 11.1 Performance Metrics + +Determining your goals, in terms of which error metric to use, is a necessary first step because your error metric will guide all your future actions. You should also have an idea of what level of performance you desire. + +Keep in mind that for most applications, it is impossible to achieve absolute zero error. The Bayes error defines the minimum error rate that you can hope to achieve, even if you have infinite training data and can recover the true probability distribution. This is because your input features may not contain complete information about the output variable, or because the system might be intrinsically stochastic. You will also be limited by having a finite amount of training data. + +The amount of training data can be limited for a variety of reasons. When your goal is to build the best possible real-world product or service, you can typically collect more data but must determine the value of reducing error further and weigh this against the cost of collecting more data. Data collection can require time, money, or human suffering (for example, if your data collection process involves performing invasive medical tests). When your goal is to answer a scientific question about which algorithm performs better on a fixed benchmark, the benchmark + +specification usually determines the training set, and you are not allowed to collect more data. + +How can one determine a reasonable level of performance to expect? Typically, in the academic setting, we have some estimate of the error rate that is attainable based on previously published benchmark results. In the real-word setting, we have some idea of the error rate that is necessary for an application to be safe, cost-effective, or appealing to consumers. Once you have determined your realistic desired error rate, your design decisions will be guided by reaching this error rate. + +Another important consideration besides the target value of the performance metric is the choice of which metric to use. Several different performance metrics may be used to measure the effectiveness of a complete application that includes machine learning components. These performance metrics are usually different from the cost function used to train the model. As described in section 5.1.2, it is common to measure the accuracy, or equivalently, the error rate, of a system. + +However, many applications require more advanced metrics. + +Sometimes it is much more costly to make one kind of a mistake than another. For example, an e-mail spam detection system can make two kinds of mistakes: incorrectly classifying a legitimate message as spam, and incorrectly allowing a spam message to appear in the inbox. It is much worse to block a legitimate message than to allow a questionable message to pass through. Rather than measuring the error rate of a spam classifier, we may wish to measure some form of total cost, where the cost of blocking legitimate messages is higher than the cost of allowing spam messages. + +Sometimes we wish to train a binary classifier that is intended to detect some rare event. For example, we might design a medical test for a rare disease. Suppose that only one in every million people has this disease. We can easily achieve 99.9999 percent accuracy on the detection task, by simply hard coding the classifier to always report that the disease is absent. Clearly, accuracy is a poor way to characterize the performance of such a system. One way to solve this problem is to instead measure precision and recall . Precision is the fraction of detections reported by the model that were correct, while recall is the fraction of true events that were detected. A detector that says no one has the disease would achieve perfect precision, but zero recall. A detector that says everyone has the disease would achieve perfect recall, but precision equal to the percentage of people who have the disease (0.0001 percent in our example of a disease that only one people in a million have). When using precision and recall, it is common to plot a PR curve , with precision on the y -axis and recall on the x -axis. The classifier generates a score that is higher if the event to be detected occurred. For example, a feedforward + +network designed to detect a disease outputs ˆy = P ( y = 1 | x ), estimating the probability that a person whose medical results are described by features x has the disease. We choose to report a detection whenever this score exceeds some threshold. By varying the threshold, we can trade precision for recall. In many cases, we wish to summarize the performance of the classifier with a single number rather than a curve. To do so, we can convert precision p and recall r into an F-score given by 2pr F = . (11.1) p + r + +Another option is to report the total area lying beneath the PR curve. + +In some applications, it is possible for the machine learning system to refuse to make a decision. This is useful when the machine learning algorithm can estimate how confident it should be about a decision, especially if a wrong decision can be harmful and if a human operator is able to occasionally take over. The Street View transcription system provides an example of this situation. The task is to transcribe the address number from a photograph to associate the location where the photo was taken with the correct address in a map. Because the value of the map degrades considerably if the map is inaccurate, it is important to add an address only if the transcription is correct. If the machine learning system thinks that it is less likely than a human being to obtain the correct transcription, then the best course of action is to allow a human to transcribe the photo instead. Of course, the machine learning system is only useful if it is able to dramatically reduce the amount of photos that the human operators must process. A natural performance metric to use in this situation is coverage . Coverage is the fraction of examples for which the machine learning system is able to produce a response. It is possible to trade coverage for accuracy. One can always obtain 100 percent accuracy by refusing to process any example, but this reduces the coverage to 0 percent. For the Street View task, the goal for the project was to reach human-level transcription accuracy while maintaining 95 percent coverage. Human-level performance on this task is 98 percent accuracy. + +Many other metrics are possible. We can, for example, measure click-through rates, collect user satisfaction surveys, and so on. Many specialized application areas have application-specific criteria as well. + +What is important is to determine which performance metric to improve ahead of time, then concentrate on improving this metric. Without clearly defined goals, it can be difficult to tell whether changes to a machine learning system make progress or not. + +## 11.2 Default Baseline Models + +After choosing performance metrics and goals, the next step in any practical application is to establish a reasonable end-to-end system as soon as possible. In this section, we provide recommendations for which algorithms to use as the first baseline approach in various situations. Keep in mind that deep learning research progresses quickly, so better default algorithms are likely to become available soon after this writing. + +Depending on the complexity of your problem, you may even want to begin without using deep learning. If your problem has a chance of being solved by just choosing a few linear weights correctly, you may want to begin with a simple statistical model like logistic regression. + +If you know that your problem falls into an “AI-complete” category like object recognition, speech recognition, machine translation, and so on, then you are likely to do well by beginning with an appropriate deep learning model. + +First, choose the general category of model based on the structure of your data. If you want to perform supervised learning with fixed-size vectors as input, use a feedforward network with fully connected layers. If the input has known topological structure (for example, if the input is an image), use a convolutional network. In these cases, you should begin by using some kind of piecewise linear unit (ReLUs or their generalizations, such as Leaky ReLUs, PreLus, or maxout). If your input or output is a sequence, use a gated recurrent net (LSTM or GRU). + +A reasonable choice of optimization algorithm is SGD with momentum with a decaying learning rate (popular decay schemes that perform better or worse on different problems include decaying linearly until reaching a fixed minimum learning rate, decaying exponentially, or decreasing the learning rate by a factor of 2–10 each time validation error plateaus). Another reasonable alternative is Adam. Batch normalization can have a dramatic effect on optimization performance, especially for convolutional networks and networks with sigmoidal nonlinearities. While it is reasonable to omit batch normalization from the very first baseline, it should be introduced quickly if optimization appears to be problematic. + +Unless your training set contains tens of millions of examples or more, you should include some mild forms of regularization from the start. Early stopping should be used almost universally. Dropout is an excellent regularizer that is easy to implement and compatible with many models and training algorithms. Batch normalization also sometimes reduces generalization error and allows dropout to be omitted, because of the noise in the estimate of the statistics used to normalize each variable. + +If your task is similar to another task that has been studied extensively, you will probably do well by first copying the model and algorithm that is already known to perform best on the previously studied task. You may even want to copy a trained model from that task. For example, it is common to use the features from a convolutional network trained on ImageNet to solve other computer vision tasks (Girshick et al., 2015). + +A common question is whether to begin by using unsupervised learning, described further in part III. This is somewhat domain specific. Some domains, such as natural language processing, are known to benefit tremendously from unsupervised learning techniques, such as learning unsupervised word embeddings. In other domains, such as computer vision, current unsupervised learning techniques do not bring a benefit, except in the semi-supervised setting, when the number of labeled examples is very small (Kingma et al., 2014; Rasmus et al., 2015). If your application is in a context where unsupervised learning is known to be important, then include it in your first end-to-end baseline. Otherwise, only use unsupervised learning in your first attempt if the task you want to solve is unsupervised. You can always try adding unsupervised learning later if you observe that your initial baseline overfits. + +## 11.3 Determining Whether to Gather More Data + +After the first end-to-end system is established, it is time to measure the performance of the algorithm and determine how to improve it. Many machine learning novices are tempted to make improvements by trying out many different algorithms. Yet, it is often much better to gather more data than to improve the learning algorithm. + +How does one decide whether to gather more data? First, determine whether the performance on the training set is acceptable. If performance on the training set is poor, the learning algorithm is not using the training data that is already available, so there is no reason to gather more data. Instead, try increasing the size of the model by adding more layers or adding more hidden units to each layer. Also, try improving the learning algorithm, for example by tuning the learning rate hyperparameter. If large models and carefully tuned optimization algorithms do not work well, then the problem might be the quality of the training data. The data may be too noisy or may not include the right inputs needed to predict the desired outputs. This suggests starting over, collecting cleaner data, or collecting a richer set of features. + +If the performance on the training set is acceptable, then measure the per- + +formance on a test set. If the performance on the test set is also acceptable, then there is nothing left to be done. If test set performance is much worse than training set performance, then gathering more data is one of the most effective solutions. The key considerations are the cost and feasibility of gathering more data, the cost and feasibility of reducing the test error by other means, and the amount of data that is expected to be necessary to improve test set performance significantly. At large internet companies with millions or billions of users, it is feasible to gather large datasets, and the expense of doing so can be considerably less than that of the alternatives, so the answer is almost always to gather more training data. For example, the development of large labeled datasets was one of the most important factors in solving object recognition. In other contexts, such as medical applications, it may be costly or infeasible to gather more data. A simple alternative to gathering more data is to reduce the size of the model or improve regularization, by adjusting hyperparameters such as weight decay coefficients, or by adding regularization strategies such as dropout. If you find that the gap between train and test performance is still unacceptable even after tuning the regularization hyperparameters, then gathering more data is advisable. + +When deciding whether to gather more data, it is also necessary to decide how much to gather. It is helpful to plot curves showing the relationship between training set size and generalization error, as in figure 5.4. By extrapolating such curves, one can predict how much additional training data would be needed to achieve a certain level of performance. Usually, adding a small fraction of the total number of examples will not have a noticeable effect on generalization error. It is therefore recommended to experiment with training set sizes on a logarithmic scale, for example, doubling the number of examples between consecutive experiments. + +If gathering much more data is not feasible, the only other way to improve generalization error is to improve the learning algorithm itself. This becomes the domain of research and not the domain of advice for applied practitioners. + +## 11.4 Selecting Hyperparameters + +Most deep learning algorithms come with several hyperparameters that control many aspects of the algorithm’s behavior. Some of these hyperparameters affect the time and memory cost of running the algorithm. Some of these hyperparameters affect the quality of the model recovered by the training process and its ability to infer correct results when deployed on new inputs. + +There are two basic approaches to choosing these hyperparameters: choosing them manually and choosing them automatically. Choosing the hyperparameters + +manually requires understanding what the hyperparameters do and how machine learning models achieve good generalization. Automatic hyperparameter selection algorithms greatly reduce the need to understand these ideas, but they are often much more computationally costly. + +### 11.4.1 Manual Hyperparameter Tuning + +To set hyperparameters manually, one must understand the relationship between hyperparameters, training error, generalization error and computational resources (memory and runtime). This means establishing a solid foundation on the fundamental ideas concerning the effective capacity of a learning algorithm, as described in chapter 5. + +The goal of manual hyperparameter search is usually to find the lowest generalization error subject to some runtime and memory budget. We do not discuss how to determine the runtime and memory impact of various hyperparameters here because this is highly platform dependent. + +The primary goal of manual hyperparameter search is to adjust the effective capacity of the model to match the complexity of the task. Effective capacity is constrained by three factors: the representational capacity of the model, the ability of the learning algorithm to successfully minimize the cost function used to train the model, and the degree to which the cost function and training procedure regularize the model. A model with more layers and more hidden units per layer has higher representational capacity—it is capable of representing more complicated functions. It cannot necessarily learn all these functions though, if the training algorithm cannot discover that certain functions do a good job of minimizing the training cost, or if regularization terms such as weight decay forbid some of these functions. + +The generalization error typically follows a U-shaped curve when plotted as a function of one of the hyperparameters, as in figure 5.3. At one extreme, the hyperparameter value corresponds to low capacity, and generalization error is high because training error is high. This is the underfitting regime. At the other extreme, the hyperparameter value corresponds to high capacity, and the generalization error is high because the gap between training and test error is high. Somewhere in the middle lies the optimal model capacity, which achieves the lowest possible generalization error, by adding a medium generalization gap to a medium amount of training error. + +For some hyperparameters, overfitting occurs when the value of the hyperparameter is large. The number of hidden units in a layer is one such example, + +because increasing the number of hidden units increases the capacity of the model. For some hyperparameters, overfitting occurs when the value of the hyperparameter is small. For example, the smallest allowable weight decay coefficient of zero corresponds to the greatest effective capacity of the learning algorithm. + +Not every hyperparameter will be able to explore the entire U-shaped curve. Many hyperparameters are discrete, such as the number of units in a layer or the number of linear pieces in a maxout unit, so it is only possible to visit a few points along the curve. Some hyperparameters are binary. Usually these hyperparameters are switches that specify whether or not to use some optional component of the learning algorithm, such as a preprocessing step that normalizes the input features by subtracting their mean and dividing by their standard deviation. These hyperparameters can explore only two points on the curve. Other hyperparameters have some minimum or maximum value that prevents them from exploring some part of the curve. For example, the minimum weight decay coefficient is zero. This means that if the model is underfitting when weight decay is zero, we cannot enter the overfitting region by modifying the weight decay coefficient. In other words, some hyperparameters can only subtract capacity. + +The learning rate is perhaps the most important hyperparameter. If you have time to tune only one hyperparameter, tune the learning rate. It controls the effective capacity of the model in a more complicated way than other hyperparameters—the effective capacity of the model is highest when the learning rate is correct for the optimization problem, not when the learning rate is especially large or especially small. The learning rate has a U-shaped curve for training error, illustrated in figure 11.1. When the learning rate is too large, gradient descent can inadvertently increase rather than decrease the training error. In the idealized quadratic case, this occurs if the learning rate is at least twice as large as its optimal value (LeCun et al., 1998a). When the learning rate is too small, training is not only slower but may become permanently stuck with a high training error. This effect is poorly understood (it would not happen for a convex loss function). + +Tuning the parameters other than the learning rate requires monitoring both training and test error to diagnose whether your model is overfitting or underfitting, then adjusting its capacity appropriately. + +If your error on the training set is higher than your target error rate, you have no choice but to increase capacity. If you are not using regularization and you are confident that your optimization algorithm is performing correctly, then you must add more layers to your network or add more hidden units. Unfortunately, this increases the computational costs associated with the model. + +If your error on the test set is higher than your target error rate, you can now + +Figure 11.1: Typical relationship between the learning rate and the training error. Notice the sharp rise in error when the learning is above an optimal value. This is for a fixed training time, as a smaller learning rate may sometimes only slow down training by a factor proportional to the learning rate reduction. Generalization error can follow this curve or be complicated by regularization effects arising out of having too large or too small learning rates, since poor optimization can, to some degree, reduce or prevent overfitting, and even points with equivalent training error can have different generalization error. + +take two kinds of actions. The test error is the sum of the training error and the gap between training and test error. The optimal test error is found by trading off these quantities. Neural networks typically perform best when the training error is very low (and thus, when capacity is high) and the test error is primarily driven by the gap between training and test error. Your goal is to reduce this gap without increasing training error faster than the gap decreases. To reduce the gap, change regularization hyperparameters to reduce effective model capacity, such as by adding dropout or weight decay. Usually the best performance comes from a large model that is regularized well, for example, by using dropout. + +Most hyperparameters can be set by reasoning about whether they increase or decrease model capacity. Some examples are included in table 11.1. + +While manually tuning hyperparameters, do not lose sight of your end goal: 8 good performance on the test set. Adding regularization is only one way to achieve 7 this goal. As long as you have low training error, you can always reduce general6 ization error by collecting more training data. The brute force way to practically 5 guarantee success is to continually increase model capacity and training set size 4 until the task is solved. This approach does of course increase the computational 3 cost of training and inference, so it is only feasible given appropriate resources. In Training error 2 principle, this approach could fail due to optimization difficulties, but for many 1 + +Hyperparameter Increases Reason Caveats capacity when. . . Number of hidincreased Increasing the number of Increasing the number den units hidden units increases the of hidden units increases representational capacity both the time and memory of the model. cost of essentially every operation on the model. Learning rate tuned opAn improper learning rate, timally whether too high or too low, results in a model with low effective capacity due to optimization failure. Convolution kerincreased Increasing the kernel width A wider kernel results nel width increases the number of pain a narrower output dirameters in the model. mension, reducing model capacity unless you use implicit zero padding to reduce this effect. Wider kernels require more memory for parameter storage and increase runtime, but a narrower output reduces memory cost. Implicit zero increased Adding implicit zeros beIncreases time and mempadding fore convolution keeps the ory cost of most operarepresentation size large. tions. Weight decay codecreased Decreasing the weight deefficient cay coefficient frees the model parameters to become larger. Dropout rate decreased Dropping units less often gives the units more opportunities to “conspire” with each other to fit the training set. + +Table 11.1: The effect of various hyperparameters on model capacity. + +problems optimization does not seem to be a significant barrier, provided that the model is chosen appropriately. + +### 11.4.2 Automatic Hyperparameter Optimization Algorithms + +The ideal learning algorithm just takes a dataset and outputs a function, without requiring hand tuning of hyperparameters. The popularity of several learning algorithms such as logistic regression and SVMs stems in part from their ability to perform well with only one or two tuned hyperparameters. Neural networks can sometimes perform well with only a small number of tuned hyperparameters, but often benefit significantly from tuning of forty or more. Manual hyperparameter tuning can work very well when the user has a good starting point, such as one determined by others having worked on the same type of application and architecture, or when the user has months or years of experience in exploring hyperparameter values for neural networks applied to similar tasks. For many applications, however, these starting points are not available. In these cases, automated algorithms can find useful values of the hyperparameters. + +If we think about the way in which the user of a learning algorithm searches for good values of the hyperparameters, we realize that an optimization is taking place: we are trying to find a value of the hyperparameters that optimizes an objective function, such as validation error, sometimes under constraints (such as a budget for training time, memory or recognition time). It is therefore possible, in principle, to develop hyperparameter optimization algorithms that wrap a learning algorithm and choose its hyperparameters, thus hiding the hyperparameters of the learning algorithm from the user. Unfortunately, hyperparameter optimization algorithms often have their own hyperparameters, such as the range of values that should be explored for each of the learning algorithm’s hyperparameters. These secondary hyperparameters are usually easier to choose, however, in the sense that acceptable performance may be achieved on a wide range of tasks using the same secondary hyperparameters for all tasks. + +### 11.4.3 Grid Search + +When there are three or fewer hyperparameters, the common practice is to perform grid search . For each hyperparameter, the user selects a small finite set of values to explore. The grid search algorithm then trains a model for every joint specification of hyperparameter values in the Cartesian product of the set of values for each individual hyperparameter. The experiment that yields the best validation set error is then chosen as having found the best hyperparameters. See the left of figure 11.2 for an illustration of a grid of hyperparameter values. + +How should the lists of values to search over be chosen? In the case of numerical (ordered) hyperparameters, the smallest and largest element of each list is chosen + +Grid Random + +Figure 11.2: Comparison of grid search and random search. For illustration purposes, we display two hyperparameters, but we are typically interested in having many more. (Left) To perform grid search, we provide a set of values for each hyperparameter. The search algorithm runs training for every joint hyperparameter setting in the cross product of these sets. (Right) To perform random search, we provide a probability distribution over joint hyperparameter configurations. Usually most of these hyperparameters are independent from each other. Common choices for the distribution over a single hyperparameter include uniform and log-uniform (to sample from a log-uniform distribution, take the exp of a sample from a uniform distribution). The search algorithm then randomly samples joint hyperparameter configurations and runs training with each of them. Both grid search and random search evaluate the validation set error and return the best configuration. The figure illustrates the typical case where only some hyperparameters have a significant influence on the result. In this illustration, only the hyperparameter on the horizontal axis has a significant effect. Grid search wastes an amount of computation that is exponential in the number of noninfluential hyperparameters, while random search tests a unique value of every influential hyperparameter on nearly every trial. Figure reproduced with permission from Bergstra and Bengio (2012). + +conservatively, based on prior experience with similar experiments, to make sure that the optimal value is likely to be in the selected range. Typically, a grid search involves picking values approximately on a logarithmic scale, e.g., a learning rate Grid Layout Random Layout Grid Layout Random Layout −3 −4 −5 taken within the set { 0 . 1 , 0 . 01 , 10 , 10 , 10 } , or a number of hidden units taken with the set {50, 100, 200, 500, 1000, 2000}. + +Grid search usually performs best when it is performed repeatedly. For example, suppose that we ran a grid search over a hyperparameter α using values of {− 1 , 0 , 1 } . If the best value found is 1, then we underestimated the range in which the best α lies and should shift the grid and run another search with α in, for example, { 1 , 2 , 3 } . If we find that the best value of α is 0, then we may wish to refine our + +estimate by zooming in and running a grid search over {−0.1, 0, 0.1}. + +The obvious problem with grid search is that its computational cost grows exponentially with the number of hyperparameters. If there are m hyperparameters, each taking at most n values, then the number of training and evaluation trials m required grows as O ( n ). The trials may be run in parallel and exploit loose parallelism (with almost no need for communication between different machines carrying out the search). Unfortunately, because of the exponential cost of grid search, even parallelization may not provide a satisfactory size of search. + +### 11.4.4 Random Search + +Fortunately, there is an alternative to grid search that is as simple to program, more convenient to use, and converges much faster to good values of the hyperparameters: random search (Bergstra and Bengio, 2012). + +A random search proceeds as follows. First we define a marginal distribution for each hyperparameter, for example, a Bernoulli or multinoulli for binary or discrete hyperparameters, or a uniform distribution on a log-scale for positive real-valued hyperparameters. For example, + +`log_learning_rate` ∼ u(−1, −5), (11.2) `log_learning_rate` `learning_rate` = 10 , (11.3) + +where u ( a, b ) indicates a sample of the uniform distribution in the interval ( a, b ). Similarly the `log_number_of_hidden_units` may be sampled from u ( log (50) , log(2000)). + +Unlike in a grid search, we should not discretize or bin the values of the hyperparameters, so that we can explore a larger set of values and avoid additional computational cost. In fact, as illustrated in figure 11.2, a random search can be exponentially more efficient than a grid search, when there are several hyperparameters that do not strongly affect the performance measure. This is studied at length in Bergstra and Bengio (2012), who found that random search reduces the validation set error much faster than grid search, in terms of the number of trials run by each method. + +As with grid search, we may often want to run repeated versions of random search, to refine the search based on the results of the first run. + +The main reason that random search finds good solutions faster than grid search is that it has no wasted experimental runs, unlike in the case of grid search, when two values of a hyperparameter (given values of the other hyperparameters) + +would give the same result. In the case of grid search, the other hyperparameters would have the same values for these two runs, whereas with random search, they would usually have different values. Hence if the change between these two values does not marginally make much difference in terms of validation set error, grid search will unnecessarily repeat two equivalent experiments while random search will still give two independent explorations of the other hyperparameters. + +### 11.4.5 Model-Based Hyperparameter Optimization + +The search for good hyperparameters can be cast as an optimization problem. The decision variables are the hyperparameters. The cost to be optimized is the validation set error that results from training using these hyperparameters. In simplified settings where it is feasible to compute the gradient of some differentiable error measure on the validation set with respect to the hyperparameters, we can simply follow this gradient (Bengio et al., 1999; Bengio, 2000; Maclaurin et al., 2015). Unfortunately, in most practical settings, this gradient is unavailable, either because of its high computation and memory cost, or because of hyperparameters that have intrinsically nondifferentiable interactions with the validation set error, as in the case of discrete-valued hyperparameters. + +To compensate for this lack of a gradient, we can build a model of the validation set error, then propose new hyperparameter guesses by performing optimization within this model. Most model-based algorithms for hyperparameter search use a Bayesian regression model to estimate both the expected value of the validation set error for each hyperparameter and the uncertainty around this expectation. Optimization thus involves a trade-off between exploration (proposing hyperparameters for that there is high uncertainty, which may lead to a large improvement but may also perform poorly) and exploitation (proposing hyperparameters that the model is confident will perform as well as any hyperparameters it has seen so far—usually hyperparameters that are very similar to ones it has seen before). Contemporary approaches to hyperparameter optimization include Spearmint (Snoek et al., 2012), TPE (Bergstra et al., 2011) and SMAC (Hutter et al., 2011). + +Currently, we cannot unambiguously recommend Bayesian hyperparameter optimization as an established tool for achieving better deep learning results or for obtaining those results with less effort. Bayesian hyperparameter optimization sometimes performs comparably to human experts, sometimes better, but fails catastrophically on other problems. It may be worth trying to see if it works on a particular problem but is not yet sufficiently mature or reliable. That being said, hyperparameter optimization is an important field of research that, while often driven primarily by the needs of deep learning, holds the potential to benefit not + +only the entire field of machine learning but also the discipline of engineering in general. + +One drawback common to most hyperparameter optimization algorithms with more sophistication than random search is that they require for a training experiment to run to completion before they are able to extract any information from the experiment. This is much less efficient, in the sense of how much information can be gleaned early in an experiment, than manual search by a human practitioner, since one can usually tell early on if some set of hyperparameters is completely pathological. Swersky et al. (2014) have introduced an early version of an algorithm that maintains a set of multiple experiments. At various time points, the hyperparameter optimization algorithm can choose to begin a new experiment, to “freeze” a running experiment that is not promising, or to “thaw” and resume an experiment that was earlier frozen but now appears promising given more information. + +## 11.5 Debugging Strategies + +When a machine learning system performs poorly, it is usually difficult to tell whether the poor performance is intrinsic to the algorithm itself or whether there is a bug in the implementation of the algorithm. Machine learning systems are difficult to debug for various reasons. + +In most cases, we do not know a priori what the intended behavior of the algorithm is. In fact, the entire point of using machine learning is that it will discover useful behavior that we were not able to specify ourselves. If we train a neural network on a new classification task and it achieves 5 percent test error, we have no straightforward way of knowing if this is the expected behavior or suboptimal behavior. + +A further difficulty is that most machine learning models have multiple parts that are each adaptive. If one part is broken, the other parts can adapt and still achieve roughly acceptable performance. For example, suppose that we are training a neural net with several layers parametrized by weights W and biases b . Suppose further that we have manually implemented the gradient descent rule for each parameter separately, and we made an error in the update for the biases: + +b ← b − α, (11.4) + +where α is the learning rate. This erroneous update does not use the gradient at all. It causes the biases to constantly become negative throughout learning, which + +is clearly not a correct implementation of any reasonable learning algorithm. The bug may not be apparent just from examining the output of the model though. Depending on the distribution of the input, the weights may be able to adapt to compensate for the negative biases. + +Most debugging strategies for neural nets are designed to get around one or both of these two difficulties. Either we design a case that is so simple that the correct behavior actually can be predicted, or we design a test that exercises one part of the neural net implementation in isolation. + +Some important debugging tests include the following. + +Visualize the model in action: When training a model to detect objects in images, view some images with the detections proposed by the model displayed superimposed on the image. When training a generative model of speech, listen to some of the speech samples it produces. This may seem obvious, but it is easy to fall into the practice of looking only at quantitative performance measurements like accuracy or log-likelihood. Directly observing the machine learning model performing its task will help to determine whether the quantitative performance numbers it achieves seem reasonable. Evaluation bugs can be some of the most devastating bugs because they can mislead you into believing your system is performing well when it is not. + +Visualize the worst mistakes: Most models are able to output some sort of confidence measure for the task they perform. For example, classifiers based on a softmax output layer assign a probability to each class. The probability assigned to the most likely class thus gives an estimate of the confidence the model has in its classification decision. Typically, maximum likelihood training results in these values being overestimates rather than accurate probabilities of correct prediction, but they are somewhat useful in the sense that examples that are actually less likely to be correctly labeled receive smaller probabilities under the model. By viewing the training set examples that are the hardest to model correctly, one can often discover problems with the way the data have been preprocessed or labeled. For example, the Street View transcription system originally had a problem where the address number detection system would crop the image too tightly and omit some digits. The transcription network then assigned very low probability to the correct answer on these images. Sorting the images to identify the most confident mistakes showed that there was a systematic problem with the cropping. Modifying the detection system to crop much wider images resulted in much better performance of the overall system, even though the transcription network needed to be able to process greater variation in the position and scale of the address numbers. + +Reason about software using training and test error: It is often difficult to + +determine whether the underlying software is correctly implemented. Some clues can be obtained from the training and test errors. If training error is low but test error is high, then it is likely that that the training procedure works correctly, and the model is overfitting for fundamental algorithmic reasons. An alternative possibility is that the test error is measured incorrectly because of a problem with saving the model after training then reloading it for test set evaluation, or because the test data was prepared differently from the training data. If both training and test errors are high, then it is difficult to determine whether there is a software defect or whether the model is underfitting due to fundamental algorithmic reasons. This scenario requires further tests, described next. + +Fit a tiny dataset: If you have high error on the training set, determine whether it is due to genuine underfitting or due to a software defect. Usually even small models can be guaranteed to be able fit a sufficiently small dataset. For example, a classification dataset with only one example can be fit just by setting the biases of the output layer correctly. Usually if you cannot train a classifier to correctly label a single example, an autoencoder to successfully reproduce a single example with high fidelity, or a generative model to consistently emit samples resembling a single example, there is a software defect preventing successful optimization on the training set. This test can be extended to a small dataset with few examples. + +Compare back-propagated derivatives to numerical derivatives: If you are using a software framework that requires you to implement your own gradient computations, or if you are adding a new operation to a differentiation library and must define its bprop method, then a common source of error is implementing this gradient expression incorrectly. One way to verify that these derivatives are correct is to compare the derivatives computed by your implementation of automatic differentiation to the derivatives computed by finite differences. Because + +f(x + ε) − f(x) ′ f (x) = lim , (11.5) ε→0 ε + +we can approximate the derivative by using a small, finite ε: + +f(x + ε) − f(x) ′ f (x) ≈ . (11.6) ε + +We can improve the accuracy of the approximation by using the centered difference: 1 1 f(x + ε) − f(x − ε) ′ 2 2 f (x) ≈ . (11.7) ε The perturbation size ε must be large enough to ensure that the perturbation is not rounded down too much by finite-precision numerical computations. + +Usually, we will want to test the gradient or Jacobian of a vector-valued function m n g : R → R . Unfortunately, finite differencing only allows us to take a single derivative at a time. We can either run finite differencing mn times to evaluate all the partial derivatives of g , or apply the test to a new function that uses random projections at both the input and the output of g . For example, we can apply T our test of the implementation of the derivatives to f ( x ), where f ( x ) = u g ( vx ), ′ and u and v are randomly chosen vectors. Computing f ( x ) correctly requires being able to back-propagate through g correctly yet is efficient to do with finite differences because f has only a single input and a single output. It is usually a good idea to repeat this test for more than one value of u and v to reduce the chance of the test overlooking mistakes that are orthogonal to the random projection. + +If one has access to numerical computation on complex numbers, then there is a very efficient way to numerically estimate the gradient by using complex numbers as input to the function (Squire and Trapp, 1998). The method is based on the observation that + +′ 2 f(x + iε) = f(x) + iεf (x) + O(ε ), (11.8) + +f(x + iε) 2 ′ 2 real(f(x + iε)) = f(x) + O(ε ), imag( ) = f (x) + O(ε ), (11.9) ε √ where i = −1 . Unlike in the real-valued case above, there is no cancellation effect because we take the difference between the value of f at different points. −150 2 This allows the use of tiny values of ε , like ε = 10 , which make the O ( ε ) error insignificant for all practical purposes. + +Monitor histograms of activations and gradient: It is often useful to visualize statistics of neural network activations and gradients, collected over a large amount of training iterations (maybe one epoch). The preactivation value of hidden units can tell us if the units saturate, or how often they do. For example, for rectifiers, how often are they off? Are there units that are always off? For tanh units, the average of the absolute value of the preactivations tells us how saturated the unit is. In a deep network where the propagated gradients quickly grow or quickly vanish, optimization may be hampered. Finally, it is useful to compare the magnitude of parameter gradients to the magnitude of the parameters themselves. As suggested by Bottou (2015), we would like the magnitude of parameter updates over a minibatch to represent something like 1 percent of the magnitude of the parameter, not 50 percent or 0.001 percent (which would make the parameters move too slowly). It may be that some groups of parameters are moving at a good pace while others are stalled. When the data is sparse (like in natural language), + +some parameters may be very rarely updated, and this should be kept in mind when monitoring their evolution. + +Finally, many deep learning algorithms provide some sort of guarantee about the results produced at each step. For example, in part III, we will see some approximate inference algorithms that work by using algebraic solutions to optimization problems. Typically these can be debugged by testing each of their guarantees. Some guarantees that some optimization algorithms offer include that the objective function will never increase after one step of the algorithm, that the gradient with respect to some subset of variables will be zero after each step of the algorithm, and that the gradient with respect to all variables will be zero at convergence. Usually due to rounding error, these conditions will not hold exactly in a digital computer, so the debugging test should include some tolerance parameter. + +## 11.6 Example: Multi-Digit Number Recognition + +To provide an end-to-end description of how to apply our design methodology in practice, we present a brief account of the Street View transcription system, from the point of view of designing the deep learning components. Obviously, many other components of the complete system, such as the Street View cars, the database infrastructure, and so on, were of paramount importance. + +From the point of view of the machine learning task, the process began with data collection. The cars collected the raw data, and human operators provided labels. The transcription task was preceded by a significant amount of dataset curation, including using other machine learning techniques to detect the house numbers prior to transcribing them. + +The transcription project began with a choice of performance metrics and desired values for these metrics. An important general principle is to tailor the choice of metric to the business goals for the project. Because maps are only useful if they have high accuracy, it was important to set a high accuracy requirement for this project. Specifically, the goal was to obtain human-level, 98 percent accuracy. This level of accuracy may not always be feasible to obtain. To reach this level of accuracy, the Street View transcription system sacrificed coverage. Coverage thus became the main performance metric optimized during the project, with accuracy held at 98 percent. As the convolutional network improved, it became possible to reduce the confidence threshold below which the network refused to transcribe the input, eventually exceeding the goal of 95 percent coverage. + +After choosing quantitative goals, the next step in our recommended methodol- + +ogy is to rapidly establish a sensible baseline system. For vision tasks, this means a convolutional network with rectified linear units. The transcription project began with such a model. At the time, it was not common for a convolutional network to output a sequence of predictions. To begin with the simplest possible baseline, the first implementation of the output layer of the model consisted of n different softmax units to predict a sequence of n characters. These softmax units were trained exactly the same as if the task were classification, with each softmax unit trained independently. + +Our recommended methodology is to iteratively refine the baseline and test whether each change makes an improvement. The first change to the Street View transcription system was motivated by a theoretical understanding of the coverage metric and the structure of the data. Specifically, the network refused to classify an input x whenever the probability of the output sequence p ( y | x ) < t for some threshold t . Initially, the definition of p ( y | x ) was ad-hoc, based on simply multiplying all the softmax outputs together. This motivated the development of a specialized output layer and cost function that actually computed a principled log-likelihood. This approach allowed the example rejection mechanism to function much more effectively. + +At this point, coverage was still below 90 percent, yet there were no obvious theoretical problems with the approach. Our methodology therefore suggested instrumenting the training and test set performance to determine whether the problem was underfitting or overfitting. In this case, training and test set error were nearly identical. Indeed, the main reason this project proceeded so smoothly was the availability of a dataset with tens of millions of labeled examples. Because training and test set error were so similar, this suggested that the problem was due to either underfitting or a problem with the training data. One of the debugging strategies we recommend is to visualize the model’s worst errors. In this case, that meant visualizing the incorrect training set transcriptions that the model gave the highest confidence. These proved to mostly consist of examples where the input image had been cropped too tightly, with some of the digits of the address being removed by the cropping operation. For example, a photo of an address “1849” might be cropped too tightly, with only the “849” remaining visible. This problem could have been resolved by spending weeks improving the accuracy of the address number detection system responsible for determining the cropping regions. Instead, the team made a much more practical decision, to simply expand the width of the crop region to be systematically wider than the address number detection system predicted. This single change added ten percentage points to the transcription system’s coverage. + +Finally, the last few percentage points of performance came from adjusting hyperparameters. This mostly consisted of making the model larger while maintaining some restrictions on its computational cost. Because train and test error remained roughly equal, it was always clear that any performance deficits were due to underfitting, as well as to a few remaining problems with the dataset itself. + +Overall, the transcription project was a great success and allowed hundreds of millions of addresses to be transcribed both faster and at lower cost than would have been possible via human effort. + +We hope that the design principles described in this chapter will lead to many other similar successes. \ No newline at end of file diff --git a/docs/evidence/goodfellow_ch15_representation_learning.md b/docs/evidence/goodfellow_ch15_representation_learning.md new file mode 100644 index 0000000..1c3a84a --- /dev/null +++ b/docs/evidence/goodfellow_ch15_representation_learning.md @@ -0,0 +1,302 @@ +Source: https://www.deeplearningbook.org/contents/representation.html +Title: Deep Learning Book - Chapter 15: Representation Learning - Goodfellow, Bengio, Courville +Fetched-via: local copy /media/wassname/.../goodfellow_deep_learning/chapters/15_representation.md +Fetch-status: verbatim + +# Chapter 15 + +# Representation Learning + +In this chapter, we first discuss what it means to learn representations and how the notion of representation can be useful to design deep architectures. We explore how learning algorithms share statistical strength across different tasks, including using information from unsupervised tasks to perform supervised tasks. Shared representations are useful to handle multiple modalities or domains, or to transfer learned knowledge to tasks for which few or no examples are given but a task representation exists. Finally, we step back and argue about the reasons for the success of representation learning, starting with the theoretical advantages of distributed representations (Hinton et al., 1986) and deep representations, ending with the more general idea of underlying assumptions about the data-generating process, in particular about underlying causes of the observed data. + +Many information processing tasks can be very easy or very difficult depending on how the information is represented. This is a general principle applicable to daily life, to computer science in general, and to machine learning. For example, it is straightforward for a person to divide 210 by 6 using long division. The task becomes considerably less straightforward if it is instead posed using the Roman numeral representation of the numbers. Most modern people asked to divide CCX by VI would begin by converting the numbers to the Arabic numeral representation, permitting long division procedures that make use of the place value system. More concretely, we can quantify the asymptotic runtime of various operations using appropriate or inappropriate representations. For example, inserting a number into the correct position in a sorted list of numbers is an O ( n ) operation if the list is represented as a linked list, but only O ( log n ) if the list is represented as a red-black tree. + +In the context of machine learning, what makes one representation better than + +another? Generally speaking, a good representation is one that makes a subsequent learning task easier. The choice of representation will usually depend on the choice of the subsequent learning task. + +We can think of feedforward networks trained by supervised learning as performing a kind of representation learning. Specifically, the last layer of the network is typically a linear classifier, such as a softmax regression classifier. The rest of the network learns to provide a representation to this classifier. Training with a supervised criterion naturally leads to the representation at every hidden layer (but more so near the top hidden layer) taking on properties that make the classification task easier. For example, classes that were not linearly separable in the input features may become linearly separable in the last hidden layer. In principle, the last layer could be another kind of model, such as a nearest neighbor classifier (Salakhutdinov and Hinton, 2007a). The features in the penultimate layer should learn different properties depending on the type of the last layer. + +Supervised training of feedforward networks does not involve explicitly imposing any condition on the learned intermediate features. Other kinds of representation learning algorithms are often explicitly designed to shape the representation in some particular way. For example, suppose we want to learn a representation that makes density estimation easier. Distributions with more independences are easier to model, so we could design an objective function that encourages the elements of the representation vector h to be independent. Just like supervised networks, unsupervised deep learning algorithms have a main training objective but also learn a representation as a side effect. Regardless of how a representation was obtained, it can be used for another task. Alternatively, multiple tasks (some supervised, some unsupervised) can be learned together with some shared internal representation. + +Most representation learning problems face a trade-off between preserving as much information about the input as possible and attaining nice properties (such as independence). + +Representation learning is particularly interesting because it provides one way to perform unsupervised and semi-supervised learning. We often have very large amounts of unlabeled training data and relatively little labeled training data. Training with supervised learning techniques on the labeled subset often results in severe overfitting. Semi-supervised learning offers the chance to resolve this overfitting problem by also learning from the unlabeled data. Specifically, we can learn good representations for the unlabeled data, and then use these representations to solve the supervised learning task. + +Humans and animals are able to learn from very few labeled examples. We do + +not yet know how this is possible. Many factors could explain improved human performance—for example, the brain may use very large ensembles of classifiers or Bayesian inference techniques. One popular hypothesis is that the brain is able to leverage unsupervised or semi-supervised learning. There are many ways to leverage unlabeled data. In this chapter, we focus on the hypothesis that the unlabeled data can be used to learn a good representation. + +## 15.1 Greedy Layer-Wise Unsupervised Pretraining + +Unsupervised learning played a key historical role in the revival of deep neural networks, enabling researchers for the first time to train a deep supervised network without requiring architectural specializations like convolution or recurrence. We call this procedure unsupervised pretraining , or more precisely, greedy layerwise unsupervised pretraining . This procedure is a canonical example of how a representation learned for one task (unsupervised learning, trying to capture the shape of the input distribution) can sometimes be useful for another task (supervised learning with the same input domain). + +Greedy layer-wise unsupervised pretraining relies on a single-layer representation learning algorithm such as an RBM, a single-layer autoencoder, a sparse coding model, or another model that learns latent representations. Each layer is pretrained using unsupervised learning, taking the output of the previous layer and producing as output a new representation of the data, whose distribution (or its relation to other variables, such as categories to predict) is hopefully simpler. See algorithm 15.1 for a formal description. + +Greedy layer-wise training procedures based on unsupervised criteria have long been used to sidestep the difficulty of jointly training the layers of a deep neural net for a supervised task. This approach dates back at least as far as the neocognitron (Fukushima, 1975). The deep learning renaissance of 2006 began with the discovery that this greedy learning procedure could be used to find a good initialization for a joint learning procedure over all the layers, and that this approach could be used to successfully train even fully connected architectures (Hinton et al., 2006; Hinton and Salakhutdinov, 2006; Hinton, 2006; Bengio et al., 2007; Ranzato et al., 2007a). Prior to this discovery, only convolutional deep networks or networks whose depth resulted from recurrence were regarded as feasible to train. Today, we now know that greedy layer-wise pretraining is not required to train fully connected deep architectures, but the unsupervised pretraining approach was the first method to succeed. + +Greedy layer-wise pretraining is called greedy because it is a greedy algo- + +rithm , meaning that it optimizes each piece of the solution independently, one piece at a time, rather than jointly optimizing all pieces. It is called layer-wise because these independent pieces are the layers of the network. Specifically, greedy layer-wise pretraining proceeds one layer at a time, training the k -th layer while keeping the previous ones fixed. In particular, the lower layers (which are trained first) are not adapted after the upper layers are introduced. It is called unsupervised because each layer is trained with an unsupervised representation learning algorithm. However, it is also called pretraining because it is supposed to be only a first step before a joint training algorithm is applied to fine-tune all the layers together. In the context of a supervised learning task, it can be viewed as a regularizer (in some experiments, pretraining decreases test error without decreasing training error) and a form of parameter initialization. + +It is common to use the word “pretraining” to refer not only to the pretraining stage itself but to the entire two-phase protocol that combines the pretraining phase and a supervised learning phase. The supervised learning phase may involve training a simple classifier on top of the features learned in the pretraining phase, or it may involve supervised fine-tuning of the entire network learned in the pretraining phase. No matter what kind of unsupervised learning algorithm or what model type is employed, in most cases, the overall training scheme is nearly the same. While the choice of unsupervised learning algorithm will obviously affect the details, most applications of unsupervised pretraining follow this basic protocol. + +Greedy layer-wise unsupervised pretraining can also be used as initialization for other unsupervised learning algorithms, such as deep autoencoders (Hinton and Salakhutdinov, 2006) and probabilistic models with many layers of latent variables. Such models include deep belief networks (Hinton et al., 2006) and deep Boltzmann machines (Salakhutdinov and Hinton, 2009a). These deep generative models are described in chapter 20. + +As discussed in section 8.7.4, it is also possible to have greedy layer-wise supervised pretraining. This builds on the premise that training a shallow network is easier than training a deep one, which seems to have been validated in several contexts (Erhan et al., 2010). + +### 15.1.1 When and Why Does Unsupervised Pretraining Work? + +On many tasks, greedy layer-wise unsupervised pretraining can yield substantial improvements in test error for classification tasks. This observation was responsible for the renewed interested in deep neural networks starting in 2006 (Hinton et al., + +Algorithm 15.1 Greedy layer-wise unsupervised pretraining protocol Given the following: Unsupervised feature learning algorithm L , which takes a training set of examples and returns an encoder or feature function f . The raw (1) input data is X , with one row per example, and f ( X ) is the output of the first stage encoder on X . In the case where fine-tuning is performed, we use a learner T , which takes an initial function f , input examples X (and in the supervised fine-tuning case, associated targets Y ), and returns a tuned function. The number of stages is m. + +f ← Identity function ˜ X = X for k = 1, . . . , m do (k) ˜ f = L( X) (k) f ← f ◦ f (k) ˜ ˜ X ← f ( X) end for if fine-tuning then f ← T (f, X, Y ) end if Return f + +2006; Bengio et al., 2007; Ranzato et al., 2007a). On many other tasks, however, unsupervised pretraining either does not confer a benefit or even causes noticeable harm. Ma et al. (2015) studied the effect of pretraining on machine learning models for chemical activity prediction and found that, on average, pretraining was slightly harmful, but for many tasks was significantly helpful. Because unsupervised pretraining is sometimes helpful but often harmful, it is important to understand when and why it works in order to determine whether it is applicable to a particular task. + +At the outset, it is important to clarify that most of this discussion is restricted to greedy unsupervised pretraining in particular. There are other, completely different paradigms for performing semi-supervised learning with neural networks, such as virtual adversarial training, described in section 7.13. It is also possible to train an autoencoder or generative model at the same time as the supervised model. Examples of this single-stage approach include the discriminative RBM (Larochelle and Bengio, 2008) and the ladder network (Rasmus et al., 2015), in which the total objective is an explicit sum of the two terms (one using the labels, and one only using the input). + +Unsupervised pretraining combines two different ideas. First, it makes use of + +the idea that the choice of initial parameters for a deep neural network can have a significant regularizing effect on the model (and, to a lesser extent, that it can improve optimization). Second, it makes use of the more general idea that learning about the input distribution can help with learning about the mapping from inputs to outputs. + +Both ideas involve many complicated interactions between several parts of the machine learning algorithm that are not entirely understood. + +The first idea, that the choice of initial parameters for a deep neural network can have a strong regularizing effect on its performance, is the least understood. At the time that pretraining became popular, it was understood as initializing the model in a location that would cause it to approach one local minimum rather than another. Today, local minima are no longer considered to be a serious problem for neural network optimization. We now know that our standard neural network training procedures usually do not arrive at a critical point of any kind. It remains possible that pretraining initializes the model in a location that would otherwise be inaccessible—for example, a region that is surrounded by areas where the cost function varies so much from one example to another that minibatches give only a very noisy estimate of the gradient, or a region surrounded by areas where the Hessian matrix is so poorly conditioned that gradient descent methods must use very small steps. However, our ability to characterize exactly what aspects of the pretrained parameters are retained during the supervised training stage is limited. This is one reason that modern approaches typically use simultaneous unsupervised learning and supervised learning rather than two sequential stages. One may also avoid struggling with these complicated ideas about how optimization in the supervised learning stage preserves information from the unsupervised learning stage by simply freezing the parameters for the feature extractors and using supervised learning only to add a classifier on top of the learned features. + +The other idea, that a learning algorithm can use information learned in the unsupervised phase to perform better in the supervised learning stage, is better understood. The basic idea is that some features that are useful for the unsupervised task may also be useful for the supervised learning task. For example, if we train a generative model of images of cars and motorcycles, it will need to know about wheels, and about how many wheels should be in an image. If we are fortunate, the representation of the wheels will take on a form that is easy for the supervised learner to access. This is not yet understood at a mathematical, theoretical level, so it is not always possible to predict which tasks will benefit from unsupervised learning in this way. Many aspects of this approach are highly dependent on the specific models used. For example, if we wish to add a linear classifier on + +top of pretrained features, the features must make the underlying classes linearly separable. These properties often occur naturally but do not always do so. This is another reason that simultaneous supervised and unsupervised learning can be preferable—the constraints imposed by the output layer are naturally included from the start. + +From the point of view of unsupervised pretraining as learning a representation, we can expect unsupervised pretraining to be more effective when the initial representation is poor. One key example of this is the use of word embeddings. Words represented by one-hot vectors are not very informative because every 2 two distinct one-hot vectors are the same distance from each other (squared L distance of 2). Learned word embeddings naturally encode similarity between words by their distance from each other. Because of this, unsupervised pretraining is especially useful when processing words. It is less useful when processing images, perhaps because images already lie in a rich vector space where distances provide a low-quality similarity metric. + +From the point of view of unsupervised pretraining as a regularizer, we can expect unsupervised pretraining to be most helpful when the number of labeled examples is very small. Because the source of information added by unsupervised pretraining is the unlabeled data, we may also expect unsupervised pretraining to perform best when the number of unlabeled examples is very large. The advantage of semi-supervised learning via unsupervised pretraining with many unlabeled examples and few labeled examples was made particularly clear in 2011 with unsupervised pretraining winning two international transfer learning competitions (Mesnil et al., 2011; Goodfellow et al., 2011), in settings where the number of labeled examples in the target task was small (from a handful to dozens of examples per class). These effects were also documented in carefully controlled experiments by Paine et al. (2014). + +Other factors are likely to be involved. For example, unsupervised pretraining is likely to be most useful when the function to be learned is extremely complicated. Unsupervised learning differs from regularizers like weight decay because it does not bias the learner toward discovering a simple function but rather leads the learner toward discovering feature functions that are useful for the unsupervised learning task. If the true underlying functions are complicated and shaped by regularities of the input distribution, unsupervised learning can be a more appropriate regularizer. + +These caveats aside, we now analyze some success cases where unsupervised pretraining is known to cause an improvement and explain what is known about why this improvement occurs. Unsupervised pretraining has usually been used to improve classifiers and is usually most interesting from the point of view of reducing + +test set error. Unsupervised pretraining can help tasks other than classification, however, and can act to improve optimization rather than being merely a regularizer. For example, it can improve both train and test reconstruction error for deep autoencoders (Hinton and Salakhutdinov, 2006). + +Erhan et al. (2010) performed many experiments to explain several successes of unsupervised pretraining. Improvements to training error and improvements to test error may both be explained in terms of unsupervised pretraining taking the parameters into a region that would otherwise be inaccessible. Neural network training is nondeterministic and converges to a different function every time it is run. Training may halt at a point where the gradient becomes small, a point where early stopping ends training to prevent overfitting, or at a point where the gradient is large, but it is difficult to find a downhill step because of problems such as stochasticity or poor conditioning of the Hessian. Neural networks that receive unsupervised pretraining consistently halt in the same region of function space, while neural networks without pretraining consistently halt in another region. See figure 15.1 for a visualization of this phenomenon. The region where pretrained networks arrive is smaller, suggesting that pretraining reduces the variance of the estimation process, which can in turn reduce the risk of severe over fitting. In other words, unsupervised pretraining initializes neural network parameters into a region that they do not escape, and the results following this initialization are more consistent and less likely to be very bad than without this initialization. + +Erhan et al. (2010) also provide some answers to when pretraining works best— the mean and variance of the test error were most reduced by pretraining for deeper networks. Keep in mind that these experiments were performed before the invention and popularization of modern techniques for training very deep networks (rectified linear units, dropout and batch normalization) so less is known about the effect of unsupervised pretraining in conjunction with contemporary approaches. + +An important question is how unsupervised pretraining can act as a regularizer. One hypothesis is that pretraining encourages the learning algorithm to discover features that relate to the underlying causes that generate the observed data. This is an important idea motivating many other algorithms besides unsupervised pretraining and is described further in section 15.3. + +Compared to other forms of unsupervised learning, unsupervised pretraining has the disadvantage of operating with two separate training phases. Many regularization strategies have the advantage of allowing the user to control the strength of the regularization by adjusting the value of a single hyperparameter. Unsupervised pretraining does not offer a clear way to adjust the strength of the regularization arising from the unsupervised stage. Instead, there are very + +Figure 15.1: Visualization via nonlinear projection of the learning trajectories of different neural networks in function space (not parameter space, to avoid the issue of many-to-one mappings from parameter vectors to functions), with different random initializations and with or without unsupervised pretraining. Each point corresponds to a different neural network at a particular time during its training process. This figure is adapted with permission from Erhan et al. (2010). A coordinate in function space is an infinitedimensional vector associating every input x with an output y . Erhan et al. (2010) made a linear projection to high-dimensional space by concatenating the y for many specific x points. They then made a further nonlinear projection to 2-D by Isomap (Tenenbaum et al., 2000). Color indicates time. All networks are initialized near the center of the plot (corresponding to the region of functions that produce approximately uniform distributions over the class y for most inputs). Over time, learning moves the function outward, to points that make strong predictions. Training consistently terminates in one region when using pretraining and in another, nonoverlapping region when not using pretraining. Isomap tries to preserve global relative distances (and hence volumes) so the small region corresponding to pretrained models may indicate that the pretraining-based estimator has reduced variance. ″‶′′ With pretraining ″′′′ Without pretraining ‶′′ + +′ + +‶′′ + +many hyperparameters, whose effect may be measured after the fact but is often difficult to predict ahead of time. When we perform unsupervised and supervised learning simultaneously, instead of using the pretraining strategy, there is a single hyperparameter, usually a coefficient attached to the unsupervised cost, that determines how strongly the unsupervised objective will regularize the supervised model. One can always predictably obtain less regularization by decreasing this coefficient. In unsupervised pretraining, there is not a way of flexibly adapting the strength of the regularization—either the supervised model is initialized to pretrained parameters, or it is not. + +Another disadvantage of having two separate training phases is that each phase has its own hyperparameters. The performance of the second phase usually cannot be predicted during the first phase, so there is a long delay between proposing hyperparameters for the first phase and being able to update them using feedback from the second phase. The most principled approach is to use validation set error in the supervised phase to select the hyperparameters of the pretraining phase, as discussed in Larochelle et al. (2009). In practice, some hyperparameters, like the number of pretraining iterations, are more conveniently set during the pretraining phase, using early stopping on the unsupervised objective, which is not ideal but is computationally much cheaper than using the supervised objective. + +Today, unsupervised pretraining has been largely abandoned, except in the field of natural language processing, where the natural representation of words as one-hot vectors conveys no similarity information and where very large unlabeled sets are available. In that case, the advantage of pretraining is that one can pretrain once on a huge unlabeled set (for example with a corpus containing billions of words), learn a good representation (typically of words, but also of sentences), and then use this representation or fine-tune it for a supervised task for which the training set contains substantially fewer examples. This approach was pioneered by Collobert and Weston (2008b), Turian et al. (2010), and Collobert et al. (2011a) and remains in common use today. + +Deep learning techniques based on supervised learning, regularized with dropout or batch normalization, are able to achieve human-level performance on many tasks, but only with extremely large labeled datasets. These same techniques outperform unsupervised pretraining on medium-sized datasets such as CIFAR-10 and MNIST, which have roughly 5,000 labeled examples per class. On extremely small datasets, such as the alternative splicing dataset, Bayesian methods outperform methods based on unsupervised pretraining (Srivastava, 2013). For these reasons, the popularity of unsupervised pretraining has declined. Nevertheless, unsupervised pretraining remains an important milestone in the history of deep learning research + +and continues to influence contemporary approaches. The idea of pretraining has been generalized to supervised pretraining , discussed in section 8.7.4, as a very common approach for transfer learning. Supervised pretraining for transfer learning is popular (Oquab et al., 2014; Yosinski et al., 2014) for use with convolutional networks pretrained on the ImageNet dataset. Practitioners publish the parameters of these trained networks for this purpose, just as pretrained word vectors are published for natural language tasks (Collobert et al., 2011a; Mikolov et al., 2013a). + +## 15.2 Transfer Learning and Domain Adaptation + +Transfer learning and domain adaptation refer to the situation where what has been learned in one setting (e.g., distribution P ) is exploited to improve generalization 1 in another setting (say, distribution P ). This generalizes the idea presented in the 2 previous section, where we transferred representations between an unsupervised learning task and a supervised learning task. + +In transfer learning , the learner must perform two or more different tasks, but we assume that many of the factors that explain the variations in P are 1 relevant to the variations that need to be captured for learning P . This is typically 2 understood in a supervised learning context, where the input is the same but the target may be of a different nature. For example, we may learn about one set of visual categories, such as cats and dogs, in the first setting, then learn about a different set of visual categories, such as ants and wasps, in the second setting. If there is significantly more data in the first setting (sampled from P ), then 1 that may help to learn representations that are useful to quickly generalize from only very few examples drawn from P . Many visual categories share low-level 2 notions of edges and visual shapes, the effects of geometric changes, changes in lighting, and so on. In general, transfer learning, multitask learning (section 7.7), and domain adaptation can be achieved via representation learning when there exist features that are useful for the different settings or tasks, corresponding to underlying factors that appear in more than one setting. This is illustrated in figure 7.2, with shared lower layers and task-dependent upper layers. + +Sometimes, however, what is shared among the different tasks is not the semantics of the input but the semantics of the output. For example, a speech recognition system needs to produce valid sentences at the output layer, but the earlier layers near the input may need to recognize very different versions of the same phonemes or subphonemic vocalizations depending on which person is speaking. In cases like these, it makes more sense to share the upper layers + +Figure 15.2: Example architecture for multitask or transfer learning when the output variable y has the same semantics for all tasks while the input variable x has a different meaning (and possibly even a different dimension) for each task (or, for example, each (1) (2) (3) user), called x , x and x for three tasks. The lower levels (up to the selection switch) are task-specific, while the upper levels are shared. The lower levels learn to translate their task-specific input into a generic set of features. + +(near the output) of the neural network and have a task-specific preprocessing, as illustrated in figure 15.2. + +In the related case of domain adaptation , the task (and the optimal input-tooutput mapping) remains the same between each setting, but the input distribution is slightly different. For example, consider the task of sentiment analysis, which consists of determining whether a comment expresses positive or negative sentiment. Comments posted on the web come from many categories. A domain adaptation scenario can arise when a sentiment predictor trained on customer reviews of media content, such as books, videos and music, is later used to analyze comments about consumer electronics, such as televisions or smartphones. One can imagine that there is an underlying function that tells whether any statement is positive, neutral, or negative, but of course the vocabulary and style may vary from one domain to another, making it more difficult to generalize across domains. Simple unsupervised pretraining (with denoising autoencoders) has been found to be very successful for sentiment analysis with domain adaptation (Glorot et al., 2011b). Selection switch A related problem is that of concept drift , which we can view as a form of transfer learning due to gradual changes in the data distribution over time. + +Both concept drift and transfer learning can be viewed as particular forms of multitask learning. While the phrase “multitask learning” typically refers to supervised learning tasks, the more general notion of transfer learning is applicable to unsupervised learning and reinforcement learning as well. + +In all these cases, the objective is to take advantage of data from the first setting to extract information that may be useful when learning or even when directly making predictions in the second setting. The core idea of representation learning is that the same representation may be useful in both settings. Using the same representation in both settings allows the representation to benefit from the training data that is available for both tasks. + +As mentioned before, unsupervised deep learning for transfer learning has found success in some machine learning competitions (Mesnil et al., 2011; Goodfellow et al., 2011). In the first of these competitions, the experimental setup is the following. Each participant is first given a dataset from the first setting (from distribution P ), illustrating examples of some set of categories. The participants 1 must use this to learn a good feature space (mapping the raw input to some representation), such that when we apply this learned transformation to inputs from the transfer setting (distribution P ), a linear classifier can be trained and 2 generalize well from few labeled examples. One of the most striking results found in this competition is that as an architecture makes use of deeper and deeper representations (learned in a purely unsupervised way from data collected in the first setting, P ), the learning curve on the new categories of the second 1 (transfer) setting P becomes much better. For deep representations, fewer labeled 2 examples of the transfer tasks are necessary to achieve the apparently asymptotic generalization performance. + +Two extreme forms of transfer learning are one-shot learning and zero-shot learning , sometimes also called zero-data learning . Only one labeled example of the transfer task is given for one-shot learning, while no labeled examples are given at all for the zero-shot learning task. + +One-shot learning (Fei-Fei et al., 2006) is possible because the representation learns to cleanly separate the underlying classes during the first stage. During the transfer learning stage, only one labeled example is needed to infer the label of many possible test examples that all cluster around the same point in representation space. This works to the extent that the factors of variation corresponding to these invariances have been cleanly separated from the other factors, in the learned representation space, and that we have somehow learned which factors do and do not matter when discriminating objects of certain categories. + +As an example of a zero-shot learning setting, consider the problem of having + +a learner read a large collection of text and then solve object recognition problems. It may be possible to recognize a specific object class even without having seen an image of that object if the text describes the object well enough. For example, having read that a cat has four legs and pointy ears, the learner might be able to guess that an image is a cat without having seen a cat before. + +Zero-data learning (Larochelle et al., 2008) and zero-shot learning (Palatucci et al., 2009; Socher et al., 2013b) are only possible because additional information has been exploited during training. We can think of the zero-data learning scenario as including three random variables: the traditional inputs x , the traditional outputs or targets y , and an additional random variable describing the task, T . The model is trained to estimate the conditional distribution p ( y | x, T ), where T is a description of the task we wish the model to perform. In our example of recognizing cats after having read about cats, the output is a binary variable y with y = 1 indicating “yes” and y = 0 indicating “no.” The task variable T then represents questions to be answered, such as “Is there a cat in this image?” If we have a training set containing unsupervised examples of objects that live in the same space as T , we may be able to infer the meaning of unseen instances of T . In our example of recognizing cats without having seen an image of the cat, it is important that we have had unlabeled text data containing sentences such as “cats have four legs” or “cats have pointy ears.” + +Zero-shot learning requires T to be represented in a way that allows some sort of generalization. For example, T cannot be just a one-hot code indicating an object category. Socher et al. (2013b) provide instead a distributed representation of object categories by using a learned word embedding for the word associated with each category. + +A similar phenomenon happens in machine translation (Klementiev et al., 2012; Mikolov et al., 2013b; Gouws et al., 2014): we have words in one language, and the relationships between words can be learned from unilingual corpora; on the other hand, we have translated sentences that relate words in one language with words in the other. Even though we may not have labeled examples translating word A in language X to word B in language Y , we can generalize and guess a translation for word A because we have learned a distributed representation for words in language X and a distributed representation for words in language Y , then created a link (possibly two-way) relating the two spaces, via training examples consisting of matched pairs of sentences in both languages. This transfer will be most successful if all three ingredients (the two representations and the relations between them) are learned jointly. + +Zero-shot learning is a particular form of transfer learning. The same principle + +Figure 15.3: Transfer learning between two domains x and y enables zero-shot learning. Labeled or unlabeled examples of x allow one to learn a representation function f and x similarly with examples of y to learn f . Each application of the f and f functions y x y appears as an upward arrow, with the style of the arrows indicating which function is applied. Distance in h space provides a similarity metric between any pair of points in x x space that may be more meaningful than distance in x space. Likewise, distance in h y space provides a similarity metric between any pair of points in y space. Both of these similarity functions are indicated with dotted bidirectional arrows. Labeled examples (dashed horizontal lines) are pairs ( x, y ) that allow one to learn a one-way or two-way map (solid bidirectional arrow) between the representations f ( x ) and the representations f ( y ) x y and to anchor these representations to each other. Zero-data learning is then enabled as follows. One can associate an image x to a word y , even if no image of that word was test test ever presented, simply because word representations f ( y ) and image representations y test f ( x ) can be related to each other via the maps between representation spaces. It x test works because, although that image and that word were never paired, their respective feature vectors f ( x ) and f ( y ) have been related to each other. Figure inspired x test y test from suggestion by Hrant Khachatrian. + +explains how one can perform multimodal learning , capturing a representation in one modality, a representation in the other, and the relationship (in general a joint distribution) between pairs ( x, y ) consisting of one observation x in one modality and another observation y in the other modality (Srivastava and Salakhutdinov, 2012). By learning all three sets of parameters (from x to its representation, from y to its representation, and the relationship between the two representations), concepts in one representation are anchored in the other, and vice versa, allowing one to meaningfully generalize to new pairs. The procedure is illustrated in figure 15.3. + +## 15.3 Semi-Supervised Disentangling of Causal Factors + +An important question about representation learning is: what makes one representation better than another? One hypothesis is that an ideal representation is one in which the features within the representation correspond to the underlying causes of the observed data, with separate features or directions in feature space corresponding to different causes, so that the representation disentangles the causes from one another. This hypothesis motivates approaches in which we first seek a good representation for p ( x ). Such a representation may also be a good representation for computing p ( y | x ) if y is among the most salient causes of x . This idea has guided a large amount of deep learning research since at least the 1990s (Becker and Hinton, 1992; Hinton and Sejnowski, 1999) in more detail. For other arguments about when semi-supervised learning can outperform pure supervised learning, we refer the reader to section 1.2 of Chapelle et al. (2006). + +In other approaches to representation learning, we have often been concerned with a representation that is easy to model—for example, one whose entries are sparse or independent from each other. A representation that cleanly separates the underlying causal factors may not necessarily be one that is easy to model. However, a further part of the hypothesis motivating semi-supervised learning via unsupervised representation learning is that for many AI tasks, these two properties coincide: once we are able to obtain the underlying explanations for what we observe, it generally becomes easy to isolate individual attributes from the others. Specifically, if a representation h represents many of the underlying causes of the observed x , and the outputs y are among the most salient causes, then it is easy to predict y from h. + +First, let us see how semi-supervised learning can fail because unsupervised learning of p ( x ) is of no help to learning p ( y | x ). Consider, for example, the case where p ( x ) is uniformly distributed and we want to learn f ( x ) = E [ y | x ]. Clearly, + +Figure 15.4: Mixture model. Example of a density over x that is a mixture over three components. The component identity is an underlying explanatory factor, y . Because the mixture components (e.g., natural object classes in image data) are statistically salient, just modeling p ( x ) in an unsupervised way with no labeled example already reveals the factor y. + +observing a training set of x values alone gives us no information about p(y | x). + +Next, let us see a simple example of how semi-supervised learning can succeed. Consider the situation where x arises from a mixture, with one mixture component per value of y , as illustrated in figure 15.4. If the mixture components are well separated, then modeling p ( x ) reveals precisely where each component is, and a single labeled example of each class will then be enough to perfectly learn p ( y | x ). But more generally, what could tie p(y | x) and p(x) together? + +If y is closely associated with one of the causal factors of x , then p ( x ) and p ( y | x ) will be strongly tied, and unsupervised representation learning that tries to disentangle the underlying factors of variation is likely to be useful as a semi-supervised learning strategy. + +Consider the assumption that y is one of the causal factors of x , and let h represent all those factors. The true generative process can be conceived as structured according to this directed graphical model, with h as the parent of x: y=1 y=2 y=3 p(h, x) = p(x | h)p(h). (15.1) + +As a consequence, the data has marginal probability + +p(x) = E p(x | h). (15.2) h p(x) From this straightforward observation, we conclude that the best possible model of x (from a generalization point of view) is the one that uncovers the above “true” + +structure, with h as a latent variable that explains the observed variations in x . The “ideal” representation learning discussed above should thus recover these latent factors. If y is one of these (or closely related to one of them), then it will be easy to learn to predict y from such a representation. We also see that the conditional distribution of y given x is tied by Bayes’ rule to the components in the above equation: p(x | y)p(y) p(y | x) = . (15.3) p(x) + +Thus the marginal p ( x ) is intimately tied to the conditional p ( y | x ), and knowledge of the structure of the former should be helpful to learn the latter. Therefore, in situations respecting these assumptions, semi-supervised learning should improve performance. + +An important research problem regards the fact that most observations are formed by an extremely large number of underlying causes. Suppose y = h , but i the unsupervised learner does not know which h . The brute force solution is for i an unsupervised learner to learn a representation that captures all the reasonably salient generative factors h and disentangles them from each other, thus making j it easy to predict y from h, regardless of which h is associated with y. i + +In practice, the brute force solution is not feasible because it is not possible to capture all or most of the factors of variation that influence an observation. For example, in a visual scene, should the representation always encode all the smallest objects in the background? It is a well-documented psychological phenomenon that human beings fail to perceive changes in their environment that are not immediately relevant to the task they are performing—see, for example Simons and Levin (1998). An important research frontier in semi-supervised learning is determining what to encode in each situation. Currently, two of the main strategies for dealing with a large number of underlying causes are to use a supervised learning signal at the same time as the unsupervised learning signal so that the model will choose to capture the most relevant factors of variation, or to use much larger representations if using purely unsupervised learning. + +An emerging strategy for unsupervised learning is to modify the definition of which underlying causes are most salient. Historically, autoencoders and generative models have been trained to optimize a fixed criterion, often similar to mean squared error. These fixed criteria determine which causes are considered salient. For example, mean squared error applied to the pixels of an image implicitly specifies that an underlying cause is only salient if it significantly changes the brightness of a large number of pixels. This can be problematic if the task we wish to solve involves interacting with small objects. See figure 15.5 for an example + +Input Reconstruction + +Figure 15.5: An autoencoder trained with mean squared error for a robotics task has failed to reconstruct a ping pong ball. The existence of the ping pong ball and all its spatial coordinates are important underlying causal factors that generate the image and are relevant to the robotics task. Unfortunately, the autoencoder has limited capacity, and the training with mean squared error did not identify the ping pong ball as being salient enough to encode. Images graciously provided by Chelsea Finn. + +of a robotics task in which an autoencoder has failed to learn to encode a small ping pong ball. This same robot is capable of successfully interacting with larger objects, such as baseballs, which are more salient according to mean squared error. + +Other definitions of salience are possible. For example, if a group of pixels follows a highly recognizable pattern, even if that pattern does not involve extreme brightness or darkness, then that pattern could be considered extremely salient. One way to implement such a definition of salience is to use a recently developed approach called generative adversarial networks (Goodfellow et al., 2014c). In this approach, a generative model is trained to fool a feedforward classifier. The feedforward classifier attempts to recognize all samples from the generative model as being fake and all samples from the training set as being real. In this framework, any structured pattern that the feedforward network can recognize is highly salient. The generative adversarial network is described in more detail in section 20.10.4. For the purposes of the present discussion, it is sufficient to understand that the networks learn how to determine what is salient. Lotter et al. (2015) showed that models trained to generate images of human heads will often neglect to generate the ears when trained with mean squared error, but will successfully generate the ears when trained with the adversarial framework. Because the ears are not extremely bright or dark compared to the surrounding skin, they are not especially salient according to mean squared error loss, but their highly recognizable shape + +Ground Truth MSE Adversarial + +Figure 15.6: Predictive generative networks provide an example of the importance of learning which features are salient. In this example, the predictive generative network has been trained to predict the appearance of a 3-D model of a human head at a specific viewing angle. (Left)Ground truth. This is the correct image, which the network should emit. (Center)Image produced by a predictive generative network trained with mean squared error alone. Because the ears do not cause an extreme difference in brightness compared to the neighboring skin, they were not sufficiently salient for the model to learn to represent them. (Right)Image produced by a model trained with a combination of mean squared error and adversarial loss. Using this learned cost function, the ears are salient because they follow a predictable pattern. Learning which underlying causes are important and relevant enough to model is an important active area of research. Figures graciously provided by Lotter et al. (2015). + +and consistent position means that a feedforward network can easily learn to detect them, making them highly salient under the generative adversarial framework. See figure 15.6 for example images. Generative adversarial networks are only one step toward determining which factors should be represented. We expect that future research will discover better ways of determining which factors to represent and develop mechanisms for representing different factors depending on the task. + +A benefit of learning the underlying causal factors, as pointed out by Schölkopf et al. (2012), is that if the true generative process has x as an effect and y as a cause, then modeling p ( x | y ) is robust to changes in p ( y ). If the cause-effect relationship were reversed, this would not be true, since by Bayes’ rule, p ( x | y ) would be sensitive to changes in p ( y ). Very often, when we consider changes in distribution due to different domains, temporal nonstationarity, or changes in the nature of the task, the causal mechanisms remain invariant (“the laws of the universe are constant”), while the marginal distribution over the underlying causes can change. Hence, better generalization and robustness to all kinds of changes can be expected via learning a generative model that attempts to recover the causal + +factors h and p(x | h). + +## 15.4 Distributed Representation + +Distributed representations of concepts—representations composed of many elements that can be set separately from each other—are one of the most important tools for representation learning. Distributed representations are powerful because n they can use n features with k values to describe k different concepts. As we have seen throughout this book, neural networks with multiple hidden units and probabilistic models with multiple latent variables both make use of the strategy of distributed representation. We now introduce an additional observation. Many deep learning algorithms are motivated by the assumption that the hidden units can learn to represent the underlying causal factors that explain the data, as discussed in section 15.3. Distributed representations are natural for this approach, because each direction in representation space can correspond to the value of a different underlying configuration variable. + +An example of a distributed representation is a vector of n binary features, n which can take 2 configurations, each potentially corresponding to a different region in input space, as illustrated in figure 15.7. This can be compared with a symbolic representation, where the input is associated with a single symbol or category. If there are n symbols in the dictionary, one can imagine n feature detectors, each corresponding to the detection of the presence of the associated category. In that case only n different configurations of the representation space are possible, carving n different regions in input space, as illustrated in figure 15.8. Such a symbolic representation is also called a one-hot representation, since it can be captured by a binary vector with n bits that are mutually exclusive (only one of them can be active). A symbolic representation is a specific example of the broader class of nondistributed representations, which are representations that may contain many entries but without significant meaningful separate control over each entry. + +The following examples of learning algorithms are based on nondistributed representations: + +• Clustering methods, including the k -means algorithm: each input point is assigned to exactly one cluster. + +• k -nearest neighbors algorithms: one or a few templates or prototype examples are associated with a given input. In the case of k > 1, multiple values describe each input, but they cannot be controlled separately from each other, so this does not qualify as a true distributed representation. + +Figure 15.7: Illustration of how a learning algorithm based on a distributed representation breaks up the input space into regions. In this example, there are three binary features h , h , and h . Each feature is defined by thresholding the output of a learned linear 1 2 3 + 2 transformation. Each feature divides R into two half-planes. Let h be the set of input i − points for which h = 1, and h be the set of input points for which h = 0. In this i i i illustration, each line represents the decision boundary for one h , with the corresponding i + arrow pointing to the h side of the boundary. The representation as a whole takes i on a unique value at each possible intersection of these half-planes. For example, the + + + ⊤ representation value [1 , 1 , 1] corresponds to the region h ∩h ∩h . Compare this to the 1 2 3 non-distributed representations in figure 15.8. In the general case of d input dimensions, d a distributed representation divides R by intersecting half-spaces rather than half-planes. d The distributed representation with n features assigns unique codes to O ( n ) different regions, while the nearest neighbor algorithm with n examples assigns unique codes to only n regions. The distributed representation is thus able to distinguish exponentially many more regions than the nondistributed one. Keep in mind that not all h values are feasible (there is no h = 0 in this example), and that a linear classifier on top of the distributed representation is not able to assign different class identities to every neighboring region; even a deep linear-threshold network has a VC dimension of only O ( w log w ), where w is the number of weights (Sontag, 1998). The combination of a powerful representation layer and a weak classifier layer can be a strong regularizer; a classifier trying to learn the concept of “person” versus “not a person” does not need to assign a different class to an input represented as “woman with glasses” than it assigns to an input represented as “man without glasses.” This capacity constraint encourages each classifier to focus on few h and encourages h to learn to represent the classes in a linearly separable way. i + +• Decision trees: only one leaf (and the nodes on the path from root to leaf) is activated when an input is given. + +• Gaussian mixtures and mixtures of experts: the templates (cluster centers) or experts are now associated with a degree of activation. As with the k -nearest neighbors algorithm, each input is represented with multiple values, but those values cannot readily be controlled separately from each other. + +• Kernel machines with a Gaussian kernel (or other similarly local kernel): although the degree of activation of each “support vector” or template example is now continuous-valued, the same issue arises as with Gaussian mixtures. + +• Language or translation models based on n -grams: the set of contexts (sequences of symbols) is partitioned according to a tree structure of suffixes. A leaf may correspond to the last two words being w and w , for example. 1 2 Separate parameters are estimated for each leaf of the tree (with some sharing being possible). + +For some of these nondistributed algorithms, the output is not constant by parts but instead interpolates between neighboring regions. The relationship between the number of parameters (or examples) and the number of regions they can define remains linear. + +An important related concept that distinguishes a distributed representation from a symbolic one is that generalization arises due to shared attributes between different concepts. As pure symbols, “ cat ” and “ dog ” are as far from each other as any other two symbols. However, if one associates them with a meaningful distributed representation, then many of the things that can be said about cats can generalize to dogs and vice versa. For example, our distributed representation may contain entries such as “ `has_fur` ” or “ `number_of_legs` ” that have the same value for the embedding of both “ cat ” and “ dog .” Neural language models that operate on distributed representations of words generalize much better than other models that operate directly on one-hot representations of words, as discussed in section 12.4. Distributed representations induce a rich similarity space, in which semantically close concepts (or inputs) are close in distance, a property that is absent from purely symbolic representations. + +When and why can there be a statistical advantage from using a distributed representation as part of a learning algorithm? Distributed representations can have a statistical advantage when an apparently complicated structure can be compactly represented using a small number of parameters. Some traditional nondistributed learning algorithms generalize only due to the smoothness assumption, which + +Figure 15.8: Illustration of how the nearest neighbor algorithm breaks up the input space into different regions. The nearest neighbor algorithm provides an example of a learning algorithm based on a nondistributed representation. Different non-distributed algorithms may have different geometry, but they typically break the input space into regions, with a separate set of parameters for each region. The advantage of a nondistributed approach is that, given enough parameters, it can fit the training set without solving a difficult optimization algorithm, because it is straightforward to choose a different output independently for each region. The disadvantage is that such nondistributed models generalize only locally via the smoothness prior, making it difficult to learn a complicated function with more peaks and troughs than the available number of examples. Contrast this with a distributed representation, figure 15.7. + +states that if u ≈ v , then the target function f to be learned has the property that f ( u ) ≈ f ( v ) in general. There are many ways of formalizing such an assumption, but the end result is that if we have an example ( x, y ) for which we know that ˆ f ( x ) ≈ y , then we choose an estimator f that approximately satisfies these constraints while changing as little as possible when we move to a nearby input x + ε . This assumption is clearly very useful, but it suffers from the curse of dimensionality: to learn a target function that increases and decreases many times 1 in many different regions, we may need a number of examples that is at least as large as the number of distinguishable regions. One can think of each of these regions as a category or symbol: by having a separate degree of freedom for each symbol (or region), we can learn an arbitrary decoder mapping from symbol to value. However, this does not allow us to generalize to new symbols for new regions. + +If we are lucky, there may be some regularity in the target function, besides being smooth. For example, a convolutional network with max pooling can recognize an object regardless of its location in the image, even though spatial translation of the object may not correspond to smooth transformations in the input space. + +Let us examine a special case of a distributed representation learning algorithm, which extracts binary features by thresholding linear functions of the input. Each d binary feature in this representation divides R into a pair of half-spaces, as illustrated in figure 15.7. The exponentially large number of intersections of n of the corresponding half-spaces determines how many regions this distributed representation learner can distinguish. How many regions are generated by an d arrangement of n hyperplanes in R ? By applying a general result concerning the intersection of hyperplanes (Zaslavsky, 1975), one can show (Pascanu et al., 2014b) that the number of regions this binary feature representation can distinguish is + +[ ] d ∑ n d = O(n ). (15.4) j j=0 + +Therefore, we see a growth that is exponential in the input size and polynomial in the number of hidden units. + +This provides a geometric argument to explain the generalization power of distributed representation: with O ( nd ) parameters (for n linear threshold features d d in R ), we can distinctly represent O ( n ) regions in input space. If instead we made no assumption at all about the data, and used a representation with one unique symbol for each region, and separate parameters for each symbol to recognize its + +1 Potentially, we may want to learn a function whose behavior is distinct in exponentially many regions: in a d -dimensional space with at least 2 different values to distinguish per dimension, we d d might want f to differ in 2 different regions, requiring O(2 ) training examples. + +d d d corresponding portion of R , then specifying O ( n ) regions would require O ( n ) examples. More generally, the argument in favor of the distributed representation could be extended to the case where instead of using linear threshold units we use nonlinear, possibly continuous, feature extractors for each of the attributes in the distributed representation. The argument in this case is that if a parametric transformation with k parameters can learn about r regions in input space, with k ≪ r , and if obtaining such a representation was useful to the task of interest, then we could potentially generalize much better in this way than in a nondistributed setting, where we would need O ( r ) examples to obtain the same features and associated partitioning of the input space into r regions. Using fewer parameters to represent the model means that we have fewer parameters to fit, and thus require far fewer training examples to generalize well. + +A further part of the argument for why models based on distributed representations generalize well is that their capacity remains limited despite being able to distinctly encode so many different regions. For example, the VC dimension of a neural network of linear threshold units is only O ( w log w ), where w is the number of weights (Sontag, 1998). This limitation arises because, while we can assign very many unique codes to representation space, we cannot use absolutely all the code space, nor can we learn arbitrary functions mapping from the representation space h to the output y using a linear classifier. The use of a distributed representation combined with a linear classifier thus expresses a prior belief that the classes to be recognized are linearly separable as a function of the underlying causal factors captured by h . We will typically want to learn categories such as the set of all images of all green objects or the set of all images of cars, but not categories that require nonlinear XOR logic. For example, we typically do not want to partition the data into the set of all red cars and green trucks as one class and the set of all green cars and red trucks as another class. + +The ideas discussed so far have been abstract, but they may be experimentally validated. Zhou et al. (2015) found that hidden units in a deep convolutional network trained on the ImageNet and Places benchmark datasets learn features that are often interpretable, corresponding to a label that humans would naturally assign. In practice it is certainly not always the case that hidden units learn something that has a simple linguistic name, but it is interesting to see this emerge near the top levels of the best computer vision deep networks. What such features have in common is that one could imagine learning about each of them without having to see all the configurations of all the others. Radford et al. (2015) demonstrated that a generative model can learn a representation of images of faces, with separate directions in representation space capturing different underlying factors of variation. Figure 15.9 demonstrates that one direction in representation space corresponds + += - + + +Figure 15.9: A generative model has learned a distributed representation that disentangles the concept of gender from the concept of wearing glasses. If we begin with the representation of the concept of a man with glasses, then subtract the vector representing the concept of a man without glasses, and finally add the vector representing the concept of a woman without glasses, we obtain the vector representing the concept of a woman with glasses. The generative model correctly decodes all these representation vectors to images that may be recognized as belonging to the correct class. Images reproduced with permission from Radford et al. (2015). + +to whether the person is male or female, while another corresponds to whether the person is wearing glasses. These features were discovered automatically, not fixed a priori. There is no need to have labels for the hidden unit classifiers: gradient descent on an objective function of interest naturally learns semantically interesting features, as long as the task requires such features. We can learn about the distinction between male and female, or about the presence or absence of glasses, without having to characterize all the configurations of the n − 1 other features by examples covering all these combinations of values. This form of statistical separability is what allows one to generalize to new configurations of a person’s features that have never been seen during training. + +## 15.5 Exponential Gains from Depth + +We have seen in section 6.4.1 that multilayer perceptrons are universal approximators, and that some functions can be represented by exponentially smaller deep networks compared to shallow networks. This decrease in model size leads to improved statistical efficiency. In this section, we describe how similar results apply more generally to other kinds of models with distributed hidden representations. + +In section 15.4, we saw an example of a generative model that learned about + +the explanatory factors underlying images of faces, including the person’s gender and whether they are wearing glasses. The generative model that accomplished this task was based on a deep neural network. It would not be reasonable to expect a shallow network, such as a linear network, to learn the complicated relationship between these abstract explanatory factors and the pixels in the image. In this and other AI tasks, the factors that can be chosen almost independently from each other yet still correspond to meaningful inputs are more likely to be very high level and related in highly nonlinear ways to the input. We argue that this demands deep distributed representations, where the higher level features (seen as functions of the input) or factors (seen as generative causes) are obtained through the composition of many nonlinearities. + +It has been proved in many different settings that organizing computation through the composition of many nonlinearities and a hierarchy of reused features can give an exponential boost to statistical efficiency, on top of the exponential boost given by using a distributed representation. Many kinds of networks (e.g., with saturating nonlinearities, Boolean gates, sum/products, or RBF units) with a single hidden layer can be shown to be universal approximators. A model family that is a universal approximator can approximate a large class of functions (including all continuous functions) up to any nonzero tolerance level, given enough hidden units. However, the required number of hidden units may be very large. Theoretical results concerning the expressive power of deep architectures state that there are families of functions that can be represented efficiently by an architecture of depth k , but that would require an exponential number of hidden units (with respect to the input size) with insufficient depth (depth 2 or depth k − 1). + +In section 6.4.1, we saw that deterministic feedforward networks are universal approximators of functions. Many structured probabilistic models with a single hidden layer of latent variables, including restricted Boltzmann machines and deep belief networks, are universal approximators of probability distributions (Le Roux and Bengio, 2008, 2010; Montúfar and Ay, 2011; Montúfar, 2014; Krause et al., 2013). + +In section 6.4.1, we saw that a sufficiently deep feedforward network can have an exponential advantage over a network that is too shallow. Such results can also be obtained for other models such as probabilistic models. One such probabilistic model is the sum-product network , or SPN (Poon and Domingos, 2011). These models use polynomial circuits to compute the probability distribution over a set of random variables. Delalleau and Bengio (2011) showed that there exist probability distributions for which a minimum depth of SPN is required to avoid needing an exponentially large model. Later, Martens and Medabalimi (2014) + +showed that there are significant differences between every two finite depths of SPN, and that some of the constraints used to make SPNs tractable may limit their representational power. + +Another interesting development is a set of theoretical results for the expressive power of families of deep circuits related to convolutional nets, highlighting an exponential advantage for the deep circuit even when the shallow circuit is allowed to only approximate the function computed by the deep circuit (Cohen et al., 2015). By comparison, previous theoretical work made claims regarding only the case where the shallow circuit must exactly replicate particular functions. + +## 15.6 Providing Clues to Discover Underlying Causes + +To close this chapter, we come back to one of our original questions: what makes one representation better than another? One answer, first introduced in section 15.3, is that an ideal representation is one that disentangles the underlying causal factors of variation that generated the data, especially those factors that are relevant to our applications. Most strategies for representation learning are based on introducing clues that help the learning find these underlying factors of variations. The clues can help the learner separate these observed factors from the others. Supervised learning provides a very strong clue: a label y , presented with each x , that usually specifies the value of at least one of the factors of variation directly. More generally, to make use of abundant unlabeled data, representation learning makes use of other, less direct hints about the underlying factors. These hints take the form of implicit prior beliefs that we, the designers of the learning algorithm, impose in order to guide the learner. Results such as the no free lunch theorem show that regularization strategies are necessary to obtain good generalization. While it is impossible to find a universally superior regularization strategy, one goal of deep learning is to find a set of fairly generic regularization strategies that are applicable to a wide variety of AI tasks, similar to the tasks that people and animals are able to solve. + +We provide here a list of these generic regularization strategies. The list is clearly not exhaustive but gives some concrete examples of how learning algorithms can be encouraged to discover features that correspond to underlying factors. This list was introduced in section 3.1 of Bengio et al. (2013d) and has been partially expanded here. + +• Smoothness: This is the assumption that f ( x + εd ) ≈ f ( x ) for unit d and small ε . This assumption allows the learner to generalize from training + +examples to nearby points in input space. Many machine learning algorithms leverage this idea, but it is insufficient to overcome the curse of dimensionality. + +• Linearity: Many learning algorithms assume that relationships between some variables are linear. This allows the algorithm to make predictions even very far from the observed data, but can sometimes lead to overly extreme predictions. Most simple machine learning algorithms that do not make the smoothness assumption instead make the linearity assumption. These are in fact different assumptions—linear functions with large weights applied to high-dimensional spaces may not be very smooth. See Goodfellow et al. (2014b) for a further discussion of the limitations of the linearity assumption. + +• Multiple explanatory factors: Many representation learning algorithms are motivated by the assumption that the data is generated by multiple underlying explanatory factors, and that most tasks can be solved easily given the state of each of these factors. Section 15.3 describes how this view motivates semisupervised learning via representation learning. Learning the structure of p ( x ) requires learning some of the same features that are useful for modeling p ( y | x ) because both refer to the same underlying explanatory factors. Section 15.4 describes how this view motivates the use of distributed representations, with separate directions in representation space corresponding to separate factors of variation. + +• Causal factors: The model is constructed in such a way that it treats the factors of variation described by the learned representation h as the causes of the observed data x , and not vice versa. As discussed in section 15.3, this is advantageous for semi-supervised learning and makes the learned model more robust when the distribution over the underlying causes changes or when we use the model for a new task. + +• Depth, or a hierarchical organization of explanatory factors: High-level, abstract concepts can be defined in terms of simple concepts, forming a hierarchy. From another point of view, the use of a deep architecture expresses our belief that the task should be accomplished via a multistep program, with each step referring back to the output of the processing accomplished via previous steps. + +• Shared factors across tasks: When we have many tasks corresponding to different y variables sharing the same input x , or when each task is associated i (i) with a subset or a function f ( x ) of a global input x , the assumption is that each y is associated with a different subset from a common pool of i + +relevant factors h . Because these subsets overlap, learning all the P ( y | x ) i via a shared intermediate representation P ( h | x ) allows sharing of statistical strength between the tasks. + +• Manifolds: Probability mass concentrates, and the regions in which it concentrates are locally connected and occupy a tiny volume. In the continuous case, these regions can be approximated by low-dimensional manifolds with a much smaller dimensionality than the original space where the data live. Many machine learning algorithms behave sensibly only on this manifold (Goodfellow et al., 2014b). Some machine learning algorithms, especially autoencoders, attempt to explicitly learn the structure of the manifold. + +• Natural clustering: Many machine learning algorithms assume that each connected manifold in the input space may be assigned to a single class. The data may lie on many disconnected manifolds, but the class remains constant within each one of these. This assumption motivates a variety of learning algorithms, including tangent propagation, double backprop, the manifold tangent classifier and adversarial training. + +• Temporal and spatial coherence: Slow feature analysis and related algorithms make the assumption that the most important explanatory factors change slowly over time, or at least that it is easier to predict the true underlying explanatory factors than to predict raw observations such as pixel values. See section 13.3 for further description of this approach. + +• Sparsity: Most features should presumably not be relevant to describing most inputs—there is no need to use a feature that detects elephant trunks when representing an image of a cat. It is therefore reasonable to impose a prior that any feature that can be interpreted as “present” or “absent” should be absent most of the time. + +• Simplicity of factor dependencies: In good high-level representations, the factors are related to each other through simple dependencies. The simplest ∏ possible is marginal independence, P ( h ) = P ( h ), but linear dependencies i i or those captured by a shallow autoencoder are also reasonable assumptions. This can be seen in many laws of physics and is assumed when plugging a linear predictor or a factorized prior on top of a learned representation. + +The concept of representation learning ties together all the many forms of deep learning. Feedforward and recurrent networks, autoencoders and deep probabilistic models all learn and exploit representations. Learning the best possible representation remains an exciting avenue of research. \ No newline at end of file