






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Artificial Intelligence. Lectures Notes of Machine Learning. Prof. Andrew Ng - Stanford University - Contents: Learning Theory
Typology: Study notes
1 / 11
This page cannot be seen from the preview
Don't miss anything!







When talking about linear regression, we discussed the problem of whether to fit a “simple” model such as the linear “y = θ 0 +θ 1 x,” or a more “complex” model such as the polynomial “y = θ 0 + θ 1 x + · · · θ 5 x^5 .” We saw the following example:
(^00 1 2 3 4 5 6 )
1
2
3
4
x
y
(^00 1 2 3 4 5 6 )
1
2
3
4
x
y
(^00 1 2 3 4 5 6 )
1
2
3
4
x
y
Fitting a 5th order polynomial to the data (rightmost figure) did not result in a good model. Specifically, even though the 5th order polynomial did a very good job predicting y (say, prices of houses) from x (say, living area) for the examples in the training set, we do not expect the model shown to be a good one for predicting the prices of houses not in the training set. In other words, what’s has been learned from the training set does not generalize well to other houses. The generalization error (which will be made formal shortly) of a hypothesis is its expected error on examples not necessarily in the training set. Both the models in the leftmost and the rightmost figures above have large generalization error. However, the problems that the two models suffer from are very different. If the relationship between y and x is not linear,
then even if we were fitting a linear model to a very large amount of training data, the linear model would still fail to accurately capture the structure in the data. Informally, we define the bias of a model to be the expected generalization error even if we were to fit it to a very (say, infinitely) large training set. Thus, for the problem above, the linear model suffers from large bias, and may underfit (i.e., fail to capture structure exhibited by) the data. Apart from bias, there’s a second component to the generalization error, consisting of the variance of a model fitting procedure. Specifically, when fitting a 5th order polynomial as in the rightmost figure, there is a large risk that we’re fitting patterns in the data that happened to be present in our small, finite training set, but that do not reflect the wider pattern of the relationship between x and y. This could be, say, because in the training set we just happened by chance to get a slightly more-expensive-than-average house here, and a slightly less-expensive-than-average house there, and so on. By fitting these “spurious” patterns in the training set, we might again obtain a model with large generalization error. In this case, we say the model has large variance.^1 Often, there is a tradeoff between bias and variance. If our model is too “simple” and has very few parameters, then it may have large bias (but small variance); if it is too “complex” and has very many parameters, then it may suffer from large variance (but have smaller bias). In the example above, fitting a quadratic function does better than either of the extremes of a first or a fifth order polynomial.
2 Preliminaries
In this set of notes, we begin our foray into learning theory. Apart from being interesting and enlightening in its own right, this discussion will also help us hone our intuitions and derive rules of thumb about how to best apply learning algorithms in different settings. We will also seek to answer a few questions: First, can we make formal the bias/variance tradeoff that was just discussed? The will also eventually lead us to talk about model selection methods, which can, for instance, automatically decide what order polynomial to fit to a training set. Second, in machine learning it’s really
(^1) In these notes, we will not try to formalize the definitions of bias and variance beyond this discussion. While bias and variance are straightforward to define formally for, e.g., linear regression, there have been several proposals for the definitions of bias and variance for classification, and there is as yet no agreement on what is the “right” and/or the most useful formalism.
This is just the fraction of training examples that h misclassifies. When we want to make explicit the dependence of ˆε(h) on the training set S, we may also write this a ˆεS (h). We also define the generalization error to be
ε(h) = P(x,y)∼D(h(x) 6 = y).
I.e. this is the probability that, if we now draw a new example (x, y) from the distribution D, h will misclassify it. Note that we have assumed that the training data was drawn from the same distribution D with which we’re going to evaluate our hypotheses (in the definition of generalization error). This is sometimes also referred to as one of the PAC assumptions.^2 Consider the setting of linear classification, and let hθ(x) = 1{θT^ x ≥ 0 }. What’s a reasonable way of fitting the parameters θ? One approach is to try to minimize the training error, and pick
θˆ = arg min θ
εˆ(hθ).
We call this process empirical risk minimization (ERM), and the resulting hypothesis output by the learning algorithm is hˆ = hθˆ. We think of ERM as the most “basic” learning algorithm, and it will be this algorithm that we focus on in these notes. (Algorithms such as logistic regression can also be viewed as approximations to empirical risk minimization.) In our study of learning theory, it will be useful to abstract away from the specific parameterization of hypotheses and from issues such as whether we’re using a linear classifier. We define the hypothesis class H used by a learning algorithm to be the set of all classifiers considered by it. For linear classification, H = {hθ : hθ(x) = 1{θT^ x ≥ 0 }, θ ∈ Rn+1} is thus the set of all classifiers over X (the domain of the inputs) where the decision boundary is linear. More broadly, if we were studying, say, neural networks, then we could let H be the set of all classifiers representable by some neural network architecture. Empirical risk minimization can now be thought of as a minimization over the class of functions H, in which the learning algorithm picks the hypothesis:
ˆh = arg min h∈H
ˆε(h)
(^2) PAC stands for “probably approximately correct,” which is a framework and set of assumptions under which numerous results on learning theory were proved. Of these, the assumption of training and testing on the same distribution, and the assumption of the independently drawn training examples, were the most important.
3 The case of finite H
Lets start by considering a learning problem in which we have a finite hy- pothesis class H = {h 1 ,... , hk} consisting of k hypotheses. Thus, H is just a set of k functions mapping from X to { 0 , 1 }, and empirical risk minimization selects ˆh to be whichever of these k functions has the smallest training error. We would like to give guarantees on the generalization error of ˆh. Our strategy for doing so will be in two parts: First, we will show that ˆε(h) is a reliable estimate of ε(h) for all h. Second, we will show that this implies an upper-bound on the generalization error of ˆh. Take any one, fixed, hi ∈ H. Consider a Bernoulli random variable Z whose distribution is defined as follows. We’re going to sample (x, y) ∼ D. Then, we set Z = 1{hi(x) 6 = y}. I.e., we’re going to draw one example, and let Z indicate whether hi misclassifies it. Similarly, we also define Zj = 1 {hi(x(j)) 6 = y(j)}. Since our training set was drawn iid from D, Z and the Zj ’s have the same distribution. We see that the misclassification probability on a randomly drawn example— that is, ε(h)—is exactly the expected value of Z (and Zj ). Moreover, the training error can be written
εˆ(hi) =
m
∑^ m
j=
Zj.
Thus, ˆε(hi) is exactly the mean of the m random variables Zj that are drawn iid from a Bernoulli distribution with mean ε(hi). Hence, we can apply the Hoeffding inequality, and obtain
P (|ε(hi) − εˆ(hi)| > γ) ≤ 2 exp(− 2 γ^2 m).
This shows that, for our particular hi, training error will be close to generalization error with high probability, assuming m is large. But we don’t just want to guarantee that ε(hi) will be close to ˆε(hi) (with high probability) for just only one particular hi. We want to prove that this will be true for simultaneously for all h ∈ H. To do so, let Ai denote the event that |ε(hi) − εˆ(hi)| > γ. We’ve already show that, for any particular Ai, it holds true that P (Ai) ≤ 2 exp(− 2 γ^2 m). Thus, using the union bound, we
Similarly, we can also hold m and δ fixed and solve for γ in the previous equation, and show [again, convince yourself that this is right!] that with probability 1 − δ, we have that for all h ∈ H,
|εˆ(h) − ε(h)| ≤
2 m
log
2 k δ
Now, lets assume that uniform convergence holds, i.e., that |ε(h)− εˆ(h)| ≤ γ for all h ∈ H. What can we prove about the generalization of our learning algorithm that picked ˆh = arg minh∈H εˆ(h)? Define h∗^ = arg minh∈H ε(h) to be the best possible hypothesis in H. Note that h∗^ is the best that we could possibly do given that we are using H, so it makes sense to compare our performance to that of h∗. We have:
ε(ˆh) ≤ εˆ(ˆh) + γ ≤ εˆ(h∗) + γ ≤ ε(h∗) + 2γ
The first line used the fact that |ε(hˆ)− ˆε(ˆh)| ≤ γ (by our uniform convergence assumption). The second used the fact that ˆh was chosen to minimize ˆε(h), and hence ˆε(ˆh) ≤ εˆ(h) for all h, and in particular ˆε(ˆh) ≤ εˆ(h∗). The third line used the uniform convergence assumption again, to show that ˆε(h∗) ≤ ε(h∗) + γ. So, what we’ve shown is the following: If uniform convergence occurs, then the generalization error of ˆh is at most 2γ worse than the best possible hypothesis in H! Lets put all this together into a theorem.
Theorem. Let |H| = k, and let any m, δ be fixed. Then with probability at least 1 − δ, we have that
ε(hˆ) ≤
min h∈H ε(h)
2 m
log
2 k δ
This is proved by letting γ equal the
· term, using our previous argu- ment that uniform convergence occurs with probability at least 1 − δ, and then noting that uniform convergence implies ε(h) is at most 2γ higher than ε(h∗) = minh∈H ε(h) (as we showed previously). This also quantifies what we were saying previously saying about the bias/variance tradeoff in model selection. Specifically, suppose we have some hypothesis class H, and are considering switching to some much larger hy- pothesis class H′^ ⊇ H. If we switch to H′, then the first term minh ε(h)
can only decrease (since we’d then be taking a min over a larger set of func- tions). Hence, by learning using a larger hypothesis class, our “bias” can only decrease. However, if k increases, then the second 2
· term would also increase. This increase corresponds to our “variance” increasing when we use a larger hypothesis class. By holding γ and δ fixed and solving for m like we did before, we can also obtain the following sample complexity bound:
Corollary. Let |H| = k, and let any δ, γ be fixed. Then for ε(ˆh) ≤ minh∈H ε(h) + 2γ to hold with probability at least 1 − δ, it suffices that
m ≥
2 γ^2
log
2 k δ
= O
γ^2
log
k δ
4 The case of infinite H
We have proved some useful theorems for the case of finite hypothesis classes. But many hypothesis classes, including any parameterized by real numbers (as in linear classification) actually contain an infinite number of functions. Can we prove similar results for this setting? Lets start by going through something that is not the “right” argument. Better and more general arguments exist, but this will be useful for honing our intuitions about the domain. Suppose we have an H that is parameterized by d real numbers. Since we are using a computer to represent real numbers, and IEEE double-precision floating point (double’s in C) uses 64 bits to represent a floating point num- ber, this means that our learning algorithm, assuming we’re using double- precision floating point, is parameterized by 64d bits. Thus, our hypothesis class really consists of at most k = 2^64 d^ different hypotheses. From the Corol- lary at the end of the previous section, we therefore find that, to guarantee ε(ˆh) ≤ ε(h∗) + 2γ, with to hold with probability at least 1 − δ, it suffices
that m ≥ O
1 γ^2 log^
264 d δ
d γ^2 log^
1 δ
= Oγ,δ (d). (The γ, δ subscripts are
to indicate that the last big-O is hiding constants that may depend on γ and δ.) Thus, the number of training examples needed is at most linear in the parameters of the model. The fact that we relied on 64-bit floating point makes this argument not entirely satisfying, but the conclusion is nonetheless roughly correct: If what we’re going to do is try to minimize training error, then in order to learn
see that, for any of the eight possible labelings of these points, we can find a linear classifier that obtains “zero training error” on them:
x
x 1
2 x
x 1
2 x
x 1
2 x
x 1
2
x
x 1
2 x
x 1
2 x
x 1
2 x
x 1
2
Moreover, it is possible to show that there is no set of 4 points that this hypothesis class can shatter. Thus, the largest set that H can shatter is of size 3, and hence VC(H) = 3. Note that the VC dimension of H here is 3 even though there may be sets of size 3 that it cannot shatter. For instance, if we had a set of three points lying in a straight line (left figure), then there is no way to find a linear separator for the labeling of the three points shown below (right figure):
x
x 1