Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Learning Theory - Lectures Notes - 6, Study notes of Machine Learning

Stanford University Machine Learning

Artificial Intelligence. Lectures Notes of Machine Learning. Prof. Andrew Ng - Stanford University - Contents: Learning Theory

Typology: Study notes

2010/2011

Uploaded on 10/30/2011

ilyastrab 🇺🇸

4.4

(52)

379 documents

1 / 11

This page cannot be seen from the preview

Don't miss anything!

CS229 Lecture notes

Andrew Ng

Part VI

Learning Theory

1 Bias/variance tradeoff

When talking about linear regression, we discussed the problem of whether

to fit a “simple” model such as the linear “y=θ0+θ1x,” or a more “complex”

model such as the polynomial “y=θ0+θ1x+···θ5x5.” We saw the following

example:

01234567

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

01234567

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

01234567

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

Fitting a 5th order polynomial to the data (rightmost figure) did not

result in a good model. Specifically, even though the 5th order polynomial

did a very good job predicting y(say, prices of houses) from x(say, living

area) for the examples in the training set, we do not expect the model shown

to be a good one for predicting the prices of houses not in the training set. In

other words, what’s has been learned from the training set does not generalize

well to other houses. The generalization error (which will be made formal

shortly) of a hypothesis is its expected error on examples not necessarily in

the training set.

Both the models in the leftmost and the rightmost figures above have

large generalization error. However, the problems that the two models suffer

from are very different. If the relationship between yand xis not linear,

1

Discover Study notes of Machine Learning Stanford University

Partial preview of the text

Download Learning Theory - Lectures Notes - 6 and more Study notes Machine Learning in PDF only on Docsity!

CS229 Lecture notes

Andrew Ng

Part VI

Learning Theory

1 Bias/variance tradeoff

When talking about linear regression, we discussed the problem of whether to fit a “simple” model such as the linear “y = θ 0 +θ 1 x,” or a more “complex” model such as the polynomial “y = θ 0 + θ 1 x + · · · θ 5 x^5 .” We saw the following example:

(^00 1 2 3 4 5 6 )

1

2

3

4

x

y

(^00 1 2 3 4 5 6 )

1

2

3

4

x

y

(^00 1 2 3 4 5 6 )

1

2

3

4

x

y

Fitting a 5th order polynomial to the data (rightmost figure) did not result in a good model. Specifically, even though the 5th order polynomial did a very good job predicting y (say, prices of houses) from x (say, living area) for the examples in the training set, we do not expect the model shown to be a good one for predicting the prices of houses not in the training set. In other words, what’s has been learned from the training set does not generalize well to other houses. The generalization error (which will be made formal shortly) of a hypothesis is its expected error on examples not necessarily in the training set. Both the models in the leftmost and the rightmost figures above have large generalization error. However, the problems that the two models suffer from are very different. If the relationship between y and x is not linear,

then even if we were fitting a linear model to a very large amount of training data, the linear model would still fail to accurately capture the structure in the data. Informally, we define the bias of a model to be the expected generalization error even if we were to fit it to a very (say, infinitely) large training set. Thus, for the problem above, the linear model suffers from large bias, and may underfit (i.e., fail to capture structure exhibited by) the data. Apart from bias, there’s a second component to the generalization error, consisting of the variance of a model fitting procedure. Specifically, when fitting a 5th order polynomial as in the rightmost figure, there is a large risk that we’re fitting patterns in the data that happened to be present in our small, finite training set, but that do not reflect the wider pattern of the relationship between x and y. This could be, say, because in the training set we just happened by chance to get a slightly more-expensive-than-average house here, and a slightly less-expensive-than-average house there, and so on. By fitting these “spurious” patterns in the training set, we might again obtain a model with large generalization error. In this case, we say the model has large variance.^1 Often, there is a tradeoff between bias and variance. If our model is too “simple” and has very few parameters, then it may have large bias (but small variance); if it is too “complex” and has very many parameters, then it may suffer from large variance (but have smaller bias). In the example above, fitting a quadratic function does better than either of the extremes of a first or a fifth order polynomial.

2 Preliminaries

In this set of notes, we begin our foray into learning theory. Apart from being interesting and enlightening in its own right, this discussion will also help us hone our intuitions and derive rules of thumb about how to best apply learning algorithms in different settings. We will also seek to answer a few questions: First, can we make formal the bias/variance tradeoff that was just discussed? The will also eventually lead us to talk about model selection methods, which can, for instance, automatically decide what order polynomial to fit to a training set. Second, in machine learning it’s really

(^1) In these notes, we will not try to formalize the definitions of bias and variance beyond this discussion. While bias and variance are straightforward to define formally for, e.g., linear regression, there have been several proposals for the definitions of bias and variance for classification, and there is as yet no agreement on what is the “right” and/or the most useful formalism.

This is just the fraction of training examples that h misclassifies. When we want to make explicit the dependence of ˆε(h) on the training set S, we may also write this a ˆεS (h). We also define the generalization error to be

ε(h) = P(x,y)∼D(h(x) 6 = y).

I.e. this is the probability that, if we now draw a new example (x, y) from the distribution D, h will misclassify it. Note that we have assumed that the training data was drawn from the same distribution D with which we’re going to evaluate our hypotheses (in the definition of generalization error). This is sometimes also referred to as one of the PAC assumptions.^2 Consider the setting of linear classification, and let hθ(x) = 1{θT^ x ≥ 0 }. What’s a reasonable way of fitting the parameters θ? One approach is to try to minimize the training error, and pick

θˆ = arg min θ

εˆ(hθ).

We call this process empirical risk minimization (ERM), and the resulting hypothesis output by the learning algorithm is hˆ = hθˆ. We think of ERM as the most “basic” learning algorithm, and it will be this algorithm that we focus on in these notes. (Algorithms such as logistic regression can also be viewed as approximations to empirical risk minimization.) In our study of learning theory, it will be useful to abstract away from the specific parameterization of hypotheses and from issues such as whether we’re using a linear classifier. We define the hypothesis class H used by a learning algorithm to be the set of all classifiers considered by it. For linear classification, H = {hθ : hθ(x) = 1{θT^ x ≥ 0 }, θ ∈ Rn+1} is thus the set of all classifiers over X (the domain of the inputs) where the decision boundary is linear. More broadly, if we were studying, say, neural networks, then we could let H be the set of all classifiers representable by some neural network architecture. Empirical risk minimization can now be thought of as a minimization over the class of functions H, in which the learning algorithm picks the hypothesis:

ˆh = arg min h∈H

ˆε(h)

(^2) PAC stands for “probably approximately correct,” which is a framework and set of assumptions under which numerous results on learning theory were proved. Of these, the assumption of training and testing on the same distribution, and the assumption of the independently drawn training examples, were the most important.

3 The case of finite H

Lets start by considering a learning problem in which we have a finite hy- pothesis class H = {h 1 ,... , hk} consisting of k hypotheses. Thus, H is just a set of k functions mapping from X to { 0 , 1 }, and empirical risk minimization selects ˆh to be whichever of these k functions has the smallest training error. We would like to give guarantees on the generalization error of ˆh. Our strategy for doing so will be in two parts: First, we will show that ˆε(h) is a reliable estimate of ε(h) for all h. Second, we will show that this implies an upper-bound on the generalization error of ˆh. Take any one, fixed, hi ∈ H. Consider a Bernoulli random variable Z whose distribution is defined as follows. We’re going to sample (x, y) ∼ D. Then, we set Z = 1{hi(x) 6 = y}. I.e., we’re going to draw one example, and let Z indicate whether hi misclassifies it. Similarly, we also define Zj = 1 {hi(x(j)) 6 = y(j)}. Since our training set was drawn iid from D, Z and the Zj ’s have the same distribution. We see that the misclassification probability on a randomly drawn example— that is, ε(h)—is exactly the expected value of Z (and Zj ). Moreover, the training error can be written

εˆ(hi) =

m

∑^ m

j=

Zj.

Thus, ˆε(hi) is exactly the mean of the m random variables Zj that are drawn iid from a Bernoulli distribution with mean ε(hi). Hence, we can apply the Hoeffding inequality, and obtain

P (|ε(hi) − εˆ(hi)| > γ) ≤ 2 exp(− 2 γ^2 m).

This shows that, for our particular hi, training error will be close to generalization error with high probability, assuming m is large. But we don’t just want to guarantee that ε(hi) will be close to ˆε(hi) (with high probability) for just only one particular hi. We want to prove that this will be true for simultaneously for all h ∈ H. To do so, let Ai denote the event that |ε(hi) − εˆ(hi)| > γ. We’ve already show that, for any particular Ai, it holds true that P (Ai) ≤ 2 exp(− 2 γ^2 m). Thus, using the union bound, we

Similarly, we can also hold m and δ fixed and solve for γ in the previous equation, and show [again, convince yourself that this is right!] that with probability 1 − δ, we have that for all h ∈ H,

|εˆ(h) − ε(h)| ≤

2 m

log

2 k δ

Now, lets assume that uniform convergence holds, i.e., that |ε(h)− εˆ(h)| ≤ γ for all h ∈ H. What can we prove about the generalization of our learning algorithm that picked ˆh = arg minh∈H εˆ(h)? Define h∗^ = arg minh∈H ε(h) to be the best possible hypothesis in H. Note that h∗^ is the best that we could possibly do given that we are using H, so it makes sense to compare our performance to that of h∗. We have:

ε(ˆh) ≤ εˆ(ˆh) + γ ≤ εˆ(h∗) + γ ≤ ε(h∗) + 2γ

The first line used the fact that |ε(hˆ)− ˆε(ˆh)| ≤ γ (by our uniform convergence assumption). The second used the fact that ˆh was chosen to minimize ˆε(h), and hence ˆε(ˆh) ≤ εˆ(h) for all h, and in particular ˆε(ˆh) ≤ εˆ(h∗). The third line used the uniform convergence assumption again, to show that ˆε(h∗) ≤ ε(h∗) + γ. So, what we’ve shown is the following: If uniform convergence occurs, then the generalization error of ˆh is at most 2γ worse than the best possible hypothesis in H! Lets put all this together into a theorem.

Theorem. Let |H| = k, and let any m, δ be fixed. Then with probability at least 1 − δ, we have that

ε(hˆ) ≤

min h∈H ε(h)

2 m

log

2 k δ

This is proved by letting γ equal the

· term, using our previous argu- ment that uniform convergence occurs with probability at least 1 − δ, and then noting that uniform convergence implies ε(h) is at most 2γ higher than ε(h∗) = minh∈H ε(h) (as we showed previously). This also quantifies what we were saying previously saying about the bias/variance tradeoff in model selection. Specifically, suppose we have some hypothesis class H, and are considering switching to some much larger hy- pothesis class H′^ ⊇ H. If we switch to H′, then the first term minh ε(h)

can only decrease (since we’d then be taking a min over a larger set of func- tions). Hence, by learning using a larger hypothesis class, our “bias” can only decrease. However, if k increases, then the second 2

· term would also increase. This increase corresponds to our “variance” increasing when we use a larger hypothesis class. By holding γ and δ fixed and solving for m like we did before, we can also obtain the following sample complexity bound:

Corollary. Let |H| = k, and let any δ, γ be fixed. Then for ε(ˆh) ≤ minh∈H ε(h) + 2γ to hold with probability at least 1 − δ, it suffices that

m ≥

2 γ^2

log

2 k δ

= O

γ^2

log

k δ

4 The case of infinite H

We have proved some useful theorems for the case of finite hypothesis classes. But many hypothesis classes, including any parameterized by real numbers (as in linear classification) actually contain an infinite number of functions. Can we prove similar results for this setting? Lets start by going through something that is not the “right” argument. Better and more general arguments exist, but this will be useful for honing our intuitions about the domain. Suppose we have an H that is parameterized by d real numbers. Since we are using a computer to represent real numbers, and IEEE double-precision floating point (double’s in C) uses 64 bits to represent a floating point num- ber, this means that our learning algorithm, assuming we’re using double- precision floating point, is parameterized by 64d bits. Thus, our hypothesis class really consists of at most k = 2^64 d^ different hypotheses. From the Corol- lary at the end of the previous section, we therefore find that, to guarantee ε(ˆh) ≤ ε(h∗) + 2γ, with to hold with probability at least 1 − δ, it suffices

that m ≥ O

1 γ^2 log^

264 d δ

= O

d γ^2 log^

1 δ

= Oγ,δ (d). (The γ, δ subscripts are

to indicate that the last big-O is hiding constants that may depend on γ and δ.) Thus, the number of training examples needed is at most linear in the parameters of the model. The fact that we relied on 64-bit floating point makes this argument not entirely satisfying, but the conclusion is nonetheless roughly correct: If what we’re going to do is try to minimize training error, then in order to learn

see that, for any of the eight possible labelings of these points, we can find a linear classifier that obtains “zero training error” on them:

x

x 1

2 x

x 1

2 x

x 1

2 x

x 1

2

x

x 1

2 x

x 1

2 x

x 1

2 x

x 1

2

Moreover, it is possible to show that there is no set of 4 points that this hypothesis class can shatter. Thus, the largest set that H can shatter is of size 3, and hence VC(H) = 3. Note that the VC dimension of H here is 3 even though there may be sets of size 3 that it cannot shatter. For instance, if we had a set of three points lying in a straight line (left figure), then there is no way to find a linear separator for the labeling of the three points shown below (right figure):

x

x 1

x

x 1

In order words, under the definition of the VC dimension, in order to prove that VC(H) is at least d, we need to show only that there’s at least one set of size d that H can shatter. The following theorem, due to Vapnik, can then be shown. (This is, many would argue, the most important theorem in all of learning theory.)

Theorem. Let H be given, and let d = VC(H). Then with probability at least 1 − δ, we have that for all h ∈ H,

|ε(h) − εˆ(h)| ≤ O

d m

log

m d

m

log

δ

Thus, with probability at least 1 − δ, we also have that:

ε(ˆh) ≤ ε(h∗) + O

d m

log

m d

m

log

δ

In other words, if a hypothesis class has finite VC dimension, then uniform convergence occurs as m becomes large. As before, this allows us to give a bound on ε(h) in terms of ε(h∗). We also have the following corollary:

Corollary. For |ε(h) − ˆε(h)| ≤ γ to hold for all h ∈ H (and hence ε(ˆh) ≤ ε(h∗) + 2γ) with probability at least 1 − δ, it suffices that m = Oγ,δ (d).

In other words, the number of training examples needed to learn “well” using H is linear in the VC dimension of H. It turns out that, for “most” hypothesis classes, the VC dimension (assuming a “reasonable” parameter- ization) is also roughly linear in the number of parameters. Putting these together, we conclude that (for an algorithm that tries to minimize training error) the number of training examples needed is usually roughly linear in the number of parameters of H.

Learning Theory - Lectures Notes - 6, Study notes of Machine Learning

Related documents

Partial preview of the text

Download Learning Theory - Lectures Notes - 6 and more Study notes Machine Learning in PDF only on Docsity!

CS229 Lecture notes

Andrew Ng

Part VI

Learning Theory

1 Bias/variance tradeoff

= O