Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Introduction to Machine Learning - Learning Theory - Slides | CSI 5325, Study notes of Computer Science

Material Type: Notes; Professor: Hamerly; Class: Introduction to Machine Learning; Subject: Computer Science; University: Baylor University; Term: Spring 2008;

Typology: Study notes

Pre 2010

Uploaded on 08/18/2009

koofers-user-7pc
koofers-user-7pc 🇺🇸

4.5

(2)

10 documents

1 / 17

Toggle sidebar

Related documents


Partial preview of the text

Download Introduction to Machine Learning - Learning Theory - Slides | CSI 5325 and more Study notes Computer Science in PDF only on Docsity!

Intro. to machine learning (CSI 5325)

Lecture 17: Learning theory

Greg Hamerly

Spring 2008

Some content from Tom Mitchell.

Outline

1 Mistake bounds

2 Tightness of bounds

3 Summary

Mistake bounds

Mistake Bounds

So far: how many examples needed to learn?

What about: how many mistakes before convergence?

Let’s consider similar setting to PAC learning: Instances drawn at random from X according to distribution D Learner must classify each instance before receiving correct classification from teacher. Can we bound the number of mistakes learner makes before converging?

Mistake bounds

Mistake Bounds: Find-S

Consider Find-S when H = conjunction of boolean literals Find-S: Initialize h to the most specific hypothesis l 1 ∧ ¬l 1 ∧ l 2 ∧ ¬l 2... ln ∧ ¬ln For each positive training instance x Remove from h any literal that is not satisfied by x Output hypothesis h.

How many mistakes before converging to correct h?

Mistake bounds

Mistake Bounds: Halving Algorithm

Consider the Halving Algorithm: Learn concept using version space Candidate-Elimination algorithm Classify new instances by majority vote of version space members

How many mistakes before converging to correct h? ... in worst case? ... in best case?

Mistake bounds

Optimal Mistake Bounds

Let MA(C ) be the max number of mistakes made by algorithm A to learn concepts in C. (maximum over all possible c ∈ C , and all possible training sequences)

MA(C ) ≡ max c∈C

MA(c)

Definition: Let C be an arbitrary non-empty concept class. The optimal mistake bound for C , denoted Opt(C ), is the minimum over all possible learning algorithms A of MA(C ).

Opt(C ) ≡ min A∈learning algorithms

MA(C )

VC (C ) ≤ Opt(C ) ≤ MHalving (C ) ≤ log 2 (|C |).

Mistake bounds

Weighted-Majority algorithm

A generalization of the Halving algorithm. uses multiple learning algorithms (or hypotheses) never completely eliminates any learner each algorithm ai has an associated weight wi use a weighted majority vote of the learners to make predictions instead of eliminating an algorithm which makes a mistake, just reduce its weight

Has nice bound on number of mistakes!

Connected to ‘boosting’ which we’ll look at later.

Mistake bounds

Weighted-Majority algorithm (2)

Start with a group of learners A = {ai } and an initial weight wi = 1 for each algorithm. Then for each training example 〈x, c(x)〉, predict + or − for x depending on the weighted majority vote of the algorithms (randomly breaking ties) for each algorithm ai that predicted incorrectly, wi ← wi β where 0 < β < 1 is a user-chosen parameter.

Mistake bounds

Mistakes in the Weighted-Majority algorithm

Note that each learner (ai ) makes a mistake when it predicts c(x) incorrectly.

However, the ensemble of learners only makes a mistake when the weighted majority of them makes a mistake.

We want a bound on how many mistakes the ensemble will make.

Mistake bounds

Mistake bound for Weighted-Majority

The Weighted-Majority algorithm makes at most

2 .4(k + log 2 n)

mistakes, where n is the number of learners k is the number of mistakes for the best learner β is assumed to be 1/ 2 the bound is over all possible training sequences

This is nice since it is close to the best learner!

Interesting question – what is the tradeoff between this bound and n, if we randomly choose the set of learning algorithms?

Tightness of bounds

Tightness of bounds

Learning theory gives guarantees of performance (complexity, mistakes, etc.) for a concept class and a learner over any distribution of examples.

Of note: the learner’s hypothesis space H could be huge (infinite, even) the distribution of examples could be anything

Because of these accomodations, PAC bounds, even ones based on VC dimension, tend to be very conservative.

Tightness of bounds

Example: linear classifier

Linear classifier can only learn (shatter) 3 examples in 2 dimensions.

This means that theoretically, we can’t learn anything more than 3 examples with a linear classifier in 2 dimensions.

In general, in d dimensions, only samples of size d + 1 can be learned perfectly.

Tightness of bounds

Example: neural network

VC dimension of simplified neural network shape classifier: n = 15 × 15 = 225 input nodes max number of inputs to any node is r ≤ 225 with 2 outputs and 10 hidden nodes, s = 12 linear threshold perceptron as squashing function: VC (perceptron) = r + 1 therefore, VC (net) ≤ 2(r + 1)s log(es) ≈ 18902

So the sample complexity for  = 0.1 and δ = 0.05 is:

m ≥



(4 log 2 (2/δ) + 8VC (net) log 2 (13/)) ≈ 107

Rather large; but not as large as |X |  2225.

Tightness of bounds

Reality versus theory

Theoretical bounds which allow any distribution over examples and very large hypothesis spaces fail to meet the reality of concepts which are relatively well-behaved.

Because of the conservative nature behind these bounds, it’s not clear how useful they are in practice.

Thus, ongoing work is tightening bounds based on better analysis and domain-specific assumptions.

Tightness of bounds

PAC-Bayes bounds

For example, the PAC-Bayes framework (McAllester and others) combines PAC analysis with Bayesian-style priors on the functions that will be learned.

This can tighten bounds by assuming that really complex functions which require lots of training are less likely to occur.

Tightness of bounds

PAC-Bayes bounds (McAllester 1998)

Assume a prior probability distribution over the hypothesis space, where P(h) is the probability of hypothesis h, and H = C.

For a consistent learner and any δ > 0, the following holds with probability at least 1 − δ for all hypotheses consistent with a sample of size m:

errorD(h) ≤

log(1/P(h)) + log(1/δ) m

Thus the sample complexity is

m ≥



(log(1/P(h)) + log(1/δ))

If h has high probability, then the sample complexity will be much smaller than the bound based on |H| or VC (H).

Summary

Thoughts on learning theory

At a very high level, they often boil down to: amount of training is related to the complexity of the functions you are training.

More complex functions (e.g. large |H| or VC (H)) require more training (i.e. more examples).

Remember that learning theory results are typically very generic.

As such, they are guidelines but very loose. Making them more specific requires incorporating specifics about problems you’re working on.