Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Notes; Professor: Hamerly; Class: Introduction to Machine Learning; Subject: Computer Science; University: Baylor University; Term: Spring 2008;
Typology: Study notes
1 / 17
Greg Hamerly
Spring 2008
Some content from Tom Mitchell.
Outline
1 Mistake bounds
2 Tightness of bounds
3 Summary
Mistake bounds
So far: how many examples needed to learn?
What about: how many mistakes before convergence?
Let’s consider similar setting to PAC learning: Instances drawn at random from X according to distribution D Learner must classify each instance before receiving correct classification from teacher. Can we bound the number of mistakes learner makes before converging?
Mistake bounds
Consider Find-S when H = conjunction of boolean literals Find-S: Initialize h to the most specific hypothesis l 1 ∧ ¬l 1 ∧ l 2 ∧ ¬l 2... ln ∧ ¬ln For each positive training instance x Remove from h any literal that is not satisfied by x Output hypothesis h.
How many mistakes before converging to correct h?
Mistake bounds
Consider the Halving Algorithm: Learn concept using version space Candidate-Elimination algorithm Classify new instances by majority vote of version space members
How many mistakes before converging to correct h? ... in worst case? ... in best case?
Mistake bounds
Let MA(C ) be the max number of mistakes made by algorithm A to learn concepts in C. (maximum over all possible c ∈ C , and all possible training sequences)
MA(C ) ≡ max c∈C
MA(c)
Definition: Let C be an arbitrary non-empty concept class. The optimal mistake bound for C , denoted Opt(C ), is the minimum over all possible learning algorithms A of MA(C ).
Opt(C ) ≡ min A∈learning algorithms
VC (C ) ≤ Opt(C ) ≤ MHalving (C ) ≤ log 2 (|C |).
Mistake bounds
A generalization of the Halving algorithm. uses multiple learning algorithms (or hypotheses) never completely eliminates any learner each algorithm ai has an associated weight wi use a weighted majority vote of the learners to make predictions instead of eliminating an algorithm which makes a mistake, just reduce its weight
Has nice bound on number of mistakes!
Connected to ‘boosting’ which we’ll look at later.
Mistake bounds
Start with a group of learners A = {ai } and an initial weight wi = 1 for each algorithm. Then for each training example 〈x, c(x)〉, predict + or − for x depending on the weighted majority vote of the algorithms (randomly breaking ties) for each algorithm ai that predicted incorrectly, wi ← wi β where 0 < β < 1 is a user-chosen parameter.
Mistake bounds
Note that each learner (ai ) makes a mistake when it predicts c(x) incorrectly.
However, the ensemble of learners only makes a mistake when the weighted majority of them makes a mistake.
We want a bound on how many mistakes the ensemble will make.
Mistake bounds
The Weighted-Majority algorithm makes at most
2 .4(k + log 2 n)
mistakes, where n is the number of learners k is the number of mistakes for the best learner β is assumed to be 1/ 2 the bound is over all possible training sequences
This is nice since it is close to the best learner!
Interesting question – what is the tradeoff between this bound and n, if we randomly choose the set of learning algorithms?
Tightness of bounds
Learning theory gives guarantees of performance (complexity, mistakes, etc.) for a concept class and a learner over any distribution of examples.
Of note: the learner’s hypothesis space H could be huge (infinite, even) the distribution of examples could be anything
Because of these accomodations, PAC bounds, even ones based on VC dimension, tend to be very conservative.
Tightness of bounds
Linear classifier can only learn (shatter) 3 examples in 2 dimensions.
This means that theoretically, we can’t learn anything more than 3 examples with a linear classifier in 2 dimensions.
In general, in d dimensions, only samples of size d + 1 can be learned perfectly.
Tightness of bounds
VC dimension of simplified neural network shape classifier: n = 15 × 15 = 225 input nodes max number of inputs to any node is r ≤ 225 with 2 outputs and 10 hidden nodes, s = 12 linear threshold perceptron as squashing function: VC (perceptron) = r + 1 therefore, VC (net) ≤ 2(r + 1)s log(es) ≈ 18902
So the sample complexity for = 0.1 and δ = 0.05 is:
m ≥
(4 log 2 (2/δ) + 8VC (net) log 2 (13/)) ≈ 107
Rather large; but not as large as |X | 2225.
Tightness of bounds
Theoretical bounds which allow any distribution over examples and very large hypothesis spaces fail to meet the reality of concepts which are relatively well-behaved.
Because of the conservative nature behind these bounds, it’s not clear how useful they are in practice.
Thus, ongoing work is tightening bounds based on better analysis and domain-specific assumptions.
Tightness of bounds
For example, the PAC-Bayes framework (McAllester and others) combines PAC analysis with Bayesian-style priors on the functions that will be learned.
This can tighten bounds by assuming that really complex functions which require lots of training are less likely to occur.
Tightness of bounds
Assume a prior probability distribution over the hypothesis space, where P(h) is the probability of hypothesis h, and H = C.
For a consistent learner and any δ > 0, the following holds with probability at least 1 − δ for all hypotheses consistent with a sample of size m:
errorD(h) ≤
log(1/P(h)) + log(1/δ) m
Thus the sample complexity is
m ≥
(log(1/P(h)) + log(1/δ))
If h has high probability, then the sample complexity will be much smaller than the bound based on |H| or VC (H).
Summary
At a very high level, they often boil down to: amount of training is related to the complexity of the functions you are training.
More complex functions (e.g. large |H| or VC (H)) require more training (i.e. more examples).
Remember that learning theory results are typically very generic.
As such, they are guidelines but very loose. Making them more specific requires incorporating specifics about problems you’re working on.