









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Exam; Professor: Roth; Class: Machine Learning; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Spring 2006;
Typology: Exams
1 / 17
This page cannot be seen from the preview
Don't miss anything!










CS446: Pattern Recognition and Machine Learning Spring 2006
March 17, 2006
This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam.
The exam ends at 1:45 pm. It contains 5 problems. You have 75 minutes to earn a total of 100 points. Answer each question in the space provided.
If you need more room, write on the back side of the paper and indicate that you have done so.
Besides having the correct answer, being concise and clear is very important. For full credit, you must show your work and explain your answers.
Good Luck!
Problem 1 (20 points): Problem 2 (20 points): Problem 3 (20 points): Problem 4 (20 points): Problem 5 (20 points): Total (100 points):
Problem 1 [Online Learning - 20 points] In this problem, you will use the Perceptron learning algorithm to learn a separating line for the following 6 data points sampled from the target, Boolean classifier:
x 1 x 2 y 1 6 • 3 3 • -3 2 × 3 -5 • -4 -2 × 0 -5 ×
The symbols × and • correspond to positive and negative labels repsectively.
(a) Assume we have two features x 1 ∈ R and x 2 ∈ R. We will initialize Perceptron’s weight vector to w = (0, 0), and predictions will be made with
f (x) = sign(
i=
wixi)
Following this formulation, apply Perceptron with a learning rate of 1 to the data in the order it is given in the table. Give the final weight vector you arrive at, and state whether or not this weight vector is consistent with the data. (There is more space on the next page.)
(b) Consider adding a new data point: xb = (− 0. 5 , 0). Assume that Perceptron will not be given the label of xb until it has started training. Can the algorithm as presented in part (a) represent a hypothesis consistent with all the data?
Solution: The representation used by the algorithm described in part (a) is not expressive enough to classify all the data correctly if xb has a negative label because it can only represent lines that cross the origin. To give Perceptron the ability to represent lines that don’t cross the origin, we can add a threshold as follows:
f (x) = sign
i=
wixi
− θ
The question also asks us to make sure we are increasing the algorithm’s expres- sivity as little as possible. Therefore, we will not learn θ, but instead we will keep it fixed and positive, thereby moving the separating line off of the origin in the direction of the weight vector, which works no matter the label of xb.
(c) Now consider adding the point: xc = (0. 5 , 0 .5). Assume that xb and xc have the same label, but that Perceptron will not be made aware of that label until after it has started training. Can the algorithm as presented in part (a) represent a hypothesis consistent with all the data?
Solution: As stated above, it is either the case that both xb and xc are labeled negative or that they are both labeled positive. The representation used by the algorithm described in part (a) is not expressive enough to classify all the data correctly in either case, again because the appropriate separating lines don’t cross through the origin. Adding a threshold as we did in part (b) is enough to move the separating line off of the origin. When xb and xc are both negative, we need to move the separating line off of the origin in the direction of w, which means we need a positive θ. When they are both positive, we need to move the separatng line off of the origin in the opposite direction of w, which means we need a negative θ. The only way to be ready for both contingencies is to learn θ along with w.
Problem 2 [Decision Trees - 20 points] In this problem, you will use the ID3 algorithm to learn a decision tree that correctly classifies the 40 data points below:
The symbols × and • correspond to positive and negative labels repsectively.
(a) Assume we have two multi-valued features whose names are x 1 and x 2. The set of allowable values for x 1 is {− 6 , ..., 6 }. Similarly, x 2 ∈ {− 6 , ..., 5 }. Using only these two features, run the ID3 algorithm on the data. Draw and clearly label the resulting decision tree. (There is more space on the next page.)
Answer problem 2(a): While it might seem daunting to run ID3 on a dataset with 40 points at first, under closer inspection, we see that most of the computations are trivial. In particular, whenever all the examples in a subset are the same label, the entropy is 0. Whenever exactly half of the examples in a subset are positive and the other half are negative, the entropy is 1.
The ID3 algorithm first decides on a root node feature using the information gain metric. Let S represent our data set.
Entropy(S) = 1 Entropy(Sx 1 =− 3 ) = 1 Entropy(Sx 1 =− 2 ) = −^15 log 15 − 45 log (^45) = 1573 + (^4513) = (^1115) Gain(S, x 1 ) = Entropy(S) −
v∈x 1
|Sv | |S| Entropy(Sv) = 1 −
5 40
11 15
Entropy(Sx 2 =2) = −^14 log 14 − 34 log (^34) = 14 2 + (^3425) = (^45) Gain(S, x 2 ) = Entropy(S) −
v∈x 2
|Sv | |S| Entropy(Sv) = 1 −
10
4 5
Since 2325 > 103120 , we will choose x 2 for our root node. After that, ID3 has only 1 feature left to choose from, so there is no need for any more computation. The resulting tree is shown below.
(c) Which of the two trees you just created do you expect will perform better on future testing examples? Justify your answer.
Solution: We’d expect the second tree to have better generalization. The reason is that the first tree seems to make much more specific decisions that will probably tend to overfit the training set. The tree making more general decisions should tend to do better on unseen examples.
Problem 3 [VC Dimension - 20 points] Let H be the hypothesis space of all disjunctions over an n-dimensional, Boolean feature space X = { 0 , 1 }n. A hypothesis h ∈ H can contain 0 ≤ k ≤ n literals (either negated or non-negated), and h predicts “positive” for an example x ∈ X if and only if the evaluation of h given the variable settings in x yields “true.” For example, if h ≡ x 1 ∨ ¬x 4 , then h predicts positive for x = (0, 1 , 1 , 0), and it predicts negative for x = (0, 0 , 0 , 1). If h contains 0 literals, its evaluation on any x yields “false.” What is the VC dimension of H? Prove that your answer is correct.
Solution: V C(H) = n. To prove this, we first show that there exists a set of n examples that can be shattered by H. Such a set can be constructed by selecting n examples that each have a single, different literal set active while all the others are set negative. For example, if n = 5, we’d select the following 5 examples:
(1, 0 , 0 , 0 , 0) (0, 1 , 0 , 0 , 0) (0, 0 , 1 , 0 , 0) (0, 0 , 0 , 1 , 0) (0, 0 , 0 , 0 , 1)
Any labeling of these examples can be produced with some h ∈ H such that h contains all and only those literals active in the positively labeled examples. It is important to note that a set of n examples can only be shattered by our H if every example contains at least one literal that has a different setting in that example than in any other example. The set of examples above satisfies this constraint. As a result, for every example x in the set, there exists an h ∈ H that labels x positive while it labels every other example negative. We are particularly concerned about this labeling, because specific labelings such as this one are difficult for a disjunctive hypothesis to achieve. If there does not exist a disjunction containing a single literal that achieves this labeling, then there cannot exist any disjunction that achieves the labeling, since adding literals will only increase the number of examples the disjunction labels as positive. To complete our proof that V C(H) = n, it remains to be shown that no set of n + 1 examples can be shattered by H. First, if a given set of n + 1 examples S does not contain a subset of n shatterable examples, then S cannot be shatterable. By the argument above, if S does contain a subset of n shatterable examples, it must be the case that every example in that subset contains one literal that is set differently in that example than in every other example in the subset. The lone remaining example in S must then break that constraint by the pigeon hole principle. In other words, every h ∈ H that labels the last example positive must also label at least one other example in S positive. Therefore, no set of n + 1 examples is shatterable by H.
Assume that ||u|| = 1. Let vk be the weight vector before the kth^ mistake. Assume that v 1 = 0 and that the kth^ mistake occurs on example (xi, yi).
(b) Give a single inequality expressing the notion that the ith^ example is the kth^ mistake.
Solution:
yivk · xi < yiθ
(c) Prove a lower bound for vk+1 · u starting from:
vk+1 = vk + ryixi
Solution:
vk+1 · u = vk · u + ryixi · u ≥ vk · u + r(yiθ + γ) ≥ vk · u + r(γ − θ) ∴ vk+1 · u ≥ kr(γ − θ)
(d) Prove an upper bound for ||vk+1||^2.
Solution:
||vk+1||^2 = ||vk||^2 + 2ryixi · vk + r^2 ||xi||^2 ≤ ||vk||^2 + 2ryiθ + r^2 R^2 ≤ ||vk||^2 + 2rθ + r^2 R^2 ∴ ||vk+1||^2 ≤ k(2rθ + r^2 R^2 )
Finally, since ||u|| = 1, we know that ||vk+1|| ≥ vk+1 · u.
(e) Prove an upper bound for the number of mistakes k.
Solution:
kr(γ − θ) ≤ vk+1 · u ≤ ||vk+1|| ≤
k(2rθ + r^2 R^2 )
kr(γ − θ) ≤
√ k(2rθ^ +^ r^2 R^2 ) k ≤
√ 2 rθ+r (^2) R 2 r(γ−θ) k ≤ 2 θ+rR
2 r(γ−θ)^2 k ≤
(^2) rθ +R 2 (γ−θ)^2
Note that neither of the third to last or second to last inequalities are true unless we assume that γ > θ. If γ < θ, the direction of the inequality would flip when dividing by γ − θ, and it would be impossible to tell what inequality would hold after squaring both sides.
(b) i. What condition does the Adaboost algorithm require its weak learning algo- rithm to satisfy in order to guarantee learnability?
Solution: The theory behind Adaboost guaranteeing learnability requires that the weak learning algorithm produce a hypothesis that makes error less than 12 during each round of boosting.
ii. Assume we have a dataset labeled by a 3-DNF target concept, and assume that there are an equal number of positive and negative examples. Describe a weak learning algorithm and prove that it will satisfy the condition from part i. during every round of Adaboost.
Solution: One learning algorithm that satisfies the requirement has the set of all 3- conjunctions as its hypothesis space. Given a set of examples and a distribu- tion over them, it simply evaluates every 3-conjunction on the examples and selects the one with the smallest error. This algorithm must produce a hypothesis that achieves an error during the first round of boosting of 1 < 12. To show that this is true, we merely need to show that there exists at least one 3-conjunction that achieves error less than 12. Any 3-conjunction that’s a term in the target 3-DNF will get all negative examples correct. Since there are positive examples, there must also be at least one 3-conjunction that gets some positive examples correct as well. Since there are an equal number of positive and negative examples, any 3-conjunction that gets at least one positive example correct achieves error 1 < 12.
(c) The following graph illustrating the performance of Adaboost appeared in Schapire, et. al., ICML 1997:
These results were obtained by boosting Quinlan’s C4.5 decision tree learning algorithm on a UCI dataset, varying the number of boosting rounds and recording the error of Adaboost’s hypothesis on both the training set and a separate testing set. i. Which curve on the graph corresponds to training error and which to testing error? Label the graph with your response. ii. What did learning theorists find unusual or interesting about these results?
Solution: The testing error continues to decrease even after the training error is 0.
iii. How can this interesting phenomenon be explained? Use evidence from the Adaboost learning algorithm to support your answer.
Solution: The adjustments made by Adaboost during each round of boosting are not a function of final, combined hypothesis’ error. Instead, they are a function of the weak learner’s error, which continues to be non-zero after the training error of the combined hypothesis reaches zero. Thus, the distribution kept by the algorithm continues to be modified, and useful features continue to be added to the combined hypothesis.