



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Exam; Class: Machine Learning; Subject: Engineering Computer Science; University: University of California - Davis; Term: Spring 2004;
Typology: Exams
1 / 6
This page cannot be seen from the preview
Don't miss anything!




Closed book
Spring 2004
Show all work clearly and legibly. Remember, you are being tested. So even if an
answer is obvious to you, pl. show all the justification by clearly showing the
calculations, or explaining why a calculation is skipped.
1. True or False? ( 9 points each)
(a) In PAC learning model, the learner makes no assumptions aboutthe
class from which the target concept is drawn. (False)
(b) In PAC learning, the learner outputs the hypothesis from H that has
theleast error (possibly zero) over the training data (False)
(c) The numberof training examples required for successful learning is
strongly influenced by the complexity of the hypothesis space
considered by the learner. (True)
2. (15 points) Illustrate your understanding of the back propagation method by
explicitly showing all steps of the calculations with respect to a single-neuron
with a sigmoidal nonlinearity. Assume that you are at the output stage of the
network. The objective is for the unit to learn a single input pattern, namely
1
2
i
i
i
The desired output is o = 1. Initially assume 1 2
w w (^0). Use a learning rate
1.0 (^). Show all the calculations for two iterations****. Show the weight values at
the end of the first and second iterations. In what direction is the weight
vector moving from iteration to iteration?
Solution:
1st iteration: netinput = 0. output = 1/2, error = (1-0.5)**2 = 0.
delta-w1 = etadeloutput = 1.0*(0.5)(0.5)(1-0.5) i1 = 0.125 i1 = 0.
delta-w2 = etadeloutput = 1.0*(0.5)(0.5)(1-0.5) i1 = 0.125 i2 = 0.
new weights are 0.125 and 0.
2nd iteration
2nd iteration: netinput = 2.125. output = 1 + exp (-2.125) = 0.893, error =
delta-w1 = etadeloutput = 1.0*(1-0.893)(0.893)(1-0.893) i1 = 0.0853 i1 =
delta-w2 = etadeloutput = 0.0853 i2 =0.
new weights are 0.210 and 0.
The weight vector is moving toward the input vector.
3. (8 points) Suppose H is a set of possible hypotheses and D is a set of training
data. We would like our program to output the most probable hypothesis h
from H, given the data D. Under what conditions does the following hold?
arg max P H ( | D ) arg max P D ( | H );
h H h H
Solution: First, there is a typo. The H should be h under arg max. But this is a
minor thing and it did not bother any of you. So let us proceed.
The starting formula is
( | ) ( )
arg max ( | ) arg max
( )
P D h P h
P h D
P D
h H h H
P(D) can be dropped because it does not depend on h
P(h) can be treated as a constant if all the hypothesesin the hypothesis space are equally
likely.
Under these conditons, both sides are equal as stated int he question.
5. (a) (12 points) Build a decision tree to classify the following patterns. Show
all the calculations systematically or explain why certain calculations are
skipped.
Pattern
(x1,x2,x3)
Class
(b) (2 points) What Boolean function is the above tree implementing?
Solution:
A plot of the 8 points along x1, x2 and x3 gives an idea on how to solve this.
The initail uncertainity of all 8 points is
-(6/8) log2 (6/8) – (2/8) log2 (2/8) = 0.
Suppose we divide the points by drawing a plane along the x1- axis (i. e., parallel to the
x2-x3 plane. Then the left-branch has 4 points all belonging to the same class and the
right hand branch has two of each class. So the uncertainity of the left branch is
-(4/4) log2 (4/4) – (0/4) log2 (0/4) = 0
The uncertainity of the right branch is
-(2/4) log2 (2/4) – (2/4) log2 (2/4) = 1
Average uncertainity after the first test (on x1) is
Uncertainity reduction achieved is 0.81 – 0.5 = 0.
Do a similar thing along x2 and x3 and find out that test along x3 gives exactly the same
uncertainity and a test along x2 gives no improvement at all. So first choose either x1 or
x2.
The decision tree really implements f = x1x3.
(c) ( 5 points) Consider a decision tree built from an arbitrary set of data. If the
output is discreet-valued and can take on k different possible values, what is the
maximum training set error (expressed as a fraction) that any data set could
possibly have?
Suggested Solution: The answer is (k-1)/k. Consider data sets with identical inputs but
the outputs are evenly distributed among k classes. Then we will always get one correct
classification and (k-1) erroneous classifications.
6. (12 points) Imagine that you are given the following set of training examples.
All the features are Boolean-valued.
F1 F2 F3 Result
T T F +
F T T +
T F T -
F T F -
F F T -
How would a Naive Bayes approach classify the following test example?
Be sure to show your work.
F1 = T F2 = F F3 = F
Solution: There are only two possible answers + and -. So it is possible that you can toss a coin
and guess the answer and be on the correct side 50% of the time. Therefore, it becomes imporatnt
that you show all calculations and they be correct too, to justify your answer.
Furthermore, one of the probability terms is zero. This makes it doubly dangerous because you
can getthe correct classification despite a horde of calculation errors.
From the historical data given to you, P(+) = 2/5 = 0.4 and P(-) = 3/5 = 0.
You simply have to calculate arg max P(vj) P(F1=T|vj) P(F2=F|vj) P(F3=F|vj); Naive Bayes
assumption. Note vj can assume only two values + and -.
P(vj = +)* P(F1| +) P(F2| +) P(F3 | +)
P(vj = -)* P(F1| -) P(F2| -) P(F3 | -)