



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Assignment; Professor: Roth; Class: Machine Learning; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Fall 2008;
Typology: Assignments
1 / 5
This page cannot be seen from the preview
Don't miss anything!




CS446: Pattern Recognition and Machine Learning Fall 2008
Solution Handed In: December 9, 2008
(a) Two directed trees T 0
and T 1
over variables x 1
, x 2
, ..., x n
are equivalent iff the joint
probability distributions they represent are the same. In other words:
T 0
(x 1
, x 2
, ..., x n
T 1
(x 1
, x 2
, ..., x n
This implies that for every event E over x 1
, x 2
, ..., x n
T 0
T 1
(b) Let T i
and T j
be the two directed trees obtained by choosing two different roots
x i
and x j
(i 6 = j, 1 ≤ i, j ≤ n) from the undirected tree T. Denoting x =
(x 1
, x 2
, ..., x n
), we would like to show that
Ti
(x) = P Tj
(x)
Let πx k
be the parent of node xk. Note that there is a unique path P between
nodes i and j. Assume for now that the path is of size 1. That is, there is an
edge in T between x i
and x j
. Thus, the only difference between T i
and T j
is the
direction of this edge.
Ti
(x) = P (x i
n ∏
k=1,k 6 =i
P (x k
|π x k
= P (x i
)P (x j
|x i
n ∏
k=1,xk 6 ∈P
P (x k
|π xk
= P (x i
, x j
n ∏
k=1,x k
6 ∈P
P (x k
|π xk
= P (xj )P (xi|xj )
n ∏
k=1,xk 6 ∈P
P (xk|πx k
= P (x j
n ∏
k=1,k 6 =j
P (x k
|π xk
Tj
(x)
Next, notice that if the path P between x i
and x j
is longer, we maintain the
property that edges not on the path have the same directionality in T i
and T j
We can therefore use the argument above inductively, transforming a tree rooted
at x i
to one rooted at x j
by switching the directions of the edges in P one edge
at a time. As shown above, each of these steps maintains the equivalent joint
distribution.
(a) Using the total probability rule and the fact that the xi are conditionally inde-
pendent given Z:
P (x
(j)
) = P (x
(j)
|Y
(j)
= 1)P (Y
(j)
= 1) + P (x
(j)
|Y
(j)
= 2)P (Y
(j)
= 2)
(j)
= 1)
n ∏
i=
P (x
(j)
i
(j)
= 1) + P (Y
(j)
= 2)
n ∏
i=
P (x
(j)
i
(j)
= 2)
= p
n ∏
i=
α
x
(j)
i
i
(1 − α i
(1−x
(j)
i
)
n ∏
i=
β
x
(j)
i
i
(1 − β i
(1−x
(j)
i
)
(b) Using Bayes’ rule, we have:
q
(j)
1
(j) = 1|x
(j) )
P (x
(j)
|Y
(j)
= 1)P (Y
(j)
= 1)
P (x
(j) )
p
n
i=
α
x
(j)
i
i
(1 − α i
(1−x
(j)
i
)
p
n
i=
α
x
(j)
i
i
(1 − α i
(1−x
(j)
i
)
n
i=
β
x
(j)
i
i
(1 − β i
(1−x
(j)
i
)
Similarly:
q
(j)
2
(j)
= 2|x
(j)
)
P (x
(j) |Y
(j) = 2)P (Y
(j) = 2)
P (x
(j) )
(1 − p)
n
i=
β
x
(j)
i
i
(1 − βi)
(1−x
(j)
i
)
p
n
i=
α
x
(j)
i
i
(1 − α i
(1−x
(j)
i
)
n
i=
β
x
(j)
i
i
(1 − β i
(1−x
(j)
i
)
(d) To maximize the log likelihood, we set equal to 0 the partial derivatives with
respect to the parameters. Note that q
(j)
1
= 1 − q
(j)
2
dE
dp˜
m ∑
j=
q
(j)
1
p ˜
q
(j)
2
1 − p˜
= 0 ⇒ p˜ =
m
m ∑
j=
q
(j)
1
dE
d α˜i
m ∑
j=
q
(j)
1
x
(j)
i
α ˜i
1 − x
(j)
i
1 − α˜i
= 0 ⇒ α˜ i
m
j=
x
(j)
i
q
(j)
1
m
j=
q
(j)
1
dE
d
βi
m ∑
j=
q
(j)
2
x
(j)
i
βi
1 − x
(j)
i
βi
β i
m
j=
x
(j)
i
q
(j)
2
m
j=
q
(j)
2
(e) From the results in (d), the update rule indicates that the best estimate for ˜p is
the average of the q
(j)
1
over all data, and the best estimates for the ˜α i
and
β i
are
weighted averages of x
(j)
i
by q
(j)
1
and q
(j)
2
respectively.
To run the algorithm:
i. Initialize with random values for ˜p, ˜αi, and
βi for all i.
ii. Calculate q
(j)
1
and q
(j)
2
as shown in (b).
iii. Find the new values for ˜p, ˜αi, and
βi using the update rules derived in (d).
iv. Repeat (ii) and (iii) until convergence.
(f) When Y is unobserved, the conditional odds that Y = 1 can be written as:
1
= x 1
n
= x n
1
= x 1
n
= x n
P (Y = 1)P (x 1 , ..., xn|Y = 1)
P (Y = 2)P (x 1
, ..., x n
p
n
i=
α
x
(j)
i
i
(1 − αi)
(1−x
(j)
i
)
(1 − p)
n
i=
β
x
(j)
i
i
(1 − βi)
(1−x
(j)
i
)
Thus, we can predict the value of y by checking the condition:
1
= x 1
n
= x n
1
= x 1
n
= x n
We predict Y = 1 if:
p
n ∏
i=
α
x
(j)
i
i
(1 − α i
(1−x
(j)
i
)
(1 − p)
n ∏
i=
β
x
(j)
i
i
(1 − β i
(1−x
(j)
i
)
(g) Continuing from (f), the hypothesis for the value of Z is:
p
n ∏
i=
α
x
(j)
i
i
(1 − α i
(1−x
(j)
i
)
(1 − p)
n ∏
i=
β
x
(j)
i
i
(1 − β i
(1−x
(j)
i
)
To show that this decision has a linear surface, it suffices to show that we can
write it as a sum like
n ∑
i=
wixi > θ
Since we have a product involving x i
and not a sum, we start by applying log to
both sides of the inequality:
log
p
n ∏
i=
α
x
(j)
i
i
(1 − α i
(1−x
(j)
i
)
log
(1 − p)
n ∏
i=
β
x
(j)
i
i
(1 − β i
(1−x
(j)
i
)
c 0
n ∑
i=
x
(j)
i
log ˜α i
(j)
i
) log(1 − α˜ i
) > c 1
n ∑
i=
x
(j)
i
log
β i
(j)
i
) log(1 −
β i
c 2 +
n ∑
i=
x
(j)
i
log
α˜i
1 − α˜ i
c 3
n ∑
i=
x
(j)
i
log
βi
βi
n ∑
i=
x
(j)
i
log
α˜i
1 − α˜ i
− log
βi
β i
c 3 − c 2
Along the way, we have substituted c j
constants for terms that do not depend on
any x i