Download Machine Learning 10-601 and more Lecture notes Artificial Intelligence in PDF only on Docsity!
Machine Learning 10-
Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 25, 2015 Today:
- Graphical models
- Bayes Nets:
- Inference
- Learning
- EM Readings:
- Bishop chapter 8
- Mitchell chapter 6
Midterm
- In class on Monday, March 2
- Closed book
- You may bring a 8.5x11 “cheat sheet” of notes
- Covers all material through today
- Be sure to come on time. We’ll start precisely
at 12 noon
What You Should Know
- Bayes nets are convenient representation for encoding dependencies / conditional independence
- BN = Graph plus parameters of CPD’s
- Defines joint distribution over variables
- Can calculate everything else from that
- Though inference may be intractable
- Reading conditional independence relations from the graph - Each node is cond indep of non-descendents, given only its parents - X and Y are conditionally independent given Z if Z D-separates every path connecting X to Y - Marginal independence : special case where Z={}
Inference in Bayes Nets
- In general, intractable (NP-complete)
- For certain cases, tractable
- Assigning probability to fully observed set of variables
- Or if just one variable unobserved
- Or for singly connected graphs (ie., no undirected loops)
- Sometimes use Monte Carlo methods
- Generate many samples according to the Bayes Net distribution, then count up the results
- Variational methods for tractable approximate solutions
Prob. of joint assignment: easy
- Suppose we are interested in joint assignment <F=f,A=a,S=s,H=h,N=n> What is P(f,a,s,h,n)? let’s use p(a,b) as shorthand for p(A=a, B=b)
Prob. of marginals: not so easy
- How do we calculate P(N=n)? let’s use p(a,b) as shorthand for p(A=a, B=b)
Generating a sample from joint distribution: easy How can we generate random samples drawn according to P(F,A,S,H,N)? Hint: random sample of F according to P(F=1) = θ F=
- draw a value of r uniformly from [0,1]
- if r<θ then output F=1, else F= Solution:
- draw a random value f for F, using its CPD
- then draw values for A, for S|A,F, for H|S, for N|S
Generating a sample from joint distribution: easy Note we can estimate marginals like P(N=n) by generating many samples from joint distribution, then count the fraction of samples for which N=n Similarly, for anything else we care about P(F=1|H=1, N=0) à weak but general method for estimating any probability term…
Learning of Bayes Nets
- Four categories of learning problems
- Graph structure may be known/unknown
- Variable values may be fully observed / partly unobserved
- Easy case: learn parameters for graph structure is known , and data is fully observed
- Interesting case: graph known , data partly known
- Gruesome case: graph structure unknown , data partly unobserved
Learning CPTs from Fully Observed Data Flu (^) Allergy Sinus Headache Nose kth^ training example δ(x) = 1 if x=true, = 0 if x=false
- Example: Consider learning the parameter
- Max Likelihood Estimate is
- Remember why? let’s use p(a,b) as shorthand for p(A=a, B=b)
Estimate from partly observed data
- What if FAHN observed, but not S?
- Can’t calculate MLE
- Let X be all observed variable values (over all examples)
- Let Z be all unobserved variable values
- Can’t calculate MLE: Flu (^) Allergy Sinus Headache Nose
- WHAT TO DO?
Estimate from partly observed data
- What if FAHN observed, but not S?
- Can’t calculate MLE
- Let X be all observed variable values (over all examples)
- Let Z be all unobserved variable values
- Can’t calculate MLE: Flu (^) Allergy Sinus Headache Nose
- EM seeks* to estimate:
- EM guaranteed to find local maximum
EM Algorithm - Informally
EM is a general procedure for learning from partly observed data Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S}) Begin with arbitrary choice for parameters θ Iterate until convergence:
- E Step: estimate the values of unobserved Z, using θ
- M Step: use observed values plus E-step estimates to derive a better θ Guaranteed to find local maximum. Each iteration increases
EM Algorithm - Precisely
EM is a general procedure for learning from partly observed data Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S}) Define Iterate until convergence:
- E Step: Use X and current θ to calculate P(Z|X,θ)
- M Step: Replace current θ by Guaranteed to find local maximum. Each iteration increases