Download Graphical Models: Learning Bayesian Structure - Machine Learning | CMSC 726 and more Study notes Computer Science in PDF only on Docsity!
cmsc726: Graphical
Models: Learning
material from: Michael Jordan, Nir Friedman and Daphne Koller
Learning Bayesian networks
Data + InducerInducer Prior information
E
R
B
A
C .9^. e
b
e
.7.
.99.
.8.
e b b b e
E BP(A | E,B)
Known Structure -- Complete Data
**E, B, A
. .
Inducer Inducer**
E B
A .9. e
b
e
.7.
.99.
.8.
e b b b e
E BP(A | E,B)
?? e
b
e
??
??
??
e b b b e
E BP(A | E,B)^ E B
A
- Network structure is specified
- Inducer needs to estimate parameters
- Data does not contain missing values
Unknown Structure -- Complete Data
**E, B, A
. .
InducerInducer**
E B
A .9. e
b
e
.7.
.99.
.8.
e b b b e
E BP(A | E,B)
?? e
b
e
??
??
??
e b b b e
E BP(A | E,B)^ E B A
- Network structure is not specified
- Inducer needs to select arcs & estimate parameters
- Data does not contain missing values
Known Structure -- Incomplete Data
Inducer Inducer
E B
A .9. e
b
e
.7.
.99.
.8.
e b b b e
E BP(A | E,B)
?? e
b
e
??
??
??
e b b b e
E BP(A | E,B)^ E B
A
- Network structure is specified
- Data contains missing values
- We consider assignments to missing values
**E, B, A
. . **
Known Structure / Complete Data
- Given a network structure G
- And choice of parametric family for P(X (^) i|Pa (^) i)
- Learn parameters for network
Goal
- Construct a network that is “closest” to probability that generated the data
Learning Parameters for a Bayesian
Network
E B
A
C
[ ] [ ] [ ] [ ]
[ 1 ] [ 1 ] [ 1 ] [ 1 ]
EM BM AM CM
E B A C
D
- Training data has the form:
Learning Parameters for a Bayesian
Network
E B
A
C
- Since we assume i.i.d. samples, likelihood function is
Θ = ∏ Θ m
L ( :D) P(E[m],B[m],A[m],C[m]: )
Learning Parameters for a Bayesian
Network
E B
A
C
- By definition of network, we get
∏
∏
m
m
PCm Am
PAm BmEm
PBm
PEm
L D PEmBmAmCm
([]| []: )
([]| [], []: )
([]: )
([]: )
( : ) ([],[], [],[ ]: )
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅
[ ] [ ] [ ] [ ]
[ 1 ] [ 1 ] [ 1 ] [ 1 ]
EM BM AM CM
E B A C
Learning Parameters for a Bayesian
Network
E B
A
C
∏
∏
∏
∏
∏
m
m
m
m
m
PCm Am
PAm BmEm
PBm
PEm
L D PEmBmAmCm
([]| [ ]: )
([ ]| [], []: )
([]: )
([]: )
( : ) ([],[ ],[], []: )
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅
[ ] [ ] [ ] [ ]
[ 1 ] [ 1 ] [ 1 ] [ 1 ]
EM BM AM CM
E B A C
General Bayesian Networks
Generalizing for any Bayesian network :
- The likelihood decomposes according to the structure of the network.
i
i i
i m
i i i
m i
i i i
m
n
L D
Pxm Pam
Pxm Pam
L D Px m x m
( []| []: )
( []| []: )
( : ) ( 1 [ ],K, []: )
i.i.d. samples
Network factorization
General Bayesian Networks (Cont.)
Decomposition ⇒ Independent Estimation Problems
If the parameters for each family are not related, then they can be estimated independently of each other.
Dirichlet Priors
- Recall that the likelihood function for a multinomial is
- A Dirichlet prior with hyperparameters α 1 ,…,αK is defined as for legal θ 1 ,…, θ (^) K
Then the posterior has the same form, with hyperparameters α 1 +N 1 ,…,αK +N (^) K
∏
K
k 1
N
L( :D) θkk
∏
K
k
P k k
1
( )^1
∏ ∏ ∏
Θ ∝ Θ Θ∝ −^ =^ K
k
N k
K
k
N k
K
k
P D P PD kk k k k
1
1 1 1
( | ) ()( | ) θα^1 θ θ^ α
Dirichlet Priors (cont.)
- We can compute the prediction on a new event in closed form:
- If P(Θ) is Dirichlet with hyperparameters α 1 ,…,αK then
- Since the posterior is also Dirichlet, we get
= =θ⋅ΘΘ=α l l
P(X[ 1 ]k) k P()d^ k
( N) P( X[M 1 ] k|D) P(|D)d k Nk k
Dirichlet Priors -- Example
0
1
2
3
4
5
0 0.2 0.4 0.6 0.8 1
Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5)
Prior Knowledge
- The hyperparameters α 1 ,…,αK can be thought of as “imaginary” counts from our prior experience
- Equivalent sample size = α 1 +…+αK
- The larger the equivalent sample size the more confident we are in our prior
Effect of Priors (cont.)
- In real data, Bayesian estimates are less sensitive to noise in the data
5 10 15 20 25 30 35 40 45 50
P(X = 1|D)
N
MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10)
N
0
(^1) Toss Result
Conjugate Families
- The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy - Dirichlet prior is a conjugate family for the multinomial likelihood
- Conjugate families are useful since:
- For many distributions we can represent them with hyperparameters
- They allow for sequential update within the same representation
- In many cases we have closed-form solution for prediction
Bayesian Networks and Bayesian Prediction
- Priors for each parameter group are independent
- Data instances are independent given the unknown parameters
θX
X[1] X[2] X[M] X[M+1]
Observed data
Plate notation
Y[1] Y[2] Y[M] Y[M+1]
θY|X θX
m (^) θY|X X[m]
Y[m]
Query
Bayesian Networks and Bayesian
Prediction (Cont.)
- We can also “read” from the network: Complete data ⇒ posteriors on parameters are independent
θX
X[1] X[2] X[M] X[M+1]
Observed data
Plate notation
Y[1] Y[2] Y[M] Y[M+1]
θY|X θX
m (^) θY|X X[m]
Y[m]
Query
Bayesian Prediction(cont.)
- Since posteriors on parameters for each family are independent, we can compute them separately
- Posteriors for parameters within families are also independent:
- Complete data ⇒
independent posteriors on θY|X=0 and θ Y|X=
θX
m (^) θY|X X[m]
Y[m]
Refined model
θX
m (^) X[m] θY|X=
Y[m]
θY|X=
Bayesian Prediction(cont.)
- Given these observations, we can compute the
posterior for each multinomial θ Xi | pai
independently
- The posterior is Dirichlet with parameters α(X (^) i=1|pa (^) i)+N (X (^) i=1|pai),…, α(X (^) i=k|pa (^) i)+N (X (^) i=k|pai)
- The predictive distribution is then represented by the parameters
(pa) N(pa)
~ (x,pa) N(x,pa)
i i
i i i i
x i |pai α +
Learning Parameters: Case Study
(cont.)
Experiment:
- Sample a stream of instances from the alarm network
- Learn parameters using
- MLE estimator
- Bayesian estimator with uniform prior with different strengths
Learning Parameters: Case Study
(cont.)
0
1
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
KL Divergence
M
MLE Bayes w/ Uniform Prior, M'= Bayes w/ Uniform Prior, M'= Bayes w/ Uniform Prior, M'= Bayes w/ Uniform Prior, M'=
Likelihood Score for Structures
First cut approach:
- Use likelihood function
- Recall, the likelihood score for a network structure and parameters is
- Since we know how to maximize parameters from now on we assume
∏∏
∏
= Θ
m i
G,i
G i i
m
G 1 n G
P(x[m]|Pa[m]:G, )
L( G, :D) P(x[m],K,x[m]:G, )
L(G:D)= maxΘG L(G,ΘG:D )
Likelihood Score for Structure (cont.)
Bad news:
- Adding arcs always helps
- Maximal score attained by fully connected networks
- Such networks can overfit the data --- parameters capture the noise in the data
Avoiding Overfitting
“Classic” issue in learning.
Approaches:
- Restricting the hypotheses space
- Limits the overfitting capability of the learner
- Example: restrict # of parents or # of parameters
- Minimum description length
- Description length measures complexity
- Prefer models that compactly describes the training data
- Bayesian methods
- Average over all possible parameter values
- Use prior knowledge
Bayesian Inference
- Bayesian Reasoning---compute expectation over
unknown G
- Assumption : Gs are mutually exclusive and exhaustive
- We know how to compute P(x[M+1]|G,D)
- Same as prediction with fixed structure
- How do we compute P(G|D)?
P(x[M 1 ]|D) P(x[M 1 ]|D,G)P(G|D)
Marginal likelihood
Prior over structures
PD
PDGPG
P G D=
Using Bayes rule:
P(D) is the same for all structures G Can be ignored when comparing structures
Probability of Data
Posterior Score Marginal Likelihood
- By introduction of variables, we have that
- This integral measures sensitivity to choice of parameters
P (D|G)= ∫P(D|G,θ)P(θ|G) dθ
Likelihood (^) Prior over parameters
Marginal Likelihood: Multinomials
For multinomials with Dirichlet prior:
- P(Θ) is Dirichlet with hyperparameters α 1 ,…,αK
- D is a dataset with sufficient statistics N 1 ,…,NK
Then
∏ ∑
∑
Γ
Γ⎛^ +
l (^) l
l l
l
l l
l
l
( )
N
N
PD
Marginal Likelihood for General
Network
The marginal likelihood has the form:
where
- N(..) are the counts from the data
- α(..) are the hyperparameters for each family given G
Γ + Γ +
Γ
i (^) pa x i G
i G i G G G
G
iG i i
i i i i
i xpa
xpa Nxpa pa Npa
pa P DG (( , ))
(( , ) ( , )) ( ) ( )
( ) ( |) α
α α
α
Dirichlet Marginal Likelihood For the sequence of values of Xi when Xi’s parents have a particular value
Priors
- We need: prior counts α(..) for each network
structure G
- This can be a formidable task
- There are exponentially many structures…
BDe Score
Possible solution: The BDe prior
- Represent prior using two elements M 0 , B (^0)
- M 0 - equivalent sample size
- B 0 - network representing the prior probability of events
BDe Score
Intuition: M 0 prior examples distributed by B 0
- Set α(x (^) i ,paiG^ ) = M 0 P(x (^) i ,paiG^ | B 0 )
- Note that paiG^ are not the same as the parents of Xi in B0.
- Compute P(xi ,paiG| B 0 ) using standard inference procedures
- Such priors have desirable theoretical properties
- Equivalent networks are assigned the same score
Bayesian Score: Asymptotic Behavior
Theorem: If the prior P(Θ |G) is “well-behaved”, then
dim( ) ( 1 ) 2
log
log ( | ) ( : ) G O
M
P D G =lG D − +
Heuristic Search (cont.)
S C
E
D
S C
E
D
Reverse C (^) → Delete E
C^ →
E
Add^ C
→ D
S C
E
D
S C
E
D
Exploiting Decomposability in Local
Search
- Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move
S C
E
D
S C
E
D
S C E
D
S C E
D
Greedy Hill-Climbing
- Simplest heuristic local search
- Start with a given network
- empty network
- best tree
- a random network
- At each iteration
- Evaluate all possible changes
- Apply change that leads to best improvement in score
- Reiterate
- Stop when no modification improves score
- Each step requires evaluating approximately n new changes
Greedy Hill-Climbing: Possible Pitfalls
- Greedy Hill-Climbing can get struck in:
- Local Maxima:
- All one-edge changes reduce the score
- Plateaus:
- Some one-edge changes leave the score unchanged
- Happens because equivalent networks received the same score and are neighbors in the search space
- Both occur during structure search
- Standard heuristics can escape both
- Random restarts
- TABU search
Model Selection
- So far, we focused on single model
- Find best scoring model
- Use it to predict next example
- Implicit assumption:
- Best scoring model dominates the weighted sum
- Pros:
- We get a single structure
- Allows for efficient use in our tasks
- Cons:
- We are committing to the independencies of a particular structure
- Other structures might be as probable given the data
Model Averaging
- Recall, Bayesian analysis started with
- This requires us to average over all possible models
P ( x[ M 1 ]| D) P( x[ M 1 ]| D, G) P( G| D)
Model Averaging (cont.)
- Full Averaging
- Sum over all structures
- Usually intractable---there are exponentially many structures
- Approximate Averaging
- Find K largest scoring structures
- Approximate the sum by averaging over their prediction
Search: Summary
- Discrete optimization problem
- In general, NP-Hard
- Need to resort to heuristic search
- In practice, search is relatively fast (~100 vars in ~10 min):
- Decomposability
- Sufficient statistics
- In some cases, we can reduce the search problem to an easy optimization problem - Example: learning trees
Incomplete Data
Incomplete Data
Data is often incomplete
- Some variables of interest are not assigned value
This phenomena happens when we have
- Missing values
- Hidden variables
Missing Values
Examples:
- Survey data
- Medical records
- Not all patients undergo all possible tests
Missing Values (cont.)
Complicating issue:
- The fact that a value is missing might be indicative of its value - The patient did not undergo X-Ray since she complained about fever and not about broken bones….
- To learn from incomplete data we need the following assumption: Missing at Random (MAR):
- The probability that the value of Xi is missing is independent of its actual value given other observed values
EM (cont.)
Training Data
X 1 X 2 X 3
H
Y 1 Y 2 Y (^3)
Initial network (G,Θ 0 )
Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1 , X 1 , X 3 ) N(Y 1 , H) N(Y 2 , H) N(Y 3 , H)
Computation
(E-Step)
Reparameterize
X 1 X 2 X 3
H
Y 1 Y 2 Y (^3)
Updated network (G,Θ 1 )
(M-Step)
Reiterate
EM (cont.)
Formal Guarantees:
- L(Θ 1 :D) ≥ L(Θ 0 :D)
- Each iteration improves the likelihood
- If Θ1 = Θ 0 , then Θ 0 is a stationary point of L(Θ:D)
- Usually, this means a local maximum Main cost:
- Computations of expected counts in E-Step
- Requires a computation pass for each instance in training set - These are exactly the same as for gradient ascent!
Example: EM in clustering
- Consider clustering example
E-Step:
- Compute P(C[m]|X 1 [m],…,Xn[m],Θ)
- This corresponds to “soft” assignment to clusters
- Compute expected statistics:
M-Step
- Re-estimate P(Xi|C), P(C)
Cluster
X 1 X 2 ... Xn
=
= Θ mX im x i
E Nxic Pc x m xnm , []
[ ( , )] ( | 1 [ ],..., [ ], )
EM in Practice
Initial parameters :
- Random parameters setting
- “Best” guess from other source Stopping criteria:
- Small change in likelihood of data
- Small change in parameter values Avoiding bad local maxima:
- Multiple restarts
- Early “pruning” of unpromising ones
Bayesian Inference with Incomplete
Data
Recall, Bayesian estimation:
Incomplete data:
- No sufficient statistics (except the data)
- Posterior does not decompose
- No closed form solution ÕNeed to use approximations
P (x [M + 1 ]|D) =∫ P(x[M + 1 ]| θ )P( θ |D) d θ
MAP Approximation
- Simplest approximation: MAP parameters
- MAP --- Maximum A-posteriori Probability
where
Assumption :
- Posterior mass is dominated by a MAP parameters Finding MAP parameters:
- Same techniques as finding ML parameters
- Maximize P(θ|D) instead of L(θ:D)
P(x[M + 1 ]|D) ≈ P(x[ M + 1 ]| θ ~ )
θ ~ = argmax θ P ( θ |D)
Stochastic Approximations
Stochastic approximation:
- Sample θ 1 , …, θk from P(θ|D)
- Approximate
- We can apply Gibbs sampling to perform stochastic simulation in the “meta” network that describes the learning problem
i
P xM D k P(x[M 1 ]| i) ([ 1 ]| )^1 θ
Parameter Learning from Incomplete
Data: Summary
- Non-linear optimization problem
- Methods for learning: EM and Gradient Ascent
- Exploit inference for learning Difficulties :
- Exploration of a complex likelihood/posterior
- More missing data ⇒ many more local maxima
- Cannot represent posterior ⇒ must resort to approximations
- Inference
- Main computational bottleneck for learning
- Learning large networks ⇒ exact inference is infeasible ⇒ resort to approximate inference