Graphical Models: Learning Bayesian Structure - Machine Learning | CMSC 726, Study notes of Computer Science

Material Type: Notes; Professor: Getoor; Class: MACHINE LEARNING; Subject: Computer Science; University: University of Maryland; Term: Spring 2006;

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-zjq-1
koofers-user-zjq-1 🇺🇸

8 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
cmsc726: Graphical
Models: Learning
material from: Michael Jordan, Nir Friedman and Daphne Koller
Learning Bayesian networks
Inducer
Inducer
Data +
Prior information
E
R
B
A
C
.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
Known Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y> Inducer
Inducer
E B
A
.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
??
e
b
e
??
??
??
be
b
b
e
BE P(A | E,B)
E B
A
Network structure is specified
Inducer needs to estimate parameters
Data does not contain missing values
Unknown Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y> Inducer
Inducer
E B
A
.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
??
e
b
e
??
??
??
be
b
b
e
BE P(A | E,B)
E B
A
Network structure is not specified
Inducer needs to select arcs & estimate pa rameters
Data does not contain missing values
Known Structure -- Incomplete Data
Inducer
Inducer
E B
A
.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
??
e
b
e
??
??
??
be
b
b
e
BE P(A | E,B)
E B
A
Network structure is specified
Data contains missing values
We consider assignments to missing values
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
.
.
<?,Y,Y>
Known Structure / Complete Data
Given a network structure G
And choice of parametric family for
P(X
i
|Pa
i
)
Learn parameters for network
Goal
Construct a network that is “closest” to probability
that generated the data
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Graphical Models: Learning Bayesian Structure - Machine Learning | CMSC 726 and more Study notes Computer Science in PDF only on Docsity!

cmsc726: Graphical

Models: Learning

material from: Michael Jordan, Nir Friedman and Daphne Koller

Learning Bayesian networks

Data + InducerInducer Prior information

E

R

B

A

C .9^. e

b

e

.7.

.99.

.8.

e b b b e

E BP(A | E,B)

Known Structure -- Complete Data

**E, B, A

. .

Inducer Inducer**

E B

A .9. e

b

e

.7.

.99.

.8.

e b b b e

E BP(A | E,B)

?? e

b

e

??

??

??

e b b b e

E BP(A | E,B)^ E B

A

  • Network structure is specified
    • Inducer needs to estimate parameters
  • Data does not contain missing values

Unknown Structure -- Complete Data

**E, B, A

. .

InducerInducer**

E B

A .9. e

b

e

.7.

.99.

.8.

e b b b e

E BP(A | E,B)

?? e

b

e

??

??

??

e b b b e

E BP(A | E,B)^ E B A

  • Network structure is not specified
    • Inducer needs to select arcs & estimate parameters
  • Data does not contain missing values

Known Structure -- Incomplete Data

Inducer Inducer

E B

A .9. e

b

e

.7.

.99.

.8.

e b b b e

E BP(A | E,B)

?? e

b

e

??

??

??

e b b b e

E BP(A | E,B)^ E B

A

  • Network structure is specified
  • Data contains missing values
    • We consider assignments to missing values

**E, B, A

. . **

Known Structure / Complete Data

  • Given a network structure G
    • And choice of parametric family for P(X (^) i|Pa (^) i)
  • Learn parameters for network

Goal

  • Construct a network that is “closest” to probability that generated the data

Learning Parameters for a Bayesian

Network

E B

A

C

[ ] [ ] [ ] [ ]
[ 1 ] [ 1 ] [ 1 ] [ 1 ]
EM BM AM CM
E B A C
D
  • Training data has the form:

Learning Parameters for a Bayesian

Network

E B

A

C

  • Since we assume i.i.d. samples, likelihood function is

Θ = ∏ Θ m

L ( :D) P(E[m],B[m],A[m],C[m]: )

Learning Parameters for a Bayesian

Network

E B

A

C

  • By definition of network, we get

m

m

PCm Am

PAm BmEm

PBm

PEm

L D PEmBmAmCm

([]| []: )

([]| [], []: )

([]: )

([]: )

( : ) ([],[], [],[ ]: )

⋅ ⋅ ⋅ ⋅

⋅ ⋅ ⋅ ⋅

[ ] [ ] [ ] [ ]

[ 1 ] [ 1 ] [ 1 ] [ 1 ]

EM BM AM CM

E B A C

Learning Parameters for a Bayesian

Network

E B

A

C

  • Rewriting terms, we get

m

m

m

m

m

PCm Am

PAm BmEm

PBm

PEm

L D PEmBmAmCm

([]| [ ]: )

([ ]| [], []: )

([]: )

([]: )

( : ) ([],[ ],[], []: )

⋅ ⋅ ⋅ ⋅

⋅ ⋅ ⋅ ⋅

[ ] [ ] [ ] [ ]

[ 1 ] [ 1 ] [ 1 ] [ 1 ]

EM BM AM CM

E B A C

General Bayesian Networks

Generalizing for any Bayesian network :

  • The likelihood decomposes according to the structure of the network.

i

i i

i m

i i i

m i

i i i

m

n

L D

Pxm Pam

Pxm Pam

L D Px m x m

( []| []: )
( []| []: )
( : ) ( 1 [ ],K, []: )

i.i.d. samples

Network factorization

General Bayesian Networks (Cont.)

DecompositionIndependent Estimation Problems

If the parameters for each family are not related, then they can be estimated independently of each other.

Dirichlet Priors

  • Recall that the likelihood function for a multinomial is
  • A Dirichlet prior with hyperparameters α 1 ,…,αK is defined as for legal θ 1 ,…, θ (^) K

Then the posterior has the same form, with hyperparameters α 1 +N 1 ,…,αK +N (^) K

K

k 1

N

L( :D) θkk

K

k

P k k

1

( )^1

∏ ∏ ∏

  • − = =
Θ ∝ Θ Θ∝ −^ =^ K

k

N k

K

k

N k

K

k

P D P PD kk k k k

1

1 1 1

( | ) ()( | ) θα^1 θ θ^ α

Dirichlet Priors (cont.)

  • We can compute the prediction on a new event in closed form:
  • If P(Θ) is Dirichlet with hyperparameters α 1 ,…,αK then
  • Since the posterior is also Dirichlet, we get

= =θ⋅ΘΘ=α l l

P(X[ 1 ]k) k P()d^ k

  • = =θ⋅Θ Θ= α+ l l^ l

( N) P( X[M 1 ] k|D) P(|D)d k Nk k

Dirichlet Priors -- Example

0

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1

Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5)

Prior Knowledge

  • The hyperparameters α 1 ,…,αK can be thought of as “imaginary” counts from our prior experience
  • Equivalent sample size = α 1 +…+αK
  • The larger the equivalent sample size the more confident we are in our prior

Effect of Priors (cont.)

  • In real data, Bayesian estimates are less sensitive to noise in the data

5 10 15 20 25 30 35 40 45 50

P(X = 1|D)
N

MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10)

N

0

(^1) Toss Result

Conjugate Families

  • The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy - Dirichlet prior is a conjugate family for the multinomial likelihood
  • Conjugate families are useful since:
    • For many distributions we can represent them with hyperparameters
    • They allow for sequential update within the same representation
    • In many cases we have closed-form solution for prediction

Bayesian Networks and Bayesian Prediction

  • Priors for each parameter group are independent
  • Data instances are independent given the unknown parameters

θX

X[1] X[2] X[M] X[M+1]

Observed data

Plate notation

Y[1] Y[2] Y[M] Y[M+1]

θY|X θX

m (^) θY|X X[m]

Y[m]

Query

Bayesian Networks and Bayesian

Prediction (Cont.)

  • We can also “read” from the network: Complete dataposteriors on parameters are independent

θX

X[1] X[2] X[M] X[M+1]

Observed data

Plate notation

Y[1] Y[2] Y[M] Y[M+1]

θY|X θX

m (^) θY|X X[m]

Y[m]

Query

Bayesian Prediction(cont.)

  • Since posteriors on parameters for each family are independent, we can compute them separately
  • Posteriors for parameters within families are also independent:
  • Complete data ⇒

independent posteriors on θY|X=0 and θ Y|X=

θX

m (^) θY|X X[m]

Y[m]

Refined model

θX

m (^) X[m] θY|X=

Y[m]

θY|X=

Bayesian Prediction(cont.)

  • Given these observations, we can compute the

posterior for each multinomial θ Xi | pai

independently

  • The posterior is Dirichlet with parameters α(X (^) i=1|pa (^) i)+N (X (^) i=1|pai),…, α(X (^) i=k|pa (^) i)+N (X (^) i=k|pai)
  • The predictive distribution is then represented by the parameters

(pa) N(pa)

~ (x,pa) N(x,pa)

i i

i i i i

x i |pai α +

Learning Parameters: Case Study

(cont.)

Experiment:

  • Sample a stream of instances from the alarm network
  • Learn parameters using
    • MLE estimator
    • Bayesian estimator with uniform prior with different strengths

Learning Parameters: Case Study

(cont.)

0

1

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

KL Divergence

M

MLE Bayes w/ Uniform Prior, M'= Bayes w/ Uniform Prior, M'= Bayes w/ Uniform Prior, M'= Bayes w/ Uniform Prior, M'=

Likelihood Score for Structures

First cut approach:

  • Use likelihood function
  • Recall, the likelihood score for a network structure and parameters is
  • Since we know how to maximize parameters from now on we assume

∏∏

= Θ

m i

G,i

G i i

m

G 1 n G

P(x[m]|Pa[m]:G, )

L( G, :D) P(x[m],K,x[m]:G, )

L(G:D)= maxΘG L(G,ΘG:D )

Likelihood Score for Structure (cont.)

Bad news:

  • Adding arcs always helps
    • Maximal score attained by fully connected networks
    • Such networks can overfit the data --- parameters capture the noise in the data

Avoiding Overfitting

“Classic” issue in learning.

Approaches:

  • Restricting the hypotheses space
    • Limits the overfitting capability of the learner
    • Example: restrict # of parents or # of parameters
  • Minimum description length
    • Description length measures complexity
    • Prefer models that compactly describes the training data
  • Bayesian methods
    • Average over all possible parameter values
    • Use prior knowledge

Bayesian Inference

  • Bayesian Reasoning---compute expectation over

unknown G

  • Assumption : Gs are mutually exclusive and exhaustive
  • We know how to compute P(x[M+1]|G,D)
    • Same as prediction with fixed structure
  • How do we compute P(G|D)?
  • = ∑ + G

P(x[M 1 ]|D) P(x[M 1 ]|D,G)P(G|D)

Marginal likelihood

Prior over structures

PD

PDGPG

P G D=

Using Bayes rule:

P(D) is the same for all structures G Can be ignored when comparing structures

Probability of Data

Posterior Score Marginal Likelihood

  • By introduction of variables, we have that
  • This integral measures sensitivity to choice of parameters

P (D|G)= ∫P(D|G,θ)P(θ|G) dθ

Likelihood (^) Prior over parameters

Marginal Likelihood: Multinomials

For multinomials with Dirichlet prior:

  • P(Θ) is Dirichlet with hyperparameters α 1 ,…,αK
  • D is a dataset with sufficient statistics N 1 ,…,NK

Then

∏ ∑

Γ

Γ⎛^ +

l (^) l

l l

l

l l

l

l

( )

N

N

PD

Marginal Likelihood for General

Network

The marginal likelihood has the form:

where

  • N(..) are the counts from the data
  • α(..) are the hyperparameters for each family given G

Γ + Γ +

Γ

i (^) pa x i G

i G i G G G

G

iG i i

i i i i

i xpa

xpa Nxpa pa Npa

pa P DG (( , ))

(( , ) ( , )) ( ) ( )

( ) ( |) α

α α

α

Dirichlet Marginal Likelihood For the sequence of values of Xi when Xi’s parents have a particular value

Priors

  • We need: prior counts α(..) for each network

structure G

  • This can be a formidable task
    • There are exponentially many structures…

BDe Score

Possible solution: The BDe prior

  • Represent prior using two elements M 0 , B (^0)
    • M 0 - equivalent sample size
    • B 0 - network representing the prior probability of events

BDe Score

Intuition: M 0 prior examples distributed by B 0

  • Set α(x (^) i ,paiG^ ) = M 0 P(x (^) i ,paiG^ | B 0 )
    • Note that paiG^ are not the same as the parents of Xi in B0.
    • Compute P(xi ,paiG| B 0 ) using standard inference procedures
  • Such priors have desirable theoretical properties
    • Equivalent networks are assigned the same score

Bayesian Score: Asymptotic Behavior

Theorem: If the prior P(Θ |G) is “well-behaved”, then

dim( ) ( 1 ) 2

log

log ( | ) ( : ) G O

M

P D G =lG D − +

Heuristic Search (cont.)

  • Typical operations:
S C
E
D
S C
E
D

Reverse C (^) → Delete E

C^ →
E

Add^ C

→ D
S C
E
D
S C
E
D

Exploiting Decomposability in Local

Search

  • Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move

S C

E

D

S C

E

D

S C E

D

S C E

D

Greedy Hill-Climbing

  • Simplest heuristic local search
    • Start with a given network
      • empty network
      • best tree
      • a random network
    • At each iteration
      • Evaluate all possible changes
      • Apply change that leads to best improvement in score
      • Reiterate
    • Stop when no modification improves score
  • Each step requires evaluating approximately n new changes

Greedy Hill-Climbing: Possible Pitfalls

  • Greedy Hill-Climbing can get struck in:
    • Local Maxima:
      • All one-edge changes reduce the score
    • Plateaus:
      • Some one-edge changes leave the score unchanged
      • Happens because equivalent networks received the same score and are neighbors in the search space
  • Both occur during structure search
  • Standard heuristics can escape both
    • Random restarts
    • TABU search

Model Selection

  • So far, we focused on single model
    • Find best scoring model
    • Use it to predict next example
  • Implicit assumption:
    • Best scoring model dominates the weighted sum
  • Pros:
    • We get a single structure
    • Allows for efficient use in our tasks
  • Cons:
    • We are committing to the independencies of a particular structure
    • Other structures might be as probable given the data

Model Averaging

  • Recall, Bayesian analysis started with
    • This requires us to average over all possible models
  • = ∑ + G

P ( x[ M 1 ]| D) P( x[ M 1 ]| D, G) P( G| D)

Model Averaging (cont.)

  • Full Averaging
    • Sum over all structures
    • Usually intractable---there are exponentially many structures
  • Approximate Averaging
    • Find K largest scoring structures
    • Approximate the sum by averaging over their prediction

Search: Summary

  • Discrete optimization problem
  • In general, NP-Hard
    • Need to resort to heuristic search
    • In practice, search is relatively fast (~100 vars in ~10 min):
      • Decomposability
      • Sufficient statistics
  • In some cases, we can reduce the search problem to an easy optimization problem - Example: learning trees

Incomplete Data

Incomplete Data

Data is often incomplete

  • Some variables of interest are not assigned value

This phenomena happens when we have

  • Missing values
  • Hidden variables

Missing Values

Examples:

  • Survey data
  • Medical records
    • Not all patients undergo all possible tests

Missing Values (cont.)

Complicating issue:

  • The fact that a value is missing might be indicative of its value - The patient did not undergo X-Ray since she complained about fever and not about broken bones….
  • To learn from incomplete data we need the following assumption: Missing at Random (MAR):
  • The probability that the value of Xi is missing is independent of its actual value given other observed values

EM (cont.)

Training Data

X 1 X 2 X 3

H

Y 1 Y 2 Y (^3)

Initial network (G,Θ 0 )

Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1 , X 1 , X 3 ) N(Y 1 , H) N(Y 2 , H) N(Y 3 , H)

Computation

(E-Step)

Reparameterize

X 1 X 2 X 3

H

Y 1 Y 2 Y (^3)

Updated network (G,Θ 1 )

(M-Step)

Reiterate

EM (cont.)

Formal Guarantees:

  • L(Θ 1 :D) ≥ L(Θ 0 :D)
    • Each iteration improves the likelihood
  • If Θ1 = Θ 0 , then Θ 0 is a stationary point of L(Θ:D)
    • Usually, this means a local maximum Main cost:
  • Computations of expected counts in E-Step
  • Requires a computation pass for each instance in training set - These are exactly the same as for gradient ascent!

Example: EM in clustering

  • Consider clustering example

E-Step:

  • Compute P(C[m]|X 1 [m],…,Xn[m],Θ)
  • This corresponds to “soft” assignment to clusters
  • Compute expected statistics:

M-Step

  • Re-estimate P(Xi|C), P(C)

Cluster

X 1 X 2 ... Xn

=

= Θ mX im x i

E Nxic Pc x m xnm , []

[ ( , )] ( | 1 [ ],..., [ ], )

EM in Practice

Initial parameters :

  • Random parameters setting
  • “Best” guess from other source Stopping criteria:
  • Small change in likelihood of data
  • Small change in parameter values Avoiding bad local maxima:
  • Multiple restarts
  • Early “pruning” of unpromising ones

Bayesian Inference with Incomplete

Data

Recall, Bayesian estimation:

Incomplete data:

  • No sufficient statistics (except the data)
  • Posterior does not decompose
  • No closed form solution ÕNeed to use approximations

P (x [M + 1 ]|D) =∫ P(x[M + 1 ]| θ )P( θ |D) d θ

MAP Approximation

  • Simplest approximation: MAP parameters
    • MAP --- Maximum A-posteriori Probability

where

Assumption :

  • Posterior mass is dominated by a MAP parameters Finding MAP parameters:
  • Same techniques as finding ML parameters
  • Maximize P(θ|D) instead of L(θ:D)

P(x[M + 1 ]|D)P(x[ M + 1 ]| θ ~ )

θ ~ = argmax θ P ( θ |D)

Stochastic Approximations

Stochastic approximation:

  • Sample θ 1 , …, θk from P(θ|D)
  • Approximate
  • We can apply Gibbs sampling to perform stochastic simulation in the “meta” network that describes the learning problem

i

P xM D k P(x[M 1 ]| i) ([ 1 ]| )^1 θ

Parameter Learning from Incomplete

Data: Summary

  • Non-linear optimization problem
  • Methods for learning: EM and Gradient Ascent
    • Exploit inference for learning Difficulties :
  • Exploration of a complex likelihood/posterior
    • More missing data ⇒ many more local maxima
    • Cannot represent posterior ⇒ must resort to approximations
  • Inference
    • Main computational bottleneck for learning
    • Learning large networks ⇒ exact inference is infeasible ⇒ resort to approximate inference