Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Graphical Models: Learning Bayesian Structure - Machine Learning | CMSC 726, Study notes of Computer Science

University of Maryland Computer Science

Prof. Lise Getoor

Material Type: Notes; Professor: Getoor; Class: MACHINE LEARNING; Subject: Computer Science; University: University of Maryland; Term: Spring 2006;

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-zjq-1 🇺🇸

8 documents

1 / 14

This page cannot be seen from the preview

Don't miss anything!

1

cmsc726: Graphical

Models: Learning

material from: Michael Jordan, Nir Friedman and Daphne Koller

Learning Bayesian networks

Inducer

Data +

Prior information

E

R

B

A

C

.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

e

BE P(A | E,B)

Known Structure -- Complete Data

E, B, A

<Y,N,N>

<Y,Y,Y>

<N,N,Y>

<N,Y,Y>

.

<N,Y,Y> Inducer

Inducer

E B

A

.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

e

BE P(A | E,B)

??

e

b

e

??

be

b

e

BE P(A | E,B)

E B

A

• Network structure is specified

– Inducer needs to estimate parameters

• Data does not contain missing values

Unknown Structure -- Complete Data

E, B, A

<Y,N,N>

<Y,Y,Y>

<N,N,Y>

<N,Y,Y>

.

<N,Y,Y> Inducer

Inducer

E B

A

.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

e

BE P(A | E,B)

??

e

b

e

??

be

b

e

BE P(A | E,B)

E B

A

• Network structure is not specified

– Inducer needs to select arcs & estimate pa rameters

• Data does not contain missing values

Known Structure -- Incomplete Data

Inducer

E B

A

.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

e

BE P(A | E,B)

??

e

b

e

??

be

b

e

BE P(A | E,B)

E B

A

• Network structure is specified

• Data contains missing values

– We consider assignments to missing values

E, B, A

<Y,N,N>

<Y,?,Y>

<N,N,Y>

<N,Y,?>

.

<?,Y,Y>

Known Structure / Complete Data

• Given a network structure G

– And choice of parametric family for

P(X

i

|Pa

i

)

• Learn parameters for network

Goal

• Construct a network that is “closest” to probability

that generated the data

Discover Study notes of Computer Science University of Maryland

Partial preview of the text

Download Graphical Models: Learning Bayesian Structure - Machine Learning | CMSC 726 and more Study notes Computer Science in PDF only on Docsity!

cmsc726: Graphical

Models: Learning

material from: Michael Jordan, Nir Friedman and Daphne Koller

Learning Bayesian networks

Data + InducerInducer Prior information

E

R

B

A

C .9^. e

b

e

.7.

.99.

.8.

e b b b e

E BP(A | E,B)

Known Structure -- Complete Data

**E, B, A

. .

Inducer Inducer**

E B

A .9. e

b

e

.7.

.99.

.8.

e b b b e

E BP(A | E,B)

?? e

b

e

??

e b b b e

E BP(A | E,B)^ E B

A

Network structure is specified
- Inducer needs to estimate parameters
Data does not contain missing values

Unknown Structure -- Complete Data

**E, B, A

. .

InducerInducer**

E B

A .9. e

b

e

.7.

.99.

.8.

e b b b e

E BP(A | E,B)

?? e

b

e

??

e b b b e

E BP(A | E,B)^ E B A

Network structure is not specified
- Inducer needs to select arcs & estimate parameters
Data does not contain missing values

Known Structure -- Incomplete Data

Inducer Inducer

E B

A .9. e

b

e

.7.

.99.

.8.

e b b b e

E BP(A | E,B)

?? e

b

e

??

e b b b e

E BP(A | E,B)^ E B

A

Network structure is specified
Data contains missing values
- We consider assignments to missing values

**E, B, A

. . **

Known Structure / Complete Data

Given a network structure G
- And choice of parametric family for P(X (^) i|Pa (^) i)
Learn parameters for network

Goal

Construct a network that is “closest” to probability that generated the data

Learning Parameters for a Bayesian

Network

E B

A

C

[ ] [ ] [ ] [ ]

[ 1 ] [ 1 ] [ 1 ] [ 1 ]

EM BM AM CM

E B A C

D

Training data has the form:

Learning Parameters for a Bayesian

Network

E B

A

C

Since we assume i.i.d. samples, likelihood function is

Θ = ∏ Θ m

L ( :D) P(E[m],B[m],A[m],C[m]: )

Learning Parameters for a Bayesian

Network

E B

A

C

By definition of network, we get

∏

m

PCm Am

PAm BmEm

PBm

PEm

L D PEmBmAmCm

([]| []: )

([]| [], []: )

([]: )

( : ) ([],[], [],[ ]: )

⎥

⎦

⎤

⎢

⎣

⎡

⋅ ⋅ ⋅ ⋅

[ ] [ ] [ ] [ ]

[ 1 ] [ 1 ] [ 1 ] [ 1 ]

EM BM AM CM

E B A C

Learning Parameters for a Bayesian

Network

E B

A

C

Rewriting terms, we get

∏

m

PCm Am

PAm BmEm

PBm

PEm

L D PEmBmAmCm

([]| [ ]: )

([ ]| [], []: )

([]: )

( : ) ([],[ ],[], []: )

⎥

⎦

⎤

⎢

⎣

⎡

⋅ ⋅ ⋅ ⋅

[ ] [ ] [ ] [ ]

[ 1 ] [ 1 ] [ 1 ] [ 1 ]

EM BM AM CM

E B A C

General Bayesian Networks

Generalizing for any Bayesian network :

The likelihood decomposes according to the structure of the network.

i

i i

i m

i i i

m i

i i i

m

n

L D

Pxm Pam

L D Px m x m

( []| []: )

( : ) ( 1 [ ],K, []: )

i.i.d. samples

Network factorization

General Bayesian Networks (Cont.)

Decomposition ⇒ Independent Estimation Problems

If the parameters for each family are not related, then they can be estimated independently of each other.

Dirichlet Priors

Recall that the likelihood function for a multinomial is
A Dirichlet prior with hyperparameters α 1 ,…,αK is defined as for legal θ 1 ,…, θ (^) K

Then the posterior has the same form, with hyperparameters α 1 +N 1 ,…,αK +N (^) K

∏

K

k 1

N

L( :D) θkk

∏

K

k

P k k

1

( )^1

∏ ∏ ∏

− = =

Θ ∝ Θ Θ∝ −^ =^ K

k

N k

K

k

N k

K

k

P D P PD kk k k k

1

1 1 1

( | ) ()( | ) θα^1 θ θ^ α

Dirichlet Priors (cont.)

We can compute the prediction on a new event in closed form:
If P(Θ) is Dirichlet with hyperparameters α 1 ,…,αK then
Since the posterior is also Dirichlet, we get

= =θ⋅ΘΘ=α l l

P(X[ 1 ]k) k P()d^ k

= =θ⋅Θ Θ= α+ l l^ l

( N) P( X[M 1 ] k|D) P(|D)d k Nk k

Dirichlet Priors -- Example

0

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1

Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5)

Prior Knowledge

The hyperparameters α 1 ,…,αK can be thought of as “imaginary” counts from our prior experience
Equivalent sample size = α 1 +…+αK
The larger the equivalent sample size the more confident we are in our prior

Effect of Priors (cont.)

In real data, Bayesian estimates are less sensitive to noise in the data

5 10 15 20 25 30 35 40 45 50

P(X = 1|D)

N

MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10)

N

0

(^1) Toss Result

Conjugate Families

The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy - Dirichlet prior is a conjugate family for the multinomial likelihood
Conjugate families are useful since:
- For many distributions we can represent them with hyperparameters
- They allow for sequential update within the same representation
- In many cases we have closed-form solution for prediction

Bayesian Networks and Bayesian Prediction

Priors for each parameter group are independent
Data instances are independent given the unknown parameters

θX

X[1] X[2] X[M] X[M+1]

Observed data

Plate notation

Y[1] Y[2] Y[M] Y[M+1]

θY|X θX

m (^) θY|X X[m]

Y[m]

Query

Bayesian Networks and Bayesian

Prediction (Cont.)

We can also “read” from the network: Complete data ⇒ posteriors on parameters are independent

θX

X[1] X[2] X[M] X[M+1]

Observed data

Plate notation

Y[1] Y[2] Y[M] Y[M+1]

θY|X θX

m (^) θY|X X[m]

Y[m]

Query

Bayesian Prediction(cont.)

Since posteriors on parameters for each family are independent, we can compute them separately
Posteriors for parameters within families are also independent:
Complete data ⇒

independent posteriors on θY|X=0 and θ Y|X=

θX

m (^) θY|X X[m]

Y[m]

Refined model

θX

m (^) X[m] θY|X=

Y[m]

θY|X=

Bayesian Prediction(cont.)

Given these observations, we can compute the

posterior for each multinomial θ Xi | pai

independently

The posterior is Dirichlet with parameters α(X (^) i=1|pa (^) i)+N (X (^) i=1|pai),…, α(X (^) i=k|pa (^) i)+N (X (^) i=k|pai)
The predictive distribution is then represented by the parameters

(pa) N(pa)

~ (x,pa) N(x,pa)

i i

i i i i

x i |pai α +

Learning Parameters: Case Study

(cont.)

Experiment:

Sample a stream of instances from the alarm network
Learn parameters using
- MLE estimator
- Bayesian estimator with uniform prior with different strengths

Learning Parameters: Case Study

(cont.)

0

1

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

KL Divergence

M

MLE Bayes w/ Uniform Prior, M'= Bayes w/ Uniform Prior, M'= Bayes w/ Uniform Prior, M'= Bayes w/ Uniform Prior, M'=

Likelihood Score for Structures

First cut approach:

Use likelihood function
Recall, the likelihood score for a network structure and parameters is
Since we know how to maximize parameters from now on we assume

∏∏

∏

= Θ

m i

G,i

G i i

m

G 1 n G

P(x[m]|Pa[m]:G, )

L( G, :D) P(x[m],K,x[m]:G, )

L(G:D)= maxΘG L(G,ΘG:D )

Likelihood Score for Structure (cont.)

Bad news:

Adding arcs always helps
- Maximal score attained by fully connected networks
- Such networks can overfit the data --- parameters capture the noise in the data

Avoiding Overfitting

“Classic” issue in learning.

Approaches:

Restricting the hypotheses space
- Limits the overfitting capability of the learner
- Example: restrict # of parents or # of parameters
Minimum description length
- Description length measures complexity
- Prefer models that compactly describes the training data
Bayesian methods
- Average over all possible parameter values
- Use prior knowledge

Bayesian Inference

Bayesian Reasoning---compute expectation over

unknown G

Assumption : Gs are mutually exclusive and exhaustive
We know how to compute P(x[M+1]|G,D)
- Same as prediction with fixed structure
How do we compute P(G|D)?

= ∑ + G

P(x[M 1 ]|D) P(x[M 1 ]|D,G)P(G|D)

Marginal likelihood

Prior over structures

PD

PDGPG

P G D=

Using Bayes rule:

P(D) is the same for all structures G Can be ignored when comparing structures

Probability of Data

Posterior Score Marginal Likelihood

By introduction of variables, we have that
This integral measures sensitivity to choice of parameters

P (D|G)= ∫P(D|G,θ)P(θ|G) dθ

Likelihood (^) Prior over parameters

Marginal Likelihood: Multinomials

For multinomials with Dirichlet prior:

P(Θ) is Dirichlet with hyperparameters α 1 ,…,αK
D is a dataset with sufficient statistics N 1 ,…,NK

Then

∏ ∑

∑

Γ

Γ⎛^ +

l (^) l

l l

l

l l

l

( )

N

PD

Marginal Likelihood for General

Network

The marginal likelihood has the form:

where

N(..) are the counts from the data
α(..) are the hyperparameters for each family given G

Γ + Γ +

Γ

i (^) pa x i G

i G i G G G

G

iG i i

i i i i

i xpa

xpa Nxpa pa Npa

pa P DG (( , ))

(( , ) ( , )) ( ) ( )

( ) ( |) α

α α

α

Dirichlet Marginal Likelihood For the sequence of values of Xi when Xi’s parents have a particular value

Priors

We need: prior counts α(..) for each network

structure G

This can be a formidable task
- There are exponentially many structures…

BDe Score

Possible solution: The BDe prior

Represent prior using two elements M 0 , B (^0)
- M 0 - equivalent sample size
- B 0 - network representing the prior probability of events

BDe Score

Intuition: M 0 prior examples distributed by B 0

Set α(x (^) i ,paiG^ ) = M 0 P(x (^) i ,paiG^ | B 0 )
- Note that paiG^ are not the same as the parents of Xi in B0.
- Compute P(xi ,paiG| B 0 ) using standard inference procedures
Such priors have desirable theoretical properties
- Equivalent networks are assigned the same score

Bayesian Score: Asymptotic Behavior

Theorem: If the prior P(Θ |G) is “well-behaved”, then

dim( ) ( 1 ) 2

log

log ( | ) ( : ) G O

M

P D G =lG D − +

Heuristic Search (cont.)

Typical operations:

S C

E

D

S C

E

D

Reverse C (^) → Delete E

C^ →

E

Add^ C

→ D

S C

E

D

S C

E

D

Exploiting Decomposability in Local

Search

Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move

S C

E

D

S C

E

D

S C E

D

S C E

D

Greedy Hill-Climbing

Simplest heuristic local search
- Start with a given network
  - empty network
  - best tree
  - a random network
- At each iteration
  - Evaluate all possible changes
  - Apply change that leads to best improvement in score
  - Reiterate
- Stop when no modification improves score
Each step requires evaluating approximately n new changes

Greedy Hill-Climbing: Possible Pitfalls

Greedy Hill-Climbing can get struck in:
- Local Maxima:
  - All one-edge changes reduce the score
- Plateaus:
  - Some one-edge changes leave the score unchanged
  - Happens because equivalent networks received the same score and are neighbors in the search space
Both occur during structure search
Standard heuristics can escape both
- Random restarts
- TABU search

Model Selection

So far, we focused on single model
- Find best scoring model
- Use it to predict next example
Implicit assumption:
- Best scoring model dominates the weighted sum
Pros:
- We get a single structure
- Allows for efficient use in our tasks
Cons:
- We are committing to the independencies of a particular structure
- Other structures might be as probable given the data

Model Averaging

Recall, Bayesian analysis started with
- This requires us to average over all possible models

= ∑ + G

P ( x[ M 1 ]| D) P( x[ M 1 ]| D, G) P( G| D)

Model Averaging (cont.)

Full Averaging
- Sum over all structures
- Usually intractable---there are exponentially many structures
Approximate Averaging
- Find K largest scoring structures
- Approximate the sum by averaging over their prediction

Search: Summary

Discrete optimization problem
In general, NP-Hard
- Need to resort to heuristic search
- In practice, search is relatively fast (~100 vars in ~10 min):
  - Decomposability
  - Sufficient statistics
In some cases, we can reduce the search problem to an easy optimization problem - Example: learning trees

Incomplete Data

Data is often incomplete

Some variables of interest are not assigned value

This phenomena happens when we have

Missing values
Hidden variables

Missing Values

Examples:

Survey data
Medical records
- Not all patients undergo all possible tests

Missing Values (cont.)

Complicating issue:

The fact that a value is missing might be indicative of its value - The patient did not undergo X-Ray since she complained about fever and not about broken bones….
To learn from incomplete data we need the following assumption: Missing at Random (MAR):
The probability that the value of Xi is missing is independent of its actual value given other observed values

EM (cont.)

Training Data

X 1 X 2 X 3

H

Y 1 Y 2 Y (^3)

Initial network (G,Θ 0 )

Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1 , X 1 , X 3 ) N(Y 1 , H) N(Y 2 , H) N(Y 3 , H)

Computation

(E-Step)

Reparameterize

X 1 X 2 X 3

H

Y 1 Y 2 Y (^3)

Updated network (G,Θ 1 )

(M-Step)

Reiterate

EM (cont.)

Formal Guarantees:

L(Θ 1 :D) ≥ L(Θ 0 :D)
- Each iteration improves the likelihood
If Θ1 = Θ 0 , then Θ 0 is a stationary point of L(Θ:D)
- Usually, this means a local maximum Main cost:
Computations of expected counts in E-Step
Requires a computation pass for each instance in training set - These are exactly the same as for gradient ascent!

Example: EM in clustering

Consider clustering example

E-Step:

Compute P(C[m]|X 1 [m],…,Xn[m],Θ)
This corresponds to “soft” assignment to clusters
Compute expected statistics:

M-Step

Re-estimate P(Xi|C), P(C)

Cluster

X 1 X 2 ... Xn

=

= Θ mX im x i

E Nxic Pc x m xnm , []

[ ( , )] ( | 1 [ ],..., [ ], )

EM in Practice

Initial parameters :

Random parameters setting
“Best” guess from other source Stopping criteria:
Small change in likelihood of data
Small change in parameter values Avoiding bad local maxima:
Multiple restarts
Early “pruning” of unpromising ones

Bayesian Inference with Incomplete

Data

Recall, Bayesian estimation:

Incomplete data:

No sufficient statistics (except the data)
Posterior does not decompose
No closed form solution ÕNeed to use approximations

P (x [M + 1 ]|D) =∫ P(x[M + 1 ]| θ )P( θ |D) d θ

MAP Approximation

Simplest approximation: MAP parameters
- MAP --- Maximum A-posteriori Probability

where

Assumption :

Posterior mass is dominated by a MAP parameters Finding MAP parameters:
Same techniques as finding ML parameters
Maximize P(θ|D) instead of L(θ:D)

P(x[M + 1 ]|D) ≈ P(x[ M + 1 ]| θ ~ )

θ ~ = argmax θ P ( θ |D)

Stochastic Approximations

Stochastic approximation:

Sample θ 1 , …, θk from P(θ|D)
Approximate
We can apply Gibbs sampling to perform stochastic simulation in the “meta” network that describes the learning problem

i

P xM D k P(x[M 1 ]| i) ([ 1 ]| )^1 θ

Parameter Learning from Incomplete

Data: Summary

Non-linear optimization problem
Methods for learning: EM and Gradient Ascent
- Exploit inference for learning Difficulties :
Exploration of a complex likelihood/posterior
- More missing data ⇒ many more local maxima
- Cannot represent posterior ⇒ must resort to approximations
Inference
- Main computational bottleneck for learning
- Learning large networks ⇒ exact inference is infeasible ⇒ resort to approximate inference

Graphical Models: Learning Bayesian Structure - Machine Learning | CMSC 726, Study notes of Computer Science

Related documents

Partial preview of the text

Download Graphical Models: Learning Bayesian Structure - Machine Learning | CMSC 726 and more Study notes Computer Science in PDF only on Docsity!

cmsc726: Graphical

Models: Learning

Learning Bayesian networks

Known Structure -- Complete Data

Unknown Structure -- Complete Data

Known Structure -- Incomplete Data

Known Structure / Complete Data

Learning Parameters for a Bayesian

Network

[ ] [ ] [ ] [ ]

[ 1 ] [ 1 ] [ 1 ] [ 1 ]

EM BM AM CM

E B A C

D

Learning Parameters for a Bayesian

Network

L ( :D) P(E[m],B[m],A[m],C[m]: )

Learning Parameters for a Bayesian

Network

PCm Am

PAm BmEm

PBm

PEm

L D PEmBmAmCm

([]| []: )

([]| [], []: )

([]: )

([]: )

( : ) ([],[], [],[ ]: )

Learning Parameters for a Bayesian

Network

PCm Am

PAm BmEm

PBm

PEm

L D PEmBmAmCm

([]| [ ]: )

([ ]| [], []: )

([]: )

([]: )

( : ) ([],[ ],[], []: )

General Bayesian Networks

L D

( []| []: )

( []| []: )

( : ) ( 1 [ ],K, []: )

General Bayesian Networks (Cont.)

Dirichlet Priors

∏

L( :D) θkk

∏

P k k

( )^1

∏ ∏ ∏

Θ ∝ Θ Θ∝ −^ =^ K

P D P PD kk k k k

( | ) ()( | ) θα^1 θ θ^ α

Dirichlet Priors (cont.)

Dirichlet Priors -- Example

Prior Knowledge

Effect of Priors (cont.)

P(X = 1|D)

N

Conjugate Families

Bayesian Networks and Bayesian Prediction

Bayesian Networks and Bayesian

Prediction (Cont.)

Bayesian Prediction(cont.)

independent posteriors on θY|X=0 and θ Y|X=

Bayesian Prediction(cont.)

posterior for each multinomial θ Xi | pai

(pa) N(pa)

~ (x,pa) N(x,pa)

x i |pai α +

Learning Parameters: Case Study

(cont.)