Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Conditional Random Fields: Global Normalization and Label Bias Problem, Study notes of Computer Science

Cornell University Computer Science

The label bias problem in conditional random fields (crfs) and proposes solutions to address it through global normalization. The authors compare crfs with hidden markov models (hmms) and explain the concept of markov random fields (mrfs). They also discuss the hammersley-clifford theorem and the objective function for training crfs.

Typology: Study notes

Pre 2010

Uploaded on 08/31/2009

koofers-user-f76 🇺🇸

9 documents

1 / 7

This page cannot be seen from the preview

Don't miss anything!

Conditional Random Fields:

Probabilistic Models for Segmenting

and Labeling Sequence Data

John Lafferty, Andrew McCallum, Fernando Pereira

Presenter: Yejin Choi

Label Bias Problem

•Suppose [NNS => VB] transition more frequent than [NNS => IN]

•Suppose from [VB], only [VB => VBG] transition is possible

Î

now, What is P(Yi= VBG | Yi-1 = VB, Xi= merino) ???

•Recall MEMM models P(Yi| Yi-1, Xi)

S2

S1

NN ( 3 )VBG ( 5 )VB ( 4 )NNS ( 0 )

sheep ( b )eating ( o )like ( r )Carnivores ( - )

NN ( 3 )JJ ( 2 )IN ( 1 )NNS ( 0 )

sheep ( b )merino ( i )like ( r )Herbivores ( - )

arcs: observations (X)

nodes: outputs (Y)

Label Bias Problem

So, how do we fix this nonsense ?

P(Yi= VBG | Yi-1 = VB, Xi= ???) = 1 for any Xi

ÎDo not normalize on each node!

Instead, normalize over the entire sequence.

This motivates the “global normalization” scheme of CRFs.

Other approaches: Cohen and Carvalho (2005), Sutton and McCallum (2005)

NN ( 3 )VBG ( 5 )VB ( 4 )NNS ( 0 )

sheep ( b )eating ( o )like ( r )Carnivores ( c )

NN ( 3 )JJ ( 2 )IN ( 1 )NNS ( 0 )

sheep ( b )merino ( i )like ( r )Herbivores ( h )

Label Bias Problem

Wait, what about HMMs?

•Recall HMM models P(Xi| Yi) and P(Yi| Yi-1)

(so that P(Xi| Yi, Yi-1)

•

P(Yi| Yi-1) = P(Xi, Yi| Yi-1))

then we can get P(merino | VBG) = 0 !

NN ( 3 )VBG ( 5 )VB ( 4 )NNS ( 0 )

sheep ( b )eating ( o )like ( r )Carnivores ( c )

NN ( 3 )JJ ( 2 )IN ( 1 )NNS ( 0 )

sheep ( b )merino ( i )like ( r )Herbivores ( h )

Discover Study notes of Computer Science Cornell University

Partial preview of the text

Download Conditional Random Fields: Global Normalization and Label Bias Problem and more Study notes Computer Science in PDF only on Docsity!

Conditional Random Fields:

Probabilistic Models for Segmenting

and Labeling Sequence Data

John Lafferty, Andrew McCallum, Fernando Pereira

Presenter: Yejin Choi

Label Bias Problem

Suppose [NNS => VB] transition more frequent than [NNS => IN] • Suppose from [VB], only [VB => VBG] transition is possible

now, What is P(Y

i^

=^

VBG | Y

i-

= VB, X

i^

=^

merino) ???

Recall MEMM models P(Y

i^

|^

Yi-

, X

)i

S1 S

NN

( 3 )

VBG

( 5 )

VB

( 4 )

NNS

( 0 )

sheep

( b )

eating

( o )

like

( r )

Carnivores

( - )

NN

( 3 )

JJ

( 2 )

IN

( 1 )

NNS

( 0 )

sheep

( b )

merino

( i )

like

( r )

Herbivores

( - )

arcs: observations (X)nodes: outputs (Y)

Label Bias Problem

So,

how

do

we

fix

this

nonsense

?

P(Y

i^

= VBG | Y

i-

= VB, X

i^

= ???) = 1 for any X

i

Do not normalize on each node!

Instead, normalize over the entire sequence.This motivates the “global normalization” scheme of CRFs.

Other approaches: Cohen and Carvalho (2005), Sutton and McCallum (2005)

NN

( 3 )

VBG

( 5 )

VB

( 4 )

NNS

( 0 )

sheep

( b )

eating

( o )

like

( r )

Carnivores

( c )

NN

( 3 )

JJ

( 2 )

IN

( 1 )

NNS

( 0 )

sheep

( b )

merino

( i )

like

( r )

Herbivores

( h )

Label Bias Problem

Wait,

what

about

HMMs?

Recall HMM models P(X

i^

| Y

) and P(Yi

i^

| Y

i-

)

(so that P(X

i^

|^

Yi

, Y

i-

) •

P(Y

i^

| Y

i-

) = P(X

,i Y

i^

| Y

i-

))

then we can get P(merino | VBG) = 0!

NN

( 3 )

VBG

( 5 )

VB

( 4 )

NNS

( 0 )

sheep

( b )

eating

( o )

like

( r )

Carnivores

( c )

NN

( 3 )

JJ

( 2 )

IN

( 1 )

NNS

( 0 )

sheep

( b )

merino

( i )

like

( r )

Herbivores

( h )

Label Bias

V.S.

Observation Bias

Both are to do with local conditional normalization.

Observation explains current state sowell (hence ignoring state transition)

Previous state explains current stateso well (hence ignoring observation)

Observation Bias(Klein and Manning 2002)

Label Bias(Bottou, 1991; Lafferty et. al. 2001)

MEMM

HMM

All

the

indexes

above

-0.

-5.

DT

NNS

VBD

Incorrect

-1.

-0.

PDT DT

NNS

VBD

Correct

e.g.

So, Let’s normalize globally! (But, how?)

-^

What we need is the global joint distribution. i.e.,

p(y | x)

-^

where y = (y

, …, y 1

) and x = (xn^

, … , x 1

)n^

-^

we do not want distributions on individual node. i.e.,

p(y

| xi^

)i^

-^

Instead, we want non-probabilistic

potential

function. i.e.,

(y

, xi

)i

-^

p(y | x)

g(

^

(y

, x 1

), … , 1

(y

, xn^

))n

-^

Problem with directed graphs (like Bayesian Network)–

a probability distribution should be given for each node

-^

then, the joint probability

p(y) =

p(yi

| parents(yi^

))i^

btw, what is

parents(y

)^ i^

for MEMM?

-^

Markov Random Field! ( = Markov Network, Random Field )

Markov Random Field^ •

But, how do we compute the global joint distribution P (Y) out of this?

-^ Besides, We don

’t want to compute P(Y

| neighbors(Yi^

)) !!!i

Just for now,forget about conditioning on X

…^

Hammersley-Clifford theorem (1971)

where

Z =

Y

c^

(Y

)c

Given MRF G=(Y,E) such that P(Y

| Y \ Yi

) = P(Yi

| nbr(Yi^

))i

Given

(Y

) forc

clique C in G, such that

(Y

)c >= 0

-^

cliques may overlap.

-^

cliques may not be maximal.

this implies we don’t need to compute P(Y

| nbr(Yi

)) to get P(Y) !i

C

Y

Z

Y

P

clique

Computing Z(x)!

…

for linear-chain CRFs

name

nonName

name

nonName

name

nonName

a

b

c

d

e

f^

g

h

^

bh

af

bg

ae

h

g

f

e

d

c

b

a

l ( ) =

log p(yj

(j)

| x

(j)

) =

(j

-^

F (y

(j) , x

(j) ) – log

y^ exp

-^

F (y, x

(j)

)^

(y

, y

i+

, x)

= exp

k

fk (yk^

, yi

i+

, x)

This diagram is from William Cohen’s slides.

Parameter Estimation for CRFs

how to compute argmax

^

l ( )?

-^

Iterative scaling algorithms–

Generalized iterative scaling (GIS)

-^

Improved iterative scaling (IIS)

-^

easy to implement

-^

both really slow to converge

-^

Gradient descent methods–

CG : conjugate gradient

-^

L-BFGS : limited memory Newton’s method

-^

much harder to implement, but lots of code available

you only need to provide

l (

) and

l

` (

)

-^

much faster to converge

Parameter Estimation for CRFs

-^

Iterative scaling algorithms– Generalized iterative scaling (GIS)– Improved iterative scaling (IIS)

-^

Gradient descent methods– CG : conjugate gradient– L-BFGS : limited memory

Newton’s method

•All of these maximize the(conditional) likelihood.•Possible to traindiscriminatively with voted perceptron

.

(Collins 2002)

Sha and Pereira (2003)

in minutes, 375k examples

argmax

y

P(y | x)

for linear-chain CRFs

argmax

y^ p(y | x ) = argmax

y^ log p(y | x)

= argmax

y^

-^

F (y, x) – log Z(x)

= argmax

y^

F

(y, x)

name

nonName

name

nonName

name

nonName

a

b

c

d

e

f^

g

h

A universalcomponent forSVM, HMM,MEMM,

etc

Other CRFs

Factorial CRFs

(Sutton et.al., 2004)

Skip-chain CRFs

(Sutton and McCallum, 2004)

Tree CRFs(Cohn and Blunsom, 2005)

Graphical representation

HMM

MEMM

CRF

?

Graphical representation

HMM

MEMM

CRF

NO!

YES

From Klein and Taskar’s slides.

Merino Sheep

Conditional Random Fields: Global Normalization and Label Bias Problem, Study notes of Computer Science

Related documents

Partial preview of the text

Download Conditional Random Fields: Global Normalization and Label Bias Problem and more Study notes Computer Science in PDF only on Docsity!

Conditional Random Fields:

Probabilistic Models for Segmenting

and Labeling Sequence Data

John Lafferty, Andrew McCallum, Fernando Pereira

Presenter: Yejin Choi

Label Bias Problem

Label Bias Problem

Label Bias Problem

Label Bias

V.S.

Observation Bias

So, Let’s normalize globally! (But, how?)

What we need is the global joint distribution. i.e.,

p(y | x)

p(y

| xi^

)i^

(y

, xi

)i

Problem with directed graphs (like Bayesian Network)–

Markov Random Field! ( = Markov Network, Random Field )

Markov Random Field^ •

Hammersley-Clifford theorem (1971)

Z =

(Y

)c

Y

Z

Y

P

Computing Z(x)!

name

nonName

name

nonName

name

nonName

a

b

c

d

e

f^

g

h

^ 

bh

af

bg

ae

h

g

f

e

d

c

b

a

(j

Parameter Estimation for CRFs

how to compute argmax

Iterative scaling algorithms–

Gradient descent methods–

` (

Parameter Estimation for CRFs

Iterative scaling algorithms– Generalized iterative scaling (GIS)– Improved iterative scaling (IIS)

Gradient descent methods– CG : conjugate gradient– L-BFGS : limited memory

Newton’s method

Sha and Pereira (2003)

in minutes, 375k examples

argmax

y

P(y | x)

name

nonName

^