Conditional Random Fields: Global Normalization and Label Bias Problem, Study notes of Computer Science

The label bias problem in conditional random fields (crfs) and proposes solutions to address it through global normalization. The authors compare crfs with hidden markov models (hmms) and explain the concept of markov random fields (mrfs). They also discuss the hammersley-clifford theorem and the objective function for training crfs.

Typology: Study notes

Pre 2010

Uploaded on 08/31/2009

koofers-user-f76
koofers-user-f76 🇺🇸

9 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Conditional Random Fields:
Probabilistic Models for Segmenting
and Labeling Sequence Data
John Lafferty, Andrew McCallum, Fernando Pereira
Presenter: Yejin Choi
Label Bias Problem
Suppose [NNS => VB] transition more frequent than [NNS => IN]
Suppose from [VB], only [VB => VBG] transition is possible
Î
now, What is P(Yi= VBG | Yi-1 = VB, Xi= merino) ???
Recall MEMM models P(Yi| Yi-1, Xi)
S2
S1
NN ( 3 )VBG ( 5 )VB ( 4 )NNS ( 0 )
sheep ( b )eating ( o )like ( r )Carnivores ( - )
NN ( 3 )JJ ( 2 )IN ( 1 )NNS ( 0 )
sheep ( b )merino ( i )like ( r )Herbivores ( - )
arcs: observations (X)
nodes: outputs (Y)
Label Bias Problem
So, how do we fix this nonsense ?
P(Yi= VBG | Yi-1 = VB, Xi= ???) = 1 for any Xi
ÎDo not normalize on each node!
Instead, normalize over the entire sequence.
This motivates the “global normalization” scheme of CRFs.
Other approaches: Cohen and Carvalho (2005), Sutton and McCallum (2005)
NN ( 3 )VBG ( 5 )VB ( 4 )NNS ( 0 )
sheep ( b )eating ( o )like ( r )Carnivores ( c )
NN ( 3 )JJ ( 2 )IN ( 1 )NNS ( 0 )
sheep ( b )merino ( i )like ( r )Herbivores ( h )
Label Bias Problem
Wait, what about HMMs?
Recall HMM models P(Xi| Yi) and P(Yi| Yi-1)
(so that P(Xi| Yi, Yi-1)
P(Yi| Yi-1) = P(Xi, Yi| Yi-1))
then we can get P(merino | VBG) = 0 !
NN ( 3 )VBG ( 5 )VB ( 4 )NNS ( 0 )
sheep ( b )eating ( o )like ( r )Carnivores ( c )
NN ( 3 )JJ ( 2 )IN ( 1 )NNS ( 0 )
sheep ( b )merino ( i )like ( r )Herbivores ( h )
pf3
pf4
pf5

Partial preview of the text

Download Conditional Random Fields: Global Normalization and Label Bias Problem and more Study notes Computer Science in PDF only on Docsity!

Conditional Random Fields:

Probabilistic Models for Segmenting

and Labeling Sequence Data

John Lafferty, Andrew McCallum, Fernando Pereira

Presenter: Yejin Choi

Label Bias Problem

  • Suppose [NNS => VB] transition more frequent than [NNS => IN]Suppose from [VB], only [VB => VBG] transition is possible 

now, What is P(Y

i^

=^

VBG | Y

i-

= VB, X

i^

=^

merino) ???

  • Recall MEMM models P(Y

i^

|^

Yi-

, X

)i

S1 S

NN

( 3 )

VBG

( 5 )

VB

( 4 )

NNS

( 0 )

sheep

( b )

eating

( o )

like

( r )

Carnivores

( - )

NN

( 3 )

JJ

( 2 )

IN

( 1 )

NNS

( 0 )

sheep

( b )

merino

( i )

like

( r )

Herbivores

( - )

arcs: observations (X)nodes: outputs (Y)

Label Bias Problem

So,

how

do

we

fix

this

nonsense

?

P(Y

i^

= VBG | Y

i-

= VB, X

i^

= ???) = 1 for any X

i



Do not normalize on each node!

Instead, normalize over the entire sequence.This motivates the “global normalization” scheme of CRFs.

Other approaches: Cohen and Carvalho (2005), Sutton and McCallum (2005)

NN

( 3 )

VBG

( 5 )

VB

( 4 )

NNS

( 0 )

sheep

( b )

eating

( o )

like

( r )

Carnivores

( c )

NN

( 3 )

JJ

( 2 )

IN

( 1 )

NNS

( 0 )

sheep

( b )

merino

( i )

like

( r )

Herbivores

( h )

Label Bias Problem

Wait,

what

about

HMMs?

  • Recall HMM models P(X

i^

| Y

) and P(Yi

i^

| Y

i-

)

(so that P(X

i^

|^

Yi

, Y

i-

)

P(Y

i^

| Y

i-

) = P(X

,i Y

i^

| Y

i-

))

then we can get P(merino | VBG) = 0!

NN

( 3 )

VBG

( 5 )

VB

( 4 )

NNS

( 0 )

sheep

( b )

eating

( o )

like

( r )

Carnivores

( c )

NN

( 3 )

JJ

( 2 )

IN

( 1 )

NNS

( 0 )

sheep

( b )

merino

( i )

like

( r )

Herbivores

( h )

Label Bias

V.S.

Observation Bias



Both are to do with local conditional normalization.

Observation explains current state sowell (hence ignoring state transition)

Previous state explains current stateso well (hence ignoring observation)

Observation Bias(Klein and Manning 2002)

Label Bias(Bottou, 1991; Lafferty et. al. 2001)

MEMM

HMM

All

the

indexes

above

-0.

-5.

DT

DT

NNS

VBD

Incorrect

-1.

-0.

PDT DT

NNS

VBD

Correct

e.g.

So, Let’s normalize globally! (But, how?)

-^

What we need is the global joint distribution. i.e.,
p(y | x)

-^

where y = (y

, …, y 1

) and x = (xn^

, … , x 1

)n^

-^

we do not want distributions on individual node. i.e.,

p(y
| xi^
)i^

-^

Instead, we want non-probabilistic

potential

function. i.e.,

(y
, xi
)i

-^

p(y | x)



g(

^

(y

, x 1

), … , 1



(y

, xn^

))n 

-^

Problem with directed graphs (like Bayesian Network)–

a probability distribution should be given for each node

-^

then, the joint probability

p(y) =



p(yi

| parents(yi^

))i^



btw, what is

parents(y

)^ i^

for MEMM?

-^

Markov Random Field! ( = Markov Network, Random Field )

Markov Random Field^ •

But, how do we compute the global joint distribution P (Y) out of this?

-^ Besides, We don

’t want to compute P(Y

| neighbors(Yi^

)) !!!i

Just for now,forget about conditioning on X

^

Hammersley-Clifford theorem (1971)

where

Z =

Y

c^

(Y

)c

Given MRF G=(Y,E) such that P(Y

| Y \ Yi

) = P(Yi

| nbr(Yi^

))i

Given



(Y

) forc



clique C in G, such that



(Y

)c >= 0

-^

cliques may overlap.

-^

cliques may not be maximal.



this implies we don’t need to compute P(Y

| nbr(Yi

)) to get P(Y) !i

C

C

Y

Z

Y

P

clique

Computing Z(x)!

for linear-chain CRFs

name
nonName
name
nonName
name
nonName
a
b
c
d
e
f^
g
h
^ 
bh
af
bg
ae
h
g
f
e
d
c
b
a

l (  ) =



log p(yj

(j)

| x

(j)

) =



(j



-^

F (y

(j) , x

(j) ) – log



y^ exp



-^

F (y, x

(j)

)^

(y

, y

i+

, x)

= exp



k

fk (yk^

, yi

i+

, x)

This diagram is from William Cohen’s slides.

Parameter Estimation for CRFs

how to compute argmax

^

l (  )?

-^

Iterative scaling algorithms–

Generalized iterative scaling (GIS)

-^

Improved iterative scaling (IIS)

-^

easy to implement

-^

both really slow to converge

-^

Gradient descent methods–

CG : conjugate gradient

-^

L-BFGS : limited memory Newton’s method

-^

much harder to implement, but lots of code available 

you only need to provide

l (

) and

l

` (

 )

-^

much faster to converge

Parameter Estimation for CRFs

-^

Iterative scaling algorithms– Generalized iterative scaling (GIS)– Improved iterative scaling (IIS)

-^

Gradient descent methods– CG : conjugate gradient– L-BFGS : limited memory

Newton’s method

•All of these maximize the(conditional) likelihood.•Possible to traindiscriminatively with voted perceptron

.

(Collins 2002)

Sha and Pereira (2003)

in minutes, 375k examples

argmax

y

P(y | x)

for linear-chain CRFs

argmax

y^ p(y | x ) = argmax

y^ log p(y | x)

= argmax

y^



-^

F (y, x) – log Z(x)

= argmax

y^



F

(y, x)

name
nonName
name
nonName
name
nonName
a
b
c
d
e
f^
g
h

A universalcomponent forSVM, HMM,MEMM,

etc

Other CRFs

Factorial CRFs

(Sutton et.al., 2004)

Skip-chain CRFs

(Sutton and McCallum, 2004)

Tree CRFs(Cohn and Blunsom, 2005)

Graphical representation

HMM

MEMM

CRF

?

?

?

Graphical representation

HMM

MEMM

CRF

NO!

YES

YES

From Klein and Taskar’s slides.

Merino Sheep