



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The label bias problem in conditional random fields (crfs) and proposes solutions to address it through global normalization. The authors compare crfs with hidden markov models (hmms) and explain the concept of markov random fields (mrfs). They also discuss the hammersley-clifford theorem and the objective function for training crfs.
Typology: Study notes
1 / 7
This page cannot be seen from the preview
Don't miss anything!




now, What is P(Y
i^
=^
VBG | Y
i-
= VB, X
i^
=^
merino) ???
i^
|^
Yi-
, X
)i
S1 S
NN
( 3 )
VBG
( 5 )
VB
( 4 )
NNS
( 0 )
sheep
( b )
eating
( o )
like
( r )
Carnivores
( - )
NN
( 3 )
JJ
( 2 )
IN
( 1 )
NNS
( 0 )
sheep
( b )
merino
( i )
like
( r )
Herbivores
( - )
arcs: observations (X)nodes: outputs (Y)
So,
how
do
we
fix
this
nonsense
?
P(Y
i^
= VBG | Y
i-
= VB, X
i^
= ???) = 1 for any X
i
Do not normalize on each node!
Instead, normalize over the entire sequence.This motivates the “global normalization” scheme of CRFs.
Other approaches: Cohen and Carvalho (2005), Sutton and McCallum (2005)
NN
( 3 )
VBG
( 5 )
VB
( 4 )
NNS
( 0 )
sheep
( b )
eating
( o )
like
( r )
Carnivores
( c )
NN
( 3 )
JJ
( 2 )
IN
( 1 )
NNS
( 0 )
sheep
( b )
merino
( i )
like
( r )
Herbivores
( h )
Wait,
what
about
HMMs?
i^
| Y
) and P(Yi
i^
| Y
i-
)
(so that P(X
i^
|^
Yi
, Y
i-
) •
P(Y
i^
| Y
i-
) = P(X
,i Y
i^
| Y
i-
))
then we can get P(merino | VBG) = 0!
NN
( 3 )
VBG
( 5 )
VB
( 4 )
NNS
( 0 )
sheep
( b )
eating
( o )
like
( r )
Carnivores
( c )
NN
( 3 )
JJ
( 2 )
IN
( 1 )
NNS
( 0 )
sheep
( b )
merino
( i )
like
( r )
Herbivores
( h )
Both are to do with local conditional normalization.
Observation explains current state sowell (hence ignoring state transition)
Previous state explains current stateso well (hence ignoring observation)
Observation Bias(Klein and Manning 2002)
Label Bias(Bottou, 1991; Lafferty et. al. 2001)
MEMM
HMM
All
the
indexes
above
-0.
-5.
DT
DT
NNS
VBD
Incorrect
-1.
-0.
PDT DT
NNS
VBD
Correct
e.g.
-^
-^
where y = (y
, …, y 1
) and x = (xn^
, … , x 1
)n^
-^
we do not want distributions on individual node. i.e.,
-^
Instead, we want non-probabilistic
potential
function. i.e.,
-^
p(y | x)
g(
^
(y
, x 1
), … , 1
(y
, xn^
))n
-^
a probability distribution should be given for each node
-^
then, the joint probability
p(y) =
p(yi
| parents(yi^
))i^
btw, what is
parents(y
)^ i^
for MEMM?
-^
But, how do we compute the global joint distribution P (Y) out of this?
-^ Besides, We don
’t want to compute P(Y
| neighbors(Yi^
)) !!!i
Just for now,forget about conditioning on X
…^
where
Y
c^
Given MRF G=(Y,E) such that P(Y
| Y \ Yi
) = P(Yi
| nbr(Yi^
))i
Given
(Y
) forc
clique C in G, such that
(Y
)c >= 0
-^
cliques may overlap.
-^
cliques may not be maximal.
this implies we don’t need to compute P(Y
| nbr(Yi
)) to get P(Y) !i
C
C
clique
…
for linear-chain CRFs
l ( ) =
log p(yj
(j)
| x
(j)
) =
-^
F (y
(j) , x
(j) ) – log
y^ exp
-^
F (y, x
(j)
)^
(y
, y
i+
, x)
= exp
k
fk (yk^
, yi
i+
, x)
This diagram is from William Cohen’s slides.
^
l ( )?
-^
Generalized iterative scaling (GIS)
-^
Improved iterative scaling (IIS)
-^
easy to implement
-^
both really slow to converge
-^
CG : conjugate gradient
-^
L-BFGS : limited memory Newton’s method
-^
much harder to implement, but lots of code available
you only need to provide
l (
) and
l
)
-^
much faster to converge
-^
-^
•All of these maximize the(conditional) likelihood.•Possible to traindiscriminatively with voted perceptron
.
(Collins 2002)
for linear-chain CRFs
argmax
y^ p(y | x ) = argmax
y^ log p(y | x)
= argmax
y^
-^
F (y, x) – log Z(x)
= argmax
y^
F
(y, x)
A universalcomponent forSVM, HMM,MEMM,
etc
(Sutton et.al., 2004)
(Sutton and McCallum, 2004)
HMM
MEMM
CRF
?
?
?
HMM
MEMM
CRF
NO!
YES
YES
From Klein and Taskar’s slides.
Merino Sheep