



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This document, from hal daumé iii's cs5350/6350 course at the university of utah, discusses structured prediction, a machine learning approach for learning complex structures from examples. Various approaches to structured prediction, including independent prediction, global prediction, and productive prediction. It also discusses the challenges of representing functions, learning from finite samples, and making predictions. Examples are provided for sequence labeling, and the document compares the performance of conditional random fields (crfs) and independent learning algorithms.
Typology: Study notes
1 / 6
This page cannot be seen from the preview
Don't miss anything!




Hal Daumé III CS5350: Machine Learning
02 Dec 2008 Hal Daumé III (U Utah)
Structured Prediction (1 / 23)
man
ate^
a^
tasty
sandwich
DT^
NN^
VBD
DT^
JJ^
NN
NP
VP S
NPNP-NN
CS 5350/
Hal Daumé III (U Utah)
Structured Prediction (2 / 23)
, given:
A bunch of inputs
2.^
Correct corresponding outputs Induce a function that maps novel inputs to corresponding outputs. Formally
, given:
1.^
An input space
An output space
(that is “hairy”)
A distribution
over
A loss function
Induce function
f^
with low expected loss (with respect to
Hal Daumé III (U Utah)
Structured Prediction (3 / 23)
Input
The
man
ate
the
really
tasty
sandwich
Output
Inputs are sequences; outputs are (equal-length) seqences of labelsdrawn from a label set.Three approaches:^ ◮
Independent prediction:^ Ignore the structure and predict each label independently ◮ Global prediction:^ Learn a score
f^
+^ so that good
ys have high score
◮^
Productive prediction: Learn a function that generates the output sequentially CS 5350/
Hal Daumé III (U Utah)
Structured Prediction (4 / 23)
Hal Daumé III (U Utah)
Structured Prediction (5 / 23)
t^ =
t^ =
t^ =
t^ =
t^ =
t^ =
t^ =
Input
The
man
ate
the
really
tasty
sandwich
Output
Features may depend on any of the input, but must
decompose
entirely over the output
:^ f^ ( x,^ y) =
T∑ t=^1
⊤ w
x,^
y)t^
One may use any classifier, for instance:^ ◮
Logistic regression (aka maximum entropy) classifiers ◮ Support vector machines: ◮... Pros:^ ◮
Really efficient at training and test time ◮ Can use any off the shelf multiclass classifier CS 5350/
Hal Daumé III (U Utah)
Structured Prediction (6 / 23)
Input
George
Bush
spoke
to^
Congress
today
Output
ER
ER
RG
Must ensure that I-P
ER
always follows B-P
ER
or I-P
ER
◮^
At test time, constraints can be enforced by (eg) integerprogramming ◮^
Incurs additional test-time complexity ◮^
See
[Punyakanok and Roth; IJCAI 2005]
Hal Daumé III (U Utah)
Structured Prediction (7 / 23)
Learn a score
f^
+^ so that good
ys have high score
Three issues:^ ◮
How to represent
f
◮^
How to learn
f^ from a finite sample
◮^
Given
f^ and a new input
x, how to find best
y
CS 5350/
Hal Daumé III (U Utah)
Structured Prediction (8 / 23)
all
other outputs to look bad, just look at the best!
Leads to
w
ensures large margin
predicted scores
cost of error
See
[Taskar, Guestrin and Koller; NIPS 2002]
Looks like exponentially many constraints, but can be reduced (ifclever) to depend on the tree-width. (Technically, must introduce slack variables.) CS 5350/
Hal Daumé III (U Utah)
Structured Prediction (13 / 23)
◮^
Represent output structure as a graph ◮^
Define features over cliques Prediction can be solved efficiently for many reasonable problems:^ ◮
chains, trees, bipartite matchings, graph cuts (sort of) CRF normalization efficient for many problems:^ ◮
chains, trees (^3) M N contraints polynomial for more problems: ◮^ chains, trees, bipartite matchings CS 5350/
Hal Daumé III (U Utah)
Structured Prediction (14 / 23)
Input
The
man
ate
the
really
tasty
sandwich
Output
Idea:
instead of using V
ERB
ET
as a feature, just add extra
input
features
and use independent predictions.
Problem:^ ◮
This introduces a ton new features, whose weights we need toestimate ◮ Probably don’t have enough data to do this reliably ◮ CRFs are
computationally complex
but
statistically simple
; IRL is
computationally simple
but
statistically complex
Solution:^ ◮
“Structure Compilation”
[Liang, Daumé and Klein, ICML 2008]
CS 5350/
Hal Daumé III (U Utah)
Structured Prediction (15 / 23)
◮^
Train a CRF on your labeled data ◮^
Run this CRF on a large amount of unlabeled data ◮^
Train an IRL on the data labeled by the CRF, using more features
2
4
8 16
32
64 128 200
m^ (thousands)
(^100989694) tag accuracy^92
CRF(
f)^1 ILR(
f)^1 ILR(
f)^2
2
4
8 16
32
64 128 200
m^
(thousands)
(^1009218476) Labeled F^68
CRF(
f)^1 ILR(
f)^1 ILR(
f)^2
(a) POS
(b) NER
CS 5350/6350^ Can bound the IRL classifier’s error by:
Hal Daumé III (U Utah)
Structured Prediction (16 / 23)
Hal Daumé III (U Utah)
Structured Prediction (17 / 23)
Follow independent classifier methodology, but: ◮^
Predict variables in a prescribed order ◮^
Allow features to depend on
any
past decision
Input
The
man
ate
the
really
tasty
sandwich
Output
◮^
At training time:^ 1.
Make a classification example for D
ET
2.^
Make an example for N
OUN
, knowing D
ET
3.^
Make an example for V
ERB
, knowing D
ET N OUN
4.^
Make an example for D
ET , knowing D
ET
... V
ERB
5.^
And so on...
◮^
At test time:^ 1.
Predict the first label
y^1
2.^
Predict the second
y^2
, knowing
y^1
3.^
Predict
y^3
, knowing
y^1
,^ y^2
4.^
And so on...
CS 5350/
Hal Daumé III (U Utah)
Structured Prediction (18 / 23)
Key Idea:
view prediction task as search
◮^
path
corresponds to a full output
◮^
Each
decision
is over a small set of
options ◮^
Train a
classifier
to make search
predictions How can we train this classifier?^ ◮
Train on every node in the search space^ This is effectively what CRFs do ◮ Train only on the correct path^ This incurs label-bias problem^ See
[Daumé & Marcu; ICML 2005]
and
[Xu and Fern, ICML 2007]
◮^
Train on the subset of states that we are likely to reach! How do we know what states we are likely to reach?Chicken-and-egg problem!
Hal Daumé III (U Utah)
Structured Prediction (19 / 23)
Idea:
Train on the subset of states that we are likely to reach ◮^
Iterative algorithm ◮^
First train on the correct path
( h 1 )
◮^
Then on an interpolation between thebest path and
( h 1 )^
( h 2 )
◮^
Then on an interpolation between thebest path and
( h 1 )^ and
( h 2 )^
( h 3 )
◮^
And so on... Guaranteed to converge in a polynomial number of iterations to amodel with regret:
last
See
[Daumé III, Langford and Marcu; MLJ 2007]
CS 5350/
Hal Daumé III (U Utah)
Structured Prediction (20 / 23)