



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of machine learning techniques, specifically kernel functions and tree kernels, for natural language processing. Topics include the use of timbl for classification, string and tree kernels, and the benefits of kernel methods in linguistics. The document also discusses the challenges of floating point arithmetic and the importance of designing appropriate kernel functions for linguistic problems.
Typology: Study Guides, Projects, Research
1 / 5
This page cannot be seen from the preview
Don't miss anything!




Presentations next week:
Hannah, Lara, (Guoyan), Lucien, (John), Rebecca
Final project due: Wednesday, May 13 @ 5:00pm
Turn in a
paper
explaining what you did, how well it worked, why it
worked as well as it did but not better, etc., plus any
programs
you
wrote
The project grade will be based on the paper, which should look likea conference paper (
8 pages, references)
1
Are chunks within the same clause part of the same argument?
Feature vector: NP,NP,Britain,NNP,’s,POS,yes.NP,VP,industry,NN,is,VBZ,no.
Use Timbl for classification
Best result so far: 81.27% accuracy
Largest source of error is PPs
3
Much of the power of SVMs comes from the use of kernel functionsand derived feature spaces
Linear kernels allow efficient processing of the very large featurevectors that come with a bag of words model
Polynomial kernels capture dependencies between features
Special purpose kernels reflect the structure of a particular problem
Combinations of kernels are also kernels
String subsequence kernels
represent as string as a bag of
(possibly discontinuous)
n
grams
The feature set is very large, but dot products can be computedefficiently
Dynamic programming and suffix trees
For text classification SSKs give a small improvement over
n
-gram
kernels for small training sets
5
We can use similar tricks to compare trees by comparing commonsubtrees
Given trees
1
and
2
, with nodes
1
and
2
, define:
i
n
if subtree
i
is rooted at
n
otherwise
The kernel function is:
1
2
h
1
h
2
where
h
i
1
n
1
∈
N
1
i
n
1
or the number of times subtree
i
occurs in tree
1
The feature vector
h
1
will have as many dimensions as there are
possible subtrees (which will be astronomical)
But, the dot product
h
1
h
2
can only depend on dimensions for
subtrees which occur in both
1
and
2
Let
n
1
, n
2
be the number of common subtrees rooted at
n
1
and
n
2
The kernel function is:
1
2
h
1
h
2
n
1
∈
N
1
n
2
∈
N
2
i
i
n
1
i
n
2
n
1
∈
N
1
n
2
∈
N
2
n
1
, n
2
7
We can compute efficiently compute
n
1
, n
2
by recursion
If the rules applied at
n
1
and
n
2
are different, then there are no
common subtrees and
n
1
, n
2
If the rules are the same and
n
1
and
n
2
are preterminals, then
n
1
, n
2
Otherwise:
n
1
, n
2
nc
(
n
1
)
∏ i
=
ch
n
1
, i
, ch
n
2
, i
Worst case,
can be computed in
1
2
time, but in practice
n
1
, n
2
for most
n
1
, n
2
and the computation is much cheaper
Digital computers can’t represent real numbers: bulba% pythonPython 2.2.1 (#1, Aug 30 2002, 12:15:30)[GCC 3.2 20020822
(Red Hat Linux Rawhide 3.2-4)] on linux
Type "help", "copyright",
"credits"
or "license" for more information.
3.33.2999999999999998>>>
Financial calculations use integers
Scientific calculations use approximations, which vary in theiraccuracy
Standard for floating point calculations: IEEE 754
13
Floating point numbers are stored as a
mantissa
and an
exponent
IEEE floating point formats:
precision
min
max
eps
digits
single
−
38
38
−
7
double
−
308
308
−
16
Just because you can represent
300
doesn’t mean you get 300
significant digits!
Default in python and perl is double precision
Don’t use single precision (
float
) unless you have a good reason
It’s easy to lose precision:
Things to watch out for:^?
subtractions of numbers that are nearly equal,
additions of numbers whose magnitudes are nearly equal, butwhose signs are opposite
additions and subtractions of numbers that differ greatly inmagnitude
Exact comparisons between floating point numbers can bemisleading
The same operations performed in a different order or on differenthardware may given different results
15
We’ve come a long way, from flipping coins to Support VectorMachines
Non-parametric methods:^?
decision trees
instance-based learning
transformation-based learning
perceptron
support vector machines
Parametric methods:^?
naive Bayes
maximum entropy
One theme that runs through machine learning research is the waywe characterize
generalization
curse of dimensionality
bias vs. variance
overtraining
simplicity
capacity
17
Some current directions in machine learning for NLP:^?
getting at ‘deep’ structures
task-specific representations (remember, there’s no free lunch!)
scaling methods to deal with huge datasets
Data mining uses machine learning to find patterns in unstructureddata collections...
... which we’ll be looking at in more detail in the fall