Machine Learning for Natural Language Processing: Kernel Functions and Tree Kernels, Study Guides, Projects, Research of Linguistics

An overview of machine learning techniques, specifically kernel functions and tree kernels, for natural language processing. Topics include the use of timbl for classification, string and tree kernels, and the benefits of kernel methods in linguistics. The document also discusses the challenges of floating point arithmetic and the importance of designing appropriate kernel functions for linguistic problems.

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 03/28/2010

koofers-user-37i
koofers-user-37i 🇺🇸

7 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Project
Presentations next week:
Hannah, Lara, (Guoyan), Lucien, (John), Rebecca
Final project due: Wednesday, May 13 @ 5:00pm
Turn in a paper explaining what you did, how well it worked, why it
worked as well as it did but not better, etc., plus any programs you
wrote
The project grade will be based on the paper, which should look like
aconference paper (8pages, references)
1
Project
2
Project
Are chunks within the same clause part of the same argument?
Feature vector:
NP,NP,Britain,NNP,’s,POS,yes.
NP,VP,industry,NN,is,VBZ,no.
Use Timbl for classification
Best result so far: 81.27% accuracy
Largest source of error is PPs
3
Kernel functions
Much of the power of SVMs comes from the use of kernel functions
and derived feature spaces
Linear kernels allow efficient processing of the very large feature
vectors that come with a bag of words model
Polynomial kernels capture dependencies between features
Special purpose kernels reflect the structure of a particular problem
Combinations of kernels are also kernels
4
pf3
pf4
pf5

Partial preview of the text

Download Machine Learning for Natural Language Processing: Kernel Functions and Tree Kernels and more Study Guides, Projects, Research Linguistics in PDF only on Docsity!

Project

Presentations next week:

Hannah, Lara, (Guoyan), Lucien, (John), Rebecca

Final project due: Wednesday, May 13 @ 5:00pm

Turn in a

paper

explaining what you did, how well it worked, why it

worked as well as it did but not better, etc., plus any

programs

you

wrote

The project grade will be based on the paper, which should look likea conference paper (

8 pages, references)

1

Project

Project

Are chunks within the same clause part of the same argument?

Feature vector: NP,NP,Britain,NNP,’s,POS,yes.NP,VP,industry,NN,is,VBZ,no.

Use Timbl for classification

Best result so far: 81.27% accuracy

Largest source of error is PPs

3

Kernel functions

Much of the power of SVMs comes from the use of kernel functionsand derived feature spaces

Linear kernels allow efficient processing of the very large featurevectors that come with a bag of words model

Polynomial kernels capture dependencies between features

Special purpose kernels reflect the structure of a particular problem

Combinations of kernels are also kernels

String kernels

String subsequence kernels

represent as string as a bag of

(possibly discontinuous)

n

grams

The feature set is very large, but dot products can be computedefficiently

Dynamic programming and suffix trees

For text classification SSKs give a small improvement over

n

-gram

kernels for small training sets

5

Tree kernels

We can use similar tricks to compare trees by comparing commonsubtrees

Given trees

T

1

and

T

2

, with nodes

N

1

and

N

2

, define:

I

i

n

if subtree

i

is rooted at

n

otherwise

The kernel function is:

K

T

1

, T

2

h

T

1

h

T

2

where

h

i

T

1

n

1

N

1

I

i

n

1

or the number of times subtree

i

occurs in tree

T

1

Tree kernels

The feature vector

h

T

1

will have as many dimensions as there are

possible subtrees (which will be astronomical)

But, the dot product

h

T

1

h

T

2

can only depend on dimensions for

subtrees which occur in both

T

1

and

T

2

Let

C

n

1

, n

2

be the number of common subtrees rooted at

n

1

and

n

2

The kernel function is:

K

T

1

, T

2

h

T

1

h

T

2

n

1

N

1

n

2

N

2

i

I

i

n

1

I

i

n

2

n

1

N

1

n

2

N

2

C

n

1

, n

2

7

Tree kernels

We can compute efficiently compute

C

n

1

, n

2

by recursion

If the rules applied at

n

1

and

n

2

are different, then there are no

common subtrees and

C

n

1

, n

2

If the rules are the same and

n

1

and

n

2

are preterminals, then

C

n

1

, n

2

Otherwise:

C

n

1

, n

2

nc

(

n

1

)

∏ i

=

C

ch

n

1

, i

, ch

n

2

, i

Worst case,

K

can be computed in

O

N

1

N

2

time, but in practice

C

n

1

, n

2

for most

n

1

, n

2

and the computation is much cheaper

Floating point arithmetic

Digital computers can’t represent real numbers: bulba% pythonPython 2.2.1 (#1, Aug 30 2002, 12:15:30)[GCC 3.2 20020822

(Red Hat Linux Rawhide 3.2-4)] on linux

Type "help", "copyright",

"credits"

or "license" for more information.

3.33.2999999999999998>>>

Financial calculations use integers

Scientific calculations use approximations, which vary in theiraccuracy

Standard for floating point calculations: IEEE 754

13

Floating point arithmetic

Floating point numbers are stored as a

mantissa

and an

exponent

IEEE floating point formats:

precision

min

max

eps

digits

single

×

38

×

38

×

7

double

×

308

×

308

×

16

Just because you can represent

300

doesn’t mean you get 300

significant digits!

Default in python and perl is double precision

Don’t use single precision (

float

) unless you have a good reason

Floating point arithmetic

It’s easy to lose precision:

Things to watch out for:^?

subtractions of numbers that are nearly equal,

additions of numbers whose magnitudes are nearly equal, butwhose signs are opposite

additions and subtractions of numbers that differ greatly inmagnitude

Exact comparisons between floating point numbers can bemisleading

The same operations performed in a different order or on differenthardware may given different results

15

A look back

We’ve come a long way, from flipping coins to Support VectorMachines

Non-parametric methods:^?

decision trees

instance-based learning

transformation-based learning

perceptron

support vector machines

Parametric methods:^?

naive Bayes

maximum entropy

A look back

One theme that runs through machine learning research is the waywe characterize

generalization

curse of dimensionality

bias vs. variance

overtraining

simplicity

capacity

17

A look ahead

Some current directions in machine learning for NLP:^?

getting at ‘deep’ structures

task-specific representations (remember, there’s no free lunch!)

scaling methods to deal with huge datasets

Data mining uses machine learning to find patterns in unstructureddata collections...

... which we’ll be looking at in more detail in the fall