Bongard Problems - Machine Learning | CMPSCI 689, Assignments of Computer Science

Material Type: Assignment; Professor: Mahadevan; Class: Machine Learning; Subject: Computer Science; University: University of Massachusetts - Amherst; Term: Spring 2005;

Typology: Assignments

Pre 2010

Uploaded on 08/19/2009

koofers-user-ybd
koofers-user-ybd šŸ‡ŗšŸ‡ø

10 documents

1 / 26

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CMPSCI 689: Machine Learning
Sridhar Mahadevan
University of Massachusetts
CMPSCI 689 – p.1/26
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a

Partial preview of the text

Download Bongard Problems - Machine Learning | CMPSCI 689 and more Assignments Computer Science in PDF only on Docsity!

CMPSCI 689: Machine Learning

Sridhar Mahadevan

[email protected]

University of Massachusetts

Bongard Problems

Bongard Problems

Perspectives on Learning

Biology

: Brain, Development, Evolution, Genetics,

Neuroscience.

Information Theory

Coding Theory, Entropy.

Linguistics

: Grammars, Language acquisition

Mathematics

Calculus, Linear Algebra, Optimization

Psychology

: Analogy, Concept Learning, Curiosity,

Discovery, Memory, Reinforcement

Philosophy

: Causality, Induction, Theory Formation

Statistics

Probability Distributions, Estimation,

Hypothesis Testing.

CMPSCI 689 – p.5/

Parametric vs. Nonparametric Learning

Parametric learning

The learner assumes that the data is coming from aspecific distribution

P

x

Examples: Multivariate gaussian, Hidden markovmodel, Dynamic Bayes Nets etc.

Nonparametric learning

The learner has no knowledge of the specificdistribution, but may make other assumptions (e.g,.stationarity).

Examples: Perceptron, Support Vector Machine,Kernel density estimation.

Three Ways of Finding Structure in Data

Density estimation

: ā€œUnsupervisedā€ learning

Estimate (joint) distribution of the data

P

X

Classification

: ā€œSupervisedā€ learning

Estimate conditional distribution

P

Y

X

Regression

: Function approximation

Estimate conditional mean

E

Y

X

Ways of Measuring Distance

  1. The dataset domain

D

is a vector space with a norm or

inner product defined on it. (a)

Euclidean distance:

d

x, x

m

i

x

i

x

mi

2

(b)

Mahalanobis distance: d

x, x

m

x

x

m

T

āˆ’

1

x

x

m

(c)

KL divergence:

d

p, q

x

p

x

) log

p

(

x

)

q

(

x

)

  1. The dataset comes from a non vector space domain,

e.g., text, bioinformatics, sensor networks. (a) Define a

featurizer

φ

x

e.g. a

kernel

function

k

x, y

D → R

φ

x

φ

y

CMPSCI 689 – p.10/

How Google

T M

learns to rank web pages

A web site is an

authority

if many sites link to it. A web site is a

hub

if it links to

many sites.

Google computes a ranking

x

1

,... , x

N

of authorities and

y

1

,... , y

M

of hubs.

Initialize

x

0 i

is the number of links pointing to

i

and

y

0 i

is the number of links going

out of

i

.

But, not all links should be weighted equally. For example, links from authorities (orhubs) should count more.

x

1 i

=

X

j links to i

y

0 j

=

A

T

y

0

and

y

1 i

=

X

i links to j

x

0 j

=

Ax

0

x

ki

=

A

T

A x

k

āˆ’

1

and

y

ki

=

AA

T

y

k

āˆ’

1

This is an iterative singular value decomposition (SVD), and Google

T M

is solving

the world’s largest SVD problem over a matrix

A

of size

4

billion by

4

billion!

CMPSCI 689 – p.11/

Bayesian Inference

Posterior

Likelihood

Ɨ

Prior

Evidence

P

c

i

X

P

X

c

i

P

c

i

P

X

where the evidence(denominator) term can be computedas

P

X

i

P

X

c

i

P

c

i

Example: Document Classification

ā€œThe countdown resumed Tuesday for the launch ofNASA’s controversial Cassini probe to Saturn afterengineers fixed a technical problem at the launch pad.NASA has rescheduled the beginning of the $3.4 billionmission....ā€Attributes:

a

1

ā€œtheā€,

a

2

ā€œcountdownā€, ...,

a

93

launch.

How many probabilities do we need? Assume a maximumdocument length of

words, and

possible categories.

Assuming

words in English, we get

million!!!

Bag of Words Representation

Word probabilities are

conditionally independent

given

the category.

Word probabilities are

marginally independent

of

location in the document.

P

a

i

ā€œNASAā€

c

j

P

a

k

ā€œNASAā€

c

j

This is called the ā€œbag of wordsā€ representation in IR.

So, in our example, this means that number ofprobabilities needed is

CMPSCI 689 – p.16/

Exact and Approximate Inference

Bayesian methods require computing the

likelihood

function

P

x

y

and the

marginal

P

y

Exact

inference requires enumerating all the possible

hypotheses efficiently, e.g, Pearl’s belief propagationalgorithm or the sum-product algorithm.

Approximate

inference restricts the hypotheses

considered, e.g,. maximum likelihood or Monte-Carlomethods.

Breaking the Curse of Dimensionality

with Kernel Methods

Primal Form:

Size of the hypothesis is proportional to number ofattributes.

Perceptron:

h

x

Sgn

i

w

i

x

i

Dual Form:

Kernel methods represent a hypothesis as a linearcombination of training examples h

x

i

α

i

K

x, x

i

An interesting

sparsity

property further reduces the

number of parameters to (sometimes)

constant

size!

Administrivia

See web page www-edlab.cs.umass.edu/cs

Instructor:

My office hours: T/Th 10:30-12, 204

TA: TBA

Ed lab account on elnux*.cs.umass.edu (MATLAB,Bayes Net Toolbox)