Unsupervised Learning: Agglomerative Clustering & Microarray Analysis, Assignments of Programming Languages

An overview of unsupervised learning, focusing on clustering and model fitting. Topics include the goals of presentations, unsupervised problems, typical tasks, and bioinformatics applications using microarrays. Agglomerative clustering, distance measures between clusters, and dendrograms.

Typology: Assignments

Pre 2010

Uploaded on 07/22/2009

koofers-user-kxl
koofers-user-kxl 🇺🇸

4.5

(2)

10 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Unsupervised Learning:
Clustering & Model Fitting
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Unsupervised Learning: Agglomerative Clustering & Microarray Analysis and more Assignments Programming Languages in PDF only on Docsity!

Unsupervised Learning:

Clustering & Model Fitting

Administrivia

Reminder: office hours truncated tomorrow

“whenever I get in” until noon

HW3 due: Dec 2

Have an excellent Turkey Day!

The art of presentations

Do NOT tell us:

Every detail of every experiment

Choose the parts to show us carefully

Each thing you show us should be informative

about your conclusions

Place for excruciating detail is the paper

Every step of all the math

Every background reference

Focus on the “big picture” and the “take home

message”

Listeners will take home ~10 bytes. Make sure

they’re the right 10 bytes!

The unsupervised problem

Given:

Set of data points

Find:

Good description of the data

General unsup tasks

Given, data matrix

Which points are similar?

How do points cluster together?

How many groups are there?

Statistical description of distribution of data?

X =

x

11

x

12

· · · x

1 N

x

21

x

2 N

x

d 1

... x

dN

5 minutes of bioinformatics

Gene microarray (a.k.a., genechip, DNA chip, etc.)

Measure thousands (10s or 100s of thousands) of

genes simultaneously

Critical tool in bioinformatics

Understand function of genes, networks of gene

activity, response to stimuli, etc.

Leads to some very nasty analysis problems...

Only mRNA can be (easily measured)

When gene is “activated”, mRNA is produced

Can be “upregulated” or “downregulated” to

produce diff. concentrations of mRNA

Can be active or inactive under different

conditions:

External stimuli (food, ph, temperature, viral

infection, etc.)

Internal metabolic processes (cell cycles,

pathways, etc.)

mRNA measurements correlated with cell activity

5 minutes of bioinformatics

5 minutes of bioinformatics

measuring many mRNA...

Population A of cells

Population B of cells

mRNA pool A

mRNA pool B

Irradiation

5 minutes of bioinformatics

Imaging

[ x

1

, x

2

, ..., x

d

]

Data vector

5 minutes of bioinformatics

Measure populations over

time

Monitor development of

cell, metabolic processes,

response to introduction

of stimulus, etc.

Time series of data

     

x 11

x 12

· · · x 1 N

x 21

.

.

. x 2 N

.

.

.

.

.

.

x d 1

... x dN

     

timepoints

genes

Can consider either rows or

columns to be “points”, depending

on what you want to know

Similarity & distance

Most clust. algorithms based on distances between

points

Recall: distance (metric) function d ( x

1

, x

2

Symmetry: d ( x

1

, x

2

)= d ( x

2

, x

1

Identity: d ( x

1

, x

1

Triangle inequality: d ( x

1

, x

3

)<= d ( x

1

, x

2

)+ d ( x

2

, x

3

E.g., Euclidean distance, kernel distance, etc.

Sometimes have a natural similarity function instead

Can usually convert to a metric or semi-metric

Agglomerative clustering

Group clusters by mutual distance

“Bottom-up” method: start w/ points and combine

into groups, combine groups, etc.

Dist between clusters?

Problem: We have distance between pairs of points

Agglomerative clustering requires distance between

pairs of clusters

A number of measures are possible:

d

min

(c

1

, c

2

) = min

x∈c 1

;x

′ ∈c 2

d(x, x

c

c

Dist between clusters?

Problem: We have distance between pairs of points

Agglomerative clustering requires distance between

pairs of clusters

A number of measures are possible:

d

max

(c

1

, c

2

) = max

x∈c 1

;x

′ ∈c 2

d(x, x

c

c