Cluster Analysis and Hierarchical Cluster Analysis, Study notes of Advanced Physics

The concept of cluster analysis, specifically focusing on hierarchical cluster analysis (hca). Hca is an unsupervised clustering method that groups samples into homogeneous classes based on their distance or similarity. The three most common methods for linking samples in hca: single link, complete link, and centroid link. It also provides examples of dendrograms produced using single, complete, and centroid linkages for raw and autoscaled data.

Typology: Study notes

2010/2011

Uploaded on 09/10/2011

gerrard_11
gerrard_11 🇬🇧

4.3

(6)

234 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Cluster analysis
The basic assumption with these methods is that
measurements made for related samples tend to
be similar.
Overall, the distance between similar samples is
smaller than for unrelated samples.
Clustering methods
We’ll look at three unsupervised clustering methods.
Univariate clustering
Evaluates individual variables (raw or scaled).
Groups samples into homogeneous classes.
Hierarchical cluster analysis
Reduction of multiple variables for a sample to a single
‘distance’ value.
Rank and link samples based on relative distances.
k-mean clustering.
Grouping of samples into a set number of classes.
Use all variables to determine relative distances.
Univariate clustering.
Creates ‘k’ homogeneous classes.
Uses within-class variances as measure of
homogeneity.
Can be used to convert quantitative variable into a
discrete ordinal variable.
Another use is to simply evaluate if a variable has
any ‘classification’ type information.
Histogram (Petal width)
0
0.05
0.1
0.15
0.2
0.25
0 5 10 15 20 25 30
Petal width
Relative frequency
Iris dataset
Univariate clustering.
Histogram (Petal width)
0
0.05
0.1
0.15
0.2
0.25
0 5 10 15 20 25 30
Petal width
Relative frequency
The goal is to partition the data so that you have ‘k’
clusters of data.
Iris data
A simple ranking of the data
indicates that we would get
reasonable clustering based
on petal width.
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Cluster Analysis and Hierarchical Cluster Analysis and more Study notes Advanced Physics in PDF only on Docsity!

Cluster analysis

The basic assumption with these methods is that

measurements made for related samples tend to

be similar.

Overall, the distance between similar samples is

smaller than for unrelated samples.

Clustering methods

We’ll look at three unsupervised clustering methods.

• Univariate clustering

• Evaluates individual variables (raw or scaled).

• Groups samples into homogeneous classes.

• Hierarchical cluster analysis

• Reduction of multiple variables for a sample to a single

‘distance’ value.

• Rank and link samples based on relative distances.

• k-mean clustering.

• Grouping of samples into a set number of classes.

• Use all variables to determine relative distances.

Univariate clustering.

• Creates ‘k’ homogeneous classes.

• Uses within-class variances as measure of

homogeneity.

• Can be used to convert quantitative variable into a

discrete ordinal variable.

• Another use is to simply evaluate if a variable has

any ‘classification’ type information.

Histogram (Petal width)

(^0 5 10) Petal width 15 20 25 30 Relative frequency

Iris dataset

Species Property

I. Setosa Petal width

I.Versicolor Petal length

I.Verginica Sepal width

! Sepal length

We’ll look at a single

property - petal width.

Univariate clustering.

Histogram (Petal width) 0

0 5 10 15 20 25 30 Petal width Relative frequency

The goal is to partition the data so that you have ‘k’

clusters of data.

Iris data

A simple ranking of the data

indicates that we would get

reasonable clustering based

on petal width.

Iris data Iris data

Body Temp (from exam)

Not exactly the best classification. It does show that there is some skew to the results (more men in class one and more women in class two) - and there is a fair amount of overlap.

So what’s it good for?

Really only useful for an initial evaluation of

individual variables.

Only want to use when you have a small number of

classes (or potential classes.

Main use is to convert quantitative (continuous) to

ordinal data.

HCA Distance and similarity

The first step in conducting HCA is to determine the

distance between your samples or variables.

Distance

City block

Euclidean

(most common)

d ij = ^ x ik - x jk h

M

8 B

j = 1

N

1/ M

d ij = ^ x ik - x jk h

8 B

j = 1

N

d ij = x ik - x jk

j = 1

N

! M=

M=

Distance and similarity

Actual distances between your samples will vary based

on the type and number of measurements present.

Similarity values are calculated to normalize the data

to a standard scale.

! For similar samples, sij approaches 1

! For dissimilar samples, sij approaches 0

s ij = 1 -

d max

d ij

Single link

We now have a

three member

cluster.

Lets skip a few

steps.

Single link

Now, our points have been linked into three clusters.

Single link

All points have now been linked.

Other linkage methods

Complete link

Linkage is based on the farthest point in a cluster

- gives a conservative linkage

d ij " C =0.5 d iC +0.5 d jC + d iC - d jC

Other linkage methods

Centroid link (Ward’s Method)

Linkage is based on the center of the cluster.

d ij " C =

ni + n j

ni d iC^2

ni + n j

n j d jC

2

ni + n j

ni n j d ij

2

HCA dendrograms

After conducting your linkage, you need a way to

visualizing the results.

Dendrograms can be used for this purpose and

provide a very simple two dimensional plot that

indicates clustering, similarities and linkages.

Dendrograms We can now see how our samples are linked. The higher the linkage level, the lower the similarity. 1.0 similarity 0. Dendrograms This plot appears to indicate that there are three groups of samples that can only be linked at very low similarity values. A B C D E F G H I J Dendrograms Lets look again at our single linkage example and see what the dendrogram would look like. Example dendrogram 1.0 similarity 0. A real example Substances commonly used as accelerants were assayed by capillary column GC / MS. At present, accelerants are identified based on boiling point range. ! Class assignments: A, B, C, D, E Goal: To determine if multivariate data treatment has the potential for classification of accelerants. Analysis conditions Neat samples were spiked with a known amount an internal standard. ! SP-5 25m x 0.2mm I.D. column ! 1 μl sample, 100:1 split injection ! 50 oC,5 min; 10oC/min ramp; hold at 250oC ! Total run time: 30 minutes ! Mass Range: 50-150 AMU ! ISTD: octadeuteronaphthalene

Raw - centroidal linkage Centroid, Raw e e e e e e e b b a a a a b a b b b b b b b b d d d d d d d d d d d d^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c

Similarity Raw - comparison Centroidal linkage appears to give the best results.

Single, raw

b b b b a a a a b a b b b b b b e e e e e e e d d d d d d d d d d d d^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c

Similarity

Complete, Raw

b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d d d

Similarity

Centroid, Raw

e e e e e e e b b a a a a b a b b b b b b b b d d d d d d d d d d d d c c c c c c c c c c c

Similarity

Autoscaled - single link Single, Scaled b b b b b b b b b a a a b a a b e e e e e e e d d d d d d d d d d d d c c c c c c c c c c c

Similarity Autoscaled - complete link Complete, Scaled a a a b a a b b b b b b b b b b^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c e e e e e e d d d d d d d e d d d d d -0. -0. -0. -0.

Similarity Autoscaled - centroid link Centroidal, Scaled e e e e e e e d d d d d d d d d d d d a a a b a a b b b b b b b b b b c c c c c c c c c c c -0. -0.

Similarity Autoscaled - centroid link For this example, a centroid link best reflects what we already know about the data.

Centroidal, Scaled

e e e e e e e d d d d d d d d d d d d a a a b a a b b b b b b b b b b^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c

Similarity

Iris dataset A pretty famous data set published by R.A. Fisher, “The Use of Multiple Measurements in Axonmic Problems.” Anals of Eugenics, 7, 179-188 (1936). He measured four physical properties of iris to see if they could be used to classify any of three different species. Used length and width of the sepal and petal. Iris dataset ! Species Property ! I. Setosa Petal width ! I.Versicolor Petal length ! I.Verginica Sepal width ! Sepal length 150 samples - no missing values HCA analysis was conducted on both raw and scaled data. Both single linkage and complete linkage were evaluated. Autoscaled data, centroidal linkage (^0111111111111111111111111111111111111111111111111133333333333333333333333333333333122222232222222222322222233333332333232222222322323332222222222222222) 5 10 15 20 25 30 35 40 Dissimilarity One class is distinct but the other two overlap. Iris dataset So it should be possible to classify samples. HCA just does not provide as useful a view as we had hoped for. 5 10 15 20 25 15 30 45 60 Petal length P e t a l w i d t h Raw data. Iris dataset So there was useful information in the dataset. HCA - not a good tool. Reducing the four measurements into a single one actually make the data worse. Autoscaling - had little or no effect. The actual numbers were all of a similar range. Moral - just because a method doesn’t work does not mean that there is no useful information. Classification of Mycobacteria Investigators at the CDC wanted to see if it was possible to identify mycobacteria using pattern recognition of an HPLC analysis of mycolic acids. Mycobacteria - include a number of respiratory and non-respiratory pathogens such as M. tuberculosis. C 70 -C 90 α-branched β-hydroxy mycolic acids were selected as they are known to be in the cell walls of these bacteria.

Representative spectra, 47-99 m/e

Sulawesi (^) Costa Rica Ethiopia Sumatra Kenya Columbia

Single linkage

Sulawesi

Costa Rica

Ethiopia

Sumatra

Kenya

Columbia

Complete linkage

Sulawesi

Costa Rica

Ethiopia

Sumatra

Kenya

Columbia

So what’s it good for?

This is a fast method of initial data exploration.

Try all of the options with both raw and scaled data.

The plots can be rapidly evaluated.

You can also use principal component data. This will be covered in

the next unit.

When you get ready to go on to other methods of clustering,

knowing the best methods for linkage will also be useful.

k-mean clustering

• An iterative method where samples are initially partitioned

into ‘k’ classes and a centroid calculated.

• Must use quantitative variables but can be raw, scaled or

PCA based.

• The positions of all samples are then calculated relative to

the centroids and then reassigned to new clusters (if

needed) and the process repeated.

• Classification criteria can include within-class variance,

pooled covariance matrix or total inertia matrix.

• The number of clusters and assignments can vary based on

the initial starting points so several iterations are

commonly used to find a constant solution.

k-mean clustering

Position initial

class centroids

Adjust centroids

Test class

memberships

Retest/repeat

Using XLStat Classification criteria that can be minimized.

  • Trace.^ Minimize the within-class variance, giving the most homogeneous clusters. Data should be autoscaled if this is used.
  • Determinant.^ Minimize the covariance matrix. More appropriate to use with unscaled data but gives less homogeneous clusters.
  • Wilks’ lambda.^ Normalized version of the Determinate approach.
  • Trace/median.^ Centroid ends up being based on median not the mean, like other approaches. Better when there is subclustering of data. Using XLStat - XLStat’s version of HCA (Agglomerative Hierarchical Clustering - AHC) will do a k-mean analysis but only the ‘trace’ method - The k-means option provides more clustering control and is faster because no HCA is conducted. - However,^ AHC has an option to allow the routine to automatically set the number of ‘clusters’ that appear to exist. Iris dataset (again). Iris dataset (again). Iris dataset (again). Iris dataset (again). 5 10 15 20 25 15 30 45 60 Petal length P e t a l w i d t h Raw data.

So what’s it good for?

Can be used as a way to subdivide a dataset into related clusters.

Clusters are objectively determined based on similarities in

multidimensional space.

While results can vary based on starting point, the effect can be

minimized by using multiple starting points and repetitions.

Results are easier to see than with HCA. k-mean and HCA

complement each other.