Cluster Measures: Distance and Similarity Measures for Different Data Types, Study notes of Mathematical Statistics

An overview of various distance and similarity measures used in cluster analysis. The measures are categorized based on the type of data, including continuous, frequency count, and binary data. Examples of measures for each category are provided, such as euclidean distance, cosine similarity, and jaccard similarity. The document also explains how these measures can be used to assess the similarity or dissimilarity between clusters and how they can be applied to different clustering methods.

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CLUSTER
Cluster Measures
Measures for Continuous Data
EUCLID
The distance between two items, x and y, is the square root of the sum of the
squared differences between the values for the items.
EUCLID xy x y
ii
i
,
16 1 6
=−
2
SEUCLID
The distance between two items is the sum of the squared differences between the
values for the items.
SEUCLID xy x y
ii
i
,
16 1 6
=−
2
CORRELATION
This is a pattern similarity measure.
CORRELATION xy ZZ
N
xi yi
i
,
16 38
=
where Zxi is the (standardized) Z-score value of x for the ith case or variable, and
N is the number of cases or variables.
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Cluster Measures: Distance and Similarity Measures for Different Data Types and more Study notes Mathematical Statistics in PDF only on Docsity!

1

Cluster Measures

Measures for Continuous Data

EUCLID

The distance between two items, x and y, is the square root of the sum of the squared differences between the values for the items.

EUCLID x y x (^) i yi i 1 , 6 =^ ∑ 1 − 62

SEUCLID

The distance between two items is the sum of the squared differences between the values for the items.

SEUCLID x y x (^) i yi i 1 , 6 =^ ∑ 1 − 62

CORRELATION

This is a pattern similarity measure.

CORRELATION x y

Z Z

N

i xi^ yi 1 , 6

3 8

where Z (^) xi is the (standardized) Z-score value of x for the ith case or variable, and N is the number of cases or variables.

COSINE

This is a pattern similarity measure.

COSINE x y

x y

x y

i i^ i

i i i i

1 , 6

1 6

 

 

 

 

 

 

∑ ∑

2 2

CHEBYCHEV

The distance between two items is the maximum absolute difference between the values for the items.

CHEBYCHEV 1 x y , 6 = maxi x (^) i −yi

BLOCK

The distance between two items is the sum of the absolute differences between the values for the items.

BLOCK x y x (^) i yi i 1 , 6 =^ ∑ −

MINKOWSKI( p )

The distance between two items is the pth root of the sum of the absolute differences to the pth power between the values for the items.

MINKOWSKI x y x (^) i yi i

p p 1 , 6 = −   

  ∑ 

1

Present Absent Item 1 Present a b Absent c d

PROXIMITIES computes all binary measures from the values of a, b, c, and d. These values are tallies across variables (when the items are cases) or tallies across cases (when the items are variables).

Russel and Rao Similarity Measure

This is the binary dot product.

RR x y a a b c d

1 , 6 =^

Simple Matching Similarity Measure

This is the ratio of the number of matches to the total number of characteristics.

SM x y a d a b c d

1 , 6 =^

Jaccard Similarity Measure

This is also known as the similarity ratio.

JACCARD x y a a b c

1 , 6 =^

Dice or Czekanowski or Sorenson Similarity Measure

DICE x y a a b c

1 , 6 =^

Sokal and Sneath Similarity Measure 1

SS1 x y

a d a d b c

1 , 6

1 6 1 6

Rogers and Tanimoto Similarity Measure

RT x y a d a d b c

1 , 6 1 6

Sokal and Sneath Similarity Measure 2

SS2 x y a a b c

1 , 6 1 6

Kulczynski Similarity Measure 1

This measure has a minimum value of 0 and no upper limit. It is undefined when there are no nonmatches 1 b = 0 andc= 06. Therefore, PROXIMITIES assigns an artificial upper limit of 9999.999 to K1 when it is undefined or exceeds this value.

K1 x y a b c

1 , 6 =^

Sokal and Sneath Similarity Measure 3

This measure has a minimum value of 0, has no upper limit, and is undefined when there are no nonmatches 1 b = 0 andc= 06. As with K1, PROXIMITIES assigns an artificial upper limit of 9999.999 to SS3 when it is undefined or exceeds this value.

SS3 x y a d b c

1 , 6 =^

Predictability Measures

The following four binary measures assess the association between items as the predictability of one given the other. All four measures yield similarities.

Goodman and Kruskal Lambda (Similarity)

This coefficient assesses the predictability of the state of a characteristic on one item (presence or absence) given the state on the other item. Specifically, lambda measures the proportional reduction in error using one item to predict the other, when the directions of prediction are of equal importance. Lambda has a range of 0 to 1.

t a b c d a c b d t a c b d a b c d

x y

t t a b c d t

1 2 1 2 (^22)

max , max , max , max , max , max ,

,

1 6 1 6 1 6 1 6 1 6 1 6

1 6 1 6

LAMBDA

Anderberg’s D (Similarity)

This coefficient assesses the predictability of the state of a characteristic on one item (presence or absence) given the state on the other. D measures the actual reduction in the error probability when one item is used to predict the other. The range of D is 0 to 1.

t a b c d a c b d t a c b d a b c d

x y t t a b c d

1 2 1 2 2

max , max , max , max , max , max ,

,

1 6 1 6 1 6 1 6 1 6 1 6

1 6 1 6

D

Yule’s Y Coefficient of Colligation (Similarity)

This is a function of the cross-product ratio for a 2 × 2 table. It has a range of –1 to +1.

Y x y ad bc ad bc

1 , 6 =^

Yule’s Q (Similarity)

This is the 2 × 2 version of Goodman and Kruskal’s ordinal measure gamma. Like Yule’s Y, Q is a function of the cross-product ratio for a 2 × 2 table and has a range of –1 to +1.

Q x y

ad bc ad bc

1 , 6 =

Other Binary Measures

The remaining binary measures available in PROXIMITIES are either binary equivalents of association measures for continuous variables or measures of special properties of the relation between items.

Ochiai Similarity Measure

This is the binary form of the cosine. It has a range of 0 to 1 and is a similarity measure.

OCHIAI x y a a b

a a c

1 , 6 =^

     

     

Sokal and Sneath Similarity Measure 5

This is a similarity measure. Its range is 0 to 1.

SS5 x y ad a b a c b d c d

1 , 6 1 61 61 61 6

Fourfold Point Correlation (Similarity)

This is the binary form of the Pearson product-moment correlation coefficient. Phi is a similarity measure, and its range is 0 to 1.

PHI x y ad bc a b a c b d c d

1 , 6 1 61 61 61 6

Dispersion Similarity Measure

This similarity measure has a range of –1 to +1.

DISPER x y ad bc a b c d

1 , 6 1 6

Variance Dissimilarity Measure

This dissimilarity measure has a minimum value of 0 and no upper limit.

VARIANCE x y b c a b c d

1 , 6 1 6

Binary Lance-and-Williams Nonmetric Dissimilarity Measure

Also known as the Bray-Curtis nonmetric coefficient, this dissimilarity measure has a range of 0 to 1.

BLWMN x y b c a b c

1 , 6 =^

Clustering Methods

Notation

The following notation is used unless otherwise specified:

S (^) Matrix of similarity or dissimilarity measures sij Similarity or dissimilarity measure between cluster^ i^ and cluster^ j

N (^) i Number of cases in cluster^ i

General Procedure

Begin with N clusters each containing one case. Denote the clusters 1 through N.

  • Find the most similar pair of clusters p and q 1 p >q 6. Denote this similarity s (^) pq. If a dissimilarity measure is used, large values indicate dissimilarity. If a similarity measure is used, small values indicate dissimilarity.
  • Reduce the number of clusters by one through merger of clusters p and q. Label the new cluster t 1 = q 6 and update similarity matrix (by the method specified) to reflect revised similarities or dissimilarities between cluster t and all other clusters. Delete the row and column of S corresponding to cluster p.
  • Perform the previous two steps until all entities are in one cluster.
  • For each of the following methods, the similarity or dissimilarity matrix S is updated to reflect revised similarities or dissimilarities 1 6str between the new cluster t and all other clusters r as given below.

Average Linkage between Groups

Before the first merge, let N (^) i = 1 for i = 1 to N. Update str by

str = s (^) pr +sqr

Update N (^) t by

N (^) t = N (^) p +Nq

and then choose the most similar pair based on the value

sij 3 N Ni j 8

Centroid Method

Update str by

s

N

N N

s

N

N N

s

N N

N N

tr s

p p q

pr

q p q

qr

p q

p q

= (^) + + (^) + − pq 3 + 8

2

Median Method

Update str by

str = 3 s (^) pr + s (^) qr 8 2 −spq 4

Ward’s Method

Update str by

s N N tr N^ N^ s^ N^ N^ s^ N s t r

= (^) r p rp r q rq r pq

1 6

3 8 3 8

Update the coefficient W by

W = W + .5s (^) pq

Note that for Ward’s method, the coefficient given in the agglomeration schedule is really the within-cluster sum of squares at that step. For all other methods, this coefficient represents the distance at which the clusters p and q were joined.

References

Anderberg, M. R. 1973. Cluster analysis for applications. New York: Academic Press.