







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of various distance and similarity measures used in cluster analysis. The measures are categorized based on the type of data, including continuous, frequency count, and binary data. Examples of measures for each category are provided, such as euclidean distance, cosine similarity, and jaccard similarity. The document also explains how these measures can be used to assess the similarity or dissimilarity between clusters and how they can be applied to different clustering methods.
Typology: Study notes
1 / 13
This page cannot be seen from the preview
Don't miss anything!








1
The distance between two items, x and y, is the square root of the sum of the squared differences between the values for the items.
EUCLID x y x (^) i yi i 1 , 6 =^ ∑ 1 − 62
The distance between two items is the sum of the squared differences between the values for the items.
SEUCLID x y x (^) i yi i 1 , 6 =^ ∑ 1 − 62
This is a pattern similarity measure.
CORRELATION x y
i xi^ yi 1 , 6
∑
where Z (^) xi is the (standardized) Z-score value of x for the ith case or variable, and N is the number of cases or variables.
This is a pattern similarity measure.
COSINE x y
x y
x y
i i^ i
i i i i
1 , 6
∑
∑ ∑
2 2
The distance between two items is the maximum absolute difference between the values for the items.
CHEBYCHEV 1 x y , 6 = maxi x (^) i −yi
The distance between two items is the sum of the absolute differences between the values for the items.
BLOCK x y x (^) i yi i 1 , 6 =^ ∑ −
MINKOWSKI( p )
The distance between two items is the pth root of the sum of the absolute differences to the pth power between the values for the items.
MINKOWSKI x y x (^) i yi i
p p 1 , 6 = −
∑
1
Present Absent Item 1 Present a b Absent c d
PROXIMITIES computes all binary measures from the values of a, b, c, and d. These values are tallies across variables (when the items are cases) or tallies across cases (when the items are variables).
Russel and Rao Similarity Measure
This is the binary dot product.
RR x y a a b c d
1 , 6 =^
Simple Matching Similarity Measure
This is the ratio of the number of matches to the total number of characteristics.
SM x y a d a b c d
1 , 6 =^
Jaccard Similarity Measure
This is also known as the similarity ratio.
JACCARD x y a a b c
1 , 6 =^
Dice or Czekanowski or Sorenson Similarity Measure
DICE x y a a b c
1 , 6 =^
Sokal and Sneath Similarity Measure 1
SS1 x y
a d a d b c
1 , 6
1 6 1 6
Rogers and Tanimoto Similarity Measure
RT x y a d a d b c
1 , 6 1 6
Sokal and Sneath Similarity Measure 2
SS2 x y a a b c
1 , 6 1 6
Kulczynski Similarity Measure 1
This measure has a minimum value of 0 and no upper limit. It is undefined when there are no nonmatches 1 b = 0 andc= 06. Therefore, PROXIMITIES assigns an artificial upper limit of 9999.999 to K1 when it is undefined or exceeds this value.
K1 x y a b c
1 , 6 =^
Sokal and Sneath Similarity Measure 3
This measure has a minimum value of 0, has no upper limit, and is undefined when there are no nonmatches 1 b = 0 andc= 06. As with K1, PROXIMITIES assigns an artificial upper limit of 9999.999 to SS3 when it is undefined or exceeds this value.
SS3 x y a d b c
1 , 6 =^
The following four binary measures assess the association between items as the predictability of one given the other. All four measures yield similarities.
Goodman and Kruskal Lambda (Similarity)
This coefficient assesses the predictability of the state of a characteristic on one item (presence or absence) given the state on the other item. Specifically, lambda measures the proportional reduction in error using one item to predict the other, when the directions of prediction are of equal importance. Lambda has a range of 0 to 1.
t a b c d a c b d t a c b d a b c d
x y
t t a b c d t
1 2 1 2 (^22)
max , max , max , max , max , max ,
,
1 6 1 6 1 6 1 6 1 6 1 6
1 6 1 6
Anderberg’s D (Similarity)
This coefficient assesses the predictability of the state of a characteristic on one item (presence or absence) given the state on the other. D measures the actual reduction in the error probability when one item is used to predict the other. The range of D is 0 to 1.
t a b c d a c b d t a c b d a b c d
x y t t a b c d
1 2 1 2 2
max , max , max , max , max , max ,
,
1 6 1 6 1 6 1 6 1 6 1 6
1 6 1 6
Yule’s Y Coefficient of Colligation (Similarity)
This is a function of the cross-product ratio for a 2 × 2 table. It has a range of –1 to +1.
Y x y ad bc ad bc
1 , 6 =^
Yule’s Q (Similarity)
This is the 2 × 2 version of Goodman and Kruskal’s ordinal measure gamma. Like Yule’s Y, Q is a function of the cross-product ratio for a 2 × 2 table and has a range of –1 to +1.
Q x y
ad bc ad bc
1 , 6 =
The remaining binary measures available in PROXIMITIES are either binary equivalents of association measures for continuous variables or measures of special properties of the relation between items.
Ochiai Similarity Measure
This is the binary form of the cosine. It has a range of 0 to 1 and is a similarity measure.
OCHIAI x y a a b
a a c
1 , 6 =^
Sokal and Sneath Similarity Measure 5
This is a similarity measure. Its range is 0 to 1.
SS5 x y ad a b a c b d c d
1 , 6 1 61 61 61 6
Fourfold Point Correlation (Similarity)
This is the binary form of the Pearson product-moment correlation coefficient. Phi is a similarity measure, and its range is 0 to 1.
PHI x y ad bc a b a c b d c d
1 , 6 1 61 61 61 6
Dispersion Similarity Measure
This similarity measure has a range of –1 to +1.
DISPER x y ad bc a b c d
1 , 6 1 6
Variance Dissimilarity Measure
This dissimilarity measure has a minimum value of 0 and no upper limit.
VARIANCE x y b c a b c d
1 , 6 1 6
Binary Lance-and-Williams Nonmetric Dissimilarity Measure
Also known as the Bray-Curtis nonmetric coefficient, this dissimilarity measure has a range of 0 to 1.
BLWMN x y b c a b c
1 , 6 =^
The following notation is used unless otherwise specified:
S (^) Matrix of similarity or dissimilarity measures sij Similarity or dissimilarity measure between cluster^ i^ and cluster^ j
N (^) i Number of cases in cluster^ i
Begin with N clusters each containing one case. Denote the clusters 1 through N.
Before the first merge, let N (^) i = 1 for i = 1 to N. Update str by
str = s (^) pr +sqr
Update N (^) t by
N (^) t = N (^) p +Nq
and then choose the most similar pair based on the value
sij 3 N Ni j 8
Update str by
s
s
s
tr s
p p q
pr
q p q
qr
p q
p q
= (^) + + (^) + − pq 3 + 8
2
Update str by
str = 3 s (^) pr + s (^) qr 8 2 −spq 4
Update str by
s N N tr N^ N^ s^ N^ N^ s^ N s t r
= (^) r p rp r q rq r pq
1 6
3 8 3 8
Update the coefficient W by
W = W + .5s (^) pq
Note that for Ward’s method, the coefficient given in the agglomeration schedule is really the within-cluster sum of squares at that step. For all other methods, this coefficient represents the distance at which the clusters p and q were joined.
Anderberg, M. R. 1973. Cluster analysis for applications. New York: Academic Press.