Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Cluster Measures: Distance and Similarity Measures for Different Data Types, Study notes of Mathematical Statistics

Alliance University Mathematical Statistics

An overview of various distance and similarity measures used in cluster analysis. The measures are categorized based on the type of data, including continuous, frequency count, and binary data. Examples of measures for each category are provided, such as euclidean distance, cosine similarity, and jaccard similarity. The document also explains how these measures can be used to assess the similarity or dissimilarity between clusters and how they can be applied to different clustering methods.

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar 🇮🇳

4.5

(4)

118 documents

1 / 13

This page cannot be seen from the preview

Don't miss anything!

CLUSTER

Cluster Measures

Measures for Continuous Data

EUCLID

The distance between two items, x and y, is the square root of the sum of the

squared differences between the values for the items.

EUCLID xy x y

16 1 6

=−

∑2

SEUCLID

The distance between two items is the sum of the squared differences between the

values for the items.

SEUCLID xy x y

16 1 6

=−

∑2

CORRELATION

This is a pattern similarity measure.

CORRELATION xy ZZ

xi yi

16 38

=∑

where Zxi is the (standardized) Z-score value of x for the ith case or variable, and

N is the number of cases or variables.

Discover Study notes of Mathematical Statistics Alliance University

Partial preview of the text

Download Cluster Measures: Distance and Similarity Measures for Different Data Types and more Study notes Mathematical Statistics in PDF only on Docsity!

Cluster Measures

Measures for Continuous Data

EUCLID

The distance between two items, x and y, is the square root of the sum of the squared differences between the values for the items.

EUCLID x y x (^) i yi i 1 , 6 =^ ∑ 1 − 62

SEUCLID

The distance between two items is the sum of the squared differences between the values for the items.

SEUCLID x y x (^) i yi i 1 , 6 =^ ∑ 1 − 62

CORRELATION

This is a pattern similarity measure.

CORRELATION x y

Z Z

N

i xi^ yi 1 , 6

3 8

∑

where Z (^) xi is the (standardized) Z-score value of x for the ith case or variable, and N is the number of cases or variables.

COSINE

This is a pattern similarity measure.

COSINE x y

x y

i i^ i

i i i i

1 , 6

1 6

∑

∑ ∑

2 2

CHEBYCHEV

The distance between two items is the maximum absolute difference between the values for the items.

CHEBYCHEV 1 x y , 6 = maxi x (^) i −yi

BLOCK

The distance between two items is the sum of the absolute differences between the values for the items.

BLOCK x y x (^) i yi i 1 , 6 =^ ∑ −

MINKOWSKI( p )

The distance between two items is the pth root of the sum of the absolute differences to the pth power between the values for the items.

MINKOWSKI x y x (^) i yi i

p p 1 , 6 = −

∑

Present Absent Item 1 Present a b Absent c d

PROXIMITIES computes all binary measures from the values of a, b, c, and d. These values are tallies across variables (when the items are cases) or tallies across cases (when the items are variables).

Russel and Rao Similarity Measure

This is the binary dot product.

RR x y a a b c d

1 , 6 =^

Simple Matching Similarity Measure

This is the ratio of the number of matches to the total number of characteristics.

SM x y a d a b c d

1 , 6 =^

Jaccard Similarity Measure

This is also known as the similarity ratio.

JACCARD x y a a b c

1 , 6 =^

Dice or Czekanowski or Sorenson Similarity Measure

DICE x y a a b c

1 , 6 =^

Sokal and Sneath Similarity Measure 1

SS1 x y

a d a d b c

1 , 6

1 6 1 6

Rogers and Tanimoto Similarity Measure

RT x y a d a d b c

1 , 6 1 6

Sokal and Sneath Similarity Measure 2

SS2 x y a a b c

1 , 6 1 6

Kulczynski Similarity Measure 1

This measure has a minimum value of 0 and no upper limit. It is undefined when there are no nonmatches 1 b = 0 andc= 06. Therefore, PROXIMITIES assigns an artificial upper limit of 9999.999 to K1 when it is undefined or exceeds this value.

K1 x y a b c

1 , 6 =^

Sokal and Sneath Similarity Measure 3

This measure has a minimum value of 0, has no upper limit, and is undefined when there are no nonmatches 1 b = 0 andc= 06. As with K1, PROXIMITIES assigns an artificial upper limit of 9999.999 to SS3 when it is undefined or exceeds this value.

SS3 x y a d b c

1 , 6 =^

Predictability Measures

The following four binary measures assess the association between items as the predictability of one given the other. All four measures yield similarities.

Goodman and Kruskal Lambda (Similarity)

This coefficient assesses the predictability of the state of a characteristic on one item (presence or absence) given the state on the other item. Specifically, lambda measures the proportional reduction in error using one item to predict the other, when the directions of prediction are of equal importance. Lambda has a range of 0 to 1.

t a b c d a c b d t a c b d a b c d

x y

t t a b c d t

1 2 1 2 (^22)

max , max , max , max , max , max ,

1 6 1 6 1 6 1 6 1 6 1 6

1 6 1 6

LAMBDA

Anderberg’s D (Similarity)

This coefficient assesses the predictability of the state of a characteristic on one item (presence or absence) given the state on the other. D measures the actual reduction in the error probability when one item is used to predict the other. The range of D is 0 to 1.

t a b c d a c b d t a c b d a b c d

x y t t a b c d

1 2 1 2 2

max , max , max , max , max , max ,

1 6 1 6 1 6 1 6 1 6 1 6

1 6 1 6

D

Yule’s Y Coefficient of Colligation (Similarity)

This is a function of the cross-product ratio for a 2 × 2 table. It has a range of –1 to +1.

Y x y ad bc ad bc

1 , 6 =^

Yule’s Q (Similarity)

This is the 2 × 2 version of Goodman and Kruskal’s ordinal measure gamma. Like Yule’s Y, Q is a function of the cross-product ratio for a 2 × 2 table and has a range of –1 to +1.

Q x y

ad bc ad bc

1 , 6 =

Other Binary Measures

The remaining binary measures available in PROXIMITIES are either binary equivalents of association measures for continuous variables or measures of special properties of the relation between items.

Ochiai Similarity Measure

This is the binary form of the cosine. It has a range of 0 to 1 and is a similarity measure.

OCHIAI x y a a b

a a c

1 , 6 =^

Sokal and Sneath Similarity Measure 5

This is a similarity measure. Its range is 0 to 1.

SS5 x y ad a b a c b d c d

1 , 6 1 61 61 61 6

Fourfold Point Correlation (Similarity)

This is the binary form of the Pearson product-moment correlation coefficient. Phi is a similarity measure, and its range is 0 to 1.

PHI x y ad bc a b a c b d c d

1 , 6 1 61 61 61 6

Dispersion Similarity Measure

This similarity measure has a range of –1 to +1.

DISPER x y ad bc a b c d

1 , 6 1 6

Variance Dissimilarity Measure

This dissimilarity measure has a minimum value of 0 and no upper limit.

VARIANCE x y b c a b c d

1 , 6 1 6

Binary Lance-and-Williams Nonmetric Dissimilarity Measure

Also known as the Bray-Curtis nonmetric coefficient, this dissimilarity measure has a range of 0 to 1.

BLWMN x y b c a b c

1 , 6 =^

Clustering Methods

Notation

The following notation is used unless otherwise specified:

S (^) Matrix of similarity or dissimilarity measures sij Similarity or dissimilarity measure between cluster^ i^ and cluster^ j

N (^) i Number of cases in cluster^ i

General Procedure

Begin with N clusters each containing one case. Denote the clusters 1 through N.

Find the most similar pair of clusters p and q 1 p >q 6. Denote this similarity s (^) pq. If a dissimilarity measure is used, large values indicate dissimilarity. If a similarity measure is used, small values indicate dissimilarity.
Reduce the number of clusters by one through merger of clusters p and q. Label the new cluster t 1 = q 6 and update similarity matrix (by the method specified) to reflect revised similarities or dissimilarities between cluster t and all other clusters. Delete the row and column of S corresponding to cluster p.
Perform the previous two steps until all entities are in one cluster.
For each of the following methods, the similarity or dissimilarity matrix S is updated to reflect revised similarities or dissimilarities 1 6str between the new cluster t and all other clusters r as given below.

Average Linkage between Groups

Before the first merge, let N (^) i = 1 for i = 1 to N. Update str by

str = s (^) pr +sqr

Update N (^) t by

N (^) t = N (^) p +Nq

and then choose the most similar pair based on the value

sij 3 N Ni j 8

Centroid Method

Update str by

N

N N

N

N N

tr s

p p q

q p q

p q

= (^) + + (^) + − pq 3 + 8

Median Method

Update str by

str = 3 s (^) pr + s (^) qr 8 2 −spq 4

Ward’s Method

Update str by

s N N tr N^ N^ s^ N^ N^ s^ N s t r

= (^) r p rp r q rq r pq

1 6

3 8 3 8

Update the coefficient W by

W = W + .5s (^) pq

Note that for Ward’s method, the coefficient given in the agglomeration schedule is really the within-cluster sum of squares at that step. For all other methods, this coefficient represents the distance at which the clusters p and q were joined.

References

Anderberg, M. R. 1973. Cluster analysis for applications. New York: Academic Press.

Cluster Measures: Distance and Similarity Measures for Different Data Types, Study notes of Mathematical Statistics

Related documents

Partial preview of the text

Download Cluster Measures: Distance and Similarity Measures for Different Data Types and more Study notes Mathematical Statistics in PDF only on Docsity!

Cluster Measures

Measures for Continuous Data

EUCLID

SEUCLID

CORRELATION

Z Z

N

3 8

COSINE

1 6

CHEBYCHEV

BLOCK

Predictability Measures

LAMBDA

D

Other Binary Measures

Clustering Methods

Notation

General Procedure

Average Linkage between Groups

Centroid Method

N

N N

N

N N

N N

N N

Median Method

Ward’s Method

References