Clustering Tendency and Cluster Validity: Understanding the Structure of Data Sets, Slides of Pattern Classification and Recognition

The importance of verifying the clustering tendency of a data set before applying clustering algorithms. It explains the concepts of randomness, regularity, and clustering hypotheses, and how they are tested using monte carlo simulations and various methods. The document also covers cluster validity, which evaluates the results of a clustering algorithm, and introduces external, internal, and relative criteria for assessment.

Typology: Slides

2011/2012

Uploaded on 07/17/2012

bandhula
bandhula 🇮🇳

4.7

(10)

91 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CLUSTER VALIDITY
Clustering tendency
Facts
Most clustering algorithms impose a clustering structure to the data
set Xat hand.
However, Xmay not possess a clustering structure.
Before we apply any clustering algorithm on X, it must first be
verified that Xpossesses a clustering structure. This is known as
clustering tendency.
Clustering tendency is heavily based on hypothesis testing.
Specifically, it is based on testing the randomness (null) hypothesis
(H0) against the regularity (
H
1
) hypothesis and the clustering (
H
2
)
hypothesis .
Randomness hypothesis (H0): “The vectors of Χare randomly
distributed, according to the uniform distribution in the sampling
window (the compact convex support set for the underlying
distribution of the vectors of the data set X)of X”.
Regularity hypothesis (H1): “The vectors of Xare regularly
spaced (that is they are not too close to each other) in the
sampling window”.
Clustering hypothesis (H2): “The vectors of Xform clusters”.
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Clustering Tendency and Cluster Validity: Understanding the Structure of Data Sets and more Slides Pattern Classification and Recognition in PDF only on Docsity!

1

CLUSTER VALIDITY

Clustering tendencyFacts

Most clustering algorithms impose a clustering structure to the dataset

X

at hand.

However,

X

may not possess a clustering structure.

Before we apply any clustering algorithm on

X

, it must first be

verified that

X

possesses a clustering structure. This is known as

clustering tendency

.

Clustering tendency is heavily based on hypothesis testing.Specifically, it is based on testing the randomness (null) hypothesis(

H

) against the regularity ( 0

H

) hypothesis and the clustering ( 1

H

hypothesis.

  • Randomness hypothesis (

H

): “The vectors of 0

are randomly

distributed, according to the uniform distribution in the samplingwindow (the compact convex support set for the underlyingdistribution of the vectors of the data set

X

) of

X

  • Regularity hypothesis (

H

): “The vectors of 1

X

are regularly

spaced (that is they are not too close to each other) in thesampling window”.

  • Clustering hypothesis (

H

): “The vectors of 2

X

form clusters”.

2

-^

p

( q|H

) 0 ,^

p

( q|H

and

p

( q|H

) 2 are estimated via Monte Carlo

simulations Some tests for spatial randomness, when the input space dimensionalitygreater than or equal to 2 are:•

Tests based on structural graphs

Test that utilizes the idea of the minimum spanning tree (MST)

•^

Tests based on nearest neighbor distances

The Hopkins test

The Cox-Lewis test

•^

A method based on sparse decomposition.

4

Cluster validity

In the sequel it is assumed that the clustering tendency procedure

indicated the existence of a clustering structure in

X

Applying a clustering algorithm on

X

, with inappropriate values of the

involved parameters, poor results may be obtained. Hence the need forfurther evaluation of clustering results is apparent. 

Cluster validity: a task that evaluates quantitatively the results of aclustering algorithm. 

A clustering structure

C,

resulting from an algorithm may be either

  • A hierarchy of clusterings or• A single clustering.

5

Cluster validity may be approached in three possible directions:

C

is evaluated in terms of an independently drawn structure, imposed on

X

a priori. The criteria used in this case are called

external criteria.

C

is evaluated in terms of quantities that involve the vectors of

X

themselves (e.g., proximity matrix). The criteria used in this caseare called internal criteria.

C

is evaluated by comparing it with other clustering structures, resulting from the application of the same clustering algorithm butwith different parameter values, or other clustering algorithms, on X

. Criteria of this kind are called relative criteria.

7

Statistics suitable for external criteria

•^

For the comparison of

C

with an independently drawn partition

P

of

X

^

Rand statistic ^

Jaccard statistic ^

Fowlkes-Mallows index ^

Hubert’s

statistic

^

Normalized

statistic

•^

For assessing the agreement between

P

and the proximity matrix

P

^

statistic.

Statistics suitable for internal criteria

•^

Validation of hierarchy of clusterings

^

Cophenetic correlation coefficient (

CPCC

^

γ^

statistic ^

Kudall’s

τ

statistic.

•^

Validation of individual clusterings

^

statistic ^

Normalized

statistic

docsity.com

8

Cluster validity for the cases of relative criteriaLet

A

denote the set of parameters of a clustering algorithm.

Statement of the problem•

“Among the clusterings produced by a specific clustering algorithm, fordifferent values of the parameters in

A

, choose the one that best fits

the data set

X

We consider two cases(a)

A

does not contain the number of clusters

m

The estimation of the best set of parameter values is carried out asfollows:

-^

Run the algorithm for a wide range of values of its parameters.

-^

Plot the number of clusters

,^

m

, versus the parameters of

A.

•^

Choose the widest range for which

m

remains constant.

•^

Adopt the clustering that corresponds to the values of the parametersin

A

that lie in the

middle

of this range.

10

Statistics suitable for relative criteria•

Hard clustering

^

Modified Hubert

statistic

^

Dunn and Dunn-like indices ^

Davies-Bouldin (DB) and DB-like indices

•^

Fuzzy clustering

^

Indices for clusters with point representatives

o Partition coefficient (PC)o Partition entropy coefficient (PE)o Xie-Beni (XB) indexo Fukuyama-Sugeno indexo Total fuzzy hypervolumeo Average partition densityo Partition density

11

Statistics suitable for relative criteria (cont.)•

Fuzzy clustering (cont.)

^

Indices for shell-shaped clusters

o Fuzzy shell densityo Average partition shell densityo Shell partition densityo Total fuzzy average shell thickness