Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Clustering Tendency and Cluster Validity: Understanding the Structure of Data Sets, Slides of Pattern Classification and Recognition

Banasthali Vidyapith Pattern Classification and Recognition

The importance of verifying the clustering tendency of a data set before applying clustering algorithms. It explains the concepts of randomness, regularity, and clustering hypotheses, and how they are tested using monte carlo simulations and various methods. The document also covers cluster validity, which evaluates the results of a clustering algorithm, and introduces external, internal, and relative criteria for assessment.

Typology: Slides

2011/2012

Uploaded on 07/17/2012

bandhula 🇮🇳

4.7

(10)

91 documents

1 / 11

This page cannot be seen from the preview

Don't miss anything!

1

CLUSTER VALIDITY

Clustering tendency

Facts

Most clustering algorithms impose a clustering structure to the data

set Xat hand.

However, Xmay not possess a clustering structure.

Before we apply any clustering algorithm on X, it must first be

verified that Xpossesses a clustering structure. This is known as

clustering tendency.

Clustering tendency is heavily based on hypothesis testing.

Specifically, it is based on testing the randomness (null) hypothesis

(H0) against the regularity (

H

1

) hypothesis and the clustering (

H

2

)

hypothesis .

• Randomness hypothesis (H0): “The vectors of Χare randomly

distributed, according to the uniform distribution in the sampling

window (the compact convex support set for the underlying

distribution of the vectors of the data set X)of X”.

• Regularity hypothesis (H1): “The vectors of Xare regularly

spaced (that is they are not too close to each other) in the

sampling window”.

• Clustering hypothesis (H2): “The vectors of Xform clusters”.

docsity.com

Discover Slides of Pattern Classification and Recognition Banasthali Vidyapith

Partial preview of the text

Download Clustering Tendency and Cluster Validity: Understanding the Structure of Data Sets and more Slides Pattern Classification and Recognition in PDF only on Docsity!

1

CLUSTER VALIDITY

Clustering tendencyFacts

Most clustering algorithms impose a clustering structure to the dataset

X

at hand.

However,

X

may not possess a clustering structure.

Before we apply any clustering algorithm on

X

, it must first be

verified that

X

possesses a clustering structure. This is known as

clustering tendency

.

Clustering tendency is heavily based on hypothesis testing.Specifically, it is based on testing the randomness (null) hypothesis(

H

) against the regularity ( 0

H

) hypothesis and the clustering ( 1

H

hypothesis.

Randomness hypothesis (

H

): “The vectors of 0

are randomly

distributed, according to the uniform distribution in the samplingwindow (the compact convex support set for the underlyingdistribution of the vectors of the data set

X

) of

X

Regularity hypothesis (

H

): “The vectors of 1

X

are regularly

spaced (that is they are not too close to each other) in thesampling window”.

Clustering hypothesis (

H

): “The vectors of 2

X

form clusters”.

2

-^

p

( q|H

) 0 ,^

p

( q|H

and

p

( q|H

) 2 are estimated via Monte Carlo

simulations Some tests for spatial randomness, when the input space dimensionalitygreater than or equal to 2 are:•

Tests based on structural graphs

Test that utilizes the idea of the minimum spanning tree (MST)

•^

Tests based on nearest neighbor distances

The Hopkins test

The Cox-Lewis test

•^

A method based on sparse decomposition.

4



Cluster validity

In the sequel it is assumed that the clustering tendency procedure

indicated the existence of a clustering structure in

X

Applying a clustering algorithm on

X

, with inappropriate values of the

involved parameters, poor results may be obtained. Hence the need forfurther evaluation of clustering results is apparent. 

Cluster validity: a task that evaluates quantitatively the results of aclustering algorithm. 

A clustering structure

C,

resulting from an algorithm may be either

A hierarchy of clusterings or• A single clustering.

5

Cluster validity may be approached in three possible directions:

C

is evaluated in terms of an independently drawn structure, imposed on

X

a priori. The criteria used in this case are called

external criteria.

C

is evaluated in terms of quantities that involve the vectors of

X

themselves (e.g., proximity matrix). The criteria used in this caseare called internal criteria.

C

is evaluated by comparing it with other clustering structures, resulting from the application of the same clustering algorithm butwith different parameter values, or other clustering algorithms, on X

. Criteria of this kind are called relative criteria.

7

Statistics suitable for external criteria

•^

For the comparison of

C

with an independently drawn partition

P

of

X

^

Rand statistic ^

Jaccard statistic ^

Fowlkes-Mallows index ^

Hubert’s

statistic

^

Normalized

statistic

•^

For assessing the agreement between

P

and the proximity matrix

P

^

statistic.

Statistics suitable for internal criteria

•^

Validation of hierarchy of clusterings

^

Cophenetic correlation coefficient (

CPCC

^

γ^

statistic ^

Kudall’s

τ

statistic.

•^

Validation of individual clusterings

^

statistic ^

Normalized

statistic

docsity.com

8

Cluster validity for the cases of relative criteriaLet

A

denote the set of parameters of a clustering algorithm.

Statement of the problem•

“Among the clusterings produced by a specific clustering algorithm, fordifferent values of the parameters in

A

, choose the one that best fits

the data set

X

We consider two cases(a)

A

does not contain the number of clusters

m

The estimation of the best set of parameter values is carried out asfollows:

-^

Run the algorithm for a wide range of values of its parameters.

-^

Plot the number of clusters

,^

m

, versus the parameters of

A.

•^

Choose the widest range for which

m

remains constant.

•^

Adopt the clustering that corresponds to the values of the parametersin

A

that lie in the

middle

of this range.

10

Statistics suitable for relative criteria•

Hard clustering

^

Modified Hubert

statistic

^

Dunn and Dunn-like indices ^

Davies-Bouldin (DB) and DB-like indices

•^

Fuzzy clustering

^

Indices for clusters with point representatives

o Partition coefficient (PC)o Partition entropy coefficient (PE)o Xie-Beni (XB) indexo Fukuyama-Sugeno indexo Total fuzzy hypervolumeo Average partition densityo Partition density

11

Statistics suitable for relative criteria (cont.)•

Fuzzy clustering (cont.)

^

Indices for shell-shaped clusters

o Fuzzy shell densityo Average partition shell densityo Shell partition densityo Total fuzzy average shell thickness

Clustering Tendency and Cluster Validity: Understanding the Structure of Data Sets, Slides of Pattern Classification and Recognition

Related documents

Partial preview of the text

Download Clustering Tendency and Cluster Validity: Understanding the Structure of Data Sets and more Slides Pattern Classification and Recognition in PDF only on Docsity!

CLUSTER VALIDITY

Clustering tendencyFacts

X

X

X

X

H

H

H

H

X

X

H

X

H

X

Test that utilizes the idea of the minimum spanning tree (MST)

•^

The Hopkins test

The Cox-Lewis test

•^

X

X

C,

C

X

a priori. The criteria used in this case are called

C

X

C

•^

C

P

X

^

^

•^

P

P

^

•^

^

CPCC

^

•^

^

Cluster validity for the cases of relative criteriaLet

A

“Among the clusterings produced by a specific clustering algorithm, fordifferent values of the parameters in

A

, choose the one that best fits

the data set

X

We consider two cases(a)

A

does not contain the number of clusters

,^

A.

•^

•^

A

Statistics suitable for relative criteria•

^

^

•^

^

Statistics suitable for relative criteria (cont.)•

^