






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The importance of verifying the clustering tendency of a data set before applying clustering algorithms. It explains the concepts of randomness, regularity, and clustering hypotheses, and how they are tested using monte carlo simulations and various methods. The document also covers cluster validity, which evaluates the results of a clustering algorithm, and introduces external, internal, and relative criteria for assessment.
Typology: Slides
1 / 11
This page cannot be seen from the preview
Don't miss anything!







1
Most clustering algorithms impose a clustering structure to the dataset
at hand.
However,
may not possess a clustering structure.
Before we apply any clustering algorithm on
, it must first be
verified that
possesses a clustering structure. This is known as
clustering tendency
.
Clustering tendency is heavily based on hypothesis testing.Specifically, it is based on testing the randomness (null) hypothesis(
) against the regularity ( 0
) hypothesis and the clustering ( 1
hypothesis.
): “The vectors of 0
are randomly
distributed, according to the uniform distribution in the samplingwindow (the compact convex support set for the underlyingdistribution of the vectors of the data set
) of
): “The vectors of 1
are regularly
spaced (that is they are not too close to each other) in thesampling window”.
): “The vectors of 2
form clusters”.
2
-^
p
( q|H
) 0 ,^
p
( q|H
and
p
( q|H
) 2 are estimated via Monte Carlo
simulations Some tests for spatial randomness, when the input space dimensionalitygreater than or equal to 2 are:•
Tests based on structural graphs
Tests based on nearest neighbor distances
A method based on sparse decomposition.
4
Cluster validity
In the sequel it is assumed that the clustering tendency procedure
indicated the existence of a clustering structure in
Applying a clustering algorithm on
, with inappropriate values of the
involved parameters, poor results may be obtained. Hence the need forfurther evaluation of clustering results is apparent.
Cluster validity: a task that evaluates quantitatively the results of aclustering algorithm.
A clustering structure
resulting from an algorithm may be either
5
Cluster validity may be approached in three possible directions:
is evaluated in terms of an independently drawn structure, imposed on
external criteria.
is evaluated in terms of quantities that involve the vectors of
themselves (e.g., proximity matrix). The criteria used in this caseare called internal criteria.
is evaluated by comparing it with other clustering structures, resulting from the application of the same clustering algorithm butwith different parameter values, or other clustering algorithms, on X
. Criteria of this kind are called relative criteria.
7
Statistics suitable for external criteria
For the comparison of
with an independently drawn partition
of
Rand statistic ^
Jaccard statistic ^
Fowlkes-Mallows index ^
Hubert’s
statistic
Normalized
statistic
For assessing the agreement between
and the proximity matrix
statistic.
Statistics suitable for internal criteria
Validation of hierarchy of clusterings
Cophenetic correlation coefficient (
γ^
statistic ^
Kudall’s
τ
statistic.
Validation of individual clusterings
statistic ^
Normalized
statistic
docsity.com
8
denote the set of parameters of a clustering algorithm.
Statement of the problem•
m
The estimation of the best set of parameter values is carried out asfollows:
-^
Run the algorithm for a wide range of values of its parameters.
-^
Plot the number of clusters
m
, versus the parameters of
Choose the widest range for which
m
remains constant.
Adopt the clustering that corresponds to the values of the parametersin
that lie in the
middle
of this range.
10
Hard clustering
Modified Hubert
statistic
Dunn and Dunn-like indices ^
Davies-Bouldin (DB) and DB-like indices
Fuzzy clustering
Indices for clusters with point representatives
o Partition coefficient (PC)o Partition entropy coefficient (PE)o Xie-Beni (XB) indexo Fukuyama-Sugeno indexo Total fuzzy hypervolumeo Average partition densityo Partition density
11
Fuzzy clustering (cont.)
Indices for shell-shaped clusters
o Fuzzy shell densityo Average partition shell densityo Shell partition densityo Total fuzzy average shell thickness