














































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
clustering Computer science dataming
Typology: Lecture notes
1 / 54
This page cannot be seen from the preview
Don't miss anything!















































Home Exam
Home Exam
Today 14.11.
Today 14.11.
Today 14.11.
Today 14.11.
Today's subject
Today's subject :
o
Classification, clustering
Classification, clustering
Next week's program
Next week's program :
o
Lecture: Data mining process
Lecture: Data mining process
o
Exercise: Classification,
Exercise: Classification,
clustering
clustering
o
Seminar: Classification,
Seminar: Classification,
clustering
clustering
What is cluster analysis?
What is cluster analysis?
Similarity and dissimilarity
Similarity and dissimilarity
Types of data in cluster analysis
Types of data in cluster analysis
Major clustering methods
Major clustering methods
Partitioning methods
Partitioning methods
Hierarchical methods
Hierarchical methods
Outlier analysis
Outlier analysis
Summary
Summary
Overview
Overview
Cluster:
Cluster: a collection of data
objects
o
similar to one another within
the same cluster
o dissimilar to the objects in the
other clusters
Aim of clustering:
Aim of clustering: to group a set
of data objects into clusters
Marketing:
Marketing: discovering of distinct customer
groups in a purchase database
Land use:
Land use: identifying of areas of similar land use
in an earth observation database
Insurance:
Insurance: identifying groups of motor insurance
policy holders with a high average claim cost
City-planning:
City-planning: identifying groups of houses
according to their house type, value, and
geographical location
good clustering
good clustering method will produce high quality
clusters with
o
high intra-class
intra-class similarity
o low
inter-class
inter-class similarity
The
quality
quality of a clustering result depends on
o the similarity measure used
o implementation of the similarity measure
The quality
quality of a clustering method is also measured by its
ability to discover some or all of the hidden
hidden patterns
Ability to deal with noise and
Ability to deal with noise and
outliers
outliers
Insensitivity to order of input
Insensitivity to order of input
records
records
High dimensionality
High dimensionality
Incorporation of user-specified
Incorporation of user-specified
constraints
constraints
Interpretability and usability
Interpretability and usability
There is no single definition
no single definition of
similarity or dissimilarity between data
objects
The definition of similarity or
similarity or
dissimilarity between objects
dissimilarity between objects
depends on
o
the type of the data considered
o what kind of similarity we are
looking for
Interval-scaled variables
Interval-scaled variables
Binary variables
Binary variables
Nominal, ordinal, and ratio
Nominal, ordinal, and ratio
variables
variables
Variables of mixed types
Variables of mixed types
Complex data types
Complex data types
Continuous measurements
Continuous measurements of a
roughly linear scale
For example, weight, height and age
The
measurement unit
measurement unit can affect
the cluster analysis
To avoid dependence on the
measurement unit, we should
standardize
standardize the data
One group of popular distance measures for interval-
scaled variables are Minkowski distances
Minkowski distances
where i = (x
i
, x
i
, …, x
ip
) and j = (x
j
, x
j
, …, x
jp
) are
two p -dimensional data objects, and q is a positive
integer
q
q
p p
q q
1 1 2 2
If q = 1 , the distance measure is Manhattan (
Manhattan ( or
city block) distance
city block) distance
If q = 2 , the distance measure is
Euclidean
Euclidean
distance
distance
( , ) | | | | ... | |
1 1 2 2 p jp
x
i
x
j
x
i
x
j
x
i
d i j x
2 2
2 2
2
1 1 p jp
x
i
x
j
x
i
x
j
x
i
d i j x
Simple matching coefficient
Simple matching coefficient (invariant similarity, if
the binary variable is symmetric
symmetric ):
Jaccard coefficient
Jaccard coefficient (noninvariant similarity, if the
binary variable is asymmetric
asymmetric ):
a b c d
b c
d i j
( , )
a b c
b c
d i j
( , )
Example
Example : dissimilarity between binary variables:
a patient record table
eight attributes, of which
o gender is a symmetric attribute, and
o the remaining attributes are asymmetric binary
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N