










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Topics include in this course are Data Warehousing Concepts, Design and Development, Extraction, Transformation and Loading, OLAP Technology, Data Mining Techniques: Classification, Clustering and Decision Tree, Advanced Topics. This lecture includes: Clustering, Grouping, Records, Observations, Tasks, Classes, Records, Homogeneity, Unsupervised, Algorithm, Segment
Typology: Slides
1 / 18
This page cannot be seen from the preview
Don't miss anything!











2
classes of similar objects
homogeneity in subgroups
3
geographic areas, according to zip code
lifestyle types
āEstablished executives, professionals, and āold moneyā heirs that live
in Americaās wealthiest suburbs...ā
5
Cluster analysis addresses similar issues encountered inclassificationā Similarity measurementā Recoding categorical variablesā Standardizing and normalizing variablesā Number of clusters
6
Measuring Similarityā Euclidean Distance measures distance between recordsā Other distance measurements include City-Block Distance and
Minkowski Distance
records two of
values
attribute
represent
,..., ,
and
,..., ,
where, )
(
) , (
2 1
2 1
2
Euclidean
m y y y x x x
y x
d
m
m
i
i i ļ½
ļ½
ļ
ļ½^
ļ„
y
x
y x
q
i
i i i
i i
y x
d
y x
d
ļ„^ ļ„
ļ
ļ½
ļ
ļ½ ) , (
) , (
Minkowski
Block- City
y x
y x
8
(BCV) large, as compared to within-cluster variation (WCV)
Between-cluster variation:Within-cluster variation:
9 k-Means Clustering
k-Means Algorithmā Step 1:
Analyst specifies
data
For each record, find cluster centerEach cluster center āownsā subset of recordsResults in
k
For each of
Update cluster center location to centroid
Repeats Steps 3 ā 5 until convergence or termination
11 k-Means Clustering
(contād)
k , all records āownedā by cluster
remain in cluster
i
cluster of
centroid
represents
i
cluster in
point
data
each
where, ) , (
1
2
ļ½ ļ½
ļ
ļ½^
ļ„ ļ„ļ½
ļ
i
i
k i
Ci p
i
m
C p
m p d
SSE
12
k-Means Clustering at
Randomly assign
For example, m
1
= (1, 1) and m
2
First Iterationā Step 3:
For each record, find nearest cluster centerEuclidean distance from points to m
1
and m
2
shown
a^
b^
c^
d^
e^
f^
g^
h
(1, 3)
(3, 3)
(4, 3)
(5, 3)
(1, 2)
(4, 2)
(1, 1)
(1, 2)
Point
a^
b^
c^
d^
e^
f^
g^
h
Distance from m
1
2.
2.
3.
4.
1.
3.
0.
1.
Distance from m
2
2.
2.
2.
3.
1.
2.
1.
0.
Cluster Membership
C^1
C^2
C^2
C^2
C^1
C^2
C^1
C^2
14
k-Means Clustering at
For
1
and m
2
(triangles) after
first iteration of algorithm
0
1
2
3
4
5
6
5 4 3 2 1 0
15
k-Means Clustering at
Repeats Steps 3 ā 4 until convergence or termination
Second Iterationā Repeat procedure for Steps 3 ā 4ā Again, for each record find nearest cluster center m
1
= (1, 2) or
m
2
1
contains {a, e, g, h} and m
2
has {b, c, d, f}
cluster variation
17
k-Means Clustering at
Third (Final) Iterationā Repeat procedure for Steps 3 ā 4ā Now, for each record find nearest cluster center m
1
1.75) or m
2
18
k-Means Clustering at
Summaryā
improves probability of achieving global minimum
remaining clusters placed far from previous centers
(Moore)