




































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Professor: Holmes; Class: DATA MINING; Subject: Statistics & Applied Probability; University: University of California - Santa Barbara; Term: Unknown 1989;
Typology: Study notes
1 / 44
This page cannot be seen from the preview
Don't miss anything!





































11
2
2
33
4
4
77
Between-cluster variation:
Within-cluster variation:
8
8
(Divisive Methods) or combining (Agglomerative Methods) existing(Divisive Methods) or combining (Agglomerative Methods) existing clustersclusters
99
Number of clusters reduced by one, each step
Number of clusters reduced by one, each step
10
10
1313
2, 5
15, 16, 18
15, 16 33, 33
2 5 9 15 16 18 25 33 33 45
14
14
2, 5
2, 5, 9
2, 5, 9, 15, 16, 18, 25, 33, 33
2, 5, 9, 15, 16, 18
2, 5, 9, 15, 16, 18, 25
15, 16, 18
15, 16 33, 33
2, 5, 9, 15, 16, 18, 25, 33, 33, 45
2 5 9 15 16 18 25 33 33 45
1515
Complete
Complete
linkage explored using sample data
linkage explored using sample data
We want the distance among records in two clusters farthest fromWe want the distance among records in two clusters farthest from each othereach other
minimizedminimized
16
16
2, 5
2, 5, 9
2, 5, 9, 15, 16, 18
15, 16, 18
15, 16 33, 33
2, 5, 9, 15, 16, 18, 25, 33, 33, 45
25, 33, 33
25, 33, 33, 45
2 5 9 15 16 18 25 33 33 45
1919
AssumeAssume nn data points (adata points (a 11 , b, b 11 , c, c 11 ), (a), (a 22 , b, b 22 , c, c 22 ), ..., (a), ..., (a nn , b, b nn , c, c nn ))
Centroid of points is center of gravity of pointsCentroid of points is center of gravity of points
Located at point (
Located at point ( Σ
a
a i
i /
n
n ,
b
b i
i /
n
n ,
c
c i
i /
n
n )
For example, points (1, 1, 1), (1, 2, 1), (1, 3, 1), and (2, 1,For example, points (1, 1, 1), (1, 2, 1), (1, 3, 1), and (2, 1, 1) have centroid1) have centroid
20
20
2121
Assume
Assume k
k = 2 to cluster following data points
= 2 to cluster following data points
Step 1:Step 1: kk = 2 specifies number of clusters to partition= 2 specifies number of clusters to partition
Step 2:Step 2: Randomly assignRandomly assign kk = 2 cluster centers= 2 cluster centers
For example, mFor example, m 11 = (1, 1) and m= (1, 1) and m 22 = (2, 1)= (2, 1)
First Iteration
Step 3:Step 3: For each record, find nearest cluster centerFor each record, find nearest cluster center
Euclidean distance from points to mEuclidean distance from points to m 11 and mand m 22 shownshown
(1, 3) (3, 3) (4, 3) (5, 3) (1, 2) (4, 2) (1, 1) (1, 2)
a b c d e f g h
C 2
C 1
C 2
C 1
C 2
C 2
C 2
C 1
Cluster Membership
Distance from m 2.24 2.24 2.83 3.61 1.41 2.24 1.00 0. 2
Distance from m 2.00 2.83 3.61 4.47 1.00 3.16 0.00 1. 1
Point a b c d e f g h
22
22
Cluster mCluster m 11 contains {a, e, g} and mcontains {a, e, g} and m 22 has {b, c, d, f, h}has {b, c, d, f, h}
Cluster membership assigned, now SSE calculatedCluster membership assigned, now SSE calculated
Recall clusters constructed whereRecall clusters constructed where betweenbetween--cluster variationcluster variation (BCV) large, as(BCV) large, as
compared tocompared to withinwithin--cluster variationcluster variation (WCV)(WCV)
Ratio BCV/WCV expected to increase for successive iterationsRatio BCV/WCV expected to increase for successive iterations
2 2 2 2 2 2 2 2
1
2
∑ ∑
= ∈
k
i i
pC
i
SSE d pm
SSE surrogate for WCV
( , ) surrogateforBCV
36
1
SSE
( , )
WCV
BCV
1 2
1 2
=
=
= = =
dm m
dmm
2525
0 1 2 3 4 5 6
0
1
2
5
4
3
26
26
Repeat procedure for Steps 3Repeat procedure for Steps 3 – – 44
Now, for each record find nearest cluster center mNow, for each record find nearest cluster center m 11 = (1.25, 1.75) or m= (1.25, 1.75) or m 22 = (4, 2.75)= (4, 2.75)
SSE = 6.23, and BCV/WCV = 0.4703SSE = 6.23, and BCV/WCV = 0.
Again, BCV/WCV has increased compared to previous = 0.3346Again, BCV/WCV has increased compared to previous = 0.
This time, no records shift cluster membershipThis time, no records shift cluster membership
Centroids remain unchanged, therefore algorithm terminatesCentroids remain unchanged, therefore algorithm terminates
2727
28
28
3131
32
32
3333
34
34
3737
38
38
3939
40
40