Clustering-Data Warehouse-Lecture Slides, Slides of Data Warehousing

Topics include in this course are Data Warehousing Concepts, Design and Development, Extraction, Transformation and Loading, OLAP Technology, Data Mining Techniques: Classification, Clustering and Decision Tree, Advanced Topics. This lecture includes: Clustering, Grouping, Records, Observations, Tasks, Classes, Records, Homogeneity, Unsupervised, Algorithm, Segment

Typology: Slides

2011/2012

Uploaded on 08/08/2012

sharib_sweet
sharib_sweet šŸ‡®šŸ‡³

4.2

(50)

102 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
2
Clustering Task
– Clustering refers to grouping records, observations, or tasks into
classes of similar objects
– Cluster is collection records similar to one another
– Records in one cluster dissimilar to records in other clusters
– Clustering is unsupervised data mining task
– Therefore, no target variable specified
– Clustering algorithms segment records and maximize
homogeneity in subgroups
– Similarity to records outside cluster minimized
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Clustering-Data Warehouse-Lecture Slides and more Slides Data Warehousing in PDF only on Docsity!

2

Clustering Task

  • Clustering refers to grouping records, observations, or tasks into

classes of similar objects

  • Cluster is collection records similar to one another– Records in one cluster dissimilar to records in other clusters– Clustering is unsupervised data mining task– Therefore, no target variable specified– Clustering algorithms segment records and maximize

homogeneity in subgroups

  • Similarity to records outside cluster minimized

3

Clustering Task

(cont’d)

  • For example, Claritas, Inc. provides demographic profiles of

geographic areas, according to zip code

  • PRIZM segmentation system clusters zip codes in terms of

lifestyle types

  • Recall clusters identified for 90210 Beverly Hills, CA– Cluster 01:

Blue Blood Estates

ā€œEstablished executives, professionals, and ā€˜old money’ heirs that live

in America’s wealthiest suburbs...ā€

  • Cluster 10:

Bohemian Mix

  • Cluster 02:

Winner’s Circle

  • Cluster 07:

Money and Brains

  • Cluster 08:

Young Literati

5

Clustering Task

(cont’d)

  • Applying cluster analysis to enormous databases helpful– Reduces search space for downstream algorithms

Cluster analysis addresses similar issues encountered inclassification– Similarity measurement– Recoding categorical variables– Standardizing and normalizing variables– Number of clusters

6

Clustering Task

(cont’d)

Measuring Similarity– Euclidean Distance measures distance between records– Other distance measurements include City-Block Distance and

Minkowski Distance

records two of

values

attribute

represent

,..., ,

and

,..., ,

where, )

(

) , (

2 1

2 1

2

Euclidean

m y y y x x x

y x

d

m

m

i

i i 





^



y

x

y x

q

i

i i i

i i

y x

d

y x

d

^ 







 ) , (

) , (

Minkowski

Block- City

y x

y x

8

Clustering Task

(cont’d)

  • Clustering identifies groups of highly-similar records– Algorithms construct clusters where between-cluster variation

(BCV) large, as compared to within-cluster variation (WCV)

  • Analogous to concept behind analysis of variance

Between-cluster variation:Within-cluster variation:

9 k-Means Clustering

k-Means effective at finding clusters in data

k-Means Algorithm– Step 1:

Analyst specifies

k = number of clusters to partition

data

  • Step 2:

k records randomly assigned to initial clusters

  • Step 3:

For each record, find cluster centerEach cluster center ā€œownsā€ subset of recordsResults in

k clusters, C

1 , C

2 , ...., C

k

  • Step 4:

For each of

k clusters, find cluster centroid

Update cluster center location to centroid

  • Step 5:

Repeats Steps 3 – 5 until convergence or termination

11 k-Means Clustering

(cont’d)

k-Means algorithm terminates when centroids no longer change

  • For

k clusters, C

1 , C

2 , ...., C

k , all records ā€œownedā€ by cluster

remain in cluster

  • Convergence criterion may also cause termination– For example, no significant reduction in SSE

i

cluster of

centroid

represents

i

cluster in

point

data

each

where, ) , (

1

2

 

ļƒŽ

^

 ļƒ„ļ€½

ļƒŽ

i

i

k i

Ci p

i

m

C p

m p d

SSE

12

Example of

k-Means Clustering at

Work

  • Assume

k = 2 to cluster following data points

  • Step 1:

k = 2 specifies number of clusters to partition

  • Step 2:

Randomly assign

k = 2 cluster centers

For example, m

1

= (1, 1) and m

2

First Iteration– Step 3:

For each record, find nearest cluster centerEuclidean distance from points to m

1

and m

2

shown

a^

b^

c^

d^

e^

f^

g^

h

(1, 3)

(3, 3)

(4, 3)

(5, 3)

(1, 2)

(4, 2)

(1, 1)

(1, 2)

Point

a^

b^

c^

d^

e^

f^

g^

h

Distance from m

1

2.

2.

3.

4.

1.

3.

0.

1.

Distance from m

2

2.

2.

2.

3.

1.

2.

1.

0.

Cluster Membership

C^1

C^2

C^2

C^2

C^1

C^2

C^1

C^2

14

Example of

k-Means Clustering at

Work

(cont’d)

  • Step 4:

For

k clusters, find cluster centroid, update location

  • Cluster 1 = [(1 + 1 + 1)/3, (3 + 2 + 1)/3] = (1, 2), Cluster 2 =

[(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)

  • Figure shows movement of clusters m

1

and m

2

(triangles) after

first iteration of algorithm

0

1

2

3

4

5

6

5 4 3 2 1 0

15

Example of

k-Means Clustering at

Work

(cont’d)

  • Step 5:

Repeats Steps 3 – 4 until convergence or termination

Second Iteration– Repeat procedure for Steps 3 – 4– Again, for each record find nearest cluster center m

1

= (1, 2) or

m

2

  • Cluster m

1

contains {a, e, g, h} and m

2

has {b, c, d, f}

  • SSE = 7.86, and BCV/WCV = 0.3346– Note 0.3346 has increased compared to First Iteration value =
  • Between-cluster variation increasing with respect to Within-

cluster variation

17

Example of

k-Means Clustering at

Work

(cont’d)

Third (Final) Iteration– Repeat procedure for Steps 3 – 4– Now, for each record find nearest cluster center m

1

1.75) or m

2

  • SSE = 6.23, and BCV/WCV = 0.4703– Again, BCV/WCV has increased compared to previous = 0.3346– This time, no records shift cluster membership– Centroids remain unchanged, therefore algorithm terminates

18

Example of

k-Means Clustering at

Work

(cont’d)

Summary–

k-Means not guaranteed to find to find global minimum SSE

  • Instead, local minimum found– Invoking algorithm using variety of initial cluster centers

improves probability of achieving global minimum

  • One approach places first cluster at random point, with

remaining clusters placed far from previous centers

(Moore)

  • What is appropriate value for

k?

  • Potential problem for applying

k-Means

  • Analyst may have

a priori knowledge of

k