Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Clustering-Data Warehouse-Lecture Slides, Slides of Data Warehousing

Baddi University of Emerging Sciences and Technologies Data Warehousing

Topics include in this course are Data Warehousing Concepts, Design and Development, Extraction, Transformation and Loading, OLAP Technology, Data Mining Techniques: Classification, Clustering and Decision Tree, Advanced Topics. This lecture includes: Clustering, Grouping, Records, Observations, Tasks, Classes, Records, Homogeneity, Unsupervised, Algorithm, Segment

Typology: Slides

2011/2012

Uploaded on 08/08/2012

sharib_sweet 🇮🇳

4.2

(50)

102 documents

1 / 18

This page cannot be seen from the preview

Don't miss anything!

Clustering Task

– Clustering refers to grouping records, observations, or tasks into

classes of similar objects

– Cluster is collection records similar to one another

– Records in one cluster dissimilar to records in other clusters

– Clustering is unsupervised data mining task

– Therefore, no target variable specified

– Clustering algorithms segment records and maximize

homogeneity in subgroups

– Similarity to records outside cluster minimized

docsity.com

Discover Slides of Data Warehousing Baddi University of Emerging Sciences and Technologies

Partial preview of the text

Download Clustering-Data Warehouse-Lecture Slides and more Slides Data Warehousing in PDF only on Docsity!

Clustering Task

Clustering refers to grouping records, observations, or tasks into

classes of similar objects

Cluster is collection records similar to one another– Records in one cluster dissimilar to records in other clusters– Clustering is unsupervised data mining task– Therefore, no target variable specified– Clustering algorithms segment records and maximize

homogeneity in subgroups

Similarity to records outside cluster minimized

Clustering Task

(cont’d)

For example, Claritas, Inc. provides demographic profiles of

geographic areas, according to zip code

PRIZM segmentation system clusters zip codes in terms of

lifestyle types

Recall clusters identified for 90210 Beverly Hills, CA– Cluster 01:

Blue Blood Estates

“Established executives, professionals, and ‘old money’ heirs that live

in America’s wealthiest suburbs...”

Cluster 10:

Bohemian Mix

Cluster 02:

Winner’s Circle

Cluster 07:

Money and Brains

Cluster 08:

Young Literati

Clustering Task

(cont’d)

Applying cluster analysis to enormous databases helpful– Reduces search space for downstream algorithms

Cluster analysis addresses similar issues encountered inclassification– Similarity measurement– Recoding categorical variables– Standardizing and normalizing variables– Number of clusters

Clustering Task

(cont’d)

Measuring Similarity– Euclidean Distance measures distance between records– Other distance measurements include City-Block Distance and

Minkowski Distance

records two of

values

attribute

represent

,..., ,

and

,..., ,

where, )

(

) , (

2 1

Euclidean

m y y y x x x

y x

i i 





^



y x

i i i

i i

y x

^ 







 ) , (

) , (

Minkowski

Block- City

y x

Clustering Task

(cont’d)

Clustering identifies groups of highly-similar records– Algorithms construct clusters where between-cluster variation

(BCV) large, as compared to within-cluster variation (WCV)

Analogous to concept behind analysis of variance

Between-cluster variation:Within-cluster variation:

9 k-Means Clustering

k-Means effective at finding clusters in data

k-Means Algorithm– Step 1:

Analyst specifies

k = number of clusters to partition

data

Step 2:

k records randomly assigned to initial clusters

Step 3:

For each record, find cluster centerEach cluster center “owns” subset of recordsResults in

k clusters, C

1 , C

2 , ...., C

Step 4:

For each of

k clusters, find cluster centroid

Update cluster center location to centroid

Step 5:

Repeats Steps 3 – 5 until convergence or termination

11 k-Means Clustering

(cont’d)

k-Means algorithm terminates when centroids no longer change

k clusters, C

1 , C

2 , ...., C

k , all records “owned” by cluster

remain in cluster

Convergence criterion may also cause termination– For example, no significant reduction in SSE

cluster of

centroid

represents

cluster in

point

data

each

where, ) , (

 



^

 



k i

Ci p

C p

m p d

SSE

Example of

k-Means Clustering at

Work

Assume

k = 2 to cluster following data points

Step 1:

k = 2 specifies number of clusters to partition

Step 2:

Randomly assign

k = 2 cluster centers

For example, m

= (1, 1) and m

First Iteration– Step 3:

For each record, find nearest cluster centerEuclidean distance from points to m

and m

shown

(1, 3)

(3, 3)

(4, 3)

(5, 3)

(1, 2)

(4, 2)

(1, 1)

(1, 2)

Point

Distance from m

Cluster Membership

C^1

C^2

C^1

C^2

C^1

C^2

Example of

k-Means Clustering at

Work

(cont’d)

Step 4:

For

k clusters, find cluster centroid, update location

Cluster 1 = [(1 + 1 + 1)/3, (3 + 2 + 1)/3] = (1, 2), Cluster 2 =

[(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)

Figure shows movement of clusters m

and m

(triangles) after

first iteration of algorithm

5 4 3 2 1 0

Example of

k-Means Clustering at

Work

(cont’d)

Step 5:

Repeats Steps 3 – 4 until convergence or termination

Second Iteration– Repeat procedure for Steps 3 – 4– Again, for each record find nearest cluster center m

= (1, 2) or

Cluster m

contains {a, e, g, h} and m

has {b, c, d, f}

SSE = 7.86, and BCV/WCV = 0.3346– Note 0.3346 has increased compared to First Iteration value =

Between-cluster variation increasing with respect to Within-

cluster variation

Example of

k-Means Clustering at

Work

(cont’d)

Third (Final) Iteration– Repeat procedure for Steps 3 – 4– Now, for each record find nearest cluster center m

1.75) or m

SSE = 6.23, and BCV/WCV = 0.4703– Again, BCV/WCV has increased compared to previous = 0.3346– This time, no records shift cluster membership– Centroids remain unchanged, therefore algorithm terminates

Example of

k-Means Clustering at

Work

(cont’d)

Summary–

k-Means not guaranteed to find to find global minimum SSE

Instead, local minimum found– Invoking algorithm using variety of initial cluster centers

improves probability of achieving global minimum

One approach places first cluster at random point, with

remaining clusters placed far from previous centers

(Moore)

What is appropriate value for

k?

Potential problem for applying

k-Means

Analyst may have

Clustering-Data Warehouse-Lecture Slides, Slides of Data Warehousing

Related documents

Partial preview of the text

Download Clustering-Data Warehouse-Lecture Slides and more Slides Data Warehousing in PDF only on Docsity!

Clustering Task

Clustering Task

(cont’d)

Blue Blood Estates

Bohemian Mix

Winner’s Circle

Money and Brains

Young Literati

Clustering Task

(cont’d)

Clustering Task

(cont’d)

Clustering Task

(cont’d)

k-Means effective at finding clusters in data

k = number of clusters to partition

k records randomly assigned to initial clusters

k clusters, C

1 , C

2 , ...., C

k clusters, find cluster centroid

k-Means algorithm terminates when centroids no longer change

k clusters, C

1 , C

2 , ...., C

Example of

Work

k = 2 to cluster following data points

k = 2 specifies number of clusters to partition

k = 2 cluster centers

Example of

Work

(cont’d)

k clusters, find cluster centroid, update location

[(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)

Example of

Work

(cont’d)

Example of

Work

(cont’d)

Example of

Work

(cont’d)

k-Means not guaranteed to find to find global minimum SSE

k?

k-Means

a priori knowledge of

k