Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Notes on Clustering - Data Mining | CSCI 243, Study notes of Computer Science

George Washington University (GW)Computer Science

Prof. Abdelghani Bellaachia

Material Type: Notes; Professor: Bellaachia; Class: Data Mining; Subject: Computer Science; University: George Washington University; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 02/25/2010

koofers-user-cth 🇺🇸

10 documents

1 / 19

This page cannot be seen from the preview

Don't miss anything!

A. Bellaachia Page: 1

Clustering

1. Objectives.................................................................................2

2. Clustering .................................................................................2

2.1. Definitions........................................................................2

2.2. General Applications........................................................2

2.3. What is a good clustering? ...............................................3

2.4. Requirements....................................................................3

3. Data Structures .........................................................................4

4. Similarity Measures..................................................................4

4.1. Standardize data ...............................................................5

4.2. Binary variables................................................................7

4.3. Nominal Variables............................................................8

4.4. Ordinal Variables .............................................................9

4.5. Ratio-scaled variables ....................................................10

4.6. Variables of mixed types................................................10

5. Clustering approaches ............................................................11

5.1. Major approaches ...........................................................11

5.2. Partitioning approach .....................................................11

6. The K-means clustering method ............................................12

7. The K-medoids Clustering Method........................................14

8. Hierarchal Clustering .............................................................15

8.1. AGNES (Agglomerative Nesting) .................................15

8.2. Divisive Analysis: DIANA ............................................17

8.3. Analysis of hierarchical clustering:................................17

9. Outliers ...................................................................................18

9.1. Statistical Approach .......................................................18

9.2. Distance-Based Approach..............................................19

Discover Study notes of Computer Science George Washington University (GW)

Partial preview of the text

Download Notes on Clustering - Data Mining | CSCI 243 and more Study notes Computer Science in PDF only on Docsity!

Clustering

1. Objectives .................................................................................
1. Clustering .................................................................................
- 2.1. Definitions ........................................................................
- 2.2. General Applications........................................................
- 2.3. What is a good clustering? ...............................................
- 2.4. Requirements....................................................................
1. Data Structures .........................................................................
1. Similarity Measures..................................................................
- 4.1. Standardize data ...............................................................
- 4.2. Binary variables................................................................
- 4.3. Nominal Variables............................................................
- 4.4. Ordinal Variables .............................................................
- 4.5. Ratio-scaled variables ....................................................
- 4.6. Variables of mixed types................................................
1. Clustering approaches ............................................................
- 5.1. Major approaches ...........................................................
- 5.2. Partitioning approach .....................................................
1. The K-means clustering method ............................................
1. The K-medoids Clustering Method........................................
1. Hierarchal Clustering .............................................................
- 8.1. AGNES (Agglomerative Nesting) .................................
- 8.2. Divisive Analysis: DIANA ............................................
- 8.3. Analysis of hierarchical clustering:................................
1. Outliers ...................................................................................
- 9.1. Statistical Approach .......................................................
- 9.2. Distance-Based Approach..............................................

1. Objectives

Techniques to group data into related classify datasets and provide categorical labels, e.g., sports, technology, kid, etc.
Detection of patterns
Models to predict certain future behaviors.

2. Clustering

2.1. Definitions

Cluster: a collection of data objects o Similar to one another within the same cluster o Dissimilar to the objects in other clusters
Cluster analysis o Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined classes
Typical applications o As a stand-alone tool to get insight into data distribution o As a preprocessing step for other algorithms

2.2. General Applications o Text mining: Document categorization Detection of topics Summarization o Text Mining: Web log analysis Detection of groups of similar access patterns

3. Data Structures

Data Matrix (two modes)
Dissimilarity (or similarity) matrix

4. Similarity Measures

Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d ( i, j )
There is a separate “quality” function that measures the “goodness” of a cluster.
The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.

xn1 ... xnf ... x np

xi1 ... xif ... xip

x 11 ... x1f ... x1p

d n d n ...

d(3,1 d 0

d(2,1) 0

Weights should be associated with different variables based on applications and data semantics.
It is hard to define “similar enough” or “good enough” o The answer is typically highly subjective.
Type of data in clustering analysis o Interval-scaled variables o Binary variables o Nominal, ordinal, and ratio variables o Variables of mixed types

4.1. Standardize data

(^) Calculate the mean absolute deviation :

Where

z-score: Calculate the standardized measurement
Using mean absolute deviation is more robust than using standard deviation

m (^) f =^1 n^ (x 1 f + x 2 f +...+ xnf ).

s (^) f =^1 n^ (| x 1 f − mf |+| x 2 f − mf |+...+| xnf − mf |)

f

if f if (^) s

x m z

−

Also, one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures

4.2. Binary variables

A contingency table for binary data
Simple matching coefficient (invariant, if the binary variable is symmetric ):
Jaccard coefficient (noninvariant if the binary variable is asymmetric ):

sum a c b d p

c d c d

a b a b

sum

0

1

1 0

a b c d

d i j b c

(, )= +

a b c

d i j b c

(, )= +

Example:
gender is a symmetric attribute
The remaining attributes are asymmetric binary
Let the values Y and P be set to 1, and the value N be set to 0

4.3. Nominal Variables

A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching o (^) m : # of matches, p : total # of variables

Name Gender (^) Fever Cough Test-1 Test-2 Test-3 Test- Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N

d jim mary

d jack jim

d jack mary

4.5. Ratio-scaled variables

Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt^ or Ae-Bt
Methods: Treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted) Apply logarithmic transformation: yif = log(xif) Treat them as continuous ordinal data treat their rank as interval-scaled

4.6. Variables of mixed types

(^) A database may contain all the six types of variables Symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio
One may use a weighted formula to combine their effects:

f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w. f is interval-based: use the normalized distance f is ordinal or ratio-scaled o compute ranks rif and o treat zif as interval-scaled

( ) 1

( ) ( )

( , )^1

f ij

p f

f ij

p

d i j f^ d

=

1

1 −

−

f

if

M

r

zif

5. Clustering approaches

5.1. Major approaches

Partitioning algorithms: Construct various partitions and then evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

5.2. Partitioning approach

Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
Given a k , find a partition of k clusters that optimizes the chosen partitioning criterion o Global optimal: exhaustively enumerate all partitions o Heuristic methods: k-means and k-medoids algorithms o k-means (MacQueen’67): Each cluster is represented by the center of the cluster o k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

Variations of K-means method:

A few variants of the k-means which differ in o Selection of the initial k means o Dissimilarity calculations o Strategies to calculate cluster means Handling categorical data: k-modes (Huang’98) o Replacing means of clusters with modes o (^) Using new dissimilarity measures to deal with categorical objects o Using a frequency-based method to update modes of clusters o A mixture of categorical and numerical data: k-prototype method

Drawbacks of k-mean method o The k-means algorithm is sensitive to outliers! Since an object with an extremely large value may substantially distort the distribution of the data. o (^) K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.

7. The K-medoids Clustering Method

Find representative objects, called medoids, in clusters
PAM (Partitioning Around Medoids, 1987) o starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non- medoids if it improves the total distance of the resulting clustering o PAM works effectively for small data sets, but does not scale well for large data sets
CLARA (Kaufmann & Rousseeuw, 1990)
CLARANS (Ng & Han, 1994): Randomized sampling
Focusing + spatial data structure (Ester et al., 1995)

A Dendrogram Shows How the Clusters are Merged Hierarchically o Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. o A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10 0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10 0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

8.2. Divisive Analysis: DIANA

Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own

8.3. Analysis of hierarchical clustering

Major weakness of agglomerative clustering methods o do not scale well: time complexity of at least O ( n2 ), where n is the number of total objects
Integration of hierarchical with distance-based clustering o BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters o CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction o CHAMELEON (1999): hierarchical clustering using dynamic modeling.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10 0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

9.2. Distance-Based Approach

Introduced to counter the main limitations imposed by statistical methods o We need multi-dimensional analysis without knowing data distribution.
Distance-based outlier: A Outlier(p, D)-outlier is an object O in a dataset T such that at least a fraction p of the objects in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers o Index-based algorithm: Use R-tree indexing structure. It takes O(k*n 2 ) without the cost of building the tree. o Nested-loop algorithm: Divide the dataset into blocks and look for outliers in block by block. It has the same complexity as index-based algorithm. o Cell-based algorithm: Divide the data space into cells and look for outliers cell-by-cell rather than point-by-point. It takes O(n 2 ).

Notes on Clustering - Data Mining | CSCI 243, Study notes of Computer Science

Related documents

Partial preview of the text

Download Notes on Clustering - Data Mining | CSCI 243 and more Study notes Computer Science in PDF only on Docsity!

Clustering

1. Objectives

2. Clustering

3. Data Structures

4. Similarity Measures

−

( , )^1

d i j f^ d

−

M

r

zif

5. Clustering approaches

7. The K-medoids Clustering Method