Notes on Clustering - Data Mining | CSCI 243, Study notes of Computer Science

Material Type: Notes; Professor: Bellaachia; Class: Data Mining; Subject: Computer Science; University: George Washington University; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 02/25/2010

koofers-user-cth
koofers-user-cth 🇺🇸

10 documents

1 / 19

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
A. Bellaachia Page: 1
Clustering
1. Objectives.................................................................................2
2. Clustering .................................................................................2
2.1. Definitions........................................................................2
2.2. General Applications........................................................2
2.3. What is a good clustering? ...............................................3
2.4. Requirements....................................................................3
3. Data Structures .........................................................................4
4. Similarity Measures..................................................................4
4.1. Standardize data ...............................................................5
4.2. Binary variables................................................................7
4.3. Nominal Variables............................................................8
4.4. Ordinal Variables .............................................................9
4.5. Ratio-scaled variables ....................................................10
4.6. Variables of mixed types................................................10
5. Clustering approaches ............................................................11
5.1. Major approaches ...........................................................11
5.2. Partitioning approach .....................................................11
6. The K-means clustering method ............................................12
7. The K-medoids Clustering Method........................................14
8. Hierarchal Clustering .............................................................15
8.1. AGNES (Agglomerative Nesting) .................................15
8.2. Divisive Analysis: DIANA ............................................17
8.3. Analysis of hierarchical clustering:................................17
9. Outliers ...................................................................................18
9.1. Statistical Approach .......................................................18
9.2. Distance-Based Approach..............................................19
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13

Partial preview of the text

Download Notes on Clustering - Data Mining | CSCI 243 and more Study notes Computer Science in PDF only on Docsity!

Clustering

    1. Objectives .................................................................................
    1. Clustering .................................................................................
    • 2.1. Definitions ........................................................................
    • 2.2. General Applications........................................................
    • 2.3. What is a good clustering? ...............................................
    • 2.4. Requirements....................................................................
    1. Data Structures .........................................................................
    1. Similarity Measures..................................................................
    • 4.1. Standardize data ...............................................................
    • 4.2. Binary variables................................................................
    • 4.3. Nominal Variables............................................................
    • 4.4. Ordinal Variables .............................................................
    • 4.5. Ratio-scaled variables ....................................................
    • 4.6. Variables of mixed types................................................
    1. Clustering approaches ............................................................
    • 5.1. Major approaches ...........................................................
    • 5.2. Partitioning approach .....................................................
    1. The K-means clustering method ............................................
    1. The K-medoids Clustering Method........................................
    1. Hierarchal Clustering .............................................................
    • 8.1. AGNES (Agglomerative Nesting) .................................
    • 8.2. Divisive Analysis: DIANA ............................................
    • 8.3. Analysis of hierarchical clustering:................................
    1. Outliers ...................................................................................
    • 9.1. Statistical Approach .......................................................
    • 9.2. Distance-Based Approach..............................................

1. Objectives

  • Techniques to group data into related classify datasets and provide categorical labels, e.g., sports, technology, kid, etc.
  • Detection of patterns
  • Models to predict certain future behaviors.

2. Clustering

2.1. Definitions

  • Cluster: a collection of data objects o Similar to one another within the same cluster o Dissimilar to the objects in other clusters
  • Cluster analysis o Grouping a set of data objects into clusters
  • Clustering is unsupervised classification: no predefined classes
  • Typical applications o As a stand-alone tool to get insight into data distribution o As a preprocessing step for other algorithms

2.2. General Applications o Text mining: ƒ Document categorization ƒ Detection of topics ƒ Summarization o Text Mining: ƒ Web log analysis ƒ Detection of groups of similar access patterns

3. Data Structures

  • Data Matrix (two modes)
  • Dissimilarity (or similarity) matrix

4. Similarity Measures

  • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d ( i, j )
  • There is a separate “quality” function that measures the “goodness” of a cluster.
  • The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.

xn1 ... xnf ... x np

xi1 ... xif ... xip

x 11 ... x1f ... x1p

d n d n ...

d(3,1 d 0

d(2,1) 0

  • Weights should be associated with different variables based on applications and data semantics.
  • It is hard to define “similar enough” or “good enough” o The answer is typically highly subjective.
  • Type of data in clustering analysis o Interval-scaled variables o Binary variables o Nominal, ordinal, and ratio variables o Variables of mixed types

4.1. Standardize data

  • (^) Calculate the mean absolute deviation :

Where

  • z-score: Calculate the standardized measurement
  • Using mean absolute deviation is more robust than using standard deviation

m (^) f =^1 n^ (x 1 f + x 2 f +...+ xnf ).

s (^) f =^1 n^ (| x 1 fmf |+| x 2 fmf |+...+| xnfmf |)

f

if f if (^) s

x m z

ƒ Also, one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures

4.2. Binary variables

  • A contingency table for binary data
  • Simple matching coefficient (invariant, if the binary variable is symmetric ):
  • Jaccard coefficient (noninvariant if the binary variable is asymmetric ):

sum a c b d p

c d c d

a b a b

sum

0

1

1 0

a b c d

d i j b c

(, )= +

a b c

d i j b c

(, )= +

  • Example:
  • gender is a symmetric attribute
  • The remaining attributes are asymmetric binary
  • Let the values Y and P be set to 1, and the value N be set to 0

4.3. Nominal Variables

  • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
  • Method 1: Simple matching o (^) m : # of matches, p : total # of variables

Name Gender (^) Fever Cough Test-1 Test-2 Test-3 Test- Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N

d jim mary

d jack jim

d jack mary

4.5. Ratio-scaled variables

  • Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt^ or Ae-Bt
  • Methods: ƒ Treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted) ƒ Apply logarithmic transformation: yif = log(xif) ƒ Treat them as continuous ordinal data treat their rank as interval-scaled

4.6. Variables of mixed types

  • (^) A database may contain all the six types of variables ƒ Symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio
  • One may use a weighted formula to combine their effects:

ƒ f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w. ƒ f is interval-based: use the normalized distance ƒ f is ordinal or ratio-scaled o compute ranks rif and o treat zif as interval-scaled

( ) 1

( ) ( )

( , )^1

f ij

p f

f ij

f ij

p

d i j f^ d

=

=

1

1 −

f

if

M

r

zif

5. Clustering approaches

5.1. Major approaches

  • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion
  • Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion
  • Density-based: based on connectivity and density functions
  • Grid-based: based on a multiple-level granularity structure
  • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

5.2. Partitioning approach

  • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
  • Given a k , find a partition of k clusters that optimizes the chosen partitioning criterion o Global optimal: exhaustively enumerate all partitions o Heuristic methods: k-means and k-medoids algorithms o k-means (MacQueen’67): Each cluster is represented by the center of the cluster o k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
  • Variations of K-means method:

ƒ A few variants of the k-means which differ in o Selection of the initial k means o Dissimilarity calculations o Strategies to calculate cluster means ƒ Handling categorical data: k-modes (Huang’98) o Replacing means of clusters with modes o (^) Using new dissimilarity measures to deal with categorical objects o Using a frequency-based method to update modes of clusters o A mixture of categorical and numerical data: k-prototype method

  • Drawbacks of k-mean method o The k-means algorithm is sensitive to outliers! ƒ Since an object with an extremely large value may substantially distort the distribution of the data. o (^) K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.

7. The K-medoids Clustering Method

  • Find representative objects, called medoids, in clusters
  • PAM (Partitioning Around Medoids, 1987) o starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non- medoids if it improves the total distance of the resulting clustering o PAM works effectively for small data sets, but does not scale well for large data sets
  • CLARA (Kaufmann & Rousseeuw, 1990)
  • CLARANS (Ng & Han, 1994): Randomized sampling
  • Focusing + spatial data structure (Ester et al., 1995)
  • A Dendrogram Shows How the Clusters are Merged Hierarchically o Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. o A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10 0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10 0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

8.2. Divisive Analysis: DIANA

  • Introduced in Kaufmann and Rousseeuw (1990)
  • Implemented in statistical analysis packages, e.g., Splus
  • Inverse order of AGNES
  • Eventually each node forms a cluster on its own

8.3. Analysis of hierarchical clustering

  • Major weakness of agglomerative clustering methods o do not scale well: time complexity of at least O ( n2 ), where n is the number of total objects
  • Integration of hierarchical with distance-based clustering o BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters o CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction o CHAMELEON (1999): hierarchical clustering using dynamic modeling.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10 0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

9.2. Distance-Based Approach

  • Introduced to counter the main limitations imposed by statistical methods o We need multi-dimensional analysis without knowing data distribution.
  • Distance-based outlier: A Outlier(p, D)-outlier is an object O in a dataset T such that at least a fraction p of the objects in T lies at a distance greater than D from O
  • Algorithms for mining distance-based outliers o Index-based algorithm: ƒ Use R-tree indexing structure. ƒ It takes O(k*n 2 ) without the cost of building the tree. o Nested-loop algorithm: ƒ Divide the dataset into blocks and look for outliers in block by block. ƒ It has the same complexity as index-based algorithm. o Cell-based algorithm: ƒ Divide the data space into cells and look for outliers cell-by-cell rather than point-by-point. ƒ It takes O(n 2 ).