Data Mining: Descriptive Modeling and Cluster Analysis, Exams of Computer Science

This document from purdue university covers descriptive modeling in data mining, including components, descriptive models, and clustering. It discusses the importance of data representation, task specification, and knowledge representation. The document also explores various clustering algorithms, such as k-means, spectral clustering, and nearest neighbor clustering, and their applications in marketing, land use, city-planning, and earthquake studies.

Typology: Exams

Pre 2010

Uploaded on 07/30/2009

koofers-user-f5z
koofers-user-f5z 🇺🇸

10 documents

1 / 20

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Mining
CS57300 / STAT 59800-024
Purdue University
March 5, 2009
1
Descriptive modeling: representation
2
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14

Partial preview of the text

Download Data Mining: Descriptive Modeling and Cluster Analysis and more Exams Computer Science in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Purdue University March 5, 2009 1

Descriptive modeling: representation

Data mining components

  • Task specification: Description
  • Data representation: Homogeneous IID data
  • Knowledge representation
  • Learning technique 3

Descriptive models

  • Descriptive models summarize the data
    • Global summary
    • Model main features of the data
  • Two main approaches:
    • Cluster analysis
    • Density estimation

Application examples

  • Marketing: discover distinct groups in customer base to develop targeted marketing programs
  • Land use: identify areas of similar use in an earth observation database to understand geographic similarities
  • City-planning: group houses according to house type, value, and location to identify “neighborhoods”
  • Earth-quake studies: Group observed earthquakes to see if they cluster along continent faults 7

Algorithm examples

  • K-means clustering (partition-based)
  • Spectral clustering (hierarchical-divisive)
  • Nearest neighbor clustering (hierarchical-agglomerative)
  • Mixture models (probabilistic model-based)

K-means

Groups represented by canonical item description(s)

9

Nearest neighbor clustering

Clustering represented with dendogram

Compact representations

  • Key idea: use properties of independence to determine more compact representations
  • If X and Y are independent if:
    • P(X,Y)=P(X)P(Y) for all values x,y
  • In general if X 1 ,X 2 ,...,Xp are independent, then
    • P(X 1 ,X 2 ,...,Xp)=P(X 1 )P(X 2 )...P(Xp)
    • Requires O(p) parameters 13

Compact representations

  • Unfortunately most variables of interest are not independent of one another
  • More suitable notion: conditional independence
    • Two variables X and Y are conditionally independent given Z if: P(X|Y,Z)=P(X|Z) for all values of x,y,z
  • Approach: model joint probability in terms of “local” conditional probability distributions (CPDs) - If dependencies have a “sparse” factorization then CPDs can be specified compactly and full joint representation will be linear O(p)

Examples

  • Graphical models for representing full joint distribution
    • Bayesian networks: directed model
    • Markov networks: undirected model
  • Can be parametric or non-parametric 15

Bayesian networks

P (X = x) =

∏^ p

i=

P (Xi|parents(Xi))

! Log-linear

model:

Weight of Feature i Feature i

Markov networks

Acknowledgement: Slide adapted from Pedro Domingos Cancer Asthma Cough Smoking

P (X = x) =

Z

exp(

i

wifi(x))

f 1 (Smoking, Cancer) =

1 if ¬Smoking ∨ Cancer

0 otherwise

19

Descriptive modeling: learning

Cluster analysis

  • Huge body of work
    • Unsupervised learning, segmentation, etc.
  • Difficult to evaluate success
    • If goal is to find “interesting” clusters, then it is difficult to quantify
    • If goal is to find “similar” clusters, then success depends on distance measure (circular) 21

Clustering algorithms

  • Types:
    • Partition-based methods
    • Hierarchical clustering (divisive/agglomerative)
    • Probabilistic model-based methods
  • Different algorithms find clusters of different “shapes”
    • Appropriate shape will depend on application, match method to objectives

K-means

  • Algorithm idea:
    • Start with k randomly chosen centroids
    • Repeat until no changes in assignments
      • Assign instances to closest centroid
      • Recompute cluster centroids
  • Score function? 25

K-means example I

,-./ ,-0/ ,-1/ ,-2/ ,-3/ ,-4/ ,-5/ ,-6/ 27

,-./ ,-0/ ,-1/ ,-2/ ,-3/ ,-4/ ,-5/ ,-6/ !" !# !$

,-./ ,-0/ ,-1/ ,-2/ ,-3/ ,-4/ ,-5/ ,-6/ !" !# !$ 31

,-./ ,-0/ ,-1/ ,-2/ ,-3/ ,-4/ ,-5/ ,-6/ !" !# !$

K-means example II

33

./- ./ ./ ./ ./ ./ ./ ./

Algorithm details

  • Does it terminate?
  • Does it converge to an optimal solution?
  • What is the time complexity? 37

K-means

  • Strengths:
    • Relatively efficient O(tkn)
    • Finds spherical clusters
  • Weaknesses:
    • Terminates at local optimum (sensitive to initial seeds)
    • Applicable only when mean is defined
    • Need to specify k
    • Susceptible to outliers/noise

Variations

  • Selection of initial centroids
  • Algorithm:
    • Recompute centroid after each point is assigned
    • Allow for merge and split of clusters
  • When mean is undefined
    • K-mediods: use one of the data points as cluster center
    • K-modes: uses categorical distance measure and frequency-based update method 39

Next class

  • Reading: Chapter 9 PDM
  • Topic: Descriptive modeling: learning
  • Midterm: Mar 10, 8-10pm, LWSN B