Download Data Mining: Descriptive Modeling and Cluster Analysis and more Exams Computer Science in PDF only on Docsity!
Data Mining
CS57300 / STAT 59800-
Purdue University March 5, 2009 1
Descriptive modeling: representation
Data mining components
- Task specification: Description
- Data representation: Homogeneous IID data
- Knowledge representation
- Learning technique 3
Descriptive models
- Descriptive models summarize the data
- Global summary
- Model main features of the data
- Two main approaches:
- Cluster analysis
- Density estimation
Application examples
- Marketing: discover distinct groups in customer base to develop targeted marketing programs
- Land use: identify areas of similar use in an earth observation database to understand geographic similarities
- City-planning: group houses according to house type, value, and location to identify “neighborhoods”
- Earth-quake studies: Group observed earthquakes to see if they cluster along continent faults 7
Algorithm examples
- K-means clustering (partition-based)
- Spectral clustering (hierarchical-divisive)
- Nearest neighbor clustering (hierarchical-agglomerative)
- Mixture models (probabilistic model-based)
K-means
Groups represented by canonical item description(s)
9
Nearest neighbor clustering
Clustering represented with dendogram
Compact representations
- Key idea: use properties of independence to determine more compact representations
- If X and Y are independent if:
- P(X,Y)=P(X)P(Y) for all values x,y
- In general if X 1 ,X 2 ,...,Xp are independent, then
- P(X 1 ,X 2 ,...,Xp)=P(X 1 )P(X 2 )...P(Xp)
- Requires O(p) parameters 13
Compact representations
- Unfortunately most variables of interest are not independent of one another
- More suitable notion: conditional independence
- Two variables X and Y are conditionally independent given Z if: P(X|Y,Z)=P(X|Z) for all values of x,y,z
- Approach: model joint probability in terms of “local” conditional probability distributions (CPDs) - If dependencies have a “sparse” factorization then CPDs can be specified compactly and full joint representation will be linear O(p)
Examples
- Graphical models for representing full joint distribution
- Bayesian networks: directed model
- Markov networks: undirected model
- Can be parametric or non-parametric 15
Bayesian networks
P (X = x) =
∏^ p
i=
P (Xi|parents(Xi))
! Log-linear
model:
Weight of Feature i Feature i
Markov networks
Acknowledgement: Slide adapted from Pedro Domingos Cancer Asthma Cough Smoking
P (X = x) =
Z
exp(
i
wifi(x))
f 1 (Smoking, Cancer) =
1 if ¬Smoking ∨ Cancer
0 otherwise
19
Descriptive modeling: learning
Cluster analysis
- Huge body of work
- Unsupervised learning, segmentation, etc.
- Difficult to evaluate success
- If goal is to find “interesting” clusters, then it is difficult to quantify
- If goal is to find “similar” clusters, then success depends on distance measure (circular) 21
Clustering algorithms
- Types:
- Partition-based methods
- Hierarchical clustering (divisive/agglomerative)
- Probabilistic model-based methods
- Different algorithms find clusters of different “shapes”
- Appropriate shape will depend on application, match method to objectives
K-means
- Algorithm idea:
- Start with k randomly chosen centroids
- Repeat until no changes in assignments
- Assign instances to closest centroid
- Recompute cluster centroids
- Score function? 25
K-means example I
,-./ ,-0/ ,-1/ ,-2/ ,-3/ ,-4/ ,-5/ ,-6/ 27
,-./ ,-0/ ,-1/ ,-2/ ,-3/ ,-4/ ,-5/ ,-6/ !" !# !$
,-./ ,-0/ ,-1/ ,-2/ ,-3/ ,-4/ ,-5/ ,-6/ !" !# !$ 31
,-./ ,-0/ ,-1/ ,-2/ ,-3/ ,-4/ ,-5/ ,-6/ !" !# !$
K-means example II
33
./- ./ ./ ./ ./ ./ ./ ./
Algorithm details
- Does it terminate?
- Does it converge to an optimal solution?
- What is the time complexity? 37
K-means
- Strengths:
- Relatively efficient O(tkn)
- Finds spherical clusters
- Weaknesses:
- Terminates at local optimum (sensitive to initial seeds)
- Applicable only when mean is defined
- Need to specify k
- Susceptible to outliers/noise
Variations
- Selection of initial centroids
- Algorithm:
- Recompute centroid after each point is assigned
- Allow for merge and split of clusters
- When mean is undefined
- K-mediods: use one of the data points as cluster center
- K-modes: uses categorical distance measure and frequency-based update method 39
Next class
- Reading: Chapter 9 PDM
- Topic: Descriptive modeling: learning
- Midterm: Mar 10, 8-10pm, LWSN B