Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Data Mining: Descriptive Modeling and Cluster Analysis, Exams of Computer Science

Purdue University Computer Science

This document from purdue university covers descriptive modeling in data mining, including components, descriptive models, and clustering. It discusses the importance of data representation, task specification, and knowledge representation. The document also explores various clustering algorithms, such as k-means, spectral clustering, and nearest neighbor clustering, and their applications in marketing, land use, city-planning, and earthquake studies.

Typology: Exams

Pre 2010

Uploaded on 07/30/2009

koofers-user-f5z 🇺🇸

10 documents

1 / 20

This page cannot be seen from the preview

Don't miss anything!

Data Mining

CS57300 / STAT 59800-024

Purdue University

March 5, 2009

Descriptive modeling: representation

Discover Exams of Computer Science Purdue University

Partial preview of the text

Download Data Mining: Descriptive Modeling and Cluster Analysis and more Exams Computer Science in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Purdue University March 5, 2009 1

Descriptive modeling: representation

Data mining components

Task specification: Description
Data representation: Homogeneous IID data
Knowledge representation
Learning technique 3

Descriptive models

Descriptive models summarize the data
- Global summary
- Model main features of the data
Two main approaches:
- Cluster analysis
- Density estimation

Application examples

Marketing: discover distinct groups in customer base to develop targeted marketing programs
Land use: identify areas of similar use in an earth observation database to understand geographic similarities
City-planning: group houses according to house type, value, and location to identify “neighborhoods”
Earth-quake studies: Group observed earthquakes to see if they cluster along continent faults 7

Algorithm examples

K-means clustering (partition-based)
Spectral clustering (hierarchical-divisive)
Nearest neighbor clustering (hierarchical-agglomerative)
Mixture models (probabilistic model-based)

K-means

Groups represented by canonical item description(s)

Nearest neighbor clustering

Clustering represented with dendogram

Compact representations

Key idea: use properties of independence to determine more compact representations
If X and Y are independent if:
- P(X,Y)=P(X)P(Y) for all values x,y
In general if X 1 ,X 2 ,...,Xp are independent, then
- P(X 1 ,X 2 ,...,Xp)=P(X 1 )P(X 2 )...P(Xp)
- Requires O(p) parameters 13

Compact representations

Unfortunately most variables of interest are not independent of one another
More suitable notion: conditional independence
- Two variables X and Y are conditionally independent given Z if: P(X|Y,Z)=P(X|Z) for all values of x,y,z
Approach: model joint probability in terms of “local” conditional probability distributions (CPDs) - If dependencies have a “sparse” factorization then CPDs can be specified compactly and full joint representation will be linear O(p)

Examples

Graphical models for representing full joint distribution
- Bayesian networks: directed model
- Markov networks: undirected model
Can be parametric or non-parametric 15

Bayesian networks

P (X = x) =

∏^ p

P (Xi|parents(Xi))

! Log-linear

model:

Weight of Feature i Feature i

Markov networks

Acknowledgement: Slide adapted from Pedro Domingos Cancer Asthma Cough Smoking

P (X = x) =

Z

exp(

i

wifi(x))

f 1 (Smoking, Cancer) =

1 if ¬Smoking ∨ Cancer

0 otherwise

Descriptive modeling: learning

Cluster analysis

Huge body of work
- Unsupervised learning, segmentation, etc.
Difficult to evaluate success
- If goal is to find “interesting” clusters, then it is difficult to quantify
- If goal is to find “similar” clusters, then success depends on distance measure (circular) 21

Clustering algorithms

Types:
- Partition-based methods
- Hierarchical clustering (divisive/agglomerative)
- Probabilistic model-based methods
Different algorithms find clusters of different “shapes”
- Appropriate shape will depend on application, match method to objectives

K-means

Algorithm idea:
- Start with k randomly chosen centroids
- Repeat until no changes in assignments
  - Assign instances to closest centroid
  - Recompute cluster centroids
Score function? 25

K-means example I

,-./ ,-0/ ,-1/ ,-2/ ,-3/ ,-4/ ,-5/ ,-6/ 27

,-./ ,-0/ ,-1/ ,-2/ ,-3/ ,-4/ ,-5/ ,-6/ !" !# !$

,-./ ,-0/ ,-1/ ,-2/ ,-3/ ,-4/ ,-5/ ,-6/ !" !# !$ 31

,-./ ,-0/ ,-1/ ,-2/ ,-3/ ,-4/ ,-5/ ,-6/ !" !# !$

K-means example II

./- ./ ./ ./ ./ ./ ./ ./

Algorithm details

Does it terminate?
Does it converge to an optimal solution?
What is the time complexity? 37

K-means

Strengths:
- Relatively efficient O(tkn)
- Finds spherical clusters
Weaknesses:
- Terminates at local optimum (sensitive to initial seeds)
- Applicable only when mean is defined
- Need to specify k
- Susceptible to outliers/noise

Variations

Selection of initial centroids
Algorithm:
- Recompute centroid after each point is assigned
- Allow for merge and split of clusters
When mean is undefined
- K-mediods: use one of the data points as cluster center
- K-modes: uses categorical distance measure and frequency-based update method 39

Next class

Reading: Chapter 9 PDM
Topic: Descriptive modeling: learning
Midterm: Mar 10, 8-10pm, LWSN B

Data Mining: Descriptive Modeling and Cluster Analysis, Exams of Computer Science

Related documents

Partial preview of the text

Download Data Mining: Descriptive Modeling and Cluster Analysis and more Exams Computer Science in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Descriptive modeling: representation

Data mining components

Descriptive models

Application examples

Algorithm examples

K-means

Groups represented by canonical item description(s)

Nearest neighbor clustering

Clustering represented with dendogram

Compact representations

Compact representations

Examples

Bayesian networks

P (X = x) =

∏^ p

P (Xi|parents(Xi))

! Log-linear

model:

Markov networks

P (X = x) =

Z

exp(

i

wifi(x))

f 1 (Smoking, Cancer) =

1 if ¬Smoking ∨ Cancer

0 otherwise

Descriptive modeling: learning

Cluster analysis

Clustering algorithms

K-means

K-means example I

K-means example II

Algorithm details

K-means

Variations

Next class