Download Data Mining: Techniques for Summarization and Dimensionality Reduction - Prof. Jennifer L. and more Exams Data Analysis & Statistical Methods in PDF only on Docsity!
Data Mining
CS57300 / STAT 59800-
Purdue University January 27, 2009 1
Data exploration
and visualization
Visualization
- Human eye/brain have evolved powerful methods to detect structure in nature
- Display data in ways that exploit human pattern recognition abilities
- Limitation: Can be difficult to apply if data size (number of dimensions or instances) is large 3
Exploratory data analysis
- Data analysis approach that employs a number of (mostly graphical) techniques to: - Maximize insight into data - Uncover underlying structure - Identify important variables - Detect outliers and anomalies - Test underlying modeling assumptions - Develop parsimonious models - Generate hypotheses from data
- Measures of dispersion or variability
- Variance:
- Standard deviation:
- Range: difference between max and min point
- Interquartile range: difference between 1st^ and 3rd^ Q
- Skew:
Data summarization
μ ˆ = (^) n^1 ∑n i=1 x(i) σ ˆ k^2 = (^1) n ∑n i=1(x(i)^ −^ μ) 2 σ ˆk =
1 n ∑n i=1(x(i)^ −^ μ) 2 2 μ ˆ = (^1) n ∑n i=1 x(i) σ ˆ^2 k = (^1) n ∑n i=1(x(i)^ −^ μ) 2 σ ˆk = √ 1 n ∑n i=1(x(i)^ −^ μ)^2 2 μ ˆ = (^) n^1 ∑n i=1 x(i) σ ˆ k^2 = (^) n^1 ∑n i=1(x(i)^ −^ μ) 2 σ ˆk = √ 1 n ∑n i=1(x(i)^ −^ μ) 2 Pn i=1(x(i)−μˆ) 3 ( Pn i=1(x(i)−μˆ)^2 )^ 3 2 7
Histograms (1D)
- Most common plot for univariate data
- Split data range into equal-sized bins, count number of data points that fall into each bin
- Graphically shows:
- Center (location)
- Spread (scale)
- Skew
- Outliers
- Multiple modes 8
Example histogram
9
Histogram limitations
- Histograms can be misleading for small datasets
- Slight changes in the data or binning approach can result in different histograms
- Solution: smoothed density plots
- Use kernel function to estimate density at each point x, pools information from neighboring points
Box plot (2D)
- Display relationship between discrete and continuous variables
- For each discrete value X, calculate quartiles and range of associated Y values 13
Scatter plot (2D)
- Most common plot for bivariate data
- Horizontal X axis: the suspected independent variable
- Vertical Y axis: the suspected dependent variable
- Graphically shows:
- If X and Y are related
- Linear or non-linear relationship
- If the variation in Y depends on X
- Outliers
No relationship
15
Linear relationship
Heteroskedastic
19
Scatterplot limitations
!"#$%&'()+,-)(./,,&")0%#,( ,##)'1.-)2/,/)! $%/.3)"&.,/45%& !"#$%&'()+,-)(./,,&")0%#,(
Too much data ,##)'1.-)2/,/)Overprinting!^ #3&")0"+4,+
Contour plot (3D)
!"#$"%&'()"$ !"#!"$"%&'%())+,-'."%$'/%0)$1!23")45)#0/&&'%()3/%$&%&)! $0'3"$6) 300"-)3/%&/1!$6)/%))7,-'."%$'/%0)2/!.&8)
- Represents a 3D surface by plotting constant z slices (contours) in a 2D format
- Can overcomes some limitations of 2D scatterplot 21
Scatterplot matrix
Higher dimensions
25
Dimensionality reduction
- Principal component analysis (PCA)
- Linear transformation, minimize unexplained variance
- Factor analysis
- Linear combination of small number of latent variables
- Multidimensional scaling (MDS)
- Project into low-dimensional subspace while preserving distance between points (can be non-linear)
Principal component analysis
- Task: Reduce dimensionality of data while capturing intrinsic variability
- Data representation: X data matrix (n x p)
- Knowledge representation:
- Set of alternative dimensions k, where each k is a weighted linear combination of the original p variables (e.g., 2x^1 + 3x^2 + x^3 )
- Each k is represented by a p-dimensional vector of weights (e.g., [2,3,1]) 27
Principal component analysis
- Learning:
- Evaluation function: Squared deviation from original points to projected points, can show that this corresponds to maximizing variance along k
- Search: Maximize variance, corresponds to solving eigensystem with the covariance matrix!
- Inference:
- Project points into new space:
μ ˆ = n^1
∑n
i=1 x(i)
ˆσ^2 k = 1 n
∑n
i=1(x(i)^ −^ μ)
2
ˆσk =
1 n
∑n
i=1(x(i)^ −^ μ)
2 Pn i=1(x(i)−μˆ) 3 (Pn i=1(x(i)−μˆ)^2 ) 32
Σ = E[(X − E[X])(X − E[X])T^ ]
Σa = λa
aT^ x =
∑p
j=1 aj^ xj
PCA (cont’)
- Project data onto top k eigenvectors
- Calculate variance of projected data:
- Use scree plot to choose number of dimensions - Choose k<p so projected data capture much of the variance of original data ( Pn i=1(x(i)−ˆμ)^2 )^ (^32)
= E[(X − E[X])(X − E[X])T^ ]
a = λa
aT^ x =
∑p
j=1 aj^ xj
∑k
j=1 λj
31
PCA example
Next class
- Homework 1 due
- Reading: Chapter 4 PDM
- Topic: Statistics background