Data Mining: Techniques for Summarization and Dimensionality Reduction - Prof. Jennifer L., Exams of Data Analysis & Statistical Methods

This document from purdue university covers data exploration and visualization techniques, including histograms, bar plots, density plots, box plots, scatter plots, and contour plots. It also discusses data summarization methods such as measures of location and dispersion, and dimensionality reduction techniques like principal component analysis (pca).

Typology: Exams

Pre 2010

Uploaded on 07/30/2009

koofers-user-8jv
koofers-user-8jv 🇺🇸

10 documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Mining
CS57300 / STAT 59800-024
Purdue University
January 27, 2009
1
Data exploration
and visualization
2
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Data Mining: Techniques for Summarization and Dimensionality Reduction - Prof. Jennifer L. and more Exams Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Purdue University January 27, 2009 1

Data exploration

and visualization

Visualization

  • Human eye/brain have evolved powerful methods to detect structure in nature
  • Display data in ways that exploit human pattern recognition abilities
  • Limitation: Can be difficult to apply if data size (number of dimensions or instances) is large 3

Exploratory data analysis

  • Data analysis approach that employs a number of (mostly graphical) techniques to: - Maximize insight into data - Uncover underlying structure - Identify important variables - Detect outliers and anomalies - Test underlying modeling assumptions - Develop parsimonious models - Generate hypotheses from data
  • Measures of dispersion or variability
    • Variance:
    • Standard deviation:
    • Range: difference between max and min point
    • Interquartile range: difference between 1st^ and 3rd^ Q
    • Skew:

Data summarization

μ ˆ = (^) n^1 ∑n i=1 x(i) σ ˆ k^2 = (^1) n ∑n i=1(x(i)^ −^ μ) 2 σ ˆk =

1 n ∑n i=1(x(i)^ −^ μ) 2 2 μ ˆ = (^1) n ∑n i=1 x(i) σ ˆ^2 k = (^1) n ∑n i=1(x(i)^ −^ μ) 2 σ ˆk = √ 1 n ∑n i=1(x(i)^ −^ μ)^2 2 μ ˆ = (^) n^1 ∑n i=1 x(i) σ ˆ k^2 = (^) n^1 ∑n i=1(x(i)^ −^ μ) 2 σ ˆk = √ 1 n ∑n i=1(x(i)^ −^ μ) 2 Pn i=1(x(i)−μˆ) 3 ( Pn i=1(x(i)−μˆ)^2 )^ 3 2 7

Histograms (1D)

  • Most common plot for univariate data
  • Split data range into equal-sized bins, count number of data points that fall into each bin
  • Graphically shows:
    • Center (location)
    • Spread (scale)
    • Skew
    • Outliers
    • Multiple modes 8

Example histogram

9

Histogram limitations

  • Histograms can be misleading for small datasets
    • Slight changes in the data or binning approach can result in different histograms
  • Solution: smoothed density plots
    • Use kernel function to estimate density at each point x, pools information from neighboring points

Box plot (2D)

  • Display relationship between discrete and continuous variables
  • For each discrete value X, calculate quartiles and range of associated Y values 13

Scatter plot (2D)

  • Most common plot for bivariate data
    • Horizontal X axis: the suspected independent variable
    • Vertical Y axis: the suspected dependent variable
  • Graphically shows:
    • If X and Y are related
    • Linear or non-linear relationship
    • If the variation in Y depends on X
    • Outliers

No relationship

15

Linear relationship

Heteroskedastic

19

Scatterplot limitations

!"#$%&'()+,-)(./,,&")0%#,( ,##)'1.-)2/,/)! $%/.3)"&.,/45%& !"#$%&'()+,-)(./,,&")0%#,(

Too much data ,##)'1.-)2/,/)Overprinting!^ #3&")0"+4,+

Contour plot (3D)

!"#$"%&'()"$ !"#!"$"%&'%())+,-'."%$'/%0)$1!23")45)#0/&&'%()3/%$&%&)! $0'3"$6) 300"-)3/%&/1!$6)/%))7,-'."%$'/%0)2/!.&8)

  • Represents a 3D surface by plotting constant z slices (contours) in a 2D format
  • Can overcomes some limitations of 2D scatterplot 21

Scatterplot matrix

Higher dimensions

25

Dimensionality reduction

  • Principal component analysis (PCA)
    • Linear transformation, minimize unexplained variance
  • Factor analysis
    • Linear combination of small number of latent variables
  • Multidimensional scaling (MDS)
    • Project into low-dimensional subspace while preserving distance between points (can be non-linear)

Principal component analysis

  • Task: Reduce dimensionality of data while capturing intrinsic variability
  • Data representation: X data matrix (n x p)
  • Knowledge representation:
    • Set of alternative dimensions k, where each k is a weighted linear combination of the original p variables (e.g., 2x^1 + 3x^2 + x^3 )
    • Each k is represented by a p-dimensional vector of weights (e.g., [2,3,1]) 27

Principal component analysis

  • Learning:
    • Evaluation function: Squared deviation from original points to projected points, can show that this corresponds to maximizing variance along k
    • Search: Maximize variance, corresponds to solving eigensystem with the covariance matrix!
  • Inference:
    • Project points into new space:

μ ˆ = n^1

∑n

i=1 x(i)

ˆσ^2 k = 1 n

∑n

i=1(x(i)^ −^ μ)

2

ˆσk =

1 n

∑n

i=1(x(i)^ −^ μ)

2 Pn i=1(x(i)−μˆ) 3 (Pn i=1(x(i)−μˆ)^2 ) 32

Σ = E[(X − E[X])(X − E[X])T^ ]

Σa = λa

aT^ x =

∑p

j=1 aj^ xj

PCA (cont’)

  • Project data onto top k eigenvectors
  • Calculate variance of projected data:
  • Use scree plot to choose number of dimensions - Choose k<p so projected data capture much of the variance of original data ( Pn i=1(x(i)−ˆμ)^2 )^ (^32)

= E[(X − E[X])(X − E[X])T^ ]

a = λa

aT^ x =

∑p

j=1 aj^ xj

∑k

j=1 λj

31

PCA example

Next class

  • Homework 1 due
  • Reading: Chapter 4 PDM
  • Topic: Statistics background