Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Data Mining: Techniques for Summarization and Dimensionality Reduction - Prof. Jennifer L., Exams of Data Analysis & Statistical Methods

Purdue University Data Analysis & Statistical Methods

Prof. Jennifer L. Neville

This document from purdue university covers data exploration and visualization techniques, including histograms, bar plots, density plots, box plots, scatter plots, and contour plots. It also discusses data summarization methods such as measures of location and dispersion, and dimensionality reduction techniques like principal component analysis (pca).

Typology: Exams

Pre 2010

Uploaded on 07/30/2009

koofers-user-8jv 🇺🇸

10 documents

1 / 17

This page cannot be seen from the preview

Don't miss anything!

Data Mining

CS57300 / STAT 59800-024

Purdue University

January 27, 2009

Data exploration

and visualization

Discover Exams of Data Analysis & Statistical Methods Purdue University

Partial preview of the text

Download Data Mining: Techniques for Summarization and Dimensionality Reduction - Prof. Jennifer L. and more Exams Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Purdue University January 27, 2009 1

Data exploration

and visualization

Visualization

Human eye/brain have evolved powerful methods to detect structure in nature
Display data in ways that exploit human pattern recognition abilities
Limitation: Can be difficult to apply if data size (number of dimensions or instances) is large 3

Exploratory data analysis

Data analysis approach that employs a number of (mostly graphical) techniques to: - Maximize insight into data - Uncover underlying structure - Identify important variables - Detect outliers and anomalies - Test underlying modeling assumptions - Develop parsimonious models - Generate hypotheses from data

Measures of dispersion or variability
- Variance:
- Standard deviation:
- Range: difference between max and min point
- Interquartile range: difference between 1st^ and 3rd^ Q
- Skew:

Data summarization

μ ˆ = (^) n^1 ∑n i=1 x(i) σ ˆ k^2 = (^1) n ∑n i=1(x(i)^ −^ μ) 2 σ ˆk =

1 n ∑n i=1(x(i)^ −^ μ) 2 2 μ ˆ = (^1) n ∑n i=1 x(i) σ ˆ^2 k = (^1) n ∑n i=1(x(i)^ −^ μ) 2 σ ˆk = √ 1 n ∑n i=1(x(i)^ −^ μ)^2 2 μ ˆ = (^) n^1 ∑n i=1 x(i) σ ˆ k^2 = (^) n^1 ∑n i=1(x(i)^ −^ μ) 2 σ ˆk = √ 1 n ∑n i=1(x(i)^ −^ μ) 2 Pn i=1(x(i)−μˆ) 3 ( Pn i=1(x(i)−μˆ)^2 )^ 3 2 7

Histograms (1D)

Most common plot for univariate data
Split data range into equal-sized bins, count number of data points that fall into each bin
Graphically shows:
- Center (location)
- Spread (scale)
- Skew
- Outliers
- Multiple modes 8

Example histogram

Histogram limitations

Histograms can be misleading for small datasets
- Slight changes in the data or binning approach can result in different histograms
Solution: smoothed density plots
- Use kernel function to estimate density at each point x, pools information from neighboring points

Box plot (2D)

Display relationship between discrete and continuous variables
For each discrete value X, calculate quartiles and range of associated Y values 13

Scatter plot (2D)

Most common plot for bivariate data
- Horizontal X axis: the suspected independent variable
- Vertical Y axis: the suspected dependent variable
Graphically shows:
- If X and Y are related
- Linear or non-linear relationship
- If the variation in Y depends on X
- Outliers

No relationship

Linear relationship

Heteroskedastic

Scatterplot limitations

!"#$%&'()+,-)(./,,&")0%#,( ,##)'1.-)2/,/)! $%/.3)"&.,/45%& !"#$%&'()+,-)(./,,&")0%#,(

Too much data ,##)'1.-)2/,/)Overprinting!^ #3&")0"+4,+

Contour plot (3D)

!"#$"%&'()"$ !"#!"$"%&'%())+,-'."%$'/%0)$1!23")45)#0/&&'%()3/%$&%&)! $0'3"$6) 300"-)3/%&/1!$6)/%))7,-'."%$'/%0)2/!.&8)

Represents a 3D surface by plotting constant z slices (contours) in a 2D format
Can overcomes some limitations of 2D scatterplot 21

Scatterplot matrix

Higher dimensions

Dimensionality reduction

Principal component analysis (PCA)
- Linear transformation, minimize unexplained variance
Factor analysis
- Linear combination of small number of latent variables
Multidimensional scaling (MDS)
- Project into low-dimensional subspace while preserving distance between points (can be non-linear)

Principal component analysis

Task: Reduce dimensionality of data while capturing intrinsic variability
Data representation: X data matrix (n x p)
Knowledge representation:
- Set of alternative dimensions k, where each k is a weighted linear combination of the original p variables (e.g., 2x^1 + 3x^2 + x^3 )
- Each k is represented by a p-dimensional vector of weights (e.g., [2,3,1]) 27

Principal component analysis

Learning:
- Evaluation function: Squared deviation from original points to projected points, can show that this corresponds to maximizing variance along k
- Search: Maximize variance, corresponds to solving eigensystem with the covariance matrix!
Inference:
- Project points into new space:

μ ˆ = n^1

∑n

i=1 x(i)

ˆσ^2 k = 1 n

∑n

i=1(x(i)^ −^ μ)

ˆσk =

1 n

∑n

i=1(x(i)^ −^ μ)

2 Pn i=1(x(i)−μˆ) 3 (Pn i=1(x(i)−μˆ)^2 ) 32

Σ = E[(X − E[X])(X − E[X])T^ ]

Σa = λa

aT^ x =

∑p

j=1 aj^ xj

PCA (cont’)

Project data onto top k eigenvectors
Calculate variance of projected data:
Use scree plot to choose number of dimensions - Choose k<p so projected data capture much of the variance of original data ( Pn i=1(x(i)−ˆμ)^2 )^ (^32)

Data Mining: Techniques for Summarization and Dimensionality Reduction - Prof. Jennifer L., Exams of Data Analysis & Statistical Methods

Related documents

Partial preview of the text

Download Data Mining: Techniques for Summarization and Dimensionality Reduction - Prof. Jennifer L. and more Exams Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Data exploration

and visualization

Visualization

Exploratory data analysis

Data summarization

Histograms (1D)

Example histogram

Histogram limitations

Box plot (2D)

Scatter plot (2D)

No relationship

Linear relationship

Heteroskedastic

Scatterplot limitations

Too much data ,##)'1.-)2/,/)Overprinting!^ #3&")0"+4,+

Contour plot (3D)

Scatterplot matrix

Higher dimensions

Dimensionality reduction

Principal component analysis

Principal component analysis

μ ˆ = n^1

∑n

i=1 x(i)

ˆσ^2 k = 1 n

∑n

i=1(x(i)^ −^ μ)

ˆσk =

∑n

i=1(x(i)^ −^ μ)

Σ = E[(X − E[X])(X − E[X])T^ ]

Σa = λa

aT^ x =

∑p

j=1 aj^ xj

PCA (cont’)

= E[(X − E[X])(X − E[X])T^ ]

a = λa

aT^ x =

∑p

j=1 aj^ xj

∑k

j=1 λj

PCA example

Next class