Exploratory Data Analysis and Data Visualization in Principles of Data Mining (Lecture #3), Study notes of Computer Science

The third lecture of the principles of data mining course, focusing on exploratory data analysis and data visualization techniques. Topics include statistical graphics, data reduction methods like pca and multidimensional scaling, and various data plots such as histograms, box plots, and scatter plots. The lecture also discusses the importance of eda in maximizing insight into a dataset, uncovering underlying structure, and extracting important variables.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-d5q-1
koofers-user-d5q-1 🇺🇸

10 documents

1 / 40

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CMSC828G Principles of Data Mining Lecture #3
•Todays Reading:
HMS, chapter 3
Today’s Lecture:
Exploratory data analysis
Statistical Graphics
–Data reduction
•PCA
multidimensional scaling
Upcoming Due Dates:
P0 due 2/7
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28

Partial preview of the text

Download Exploratory Data Analysis and Data Visualization in Principles of Data Mining (Lecture #3) and more Study notes Computer Science in PDF only on Docsity!

CMSC828G

Principles of Data Mining

Lecture #

-^

Today’s Reading:– HMS, chapter 3

-^

Today’s Lecture:– Exploratory data analysis– Statistical Graphics– Data reduction

  • PCA• multidimensional scaling -^

Upcoming Due Dates:– P0 due 2/

Exploratory Data Analysis (EDA)

-^

An approach/philosophy for data analysis thatemploys a variety of techniques (mostly graphical) to1.maximize insight into a data set;2.uncover underlying structure;3.extract important variables;4.detect outliers and anomalies;5.test underlying assumptions;6.develop parsimonious models; and7.determine optimal factor settings NIST/Semantech Engineering Statistics Handbook

Plotting Raw Data (1D)

-^

Data traces

-^

Histograms

Run Sequence Plot

•^

Purpose:check for shifts in location andscale and outliers

-^

Run sequence plots formed by:Vertical axis:

Response variable Y(i)Horizontal axis:Index i (i = 1, 2, 3, ... )

-^

The run sequence plot can beused to answer the followingquestions1.

Are there any shifts in location?

Are there any shifts in variation?

Are there any outliers?

Histogram Example

classical bell-shaped, symmetric histogram with most of the frequencycounts bunched in the middle and with the counts dying off out in thetails. From a physical science/engineering point of view, the normaldistribution is that distribution which occurs most often in nature (due inpart to the central limit theorem).

Another Histogram Example

Histograms cont.

-^

For small data sets, histograms can be misleading.Small changes in the data or to the bucketboundaries can result in very different histograms.

-^

For large data sets, histograms can be quite effectiveat illustrating properties of the distribution.

-^

example

-^

Can smooth histogram using a variety of techniques– kernel estimates (we will discuss this in more detail when we

discuss Kernel methods and SVMs)

Box Plot

Calculate the median and the quartiles

Plot the median and draw a box between lower and upper quartiles; thisbox represents the middle 50% of the data--the "body" of the data.

Draw a line from the lower quartile to the minimum point and another linefrom the upper quartile to the maximum point.

2D: Scatter Plots

•^

standard tool for displaying relationship between two variables

-^

A scatter plot is a plot of the values of Y versus thecorresponding values of X:–

Vertical axis: variable Y--usually the response variable– Horizontal axis: variable X--variable we suspect may be related

-^

Scatter plots can provide answers to the following questions:1. Are variables X and Y related?2. Are variables X and Y linearly related?3. Are variables X and Y non-linearly related?4. Does the variation in Y change

depending on X?

  1. Are there outliers?

Scatter Plot: No relationship

Scatter Plot: Quadratic relationship

Scatter plot: Homoscedastic

Variation of Y Does Not Depend on X

Problems with scatter plots

too much data

black rectangle

Problems with scatter plots

too much data

over printing