
































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The third lecture of the principles of data mining course, focusing on exploratory data analysis and data visualization techniques. Topics include statistical graphics, data reduction methods like pca and multidimensional scaling, and various data plots such as histograms, box plots, and scatter plots. The lecture also discusses the importance of eda in maximizing insight into a dataset, uncovering underlying structure, and extracting important variables.
Typology: Study notes
1 / 40
This page cannot be seen from the preview
Don't miss anything!

































Lecture #
-^
-^
Exploratory Data Analysis (EDA)
-^
Plotting Raw Data (1D)
-^
-^
Run Sequence Plot
Purpose:check for shifts in location andscale and outliers
-^
Run sequence plots formed by:Vertical axis:
Response variable Y(i)Horizontal axis:Index i (i = 1, 2, 3, ... )
-^
The run sequence plot can beused to answer the followingquestions1.
Are there any shifts in location?
Are there any shifts in variation?
Are there any outliers?
Histogram Example
classical bell-shaped, symmetric histogram with most of the frequencycounts bunched in the middle and with the counts dying off out in thetails. From a physical science/engineering point of view, the normaldistribution is that distribution which occurs most often in nature (due inpart to the central limit theorem).
Another Histogram Example
Histograms cont.
-^
-^
-^
-^
discuss Kernel methods and SVMs)
Box Plot
Calculate the median and the quartiles
Plot the median and draw a box between lower and upper quartiles; thisbox represents the middle 50% of the data--the "body" of the data.
Draw a line from the lower quartile to the minimum point and another linefrom the upper quartile to the maximum point.
2D: Scatter Plots
standard tool for displaying relationship between two variables
-^
A scatter plot is a plot of the values of Y versus thecorresponding values of X:–
Vertical axis: variable Y--usually the response variable– Horizontal axis: variable X--variable we suspect may be related
-^
Scatter plots can provide answers to the following questions:1. Are variables X and Y related?2. Are variables X and Y linearly related?3. Are variables X and Y non-linearly related?4. Does the variation in Y change
depending on X?
Scatter Plot: No relationship
Scatter Plot: Quadratic relationship
Scatter plot: Homoscedastic
Variation of Y Does Not Depend on X
Problems with scatter plots
Problems with scatter plots