Exploratory Data Analysis: Understanding Variables, Types, and Measuring Spread, Study notes of Data Analysis & Statistical Methods

This chapter from a statistics textbook introduces the concept of exploratory data analysis (eda), focusing on definitions, types of data, and measuring spread. It covers variables, quantitative and categorical data, frequency tables, graphs, and measures of spread such as range, deviations, variance, and standard deviation.

Typology: Study notes

Pre 2010

Uploaded on 09/20/2010

lookin4life
lookin4life 🇺🇸

5 documents

1 / 37

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 2
Exploratory Data Analysis (EDA)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25

Partial preview of the text

Download Exploratory Data Analysis: Understanding Variables, Types, and Measuring Spread and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Chapter 2

Exploratory Data Analysis (EDA)

Definitions

  • (^) EDA: is set of techniques used to explore and summarize data via graphical and numerical methods.
  • (^) Variables: any characteristics that are recorded in a study. Section 2.1 -- Types of Data 2

Quantitative vs Categorical

  • (^) Quantitative: variables take on numeric values that convey the relative magnitude of the variable. - (^) Look at spread and center
  • (^) Categorical: variables have responses that are classes, groups or categories. - (^) Often look at proportions Section 2.1 -- Types of Data 4

Ex. 2.

  • (^) Is your current zip code a quantitative or categorical variable? Why?
  • (^) Fact: An analysis method is determined based on the type of variable. Section 2.1 -- Types of Data 5

Frequencies

  • (^) Frequency Table: a table listing the possible values and they (relative) frequencies
  • (^) Example: Section 2.1 -- Types of Data 7

Graphs

  • (^) Dotplots
  • (^) Stemplots: created using the last digit as a “leaf” and the remaining digits as a “stem.” The stems are arranged in order and the stems should be chosen to maximize the viewability. Section 2.2 -- Some Graphical Summaries 8

Graphs

  • (^) Histogram: puts continuous observations into groups, classes, or bins and graphs the frequencies or relative frequencies Section 2.2 -- Some Graphical Summaries 10

Interpretation

  1. Shape – Symmetry - symmetric, skewed left, or skewed right --Modality - Unimodal, bimodal, trimodal, etc.
  2. Center – Mean, Median, Mode(s)
  3. Spread – Range, Inter-Quartile Range (IQR), standard deviation (variance)
  4. Anomalies – Outliers, unlikely/impossible observations Section 2.2 -- Some Graphical Summaries 11

Estimating Center

  • (^) Mean: a parameter ,μ, denotes the average value of the population
  • (^) Median: measures the middle value -- half the data falls below it and half above it - (^) Often denoted by capital M, but we will use η to stand for population median
  • (^) Mode: Value that occurs most frequently in the dataset Section 2.3 -- How do we estimate Center? 13
 - Stemplot Example 2. 
  • • Class Data Set 1 or CDS
    • 29, 30, 31, 32, 37, - Section 2.2 -- Some Graphical Summaries
  • Notation and Example 2.
    • Section 2.3 -- How do we estimate Center?

Sample Median

  • (^) Denoted by
  • (^) Method
    1. Order observations from smallest to largest
    2. Determine whether n is odd or even a) If n is odd then is the middle observation b) If n is even then is the average of the middle two observations Section 2.3 -- How do we estimate Center? 16

x

x ~ th n        2 1 x ~ th th n and n average the              1 2 2

Example 2.

  • (^) Recall our data CDS1. Compute the mean, median and mode 0, 5, 9, 9, 10, 11, 12, 14, 17, 22, 24, 26, 27, 28, 29, 30, 31, 32, 37, 45 Section 2.3 -- How do we estimate Center? 17

Example 2.5 cont

  • (^) Consider our CDS1. What happens to the median if we change our largest observation from 45 to 450? What about the mean? Section 2.3 -- How do we estimate Center? 19

Spread

  • (^) Range: distance form the largest to smallest observations
  • (^) Is the range robust? Section 2.4 -- How do we measure the spread? 20