Lecture Slides on Exploratory Data Analysis | STAT 371, Study notes of Statistics

Material Type: Notes; Class: Introductory Applied Statistics for the Life Sciences; Subject: STATISTICS; University: University of Wisconsin - Madison; Term: Spring 2006;

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-nw6
koofers-user-nw6 🇺🇸

10 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Exploratory data analysis (Chapter 2)
Cécile Ané
Stat 371
Spring 2006
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Lecture Slides on Exploratory Data Analysis | STAT 371 and more Study notes Statistics in PDF only on Docsity!

Exploratory data analysis (Chapter 2)

Cécile Ané

Stat 371

Spring 2006

Outline

(^1) Categorical data

(^2) Numerical data Displays Numerical summaries

R demo and Bar plots

A AB B O NA's

0

10

20

30

40

Blood type, 2005 survey

Displays: Stem-leaf display

Milk data: milk yields (lbs/day) were collected from a particular herd on a given day. Data: 44, 55, 37, 32, 37, 26, 23, 41, 34, 19, 30, 39, 46, 44.

stem(milk) The decimal point is 1 digit(s) to the right of the | 1 | 9 2 | 36 3 | 024779 4 | 1446 5 | 5

Look for the minimum and maximum, then decide on a precision and round off all data at the same precision. Last digit: leaf, any digit before: stem. 1 observation=1 leaf Leaves may then be ordered. Provides complete information. Pretty easy to do.

Displays: Histograms

Rotate the histogram: like stem-leaf No single histogram. Lots of them! Rules are somewhat arbitrary. Histograms are useful with larger datasets, stem-leaf displays with smaller data sets.

Measures of location: Sample mean

Milk yield data: 44, 55, 37, 32, 37, 26, 23, 41, 34, 19, 30, 39, 46, 44. y 1 = 44, y 2 = 55,... , y 14 = 44.

Sample mean

¯ y = ( 44 + 55 + · · · + 44 )/ 14 = ( y 1 + y 2 + · · · + y 14 )/ 14

=

n ( y 1 + y 2 + · · · + yn )

n

∑^ n

i = 1

yi

mean(milk) [1] 36.

Here y ¯ = 36 .2 lbs/day

Measures of location: Mode

Mode: most common value. More interesting for discrete data, with small # possible values and large # observations. Example: # of brothers. 0 | 0000000000000000000000000000000 1 | 00000000000000000000000000000000000000000000000000 2 | 000000000000000 3 | 00000 4 | 5 | 6 | 7 | 0

Mode = 1 (brother).

Measures of location: Quantiles and percentiles

25 % percentile = 0.25 quantile = value such that 1/ observations are below and 3/4 are above. p quantile: value such that (about) a proportion p of observations are below and about 1 − p are above. Median is a special case (why?) Example: 6 9 11 17 19 23 26 26

First quartile Q 1 : median of those values below the median Third quartile Q 3 : median of those values above the median

Milk yield:

19 23 26 30 32 34 37

Display: Boxplot

fivenum(milk) summary(milk) boxplot(milk)

20 25 30 35 40 45 50 55

Min Q1 Median Q3 Max

Milk yield data

Range: 36

IQR: 14

boxplot(Height ~ Sex)

Display: Boxplot

No fence: whiskers extend to minimum and maximum With fences (Modified boxplot): Observations outside fences are drawn as points. Whiskers cannot go beyond fences. Fence = 1.5 IQR Milk example: IQR = 14. Fences are 1. 5 ∗ 14 = 21 below Q 1 and above Q 3 , i.e 30 − 21 = 9 and 44 + 21 = 65. Here smallest data point (min) = 19 > 9 and largest (max) = 55 < 65: no outlier.

Measures of spread: Standard deviation

Recall y 1 = first observation,... , yn = last observation. Deviation from the mean: yiy ¯. Ex: first cow has deviation 44 − 36. 2 = + 7 .8, cow with data 19 has deviation 19 − 36. 2 = − 17 .2. Variance : s^2 ≥ 0 always!

s^2 =

n − 1

∑^ n

i = 1

( yiy ¯)^2 =

n − 1

( y 1 − y ¯)^2 + · · · + ( yn − ¯ y )^2

Preferred formula for hand calculation:

s^2 =

n − 1

( (^) n

i = 1

y i^2 − ny ¯^2

Here we get

s^2 =

= 95. 26 lb^2

Measures of spread: Standard deviation

Standard deviation: s =

variance =

s^2 is now in original units. s is the typical deviation. Here, s = 9 .8 lbs.

mean(milk) [1] 95. sd(milk) [1] 9.

20 30 40 50

l l l l l l l

l l l l

l l l

sd=9.

16.6 26.4 36.2 46.0 55.

sd=9.

The empirical rule: for most “mound-shaped” distributions, about 68% of observations lie within 1 standard deviation of the mean, (here: 9/14 = 64%) about 95% lie within 2 s.d. of the mean (here: 100%) about 99% lie within 3 s.d. of the mean (here: 100%)