Looking at Data Distributions - Lecture Notes | STAT 30100, Study notes of Data Analysis & Statistical Methods

Material Type: Notes; Professor: Zhao; Class: Elementary Statistical Methods; Subject: STAT-Statistics; University: Purdue University - Main Campus; Term: Spring 2007;

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-0xp
koofers-user-0xp šŸ‡ŗšŸ‡ø

10 documents

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 1: Looking at Data--Distributions
Section 1.1: Introduction, Displaying Distributions with Graphs
Section 1.2: Describing Distributions with Numbers
Big picture: what do we learn in this chapter?
Individuals vs. Variables
Categorical vs. Quantitative Variables
Graphs:
• Bar graphs and pie charts (categorical variables)
• Histograms and stemplots (quantitative variables—good for checking for
symmetry and skewness)
• Boxplots (quantitative variables—graphical display of the 5 # summary, modified
boxplots show outliers)
Describing distributions
• Shape (symmetric/skewed, unimodal/bimodal/multimodal)
• Center (mean or median)
• Spread (usually standard deviation/variance or IQR from the 5 # summary)
• Outliers
• If you have a symmetric distribution with no outliers, use the mean and standard
deviation.
• If you have a skewed distribution and/or you have outliers, use the 5 # summary
instead.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Looking at Data Distributions - Lecture Notes | STAT 30100 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Chapter 1: Looking at Data--Distributions

Section 1.1: Introduction, Displaying Distributions with Graphs

Section 1.2: Describing Distributions with Numbers

Big picture: w hat do we learn in this chapter?

Individuals vs. Variables

Categorical vs. Quantitative Variables

Graphs:

  • Bar graphs and pie charts (categorical variables)
  • Histograms and stemplots (quantitative variables—good for checking for symmetry and skewness)
  • Boxplots (quantitative variables—graphical display of the 5 # summary, modified boxplots show outliers)

Describing distributions

  • Shape (symmetric/skewed, unimodal/bimodal/multimodal)
  • Center (mean or median)
  • Spread (usually standard deviation/variance or IQR from the 5 # summary)
  • Outliers
  • If you have a symmetric distribution with no outliers, use the mean and standard deviation.
  • If you have a skewed distribution and/or you have outliers, use the 5 # summary instead.

2 components in describing data or information:

  • Individuals : objects being described by a set of data (people, households, cars, animals, corn, etc.)
  • Variables : characteristics of individuals (height, yield, length, age, eye color, etc.) - Categorical : places an individual into one of several groups (gender, eye color, college major, hometown, etc.) - Quantitative : Attaches a numerical value to a variable so that adding or averaging the values makes sense (height, weight, age, income, yield, etc.)

Distribution of a variable : describes what values a variables takes and how often it takes those values

If you have more than one variable in your problem, you should look at each variable by itself before you look at relationships between the variables.

Example: Identify whether the following questions would give you categorical or quantitative data. If it is categorical, state the possible answers.

a) What letter grade did you get in your Calculus class last semester?

b) What was your score on the last exam?

c) What is your GPA?

d) Did you vote for John Kerry?

e) Who did you vote for in the last election?

f) How many votes did George Bush get?

g) How many red M&Ms are in this bag?

h) Is this a red M&M?

i) What color is the M&M you just ate?

j) Which type of M&Ms has more red ones, peanut or plain?

Quantitative Variables:

  • Stem plots, histograms, and boxplots (discussed a little later)
  • Example: You investigate the amount of time students spend on the internet (in minutes). You study 28 students, and their times (in minutes) are listed below. Show the distribution of times with a stem plot and a histogram.

To create a stem plot by hand,

  1. Put the data in order from smallest to largest.
  2. The ā€œstemā€ will be all digits for a data point except for the last one. Write the stems in a vertical line.
  3. The ā€œleafā€ will be the last digit from each data point. Write the leaves after the appropriate stem, in increasing order.

0 | 7 1 | 2 | 0 4 5 5 8 8 3 | 0 2 5 4 | 2 3 4 5 6 7 8 8 5 | 0 1 6 | 7 | 2 5 7 8 9 8 | 3 7 8

A split stem plot just has more stems. There are several ways to split the stems. 0 | 7 1 | 1 | 2 | 0 4 2 | 5 5 8 8 3 | 0 2 3 | 5 4 | 2 3 4 4 | 5 6 7 8 8 5 | 0 1 5 | 6 | 6 | 7 | 2 7 | 5 7 8 9 8 | 3 8 | 7 8

Why do we need split stem plots? Sometimes it is easier to see the shape of the data with more stems. Sometimes a regular stem plot is better. If you’re not sure, try it both ways and see if a pattern appears.

To create a histogram by hand:

  1. Order the data from smallest to largest.
  2. Your range is the max data point – min data point (88 – 7 = 81 here)
  3. Decide how many intervals you want. (9?)
  4. The width of your intervals is just range

of intervals

āŽœ =^ = āŽŸ

  1. List your intervals and # of individuals in each interval in tabular form
  2. Draw the histogram

Interval range

of times in

that range 7 < x ≤ 16 1 16 < x ≤ 25 4 25 < x ≤ 34 4 34 < x ≤ 43 3 43 < x ≤ 52 8 52 < x ≤ 61 0 61 < x ≤ 70 0 70 < x ≤ 79 5 79 < x ≤ 88 3 Total 28

How is a histogram different from a stem plot?

  • They basically show the same information, except stem plots use numbers and histogram use shaded rectangles to show where the quantitative data falls.

How is a histogram different from a bar graph?

  • Histograms have the bars for each interval touching each other, bar graphs do not have the bars touching.
  • Histograms will have a continuous x-axis, with the x-values in order. Bar graphs can have the categories on the x-axis listed in any order.
  • Histograms are for quantitative variables, and bar graphs are for categorical variables.

Find the mean and median of the following 8 numbers in Dataset B:

1 2 4 6 8 9 12 13

  1. Spread : a) Range = max – min (simplest, not always the most helpful)

b) Variance : s^2 , average of the square of deviations of observations from the mean 2 2 1

n i i

s x n (^) =

āˆ‘ x

c) Standard Deviation : s , square root of the variance, common way for measuring how far observations are from the mean

Example of finding the standard deviation by hand:

0 2 4

  1. Calculate the mean.
  2. Calculate the variance.
  3. Take the square root of the variance.

d) P P^ th^ percentile : value such that p% of the observations fall at or below it

Median = M = 50th^ percentile First Quartile = Q 1 = 25th^ percentile Third Quartile = Q 3 = 75th^ percentile

How do you find quartiles? Think of them as ā€œmini-medians.ā€ Leave the median out, and then find the median of what is left over on the left side (Q 1 ) and what is left over on the right side (Q 3 ).

Find the 1st^ and 3rd^ quartiles of the following 7 numbers in Dataset A:

-20 1 23 25 32.5 33 67

Find the 1st^ and 3rd^ quartiles of the following 8 numbers in Dataset B:

1 2 4 6 8 9 12 13

e) 5-Number Summary : Min Q 1 M Q 3 Max

f) Interquartile Range (IQR) = Q 3 – Q 1

Call an observation a suspected outlier if it is: > Q 3 + 1.5 IQR OR < Q 1 – 1.5 IQR

g) Boxplots : Use the 5-number summary

  • A central box spans the quartiles Q1 and Q3.
  • A line in the box marks the median M.
  • Lines extend from the box out to the smallest and largest observations.
  • Modified boxplots have lines extend from the box out to the smallest and largest observations which are NOT outliers. Dots mark any outliers.

ā€œThe Median vs. the Mean in the Age of Averageā€ by Mike Pesca on NPR’s Day-to-Day 7/19/06: http://www.npr.org/templates/story/story.php?storyId=

Do you always have to do all of this by hand? NO! Statistical software packages like SPSS can make life much easier for you, but it’s a good idea to know how to do these by hand so you can make sense of your output. Also, on the exam, you won’t have access to a computer.

Read over your SPSS manual (part of the HW) and get comfortable with using it. You will have a chance to practice on the HW for this week, and you will work on it in lab on Friday.

Enter your data, then Analyze--> Descriptive Statistics--> Explore. Follow the instructions on p. 48 of the SPSS manual.

The output from SPSS for the internet time problem looks like:

Descriptives

54.77 5.

7 151 144 48 1.314. 1.977.

Mean Lower Bound Upper Bound

95% Confidence Interval for Mean

5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis

Time spent on the web

Statistic Std. Error

0 50 100 150 Time spent on the web

0

2

4

6

8

10

Frequency

Mean = 54.77Std. Dev. = 32. N = 30

Histogram

Time spent on the web Stem-and-Leaf Plot

222333

(s)

F requency Stem & Leaf

1.00 0. 0 9.00 0. 222 10.00 0. 4444444455 5.00 0. 77777 3.00 0. 888 .00 1. 1.00 1. 3 1.00 Extremes (>=

S tem width: 100 Each leaf: 1 case

otice on the boxplot, it is easy to identify the potential outlier. This would be your You

PSS can also give you the Quartiles (listed under ā€œPercentilesā€), but these are not eā€

ask you to calculate the Quartiles, we want you to do them by hand.

N

indication that the 5-number summary would be the best way to describe your data. ( could also try calculating the mean and standard deviation without the outlier for comparison.)

S necessarily the same answers as what you would get by hand. The ā€œweighted averag and ā€œTukey’s Hingesā€ are not the same method we use. For this class, whenever we

Features of bell-shaped distributions (from Section 1.3)

A z -score tells us how many standard deviation away from the mean an observation is.

x z

This is also called getting a standardized value.

hy is standardization useful? For comparing apples to oranges.

xample: (p. 88, Problem 1.99) Jacob scores 16 on the ACT. Emily scores 670 on the AT. Assuming that both tests measure scholastic aptitude, who has the higher score? he SAT scores for 1.4 million students in a recent graduating class were roughly ormal with a mean of 1026 and standard deviation of 209. The ACT scores for more an 1 million students in the same class were roughly normal with mean of 20.8 and andard deviation of 4.8.

W

E

S

T

n th st

How else can we use standardization? If the distribution of observations has a bell-shape, en these standardized values have some special properties. One of these is the 68-95-

  • Approximately 68% of the observations fall within 1σ of the μ.
  • Approximately 95% of the observations fall within 2σ of the μ. Approximately 99.7% of the observations fall within 3σ of the μ.

The most famous bell-shaped distribution is the Normal distribution. We will spend a whole week talking about it for Section 1.3, and it will be important to everything we do for the rest of the semester.

th 99.7 rule.

Standard deviations away from the mean ( z -score )

mean