Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Looking at Data Distributions - Lecture Notes | STAT 30100, Study notes of Data Analysis & Statistical Methods

Purdue University Data Analysis & Statistical Methods

Prof. Yang Zhao

Material Type: Notes; Professor: Zhao; Class: Elementary Statistical Methods; Subject: STAT-Statistics; University: Purdue University - Main Campus; Term: Spring 2007;

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-0xp 🇺🇸

10 documents

1 / 15

This page cannot be seen from the preview

Don't miss anything!

Chapter 1: Looking at Data--Distributions

Section 1.1: Introduction, Displaying Distributions with Graphs

Section 1.2: Describing Distributions with Numbers

Big picture: what do we learn in this chapter?

Individuals vs. Variables

Categorical vs. Quantitative Variables

Graphs:

• Bar graphs and pie charts (categorical variables)

• Histograms and stemplots (quantitative variables—good for checking for

symmetry and skewness)

• Boxplots (quantitative variables—graphical display of the 5 # summary, modified

boxplots show outliers)

Describing distributions

• Shape (symmetric/skewed, unimodal/bimodal/multimodal)

• Center (mean or median)

• Spread (usually standard deviation/variance or IQR from the 5 # summary)

• Outliers

• If you have a symmetric distribution with no outliers, use the mean and standard

deviation.

• If you have a skewed distribution and/or you have outliers, use the 5 # summary

instead.

Discover Study notes of Data Analysis & Statistical Methods Purdue University

Partial preview of the text

Download Looking at Data Distributions - Lecture Notes | STAT 30100 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Chapter 1: Looking at Data--Distributions

Section 1.1: Introduction, Displaying Distributions with Graphs

Section 1.2: Describing Distributions with Numbers

Big picture: w hat do we learn in this chapter?

Individuals vs. Variables

Categorical vs. Quantitative Variables

Graphs:

Bar graphs and pie charts (categorical variables)
Histograms and stemplots (quantitative variables—good for checking for symmetry and skewness)
Boxplots (quantitative variables—graphical display of the 5 # summary, modified boxplots show outliers)

Describing distributions

Shape (symmetric/skewed, unimodal/bimodal/multimodal)
Center (mean or median)
Spread (usually standard deviation/variance or IQR from the 5 # summary)
Outliers
If you have a symmetric distribution with no outliers, use the mean and standard deviation.
If you have a skewed distribution and/or you have outliers, use the 5 # summary instead.

2 components in describing data or information:

Individuals : objects being described by a set of data (people, households, cars, animals, corn, etc.)
Variables : characteristics of individuals (height, yield, length, age, eye color, etc.) - Categorical : places an individual into one of several groups (gender, eye color, college major, hometown, etc.) - Quantitative : Attaches a numerical value to a variable so that adding or averaging the values makes sense (height, weight, age, income, yield, etc.)

Distribution of a variable : describes what values a variables takes and how often it takes those values

If you have more than one variable in your problem, you should look at each variable by itself before you look at relationships between the variables.

Example: Identify whether the following questions would give you categorical or quantitative data. If it is categorical, state the possible answers.

a) What letter grade did you get in your Calculus class last semester?

b) What was your score on the last exam?

c) What is your GPA?

d) Did you vote for John Kerry?

e) Who did you vote for in the last election?

f) How many votes did George Bush get?

g) How many red M&Ms are in this bag?

h) Is this a red M&M?

i) What color is the M&M you just ate?

j) Which type of M&Ms has more red ones, peanut or plain?

Quantitative Variables:

Stem plots, histograms, and boxplots (discussed a little later)
Example: You investigate the amount of time students spend on the internet (in minutes). You study 28 students, and their times (in minutes) are listed below. Show the distribution of times with a stem plot and a histogram.

To create a stem plot by hand,

Put the data in order from smallest to largest.
The “stem” will be all digits for a data point except for the last one. Write the stems in a vertical line.
The “leaf” will be the last digit from each data point. Write the leaves after the appropriate stem, in increasing order.

0 | 7 1 | 2 | 0 4 5 5 8 8 3 | 0 2 5 4 | 2 3 4 5 6 7 8 8 5 | 0 1 6 | 7 | 2 5 7 8 9 8 | 3 7 8

A split stem plot just has more stems. There are several ways to split the stems. 0 | 7 1 | 1 | 2 | 0 4 2 | 5 5 8 8 3 | 0 2 3 | 5 4 | 2 3 4 4 | 5 6 7 8 8 5 | 0 1 5 | 6 | 6 | 7 | 2 7 | 5 7 8 9 8 | 3 8 | 7 8

Why do we need split stem plots? Sometimes it is easier to see the shape of the data with more stems. Sometimes a regular stem plot is better. If you’re not sure, try it both ways and see if a pattern appears.

To create a histogram by hand:

Order the data from smallest to largest.
Your range is the max data point – min data point (88 – 7 = 81 here)
Decide how many intervals you want. (9?)
The width of your intervals is just range

of intervals

⎜ =^ = ⎟

List your intervals and # of individuals in each interval in tabular form
Draw the histogram

Interval range

of times in

that range 7 < x ≤ 16 1 16 < x ≤ 25 4 25 < x ≤ 34 4 34 < x ≤ 43 3 43 < x ≤ 52 8 52 < x ≤ 61 0 61 < x ≤ 70 0 70 < x ≤ 79 5 79 < x ≤ 88 3 Total 28

How is a histogram different from a stem plot?

They basically show the same information, except stem plots use numbers and histogram use shaded rectangles to show where the quantitative data falls.

How is a histogram different from a bar graph?

Histograms have the bars for each interval touching each other, bar graphs do not have the bars touching.
Histograms will have a continuous x-axis, with the x-values in order. Bar graphs can have the categories on the x-axis listed in any order.
Histograms are for quantitative variables, and bar graphs are for categorical variables.

Find the mean and median of the following 8 numbers in Dataset B:

1 2 4 6 8 9 12 13

Spread : a) Range = max – min (simplest, not always the most helpful)

b) Variance : s^2 , average of the square of deviations of observations from the mean 2 2 1

n i i

s x n (^) =

∑ x

c) Standard Deviation : s , square root of the variance, common way for measuring how far observations are from the mean

Example of finding the standard deviation by hand:

0 2 4

Calculate the mean.
Calculate the variance.
Take the square root of the variance.

d) P P^ th^ percentile : value such that p% of the observations fall at or below it

Median = M = 50th^ percentile First Quartile = Q 1 = 25th^ percentile Third Quartile = Q 3 = 75th^ percentile

How do you find quartiles? Think of them as “mini-medians.” Leave the median out, and then find the median of what is left over on the left side (Q 1 ) and what is left over on the right side (Q 3 ).

Find the 1st^ and 3rd^ quartiles of the following 7 numbers in Dataset A:

-20 1 23 25 32.5 33 67

Find the 1st^ and 3rd^ quartiles of the following 8 numbers in Dataset B:

1 2 4 6 8 9 12 13

e) 5-Number Summary : Min Q 1 M Q 3 Max

f) Interquartile Range (IQR) = Q 3 – Q 1

Call an observation a suspected outlier if it is: > Q 3 + 1.5 IQR OR < Q 1 – 1.5 IQR

g) Boxplots : Use the 5-number summary

A central box spans the quartiles Q1 and Q3.
A line in the box marks the median M.
Lines extend from the box out to the smallest and largest observations.
Modified boxplots have lines extend from the box out to the smallest and largest observations which are NOT outliers. Dots mark any outliers.

“The Median vs. the Mean in the Age of Average” by Mike Pesca on NPR’s Day-to-Day 7/19/06: http://www.npr.org/templates/story/story.php?storyId=

Do you always have to do all of this by hand? NO! Statistical software packages like SPSS can make life much easier for you, but it’s a good idea to know how to do these by hand so you can make sense of your output. Also, on the exam, you won’t have access to a computer.

Read over your SPSS manual (part of the HW) and get comfortable with using it. You will have a chance to practice on the HW for this week, and you will work on it in lab on Friday.

Enter your data, then Analyze--> Descriptive Statistics--> Explore. Follow the instructions on p. 48 of the SPSS manual.

The output from SPSS for the internet time problem looks like:

Descriptives

54.77 5.

7 151 144 48 1.314. 1.977.

Mean Lower Bound Upper Bound

95% Confidence Interval for Mean

5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis

Time spent on the web

Statistic Std. Error

0 50 100 150 Time spent on the web

Frequency

Mean = 54.77Std. Dev. = 32. N = 30

Histogram

Time spent on the web Stem-and-Leaf Plot

222333

(s)

F requency Stem & Leaf

1.00 0. 0 9.00 0. 222 10.00 0. 4444444455 5.00 0. 77777 3.00 0. 888 .00 1. 1.00 1. 3 1.00 Extremes (>=

S tem width: 100 Each leaf: 1 case

otice on the boxplot, it is easy to identify the potential outlier. This would be your You

PSS can also give you the Quartiles (listed under “Percentiles”), but these are not e”

ask you to calculate the Quartiles, we want you to do them by hand.

N

indication that the 5-number summary would be the best way to describe your data. ( could also try calculating the mean and standard deviation without the outlier for comparison.)

S necessarily the same answers as what you would get by hand. The “weighted averag and “Tukey’s Hinges” are not the same method we use. For this class, whenever we

Features of bell-shaped distributions (from Section 1.3)

A z -score tells us how many standard deviation away from the mean an observation is.

x z

This is also called getting a standardized value.

hy is standardization useful? For comparing apples to oranges.

xample: (p. 88, Problem 1.99) Jacob scores 16 on the ACT. Emily scores 670 on the AT. Assuming that both tests measure scholastic aptitude, who has the higher score? he SAT scores for 1.4 million students in a recent graduating class were roughly ormal with a mean of 1026 and standard deviation of 209. The ACT scores for more an 1 million students in the same class were roughly normal with mean of 20.8 and andard deviation of 4.8.

W

E

S

T

n th st

How else can we use standardization? If the distribution of observations has a bell-shape, en these standardized values have some special properties. One of these is the 68-95-

Approximately 68% of the observations fall within 1σ of the μ.
Approximately 95% of the observations fall within 2σ of the μ. Approximately 99.7% of the observations fall within 3σ of the μ.

The most famous bell-shaped distribution is the Normal distribution. We will spend a whole week talking about it for Section 1.3, and it will be important to everything we do for the rest of the semester.

th 99.7 rule.

Standard deviations away from the mean ( z -score )

mean

Looking at Data Distributions - Lecture Notes | STAT 30100, Study notes of Data Analysis & Statistical Methods

Related documents

Partial preview of the text

Download Looking at Data Distributions - Lecture Notes | STAT 30100 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Chapter 1: Looking at Data--Distributions

Section 1.1: Introduction, Displaying Distributions with Graphs

Section 1.2: Describing Distributions with Numbers

of intervals

⎜ =^ = ⎟

of times in

∑ x

N

W

E

S

T