Lecture Notes on Looking at Data-Distributions - Solved Examples | MATH 425, Study notes of Data Analysis & Statistical Methods

Material Type: Notes; Class: Statistical Methods I >5; Subject: Mathematics; University: University of Oregon; Term: Winter 2005;

Typology: Study notes

Pre 2010

Uploaded on 07/23/2009

koofers-user-p1j-1
koofers-user-p1j-1 🇺🇸

10 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture Notes for Math 425/525
Statistical Methods
Qi-Man Shao
Department of Mathematics
University of Oregon
@ 2005 by Qi-Man Shao. All rights reserved.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Lecture Notes on Looking at Data-Distributions - Solved Examples | MATH 425 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Lecture Notes for Math 425/

Statistical Methods

Qi-Man Shao Department of Mathematics University of Oregon

@ 2005 by Qi-Man Shao. All rights reserved.

STATISTICS

The science of collecting, organizing and interpreting data

  • Understanding Data
    • Looking at Data: Distributions
    • Looking at Data: Relationships
    • Producing Data
  • Understanding Inference
    • Probability
    • Sampling Distributions
    • Introduction to Inference

Example 1.2 The distribution of marital status for all American age 18 and over

Marital status Count Percent (millions) Never married 43.9 22. Married 116.7 60. Widowed 13.4 7. Divorced 17.6 9.

  • Pie charts include all the categories that make up a whole.
  • Bar graphs are more flexible.

Graphs for quantitative variables Distribution: The pattern of variation of a variable. The distribution records the numerical values of the variable and how often each value occurs. The distribution of a variable is best displayed graphically. Below are 3 commonly used graphical tools. Stemplots:

Step 1. Separate each observation into a stem and a leaf stem: all but the final digit leaf: the final digit

Step 2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line to the right of the stems

Step 3. Add the leaves to the right of the line

  • Stemplots work best for small numbers of observations.

Example 1.3 Here are the number of home runs that Babe Ruth hit in each of his 15 years with the New York Yankees, 1920 to 1934: 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22 Make a stemplot.

Example 1.4 Does increasing the amount of calcium in our diet reduce blood pressure? A randomized comparative experiment gave one group of 10 black men a calcium supplement for 12 weeks. The control group of 11 black men received a placebo (a dummy pill) that appeared identical. The blood pressure distributions in the two groups should have been similar at the beginning of the experiment. Here are the initial blood pressure readings for the two groups:

Calcium group

107 110 123 129 112 111 107 112 136 102

Placebo group

123 109 112 102 98 114 119 112 110 117 130 Make a back-to-back stemplot of these data.

  • Stemplots work best for small numbers of observations.
  • Splitting stems: when the display looks too cramped and narrow, we can split each stem into two, one with leaves 0 to 4 and the other with leaves 5 through 9.
  • Back-to-back stemplot: when we wish to compare two related distributions, a back-to- back stemplot with common stems is useful

Example 1.5 Babe Ruth’s home run record for a single year was broken by another Yankee, Roger Maris, who hit 61 home runs in 1961. Here are Maris’s home run totals for his 10 years in the American League: 13 23 26 16 33 61 28 39 14 8 Is Maris’s record 61 an outlier in his distribution of home runs hit? Who is the better home run hitter, Maris or Ruth?

Examining a distribution:

  • Overall pattern, shape, center, spread
  • Sometimes we use percent (proportion / relative frequency) rather than count.

Time plots: Plots each observation against the time

  • Time plot can reveal trends, or other changes over time, such as seasonal variation.

1.2 Describing Distributions with Numbers

Measuring center:

  • Mean ( average): x¯ =

n

(x 1 + x 2 + · · · + xn) or ¯x =

n

∑^ n

i=

xi =

n

xi

  • Median M: Middle value
    1. Sort all observations from smallest to largest
    2. If n is odd, the median is the center observation in the ordered list.
    3. If n is even, the median is the average of the two center observations in the ordered list.

Example 1.7 Find Ruth’s mean number and median number of home runs hit in a year 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22

Find Maris’s mean number and median number of home runs hit in a year 13 23 26 16 33 61 28 39 14 8

Solution:

Mean or Median

  • Boxplot

Example 1.8 Some people worry about how many calories they consume. Consumer Reports magazine, in a story on hot dogs, measured the calories in 20 brands of beef hot dogs, 17 brand of meat hot dogs, and 17 brands of poultry hot dogs. Here are the computer outputs:

Hot dogs Min Q 1 M Q 3 Max Beef 111 140 152.5 178.5 190 Meat 107 139 153 179 195 Poultry 87 102 129 143 170

Make side-by-side boxplots of the calorie counts for the three types of hot dogs.

Measuring spread: the standard deviation(s.d.) The most common measure of the spread about the mean

  • The variance of n observations x 1 , x 2 ,... , xn is

s^2 =

n − 1

[(x 1 − x¯)^2 + (x 2 − x¯)^2

  • · · · + (xn − x¯)^2 ]

or s^2 =

n − 1

(xi − ¯x)^2

  • The standard deviation s is the square root of the variance s^2 :

s =

n − 1

(xi − x¯)^2

  • n − 1 is called the degrees of freedom (d.f.) of the variance or standard deviation

Properties of the standard deviation:

  • s measures spread about the mean and should be used only when the mean is chosen as the measure of center
  • s ≥ 0
  • s = 0 only when all observations have the same value.
  • s, not like IQR, is sensitive to extreme observations
  • For describing a skewed distribution, the five-number summary is usually better than the mean and s.d.
  • Always plot your data. Numerical summaries are not a complete description.

Example 1. Data Set I: 12 25 38 8 42 Data Set II: 19 32 45 15 49 Note that each value of the second data set is obtained by adding 7 to the corresponding value of the first data set. Calculate the mean and the standard deviation for each of these two data sets. Comment on the relationship between the two means and the two standard deviations.

Solution:

Example 1. Data Set I: 2 8 15 9 11 Data Set II: 4 16 30 18 22 Note that each value of the second data set is obtained by multiplying the corresponding value of the first data set by 2. Calculate the mean and the standard deviation for each of these two data sets. Comment on the relationship between the two means and the two standard deviations.

Changing the unit of measurement A linear transformation changes the original variable x into the new variable y given by an equation of the form y = a + bx

Then y¯ = a + bx¯ sy = |b|sx

1.3 The Normal Distributions

Most useful mathematical model in probability and statistics Strategy for exploring distributions: Graphical −→ numerical −→ mathematical model Note:

  • The relative frequency histogram depends on choice of classes
  • As we divide the cells into finer ones and increase the number of observations, the step function of histogram tends to a smooth curve — density curve

Density curve: an idealized description of the distribution of data

  • is always on or above the horizontal axis
  • has area exactly 1 underneath it
  • area between a and b = proportion of the variable between a and b

Example 1.12 Refer to Exercise 1.79 (p.84)

Solution:

Median and mean of a density curve

  • Median: equal-areas point
  • Mean μ: balance point
  • mean = median for a symmetric density curve
  • mean is pulled further toward longer tail than median for a skewed density curve

μ, σ and ¯x, s

  • μ: mean of the density curve
  • σ: standard deviation of the density curve
  • x¯: mean of data, computed from actual observations
  • s: standard deviation of data

Normal distributions symmetric, single peaked, bell-shaped density curves

  • X is called a normal variable if it has density function

f (x) =

2 πσ

e−(x−μ)

(^2) /(2σ (^2) ) , −∞ < x < ∞

  • μ — mean of X, σ — standard deviation of X. We write X ∼ N (μ, σ).

The 68-95-99.7 Rule In any normal distribution N (μ, σ)

  • 68% of the observations fall within σ of the mean μ
  • 95% of the observations fall within 2σ of the mean μ
  • 99.7 % of observations fall within 3σ of the mean μ
  • Area between a and b (a < b) = area to the left of b − area to the left of a

P (a < Z ≤ b) = P (Z ≤ b) − P (Z ≤ a)

Example 1.14 Use Table A to find

a. P (Z < 2 .5)

b. P (Z > 2 .5)

c. P (Z < − 1 .6)

d. P (− 1. 6 < Z < 2 .5)

Solution:

Example 1.15 The distribution of heights of young women aged 18 to 24 is approximately N (64. 5 , 2 .5) (in inches). What proportion of all young women

a. are less than 68 inches tall?

b. are between 64.5 and 67 inches tall?

c. are at least 70 inches tall?

Solution:

To compute proportions for N (μ, σ)

  1. Formulate the problem in terms of proportion of a variable X
  2. Use z-score transformation z =

x − μ σ to restate the problem in terms of an N (0, 1) variable

  1. Use Table A and the properties to find the required proportion (area)

Find a value given a proportion

Example 1.16 Use Table T-11 to find the value z of a standard normal variable that satisfies each of the following conditions.

(a) the point z with 10% of the observations falling below it

(b) the point z with 5% of the observations falling above it

Solution:

Example 1.17 The scores of a reference population on the Wechsler Intelligence Scale for Children (WISC) are normally distributed with μ = 100 and σ = 15. What score must a child achieve on the WISC in order to fall in the top 5% of the population? In the top 1%?

Solution: