Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Confidence Intervals and Hypothesis Testing for Population Means and Differences, Study notes of Statistics

How to determine the differences between population means of dependent variable values associated with different factor levels and give a meaningful estimate for these means. It focuses on the assumption that errors in models are normally distributed with a common variance and calculates confidence intervals and performs hypothesis tests using t-statistics. The document also covers the use of anova and msw for calculating confidence intervals for column means.

Typology: Study notes

Pre 2010

Uploaded on 07/28/2009

koofers-user-1ru
koofers-user-1ru 🇺🇸

10 documents

1 / 4

Toggle sidebar

Related documents


Partial preview of the text

Download Confidence Intervals and Hypothesis Testing for Population Means and Differences and more Study notes Statistics in PDF only on Docsity!

Confidence Intervals and Hypothesis Testing for Population Means and differences

between Population Means

In this course, we are interested in determining the differences between population means

of dependent variable values associated with different factor levels. We also would like to

be able to give a meaningful estimate for these means. Furthermore, we are working

under the assumption that the errors in our models are normally distributed with a

common variance (that has to be estimated from the data). This means that, when

calculating confidence intervals or performing hypothesis tests, the appropriate test

statistic will be a t – statistic with some df = degrees of freedom. Let’s review, then, how

to compute confidence intervals and how to perform hypothesis tests in this specialized

setting.

Confidence Intervals:

For a single population mean of a normally distributed random variable with^ 

unknown.

Illustrative Example: Suppose that the price of regular gasoline in a given area in a given

week is normally distributed and we wish to estimate the mean price of a gallon of

gasoline in this region for the first week of August. If we randomly select ten (10) gas

stations in the area and determine what they are charging for a gallon of regular gas, then

the average of those ten prices should give us an idea of what the overall average price of

a gallon of regular in the area for the week is. However, we know that if we take another

random sample of ten stations and calculate the average price for them that it very

possibly might differ from the average determined by our first sample. This means that

the average price of gas in the region during the first week of August for random samples

of ten (10) stations is actually a random variable and not something fixed (just like the

individual prices at specific stations is a random variable, which we have assumed is

normally distributed). At this point, we might want to know, just out of curiosity, what

the distribution is of the random variable of average gas prices for ten randomly selected

gas stations from our area during the first week of August (quite a mouthful!).

Let’s put curiosity aside for the moment and back up and suppose that instead we want to

estimate the mean price of regular gas by just picking one station at random and using its

price some way. For example, suppose George’s BP charges $1.81 (we’re rounding up).

Now, were pretty certain that $1.81 is not the true average for the entire region. It could

be, but it isn’t too likely. Therefore, the statement that the mean price,^ ^ , equals $1.

is almost certainly false. In order to have a reasonable chance of saying anything true, we

could soften our statement to, say, the mean is “nearly” or “approximately” $1.81.

However, this makes our statement somewhat vague because we haven’t pinned down

what “approximately” means. Maybe we could say that the mean price is between two

numbers, say between $1.71 and $1.91 ($1.81  .1). But this can’t be a true statement

either, because we originally assumed that our gas prices were normally distributed and

normal distributions assign some nonzero probability to values outside any finite interval.

Of course you might now say that assumption was stupid, because anybody knows that

the price certainly is between $0 and, say, $10. But we aren’t willing to throw this

assumption out, besides, a more precise statement of what we meant by it is that, except

for a range of prices of probability so small that they are completely negligible, the

random variable of prices is normal. Thus, a statement like “the true mean is between

$1.71 and $1.91 with some high probability” is something that we might be able to say.

Thus, we are willing to say to someone, you tell us (or we pick in advance) a confidence

level (90%, 95%, 99%, 99.9%, whatever) and we will tell you how to pick an interval

(range of values) such that the true population mean lies within the interval (90%, 95%,

99%, 99.9%, whatever) % if the time. For technical reasons, we prefer to talk about a

confidence level c^ (^1 ^ )^100 %, so, if^ ^ = .05, then c = 95%.

How should we go about choosing a c % confidence interval? Returning to the gas price

problem, in particular, how should we choose a c % confidence level for gas price based

on the price at a single gas station? If we don’t know any more than we have revealed so

far, we can’t, but, if we have some information concerning the variance of the original

distribution, we can use it. Just for discussion purposes, let’s suppose that we know the

variance completely, i.e., we know the value of  2 and, hence,^ ^. Then we know that

( ) ( ) ( ).

 

  

E

z

E

P x  E   x  E  P  E  x   E  P    Where

 

x

z is the so

called z – score. Under our assumptions, z is distributed according to a normal

distribution with a mean of 0 and a standard deviation of 1 (the standard normal

distribution). So, if we can answer the question, given confidence level c , what value of

E is such that

( )

E c z

E

P    

 

? Of course, since we know^ ^ , this is the same as

asking what value of zc

E

is such that

( )

c

P  zc  z  zc ? From what we know

about the standard normal distribution, we are asking for that value of z such that

units

of area lie above it, or below its negative. Once we have found z^ c , we then take

E  zc  and this would solve our problem. Thus, for the gas prices, suppose that we

knew that^ ^ = .2 and we want a 95% confidence interval. Then z^ c = 1.96 (from tables

or from Excel) and E^  zc^ ^1.^96 (.^2 ).^292 and, using the price at George’s BP, the

interval with end points 1.81 . 29 is a 95% confidence interval. We cannot be certain

that it contains the true mean, but we know that, if we generate intervals in this way, then

95% of the time the resulting interval will contain the true mean.

In reality, though, we aren’t going to know^ ^ and we are going to have to take a sample

bigger than 1 in order to learn anything about it. But, taking a sample helps in two ways,

first, it gives us a way to estimate^ ^ , and, second, the distribution of sample means of

samples of size n drawn at random from a population has the same mean as the

population and a much tighter (smaller) variance, which means that using a point from

the distribution of sample means of samples of size n to estimate the mean of that

distribution will give us an interval estimate for the original distribution’s mean with a

much smaller width than we could get if we used a point from the original distribution.

Here is the result that justifies these statements:

Let X be a normally distributed random variable with mean^ ^ and standard deviation^ 

and let X be the random variable of means of samples of size n. Then the mean of X =

 and the standard deviation of X ,  X =

n

.

Thus, a confidence interval based on a sample of size 100 will have 1/10th^ the size of one

based a single value from the original distribution.

Now, there is one more complication, and then we are done. We don’t know^ ^ ; we

have to use the sample standard deviation, s , to approximate^ ^. But this introduces

more variability, which we need to take into account in calculating the confidence

interval width. Our estimate for the standard deviation of the distribution of sample

means of samples of size n is

n s

se  , we refer to s e as the standard error of estimate.

We still have ( ) ( ) ( ).

e s e

E

t s

E

P x  E    x  E  P  E  x   E  P    Here,





n s x s x t ex

, but, despite its similarity to z , t is not distributed like a standard

normal but, instead, takes the distribution of a Student’s t distribution with degrees of

freedom, df, equal to n – 1. Inside this distribution, we now seek a critical value, tc^ ,

such that

( )

c

P  tc  t  tc  and then we can take

n s

E  t c se  tc. Returning to our

gas price example, suppose a random sample of 10 gas stations yield a mean

x  1. 81 , s . 2 , then   

n s

se .06325 and df = 10 – 1 = 9. So, t c 2.262 if c =

95% and E^ ^ setc ^2.^262 (.^06325 ).^143 so our 95% confidence interval would have

endpoints 1.81 .^143.

When calculating confidence intervals after an ANOVA, we may use MSW as a pooled

estimate for the population variance. The square root of this value may be used for s and

s e can be taken to be

R MSW

in the calculation of a confidence interval for and column

mean, ^ j. The degrees of freedom for the appropriate t - distribution, however, will be

RC  C , which is greatly to our advantage. Let’s use these guidelines to calculate

confidence intervals for the factor level means for the column means in the Golfcourse

data in EXcel File PB2 – 16

Here is a summary of the steps:

1. Perform an ANOVA

2. On the result page for the ANOVA, enter alpha in a convenient cell, and calculate

tcrit using the Tinv function with input alpha, and df = df of SSW from the

ANOVA report table. Calculate the error term as

R

MSW

Etcrit

3. in the fifth row, choose contiguous blank cells and then

a. enter the formula “= column 1 mean – E” in the first cell

b. enter the formula “= column 1 mean + E” in the second cell

4. fill down for all the rows corresponding to columns in the factor analysis table.

Now lets do hypothesis tests for differences in the column means. For the Golfcourse

data in PB2 – 15