Confidence Intervals: Estimating Unknown Parameters with Margin of Error, Exams of Statistics

The concept of confidence intervals in statistics, focusing on the estimation of population proportions and means. It covers the use of normal and student's t-distributions to calculate confidence intervals for proportions and means, respectively. The document also discusses confidence intervals for differences in proportions and means, as well as the importance of sample size and independence in the calculations.

Typology: Exams

Pre 2010

Uploaded on 03/11/2009

koofers-user-c32
koofers-user-c32 🇺🇸

10 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CONFIDENCE INTERVALS
In statistics one often would like to estimate unknown parameters for a known distribution. For
example, you may think that your parent population is normal, but the mean is unknown, or both
the mean and standard deviation are unknown. From a data set you can not hope to know the
exact values of the parameters, but the data should give you a good idea what they are. For the
mean, we expect that the sample mean or average of our data will be a good choice for the
population mean, and intuitively, we understand that the more data we have the better this should
be. How do we quantify this? That is, how good is our sample estimate at guessing the true,
unknown population parameter?
Statistical theory is based on knowing the sampling distribution of some statistic, such as the
mean. This allows us to make probability statements about the value of the parameters, since we
can never identify the true value of the parameters. Such statements are typically in the form of
a confidence interval, indicating that we are 95 percent certain the parameter is in some range of
values.
Proportions
The most widely seen use of confidence intervals is the estimation of population proportion
through surveys or polls. For example, suppose it is reported that 100 people were surveyed and
42 of them liked brand X. How do you see this in the media?
“42% of the population reports they like brand X.”
“The survey indicates that 42% of people like brand X. This has a margin of error of 9
percentage points.”
“The survey indicates that 42% of people like brand X, with a margin of error of 9
percentage points. This is a 95% confidence level.”
Why all the different answers? Well, the idea that we can infer anything about the population
based on a survey of just 100 people is founded on probability theory. If the sample is a random
sample then we know the sampling distribution of
p
ˆ
the sample proportion.
It is approximately
normal.
(How do we know that?)
Example:
CNN/Time Poll June 14-15, 2000 of n = 1,218 adults nationwide who were asked,
"Do you like to watch reality-based television programs, or don't you?"
Results: Yes 43%, No 53%, Not sure 4%
Estimate the percent of all US adults who like to watch reality based TV and find the SE for this
estimate.
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Confidence Intervals: Estimating Unknown Parameters with Margin of Error and more Exams Statistics in PDF only on Docsity!

CONFIDENCE INTERVALS

In statistics one often would like to estimate unknown parameters for a known distribution. For example, you may think that your parent population is normal, but the mean is unknown, or both the mean and standard deviation are unknown. From a data set you can not hope to know the exact values of the parameters, but the data should give you a good idea what they are. For the mean, we expect that the sample mean or average of our data will be a good choice for the population mean, and intuitively, we understand that the more data we have the better this should be. How do we quantify this? That is, how good is our sample estimate at guessing the true, unknown population parameter?

Statistical theory is based on knowing the sampling distribution of some statistic, such as the mean. This allows us to make probability statements about the value of the parameters, since we can never identify the true value of the parameters. Such statements are typically in the form of a confidence interval , indicating that we are 95 percent certain the parameter is in some range of values.

Proportions The most widely seen use of confidence intervals is the estimation of population proportion through surveys or polls. For example, suppose it is reported that 100 people were surveyed and 42 of them liked brand X. How do you see this in the media?

  • “42% of the population reports they like brand X.”
  • “The survey indicates that 42% of people like brand X. This has a margin of error of 9 percentage points.”
  • “The survey indicates that 42% of people like brand X, with a margin of error of 9 percentage points. This is a 95% confidence level.”

Why all the different answers? Well, the idea that we can infer anything about the population based on a survey of just 100 people is founded on probability theory. If the sample is a random sample then we know the sampling distribution of p ˆ the sample proportion. It is approximately

normal. (How do we know that?)

Example: CNN/Time Poll June 14-15, 2000 of n = 1,218 adults nationwide who were asked, "Do you like to watch reality-based television programs, or don't you?" Results: Yes 43%, No 53%, Not sure 4%

Estimate the percent of all US adults who like to watch reality based TV and find the SE for this estimate.

In the above example we got a sample percentage of 43% enjoying watching reality-based TV. The SE was estimated to be 1.4%.

A 95% confidence interval means that we can be about 95% “confident” that the population percentage lies in the interval 43% ± (2 * 1.4%). In other words the interval (40.2%, 45.8%) is a 95% confidence interval for the population percentage.

This does NOT mean that either of the following is true:

  • There is a 95% chance that the population percentage is between 40.2% and 45.8%.

This statement doesn’t really make sense because a 95% chance means that something will happen 95% of the time. But the population percentage is not changing. It doesn't happen 95% of the time; it is the same always—we just don't know it.

  • There is a 95% chance that the sample percentage is between 40.2% and 45.8%

This statement also doesn’t make sense. We know that our particular sample has exactly 43% who like to watch reality-based TV. Again, there is no chance associated with that, since we’re 100% certain it’s in that interval.

The 95% chance refers to the sampling process —each time we take a sample we get a slightly different sample percentage. So, if we took 100 samples and calculated “sample % ± 2*SE”, we think that about 95 of them would cover the true population percentage.

When dealing with a 95% confidence interval for proportions, the “± 2*SE” is commonly called the margin of error or the sampling error in polls.

In general, the interval p ˆ ± z * p ˆ( 1 − p ˆ)/ n is a confidence interval for the population percentage

found by using the Normal distribution.

  • p ˆ ± 1*SE is a 68% confidence interval for the population percentage.
  • p ˆ ± 2*SE is a 95% confidence interval for the population percentage.

Example: In a July 3, 2003 Gallup poll 1,006 adults nationwide were asked the following: “Would you favor holding the fast food industry legally responsible for the diet-related health problems of people who eat fast food on a regular basis?” 10% answered ‘YES’.

Find a 95% confidence interval for the percent of all US adults who believe the fast food industry should be held legally responsible for such health problems.

T-Test for Means Statistical analysis must sometimes be based on less than ideal circumstances, as is the case with a small sample. Under certain conditions, we may use the Student’s t -distribution in place of the Normal distribution for computing the confidence interval of a population mean.

Use the t -statistic and the Student's Curve ( t- Distribution ) when all 3 of these conditions are met:

  1. The sample is small, less than or equal to 25. (If the sample > 25, use z by way of Central Limit Theorem.)
  2. The histogram for the contents of the box is close to the normal curve. (If the sample is small and the content of the box is not normal, then you can't use either z or t .)
  3. The σ = SD of the population is unknown , which is almost always true. (If the SD is known don’t use t —use z instead.)

Use of the t -statistic is very similar to the z-statistic. In general, the interval

EV ± t * SE =μˆ ± t * σˆ/ n is a confidence interval for the population mean found by using the

Student’s t -distribution.

However, sample size is even more important with the Student’s t. When looking up area under the t -distribution, we must also know the degrees of freedom which is equal to n – 1. Note that the curves are fatter in the tails than the normal curve to pick up more potential outliers.

Example: A simple random sample of 16 is taken from a population of 30,000 university students. These students have an average age of μˆ = 22.3 years with an SD of σˆ = 4.5 years,

and they appear to be normally distributed.

  • Find a 95% confidence interval for the average age of all 30,000 students.

Examples in R Example: In a July 3, 2003 Gallup poll 1,006 adults nationwide were asked the following: “Would you favor holding the fast food industry legally responsible for the diet-related health problems of people who eat fast food on a regular basis?” 10% answered ‘YES’.

Find a 95% confidence interval for the percent of all US adults who believe the fast food industry should be held legally responsible for such health problems.

Step-by-step...

> phat = .10; n = 1006 > SE = sqrt(phat(1-phat)/n) > z = 1. > c(phat – zSE, phat + z*SE)

All at once...

> prop.test(.1*1006, 1006, conf.level=.95)

Note: The results of prop.test will differ slightly than the results found as described in our “by hand” method using SE. The prop.test function is based on the true standard deviation and not the approximated standard error. This is more complicated algebraically, but more correct, as the central limit theorem approximation for the binomial is better for this expression.

Example: A simple random sample of 900 is taken from a population of 30,000 university students. The average age of the sample μˆ^ = 22.3 years with an SD of σˆ = 4.5 years.

Find a 95% confidence interval.

> xbar=22.3; sd=4.5; n= > SE = sd/sqrt(n) > z = 1. > c(xbar – zSE, xbar + zSE)

Find an 80% confidence interval.

> alpha=. > z = qnorm(1-alpha/2) > c(xbar – zSE, xbar + zSE)

Example: A simple random sample of 16 is taken from a population of 30,000 university students. Their average age is μˆ = 22.3 years with an SD of σˆ = 4.5 years, and they appear to

be normally distributed. Find a 95% confidence interval for the average age.

> xbar=22.3; sd=4.5; n= > SE = sd/sqrt(n) > alpha =. > t = qt(1-alpha/2, df=16-1) > c(xbar – tSE, xbar + tSE)

Differences of Independent Means Recall that for one sample, our method for converting a mean to standard units and finding a test- statistic is

SE n

Value EV z t ˆ/

(or ) σ

μ μ μ

μ −

Thus, a confidence interval for a one sample mean is EV ± z * SE =μˆ ± z * σˆ/ n (or similarly

when using t instead of z ).

Now we look at the difference of two sample means ( X , Y )to try to find a confidence interval for

the true difference in their population means. Provided that the sample sizes ( n x, n y) are both large enough, the distribution of the mean difference will be approximately normal. Thus, the test statistic is similar,

(or ) SE X Y

X Y x y SE X Y

Value EV X Y z t

but we need a new way to calculate standard error.

Just as the two sample proportions, it is crucial that the samples are independent.

  1. If the two population variances are equal (or their sample variances are approximately equal upon inspection), we can use a pooled estimate for this common variance:

x y

p n n

SE X Y

( ) σˆ 2 , where

2 2 2

x y

x x y y p n n

n σ n σ σ.

  1. However, if the two population variances are not equal , we use the following calculation:

y

y x

x n n

SE X Y SE X SEY

ˆ^2 ˆ^2

σ σ − = + = +.

Therefore, a confidence interval for the difference between two sample means is

EV ± z * SE =( xyz * SE ( XY ),

or similarly when using t instead of z.

Z vs. T Which test statistic is used depends on the same set of three criteria as with the one-sample confidence intervals. For the t- distribution to be used, at least one of the samples must be relatively small, both samples should be roughly bell-shaped, and the population standard deviations are unknown.

However, many statisticians will simply use the t -distribution in all situations because it is more robust and because even for large samples, it behaves similarly to the normal distribution. If the population variances are assumed equal , then degrees of freedom is nx + ny – 2. If the population variances are unequa l, then degrees of freedom is more complicated, but given in the text.

Example: A study published in the May 22, 2003 issue of the New England Journal of Medicine compared the Atkins Diet (a low-carbohydrate, high-protein and high-fat diet) to a conventional high-carbohydrate, low-fat and low-calorie diet.

The subjects were 63 obese men and women. 33 people were randomly assigned to the Atkins diet and 30 were randomly assigned to the conventional diet. (Assume the people are chosen from a population of thousands of people wanting to lose weight.) After 3 months the Atkins group lost an average of 6.8% of their body weight with a SD of 5% while the low-calorie group lost an average of 2.7% of their body weight with a SD of 3.7%.

Construct a 95% confidence interval using both the normal distribution and the t -distribution.

Matched Pairs In the analysis above, we compared the means of two independent samples. But what if the two samples have a relationship? What if for each observation in sample X has some tie to an observation in sample Y? This dependent relationship may seem like a problem (since we so relish our independence), but often it is preferred.

Do not be confused. Independence within each sample is always crucial. But here we are referring to a dependence between two samples, also known as matched pairs. As we will see, we actually gain a lot of power by using matched pairs when available.

The statistical power to distinguish among group differences depends on the variability of the random variable used to assess these differences. This variability is related to the heterogeneity of experimental units. We can often increase precision and power by making comparisons between matched pairs of homogeneous (alike) experimental units.

Examples in R Example: (http://www.pollingreport.com/BushJob.htm) In two separate polls running from 10/20/06 to 10/22/06, Gallup reported a 37% approval rating while CNN reported 39% approval. The sample size of each is approximately 900. Construct a 90% confidence interval for the difference between these proportions.

Step-by-step...

> phat1 = .37; phat2 =. > n1 = 900; n2 = 900 > SE = sqrt(phat1(1-phat1)/n1 + phat2(1-phat2)/n2) > z = qnorm(1-.10/2) > c(phat1-phat2 – zSE, phat1-phat2 + zSE)

All at once...

> prop.test(c( , ), c( , ), conf.level= )

Note: As with the one-sample test, prop.test gives slightly more precise results.

We’ve already seen the step-by-step method for finding confidence intervals for a population mean. Now here’s another way.

Example: The dataset iq (UsingR) contains 100 randomly simulated IQ scores. If we wanted to find a 95% confidence interval, we could use the Normal distribution by way of CLT which would give us a 95% CI of (98.94, 103.76). Let’s just use the t -distribution, though. How does it compare to the CI using a z -statistic?

> t.test(iq)

The nice thing about t.test is that you don’t need to tell R the values for mean, SD, or sample size. However, you must have a vector of data. You can not simply have the summary statistics. That is, we are unable to find a confidence interval for the true average age of university students given n= 16, μˆ^ = 22.3 years, and σˆ = 4.5 years while using t.test.

But, what are two methods learned in Chapter 6 that can help us with this situation?

Now let’s look at creating confidence intervals for the difference of two population means.

Example: Look at the kid.weights (UsingR) dataset containing heights for both boys and girls. Are these samples independent?

We tend to see on average that males are taller than females. Construct a 95% confidence interval for the difference in mean height. Explain the confidence interval.

> attach(kid.weights) > boys = height[gender==’M’] > girls = height[gender==’F’] > mean(boys); mean(girls) > sd(boys); sd(girls) > t.test(boys, girls, conf.level=.95)

Now construct a 95% confidence interval which includes all differences indicating boys are taller than girls. That is, find the interval which extends to positive infinity.

> t.test(boys, girls, conf.level=.95, alt=’greater’)

Example: Look at the twins (UsingR) dataset containing IQ scores for pairs of identical twins. Are these samples independent?

Construct a 99% confidence interval for the difference in mean IQ score. Explain the confidence interval.

> attach(twins)

There are two correct ways to do this with t.test.

This is the WRONG way!

> t.test(Foster, Biological, conf.level=.99)