






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concept of confidence intervals in statistics, focusing on the estimation of population proportions and means. It covers the use of normal and student's t-distributions to calculate confidence intervals for proportions and means, respectively. The document also discusses confidence intervals for differences in proportions and means, as well as the importance of sample size and independence in the calculations.
Typology: Exams
1 / 11
This page cannot be seen from the preview
Don't miss anything!







In statistics one often would like to estimate unknown parameters for a known distribution. For example, you may think that your parent population is normal, but the mean is unknown, or both the mean and standard deviation are unknown. From a data set you can not hope to know the exact values of the parameters, but the data should give you a good idea what they are. For the mean, we expect that the sample mean or average of our data will be a good choice for the population mean, and intuitively, we understand that the more data we have the better this should be. How do we quantify this? That is, how good is our sample estimate at guessing the true, unknown population parameter?
Statistical theory is based on knowing the sampling distribution of some statistic, such as the mean. This allows us to make probability statements about the value of the parameters, since we can never identify the true value of the parameters. Such statements are typically in the form of a confidence interval , indicating that we are 95 percent certain the parameter is in some range of values.
Proportions The most widely seen use of confidence intervals is the estimation of population proportion through surveys or polls. For example, suppose it is reported that 100 people were surveyed and 42 of them liked brand X. How do you see this in the media?
Why all the different answers? Well, the idea that we can infer anything about the population based on a survey of just 100 people is founded on probability theory. If the sample is a random sample then we know the sampling distribution of p ˆ the sample proportion. It is approximately
normal. (How do we know that?)
Example: CNN/Time Poll June 14-15, 2000 of n = 1,218 adults nationwide who were asked, "Do you like to watch reality-based television programs, or don't you?" Results: Yes 43%, No 53%, Not sure 4%
Estimate the percent of all US adults who like to watch reality based TV and find the SE for this estimate.
In the above example we got a sample percentage of 43% enjoying watching reality-based TV. The SE was estimated to be 1.4%.
A 95% confidence interval means that we can be about 95% “confident” that the population percentage lies in the interval 43% ± (2 * 1.4%). In other words the interval (40.2%, 45.8%) is a 95% confidence interval for the population percentage.
This does NOT mean that either of the following is true:
This statement doesn’t really make sense because a 95% chance means that something will happen 95% of the time. But the population percentage is not changing. It doesn't happen 95% of the time; it is the same always—we just don't know it.
This statement also doesn’t make sense. We know that our particular sample has exactly 43% who like to watch reality-based TV. Again, there is no chance associated with that, since we’re 100% certain it’s in that interval.
The 95% chance refers to the sampling process —each time we take a sample we get a slightly different sample percentage. So, if we took 100 samples and calculated “sample % ± 2*SE”, we think that about 95 of them would cover the true population percentage.
When dealing with a 95% confidence interval for proportions, the “± 2*SE” is commonly called the margin of error or the sampling error in polls.
In general, the interval p ˆ ± z * p ˆ( 1 − p ˆ)/ n is a confidence interval for the population percentage
found by using the Normal distribution.
Example: In a July 3, 2003 Gallup poll 1,006 adults nationwide were asked the following: “Would you favor holding the fast food industry legally responsible for the diet-related health problems of people who eat fast food on a regular basis?” 10% answered ‘YES’.
Find a 95% confidence interval for the percent of all US adults who believe the fast food industry should be held legally responsible for such health problems.
T-Test for Means Statistical analysis must sometimes be based on less than ideal circumstances, as is the case with a small sample. Under certain conditions, we may use the Student’s t -distribution in place of the Normal distribution for computing the confidence interval of a population mean.
Use the t -statistic and the Student's Curve ( t- Distribution ) when all 3 of these conditions are met:
Use of the t -statistic is very similar to the z-statistic. In general, the interval
EV ± t * SE =μˆ ± t * σˆ/ n is a confidence interval for the population mean found by using the
Student’s t -distribution.
However, sample size is even more important with the Student’s t. When looking up area under the t -distribution, we must also know the degrees of freedom which is equal to n – 1. Note that the curves are fatter in the tails than the normal curve to pick up more potential outliers.
Example: A simple random sample of 16 is taken from a population of 30,000 university students. These students have an average age of μˆ = 22.3 years with an SD of σˆ = 4.5 years,
and they appear to be normally distributed.
Examples in R Example: In a July 3, 2003 Gallup poll 1,006 adults nationwide were asked the following: “Would you favor holding the fast food industry legally responsible for the diet-related health problems of people who eat fast food on a regular basis?” 10% answered ‘YES’.
Find a 95% confidence interval for the percent of all US adults who believe the fast food industry should be held legally responsible for such health problems.
> phat = .10; n = 1006 > SE = sqrt(phat(1-phat)/n) > z = 1. > c(phat – zSE, phat + z*SE)
> prop.test(.1*1006, 1006, conf.level=.95)
Note: The results of prop.test will differ slightly than the results found as described in our “by hand” method using SE. The prop.test function is based on the true standard deviation and not the approximated standard error. This is more complicated algebraically, but more correct, as the central limit theorem approximation for the binomial is better for this expression.
Example: A simple random sample of 900 is taken from a population of 30,000 university students. The average age of the sample μˆ^ = 22.3 years with an SD of σˆ = 4.5 years.
Find a 95% confidence interval.
> xbar=22.3; sd=4.5; n= > SE = sd/sqrt(n) > z = 1. > c(xbar – zSE, xbar + zSE)
Find an 80% confidence interval.
> alpha=. > z = qnorm(1-alpha/2) > c(xbar – zSE, xbar + zSE)
Example: A simple random sample of 16 is taken from a population of 30,000 university students. Their average age is μˆ = 22.3 years with an SD of σˆ = 4.5 years, and they appear to
be normally distributed. Find a 95% confidence interval for the average age.
> xbar=22.3; sd=4.5; n= > SE = sd/sqrt(n) > alpha =. > t = qt(1-alpha/2, df=16-1) > c(xbar – tSE, xbar + tSE)
Differences of Independent Means Recall that for one sample, our method for converting a mean to standard units and finding a test- statistic is
SE n
Value EV z t ˆ/
(or ) σ
μ μ μ
Thus, a confidence interval for a one sample mean is EV ± z * SE =μˆ ± z * σˆ/ n (or similarly
when using t instead of z ).
Now we look at the difference of two sample means ( X , Y )to try to find a confidence interval for
the true difference in their population means. Provided that the sample sizes ( n x, n y) are both large enough, the distribution of the mean difference will be approximately normal. Thus, the test statistic is similar,
(or ) SE X Y
X Y x y SE X Y
Value EV X Y z t −
but we need a new way to calculate standard error.
Just as the two sample proportions, it is crucial that the samples are independent.
x y
p n n
2 2 2
x y
x x y y p n n
n σ n σ σ.
y
y x
x n n
σ σ − = + = +.
Therefore, a confidence interval for the difference between two sample means is
EV ± z * SE =( x − y )± z * SE ( X − Y ),
or similarly when using t instead of z.
Z vs. T Which test statistic is used depends on the same set of three criteria as with the one-sample confidence intervals. For the t- distribution to be used, at least one of the samples must be relatively small, both samples should be roughly bell-shaped, and the population standard deviations are unknown.
However, many statisticians will simply use the t -distribution in all situations because it is more robust and because even for large samples, it behaves similarly to the normal distribution. If the population variances are assumed equal , then degrees of freedom is nx + ny – 2. If the population variances are unequa l, then degrees of freedom is more complicated, but given in the text.
Example: A study published in the May 22, 2003 issue of the New England Journal of Medicine compared the Atkins Diet (a low-carbohydrate, high-protein and high-fat diet) to a conventional high-carbohydrate, low-fat and low-calorie diet.
The subjects were 63 obese men and women. 33 people were randomly assigned to the Atkins diet and 30 were randomly assigned to the conventional diet. (Assume the people are chosen from a population of thousands of people wanting to lose weight.) After 3 months the Atkins group lost an average of 6.8% of their body weight with a SD of 5% while the low-calorie group lost an average of 2.7% of their body weight with a SD of 3.7%.
Construct a 95% confidence interval using both the normal distribution and the t -distribution.
Matched Pairs In the analysis above, we compared the means of two independent samples. But what if the two samples have a relationship? What if for each observation in sample X has some tie to an observation in sample Y? This dependent relationship may seem like a problem (since we so relish our independence), but often it is preferred.
Do not be confused. Independence within each sample is always crucial. But here we are referring to a dependence between two samples, also known as matched pairs. As we will see, we actually gain a lot of power by using matched pairs when available.
The statistical power to distinguish among group differences depends on the variability of the random variable used to assess these differences. This variability is related to the heterogeneity of experimental units. We can often increase precision and power by making comparisons between matched pairs of homogeneous (alike) experimental units.
Examples in R Example: (http://www.pollingreport.com/BushJob.htm) In two separate polls running from 10/20/06 to 10/22/06, Gallup reported a 37% approval rating while CNN reported 39% approval. The sample size of each is approximately 900. Construct a 90% confidence interval for the difference between these proportions.
> phat1 = .37; phat2 =. > n1 = 900; n2 = 900 > SE = sqrt(phat1(1-phat1)/n1 + phat2(1-phat2)/n2) > z = qnorm(1-.10/2) > c(phat1-phat2 – zSE, phat1-phat2 + zSE)
> prop.test(c( , ), c( , ), conf.level= )
Note: As with the one-sample test, prop.test gives slightly more precise results.
We’ve already seen the step-by-step method for finding confidence intervals for a population mean. Now here’s another way.
Example: The dataset iq (UsingR) contains 100 randomly simulated IQ scores. If we wanted to find a 95% confidence interval, we could use the Normal distribution by way of CLT which would give us a 95% CI of (98.94, 103.76). Let’s just use the t -distribution, though. How does it compare to the CI using a z -statistic?
> t.test(iq)
The nice thing about t.test is that you don’t need to tell R the values for mean, SD, or sample size. However, you must have a vector of data. You can not simply have the summary statistics. That is, we are unable to find a confidence interval for the true average age of university students given n= 16, μˆ^ = 22.3 years, and σˆ = 4.5 years while using t.test.
But, what are two methods learned in Chapter 6 that can help us with this situation?
Now let’s look at creating confidence intervals for the difference of two population means.
Example: Look at the kid.weights (UsingR) dataset containing heights for both boys and girls. Are these samples independent?
We tend to see on average that males are taller than females. Construct a 95% confidence interval for the difference in mean height. Explain the confidence interval.
> attach(kid.weights) > boys = height[gender==’M’] > girls = height[gender==’F’] > mean(boys); mean(girls) > sd(boys); sd(girls) > t.test(boys, girls, conf.level=.95)
Now construct a 95% confidence interval which includes all differences indicating boys are taller than girls. That is, find the interval which extends to positive infinity.
> t.test(boys, girls, conf.level=.95, alt=’greater’)
Example: Look at the twins (UsingR) dataset containing IQ scores for pairs of identical twins. Are these samples independent?
Construct a 99% confidence interval for the difference in mean IQ score. Explain the confidence interval.
> attach(twins)
> t.test(Foster, Biological, conf.level=.99)