Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

The construction of confidence intervals for population mean and difference based on sample mean and difference in matched pairs and two-sample studies. It also covers hypothesis tests for population mean and difference using t-test and z-test. The conditions for using the t-test and t-confidence intervals are also discussed.

Typology: Study notes

Pre 2010

1 / 14

Download Confidence Intervals and Hypothesis Tests for Population Mean and Difference - Prof. Nancy and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity! Lecture 26 Nancy Pfenning Stats 1000 Chapter 12: More About Confidence Intervals Recall: Setting up a confidence interval is one way to perform statistical inference: we use a statistic measured from the sample to construct an interval estimate for the unknown parameter for the population. We learned in Chapter 10 how to construct a confidence interval for unknown population proportion p based on sample proportion p̂, when there was a single categorical variable of interest, such as smoking or not. In this chapter, we will learn how to construct other confidence intervals: • for population mean µ based on sample mean x̄ when there is one quantitative variable of interest; • for population mean difference µd based on sample mean difference d̄ in a matched pairs study when the single set of (quantitative) differences d is the variable of interest; • for difference between population means µ1 −µ2 based on difference between sample means x̄1 − x̄2 in a two-sample study. The latter two situations involve one quantitative variable and an additional categorical variable with two possible values, although we may think of the distribution of differences in the matched-pairs study as a single quantitative variable. Also discussed in the textbook but not in our course is the method of constructing a confidence interval for the difference between two population proportions p1−p2 based on the difference between sample proportions p̂1 − p̂2. Because such situations involve two categorical variables, they can be handled instead with a chi- square procedure, which will be discussed further in Chapter 15. The Empirical Rule for normal distributions allowed us to state that in general, the probability is 95% that a normal variable falls withing 2 standard deviations of its mean. Since sample proportion p̂ for a large enough sample size n is approximately normal with mean p and standard deviation √ p(1−p) n , we were able to construct an approximate 95% confidence interval for p: p̂ ± 2 √ p̂(1−p̂) n . In general, an approximate 95% confidence interval for a parameter is the accompanying statistic plus or minus two standard errors; this works well if the statistic’s sampling distribution is approximately normal. If we are interested in the unknown population mean µ when there is a single quantitative variable of interest, we use the fact (established in Chapter 9) that sample mean x̄ has mean µ and standard deviation σ√ n . For a large enough sample size n (say, n at least 30), population standard deviation σ will be fairly well approximated by sample standard deviation s and so our standard error for x̄ is s.e.(x̄) = s√ n . Also for large n, by virtue of the Central Limit Theorem, the distribution of x̄ will be approximately normal, even if the underlying population variable X is not. Thus, for a large sample size n, the Empirical Rule tells us that an approximate 95% confidence interval for population mean µ is x̄ ± 2 s√ n Example The mean number of credits taken by a sample of 81 statistics students was 15.60 and the standard deviation was 1.8. Construct an approximate 95% confidence interval for the mean number of credits taken by all statistics students; does this interval also have a 95% chance of capturing the mean number of credits taken by all students at the entire university? x̄ ± 2 s√ n = 15.60± 2 1.8√ 81 = 15.60± .40 = (15.20, 16.00) This interval applies to statistics students only. Especially because the intro stats courses are 4 credits each instead of the usual 3, these students may average slightly higher credit hours than students in general. 100 Recall: The Empirical Rule is only roughly accurate; besides, we sometimes may prefer a different level of confidence other than .95. More precise standard normal values for confidence levels .90, .95, .98, and .99 may be obtained the “infinite” row at the bottom of Table A.2. [The row is called “infinite” because t∗ multipliers converge to z∗ for infinite sample sizes—same as infinite degrees of freedom.] We can summarize the intervals as follows: for a large sample size n, an approximate 90% confidence interval for µ is x̄ ± 1.645 s√ n 95% confidence interval for µ is x̄ ± 1.960 s√ n 98% confidence interval for µ is x̄ ± 2.326 s√ n 99% confidence interval for µ is x̄ ± 2.576 s√ n Example The mean number of credits taken by a sample of 81 statistics students was 15.60 and the standard deviation was 1.8. Construct a more precise 95% confidence interval for the mean number of credits taken by all statistics students. Then construct a 90% confidence interval. A 95% confidence interval is x̄ ± 1.96 s√ n = 15.60± 1.96 1.8√ 81 = 15.60± .39 = (15.21, 15.99) A 90% confidence interval is x̄ ± 1.645 s√ n = 15.60± 1.645 1.8√ 81 = 15.60± .33 = (15.27, 15.93) Note the trade-off: we obtain a narrower, more precise interval when we make do with a lower level of confidence. Recall: we learned in Chapter 9 that not all standardized test statistics follow a standard normal (z) curve. In particular, when the sample size n is small, s may be quite different from σ, and the random variable x̄−µs √ n follows a t distribution with n− 1 degrees of freedom, not a z distribution. Especially for small samples, t has more spread than the standard normal z. It is still symmetric about zero and bell-shaped like the z curve. Table A.2 provides t∗ multipliers for constructing 90%, 95%, 98%, or 99% confidence intervals for unknown population mean µ when the sample size is on the small side. Example Suppose a sample of only 9 statistics students averaged 15.60 credits, with standard deviation 1.8. Construct 95% and 90% confidence intervals for the mean number of credits taken by all statistics students. A sample of size n = 9 has df = n − 1 = 9 − 1 = 8, and so we obtain the correct t∗ multipliers from the 8 df row of Table A.2: A 95% confidence interval is x̄ ± 2.31 s√ n = 15.60± 2.31 1.8√ 9 = 15.60± 1.39 = (14.21, 16.99) 101 Before After Difference 45 36 9 73 60 13 46 44 2 124 119 5 33 35 -2 57 51 6 83 77 6 34 29 5 26 24 2 17 11 6 d̄ = 5.2 sd = 4.1 First we can verify with a histogram that the data show no outliers or skewness, and are approx- imately normal. Next, we find that a 95% confidence interval for the population mean difference µd is d̄ ± t∗ s√n where t∗ comes from the df = 9 row and the .95 confidence level column. Our confidence interval is 5.2 ± 2.262( 4.1√ 10 ) = 5.2 ± 2.9 = (2.3, 8.1) We are 95% confident that the population mean difference in average weekly man hours lost is between 2.3 and 8.1. (Implicit is the assumption that the plants constitute a random sample of all plants for which such a safety program is intended.) Since the interval is strictly to the right of zero, containing only positive numbers, it suggests that there was a real decrease in mean man-hours lost from before to after. However, the study design is somewhat flawed because time could possibly be a confounding variable. Perhaps because of heightening awareness of safety issues (and increased fear of lawsuits), there was a general decrease in man-hours lost due to accidents during that time period, even in plants that did not implement the safety program. How could we control for this possible confounding variable? By comparing our ten plants to another sample of plants over the same time period which did not implement the safety program. Such a design, because it involves samples from two distinct populations, is called a two-sample design. Comparing Two Means We will use inference to compare the mean responses in two groups, each from a distinct population. This is called a two-sample siutation, one of the most common settings in statistical applications. One example would be to compare mean IQ’s of male and female seventh-graders—i.e., comparing results in an observa- tional study. Another example would be to compare the change in blood pressure for two groups of black men, where one group has been given calcium supplements, the other a placebo—i.e., comparing results in an experiment. In general, a two-sample t procedure arises in situations where there is one quantitative variable of interest, plus a categorical variable which has two possible values. The variables in the first example are IQ and gender; in the second example they are blood pressure and whether the subject has been given calcium or a placebo. Responses in each group must be independent of those in the other; sample sizes may differ. The setting is not appropriate for matched pairs, which represent a single population. The following notation is used to describe the two populations and the results of two independent random samples: Parameters Statistics Population R.V. mean s.d. sample size sample mean sample s.d. 1 X1 µ1 σ1 n1 x̄1 s1 2 X2 µ2 σ2 n2 x̄2 s2 104 Naturally enough, we estimate the parameter µ1 − µ2 with the statistic x̄1 − x̄2. As one would hope and expect, it turns out that the distribution of the R.V. x̄1 − x̄2 is centered at µ1 − µ2, providing an unbiased estimator. The spread of the distribution is not so intuitive; it can be shown that the standard error of x̄1 − x̄2 is √ s21 n1 + s22 n2 We use the above mean and standard error to standardize x̄1 − x̄2 to the two-sample t statistic t = (x̄1 − x̄2) − (µ1 − µ2) √ s2 1 n1 + s2 2 n2 Although this R.V. does not have a t distribution per se, it can still be used with t∗ values in either or two ways: Option 1: Approximate df = ( s21 n1 + s22 n2 )2/[ 1 n1 − 1 ( s21 n1 )2 + 1 n2 − 1 ( s22 n2 )2] and use the t table. [The computer takes this approach, but for obvious reasons we would rather not, if solving a two-sample problem by hand. Instead, we will use...] Option 2: (conservative approach) use the smaller of n1 − 1,n2 − 1 as our df in the t table: An approximate confidence interval for µ1 − µ2 is (x̄1 − x̄2) ± t∗ √ s21 n1 + s22 n2 where t∗ uses the smaller of n1−1, n2−1 as its df and the desired confidence level dictates which column from Table A.2 to use. This interval should be fairly accurate as long as the sample sizes are large, or if small samples show no outliers or skewness. Example In random samples of 47 male and 31 female seventh-graders in a Midwest school district, IQ’s were found to have the following means and standard deviations: Group n x̄ s Males 47 111 12 Females 31 106 14 1. What shapes are required of the underlying populations to justify use of two-sample t procedures? Any shapes should be acceptable, since the sample sizes of 47 and 31 are reasonably large. 2. Use a two-sample t procedure to give a 90% confidence interval for the difference between mean IQ’s, males minus females. The 90% confidence interval for µ1 − µ2 is given by (x̄1 − x̄2) ± t∗ √ s21 n1 + s22 n2 where we take t∗ to be the value for the smaller of 47−1, 31−1 df , and confidence level .90. We find the t∗ value for the 30 row, .90 column to be 1.70 and our 90% confidence interval is (111− 106)± 1.70 √ 122 47 + 142 31 = 5 ± 6 = (−1, 11) 105 3. It is common for boys to score somewhat higher than girls on standardized tests. Does this seem to be the case for all seventh-grade boys and girls in this school district? The interval just barely contains zero, so it is difficult to be sure. Eventually we will learn to carry out a formal test of whether or not two means µ1 and µ2 are equal. Pooled Two-Sample t Procedures If the samples are coming from populations that have equal variances, we can use a pooled procedure. The test statistic can be shown to follow a genuine t distribution, with n1 + n2 − 2 df . This places us further down on the t table than taking the smaller of n1 − 1 and n2 − 1 as our df , resulting in slightly narrower confidence intervals. One criterion for use of a pooled procedure is to check that sample standard deviations are close enough to suggest equal population standard deviations, and hence equal variances. We do this by verifying that the larger sample standard deviation is no more than twice the smaller. Example Looking at the sample standard deviations for IQs, we note that 14 is not more than twice 12, so a pooled procedure seems appropriate. There are actually much better criteria for use of a pooled procedure, which are outlined in your textbook. In any case, for our purposes in this course the non-pooled procedure will be considered adequate. Example In a previous Example, we explored the sampling distribution of sample mean height, when random samples are taken from a population of women whose mean height is claimed to be 64.5. We noted the sample mean height of surveyed Stats female students, and calculated by hand the probability of observing such a high sample mean, if population mean were really only 64.5. We used this probability to decide whether we were willing to believe that population mean was in fact 64.5, or if the population of female Stats students is actually taller, on average. For this Example, we address the same question by using MINITAB to set up a confidence interval for unknown population mean height, given that population standard deviation is 2.5 (thus, a z procedure is used). When a one-sided alternative is not specified, the confidence interval just barely contains 64.5 (it goes down to 64.491), and so we can’t quite produce evidence that population mean height of females differs 64.5. If a greater-than alternative is specified, then our lower bound is 64.538, which would suggest population mean height is higher than 64.5. If the standard deviation of 2.5 were not given, we would carry out a t procedure. Again, the confidence interval just barely contains 64.5 with a two-sided alternative, and just barely misses it with a one-sided alternative. Considering the 95% confidence interval will give results that match up neatly with those of a hypothesis test at the 5% level only in the case of a two-sided alternative. One-Sample Z: HT_female Test of mu = 64.5 vs mu not = 64.5 The assumed sigma = 2.5 Variable N Mean StDev SE Mean HT_female 281 64.783 2.637 0.149 Variable 95.0% CI Z P HT_female ( 64.491, 65.075) 1.90 0.058 One-Sample Z: HT_female 106 Example I had been going under the assumption that my students averaged 15 credits in a semester, but then I thought that because mine is a 4 credit course, their mean may actually be higher than 15. The mean number of credits taken by a sample of 81 statistics students was 15.6 and the standard deviation was 1.8. Does this provide evidence that statistics students overall average more than 15 credits? 1. H0 : µ = 15 vs. Ha : µ > 15 2. Since n = 81 is large, non-normal shape would not be a problem; calculate t = 15.6−151.8 √ 81 = 3 3. P-value= P (T ≥ 3) 4. For 80 df, 3 is greater than 2.64, so the P-value is less than .005: results are statistically significant, and we reject H0. 5. Overall, statistics students average more than 15 credits. Example Suppose a sample of 36 statistics students had been taken. How many df should we use from Table A.2? Note that the table does not include exactly 35 df, so we must choose between 30 and 40. Always choose the smaller df, because this makes it slightly more difficult to reject the null hypothesis, which is the safer approach to take. Thus, we would carry out the test using the 30 df row of Table A.2. Conditions for Using the t Test Just as with confidence intervals, P-value ranges obtained by comparing the t statistic to t∗ values in Table A.2 will produce accurate results if the sample size is large or if the sample size is small but the data show no outliers or pronounced skewness. The table will not produce accurate results if the sample size is small and the data show outliers or skewness. Matched Pairs Hypothesis Tests When the mean of a quantitative variable is explored via a matched pairs design, hypothesis tests are carried out on the population mean difference µd based on the sample mean difference d̄. Example To test if students’ mothers tend to be younger than their fathers, I looked at the difference mother’s age minus father’s age for a sample of 12 students. This difference had mean d̄ = −1.5 and standard deviation sd = 3. Is the mean difference significantly less than zero, using α = .05 as the cut-off probability? To test H0 : µd = 0 vs. Ha : µd < 0, we check if the distribution seems fairly symmetric and outlier-free (it is) and calculate t = −1.5−0 3/ √ 12 = −1.73. Because the alternative has the “<” sign, our P-value is P (T ≤ −1.73). According to the table, the probability of a T random variable with 11 df being greater than 1.80 is .05; likewise the probability of being less than -1.8 is also .05. The test statistic -1.73 isn’t as far out on the tail of the t curve as -1.8, so its tail probability is more than .05. We do not have evidence to reject the null hypothesis at α = .05. A sample of 12 age differences was not enough to convince us that mothers tend to be younger. [In fact, another much larger sample was taken, producting a much smaller P-value, and this sample did provide evidence that the mean age difference is negative.] 109 Hypothesis Tests About the Difference Between Two Means We can test for equality of the mean responses in two groups, each from a distinct population. This is called a two-sample siutation, one of the most common settings in statistical applications. One example would be to compare mean IQ’s of male and female seventh-graders—i.e., comparing results in an observational study. Another example would be to compare the change in blood pressure for two groups of black men, where one group has been given calcium supplements, the other a placebo—i.e., comparing results in an experiment. As with confidence intervals, we use the following notation: Parameters Statistics Population R.V. mean s.d. sample size sample mean sample s.d. 1 X1 µ1 σ1 n1 x̄1 s1 2 X2 µ2 σ2 n2 x̄2 s2 The null hypothesis is H0 : µ1 = µ2 [same as H0 : µ1 − µ2 = 0] and the alternative substitutes the appropriate inequality for “=”. We carry out our test using the two-sample t statistic t = (x̄1 − x̄2) − (µ1 − µ2) √ s2 1 n1 + s2 2 n2 and use the smaller of n1 − 1,n2 − 1 as our df in the t table. The approximate P-value is found in the usual way from Table A.2. Example In random samples of 47 male and 31 female seventh-graders in a Midwest school district, IQ’s were found to have the following means and standard deviations: Group n x̄ s Males 47 111 12 Females 31 106 14 Is the mean male IQ significantly higher than that for the females? Test at level α = .05. We will test H0 : µ1 − µ2 = 0 vs Ha : µ1 − µ2 > 0. Our two-sample t statistic is t = (111− 106) − 0 √ 122 47 + 142 31 = 1.63 For 30 df , 1.63 is less than 1.70, so our (one-sided) P-value is greater than .05. There is not quite enough evidence to reject H0 at the .05 level; the population of boys doesn’t necessarily average higher than the population of girls in this district. Example Suppose the FAA weighed a random sample of 25 airline passengers during the summer and found their weights to have mean 180, standard deviation 40. Are airline passengers necessarily heavier now than they were in 1995, when mean weight for 16 passengers was 160, with standard deviation 30? Answer this question two ways: first by looking at a 90% confidence interval for the difference in mean weights, then by testing at the .05 level if the mean weight increased. (Note that since the sample sizes aren’t especially large, we should first check that the weight distributions do not show obvious outliers or skewness.) We have n1 = 16, n2 = 25; x̄1 = 160, x̄2 = 180; s1 = 30, s2 = 40. 110 1. We use the t∗ multiplier for the df = 16 − 1 = 15 row and the column for confidence level .90. Our 90% confidence interval for µ1 − µ2 is 160− 180± 1.75 √ 302 16 + 402 25 = −20± 19 = (−39,−1) The interval contains only negative numbers, and suggests a significant increase in mean weight from 1995 to 2002. 2. We test H0 : µ1 − µ2 = 0 vs. Ha : µ1 − µ2 < 0 about population mean weight in 1995 minus population mean weight in 2002. The test statistic is t = 160−180√ 302 16 + 40 2 25 = 1.82 For 15 df, 1.82 is between 1.75 and 2.13, so the P-value is between .05 and .025. We reject H0 at the .05 level, and conclude that mean weight has increased significantly. The FAA reached this conclusion in the spring of 2003, and made new restrictions on number of passengers aboard smaller planes based on the fact that people are heavier than they used to be. Multiple Hypothesis Tests Example Verbal SATs have mean 500. An education expert samples verbal SAT scores of 20 students each in 100 schools across the state, and finds that in 4 of those schools, the sample mean verbal SAT is significantly lower than 500, using α = .05. Are these schools necessarily inferior in that their students do significantly worse on the verbal SATs? No. First note that 20 indicates the sample size here, and 100 is the number of tests—in other words, we test H0 : µ = 500 vs Ha : µ < 500 over and over, one hundred times. Remember that if α = .05 is used as a cutoff, then 5% of the time in the long run we will reject H0 even when it is true. Roughly, 5 schools in 100 will produce samples of students with verbal SATs low enough to reject H0, just by chance in the selection process, even if the mean for all students at those schools is in fact 500. Example Kanarek and others studied the relationship between cancer rates and levels of asbestos in the drinking water. After adjusting for age and various demographic variables, but not smoking, they found a “strong relationship” between the rate of lung cancer among white males and the concentration of asbestos fibers in the drinking water: p-value<.001. [An increase of 100 times the asbestos concentration results in an increase of 1.05 per 1000 in the lung cancer rate—one additional lung cancer case per year for every 20,000 people.] The investigators tested over 200 relationships...the p-value for lung cancer in white males was by far the smallest one they got. Does asbestos in the drinking water cause lung cancer in white males? No! When they test hundreds of relationships, sooner or later by chance alone some will end up looking significant. [There are other problems with this study: failing to control for the possible confounding variable of smoking, and calling a relationship “strong” even though it would imply just one additional case of lung cancer for every 20,000 white males.] Example A researcher of ESP tests 500 subjects. Four of them do significantly better (each P-value < .01) than random guessing. Should the researcher conclude that those four have ESP? No! In so many trials, even if each subject is just guessing, chances are that a few of the 500 will do significantly better than guessing (and a few will do significantly worse!). The researcher should proceed with further testing of those four subjects. In general, we should be aware that many tests run at once will probably produce some significant results by chance alone, even if none of the null hypothese are false. 111