Download Confidence Intervals and Hypothesis Testing for Population Means and Differences and more Study notes Statistics in PDF only on Docsity! Confidence Intervals and Hypothesis Testing for Population Means and differences between Population Means In this course, we are interested in determining the differences between population means of dependent variable values associated with different factor levels. We also would like to be able to give a meaningful estimate for these means. Furthermore, we are working under the assumption that the errors in our models are normally distributed with a common variance (that has to be estimated from the data). This means that, when calculating confidence intervals or performing hypothesis tests, the appropriate test statistic will be a t – statistic with some df = degrees of freedom. Let’s review, then, how to compute confidence intervals and how to perform hypothesis tests in this specialized setting. Confidence Intervals: For a single population mean of a normally distributed random variable with unknown. Illustrative Example: Suppose that the price of regular gasoline in a given area in a given week is normally distributed and we wish to estimate the mean price of a gallon of gasoline in this region for the first week of August. If we randomly select ten (10) gas stations in the area and determine what they are charging for a gallon of regular gas, then the average of those ten prices should give us an idea of what the overall average price of a gallon of regular in the area for the week is. However, we know that if we take another random sample of ten stations and calculate the average price for them that it very possibly might differ from the average determined by our first sample. This means that the average price of gas in the region during the first week of August for random samples of ten (10) stations is actually a random variable and not something fixed (just like the individual prices at specific stations is a random variable, which we have assumed is normally distributed). At this point, we might want to know, just out of curiosity, what the distribution is of the random variable of average gas prices for ten randomly selected gas stations from our area during the first week of August (quite a mouthful!). Let’s put curiosity aside for the moment and back up and suppose that instead we want to estimate the mean price of regular gas by just picking one station at random and using its price some way. For example, suppose George’s BP charges $1.81 (we’re rounding up). Now, were pretty certain that $1.81 is not the true average for the entire region. It could be, but it isn’t too likely. Therefore, the statement that the mean price, , equals $1.81 is almost certainly false. In order to have a reasonable chance of saying anything true, we could soften our statement to, say, the mean is “nearly” or “approximately” $1.81. However, this makes our statement somewhat vague because we haven’t pinned down what “approximately” means. Maybe we could say that the mean price is between two numbers, say between $1.71 and $1.91 ($1.81 .1). But this can’t be a true statement either, because we originally assumed that our gas prices were normally distributed and normal distributions assign some nonzero probability to values outside any finite interval. Of course you might now say that assumption was stupid, because anybody knows that the price certainly is between $0 and, say, $10. But we aren’t willing to throw this assumption out, besides, a more precise statement of what we meant by it is that, except for a range of prices of probability so small that they are completely negligible, the random variable of prices is normal. Thus, a statement like “the true mean is between $1.71 and $1.91 with some high probability” is something that we might be able to say. Thus, we are willing to say to someone, you tell us (or we pick in advance) a confidence level (90%, 95%, 99%, 99.9%, whatever) and we will tell you how to pick an interval (range of values) such that the true population mean lies within the interval (90%, 95%, 99%, 99.9%, whatever) % if the time. For technical reasons, we prefer to talk about a confidence level %100)1( c , so, if = .05, then c = 95%. How should we go about choosing a c % confidence interval? Returning to the gas price problem, in particular, how should we choose a c % confidence level for gas price based on the price at a single gas station? If we don’t know any more than we have revealed so far, we can’t, but, if we have some information concerning the variance of the original distribution, we can use it. Just for discussion purposes, let’s suppose that we know the variance completely, i.e., we know the value of 2 and, hence, . Then we know that ).()()( E z E PExEPExExP Where x z is the so called z – score. Under our assumptions, z is distributed according to a normal distribution with a mean of 0 and a standard deviation of 1 (the standard normal distribution). So, if we can answer the question, given confidence level c, what value of E is such that 100 )( cE z E P ? Of course, since we know , this is the same as asking what value of cz E is such that 100 )( c zzzP cc ? From what we know about the standard normal distribution, we are asking for that value of z such that 2 units of area lie above it, or below its negative. Once we have found cz , we then take czE and this would solve our problem. Thus, for the gas prices, suppose that we knew that = .2 and we want a 95% confidence interval. Then cz = 1.96 (from tables or from Excel) and 292.)2(.96.1 czE and, using the price at George’s BP, the interval with end points 1.81 29. is a 95% confidence interval. We cannot be certain that it contains the true mean, but we know that, if we generate intervals in this way, then 95% of the time the resulting interval will contain the true mean. In reality, though, we aren’t going to know and we are going to have to take a sample bigger than 1 in order to learn anything about it. But, taking a sample helps in two ways, first, it gives us a way to estimate , and, second, the distribution of sample means of samples of size n drawn at random from a population has the same mean as the population and a much tighter (smaller) variance, which means that using a point from the distribution of sample means of samples of size n to estimate the mean of that distribution will give us an interval estimate for the original distribution’s mean with a much smaller width than we could get if we used a point from the original distribution.