Understanding Normal Distributions & Hypothesis Testing: Sampling Mean & t-Tests, Slides of Statistics

The concept of sampling distribution of the mean and its relationship with z-scores and t-tests. It covers the normal distribution of population and sampling distribution of mean, the role of sample size and population shape, and the use of z-scores and t-tests in hypothesis testing. It also includes examples of single-sample, paired, and independent samples t-tests.

Typology: Slides

2021/2022

Uploaded on 08/01/2022

hal_s95
hal_s95 🇵🇭

4.4

(655)

10K documents

1 / 31

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
B. Weaver (27-May-2011) z- and t-tests ... 1
Hypothesis Testing Using z- and t-tests
In hypothesis testing, one attempts to answer the following question: If the null hypothesis is
assumed to be true, what is the probability of obtaining the observed result, or any more extreme
result that is favourable to the alternative hypothesis?1 In order to tackle this question, at least in
the context of z- and t-tests, one must first understand two important concepts: 1) sampling
distributions of statistics, and 2) the central limit theorem.
Sampling Distributions
Imagine drawing (with replacement) all possible samples of size n from a population, and for
each sample, calculating a statistic--e.g., the sample mean. The frequency distribution of those
sample means would be the sampling distribution of the mean (for samples of size n drawn from
that particular population).
Normally, one thinks of sampling from relatively large populations. But the concept of a
sampling distribution can be illustrated with a small population. Suppose, for example, that our
population consisted of the following 5 scores: 2, 3, 4, 5, and 6. The population mean = 4, and
the population standard deviation (dividing by N) = 1.414.
If we drew (with replacement) all possible samples of 2 from this population, we would end up
with the 25 samples shown in Table 1.
Table 1: All possible samples of n=2 from a population of 5 scores.
First Second Sample First Second Sample
Sample # Score Score Mean Sample # Score Score Mean
1 2 2 2 14 4 5 4.5
2 2 3 2.5 15 4 6 5
3 2 4 3 16 5 2 3.5
4 2 5 3.5 17 5 3 4
5 2 6 4 18 5 4 4.5
6 3 2 2.5 19 5 5 5
7 3 3 3 20 5 6 5.5
8 3 4 3.5 21 6 2 4
9 3 5 4 22 6 3 4.5
10 3 6 4.5 23 6 4 5
11 4 2 3 24 6 5 5.5
12 4 3 3.5 25 6 6 6
13 4 4 4
Mean of the sample means = 4.000
SD of the sample means = 1.000
(SD calculated with division by N)
1 That probability is called a p-value. It is a really a conditional probability--it is conditional on the null hypothesis
being true.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f

Partial preview of the text

Download Understanding Normal Distributions & Hypothesis Testing: Sampling Mean & t-Tests and more Slides Statistics in PDF only on Docsity!

Hypothesis Testing Using z- and t-tests

In hypothesis testing, one attempts to answer the following question: If the null hypothesis is assumed to be true, what is the probability of obtaining the observed result, or any more extreme result that is favourable to the alternative hypothesis?^1 In order to tackle this question, at least in the context of z- and t-tests, one must first understand two important concepts: 1) sampling distributions of statistics, and 2) the central limit theorem.

Sampling Distributions

Imagine drawing (with replacement) all possible samples of size n from a population, and for each sample, calculating a statistic--e.g., the sample mean. The frequency distribution of those sample means would be the sampling distribution of the mean (for samples of size n drawn from that particular population).

Normally, one thinks of sampling from relatively large populations. But the concept of a sampling distribution can be illustrated with a small population. Suppose, for example, that our population consisted of the following 5 scores: 2, 3, 4, 5, and 6. The population mean = 4 , and the population standard deviation (dividing by N) = 1..

If we drew (with replacement) all possible samples of 2 from this population, we would end up with the 25 samples shown in Table 1.

Table 1: All possible samples of n=2 from a population of 5 scores.

First Second Sample First Second Sample Sample # Score Score Mean Sample # Score Score Mean 1 2 2 2 14 4 5 4. 2 2 3 2.5 15 4 6 5 3 2 4 3 16 5 2 3. 4 2 5 3.5 17 5 3 4 5 2 6 4 18 5 4 4. 6 3 2 2.5 19 5 5 5 7 3 3 3 20 5 6 5. 8 3 4 3.5 21 6 2 4 9 3 5 4 22 6 3 4. 10 3 6 4.5 23 6 4 5 11 4 2 3 24 6 5 5. 12 4 3 3.5 25 6 6 6 13 4 4 4 Mean of the sample means = 4. SD of the sample means = 1. (SD calculated with division by N)

(^1) That probability is called a p -value. It is a really a conditional probability --it is conditional on the null hypothesis

being true.

The 25 sample means from Table 1 are plotted below in Figure 1 (a histogram). This distribution of sample means is called the sampling distribution of the mean for samples of n=2 from the population of interest (i.e., our population of 5 scores).

Figure 1: Sampling distribution of the mean for samples of n=2 from a population of N=5.

I suspect the first thing you noticed about this figure is peaked in the middle, and symmetrical about the mean. This is an important characteristic of sampling distributions, and we will return to it in a moment.

You may have also noticed that the standard deviation reported in the figure legend is 1.02, whereas I reported SD = 1.000 in Table 1. Why the discrepancy? Because I used the population SD formula (with division by N) to compute SD = 1.000 in Table 1, but SPSS used the sample SD formula (with division by n-1) when computing the SD it plotted alongside the histogram. The population SD is the correct one to use in this case, because I have the entire population of 25 samples in hand.

The Central Limit Theorem (CLT)

If I were a mathematical statistician, I would now proceed to work through derivations, proving the following statements:

MEANS

2.00 2.50 3.00 3.50 4.00 4.50 5.00 5.50 6.

Distribution of Sample Means*

  • or "Sampling Distribution of the Mean"

6 5 4 3 2 1 0

Std. Dev = 1. Mean = 4. N = 25.

  1. If the population from which you sample is extremely non-normal, the sampling distribution of the mean will still be approximately normal given a large enough sample size (e.g., some authors suggest for sample sizes of 300 or greater).

So, the general principle is that the more the population shape departs from normal, the greater the sample size must be to ensure that the sampling distribution of the mean is approximately normal. This tradeoff is illustrated in the following figure, which uses colour to represent the shape of the sampling distribution (purple = non-normal, red = normal, with the other colours representing points in between).

Does n have to be ≥ 30?

Some textbooks say that one should have a sample size of at least 30 to ensure that the sampling distribution of the mean is approximately normal. The example we started with (i.e., samples of n = 2 from a population of 5 scores) suggests that this is not correct (see Figure 1). Here is another example that makes the same point. The figure on the left, which shows the age distribution for all students admitted to the Northern Ontario School of Medicine in its first 3 years of operation, is treated as the population. The figure on the right shows the distribution of means for 10,000 samples of size 16 drawn from that population. Notice that despite the severe

positive skew in the population, the distribution of sample means is near enough to normal for the normal approximation to be useful.

What is the rule of 30 about then?

In the olden days , textbook authors often did make a distinction between small-sample and large- sample versions of t -tests. The small- and large-sample versions did not differ at all in terms of how t was calculated. Rather, they differed in how/where one obtained the critical value to which they compared their computed t -value. For the small-sample test, one used the critical value of t, from a table of critical t-values. For the large-sample test, one used the critical value of z , obtained from a table of the standard normal distribution. The dividing line between small and large samples was usually n = 30 (or sometimes 20).

Why was this done? Remember that in that era, data analysts did not have access to desktop computers and statistics packages that computed exact p-values. Therefore, they had to compute the test statistic, and compare it to the critical value, which they looked up in a table. Tables of critical values can take up a lot of room. So when possible, compromises were made. In this particular case, most authors and statisticians agreed that for n ≥ 30, the critical value of z (from the standard normal distribution) was close enough to the critical value of t that it could be used as an approximation. The following figure illustrates this by plotting critical values of t with

X X

X X

z n

This is the same formula, but with X in place of X, and σ X in place of σ X. And, if the sampling

distribution of X is normal, or at least approximately normal, we may then refer this value of z to the standard normal distribution, just as we did when we were using raw scores. (This is where the CLT comes in, because it tells the conditions under which the sampling distribution of

X is approximately normal.)

An example. Here is a (fictitious) newspaper advertisement for a program designed to increase intelligence of school children^2 :

As an expert on IQ, you know that in the general population of children, the mean IQ = 100, and the population SD = 15 (for the WISC, at least). You also know that IQ is (approximately) normally distributed in the population. Equipped with this information, you can now address questions such as:

If the n=25 children from Dundas are a random sample from the general population of children,

a) What is the probability of getting a sample mean of 108 or higher? b) What is the probability of getting a sample mean of 92 or lower? c) How high would the sample mean have to be for you to say that the probability of getting a mean that high (or higher) was 0.05 (or 5%)? d) How low would the sample mean have to be for you to say that the probability of getting a mean that low (or lower) was 0.05 (or 5%)?

(^2) I cannot find the original source for this example, but I believe I got it from Dr. Geoff Norman, McMaster

University.

The solutions to these questions are quite straightforward, given everything we have learned so far in this chapter. If we have sampled from the general population of children, as we are assuming, then the population from which we have sampled is at least approximately normal. Therefore, the sampling distribution of the mean will be normal, regardless of sample size. Therefore, we can compute a z-score, and refer it to the table of the standard normal distribution.

So, for part (a) above: 108 100 8

(^15 ) 25

X X X X

X X

z

n

And from a table of the standard normal distribution (or using a computer program, as I did), we can see that the probability of a z-score greater than or equal to 2.667 = 0.0038. Translating that back to the original units, we could say that the probability of getting a sample mean of 108 (or greater) is .0038 (assuming that the 25 children are a random sample from the general population).

For part (b), do the same, but replace 108 with 92:

X X X X

X X

z

n

Because the standard normal distribution is symmetrical about 0, the probability of a z-score equal to or less than -2.667 is the same as the probability of a z-score equal to or greater than 2.667. So, the probability of a sample mean less than or equal to 92 is also equal to 0.0038. Had we asked for the probability of a sample mean that is either 108 or greater, or 92 or less, the answer would be 0.0038 + 0.0038 = 0.0076.

Part (c) above amounts to the same thing as asking, "What sample mean corresponds to a z-score of 1.645?", because we know that p z ( ≥ 1.645) = 0.05. We can start out with the usual z-score

formula, but need to rearrange the terms a bit, because we know that z = 1.645, and are trying to

determine the corresponding value of X.

{ cross-multiply to get to next line }

z { add to both sides }

z { switch sides }

z 1.645 15 25 100 104.

X X

X X X

X X

X X

X

z

X

X

X

disciplines use by convention is this: The difference between X and μ must be large enough

that the probability it occurred by chance (given a true null hypothesis) is 5% or less.

The observed sample mean for this example was 108. As we saw earlier, this corresponds to a z- score of 2.667, and p z ( ≥ 2.667) = 0.0038. Therefore, we could reject H (^) 0 , and we would act as

if the sample was drawn from a population in which mean IQ is greater than 100.

Version 2: Another directional alternative hypothesis

0 1

H

H

This pair of hypotheses would be used if we expected the Dr. Duntz's program to lower IQ, and if we were willing to include an increase in IQ (no matter how large) under the null hypothesis. Given a sample mean of 108, we could stop without calculating z, because the difference is in the wrong direction. That is, to have any hope of rejecting H (^) 0 , the observed difference must be

in the direction specified by H 1.

Version 3: A non-directional alternative hypothesis

0 1

H

H

In this case, the null hypothesis states that the 25 children are a random sample from a population with mean IQ = 100, and the alternative hypothesis says they are not--but it does not specify the

direction of the difference from 100. In the first directional test, we needed to have X > 100 by

a sufficient amount, and in the second directional test, X < 100 by a sufficient amount in order to reject H (^) 0. But in this case, with a non-directional alternative hypothesis, we may reject H 0 if

X < 100 or if X > 100 , provided the difference is large enough.

Because differences in either direction can lead to rejection of H (^) 0 , we must consider both tails of

the standard normal distribution when calculating the p -value--i.e., the probability of the observed outcome, or a more extreme outcome favourable to H 1. For symmetrical distributions

like the standard normal, this boils down to taking the p-value for a directional (or 1-tailed) test, and doubling it.

For this example, the sample mean = 108. This represents a difference of +8 from the population mean (under a true null hypothesis). Because we are interested in both tails of the distribution, we must figure out the probability of a difference of +8 or greater, or a change of -8 or greater.

In other words, p = p X ( ≥ 108) + p X ( ≤ 92) = .0038 + .0038 = .0076.

Single sample t-test (when σ is not known)

In many real-world cases of hypothesis testing, one does not know the standard deviation of the population. In such cases, it must be estimated using the sample standard deviation. That is, s (calculated with division by n-1) is used to estimate σ. Other than that, the calculations are as we saw for the z-test for a single sample--but the test statistic is called t , not z.

( )

2 ( 1) where^ , and^ 1 1

X X df n (^) X X

X (^) s X^ X SS t s s s (^) n n n

= −

∑ (1.9)

In equation (1.9) , notice the subscript written by the t. It says "df = n-1". The "df" stands for degrees of freedom. "Degrees of freedom" can be a bit tricky to grasp, but let's see if we can make it clear.

Degrees of Freedom

Suppose I tell you that I have a sample of n=4 scores, and that the first three scores are 2, 3, and

  1. What is the value of the 4th^ score? You can't tell me, given only that n = 4. It could be anything. In other words, all of the scores, including the last one, are free to vary: df = n for a sample mean.

To calculate t , you must first calculate the sample standard deviation. The conceptual formula for the sample standard deviation is:

( )

2

1

X X

s n

∑ (1.10)

Suppose that the last score in my sample of 4 scores is a 6. That would make the sample mean equal to (2+3+5+6)/4 = 4. As shown in Table 2, the deviation scores for the first 3 scores are -2, -1, and 1.

Table 2: Illustration of degrees of freedom for sample standard deviation

Score Mean

Deviation from Mean 2 4 - 3 4 - 5 4 1 -- -- (^) x 4

Figure 2: Probability density functions of: the standard normal distribution (the highest peak with the thinnest tails); the t -distribution with df =10 (intermediate peak and tails); and the t -distribution with df = (the lowest peak and thickest tails). The dotted lines are at -1.96 and +1.96, the critical values of z for a two-tailed test with alpha = .05. For all t -distributions with df < ∞ , the proportion of area beyond -1. and +1.96 is greater than .05. The lower the degrees of freedom, the thicker the tails, and the greater the proportion of area beyond -1.96 and +1.96. Table 3 (see below) shows yet another way to think about the relationship between the standard normal distribution and various t-distributions. It shows the area in the two tails beyond -1. and +1.96, the critical values of z with 2-tailed alpha = .05. With df=1, roughly 15% of the area falls in each tail of the t-distribution. As df gets larger, the tail areas get smaller and smaller, until the t-distribution converges on the standard normal when df = infinity.

Table 3: Area beyond critical values of + or -1.96 in various t-distributions. The t-distribution with df = infinity is identical to the standard normal distribution.

Degrees of Area beyond Degrees of Area beyond Freedom + or -1.96 Freedom + or -1. 1 0.30034 40 0. 2 0.18906 50 0. 3 0.14485 100 0. 4 0.12155 200 0. 5 0.10729 300 0. 10 0.07844 400 0. 15 0.06884 500 0. 20 0.06408 1,000 0. 25 0.06123 5,000 0. 30 0.05934 10,000 0. Infinity 0.

Example of single-sample t-test.

This example is taken from Understanding Statistics in the Behavioral Sciences (3 rd^ Ed), by Robert R. Pagano.

A researcher believes that in recent years women have been getting taller. She knows that 10 years ago the average height of young adult women living in her city was 63 inches. The standard deviation is unknown. She randomly samples eight young adult women currently residing in her city and measures their heights. The following data are obtained: [64, 66, 68, 60, 62, 65, 66, 63.]

The null hypothesis is that these 8 women are a random sample from a population in which the mean height is 63 inches. The non-directional alternative states that the women are a random sample from a population in which the mean is not 63 inches.

0 1

H

H

The sample mean is 64.25. Because the population standard deviation is not known, we must estimate it using the sample standard deviation.

2

2 2 2

sample SD 1

X X

s n

We can now use the sample standard deviation to estimate the standard error of the mean:

Estimated SE of mean = 0. X 8

s s n

And finally:

0 64.25^63 1.

X 0.

X

t s

This value of t can be referred to a t-distribution with df = n-1 = 7. Doing so, we find that the conditional probability^4 of obtaining a t-statistic with absolute value equal to or greater than 1.387 = 0.208. Therefore, assuming that alpha had been set at the usual .05 level, the researcher cannot reject the null hypothesis.

I performed the same test in SPSS (AnalyzeÆCompare MeansÆOne sample t-test), and obtained the same results, as shown below.

T-TEST /TESTVAL=63 /* <-- This is where you specify the value of mu / /MISSING=ANALYSIS /VARIABLES=height / <-- This is the variable you measured */ /CRITERIA=CIN (.95).

T-Test

One-Sample Statistics

HEIGHT 8 64.25 2.550.

N Mean Std. Deviation

Std. Error Mean

(^4) Remember that a p -value is really a conditional probability. It is conditional on the null hypothesis being true.

Example of paired t-test

This example is from the Study Guide to Pagano’s book Understanding Statistics in The Behavioral Sciences (3rd^ Edition).

A political candidate wishes to determine if endorsing increased social spending is likely to affect her standing in the polls. She has access to data on the popularity of several other candidates who have endorsed increases spending. The data was available both before and after the candidates announced their positions on the issue [see Table 4]. Table 4: Data for paired t-test example.

Popularity Ratings Candidate Before After (^) Difference 1 42 43 1 2 41 45 4 3 50 56 6 4 52 54 2 5 58 65 7 6 32 29 - 7 39 46 7 8 42 48 6 9 48 47 - 10 47 53 6

I entered these BEFORE and AFTER scores into SPSS, and performed a paired t-test as follows:

T-TEST PAIRS= after WITH before (PAIRED) /CRITERIA=CIN(.95) /MISSING=ANALYSIS.

This yielded the following output.

Paired Samples Statistics

48.60 10 9.489 3. 45.10 10 7.415 2.

AFTER BEFORE

Pair 1

Mean N Std. Deviation

Std. Error Mean

Paired Samples Correlations

Pair 1 AFTER & BEFORE 10 .940.

N Correlation Sig.

Paired Samples Test

Pair 1 AFTER - BEFORE 3.50 3.567 1.128 .95 6.05 (^) 3.103 9.

Mean Std. Deviation

Std. Error Mean Lower Upper

95% Confidence Interval of the Difference

Paired Differences

t df Sig. (2-tailed)

The first output table gives descriptive information on the BEORE and AFTER popularity ratings, and shows that the mean is higher after politicians have endorsed increased spending.

The second output table gives the Pearson correlation (r) between the BEFORE and AFTER scores. (The correlation coefficient is measure of the direction and strength of the linear relationship between two variables. I will say more about it in a later section called Testing the significance of Pearson r .)

The final output table shows descriptive statistics for the AFTER – BEFORE difference scores, and the t-value with it’s degrees of freedom and p-value.

The null hypothesis for this test states that the mean difference in the population is zero. In other words, endorsing increased social spending has no effect on popularity ratings in the population from which we have sampled. If that is true, the probability of seeing a difference of 3.5 points or more is 0.013 (the p-value). Therefore, the politician would likely reject the null hypothesis, and would endorse increased social spending.

The same example done using a one-sample t-test

Earlier, I said that the paired t-test is really just a single-sample t-test done on difference scores. Let’s demonstrate for ourselves that this is really so. Here are the BEFORE, AFTER, and DIFF scores from my SPSS file. (I computed the DIFF scores using “compute diff = after – before.”)

BEFORE AFTER DIFF

42 43 1 41 45 4 50 56 6 52 54 2 58 65 7 32 29 - 39 46 7 42 48 6 48 47 - 47 53 6

Number of cases read: 10 Number of cases listed: 10

I then ran a single-sample t-test on the difference scores using the following syntax:

8.00 78.79 63.50 15. 9.00 73.19 68.12 5. 10.00 67.24 67.61 -. 11.00 70.98 60.60 10. 12.00 69.86 63.54 6. 13.00 69.07 65.20 3. 14.00 68.81 61.93 6. 15.00 76.47 71.70 4. 16.00 71.28 67.72 3. 17.00 79.77 62.03 17. 18.00 66.87 58.59 8. 19.00 71.10 68.97 2. 20.00 74.68 65.90 8. 21.00 75.56 58.92 16. 22.00 67.04 64.97 2. 23.00 60.80 67.93 -7. 24.00 71.91 63.68 8. 25.00 79.17 73.08 6.

Number of cases read: 25 Number of cases listed: 25

  • We cannot use the "paired t-test procedure, because it does not
  • allow for non-zero mean difference under the null hypothesis.
  • Instead, we need to use the single-sample t-test on the difference
  • scores, and specify a mean difference of 5 inches under the null.

T-TEST /TESTVAL=5 /* H0: Mean difference = 5 inches (as in past) / /MISSING=ANALYSIS /VARIABLES=diff / perform analysis on the difference scores */ /CRITERIA=CIN (.95).

T-Test

One-Sample Statistics

DIFF 25 5.9372 6.54740 1.

N Mean Std. Deviation

Std. Error Mean

One-Sample Test

DIFF .716 24 .481 .9372 -1.7654 3.

t df Sig. (2-tailed)

Mean Difference Lower Upper

95% Confidence Interval of the Difference

Test Value = 5

  • The observed mean difference = 5.9 inches.

  • The null hypothesis cannot be rejected (p = 0.481).

  • This example shows that the null hypothesis does not always have to

  • specify a mean difference of 0. We obtained a mean difference of

  • 5.9 inches, but were unable to reject H0, because it stated that

  • the mean difference = 5 inches.

  • Suppose we had found that the difference in height between

  • husbands and wives really had decreased dramatically. In

  • that case, we might have found a mean difference close to 0,

  • which might have allowed us to reject H0. An example of

  • this scenario follows below.

COUPLE HUSBAND WIFE DIFF

1.00 68.78 75.34 -6. 2.00 66.09 67.57 -1. 3.00 71.99 69.16 2. 4.00 74.51 69.17 5. 5.00 67.31 68.11 -. 6.00 64.05 68.62 -4. 7.00 66.77 70.31 -3. 8.00 75.33 72.92 2. 9.00 74.11 73.10 1. 10.00 75.71 62.66 13. 11.00 69.01 76.83 -7. 12.00 67.86 63.23 4. 13.00 66.61 72.01 -5. 14.00 68.64 76.10 -7. 15.00 78.74 68.53 10. 16.00 71.66 62.65 9. 17.00 73.43 70.46 2. 18.00 70.39 79.99 -9. 19.00 70.15 64.27 5. 20.00 71.53 69.07 2. 21.00 57.49 81.21 -23. 22.00 68.95 69.92 -. 23.00 77.60 70.70 6. 24.00 72.36 67.79 4. 25.00 72.70 67.50 5.

Number of cases read: 25 Number of cases listed: 25

T-TEST /TESTVAL=5 /* H0: Mean difference = 5 inches (as in past) / /MISSING=ANALYSIS /VARIABLES=diff / perform analysis on the difference scores */ /CRITERIA=CIN (.95).

T-Test

One-Sample Statistics

DIFF (^25) .1820 7.76116 1.

N Mean Std. Deviation

Std. Error Mean