G*Power for Power Analysis: Effect & Sample Size for t-Tests & ANOVA, Summaries of Designs and Groups

A step-by-step guide on how to use G*Power for power analyses of t-Tests (means) and ANOVA (F-Test) with examples and explanations of effect size, alpha, beta, and critical values. It covers post-hoc and a priori power analyses, as well as compromise power analyses.

Typology: Summaries

2021/2022

Uploaded on 07/05/2022

barbara_gr
barbara_gr 🇦🇺

4.6

(73)

1K documents

1 / 43

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
GPOWER Tutorial
Before we begin this tutorial, we would like to give you a general advice for performing
power analyses.
A very frequent error in performing power analyses with G*Power is to specify incorrect
degrees of freedom. As a general rule, therefore, we recommend that you routinely compare
the degrees of freedom as specified in G*Power with the degrees of freedom that your
statistical analysis program gives you for an appropriate set of data. If you do not yet have
your data set (e.g., in the case of an a priori power analysis), then you could simply create
an appropriate artificial data set and check the degrees of freedom for this set.
Let us now start with the simplest possible case, a t-test for independent samples.
In a now-classic study, Warrington and Weiskrantz (1970) compared the memory
performance of amnesics to normal controls. Amnesics are persons who have very serious
long-term memory problems. It very often takes them weeks to learn where the bathroom is
in a new environment, and some of them never seem to learn such things. Perhaps the most
intriguing result of the Warrington and Weiskrantz study was that amnesics and normals
differed with respect to direct, but not indirect measures of memory.
An example of a direct memory measure would be recognition performance. This measure
is called direct because the remembering person receives explicit instructions to recollect a
prior study episode ("please recognize which of these words you have seen before").
In contrast, word stem completion would be an indirect measure of memory. In such a task,
a person is given a word stem such as "tri....." and is asked to complete it with the first word
that comes to mind. If the probability of completing such stems with studied words is above
base-line, then we observe an effect of prior experience.
It should be clear by now why the finding of no statistically significant difference between
amnesiacs and normal in indirect tests was so exciting: All of a sudden there was evidence
for memory where it was not expected, but only when the instructions did not stress the fact
that the task was a memory task.
However, it may appear a bit puzzling that amnesiacs and normal were not totally
equivalent with respect to the indirect word stem completion task. Rather, normal were a bit
better than amnesiacs with an average of 16 versus 14.5 stems completed with studied
words, respectively. Of course, in the recognition task, normal were much better than
amnesiacs with correct recognition scores of 13 versus 8, respectively.
At this point, one may wonder about the power of the relevant statistical test to detect a
difference if there truly was one. Therefore, let's perform a post-hoc power analysis on
these Warrington and Weiskrantz (1970) data.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b

Partial preview of the text

Download G*Power for Power Analysis: Effect & Sample Size for t-Tests & ANOVA and more Summaries Designs and Groups in PDF only on Docsity!

GPOWER Tutorial

Before we begin this tutorial, we would like to give you a general advice for performing power analyses.

A very frequent error in performing power analyses with GPower is to specify incorrect degrees of freedom. As a general rule, therefore, we recommend that you routinely compare the degrees of freedom as specified in GPower with the degrees of freedom that your statistical analysis program gives you for an appropriate set of data. If you do not yet have your data set (e.g., in the case of an a priori power analysis), then you could simply create an appropriate artificial data set and check the degrees of freedom for this set.

Let us now start with the simplest possible case, a t-test for independent samples.

In a now-classic study, Warrington and Weiskrantz (1970) compared the memory performance of amnesics to normal controls. Amnesics are persons who have very serious long-term memory problems. It very often takes them weeks to learn where the bathroom is in a new environment, and some of them never seem to learn such things. Perhaps the most intriguing result of the Warrington and Weiskrantz study was that amnesics and normals differed with respect to direct, but not indirect measures of memory.

An example of a direct memory measure would be recognition performance. This measure is called direct because the remembering person receives explicit instructions to recollect a prior study episode ("please recognize which of these words you have seen before").

In contrast, word stem completion would be an indirect measure of memory. In such a task, a person is given a word stem such as "tri....." and is asked to complete it with the first word that comes to mind. If the probability of completing such stems with studied words is above base-line, then we observe an effect of prior experience.

It should be clear by now why the finding of no statistically significant difference between amnesiacs and normal in indirect tests was so exciting: All of a sudden there was evidence for memory where it was not expected, but only when the instructions did not stress the fact that the task was a memory task.

However, it may appear a bit puzzling that amnesiacs and normal were not totally equivalent with respect to the indirect word stem completion task. Rather, normal were a bit better than amnesiacs with an average of 16 versus 14.5 stems completed with studied words, respectively. Of course, in the recognition task, normal were much better than amnesiacs with correct recognition scores of 13 versus 8, respectively.

At this point, one may wonder about the power of the relevant statistical test to detect a difference if there truly was one. Therefore, let's perform a post-hoc power analysis on these Warrington and Weiskrantz (1970) data.

Post-hoc Power Analysis

For the sake of this example, let us assume that the mean word-stem completion performance for amnesics (14.5) and normals (16) as observed by Warrington and Weiskrantz (1970) reflects the population means, and let the population standard deviation of both group means be sigma = 3. We can now compute the effect size index d (Cohen,

  1. which is defined as

d =

We obtain

d =

The resulting d = 0.5 can be interpreted as a "medium" effect according to Cohen's (1977) popular effect size conventions.

A total of

n 1 = 4 amnesics and

n2 = 8 normal control subjects

participated in the Warrington and Weiskrantz (1970) study. These sample sizes are used by G*Power to compute the relevant noncentrality parameter of the noncentral t- distribution. The noncentral distribution of a test statistic results, for a certain sample size, if H 1 (the alternative hypothesis) is true. The noncentrality parameter delta (δ) is defined as

1 2

1 2 n n

nn d

δ=

Now we are almost set to perform our post-hoc power analysis. One more piece is missing, however. We need to decide which level of alpha is acceptable. Without much thinking, we choose alpha = .05. Given these premises, what was the power in the Warrington and Weiskrantz (1970) study to detect a "medium" size difference between amnesics and controls in the word stem completion task?

Start G*Power and select: Type of Power Analysis: Post-hoc Type of Test: t-Test (means), two-tailed Accuracy mode calculation

We are shocked. Obviously, there is no way we can recruit N = 210 subjects for our study, simply because it would be impossible to find n 1 = 105 amnesic patients (fortunately, very few people suffer from severe amnesia!).

Assume that we work in a hospital in which n 1 = 20 amnesics are treated at the moment. It seems reasonable to expect that we can recruit an equal number of control patients to participate in our study. Thus, n 1 + n 2 = 20 + 20 = 40 is our largest possible sample size.

What are we going to do? Well, we simply perform a

Compromise Power Analysis

Erdfelder (1984) has developed the concept of a compromise power analysis specifically for cases like the present one in which pragmatic constraints prohibit that our investigations follow the recommendations derived from an a priori power analysis. The basic idea here is that two things are fixed, the maximum possible sample size and the effect we want to detect, but that we may still opt to choose alpha and beta error probabilities in accordance with the other two parameters. All we need to specify is the relative seriousness of the alpha and beta error probabilities. Sometimes, protecting against alpha errors will be more important, and sometimes beta errors are associated with a higher cost. Which error type is more serious depends on our research question. For instance, if we invented a new, cheaper treatment of a mental disorder, then we would want to make sure that it is not worse than the older, more expensive treatment. In this case, committing a beta error (i.e., accepting both treatments as equivalent although the cheaper treatment is worse) may be considered more serious than committing an alpha error.

In basic research, both types of errors are normally considered equally serious. Thus, in our present basic-research example we choose

q α

We're all set now to perform our compromise power analysis.

Select: Type of Power Analysis: Compromise Type of Test: t-Test (means), two-tailed Accuracy mode calculation Input: n1: n2:

Effect size "d": 0. Beta/alpha ratio: 1 Result: alpha: 0. Power (1-beta): 0. Critical t: t(38) = 1. Delta: 1.

This is still not fantastic, but perhaps it is more reasonable than the alternatives we have. In the end, you will have to decide whether it is worth the trouble given these premises.

We have now arrived at the end of our tutorial. If you want to learn more about statistical power analyses, we recommend that you read Cohen's (1988) excellent book.

q := beta/alpha which specifies the relative seriousness of both errors (cf. Cohen, 1965, 1988, p. 5). The problem is to calculate an optimum critical value for the test statistic which satisfies beta/alpha = q. This optimum critical value can be regarded as a rational compromise between the demands for a low alpha-risk and a large power level, given a fixed sample size.

Given appropriate subroutines for computing the noncentral distributions of the relevant test statistics (i.e., the exact distributions of the test statistics if H1 is true, cf. Johnson & Kotz, 1970, chap. 28, 30, and 31), it is relatively easy to implement compromise power analyses using an efficient iterative interval dissection algorithm (cf. Press, Flannery, Teukolsky, & Vetterling, 1988, chap. 9).

The question is, therefore, why compromise analyses are missing in the currently available power analysis software. The only reason we can think of is that non-standard results may occur, that is, results that are inconsistent with established conventions of statistical inference. Given some fixed sample size, a compromise power analysis could suggest to choose a critical value which corresponds to, say, alpha = beta = .168.

These error probabilities are indeed non-standard, but they may nevertheless be reasonable given the constraints of the research. To illustrate, consider the special case of some substantive hypothesis which implies H 0 , for instance, the hypothesis of no interaction. Does it make more sense to choose alpha = beta = .168 rather than to insist on the standard level alpha = .05 associated with beta = .623? Obviously, the standard .05 alpha-level makes no sense in this situation, because it implies a risk of almost two-thirds to accept falsely the hypothesis of interest. Therefore, not only a priori and post-hoc analyses, but also compromise power analyses should be offered routinely by software which is designed to serve as a researcher's tool.

Note that there is a list of tests for fast access to test-specific information

One-Tailed versus Two-Tailed Tests

If you are interested in testing two directional parameter hypotheses against each other (e.g., H0: mu1 <= mu2; H1: mu1 > mu2), a one-tailed test is more appropriate than a two- tailed test. Limiting the region of rejection to one tail of the sampling distributions of H provides greater power with respect to an alternative hypothesis in the direction of that tail. The figure below tries to illustrate this.

Alpha Error Probability

Alpha is the probability of falsely accepting H1 when in fact H0 is true. The figure below illustrates alpha for an F-test with respect to an alternative hypothesis that corresponds to a so-called "noncentral" F sampling distribution defined by the noncentrality parameter lambda.

Power and the Beta Error Probability

The power of a test is defined as 1-beta, and beta is the probability of falsely accepting H when in fact H1 is true. The figure below illustrates beta and the power of an F-test with respect to an alternative hypothesis that corresponds to a so-called "noncentral" F sampling distribution defined by the noncentrality parameter lambda.

latter option, users must click on the "Calc 'x' " button (x representing the effect size parameter of the test currently selected).

In order to prepare the appropriate G*Power input, it may sometimes be necessary to know the relation between the sample size and the effect size measure on the one hand and the noncentrality parameter of the noncentral distributions on the other hand. We have provided the relation between the sample size, the effect size measures, and the noncentrality parameters on a separate page.

Total Sample Size

In G*Power the total sample size is the number of subjects summed over all groups of the design.

In a t-test on means, the sample size may vary between groups A and B. Note, however, that in this case we want sigma to be approximately equal in both groups. Otherwise, both the t-test and the corresponding G*Power calculations may be misleading because the distributions of the test statistic under H0 and H1 will differ substantially from (central and noncentral) t-distributions.

Another problem could be unequal standard deviations in the populations underlying the two samples. In this case, Cohen (1977) recommended to adjust sigma to sigma' according to

2 2

σ ′= σ A +^ σ^ B

According to Cohen (1977) the number of participants in both groups A and B must be equal for this correction to be acceptable. If the group sizes vary, then this adjustment is not appropriate.

Please note that you will only arrive at an approximation of the true power of the t-test if the assumption of equal variances is violated. However, Cohen (1977) argues that the approximation will be "adequate" from most purposes.

As a general warning, you should keep in mind that GPower results are valid if the statistical assumptions underlying the tests are met (e.g., normal distributions and homogeneous variances within cells). Some work has been done on the robustness of these tests, that is, the deviation of actual and nominal alpha error probabilities when the distribution assumptions are not met. However, little is known on a test's power given a misspecified distribution model. Thus, GPower results may or may not be useful approximations to the true power values in such cases.

In F-Test (ANOVA), we assume that there are an equal number of subjects in each group. If, in a post-hoc or compromise power analysis, the total sample size is not a multiple of the

group size, then the power analysis will be based on the average group size (a noninteger value). G*Power will inform you if this is the case.

Note also that in a priori power analyses, the sample size is usually rounded to the next multiple of the number of groups or cells in your design. This implies that the actual power of your test usually is slightly larger than the power you entered as a parameter.

The Ratio q:= beta/alpha

In a compromise power analysis, the ratio q := beta/alpha specifies the relative seriousness of both types of errors (cf. Cohen, 1965, 1988, p. 5).

For instance, if alpha errors appear twice as serious as beta errors, then you can risk a beta error which is twice as large as alpha, thus q = beta/alpha = 2/1 = 2. This value is what you would then insert as the "beta/alpha ratio" in a compromise power analysis.

Alternatively, if you think you'd rather not risk committing a beta error (e.g., a beta error is considered three times as important as an alpha error), then you would specify q = beta/alpha = 1/3 = 0.3333.

These choices depend on the different valences you associate with either outcome of the test. However, we suspect that in basic psychological research at least, q = beta/alpha 1/1 = 1 is the rational choice most often.

Given your decision as to the relative seriousness of both types of errors, the problem is to calculate an optimum critical value for the test statistic which satisfies beta/alpha = q. This optimum critical value can be regarded as a rational compromise (hence the term "Compromise power analysis") between the demands for a low alpha-risk and a large power level, given a fixed sample size.

The Noncentrality Parameter

The noncentrality parameter of the t distribution is called delta , and that of the F and Chi 2 distributions is called lambda. Both measures increase as a function of N and the effect size postulated by H 1. More detailed information about the relation among sample size, effect size, and the noncentrality parameter is also available.

Where:

d =

is Cohen's (1977, 1988, p. 40) effect size parameter for t tests for means, and n1 and n2 are the sample sizes in groups 1 and 2, respectively.

t-Test on Correlations

In t-test on correlations, the noncentrality parameter delta is

2^ N

2

Where N is the total sample size (i.e., the number of pairs of values) and rho is the population correlation coefficient according to H 1 (i.e., Cohen's rho, see Cohen, 1977, 1988, p. 77-81).

Other t-Tests

In the Other t-Tests option we used f as an effect size measure (cf. Cohen, 1977, 1988, Chap. 8.2). The relation between delta and f is

δ= f N

F-Test (ANOVA), F-Test (MCR), and Other F-Tests

The standardized effect size measures f or f^2 are also used in power analyses for F-tests (F- Test (ANOVA), F-Test (MCR), and Other F-Tests). Their relation to the noncentrality parameter lambda of the noncentral F distribution is given by Lambda:

f N 2

λ = , where 2

2 2

f =

and ρ^2 denotes the coefficient of determination in the population according to H 1 (e.g., Koele, 1982, p. 514). For global ANOVA F-tests, ρ^2 is just eta 2. (ε^2 )

For special F-tests of main effects or interactions in complex ANOVA-designs, ρ^2 equals the partial eta^2.

Analogously, ρ^2 coincides with the (partial) squared multiple correlation in multiple regression/correlation F-tests (cf. Cohen, 1988, Chap. 9.2.1).

Chi-Square Tests

For Chi-Square tests based on m-cell contingency tables (m in N), Cohen (1977, 1988, Chap. 7) uses

=

m

i (^) i

i i p

p p w (^1 0) ()

0 () 1 ()

as an effect size measure, where p0(i) and p (^) 1(i) denote the cell probabilities for the i-th cell according to H 0 and H 1 , respectively. Then

w N 2

is the noncentrality parameter of the noncentral chi-square distribution (Cohen, 1988, p. 549).

Actual Power

When you use GPower to perform an a priori power analysis, the program calculates the 'exact' sample size for you. Assume that this exact sample size for a t-test is 60.70. Of course, you cannot recruit 60.70 subjects. Therefore, GPower rounds to the next reasonable integer for your t-test, which would be 62 (two groups of 31 subjects each).

However, 62 is larger than 60.70, and one way to express what this means is to say that, with 62 subjects and all other parameters being equal, your t-test has more power to detect an effect than it would have given the 'exact' number of 60.70 subjects. This 'inflated' power value is displayed as Actual power. Note that in this way G*Power guarantees that with the sample size computed for an a priori power analysis, the power of your test is always at least the power you specified.

  1. List of Tests

o t-Test on Means

Two group t-test, equal group sizes, equal sigma Two group t-test, unequal group sizes, equal sigma Two group t-test, equal group sizes, unequal sigma

o t-Test on Correlations o Other t-Tests

Matched-Pairs t-Test One-Sample t-Test z-Test Wilcoxon-Mann-Whitney U test (plus hints for other nonparametric tests)

H 0 : μA - μB = 0 H 1 : (^) μA - μB = c, c ≠ 0.

Which total sample size do you need such that the probability of obtaining a t statistic equal to or larger than a critical value is alpha = 0.05 under H 0 and 1-beta = .9 under H 1?

Assume that the difference in means between the groups postulated by your H 1 is equal to one half of the standard deviation, thus d = 0.5 (e.g., μA = 10, μB = 12, σ = 4).

Select: Type of Power Analysis: A priori Type of Test: t-Test (means), two-tailed Accuracy mode calculation Input: Alpha:. Power (1-beta):. Effect size "d": 0.5 (To calculate the effect size from μA, μB , and σ, simply click "Calc d", insert the means and the standard deviation, and click "Calc & Copy".) Result: Total sample size: 172 Actual power: 0. Critical t: t(170) = 1. Delta: 3.

Assume further that you do not have enough money to pay 172 subjects. However, 140 would seam feasible. Which critical t would still result in a "fair" test of your H 1? We use a compromise power analysis to compute an optimum critical value for the test statistic which satisfies the ratio q := beta/alpha. This optimum critical value can be regarded as a rational compromise between the demands for a low a-risk and a large power level, given a fixed sample size.

Select: Type of Power Analysis: Compromise Type of Test: t-Test (means), two-tailed Accuracy mode calculation Input: n1: n2:

Effect size "d": 0. Beta/alpha ratio: 2 (That is, we are willing to commit a beta error twice as large as our alpha error.)

Result: alpha: 0. Power (1-beta): 0. Critical t: t(138) = 1. Delta: 2.

Two Group t-Test, Unequal Group Sizes, Equal Sigma

We have done a study in which, for some reasons, the group sizes are not equal. In Group A we have 24 subjects; in Group B we have 33. What is the power of the t-Test comparing the means of both groups, and how much power have we lost due to the unequal group sizes?

Select: Type of Power Analysis: Post-hoc Type of Test: t-Test (means), one-tailed (This time assume that we know the direction of the difference between the groups.) Accuracy mode calculation Input: Alpha:. Effect size "d": 0.8 (We expect "large" effects according to the effect size conventions of Cohen, 1977.) n1: n2:

Result: Power (1-beta): 0. Critical t: t(55) = 1. Delta: 2.

Two Group t-Test, Equal Group Sizes, Unequal Sigma

What do you do if σA ≠ σB?

This is not normally a problem because the t-test is known to be quite robust, at least as long as the groups sizes are equal. Cohen (1977, p. 44) suggests to adjust sigma to sigma':

2 2

σ ′= σ A +^ σ^ B

Simply use sigma' instead of sigma to calculate the effect size using the "Calc 'd'" option, then proceed as in the examples given above.

reason being that there is no definite association between N and df (the degrees of freedom). You need to tell G*Power the values of both N and df explicitly.

We consider 3 examples here:

  • Matched-pairs t-tests
  • One-sample t-tests
  • z-Tests

In addition, we give hints on how to do power analyses for nonparametric tests such as the

  • Wilcoxon-Mann-Whitney U test

Matched-Pairs t-Tests

In t-tests for matched pairs, we have differences of the values from N matched pairs,

y 1 = x (^) A1 - x (^) B

: : :

y (^) N = x (^) A1 - x (^) B

The H 0 we test is that the pairs do not differ, that is, the population mean μY of the differences is zero. More formally,

H 0 : (^) μY = 0 H 1 : (^) μY = c, c ≠ 0.

When computing the standard deviation σY of the distribution of differences, we need to take into account the correlation r between A and B in the population:

σ (^) Y σ A σ B 2 r σ A σ B 2 2 = + −

where σA and σB are the standard deviations of x in the populations A and B, respectively, and r is the population correlation between A and B as paired.

In matched-pairs t-tests, N is the total sample size (i.e., total number of pairs), df = N-1, and the effect size is:

Y

f Y

Where μY is the difference between the means as specified by H 1.

Example

Assume that we are faced with a repeated measures design in which the same subject is observed under each of two treatments. We have data from 40 subjects. Previous research has shown that the standard deviation of the differences is approximately 20. We consider mean differences of 8 or larger as important. Thus, the effect size we need to enter is f = 8/20 = 0.4. We fix alpha at 0.05.

Select: Type of Power Analysis: Post-hoc Type of Test: Other t-Tests, two-tailed. Accuracy mode calculation Input: Alpha: 0. Effect size "f": 0. N: 40

df: 39 (Df = N-1 in matched pairs t- tests.) Result: Power (1-beta): 0. Critical t: t(39) = 2. Delta: 2.

As we said before, you cannot perform a priori power analyses directly, but you can, of course, perform repeated post-hoc power analyses, adjusting N and df until you arrive at the power value you desire. For instance, if you want, in the above example, the power to be .95, you simply increase N and df (= N-1) until the power is as close as possible to. (which will be the case with N = 84 and df = 83 for the present example).

One-Sample t-Tests

We want to compare the mean of a population from which we sample to a constant c. The effect size index d is computed according to

μ c

f

where μ and σ are the mean and the standard deviation in the population, respectively. As Cohen (1977, p. 46) writes, the interpretations of f (Cohen's d3') as well as the effect size conventions are identical to those for d.

N is the total sample size, and df = N-1. Thus, we're all set to do this power analysis analogously to the one for matched pairs t-tests (above).