
































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Although most practicing scientists are equipped with a conceptual foundation with re- spect to the basic tools of data analysis and data presentation, such a ...
Typology: Study notes
1 / 40
This page cannot be seen from the preview
Don't miss anything!

































Following data collection from some experi- ment, there arise two goals which should guide subsequent data analysis and data presentation. The first goal is for the data collector him or her- self to understand the data as thoroughly as possi- ble in terms of (1) how they may bear on the spe- cific question that the experiment was designed to address, (2) what if any surprises the data may have produced, and (3) what, if anything, such surprises may imply about the original questions, related questions, or anything else. The second goal is to determine how to present the data to the scientific community in a manner that is as clear, complete, and intuitively compelling as possible. This second goal is intimately entwined with the first: Whatever data analysis and presentation
techniques best instill understanding in the inves- tigator to begin with are generally also optimal for conveying the data’s meaning to the data’s even- tual consumers. So what are these data-analysis and data- presentation techniques? It is not possible in a sin- gle chapter or even in a very long book to describe them all, because there are an infinite number of them. Although most practicing scientists are equipped with a conceptual foundation with re- spect to the basic tools of data analysis and data presentation, such a foundation is far from suffi- cient: it is akin to an artist’s foundation in the tools of color mixing, setting up an easel, understanding perspective, and the like. To build on this analogy, a scientist analyzing any given experiment is like an artist rendering a work of art: Ideally the tools comprising the practitioner’s foundation should be used creatively rather than dogmatically to pro- duce a final result that is beautiful, elegant, and interesting, instead of ugly, convoluted, and pro- saic. My goal in this chapter is to try to demonstrate how a number of data-analysis techniques may be
The writing of this chapter was supported by NIMH grant MH41637 to G. Loftus. I thank the late Merrill Carlsmith for introducing me to numerous of the techniques described in this chapter and David Krantz for a great deal of more recent conceptual enlightenment about some of the more subtle aspects of hypothesis-testing, confidence intervals and planned comparisons.
Chapter to appear in Stevens' Handbook of Experimental Psychology, Volume IV
STEVENS HANDBOOK OF EXPERIMENTAL PSYCHOLOGY Page 2
used creatively in an effort to understand and con- vey to others the meaning and relevance of a data set. It is not my intent to go over territory that is traditionally covered in statistics texts. Rather, I have chosen to focus on a limited, but powerful, arsenal of techniques and associated issues that are related to, but are not typically part of a standard statistics curriculum. I begin this chapter with an overview of data analysis as generically carried out in psychology, accompanied by a critique of some standard procedures and assumptions, with particular emphasis on a critique of null hypothe- sis significance testing (NHST). Next, I discuss a collection of topics that represent some supple- ments and/or alternatives to the kinds of standard analysis procedures about which I will have just complained. These discussion include (1) a de- scription of various types of pictorial representa- tions of data, (2) an overview of the use of confi- dence intervals which, I believe, constitutes an attractive alternative to NHST, (3) a review of the benefits of planned comparisons which entail an analysis of percent between-conditions variance accounted for, (4) a description of techniques in- volving percent total variance accounted for, (5) a brief set of suggestions about presentation of re- sults based on mathematical models (meant to complement the material in the Myung & Pitt chapter) and finally, (6) a somewhat evangelical description of what I have termed equivalence techniques.
My main expositional strategy is to illustrate through example. In most instances, I have in- vented experiments and associated data to use in the examples. This strategy has the disadvantage that it is somewhat divorced from the real world of psychological data, but has the dominating ad- vantage that the examples can be tailored specifi- cally to the illustration of particular points.
The logic and mathematical analysis in this chapter is not meant to be formal or complete. For proofs of various mathematical assertions that I make, it is necessary to consult a mathematically oriented statistics text. There are a number of such texts; my personal favorite is Hays (1973), and where appropriate, I supply references to Hays along with specific page numbers.
My choice of material and the recommenda- tions that I selected to include in this chapter have been strongly influenced by 35 years of experience in reviewing and journal editing. In the course of these endeavors I have noticed an enormous num- ber of data-analysis and data-presentation tech- niques that have been sadly inimical to insight and clarity—and conversely, I have noticed enormous numbers of missed opportunities to analyze and present data in such a way that the relevance and importance of the findings are underscored and
clearly conveyed to the intended recipients. Somewhere in this chapter is an answer to ap- proximately 70% of these complaints. It is my hope that, among other things, this chapter will provide a reference to which I can guide authors whose future work passes across my desk—as an alternative, that is, to trying to solve what I believe to be the world’s data analysis and presentation problems one manuscript at a time.
Suppose that a memory researcher were inter- ested in how stimulus presentation time affects memory for a list of words as measured in a free- recall paradigm. In a hypothetical experiment to answer this question, the investigator might select J = 5 presentation times consisting of 0.5, 1.0, 2.0, 4.0, and 8.0 sec/word and carry out an experiment using a between-subjects design in which n = 20 subjects are assigned to each of the 5 word- duration conditions—hence, N = 100 subjects in all. Each subject sees 20 words, randomly selected from a very large pool of words. For each subject, the words are presented sequentially on a com- puter screen, each word presented for its appropri- ate duration. Immediately following presentation of the last word, the subject attempts to write down as many of the words as possible. The in- vestigator then calculates the proportion correct number of words (out of the 20 possible) for each subject. The results of this experiment therefore con- sist of 100 numbers: one for each of the 100 sub- jects. How are these 100 numbers to be treated in order to address the original question of how memory performance is affected by presentation time? There are two steps to this data- interpretation process. The first is the specification of a mathematical model^1 , within the context of which each subject’s experimentally observed number results from assumed events occurring within the subject. There are an infinite number of ways to formulate such a mathematical model. The most widely used formulation, on which I will fo- cus on in this chapter, is referred to as the linear model or LM. The second step in data interpretation is to carry out a process by which the mathematical model, once specified, is used to answer the ques-
(^1) I have sometimes observed that the term "mathemati- cal model" casts fear into the hearts of many research- ers. However, if it is numbers from an experiment that are to be accounted for, then the necessity of some kind of mathematical model is logically inevitable.
STEVENS HANDBOOK OF EXPERIMENTAL PSYCHOLOGY Page 4
These assumptions imply that the Xij, the score of Subject i in Condition j is equal to,
Xij = μ + αj + eij
which in turn implies that Xij’s within each condi- tion j are distributed with a variance of σ^2.
Equipped with a mathematical model, the in- vestigator’s next step in the data-analysis process is to use the model to arrive at answers to the question at hand. As noted, the most pervasive means by which this is done is via NHST, which works as follows.
Η 0 : α 1 = α 2 =...= αJ = 0
or equivalently, that,
Η0: μ 1 = μ 2 =...= μJ
Mathematically the null hypothesis may be viewed a single-dimensional hypothesis: the only variation permissible is the single value of the J population means.
with the mean, the overall error variance cannot be as- sumed to be fully constant. Nonetheless, the linear model formulated would still be a very useful ap- proximation.
H 0 , while the larger is the F, the greater is the evidence for H 1.
The LM can be used without proceeding on to NHST and NHST can be used with models other than the LM. However, a conjunction of the LM and NHST is used in the vast majority of experi- ments within the social sciences and in other sci- ences, notably the medical sciences, as well. Both the LM and NHST have shortcomings with respect to the insight into a data set that they provide. However, it is my opinion that the shortcomings of NHST are more serious than the shortcomings of the LM. In the next two subsections, I will briefly describe the problems with the LM, and I will then provide a somewhat lengthier discussion of the problems with NHST.
(^3) More precisely, power can be represented as a func- tion over the J-dimensional space, mentioned earlier, that corresponds to the J-dimensional alternative hy- pothesis.
DATA ANALYSIS, INTERPRETATION AND PRESENTATION Page 5
Problems with the LM The LM is what might be termed an off-the- shelf model: That is, the LM is at least a plausible model that probably bears at least some approxi- mation to reality in many situations. However, its pervasiveness often tends to blind investigators to alternative ways of representing the psychological processes that underlie the data in some experi- ment.
More specifically, although there are different LM equations corresponding to different experi- mental designs, all of them are additive with re- spect to the dependent variable : that is the de- pendent variable is assumed to be the sum of a set of theoretical parameters (see, for example, Table 1 and Equation 1, below). The simplicity of this arrangement is elegant, but it de-emphasizes other kinds of equations that might better elucidate the underlying psychological processes.
I will illustrate this point in the context of the classic question: What is the effect of degree of original learning on subsequent forgetting and more particularly, does forgetting rate depend on degree of original learning? My goal is to show how the LM leads investigators astray in their at- tempts to answer this question, and that an alter- native to the LM provides considerably more in- sight.
Slamecka and McElree (1983) reported a se- ries of experiments with the goal of determining the relation between degree of original learning and forgetting rate. In their experiments subjects studied word lists to one of two degrees of profi- ciency. Subjects’ memory performance then was measured following forgetting intervals of 0, 1, or 5 days. Within the context of the LM, the relevant equation relating mean performance, μjk to delay interval j and initial learning level k is,
μjk = μ + αj + βk + γjk (Eq. 1)
where αj is the effect of delay interval j (presuma- bly αj monotonically decreases with increasing j), βk is the effect of degree of learning k (presumably βk monotonically increases with increasing k) and finally, γjk, a term applied to each combination of delay interval and learning level, represents the interaction between delay interval and learning level.
Within the context of the LM, two theoretical components are construed as independent if there is no interaction between them. In terms of Equa- tion 1, degree of learning and forgetting are inde- pendent if all the γij’s are equal to zero. The criti- cal null hypothesis was tested by Slamecka and McElree was therefore that γij = 0 for all i, j. They used their resulting failure to reject this null hy- pothesis as evidence for the proposition that for-
getting rate is independent of degree of original learning. This conclusion is dubious for a variety of rea- sons. For present purposes, I want to emphasize that Slamecka and McElree’s analysis technique (which Slamecka, 1985, vigorously defended) emerged quite naturally from the LM-based Equa- tion 1 above. Because the LM is so simple, and is so ingrained as a basis for data analysis, it seemed, and still seems, unnatural for workers in the field to consider alternatives to the LM. What would such an alternative look like? In the final section of this chapter, I will provide some illustrations of alternatives to the LM. In the present context, I will briefly discuss an alternative model within which the learning-forgetting inde- pendence issue can be investigated. This model, described by Loftus (1985a; 1985b; see also Loftus & Bamber, 1990) rests on an analogy to forgetting of radioactive decay. Consider two pieces of radioactive material, a large piece (say 9 gms) and a small piece (say 5 gms). Suppose the decay rates are the same in the sense that both can be described by the equation,
M = M 0 e-kd^ (Eq. 2)
where M is the remaining mass after an interval of d days, M 0 is the original mass, and k is the decay constant^4. The Equation-6 decay curves corresponding to the two different chunks are shown in Figure 1, with the same decay constant, k=0.5, describing the two curves. These curves could, of course, be described by the LM (Equation 1). The γjk terms would be decidedly nonzero, reflecting the inter- action that is represented in Figure 1 by the de- creasing vertical distance between the two decay curves with increasing decay time. Thus, using the LM, and Slamecka and McElree’s logic, one con- cludes that large-chunk decay is faster than small- chunk decay. This conclusion would, in a very powerful sense, be incorrect: As noted above, the Figure- decay curves were generated by equations having identical decay rates (k = 0.5). The key to under- standing this error is that independence of radio- active decay rates is not associated with lack of interaction within the context of the LM. Instead, it is associated with another kind of lack of in-
(^4) This is not a technically correct description of radio- active decay, as radioactive material actually decays to some inert substance instead of to nothing, as implied by Equation 2. For the purposes of this discussion, the "decaying material" may be thought of as that portion of the material that actually does decay, and the logic is unaffected.
DATA ANALYSIS, INTERPRETATION AND PRESENTATION Page 7
ried out. Over the past 10 years, however, there has at least been some recognition of the issues raised by these articles; this recognition has re- sulted in APA and APS task forces and symposia on the topic, editorials explicitly questioning the use of NHST (e.g., Loftus, 1993b), and occasional calls for the banning of NHST (with which I do not agree), along with a small but still dimly per- ceptible shift away from exclusive reliance on NHST as a means of interpreting and understand- ing data.
As I have suggested earlier in this chapter, problems with the LM, such as those described above, pale in comparison to problems with NHST. These problems have been reviewed in the books and articles cited in Footnote 3, and it is not my goal here to provide a detailed rehash of them. Instead, I will sketch them here briefly; the reader is referred to the cited articles for more detailed information. I should note, in the interests of full disclosure, that a number of well reasoned argu- ments have been made in favor of assigning NHST at least a minor supporting role in the data- comprehension drama. The reader is directed to Abelson (1995) and Krantz (1999) for the best of such arguments.
The major difficulties with NHST are these. Information Loss as a Result of Binary De- cision Processes A data set is often quite rich. As a typical ex- ample, a 3x5 factorial design contains 15 condi- tions and hence 15 sample means to be accounted for (ignoring of course, per the LM, the raw data from within each condition along with less favored statistics such as the variance, the kurtosis and so on). However, a standard ANOVA reduces this data set to three bits of information: Rejection or failure to reject the null hypotheses corresponding to the effects of Factor 1, Factor 2, and the inter- action. Granted, one can carry out additional post- hoc tests or simple-effects tests, but the end result is still that the complex data set is understood, via the NHST process, only in terms of a series of bi- nary decisions rather than as a unified pattern. This is a poor basis for acquiring the kind of ge- stalt that is necessary for insight and gut-level un- derstanding of a data set.
The Implausibility of the Null Hypothesis Consider the hypothetical experiment de- scribed at the beginning of this chapter. There were five conditions, involving five exposure du- rations in a free-recall experiment. In a standard ANOVA, the null hypothesis would be:
μ 1 = μ 2 = μ 3 = μ 4 = μ 5 (Eq. 4)
where the μj’s refer to the population means of the five conditions. Note here that “=“ signs in Equa- tion 4 must be taken seriously: Equal means equal to an infinite number of decimal places. If the null hypothesis is fudged to specify that “the popula- tion means are about equal” then the logic of NHST collapses, or at least must be supplemented to include a precise definition of what “about equal” means. As has been argued by many, a null hypothesis of the sort described by Equation 4 cannot literally be true. Meehl (1967) makes the argument most eloquently, stating, Considering...that everything in the brain is connected with everything else, and that there exist several ‘gen- eral state-variables’ (such as arousal, attention, anxiety and the like) which are known to be at least slightly influ- enceable by practically any kind of stimulus input, it is highly unlikely that any psychologically discriminable situation which we apply to an ex- perimental subject would exert liter- ally zero effect on any aspect of per- formance.” Alternatively, the μj’s can be viewed as measurable values on the real-number line. Any two of them being identical implies that their dif- ference (also a measurable value on the real-number line) is exactly zero—which has a probability of zero.^7 And therein lies a serious problem: It is meaningless to reject a null hypothesis that is im- possible to begin with. An analogy makes this clear: Suppose an astronomer were to announce that “Given our data, we have rejected the null hypothesis that Saturn is made of green cheese.” Although it is unlikely that this conclusion would be challenged, a consensus would doubtless emerge that the astronomer must have been off his rocker for even considering such a null hypothesis to begin with. Strangely, psychologists who make equally meaningless statements on a routine basis
(^7) A caveat is in order here. Most null hypotheses are of the sort described by Equation 4; that is, they are quantitative , specifying a particular set of relation among a set of population parameters. It is possible, in contrast, for a null hypothesis to be qualitative (see, e.g., Frick, 1995, for a discussion of this topic). An ex- ample of such an hypothesis, described by Greenwald, et al., 1996, is that the defendant in a murder case is actually the murderer. This null hypothesis could cer- tainly be true; however; the kind of qualitative null hypothesis that it illustrates constitutes the exception rather than the rule.
STEVENS HANDBOOK OF EXPERIMENTAL PSYCHOLOGY Page 8
continue to be regarded as entirely sane. (Even stranger is the common belief that an α-level of .05 implies that an error is made in 5% of all ex- periments in which the null hypothesis is rejected. This is analogous to saying that, of all planets re- ported not to be made of green cheese, 5% of them actually are made of green cheese.)
Decision Asymmetry Putting aside for the moment the usual impos- sibility of the null hypothesis, there is a decided imbalance between the two types of errors that can be made in a hypothesis-testing situation. The probability of a Type-I error, α, can be, and is, set by appropriate selection of a summary-score crite- rion. However, the probability of a Type-II error, β, is, as noted earlier, generally unknowable be- cause of the lack of a quantitative alternative hy- pothesis. The consequence of this situation is that rejecting the null hypothesis is a “real” decision, while failing to reject the null hypothesis is, as the phrase suggests, a nondecision: It is simply an admission that the data do not provide sufficient information to support a clear decision.
Accepting H 0 The teaching of statistics generally emphasizes that “we fail to reject the null hypothesis” does not mean the same thing as “we accept the null hy- pothesis”. Nonetheless, the temptation to accept the null hypothesis (usually implicitly so as to not brazenly disobey the rules) often seems to be overwhelming, particularly when an investigator has an investment in such acceptance. As I have noted in the previous section, accepting a typical null hypothesis involves faulty reasoning anyway because a typical null hypothesis is impossible. However, particularly in practically-oriented situations, an investigator is justified in accepting the null hypothesis “for all intents and purposes” assuming that the investigator has convincingly shown that there is adequate statistical power (see Cohen, 1990; 1994). Such a power analysis is most easily carried out by computing some kind of confidence interval (described in detail below) which would allow a meaningful conclusion such as “the population mean difference between Con- ditions 1 and 2 is, with 95% confidence, between ±ε” where ε is a sufficiently small number that the actual difference between Conditions 1 and 2 is inconsequential from a practical perspective.
The misleading dichotomization of “p < .05” vs. “p > .05” results As indicated in his 1962 editorial, summarized earlier, Arthur Melton considered an observed p- value of .05 to be maximal for acceptance of an article. Almost four decades later, more or less this same convention holds sway: Who among us has
not observed the heartrending spectacle of a stu- dent or colleague struggling to somehow transform a vexing 0.051 into an acceptable 0.050? This is bizarre. The actual difference between a data set that produces a p-value of 0.051 versus one that produces a p-value of 0.050 is, of course, miniscule. Logically, very similar conclusions should issue from both data sets, and yet they do not: The .050 data set produces a “reject the null hypothesis” conclusion, while the .051 data set produces a “fail to reject the null hypothesis” con- clusion. This is akin to a chaotic situation in which small initial differences distinguishing two situa- tions lead to vast and unpredictable eventual dif- ferences between the situations. The most obvious consequence of this situa- tion is that the lucky recipient of the .050 data set gets to publish, while his unlucky .051 colleague does not. There is another consequence, however, which is more subtle but probably more insidious: The reject/fail-to-reject dichotomy keeps the field awash in confusion and artificial controversy. This is because investigators, like most humans, are loath to make and stick to conclusions that are both weak and complicated, like “we fail to reject the null hypothesis.” Instead investigators are prone to (often unwittingly) transform the conclu- sion into the stronger and simpler, “we accept the null hypothesis.” Thus two similar experi- ments—one in which the null hypothesis is re- jected and one in which the null hypothesis is not rejected—can and often do lead to seemingly con- tradictory conclusions—“the null hypothesis is true” versus “the null hypothesis is false.” The inevitable head-scratching and subsequent flood of “critical experiments” that are generated by such “failures to replicate” may well constitute the sin- gle largest source of wasted time in the practice of psychology. The counternull Robert Rosenthal (see chapter this volume) has suggested a simple score called the “counter- null” which serves to underscore the difficulty in accepting Η 0. The counternull revolves around an increasingly common measure called “effect size,” which, essentially is the mean magnitude of some effect (e.g., the mean difference between two con- ditions) divided by the standard deviation (gener- ally pooled over the conditions). Obviously, all else equal, the smaller the effect size, the less in- clined one is to reject Η 0. Suppose, to illustrate, that in some experiment one found an effect size of 0.20 which was insufficiently large to reject Η 0. As noted earlier, the temptation is often over- whelming to accept Η 0 in such a situation because the world seems so much clearer that way. It is therefore useful to report Rosenthal’s counternull
STEVENS HANDBOOK OF EXPERIMENTAL PSYCHOLOGY Page 10
can only be understood (and not very well under- stood at that) via a lengthy serial inspection of the numbers within it. In contrast a mere glance at the corresponding figure renders it entirely clear what is going on.
Graphs Versus Tables Despite the obvious and dominating exposi- tional advantage of figures over tables, data con- tinue to be presented as tables are at least as often, or possibly more often than as figures. For most of the psychology’s history, the reason for this curi- ous practice appeared to be founded in a prosaic matter of convenience: While it was relatively easy to construct a table of numbers on a type- writer, constructing a decent figure was a labori- ous undertaking. You drew a draft of the figure on graph paper, took the draft to an artist who in- variably seemed to reside on the other side of the campus, following which you waited a week for the artist to produce a semi-finished version. Then you made whatever changes in the artist’s render- ing that seemed appropriate. Then, you repeatedly iterated through this dreary process until the figure was eventually satisfactory. Finally, adding insult to injury, you had to take the finished drawing somewhere else to have its picture taken before the publisher would take it. Who needed that kind of hassle?
Today, obviously, things are much different, as electronic means of producing figures abound. To obtain information about popular graphing techniques, I conducted an informal survey in which I emailed to all researchers in my email ad- dress book, a request that they tell me what graphing technique(s) they use. One hundred and sixty one respondents used a total of 229 tech- niques, and the summarized results are provided in Table 3.
The results of this survey can be summarized as follows. Fewer than 25% of the application programs mentioned were statistical packages, perhaps because the most commonly used pack- ages do not provide very flexible graphing op- tions. Over a third of the applications were spe- cialized drawing programs (CricketGraph, Sig- maPlot, and KaleidaGraph were the most popular, but many others were mentioned). About 10% of the applications were general-purpose presentation programs (Powerpoint was the most popular) and the final one-third was general-purpose analysis programs, with Microsoft Excel accounting for
Table 3. Techniques for plotting data, as revealed by an informal survey. Application Name Frequency Microsoft Excel 55 CricketGraph 27 SigmaPlot 22 KaleidaGraph 17 SPSS 16 MATLAB 15 PowerPoint 10 DeltaGraph 9 S-plus 7 Mathematica 5 Microsoft Office 5 Systat 5 Igor/Igor Pro 4 Statistica 4 Gnuplot 3 Canvas 2 Hand plotting 2 StatView 3 ABC Graphics 1 Autocad 1 Axum 1 c graph-pac 1 ClarisDraw 1 Grapher 1 Graphpad 1 Illustrator 1 JMP 1 MacDraw 1 Maple 2D 1 Origin 1 PsiPlot 1 Quattro Pro 1 R 1 SciPlot 1 Smartdraw 1 TK solver 1
0 50 100 150 200 250 300 350
Proportion Correct
Stimulus Duration (ms)
% Contrast
Figure 2. Hypothetical data from a 5 (stimulus exposure duration) x 3 (stimulus contrast level) experiment. The dependent variable is proportion correct recall. Error bars represent standard errors.
DATA ANALYSIS, INTERPRETATION AND PRESENTATION Page 11
the majority of these instances. Excel was by far the single leading application used for graphing. Seven respondents reported never graphing data, while 13 assigned the task to someone else. Two people reported still drawing graphs by hand. The remaining 139 respondents used some form of electronic graphing techniques.
At the present time, a brief description of graphing programs is supplied by Denis Pelli (per- sonal communication) and can be found at http://vision.nyu.edu/Tips/RecSoftware.html.
Graph-Making Transgressions I have tried to present a fairly bright picture of the ease of creating high-quality graphs. There is, however, a dark side of this process which is that a graph-creator has the capability of going wild with graphical features, thereby producing a graph that is difficult or impossible to interpret. For example David Krantz (personal communication) has noted that, for example, graphmakers often attempt to pack too much information into a graph, they pro- duce graphs that are difficult to interpret without intense serial processing, they produce unintended and distracting emergent perceptual features, or they simply omit key information either in the graph itself of in the graph’s legend. There are, of course, many other such transgressions, treatments of which are found in the references provided in the next section. (My own personal bête noire is the 3-D bar graph.)
Other Graphical Representations A discussion of graphs is limited in the sense that there are myriad means of visually presenting the results of a data set. It is beyond the scope of this chapter to describe all of them. For an a set of initial pointers to a set of sophisticated and ele- gant graphical procedures, the reader is directed to excellent discussions and examples in Tufte (1983; 1990), Tukey (1977), and Wainer & Thissen (1993). The main point I want to make is that pic- torial representations almost always excel over their verbal counterparts as an efficient way of conveying the meaning of a data set.
Earlier, I described the LM as the standard model for linking a data set to the answer to the scientific question at hand. Somewhere in a LM equation (e.g., Equation 1) are always one or more error terms which represent the uncertainty in the world.
Using the LM to answer scientific questions is a two-stage process. The first stage is to somehow determine knowledge of relevant population pa- rameters given measured sample statistics along with the inevitable statistical noise. The second
stage is to use whatever knowledge emerges about population parameters to answer the question at hand as best as possible. It seems almost self evident that that the sec- ond stage—deciding the implications of the pat- tern of population parameters for the answer to the question at hand—should be the investigator’s fundamental goal. In contrast, the typical routine of statistical analysis—carrying out some proce- dure designed to cope with the noise-limited rela- tion between the sample statistics and the corre- sponding population parameters—should be viewed as a necessary but boring nuisance. If the real world suddenly transformed into an ideal world in which experiments produced no statistical noise, it would be cause for rejoicing among in- vestigators, as a major barrier to data interpretation would be absent. There are two basic procedures for coping with statistical noise in quest of determining the relations between a set of sample statistics and their population counterparts. The first procedure entails attempting to determine what the pattern of population parameters isn’t , i.e., trying to reject a null hypothesis of some specific, usually uninter- esting, pattern of population parameters, via NHST. The second procedure entails attempting to determine what the pattern of population parame- ters is , using the pattern of sample statistics as an estimate of the corresponding pattern of popula- tion parameters, along with error bars to represent the degree of conclusion-obscuring statistical noise. It is my (strong) opinion that trying to de- termine what something is is generally more illu- minating than trying to determine what it isn't. The use of error bars, e.g., in the form of 95% confidence intervals, around plotted sample statis- tics (usually sample means) is an ideal way of pre- senting data in such a way that the results of both these two data-analysis and interpretation stages are represented and that their relative importance is depicted. Consider a plot such as the one shown in Figure 2. The pattern of sample means repre- sents the best estimate of the corresponding pat- tern of population means. This pattern is funda- mental to understanding how perception is influ- enced by contrast and duration and it is this pattern that is most obvious and fundamental in the graph. Secondarily, the confidence intervals provide a quantitative visual representation of the faith that should be placed in the pattern of sample means as an estimate of the corresponding pattern of popu- lation means. Smaller confidence intervals, of course, mean a better estimate: In the extreme, if the confidence intervals were of zero length, it would be clear that error was irrelevant, and that the investigator could spend all his or her energy on the fundamental task of figuring out the impli-
DATA ANALYSIS, INTERPRETATION AND PRESENTATION Page 13
a hypothetical experiment in which a clinical re- searcher is investigating the relative effectiveness of two methods, Method A and Method B, of re- ducing anxiety. Two groups of high-anxiety sub- jects participate in the experiment, one receiving Method A and the other receiving Method B. Fol- lowing their treatment, subjects rate their anxiety on a 7-point scale. Suppose that the experiment results in a small, not statistically significant dif- ference between the two methods. In what follows, I will demonstrate two techniques of presenting the results for two hypothetical cases: A low- power case involving n subjects, and a high-power case involving 100n subjects.
The first analysis technique incorporates stan- dard NHST, along with a formal power analysis. Figure 3 shows the graphical results of this kind of analysis for the low-power case (left panels) and the high-power case (right panels). The top panels show bar graphs depicting the main experimental
results while the bottom panels show power curves that depict power as a function of the difference between two population means according to a continuous succession of alternative hypotheses. Power is represented by the slope of the power curves. As illustrated by the arrows, the low- power curve achieves a power of 0.90 when the alternative hypothesis is that the population means differ by about 3.0, while the high-power curve achieves 0.90 when the alternative hypothesis is that the population means differ by about 0.3. Figure 4 shows a different way of representing this power information for the same low-power case (left panel) and high-power case (right panel). Figure 4 again shows the bar graph, but here, the bars are accompanied by 95% confidence intervals around the means that they depict. The free- floating error bars show the magnitude of the 95% confidence interval around the population mean differences in each of the panels. Here, power is
Method A Method B
Mean Rating
Method A Method B
Mean Rating
0.0 1.0 2.0 3.0 4.0 5.0 6.
Power
Alternative Hypothesis
1.0 2.0 3.0 4.0 5.0 6.
Power
Alternative Hypothesis
Figure 3. One technique for carrying out a power analysis. A low-power situation is in the left panels) and a high- power situation is in the right panels. The top panels show the data, the while the bottom panels show power curves.
STEVENS HANDBOOK OF EXPERIMENTAL PSYCHOLOGY Page 14
represented quite simply by the size of the confi- dence intervals which are large in the left (low- power) graph, but small in the right (high-power) graph.
In short, Figures 3 and 4 show the same in- formation, However, Figure 4 presents the infor- mation in a much simpler and more intuitive man- ner than does Figure 3. Figure 4 makes it immedi- ately and visually clear how seriously the sample means and the sample mean differences are to be taken as estimates of the corresponding population means which, in turn, provides critical information about how “nonsignificance” should be treated. The left panel of Figure 4 leaves no doubt that failure to reject the null hypothesis is a non- conclusion—that there is not sufficient statistical power to make any conclusions at all about the relative magnitudes of the two population means. The right panel, in contrast, makes it evident that something very close to the null hypothesis is ac- tually true—that the true difference between the population means is, with 95% confidence, re- stricted to a range of 0.277 which is very small in the grand scheme of things.
Confidence Intervals or Standard Errors? Thus far I have been using 95% confidence intervals in my examples. This is one of the two standard configurations for error bars, the other being a standard error which is approximately a 67% confidence interval^8. In the interests of stan- dardization, one of these configurations or the
(^8) The exact coverage of a standard error depends, of
course, on the number of degrees of freedom going into the error term.
other should be used unless there is some compel- ling reason for some other configuration. I suggest, in particular, being visually conser- vative, which means deliberately stacking the deck against concluding whatever one wishes to con- clude. This means, one should use 95% confidence intervals, which have a greater effect of suggesting no difference, when the interest is in rejecting some null hypothesis. Conversely, one should use standard errors, which have a greater effect of suggesting a difference, when the interest is in confirming some null hypothesis (as in, for exam- ple, when comparing observed to predicted data points in a model fit).
Different Kinds of Confidence Intervals The interpretation of a confidence interval is somewhat different depending on whether it is used in a between-subjects or a single-factor within-subjects (i.e., repeated-measures) design, a multifactor within-subjects design, or a mixed de- sign (some factors between, other factors within). These differences are discussed in detail by Loftus & Masson (1994). The general idea is as follows. Between-subjects designs A confidence interval is designed to isolate a population parameter, most typically a population mean, to within a particular range. A between- subjects design constitutes the usual venue in which a confidence interval has been used in psy- chology, to the extent that confidence intervals have been used at all. Consider as an example a simple one-way ANOVA experiment in which the investigator is interested in the effects of caffeine on reaction time (RT). Four conditions are defined by four levels of caffeine: 0, 1, 2, or 3 caffeine units per unit body weight. Suppose that n = 10
Method A Method B
Mean Rating
CI: Mean Difference (±2.77)
Low Power
Method A Method B
Mean Rating
High Power CI: Mean Difference (±0.277)
Figure 4. A second technique for carrying out a power analysis in the anxiety treatment method experiment. Smaller confidence intervals reflect greater power.
STEVENS HANDBOOK OF EXPERIMENTAL PSYCHOLOGY Page 16
grees of freedom, and variability due to the subject by condition interaction, SS (Interaction) = 543, based on the remaining 27 degrees of freedom. The combined error variance SS (Subjects plus Interaction) is therefore 11,615, based on 36 de- grees of freedom, just as it was in the between- subjects design, and the confidence interval of 11.52 is therefore identical also.
Intuitively this seems wrong. Just as the within-subjects design includes a great deal more sensitivity, as reflected in the substantially greater F ratio in the ANOVA, so it seems that the greater sensitivity should also be reflected in a smaller confidence interval. What is going on?
To answer this question, it is necessary to con- sider not what a confidence interval is technically, but what a confidence interval is generally used to accomplish. An investigator is not usually inter- ested in absolute values of population means, but rather is interested in patterns of population means. So for instance, in the Figures 5 and 6 data the mean RT declines from approximately 240 ms to 215 ms across the caffeine conditions. How- ever, it is not the exact means that are important for determining caffeine’s effect on RT; but rather it is the decrease, or perhaps the form of mathe- matical function describing the decrease that is of interest^9.
(^9) I should note that this is not always true. Sometimes
an investigator is interested in isolating some popula- tion mean. An obvious example would be when the investigator wishes to determine whether performance in some condition is at a chance value.
This observation has an important implication for the interpretation of confidence intervals: Con- fidence intervals are rarely used in their “official” role of isolating population means. Instead, they are generally used as a visual aid to judge the reli- ability of a pattern of sample means as an estimate of the corresponding pattern of population means. In the Figure-5 between-subjects data, for in- stance, the confidence intervals indicate that a hy- pothesis of monotonically decreasing population- mean RT’s with increased caffeine is reasonable. How does this logic relate to within-subjects designs? The answer, detailed by Loftus and Mas- son (1994) is that a “confidence interval” based on the interaction variance is appropriate for the goal of judging the reliability of a pattern of sample means as an estimate of the corresponding pattern of population means; thus the within-subjects con- fidence interval equation is,
CI = ±
MS(Interaction) n
crit t(dfI)^ (Eq. 9)
where n again represents the number of observa- tions on which each mean is based (n = 10 in this example). Using Equation 9 (see Figure 6b), the confidence intervals in Figure 6a were computed using the MS (Interaction) shown in the ANOVA table within Figure 6a. The resulting confidence interval is ±3.21. This value is, of course, consid- erably smaller than the between-subjects, Figure- counterpart of 11.52. It bears emphasis, however, this apparent increase in power comes about be- cause information is lost: In particular the confi- dence intervals no longer isolate absolute values of population means; rather they are appropriate only
100
150
200
250
300
0.0 1.0 2.0 3.
Reaction time (ms)
Caffeine (units per body weight)
ANOVA (withins S's) Source df SS MS Obt F Crit F Btwn 39 16, C 3 4,810 1,603 79.72 2. R 9 11,072 1, RxC 27 543 20
(a)
100
150
200
250
300
0.0 1.0 2.0 3.
Reaction time (ms)
Caffeine (units per body weight)
CI = ± MSIn
crit t(9) = 1020
2.26 = ±3.
(b)
Figure 6. Data from a hypothetical experiment in which RT is measured as a function of caffeine consumption in a within-subjects design. All 40 data points are the same as those shown in Figure 5. The right panel shows the mean data (heavy line) along with individual-subject data points (light lines). The right panel shows 95% “within- subject” confidence intervals around the sample means that is based on the subject x interaction variance.
DATA ANALYSIS, INTERPRETATION AND PRESENTATION Page 17
for assessing the reliability of the pattern of sam- ple means as an estimate of the underlying pattern of population means. That is, they serve the same function as they do in the between-subjects ANOVA.
Confidence intervals in multifactor within- subjects designs In a pure between-subjects design, there is only one error term, MS (Within), irrespective of the number of factors in the design. Therefore, assuming homogeneity of variance, a single confi- dence interval, computed by Equation 8 or Equa- tion 9, is always appropriate.
In a multifactor within-subjects design, the situation is more complicated in that there are multiple error terms, corresponding to multiple subject-by-something interactions. For instance, in a two-factor within-subjects design, there are three error terms: One corresponding to Factor A, one corresponding to Factor B, and one corresponding to the (AxB) interaction. These error terms are summarized in Table 4, for a standard two-factor, within-subjects design^10. This raises the problem of how to compute confidence intervals, as it would appear that there are as many possible con- fidence intervals as there are error terms. Which confidence interval(s) are appropriate to display?
Often the answer to this question is simple, because in many such two-factor designs—and in many multifactor within-subjects designs in gen- eral—the error terms are all roughly equal (i.e., differ by no more than a factor of around 2:1). In such instances, it is reasonable to simply pool er- ror terms, that is to compute an overall error term by dividing the sum of the sum of squares (error)
(^10) With more than two factors, the same general argu-
ments to presented below hold, they are simply more complex, because there are yet more error terms; e.g., in a three-factor, within-subjects design, there are 3 main-effect error terms, 3 two-way interaction error terms, and 1 three-way interaction error term, or 7 er- ror terms in all.
by the total degrees of freedom (error) to arrive at a single “subject x condition” interaction, where a “condition” is construed as single combination of the various factors (e.g., a 5 x 3 x subjects design would have 15 separate conditions). This single error term can then be entered into Equation 9 to compute a single interaction. Here “dfI” refers to degrees of freedom in the total interaction between subjects and conditions. So, for instance, in a 5 (Factor A) x 3 (Factor B) x 20 (subjects) design, dfI would be (15-1) x (20-1) = 266. As before, “n” in Equation 9 refers to the number of observations on which each mean is based: 20 in this example. Of course, Nature is not always this kind, and the investigator sometimes finds that the various error terms have widely varying values. In this situation, the investigator is in a position of having to provide a more complex representation of con- fidence intervals, and the situation becomes akin to that described in the next section where a mixed design is used. Confidence intervals in mixed designs A mixed design is one in which some of the factors are between subjects and other factors are within subjects. For simplicity, I will describe the simplest such design: a two-factor design with one between-subjects factor and one within-subjects factor (see also, Loftus & Masson, 1994, pp. 484- 486). Imagine the caffeine experiment described above except that two different subject populations are investigated: young adults (in their 20’s) and older adults (in their 70’s); thus there are two vari- ables, one of which (caffeine) is within-subjects and the other of which (age) is between subjects. Again, there are n = 10 subjects in each of the two age groups. Suppose that the data are as depicted in Figure 7a (note that again, the relevant ANOVA table is provided at the bottom of the figure). As described many standard statistics text- books, there are two error terms in this design. The error term for the age effect is MS (Subjects within age groups) = 1,656, while the error term for caffeine and for the caffeine x age interaction is the MS (Caffeine x Subjects) = 99. There are, correspondingly, two separate confidence intervals that can be computed. The first, computed by Equation 9, is the kind of “within-subjects” confi- dence interval that was described in the previous section. This confidence interval which, as indi- cated at the bottom of Figure 7b is computed to be ±6.3, is appropriate for assessing the observed ef- fects of caffeine and of the age x caffeine interac- tion as estimates of the corresponding population effects. This confidence interval is plotted around each of the cell means in Figure 7b. Note that this confidence interval is not appropriate for describ-
Table 4. ANOVA table for a two-factor, within- subjects design.
Source Degrees of freedom Error term
Factor A (A) df(A) MS (A x S) Factor B (B) df(B) MS(B x S) Inter. (AxB) df(A x B) MS(A x B x S) Subjects (S) df(S) A x S df(A) x df(S) B x S df(B) x df(S) (A x B) x S df(A) x df(B) x df(S)
DATA ANALYSIS, INTERPRETATION AND PRESENTATION Page 19
computed to be X (e.g., suppose X = 0.4 in this example). Thus, by Equation 7, the confidence interval around this interaction magnitude is,
I ± x 12 + 12 + 12 + 12 = I ± 2X
which, in this example, would be 2.0 ± 0.8.
Asymmetrical confidence intervals Thus far in the chapter I have been describing confidence intervals that are symmetrical around the obtained sample statistics (generally the sam- ple mean). However, some circumstances demand asymmetrical confidence intervals. In this section, I will describe how to compute asymmetrical con- fidence intervals around three common statistics: variances, Pearson r’s, and binomial proportions. In general, asymmetry reflects the bounded nature of the variable: variances are bounded at zero; Pearson r’s are bounded at ±1).
Confidence intervals around variances As described by Hays (1973, pp. 441-445) the confidence interval for a sample variance based on n observations (Xi’s) with mean M, is:
(Upper limit):
(n-1)estσ 2 χ^2 (n − 1 ; p(upper limit)) CI =
(Lower limit):
(n-1)estσ^2 χ^2 (n −1; p(lower limit))
Here, estσ^2 (or s^2 in Hays’ notation) is the best estimate of the population variance computed by,
estσ 2 =
(Xij − M)^2 i= 1
n ∑
n − 1
Xi^2 − nM^2 i= 1
n ∑
n − 1
and p(upper limit) and p(lower limit) are the prob- ability boundaries for the upper and lower limits of the confidence interval (e.g., 0.975 and 0.025 for a 95% confidence interval).
Suppose, to illustrate, that a sample of n = 100 scores produced a sample variance, estσ^2 = 20.
The upper limit of a 95% confidence interval would be, (100-1)(20) χ^2 (9,0.975)
99x
while the lower limit would be, (100-1)(20) χ^2 (9,0.025)
99x
Confidence intervals around Pearson r’s The confidence interval around a Pearson r is based on Fisher’s r-to-z transformation. In par- ticular, suppose a sample of n X-Y pairs produces some value of Pearson r. Given the transformation,
z = 0.5ln
1 + r 1 − r
(Eq. 10)
z is approximately normally distributed, with an expectation equal to
0.5ln
1 + ρ 1 − ρ
where ρ is the population correlation of which r is an estimate, and a standard deviation of
σ = (^) 1/(n− 3)
Therefore, having computed an obtained z from the obtained r via Equation 10, a confidence inter- val can easily be constructed in z-space as z ± criterion z where the criterion z corresponds to the desired confidence level (e.g., 1.96 in the case of a 95% confidence interval). The upper and lower z limits of this confidence interval can then be transformed back to upper and lower r limits. Suppose, for instance, that a sample of n = 25 X-Y pairs produces a Pearson r of 0.90, and a 95% confidence interval is desired. The obtained z is thus,
z = 0.5 x ln [(1+.90)/(1-.90)] = 1. which is distributed with a standard deviation of
1/(25− 3) = 0.. The upper and lower confidence interval limits in z-space are therefore
1.472+(.213)(1.96) = 1. and
1.472-(.213)(1.96) = 1.054. To translate from z-space back to r-space, it is necessary to invert Equation 10, It is easily shown that such inversion produces,
Table 5. Hypothetical data from a 2 x 2 facto- rial design.
Factor 1 Level 1 Level 2
Level 1 M 11 = 5 M 21 = 8 Factor 2 Level 2 M 12 = 7 M 22 = 12
STEVENS HANDBOOK OF EXPERIMENTAL PSYCHOLOGY Page 20
r =
e2 z^ − 1 e2 z^ + 1
(Eq. 11)
The upper and lower confidence-interval limits may then be computed from Equation 11:
u p p e r l i m i t : r =
e2x1.890^ − 1 e2x1.890^ + 1
and
lower limit : r =
e2x1.054^ − 1 e2x1.054^ + 1
Thus, the 95% confidence interval around the original obtained r of 0.90 ranges from 0.783 to 0.955.
Confidence intervals around binomial pro- portions To compute confidence intervals around bi- nomial proportions, note first that the equation for the standard deviation of a proportion is,
σ =
pq n
where p is the proportion, q is (1-p) and n is the number of observations.
Suppose now that we wish to compute the up- per limit of a X% confidence interval. Call the corresponding criterion z, zX (e.g., zX = 1.64 for a 90% confidence interval, zX = 1.96 for a 95% con- fidence interval, and so on). It follows then that the upper limit, U, for an X% confidence interval around some obtained proportion, p, can be writ- ten as,
U = p +
2n
+zXσ =p +
2n
n
Eq. 12
where the factor (1/2n) is to correct for continuity, as the normal approximation to the binomial is most easily used in these computations. The equa- tion for the lower limit, L, is the same except that the second plus sign in Equation 12 is replaced with a minus sign, i.e.,
L = p +
2n
− zX σ =p +
2n
−zX
n
This equations for both U or L, can, after suitable algebraic manipulation, be written as standard quadratics of the form,
aU^2 + bU + c = 0
and,
aL^2 + bL + c = 0
where for both U and L, the values of a, b, and c can be computed as,
a = 1 +
zX^2 n
Eq. 13
and,
b = −2p−
zX^2 n
n
Eq. 14
and
c = p^2 +
p n
4n^2
Eq. 15
The seemingly odd fact that the values of a, b, and c are the same for both U and L comes about be- cause when, as part of the aforementioned alge- braic manipulation, one squares the far-right term in Equation 12: the minus sign in the equation for L disappears and hence the equations for U and L become identical. Nevertheless, distinct values for both U and L emerge from the quadratic solution below. A quadratic equation of the form,
aX^2 + bX + c = 0 has two solutions, which are computed as follows.
X =
−b ± b^2 − 4ac 2a
Eq. 16
When the values of a, b, and c obtained by Equa- tions 13, 14, and 15 are plugged into Equation 16, the two resulting solutions correspond to the U and L, the upper and lower limits of the confidence interval. As an example, supposed that an obtained proportion of p = .96 is obtained based on n = 5 observations, and suppose one wishes to compute a 99% confidence interval around this obtained value of p = .96. The criterion z for a 99% confi- dence interval is zX = 2.576. There is now suffi- cient information to compute the values of the quadratic-equation coefficients, a, b, and c via Equations 13-15. They are, a = 2.327, b = -3.447. and c = 1.124. Plugging these three values, in turn, into the Equation 16 leads to solutions—upper and lower limits—of U = 0.997 and and L = 0.484.
Homogeneity of Variance Let us return to the standard, one-way, be- tween-subjects ANOVA design, as exemplified the RT-as-a-function-of-caffeine example (see Figure 5). There is only a single MS (Error) in this design, in this case MS (Within) = 323. Computa- tion of this single MS (Within) rests on the homo- geneity of variance assumption which is this: Al- though the treatment in some experiment (caffeine variation in this example) may affect the popula- tion means, it does not affect population variances.