



























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The importance of selecting an appropriate y-axis range in scientific graphs to maximize compatibility between visual size and actual effect size. The authors recommend a range of approximately 1.5 standard deviations (SDs) for fields where a standardized effect size of Cohen’s d = 0.8 is considered big. results from five experiments that demonstrate improved calibration to effect size and reduced bias when the y-axis is centered on the grand mean and extended approximately 0.7 SDs in either direction. The studies also suggest that standardized graphs, which have a y-axis range that is a function of the standard deviation, produce better sensitivity and less bias than graphs that show the full range or only the minimal range necessary to see the data.
Typology: Exercises
1 / 35
This page cannot be seen from the preview
Don't miss anything!




























Running Header: GRAPH FLUENCY AND Y‐AXIS RANGE Graph Construction: An Empirical Investigation on Setting the Range of the Y‐Axis Jessica K. Witt Colorado State University Under peer review at Meta-Psychology. This is a revised version of the submitted paper****.
Abstract: Graphs are an effective and compelling way to present scientific results. With few rigid guidelines, researchers have many degrees‐of‐freedom regarding graph construction. One such choice is the range of the y‐axis. A range set just beyond the data will bias readers to see all effects as big. Conversely, a range set to the full range of options will bias readers to see all effects as small. Researchers should maximize congruence between visual size of an effect and the actual size of the effect. In the experiments presented here, participants viewed graphs with the y‐axis set to the minimum range required for all the data to be visible, the full range from 0 to 100, and a range of approximately 1. standard deviations. The results showed that participants’ sensitivity to the effect depicted in the graph was better when the y‐axis range was between one to two standard deviations than with either the minimum range or the full range. In addition, bias was also smaller with the standardized axis range than the minimum or full axis ranges. To achieve congruency in scientific fields for which effects are standardized, the y‐axis range should be 1.5 standard deviations.
data point to just above the highest data point (Kosslyn, 1994). This would not achieve the recommended compatibility because small effects would look big. Others assert that the y‐axis should always start from 0, particularly for bar graphs (Few, 2012; Pandey et al., 2015; Wong, 2010). This too could fail to achieve compatibility by making effects look too small. In the case of scientific fields for which effect size is standardized based on standard deviation, the range of the y‐axis should be a function of the standard deviation (SD). In behavioral sciences such as psychology and economics, for example, most effects are approximately half a SD (Bosco, Aguinis, Singh, Field, & Pierce, 2015; Open Science Collaboration, 2015; Paterson, Harms, Steel, & Crede, 2016), and a standardized effect size of d = .8 is considered a big effect (Cohen, 1988). Consequently, an appropriate range for the y‐axis would be one to two SDs, which would be plotted as the group mean ± 0.75 SD (or ±0.5 – 1 SDs). With this range, big effects such as a Cohen’s d of .8 would look big and small effects of d = .3 would look small. In other words, this range would help achieve compatibility between the visual impression of the size of the effect and the actual size of the effect. Empirical Studies The effect of visual‐conceptual size compatibility on graph fluency was empirically tested in 57 participants across 5 experiments (see Table 1). The participants were naïve college students, which serves as an appropriate sample given that scientific results should be accessible and comprehensible to this population and not just to experts in one’s field.
Table 1. Overview of the five experiments. Experiment N Effect sizes depicted in stimuli Graph Type Range for Standardized condition 1 1 9 0.1, 0.3, 0.5, 0.8 Bar graph 2 SDs 2 14 0.1, 0.3, 0.5, 0.8 Bar graph 1.4 SDs 3 13 0, 0.3, 0.5, 0.8 Bar graph with error bars 1.2 SDs 4 20 0, 0.3, 0.5, 0.8 Line graph 1.4 SDs 5 15 0, 0.3, 0.5, 0.8 Line graph 1 SD Notes. 1 The range refers to the full range depicted. For example, a range of 1.4 SDs refers to conditions for which the graph was centered on the grand mean and extended 0.7 SDs in either direction. The stimuli were bar or line graphs that had been constructed from simulated data. Data were simulated from two (hypothetical) groups of participants by sampling from normal distributions in R (R Core Team, 2017). For one group, the data were drawn from a normal distribution with a mean of 50 and a standard deviation of 10 (as in a memory experiment with mean performance of 50% and SD of 10%). For the other group, the data were drawn from a normal distribution with a standard deviation of 10 and the mean at 49, 47, 45, or 42. These means correspond to effect sizes of d = 0.1, 0.3, 0.5, and 0.8, respectively. In Experiments 3 ‐5, the mean of 49 ( d = 0.1) was replaced with the mean of 50 ( d = 0). In Experiments 2 ‐5, the data were re‐sampled if the attained effect size differed by more than 0.1 from the intended effect size. Data were simulated 10 times for each of the four effect sizes to create 40 sets of data for each Experiment. In Experiments 1 ‐3, the means of the simulated data were displayed as a bar graph depicting two groups of participants who engaged in different study strategies (spaced versus massed; see Figure 1). In Experiments 4 ‐5, the means were used to determine the end points of a line graph, and the x‐axis was labeled as “hours spent studying”. For each set of data, three graphs were constructed that varied in the range of the y‐axis. The full condition showed the full range from 0 to 100 on a hypothetical memory test. The minimal condition showed the smallest range necessary to see the data. The standardized condition was centered on the group mean and extended by one to two SDs in either direction (the exact value differed across experiments, see Table 1 or the supplementary
Graph fluency was measured using linear regressions rather than accuracy because regression coefficients have the advantage that they provide two separate measures. The slope provides an estimate of sensitivity to the magnitude of the effect depicted in the plot. A steeper slope indicates better sensitivity to effect size than a shallower slope. The intercept provides an estimate of bias. Two graphs could lead to similar levels of sensitivity but different levels of bias. Separate linear regressions were calculated for each participant for each y‐axis range condition (full, standardized, and minimal). In each regression, the dependent measure was response (on the scale of 1 to 4). The effect sizes were recoded to also be on a scale from 1 to 4 so that perfect performance would produce a regression coefficient for the slope of 1. Figure 2 shows the mean slope coefficients across all 5 experiments. Sensitivity was best for the standardized graphs and worse for the full range graphs. Participants were better able to assess the size of the effect depicted in the graph for the standardized graphs, than for the minimal or full graphs. Participants were also less biased when viewing the standardized graphs. Figure 3 shows the mean bias across all 5 experiments. Bias scores were calculated as a percent bias based on the coefficients for the intercept. A negative score indicates a bias to respond that effects were small, and a positive score indicates a bias to respond that the effects were big. For the full graphs, there was a large bias to respond that the effects were small. When looking at graphs with the full range, participants responded that almost all effects (86%) were null or small. For the minimal graphs, there was a large bias to respond that the effects were substantial. When looking at graphs with the minimal range for Cohen’s d was 0.10 – 0.80, participants responded that the effect was big on 49% of the trials. In contrast, there was much less bias with the standardized graphs (see Supplemental Materials).
Figure 2. Sensitivity is plotted as a function of graph axis condition for the three types of graphs across all 5 experiments. Sensitivity was measured as the coefficient for the slope from regressions of actual effect size on estimated effect size. Only trials for which the graph depicted an effect size greater than d = 0.1 are included (see supplementary materials for all the data). A higher sensitivity score corresponds to better performance, and a coefficient of 1 corresponds to perfect performance. A coefficient of 0 indicates chance performance. In the left panel, mean sensitivity across all experiments is shown. Error bars are 1 SEM calculated within‐subjects, and are approximately the same size as the symbols. The y‐axis range is 3 SD. The right panel shows sensitivity for each participant for each experiment. The data are color‐ coded by experiment (e.g. red = Experiment 1, orange = Experiment 2) and are also laterally positioned from left to right within graph type category. Each point corresponds to one participant, and each participant has one symbol for each of the three graph types. The solid horizontal line at 0 shows the point of no sensitivity and the dashed horizontal line at 1 shows the point of perfect sensitivity.
The current studies show improved calibration to effect size and reduced bias in estimating effect size when the range of the y‐axis was centered on the grand mean of the data and extended approximately 0.7 SDs in either direction. The various studies used slightly different extensions ranging from 0.5 SDs to 1 SD. There were not large detectable differences in calibration or bias depending on the exact range that was used, so the precise value of the y‐axis range might not be critical. Rather, the key feature is that the visual size aligns with the actual size of the effect. The specific range to be used might vary as a function of the size of the error bars (the range should be large enough to encompass them), the size of the effect (the range would have to be extended by 1.5 SDs for particularly large effects, such as was done with the current results), if doing so would make the range include nonsensical numbers (such as negative numbers for performance), and to achieve a consistent scale across multiple graphs to enhance across‐graph comparisons. Given that the exact range in terms of SD could vary from plot to plot, it could be useful to indicate the range in SD units in the figure caption. This indication would also be useful in cases for which researchers neglect to include error bars. In the current studies, the participants were similarly calibrated to the size of the effect when error bars were and were not included. If this pattern extends to more a more statistically‐knowledgeable community, setting the range to be 1.5 SDs could provide similarly effective information about the size of effects as including error bars, and thus could help compensate for differences in researcher preferences for including or excluding error bars. The current experiments explored graphs of stimulated data from between‐subjects designs. The recommendations likely generalize to within‐subject designs with the caveat that the y‐axis should be a function of the denominator used to calculate the within‐subjects effect size. For example, the denominator for Cohen’s dz is the square root of the sum of the squares of the standard deviations minus the product of the standard deviations and the correlation between the two measures. Graphs plotting within‐subjects data could be ± 0.75 times this denominator (or one of the other suggested
measures for within‐subjects effects sizes; e.g. Lakens, 2013). In cases for which there are both between‐subjects and within‐subjects factors, the researchers will have to decide which denominator to use for the range depending on which effect they most want to emphasize. It is debatable whether the recommendation offered here should be employed with bar graphs. Some have shown that graphs that start at a position other than 0 are deceptive (e.g. Pandey et al., 2015). The idea is that bar graphs should always start at 0 because the height of the bar signifies the value of the condition being represented. When the y‐axis starts at a value greater than 0, the height of the bar corresponds to the difference between the condition’s value and the starting point, rather than the condition’s value itself. Consider the following example: imagine that group A scored 70% on a memory test and group B scored 60%. On a plot for which the y‐axis starts at 50%, group A’s score would appear twice as big as group B’s score, even though they only scored 10% higher. The issue at hand concerns the visual impression of the data. If the graph gives the impression that the differences are big, and that aligns with the size of the effect, the graph would be produce compatibility between vision and true effect size. If, however, the impression is that one group’s performance was twice as good as the other group’s^ performance,^ this^ would^ produce^ a^ misleading^ impression^ of^ the^ data.^ The current experiments cannot speak to which impression was experienced because participants were asked to rate the size of the effect as being no effect, small, medium, or big, rather than quantifying the size of one bar relative to another. The specific task used here did not permit measuring the spontaneous impression given by the graphs. One option is for researchers to use alternative types of graphs to avoid the issue. Alternatives include point graphs and a newly‐designed (but yet unpublished) type of graph called a hat graph (Witt, in preparation). The recommendation to set the y‐axis range to be 1.5 SDs does not generalize to fields for which the SD is unknown or irrelevant for interpreting effect size. For these fields, previous recommendations such as Tufte’s Lie Detector Ratio could be appropriate (Tufte, 2001). But for scientific fields that rely on
Author Note Jessica K. Witt, Department of Psychology, Colorado State University. Data, scripts, and supplementary materials available at osf.io/hw2ac. This work was supported by grants from the National Science Foundation (BCS‐ 1632222 and BCS‐1348916). Address correspondence to JKW, Department of Psychology, Colorado State University, Fort Collins, CO 80523, USA. Email: [email protected]
Supplementary Materials Experiment 1: Bar Graphs with Axis Range of 2 SD Participants judged the size of effects depicted in bar graphs that were constructed with three axis range options. Method Participants. Nine students in an introductory psychology course participated in exchange for course credit. In this and all subsequent experiments, the number of participants was maximized within a pre‐determined time limit. Stimuli and Apparatus. Graphs were constructed in R (R Core Team, 2017). For each graph, two means were generated. One mean was 50, and the other mean was 49, 47, 45, or 42. These equated to effect sizes of Cohen’s d = .1, .3, .5, and .8, respectively. To add some noise to each graph, each mean was drawn from a normal distribution centered on the desired mean with 1000 samples and a standard deviation of 10. The means were presented in bar graphs (see Figure S1). The left bar was white and labeled “Spaced” and the right bar was black and labeled “Massed”. For each set of simulated data, three bar graphs were constructed that corresponded to the three y‐axis range conditions. For the full graphs, the y‐axis range went from 0 to 100. For the minimal graphs, the y‐axis went from the smallest data value minus 1 to the largest data value plus 1. For the standardized graphs, the mean of the two groups was calculated, and 1 SD (10) was added in either direction to set the y‐axis range. This process of creating 3 graphs for each set of data was repeated 10 times for each of the 4 effect sizes for a total of 120 graphs. Graphs were 500 pixels by 500 pixels and were shown on a 19” computer monitors with 1028 x 1024 resolution.
Experiment 5 (Range = 1 SD) Figure S1. Sample stimuli for each of the 5 experiments. Each row corresponds to one experiment and shows a single set of a data plotted in the three different ways (full, standardized, and minimal). In all cases, the data show a medium effect (Cohen’s d = 0.5). The number in parentheses under the experiment number indicates the range of the standardized condition. Procedure. After providing informed consent, each participant was seated at a computer. They were given the following instructions: “You will see graphs showing the effect of study style on final test performance. There were two study styles. Massed is like cramming everything at once at just before the exam. Spaced refers to studying a little bit every day for weeks before the exam. The y‐axis shows final test performance, with higher value meaning better performance. For each graph, indicate if study style had 1. No effect, 2. A small effect, 3. A medium effect, 4. A big effect on final performance. Ready? Press ENTER”. A trial began with a fixation cross at the center of the screen for 500ms. The graph was then shown. Above the graph, text reminded participants of the four response options. The graph remained until participants made a response, at which point, the graph disappeared and a blank screen was shown for 500ms. Each block of trails consisted of the presentation of each of the 120 graphs (3 graph types x 4 depicted effect sizes x 10 repetitions). Order was randomized within block, and participants completed 4 blocks for a total of 480 trials. Results and Discussion
One participant only completed 431 trials, but their data were still included. The depicted effect size was recoded on a scale from 1 to 4 to be consistent with the scale of the response. The smallest effect size ( d = .1) was coded as 1.5 to account for the idea that this effect is smaller than a small effect but bigger than no effect. In later experiments, these graphs were replaced with graphs for which there was no effect instead of d = .1. Data from each participant for each of the 3 axis range conditions were submitted to a linear regression with response as the dependent factor and recoded effect size as the independent factor. Depicted effect size was centered by subtracting the mean response from each trial response. The regressions produced two coefficients for each participant for each axis range conditions. The slope indicates sensitivity to the size of the effect. A slope of 1 indicates perfect sensitivity. A slope less than 1 indicates attenuated sensitivity. The intercept indicates any bias to see effects as smaller or larger than their true size. One participant had slopes that were identified as outliers in the full and minimal conditions because they were greater than 1.5 times the interquartile range for each condition. This participant was excluded from the analysis (despite being the best performer in the group) because their data were not typical of the rest of the sample. Another participant had a slope less than^ 1.5^ times^ the interquartile range in the full condition, and was also excluded for not being typical of the rest of the sample. The coefficients were analyzed using paired‐samples t‐tests to compare each graph condition to the others. Analyses were done in R (R Core Team, 2017). Bayes factors were calculated using the BayesFactor package in R with a medium prior (Morey, Rouder, & Jamil, 2014). A Bayes factor greater than 3 indicates moderate evidence, and a Bayes factor greater than 10 indicates substantial evidence for the alternative hypothesis over the null hypothesis. Conversely, a Bayes factor less than .33 and less than .10 indicates moderate and substantial evidence for the null hypothesis over the alternative hypothesis. Effect sizes were calculated using the recommendations of Lakens (2013). 95% confidence
These data show an advantage for the standardized graphs because participants were more sensitive to differences among magnitudes of the depicted effect sizes with the standardized graphs than with the full or minimal graphs. However, the standardized graphs led to performance that was far from perfect. The slope was .47, and perfect performance would have produced slopes of 1. Thus, even though the standardized graphs signify an improvement over the other two options, more work is still necessary to improve graph comprehension. Another advantage for the standardized graphs can be seen with respect to bias. Bias scores were calculated as a percentage score of underestimation (negative values) and overestimation (positive values). They were calculated as the participant’s coefficient for the intercept minus the true intercept divided by the true intercept. The bias scores for the full graphs was negative ( M = ‐26%, SD = 10%) and significantly below 0, t (6) = ‐14.84, p < .001, d (^) z = 2.56, 95% CIs [.95, 4.14], Bayes factor = 71. The bias scores for the minimal graphs were positive ( M = 38%, SD = 19%) and significantly above 0, t (6) = 5.12, p = .002, d (^) z = 1.93, 95% CIs [.61, 3.21], Bayes factor = 22. In contrast, the bias scores for the standardized graphs were significantly less biased than in the other conditions ( p s < .001), and were close to 0 ( M = 3%, SD = 5%), t (6) = 1.85, p = .115, dz =^ .70,^ 95%^ CIs^ [‐.16,^ 1.51],^ Bayes^ Factor^ =^ 1.09.^ With^ the^ full graphs, most effects looked like small effects. Indeed, the responses on over 94% of the trials with the full graphs were that there was no effect or a small effect. With the minimal graphs, over half of the effects were labeled as big effects and 81% were labeled as medium or big. With the standardized graphs, small effects looked small and medium effects looked medium (see Figure S3). However, the big effects only looked medium. Thus, the experiment was replicated but with a smaller range in the standardized condition to determine if that would improve detection of big effects.
Figure S3. Response is plotted as a function of depicted effect size for the three types of axis range conditions (full, minimal, and standardized) for Experiment 1. The bottom right panel shows the correct response. Response was entered as 1 (no effect), 2 (small effect), 3 (medium effect), and 4 (big effect). Each point corresponds to one participant’s response on one trial. The data have been jittered along both axes to enable visibility. Experiment 2: Bar Graphs with Axis Range of 1.4 SD Standardized graphs, for which the y‐axis range is a function of the standard deviation, produced better sensitivity and less bias in participants who judged the size of the depicted effect compared with graphs that showed the full range and graphs that showed only the minimal range necessary to see the data. However, sensitivity with the standardized graphs was still below perfect