Statistical significance bars (SSB):

A way to make graphs more interpretable

Christian D. Schunn

University of Pittsburgh

 

Author notes: Christian D. Schunn, LRDC. Correspondence concerning this article should be addressed to: Christian Schunn, LRDC 821, 3939 O'Hara St, Pittsburgh, PA 15260 or by E-mail at: schunn@pitt.edu.

Schunn, C. D. (1999). Statistical significance bars (SSB): A way to make graphs more interpretable. Unpublished manuscript.

 

Abstract

While line and bar graphs are a very common medium for presenting data in psychology, currently they rarely present information that the reader would like to know: which means are statistically significant from one another. I present a new kind of error bar, statistical significance bars (SSBs), that allows the reader to more directly and easily infer whether any two means in the graph are statistical significant from one another. This method works for both with and between subjects designs and is based directly upon standard statistical tests that the authors would normally use for pairwise means comparisons.

 

Introduction

Tables, line graphs, and bar graphs are a very common way of displaying data in the sciences, especially in psychology. Line graphs and bar graphs are thought to have an advantage over tables in that they make it easy for the reader to notice effects of variables-that is, they make salient the relative height of points or lines (Kosslyn, 1989; Loftus, 1993; Shah & Carpenter, 1995). However, simply knowing which points are above or below other points is not typically sufficient in determining the effects of variables. Because the researcher has typically gathered only a small subset of all possible data, there is a question of statistical significance: how certain are we that the true population means actually differ? Here again, line graphs and bar graphs have the potential of being more informative than tables. By adding the appropriate error bars to the graph, simple visual comparisons may be used to determine how likely it is that the differences between sample means can be attributed to noise. Unfortunately, the appropriate error bars are never used.

There are two senses in which appropriate error bars are not used. First, most line and bar graphs display no error bars at all.1 This lack of error bars can be readily seen by examining current psychology journals. To estimate the recent use rates, I coded the 1998 issues of three high profile journals that span many areas of psychology-Journal of Experimental Psychology: General, Psychological Review, and Psychological Science.2 Sixty-six percent of the articles included at least one line or table graph. Of these graphs, two thirds displayed no error bars at all. In a historical analysis of the journal JEP:LMC, Estes (1997) found a similar dearth in the use of error bars in graphs (contrasting sharply with a recent rise in the presentation of variance information in tables).

The second sense in which appropriate error bars are not used is that, even when some kind of error bar is presented, it is usually not an appropriate error bar for easily determining the statistical significance between points. From the coding of the 1998 journal articles, when authors did use error bars, the most common type was standard error bars. Specifically, 22% of graphs contained standard error bars, 2% contained standard deviation bars, 1% displayed confidence interval, and 6% did not describe which error bars were displayed. As I will shortly explain, none of these error bars are appropriate for the task of easily judging the statistical significance of differences between means.

It is likely that these two aspects are related to one another: people may not use error bars because they do not know what is the appropriate error bar to use or because they know that all of the common types of error bars are inappropriate. A survey given to faculty at my own psychology department found a variety of reasons for not presenting error bars in journal articles (in decreasing order of frequency): not being required to include error bars by editors, not knowing how to add error bars in their graphing package, not knowing which type of error bar to use, believing that error bars are irrelevant to their purposes, aesthetics, adding error bars requires too much effort, and worrying that the error bars they know how to add make the effects look less reliable than they actually are. Moreover, the current APA publication manual (APA, 1994) provides no guidance on this issue of whether to use error bars in graphs, and if so, which ones are most appropriate. Other style guides on graph construction provide little or no guidance on this issue as well (e.g., Cleveland, 1985; Gillan, Wickens, Hollands, & Carswell, 1998; Tufte, 1983).

What is the appropriate error bar to use? The answer to this question depends in part on the type of information being sought. Table 1 presents a summary of various information one might seek from an error bar and how well each type of error bars conveys that information. The function of this paper is to introduce a new kind of error bar that is the best choice for one of the most common goals of data analysis in psychology: assessing the statistical significance of pairwise comparisons between means.

 

Table 1. Comparison of standard deviation, boxplots, standard error, 95% confidence intervals, root mean square error, and statistical significance bars for assessing distribution information (range and homogeneity of variance), effect size, absolute location of the population means, pairwise statistical significance of differences between means, and applicability to repeated measures designs.

 

Distribution

Effect size

Location of true means

Differences between means

Repeated measures

SD

++

++

+

+

+

Boxplot

+++

+

+

+

+

SE

+

+

++

++

+

95% CI

+

+

+++

++

+

RMS

+

+

+

++

++

SSB

+

+

+

+++

+++

Note. +=poor, ++=ok, +++=best

 

To illustrate this goal concretely, consider the following example. A psychologist runs a study with three conditions and displays the resulting condition means in a line graph.3 The readers of the psychologist's article will want to know which condition means are statistically significantly different from one another. To determine this information, the reader would normally have to refer to the text in which the results of an ANOVA and pairwise posthoc contrasts are described. In this case, the graph itself is in effect potentially misleading, because apparently significant differences are not significant, or at least ambiguous. A much more desirable state of affairs would be for the reader to be able to determine more directly and easily from the graph which means are significantly different from one another. If this were possible, then the reader would not need to navigate the clumsy text or appendices outlining the results of all pairwise contrasts.

 

The Existing Alternatives

While most graphs contain no information about variance, some graphs do contain error bars of some kind and the reader typically tries to assess statistical significance using these error bars. In this section, I will step through the various existing types of error bars that can be used, demonstrating why they are inappropriate for the task of determining the statistical significance of pairwise means comparisons. This overview will serve to enumerate concretely the desirable features of an appropriate type of error bar. Then I will introduce a new type of error bar that has all of these desirable features.

Standard deviation bars.

Standard deviation is useful when the absolute magnitude of the within group variance is of interest (see Figure 1a). For example, theoretical models occasionally make predictions regarding variance. As another example, when criteria are to be used to separate groups in clinical settings, the within group variance is also important to assess the reliability of the criterion. However, standard deviations are not appropriate variance estimates for assessing statistical significance between means because they do not reflect the sample size. As the sample size increases, the variance (as reflected in the standard deviation) will remain stable, whereas the variance in the estimate of the mean decreases. Thus, to begin to assess statistical significance using standard deviation bars, the reader must know the ns of each point, and scale the standard deviations bars appropriately. This computation is no trivial task, especially in complex experimental designs with uneven ns.

Similar complaints could be made of boxplots and interquartile range bars. They are quite useful for examining group variance, but quite impractical for assessing statistical significance of means comparisons.

Standard error bars.

In contrast to standard deviations, standard error bars (see Figure 1b) do make use of the sample size. Specifically, the standard error is equal to standard deviation divided by the square root of n. This particular error information is highly relevant to statistical means comparisons. However, standard errors do not convey information regarding the criterion associated with an a level. The statistical significance of a mean comparison is defined relative to a particular test statistic (e.g., t or F). Since the statistics are typically a complex function of the degrees of freedom and an a level, a reader cannot easily estimate statistical significance with only an estimate of variance. For example, are two means significantly different when their difference is more than 1.5, 2, 2.5, or 3 times the size of the standard error?

Second, the statistical tests are typically done with pooled estimates of variance (i.e., one estimate of mean variance across all condition means). Thus, to properly assess statistical differences, the graphically displayed variance should also be a pooled variance estimate. Unfortunately, most graphing packages default to displaying condition specific standard error bars (which will vary in size from condition to condition). Note: if the difference in variance across conditions is very large, then pooled variance is inappropriate both for the display and for the statistical tests.

Third, standard error bars are not the appropriate error information for within subjects designs (see Estes (1997) and Loftus and Masson (1994) for a detailed exposition of this issue). For within subject designs, the variability within conditions (which standard error bars convey) is irrelevant. Instead it is the information about the variability in the difference between conditions that is important. Estes (1997) recommends the use of root mean squared error. This measure is appropriate for within subjects designs and is a pooled measure of variance. However, it does not reflect a criterion associated with an a level threshold.

Figure 1. An example graph displaying A) standard deviation bars, B) standard error bars, C) 95% Confidence intervals, and D) .05 Tukey HSD statistical significance bars.

Confidence intervals.

In contrast to standard error bars, confidence intervals solve the first of the three problems with standard error bars (see Figure 1c). That is, confidence intervals do reflect a criterion associate with an a level. Specifically, the size of the confidence interval is simply the standard error multiplied by a criterion (e.g., t or F), which can be found in a statistical table using information about degrees of freedom and a given a level. Unfortunately, the confidence intervals that authors typically use suffer from the other two problems with standard error bars. First, they reflect the variance within each condition rather than the pooled variance that is used in the statistical tests. Second, they are also not appropriate for within subjects designs, for exactly the same reasons.

These last two problems can be addressed using modified confidence intervals that are based on pooled variance and are appropriate for within subjects designs (see Loftus and Masson (1994) for a presentation of these modified confidence intervals). However, confidence intervals, including these modified confidence intervals, have an important additional problem: confidence intervals are not for comparing pairs of condition means. Instead, confidence intervals are used for comparing observed means to theoretical population means (i.e., how likely is it that the observed mean could have arisen from a population with a certain theoretical mean). Confidence intervals are not directly useful for comparisons across means. The problem is that the variance associated with the estimate of a difference between two samples is larger by a factor of (1.414214...) than the variance associated with the estimate of a mean. Thus, the confidence intervals are too small by a factor of about 1.4. To be able to accurately judge statistical significance, a reader armed only with confidence intervals would need a ruler and a calculator (assuming a reader even knew that this was required).

 

A New Alternative: Statistical Significance Bars

The preceding discussion illuminated the problems with existing error bars and desirable features of an error bar. To that list of desirable features, I will add another important feature: a display construct should be consistent with the usual habits of a reader (Larkin & Simon, 1987). In other words, the construct should have good visual affordances. To understand this feature in the current context, one must think about the typical habits of an individual reading a line or bar graph. When confronted with a graph containing error bars, a likely activity is for the reader to look for overlapping error bars. When the error bars for two means overlap, the reader will assume the means do not differ from one another. If the error bars do not overlap, the reader will assume the means do differ from one another. Unfortunately for the reader, all existing error bars do not accurately support this inference procedure. Then what is desired of a good error bar is one that matches this behavior. That is, for a good type of error bar, when any two bars overlap, the two means should never be statistically different from one another, and when any two bars do not overlap, the two means should always be statistically different from one another.

Let us define a new display construct, statistical significance bars (SSB), in exactly these terms. A statistical significance bar is an error bar that displays pairwise statistical significance between means using the visual overlap test. It turns out that such an error bar is quite easy to calculate using information readily available in an ANOVA table.

SSB also has a simple relationship to existing error bars. For between subjects designs, SSB is simply the (pooled) confidence interval divided by, or the (pooled) standard error multiplied by a criterion value (usually a t-statistic with given degrees of freedom and a level) and then divided by . The justification for dividing by is quite simple. To test whether two points are significantly different, a normal confidence interval is too small by a factor of (as described above). Thus, we multiply by . However, this will produce a result in which points are significantly different from one another if one point is not inside the error bars of the other. To move to a situation in which one uses the overlap of the bars (rather than the overlap of a point within a bar), one must divide the bars in half. Thus, we multiply by and divide by 2, which is the same as simply dividing by .

Pooled standard errors or pooled confidence intervals are not usually given in standard statistical package outputs, and thus do not provide a practical definition. Instead, one can easily define an SSB in terms of the output of an ANOVA table. Let MSEc be the mean squared error term associated with the contrast being graphed. Let dfc be the degrees of freedom associated with that error term. Let n be the number of data points contributing to each given mean being displayed.4 Then, the a priori SSB is defined as (a priori SSB equation):

SSB= t(a, dfc)

where t(a, dfc) is the two-tailed t statistic criterion (i.e., a/2) with dfc degrees of freedom. This one formula is applicable to both within and between subjects designs.

Truth in advertising: Post hoc contrasts. Since the statistical significance of differences between particular means is usually tested using post hoc contrasts, it usually more appropriate to use SSBs that are also based on post hoc contrasts. Since we are generally interested in allowing the reader to make inferences about all pairwise comparisons between means in the graph, an appropriate post hoc procedure is the Tukey Honestly Significant Different (HSD) test. To compute an SSB using the Tukey HSD test, the following formula is used (post hoc SSB equation):

SSB= Q(a, k, n)

where k is the number of conditions (or number of cells in an interaction graph), n is the number of data points contributing to each given mean being displayed, MSEc is the mean squared error term associated with the contrast being graphed, and Q(a, k,n) is the Q statistic criterion taken from the Sudentized range table found in the appendices of many statistics books. Note that we are dividing by 2 in this equation, not by as in the a priori equation. To indicate to the reader that a post hoc SSB is being used (and which post hoc procedure is being used), the name of the post hoc procedure and a level should be indicated either on the graph or in the figure caption (e.g., .05 Tukey HSD SSB).

Tests other than Tukey's HSD test may be used. For example, for binary data, a corresponding SSB may be constructed from the logistic regression. In general, whatever test the author would use to test the pairwise significance between points can be applied to create an SSB. The goal of including SSBs is to have the figure match the inferential statistics as closely as possible.

A simple example. Table 2 presents the results of an ANOVA applied to sample data taken from Loftus and Masson (1994). The dependent measure is the mean number of words recalled (out of 20). The independent measure, manipulated within participants, is the exposure time to each word: 1, 2, or 3 seconds per word. The means for this data set are displayed in Figure 1. To compute the SSBs, we first determine the MSE for the error term. In this case, the error term is the Block by Participant interaction, and so the MSE is 0.615. The values of n and k are 10 and 3. The Q statistic for a .05 a level, k=3, df=10 is 3.88. Substituting these values into the posthoc equation produces an SSB of length 0.48. The graph using this SSB is displayed in Figure 1d. The scales are held constant across the figures 1a through 1d. This scale is unusually large for Figure 1d, but was selected to accommodate the large standard deviation bars. A replotting of Figure 1d is presented in Figure 2 make the group differences clearer.5 Templates demonstrating how this SSB was computed and how to create a graph with an SSB using Microsoft Excel(c) can be found by visiting the web address: http://hfac.gmu.edu/SSB.

 

Table 2. ANOVA table used as an example of how to calculate SSBs.

Term

Df

Sum of Squares

Mean Square

F-value

p-value

Participant

9

942.533

104. 726

Block

2

52.267

26.133

42.506

<.0001

Block X Participant

18

11.067

0.615

 

Figure 2. The example graph from Figure 1D (with .05 Tukey HSD statistical significance bars) with the scales expanded.

 

In this case, the SSBs are much smaller than the SE bars. The relationship between SSBs and SE bars will vary from situation to situation, which is exactly why one cannot rely on SE bars to determine statistical significance. In a between subjects design, there is a fixed relationship between the size of the SE and the size of the .05 a priori SSB-assuming equal variance, the SSB is 1.39 times the size of the SE (1.96 divided by ). In a within subjects design, SSBs can be much smaller or even larger than the SE bars, depending on the consistency of the within subjects effect. In the presented example, the effect was very consistent across participants (producing the small SSBs) but the absolute performance levels across participants was quite variable (producing large SE bars). If the effect had been less consistent across participants or if there had been less variance in absolute performance levels across participants, the SSBs may have become as large or even larger than the SE bars.

Interactions. SSBs are easily applied to graphs displaying interactions among variables. The same formulas described above are used, except that the value for k in the equation is the number of cells in the interaction. For example, in a 2x3 interaction graph, there are 6 cell means, and so k is 6. MSEc is the mean squared error term used in testing the interaction. This is true for both within and between subjects designs.

Occasionally, authors prefer not to use error bars for interaction line graphs because the error bars from one line obscure the data points from the other line, creating a cluttered display. In these cases, interaction bar graphs with SSBs may be used instead.

As a third issue, occasionally the difference in slope rather than the pairwise means comparisons are of interest in an interaction graph. In this case, authors should plot mean differences between conditions, with SSBs derived from the error term for the interaction effect. This difference score plot with SSBs will allow the reader to quickly determine whether the effect of one variable is the statistically different across conditions of the other variable.

 

General Discussion

On the use of statistical significance cutoffs.

While the shorter term, significance bars, was considered, it was rejected because there is an important difference between statistical significance (i.e., how likely is the observed difference due to chance) and semantic significance (i.e., how large or how important is the effect). Semantic significance is better displayed by displaying effect sizes (e.g., the difference between condition means divided by a measure of within group variance) or simply by displaying within group variance (e.g., with standard deviation bars or boxplots).

As Table 1 indicates, SSBs are not appropriate for displaying some kinds of variance information. In psychology, our dependent measures are often continuous, our independent measures are often discrete values, and our goal is determine the pairwise statistical significance of differences between means. For this situation and goal, SSBs are the most desirable error bar. For other goals, other error bars should be used. For example, if the author wishes to convey information about the distribution of a variable, either within or across groups, then boxplots should be used. If effect size is the more desirable information, standard deviation bars are more helpful.

As a related issue, there has been recent debate whether significance testing should be banned entirely because of the many logical errors that it produces (Cohen, 1994; Loftus, 1993). It might be argued that the use of SSBs would further support poor use of reasoning about a levels. However, there is nothing inherently illogical in the appropriate use of significance testing (Dixon, 1998). Moreover, there is nothing in particular about SSBs that would support less logical thinking about significance testing. SSBs merely allow one to visual determine whether a given a level criterion has been met, which is what the text that accompanies figures currently indicates. If anything, using SSBs may improve thinking about significance testing. For example, in the text it is easy to make sharp but logically questionable distinctions between minor differences in p-values (p=.051 as non-significant and p=.049 as significant). Using SSBs in a graph makes it more difficult to make such sharp distinctions. Instead three qualitatively different situations are possible: 1) error bars clearly overlap, indicating a p-value much above .05; 2) error bars clearly do not overlap, indicating a p-value much below .05; and 3) errors bars are close to overlapping indicating a p-value of approximately .05. Thus, readers are discouraged from making sharp distinctions between p=.051 and p=.049.

The current proposal to use SSBs does not include removing in-text inferential statistics. Inferential statistics presented in text will continue to provide important information. For example, by including information about degrees of freedom, the authors clarify the kinds of aggregation that was used in the statistics. Instead, the current proposal is to augment the in-text inferential statistics by having the line and bar graphs clearly indicate the results of the inferential statistics.

Alternatives to SSBs.

An alternative scheme to using SSBs is to include markings (symbols or coloring) indicating which means are statistically different from other means. However, these markings are not always easily interpreted. For example, one can have three groups A, B, C ordered A<B<C. Here, A and C can be significantly different from one another, and yet there are no statistically significant differences between A and B or between B and C. How should the mean for group B be labeled? With more than three points, more complicated patterns of statistical significance are possible.

In considering various kinds of error bars, Estes (1997) argues against the use of confidence intervals and other constructs which incorporate statistical criteria tied to particular a levels. Specifically, Estes raises the issue that lower a levels give wider bands, seemingly giving less confidence in the location of a point. As the better alternative, Estes argues that authors should plot the appropriate pooled measure of variance (e.g., RMS) on the graphs, allowing the reader to multiply the bars mentally to produce the appropriate significance threshold test. However, the mental multiplication by non-integers is not a simple perceptual task (e.g., by 1.39). Also, most readers do not have the appropriate post hoc criteria memorized.6 Finally, each paper implicitly adopts a given a level for determining what is statistically significant and what is not. That same a level can be used in plotting SSBs. Thus, there will be no confusion about the precision of the data from varying a levels because the a levels will not typically vary. Since most psychology papers use .05, it will probably be best to use this convention in graphing as well. Alternatively, one might use two tiered error bars, which display both .05 and .01 error bars.

Excuses.

At the beginning of this paper, I listed several commonly mentioned reasons for not including error bars in graphs. The majority of these reasons have been addressed. Which error bar to use is now clarified. How to compute this error bar has also been clarified. By now using appropriate error bars, statistically significant differences will continue to look significant. Excuses relating to the abilities of the graphing package (e.g., it being difficult or impossible to add error bars) should never have been a good excuse. This excuse is equivalent to not reporting accurate reaction times because of not knowing how to use data collection software (Gillan et al., 1998). Currently there are many readily available and inexpensive graphing packages that easily allow the addition of use-specified error bars to line and bar graphs. Examples of how to calculate SSBs, including a Studentized Range table, and how to construct the appropriate graph in Microsoft Excel(c) can be found on the world wide web at: hfac.gmu.edu/SSB. With these other excuses removed, perhaps editors of journals will also begin to require the inclusion of error bars in graphs.

 

Footnotes

1. By error bars, I mean all visual display constructs that are used to convey variance information, including standard deviation bars, standard error bars, confidence intervals, interquartile range bars, and boxplots.

2. This included all issues of 1998 for each journal, producing 26 Psychological Review articles, 17 JEP:G articles, and 79 Psychological Science articles.

3. There is some debate regarding when to use line graphs versus bar graphs relating to issues of avoid unwarranted interpolation, ease of reading, and aesthetics. Since the same error bars can be used in either case, this debate will not be discussed further here.

4.In the case of uneven ns, one can use the harmonic mean, k/(1/n1 + 1/n2 + ... + 1/nk), as long as the ratio between the largest n and the smallest n is no more than three to one (Rankin, 1974).

5. Of course, in some cases it is important to display the full range of the scale rather than magnify the effect.

6. For other arguments against the use of such error bars, see (Cleveland, 1985).

 

References

American Psychological Association (1994). Publication manual of the American Psychological Association. (4th ed.). Washington, DC.

Cleveland, W. S. (1985). The elements of graphing data. Monterey, CA: Wadsworth.

Cohen, J. (1994). The Earth is round (p<.05). American Psychologist, 49, 997-1003.

Dixon, P. (1998). Why scientists value p values. Psychonomic Bulletin & Review, 5(3), 390-396.

Estes, W. K. (1997). On the communication of information by displays of standard errors and confidence intervals. Psychonomic Bulletin & Review, 4(3), 330-341.

Gillan, D. J., Wickens, C. D., Hollands, J. G., & Carswell, C. M. (1998). Guidelines for presenting quantitative data in HFES publications. Human Factors, 40(1), 28-41.

Kosslyn, S. M. (1989). Understanding charts and graphs. Applied Cognitive Psychology, 3, 185-225.

Larkin, J. H., & Simon, H. A. (1987). Why a diagram is (sometimes) worth 10,000 words. Cognitive Science, 4, 317-345.

Loftus, G. R. (1993). A picture is worth a thousand p values: On the irrelevance of hypothesis testing in the microcomputer age. Behavioral Research Methods, Instruments, & Computers, 25, 250-256.

Loftus, G. R., & Masson, M. E. J. (1994). Using confidence intervals in within-subject designs. Psychological Bulletin & Review, 1(4), 476-490.

Rankin, N. O. (1974). The harmonic mean method for one-way and two-way analyses of variance. Biometrika, 61, 117-129.

Shah, P., & Carpenter, P. A. (1995). Conceptual limitations in comprehending line graphs. Journal of Experimental Psychology: General, 124, 43-61.

Tufte, E. R. (1983). The visual display of quantitative information. Cheshire, CT: Graphics Press.