Informal versus formal judgment of statistical models: The case of normality assumptions

Bishara, Anthony J.; Li, Jiexiang; Conley, Christian

doi:10.3758/s13423-021-01879-z

Informal versus formal judgment of statistical models: The case of normality assumptions

Theoretical Review
Published: 03 March 2021

Volume 28, pages 1164–1182, (2021)
Cite this article

Download PDF

Psychonomic Bulletin & Review Aims and scope Submit manuscript

Informal versus formal judgment of statistical models: The case of normality assumptions

Download PDF

Anthony J. Bishara¹,
Jiexiang Li² &
Christian Conley³

2257 Accesses
9 Citations
6 Altmetric
1 Mention
Explore all metrics

Abstract

Researchers sometimes use informal judgment for statistical model diagnostics and assumption checking. Informal judgment might seem more desirable than formal judgment because of a paradox: Formal hypothesis tests of assumptions appear to become less useful as sample size increases. We suggest that this paradox can be resolved by evaluating both formal and informal statistical judgment via a simplified signal detection framework. In 4 studies, we used this approach to compare informal judgments of normality diagnostic graphs (histograms, Q–Q plots, and P–P plots) to the performance of several formal tests (Shapiro–Wilk test, Kolmogorov–Smirnov test, etc.). Participants judged whether or not graphs of sample data came from a normal population (Experiments 1–2) or whether or not from a population close enough to normal for a parametric test to be more powerful than a nonparametric one (Experiments 3–4). Across all experiments, participants’ informal judgments showed lower discriminability than did formal hypothesis tests. This pattern occurred even after participants were given 400 training trials with feedback, a financial incentive, and ecologically valid distribution shapes. The discriminability advantage of formal normality tests led to slightly more powerful follow-up tests (parametric vs. nonparametric). Overall, the framework used here suggests that formal model diagnostics may be more desirable than informal ones.

Evidence for Alternative Hypotheses

Beyond p values: utilizing multiple methods to evaluate evidence

Article 08 March 2019

Are P-values and Bayes factors valid measures of evidential strength?

Article 23 November 2022

Statistical models are sometimes judged informally. For example, distributional assumptions might be judged by considering a histogram; homoscedasticity might be judged by examining a plot of the residuals across a regression line; two variations of a model might be judged holistically by comparing several pieces of information, such as measures of complexity-corrected fit, out-of-sample prediction error, and/or other graphical or numerical information. Such judgments matter, as different statistical models of the same data set could lead to substantially different conclusions (e.g., Silberzahn et al., 2018). The primary goal of this paper is to compare the effectiveness of informal and formal judgments of statistical models, and specifically judgments that are often referred to as model diagnostics, misspecification tests, or assumption checking.

For several reasons, we focus primarily on the judgments of normality. Normality assumptions are common, as they appear in the general linear model, and by extension, in all models of this type (e.g., ANOVAs, t tests). Various normality assumptions also underlie other commonly used statistical procedures, ranging from simple bivariate correlations to structural equation models. When normality assumptions are violated, the general linear model and other commonly used tests can produce inflated Type I and Type II errors, as well as other undesirable properties (Bishara & Hittner, 2012; Kelley, 2005; Levine & Dunlap, 1982; Sawilowsky & Blair, 1992; West, Finch, & Curran, 1995). Such violations may be common because nonnormality is common in psychological and educational data sets (Blanca, Arnau, López-Montiel, Bono, & Bendayan, 2013; Cain, Zhang, & Yuan, 2017; Micceri, 1989).

Incorrect normality assumptions can cause problems even in large samples. For instance, when a confidence interval is constructed using a method that incorrectly assumes normality, as sample size increases, confidence interval coverage can actually decrease (e.g., Bishara, Li, & Nash, 2018). Furthermore, as n increases, even though parametric models sometimes become robust in terms of Type I error, other models may still be preferable (see Sawilowsky & Blair, 1992). For example, consider the power of an independent-samples t test as compared with its nonparametric analog, the Mann–Whitney–Wilcoxon (MWW) test, also known as the ranked-sums test. As shown in Fig. 1, as n increases from 5 to 200, the power of an independent-samples t test will sometimes increase at a slower rate than that of the MWW test. In other words, a large sample size does not guarantee that a model which assumes normality will be preferable to one that does not.

Unfortunately, though, there is no single agreed-upon method for judging normality; there are numerous methods belonging to one of two families. One family involves informal judgment, often using histograms, Q–Q plots, or P–P plots. A second family involves formal statistical methods, such as the Kolmogorov–Smirnov test, the Shapiro–Wilk test, and many others. How well do these methods distinguish between nonnormal and normal? Relatedly, how well do they distinguish between nonnormal and approximately normal—that is, normal enough for a parametric model to have higher power to detect a true nonzero effect without increasing the Type I error rate?

FormalPara Informal versus formal diagnostic decisions

The two families of normality judgment can be viewed as belonging to two more general strategies for diagnostic decisions. One general strategy is to rely on informal intuitive judgment, perhaps of an expert. A second strategy is to rely on a formal, mechanical decision rule, often involving a formula with numeric cutoff values that determine the decision. A large body of empirical work has shown that the second strategy can often outperform the first—that is, formal decision rules can often match or beat informal judgments, even those of experts (Dawes, Faust, & Meehl, 1989; Meehl, 1954; Swets, Dawes, & Monahan, 2000). In one meta-analysis, informal judgments of medical and psychological experts were, on average, outperformed by formal decision rules when making predictions about diagnoses, prognoses, and personalities (Grove, Zald, Lebow, Snitz, & Nelson, 2000). Additionally, experts’ informal judgments are sometimes outperformed by relatively simple decision rules—even rules created to mimic those same experts in their use of cues—partly because such decision rules are more consistent than are informal human judgments (Camerer, 1981; Dawes, 1971; Karelaia & Hogarth, 2008). The superiority of formal decisions rules often occurs in situations where experts do not receive immediate or clear feedback about their decisions (Kahneman & Klein, 2009; Shanteau, 1992), although it sometimes occurs even despite such feedback (e.g., Goldberg, 1968). In short, the broader literature on diagnostic decision-making suggests that formal judgments often do as well as or better than informal ones.

Much of the above-described research pertains to diagnostic decisions about human behavior or disease, but what of decisions about data patterns? Researchers commonly make decisions about data by informally judging a graph, and it is customary to do so in single-case experimental designs (Skinner, 1956). For such designs, empirical studies have shown that expert judgments of graphs sometimes have high interrater reliability (Kahng et al., 2010), but not always (Parker & Brossart, 2003). Unfortunately, expert judgments sometimes lead to excessive Type I and Type II errors (Matyas & Greenwood, 1990). Perhaps because of these findings, in single-subject designs, the focus has been shifting toward more formal statistical analyses (Fisch, 2001; Manolov, Gast, Perdices, & Evans, 2014; Smith, 2012), or at the very least toward formal quantifications of visual depictions (Lane & Gast, 2014).

Especially pertinent to normality judgment is the informal judgment of scatterplots. With scatterplots, a pattern of dots close to a positively sloped line indicates a strong positive correlation. Many visual depictions of normality involve a similar principle. For example, with Q–Q plots and P–P plots (see Fig. 2), dots close to the positively sloped reference line indicate an approximately normal distribution. Empirical research on judgments of scatterplots has shown that human judges are sensitive to correlations depicted in them, albeit not perfectly (Rensink, 2017). For instance, human judges tend to overemphasize the distance between the dots and the reference line, and underemphasize other cues, such as slope and scale (Lane, Anderson, & Kellam, 1985). Additionally, there is error in informal human estimates, and the error tends to increase as r approaches zero (Doherty, Anderson, Angott, & Klopfer, 2007).

Unfortunately, the existing literature offers no empirical studies of informal judgments of normality. That is, there is no direct empirical evidence to indicate the superiority of some informal methods over others, or to compare informal judgments to formal hypothesis tests. In the absence of direct empirical evidence, researchers might instead rely on the advice of authorities.

Review of popular statistics textbooks

To gauge the current advice about normality judgments, we reviewed 20 of the most popular statistics textbooks, operationally defined here as the Top 10 Amazon Best Sellers in the Statistics category and by the Top 10 library holdings as indicated by the WorldCat database (see references with † or ‡, respectively; for details, see Supplement 1). Unsurprisingly, most textbooks included basic definitions of normal distributions, as well as curve and/or histogram depictions of them. However, only 8 textbooks offered specific recommendations for judging whether normality had been adequately satisfied or not.

Of these eight books, all recommended at least one visual inspection method, most commonly Q–Q plots (6), followed by histograms (4), and P–P plots (3), with some textbooks recommending more than one type of graph. Q–Q plots, in addition to being mentioned by several books, were treated as essential in some. For example, the most popular book from the Amazon Best Seller set (Triola, 2012) suggested that bell-shaped histograms alone could not assure normality, and so Q–Q plots must also be used. Additionally, a text with especially comprehensive coverage of normality (Field, 2013) suggested that Q–Q plots would be easier to interpret than P–P plots, at least in large samples. Thus, among visual inspection methods, popular textbooks showed a preference for Q–Q plots.

Regarding formal statistical tests to evaluate normality, only five books described at least one statistical test, most commonly the Kolmogorov–Smirnov test (3), followed by the Shapiro–Wilk, Anderson–Darling, Pearson χ², and Ryan–Joiner test, and also tests of skewness and kurtosis values (one book each). Additionally, one book encouraged comparing the correlation of Q–Q plot coordinates to critical correlation values (Sullivan, 2017; see Looney & Gulledge, 1985), a method similar to the Shapiro–Francia test. Although the most commonly mentioned test was the Kolmogorov–Smirnov test, the books did not specifically endorse this test as preferred. Indeed, one book noted that it was less powerful than the Shapiro–Wilk test (online supplement of Field, 2013; for evidence, see Shapiro, Wilk, & Chen, 1968; Thode, 2002). Table 1 provides a summary of the major normality tests examined in the present research. These tests were chosen because they were well-known (Kolmogorov–Smirnov), well-supported in the simulation literature (Shapiro–Wilk, Shapiro–Francia, Anderson–Darling), somehow analogous to informal graph judgments (Pearson χ², Shapiro–Francia), or because of some combination of these reasons.

Table 1 Summary of formal hypothesis tests of normality examined here

Full size table

A paradox of formal assumption tests

Interestingly, some popular textbooks expressed concerns about formal normality tests (Field, 2013; McClave, Benson, & Sincich, 2014). Formal tests may produce significant results (“nonnormal” decisions) too easily, and so they may be too sensitive to tiny deviations from normality in the data. The problem becomes worse with larger samples, where formal tests reject the null of normality for vanishingly small deviations from it as n increases. That is, in large samples, a formal normality test may lead a researcher to adopt a nonparametric test even when a parametric test would be more powerful. This concern is paradoxical because large samples are thought to be more problematic than small ones, a pattern opposite that of most other statistical situations.

This paradox matters, as it often leads to the encouragement of informal judgment and discouragement of formal tests. This paradox appears not only in popular texts but also in websites that offer statistical advice.^{Footnote 1} Furthermore, this paradox generalizes beyond normality assumption tests to other model diagnostics. For example, many models rely on equal variance assumptions (homoscedasticity, and relatedly, homogeneity of variance). Formal tests of these assumptions have equal variance as the null hypothesis. Therefore, large samples cause equal variance to be rejected for even trivial variance differences. The paradox can even arise with Bayesian models. For instance, if n is large, posterior predictive checks can produce small p values for trivial differences between empirical and posterior predictive distributions.

We suggest that this paradox can be resolved via signal detection theory, and even via a simplified version of it that only makes use of receiver operating characteristics (ROCs; Peterson & Birdsall, 1953). The paradox reflects a problem not with formal judgments per se, but rather with the criterion used in them. The typical criterion is determined by setting alpha to some fixed value (.05), which may lead to too many decisions in one direction rather than another. Of course, the criterion can be easily adjusted simply by changing the alpha level. To examine all possible criterion settings, it is useful to plot ROCs, which illustrate trade-offs between true positives (y-axis) and false positives (x-axis; for a review, see Macmillan & Creelman, 2004). ROCs have proven useful for a diverse array of problems in psychological research (McFall & Treat, 1999; Swets et al., 2000; Wixted & Mickes, 2014; for a history see Wixted, 2020), but ROCs can be especially useful for assessing statistical hypothesis tests, where the true positive rate is equivalent to power, and the false positive rate is equivalent to the Type I error rate.

Consider, for example, a hypothetical formal test of normality (perhaps the Shapiro–Wilk test in a particular situation) that produces power of .70 with a Type I error rate of .05. That is, it has a 70% chance to correctly label a sample “nonnormal” when it truly came from a nonnormal population, and a 5% chance to incorrectly label it “nonnormal” when it truly came from a normal population. This situation is illustrated by Point A in Fig. 3.

Other methods for assessing normality might produce different combinations of power and Type I error rate. Point B illustrates a different method that yields higher power (.90), with the same Type I error rate as Point A. The curves show how the two methods that produced Points A and B would perform at all possible criterion settings—that is, alpha settings. These curves illustrate a clear advantage for Method B over Method A. At any given criterion, Method B results in higher power, lower Type I error rate, or both. In signal detection terms, Method B has higher discriminability. That is, Method B is better able to distinguish between normal and nonnormal distributions. Generally, higher discriminability is more desirable, and is represented by a curve closer to the upper left corner of the ROC plot. Discriminability is often measured as the proportion of the graph area under the curve (AUC). An AUC of 1.0 would indicate perfect discriminability. In Fig. 3, Method B has a higher AUC (.97) than Method A (.89). AUC can be interpreted as the probability that stimuli from two different populations will be correctly ranked. For example, the AUC of the Shapiro–Wilk test represents the probability that a randomly chosen sample from a nonnormal population has a smaller p value on the Shapiro–Wilk test than does a randomly chosen sample from a normal population.

Although the above-discussed Methods A and B could represent formal hypothesis tests, they could also represent informal judgments. Informal judgments of normality produce power and Type I error rates that may be better or worse than those of formal tests. Additionally, informal judgments can also be made at different criterion levels, or thresholds of confidence. For example, Point A might represent an extreme level of confidence: “definitely not normal.” In contrast, Point “a” might represent a different threshold, the sum of “definitely not normal” and “probably not normal” judgments. The more relaxed criterion in “a” yields higher power, but also a higher Type I error rate.

Using ROCs to examine all possible criterion settings, do formal judgments still appear to be problematic, as the paradox suggests they are? Or, as suggested by the literature on diagnostic decision-making, could formal judgments perform as well as informal ones, or perhaps even better? In Experiments 1–2, the primary goal was to identify the most discriminating informal judgment type by comparing performance across different types of graphs (histograms, Q–Q plots, and P–P plots). Additionally, even in these early experiments, informal graph judgments were compared with formal hypothesis tests on the same data sets. Experiments 3–4 relied on the most discriminating graph type identified in earlier experiments, using this graph type to compare informal and formal judgments in more ecologically valid contexts. Experiments 3–4 also examined whether informal or formal judgments would lead to higher statistical power of follow-up tests, tests chosen based on these normality judgments.

Experiment 1: Histograms and Q–Q plots

In the absence of existing data on informal judgments of normality, a reasonable starting point would be to test the common textbook advice that Q–Q plots are easier to judge than other graphs. So, participants were randomly assigned to make judgments of either Q–Q plots or of the second most commonly mentioned graph: histograms. On each trial, a sample of 60 values was simulated from either a normal or nonnormal population. A graph of this sample was presented on a computer screen, and participants judged it on a 6-point scale ranging from “Definitely Normal” to “Definitely NOT Normal.” To explore the potential for learning in this task, participants judged 80 graphs without feedback, then 320 with feedback, and finally 80 without. Feedback consisted of an indication of the correct answer (“Normal” or “NOT normal”) after each response.

Method

Participants

A total of 46 participants were recruited through an introductory psychology course participant pool, and they participated individually in private laboratory rooms. Three participants did not complete the study due to a program or scheduling error. Based on preliminary analyses (all blind to graph condition to avoid bias), we decided to exclude any participant with a median response time of less than 500 milliseconds or accuracy less than .40 during any stage of the experiment (pretraining, training, or posttraining). These restrictions excluded only one participant. The final data set had 42 participants (31 females). Participants were compensated with course credit.

Design and materials

The experiment had a 2 (graph type: histogram vs. Q–Q plot) × 6 (trial block) factorial design, with graph type between subjects and trial block within subjects. Each participant was randomly assigned to view histograms (n = 24) or Q–Q plots (n = 18). There was one pretraining block, followed by four training blocks, and then one posttraining block. Only the training blocks provided feedback.

All graphs were generated in the programming language R (R Core Team, 2016). Histograms were generated using the default “hist()” function (for binning details, see Sturges, 1926). For reference, a green normal bell curve appeared on each histogram. Q–Q plots were generated using the “qqnorm()” function and a reference line (also in green) created by the “qqline()” function. By default, this line passed through the 1st and 3rd quartiles of the data. All graphs omitted axis labels so that participants could focus on data shapes and feedback while learning. See the first two rows of Fig. 2 for examples.

Each graph had a sample size of 60 drawn from either a normal or nonnormal population. There were four different types of nonnormal populations: (a) no skewness and negative kurtosis, (b) no skewness and positive kurtosis, (c) positive skewness and positive kurtosis, and (d) negative skewness and positive kurtosis. To define these, let the kth central population moment be:

$$ {\mu}_k=E\left[{\left(x-\mu \right)}^k\right], $$

(1)

where μ with no subscript is the population mean. Population skewness and kurtosis are then defined, respectively, as:

$$ {\gamma}_1=\frac{\mu_3}{\sigma^3}, $$

(2)

$$ {\gamma}_2=\frac{\mu_4}{\sigma^4}-3, $$

(3)

where σ is the population standard deviation. In a normal population, γ₁ = γ₂ = 0. There were four nonnormal populations used here: (a) γ₁ = 0, γ₂ = −1, (b) γ₁ = 0, γ₂ = 10, (c) γ₁ = 2, γ₂ = 8, and (d) γ₁ = −2, γ₂ = 8. These values were chosen in an attempt to make the nonnormal situations equally difficult, at least for visual inspection methods. Nonnormal distributions were generated using the fifth-order polynomial family (Headrick, 2002), with approximate densities shown in Fig. 4.

There were 64 samples generated from each the 4 types of nonnormal population, and 256 normal samples, for a total of 512. Of these 512 samples, the first 480 were used as critical stimuli (counted in the analyses) and the last 32 were reserved. From this reserve set, 3 relatively average stimuli were chosen as instruction examples by selecting the median Shapiro–Wilk test p-values from “normal” and “not normal” categories. The 480 critical stimuli were assigned to the 6 blocks (80 stimuli per block). Assignment was random for each participant with the constraint that each block contained half normal and half nonnormal stimuli. Presentation order within blocks was random with the constraint that each 8-trial sub-block contained 4 normal and 4 nonnormal stimuli (with 1 of each of the types of nonnormal). The experiment was programmed in E-Prime. Technical details of graphs and normality tests can be found in Supplement 2, and all materials and code to generate them can be found at https://osf.io/msv72.

Procedure

Participants first answered demographic questions. Next, the researcher read aloud instructions adapted from Triola (2012, pp. 57, 297). In the histogram condition, participants were told that the graph was not normal if it had “rectangles that depart dramatically from the bell-shaped curve.” In the Q–Q plot condition, they were told that the graph was not normal if the “circles do not lie reasonably close to a straight line, or the circles may show some systematic pattern that is not a straight-line pattern.”

Next, participants placed their middle three fingers of each hand on the six keys from “c” to “,” on the computer keyboard. The six keys had colored stickers corresponded to a color-coded, 6-point Likert scale that was visible on the screen. The colors from left to right were light green, green, dark green, dark red, red, light red. The Likert scale from left to right was “Definitely Normal,” “Probably Normal,” “Guess Normal,” “Guess NOT Normal,” “Probably NOT Normal,” “Definitely NOT Normal” (see Supplement 3). Participants were encouraged to use the whole range of this scale and were informed that they would have 10 seconds to decide for each graph. They were also informed that each graph would have an equal chance of being normal or not.

The pretraining block consisted of 80 trials without feedback, where each button press led to the next stimulus being presented. Next, there were four training blocks with 80 trials each. During these blocks, after either a button was pressed or 10 seconds had elapsed, the screen indicated that the correct answer was either “normal” or “NOT normal,” in light green or light red font, respectively. This feedback remained on the screen for 3.5 seconds before the next trial. Finally, there was a posttraining block of 80 trials without feedback. At the end of the experiment, participants were asked to estimate their percentage of accurate decisions for the experiment as a whole (all six blocks), and for the posttraining block in particular (see Supplement 4 for details).

Data analysis

The experiments here were intended to inform decisions about which statistical models to use. However, we must choose which statistical models to use to analyze data from these experiments. To avoid circularity, we decided ahead of time to use procedures that are usually robust to nonnormality and other typical assumption violations. First, all reported confidence intervals were constructed using bootstrapping with bias correction and acceleration (BCa; via Kirby & Gerlanc, 2013). Second, the measure of discriminability was computed without making distributional assumptions for signal and noise. In other words, discriminability was computed by the empirical ROCs rather than by curve-fitted ROCs. Third, in all analyses of variance (ANOVAs), df with decimals indicates that a significant violation of sphericity occurred and the Greenhouse–Geisser correction was applied. Fourth, for two-sample t tests, df always have decimals because Welch’s unequal variance version was routinely applied without a precursor assumption test (see Moser & Stevens, 1992; Zimmerman, 2004).

Results and discussion

As shown in Fig. 5, accuracy increased across blocks, and was higher for Q–Q plots than for histograms. Accuracy was calculated by collapsing across confidence level. For example, if the graph showed a sample from a normal population, any of the “normal” confidence responses (“definitely,” “probably,” or “guess”) was considered accurate; any of the “nonnormal” confidence levels was considered inaccurate. A 2 (graph type) × 6 (trial block) ANOVA showed that accuracy was significantly higher for Q–Q plots than for histograms, F(1, 40) = 7.36, p = .01, η_p² = .16, a medium effect. There was also a significant effect of trial block, F(3.87, 154.7) = 10.5, p < .001, η_p² = .21, a medium effect. Finally, the interaction between trial block and graph type was not significant, F(3.87, 154.7) = .45, p = .77, η_p² = .01. Of most importance, though, is performance posttraining, when feedback was no longer available, just as in realistic situations. Posttraining accuracy was significantly higher for Q–Q plots (.80) than for histograms (.73), and with a large effect, t(32.7) = 2.74, p = .010, d = .878, 95% CI [.114, 1.619]. However, most formal hypothesis tests were more accurate (range: .78–.93) than informal judgments, except for the Pearson χ² test, which was similar to judgments of Q–Q plots, especially in later blocks.

Accuracy scores neglect differences among confidence levels in informal judgments, and likewise, different possible alpha levels that could be used in formal tests. To consider the whole range of confidence levels and alpha levels, Fig. 6 shows the ROCs of formal tests and mean informal judgments (for pretraining and posttraining blocks). As shown in Fig. 6, for a given Type I error rate, the Shapiro–Wilk test typically had higher power than any other method, whether formal test or informal judgment of a graph. One can also see that Q–Q plot performance tended to be better than histograms performance, but worse than formal tests.

To confirm previously established accuracy patterns, but with the ROCs, we examined area under the curve (AUC), which is simply the area between the curve and the bottom and right borders of the graph expressed as a proportion of the total graph area (A_g in Macmillan & Creelman, 2004). Confirming the accuracy analyses, posttraining Q–Q plot judgments had significantly higher AUC (.826) than posttraining histogram judgments (.749), again with a large effect, t(30.1) = 2.86, p = .008, d = .931, 95% CI [.181, 1.715]. Nevertheless, posttraining Q–Q plot AUC was still significantly lower than the AUC of all formal tests, as is shown in Table 2.

Table 2 Posttraining discriminability as measured by area under the curve (AUC) for all experiments (1–4)

Full size table

Overall, for informal judgments, discrimination between normal and nonnormal populations was better with Q–Q plots than with histograms, and this performance improved across blocks of the experiment. However, formal tests performed better than informal judgments, and this was especially true for the Anderson–Darling, Shapiro–Francia, and Shapiro–Wilk tests.

Experiment 2: Q–Q plot variants

It is possible that the Q–Q plot used in the previous experiment might be improved upon, especially considering research on the closely related issue of scatterplot judgments. Such research has shown that scatterplot interpretation can be improved by making constant the shape and size of the plot (Doherty & Anderson, 2009). Such research also suggests that graph judgments could be improved by making the reference line constant, as is the default in some software packages (e.g., SPSS, Minitab). To examine this possibility, participants were randomly assigned to judge one of three types of graphs: Q–Q plots used in the previous study, stable Q–Q plots, or P–P plots (see Fig. 2). Both stable Q–Q plots and P–P plots have reference lines that are consistently on the diagonal, as well as fixed ranges of the x-axis and y-axis. These properties could make stable Q–Q plots and P–P plots easier to judge.

Method

Only differences relative to the previous experiment are reported here and in subsequent Method sections.

Before conducting the experiment, to estimate a target sample size, a power analysis was conducted using the effect size for the accuracy difference between histograms and Q–Q plots posttraining (d = .88). To achieve .95 power to detect a comparable difference between two independent sample means would require approximately 35 participants per condition (using R’s “pwr” package, pwr::pwr.t.test(d=.88, power=.95, type=“two.sample”, alt=“two.sided”); Champely, 2020).

In Experiment 2, 117 participants completed the experiment, but three were excluded by the same criteria as in Experiment 1. The remaining 114 participants were, by random assignment, in either the Q–Q plot (n = 37), stable Q–Q plot (n = 39), or P–P plot (n = 38) condition.

For the Stable Q–Q plot condition, the x-axis and y-axis used the same range (the minimum of x and y plotting coordinates combined up to the maximum of x and y plotting coordinates combined). The reference line was set to have a y-intercept of 0 and slope of 1, resulting in a consistent diagonal line.

Whereas Q–Q plots show actual versus expected points on the scale of z scores (or similarly, raw scores), P–P plots show actual versus expected points on the scale of percentiles. Because percentiles are a non-linear transformation of z scores, the placement of dots is different. The reference line was identical to that of the Stable Q–Q plot condition.

In Experiment 1, one nonnormal type (γ₁ = 0, γ₂ = 10) was harder to visually discriminate than others, adding an unintended source of noise. To address this issue, this nonnormal type was changed to γ₁ = 0, γ₂ = 20 in Experiment 2.

Participants completed the experiment in groups of up to 11 at a time in a computer classroom. To ensure comprehension of instructions, instructions were followed by a multiple-choice test. If a participant did not achieve 100% correct on this test, the instructions and test repeated until he/she did so. Instructions for all three conditions were identical to those in the Q–Q plot condition of the previous experiment (for details, see https://osf.io/msv72).

Results and discussion

As with Experiment 1, formal hypothesis tests tended to outperform informal graph judgments, with the Shapiro–Wilk test showing the highest mean performance levels. As shown in Table 2, informal judgments of any graph type (even posttraining) had significantly lower discriminability as compared to any formal hypothesis test.

To further examine informal judgments, a 3 (graph type) × 2 (trial block: pretraining vs. posttraining) ANOVA was conducted on AUC. There was a significant main effect of graph type, F(2, 111) = 9.60, p < .001, η_p² = .15, qualified by a significant Graph Type × Trial Block interaction, F(2, 111) = 7.25, p = .001, η_p² = .12, both medium effects. As shown in Fig. 7, at pretraining but not posttraining, Q–Q plots had lower discriminability than Stable Q–Q and P–P plots. That is, at least before training, it was easier to judge plots when they had a stable, consistent reference line, a finding consistent with data on judgments of scatterplots (Doherty & Anderson, 2009), F(2, 111) = 15.23, p < .001, η_p² = .22, with Tukey’s HSD showing significantly lower AUC for Q–Q than stable Q–Q and P–P plots, ps < .001, but no significant difference between Stable Q–Q and P–P plots, p = .75. However, after training, performance was similar across the three graph types, suggesting that the choice of particular Q–Q and P–P plots may have little impact, F(2, 111) = 1.64, p = .20, η_p² = .03. Stable Q–Q plots had the highest mean performance of the three graphs, and so these plots were used in all remaining experiments.

Experiment 3: Ecological validity

The previous experiments suggested that several formal hypothesis tests of normality perform better than informal judgments. Do such findings generalize to more realistic situations? Experiment 3 involved several modifications to address this question.

Researchers often wish to judge normality to inform decisions about other routine hypothesis tests, such as whether two or more means differ in a population. In such a situation, the decision about normality is intended to improve the more important decision about whether the means are equal. So, the goal is not to determine whether the population is exactly normal or not. Rather, the goal is to determine whether the population is close enough to normal for one test—often a parametric test—to produce fewer errors (Type I and/or II) than an alternative.

To represent such a real-world decision, in Experiment 3, participants made decisions about whether a sample came from an approximately normal distribution or not. In particular, approximately normal was operationally defined here as any population distribution shape close enough to normal for a common parametric test to produce more power than its nonparametric analog. Conversely, “not normal” was defined as any population shape different enough from normal that the nonparametric test was more powerful than the parametric one. Specifically, the parametric test here was an independent-samples t test, and nonparametric test was the MWW test. These tests were chosen for two reasons. First, these tests are used for one of the simplest and most common experimental designs—a two-condition between-participant experiment, frequently used for comparison of a treatment group to a control group. Second, both the independent-samples t and MWW tests have acceptable Type I error rates under the conditions studied here, and so their performance can be compared solely on the basis of power (i.e., 1–Type II error probability). So, in addition to discriminability, we also examined the power implications of both informal and formal judgment, that is, how normality judgments would lead to more or less powerful tests of equal means.

To further mimic real-world decisions, in Experiment 3, distribution shapes for stimuli were generated by stratified sampling of combinations of skewness and kurtosis observed in actual psychological and educational data (via Cain et al., 2017). Figure 8 shows the resulting combinations of skewness and kurtosis, along with the distinction between approximately normal (green) and not (red).

Additionally, because researchers have incentives to reach valid, replicable conclusions, in Experiment 3, participants were provided with a tangible incentive for performance. Participants bet points on normality judgments, with the number of points bet indicating confidence. The participant who achieved the highest number of points at the end of the experiment was awarded a $75 bonus.

Finally, in many real-world settings, researchers judge normality across various sample sizes. To provide at least some semblance of this real-life complexity, in Experiment 3, each participant judged stimuli of two different sizes (small: n = 30, and large: n = 120). It was expected that both participants and formal hypothesis tests would perform better in the large sample condition, where more information was available. Of primary interest, though, was whether the advantage of formal tests over informal judgment would remain in an experiment that better resembled realistic situations