Abstract
The goal of (eco-) toxicological testing is to experimentally establish a dose or concentration–response and to identify a threshold with a biologically relevant and probably non-random deviation from “normal”. Statistical tests aid this process. Most statistical tests have distributional assumptions that need to be satisfied for reliable performance. Therefore, most statistical analyses used in (eco-)toxicological bioassays use subsequent pre- or assumption-tests to identify the most appropriate main test, so-called statistical decision trees. There are however several deficiencies with the approach, based on study design, type of tests used and subsequent statistical testing in general. When multiple comparisons are used to identify a non-random change against negative control, we propose to use robust testing, which can be generically applied without the need of decision trees. Visualization techniques and reference ranges also offer advantages over the current pre-testing approaches. We aim to promulgate the concepts in the (eco-) toxicological community and initiate a discussion for regulatory acceptance.
Similar content being viewed by others
Notes
N.B. after peer-review another study investigated the robustness of the MLT-Dunnett for count data which is available as a pre-print: Hothorn and Kluxen (2020) Statistical analysis of no observed effect concentrations or levels in eco-toxicological assays with overdispersed count endpoints. https://doi.org/10.1101/2020.01.15.907881.
Hothorn and Kluxen (2020) has become available in the meantime as a preprint, which investigates the MLT-Dunnett's robustness for count data.
References
Altman DG, Bland JM (1995) Statistics notes: absence of evidence is not evidence of absence. BMJ 311(7003):485. https://doi.org/10.1136/bmj.311.7003.485
Amrhein V, Korner-Nievergelt F, Roth T (2017) The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research. PeerJ 5:e3544. https://doi.org/10.7717/peerj.3544
Amrhein V, Greenland S, McShane B (2019) Retire statistical significance. Nature 567:305–307. https://doi.org/10.1038/d41586-019-00857-9
Anderson TW, Darling DA (1952) Asymptotic theory of certain "goodness of fit" criteria based on stochastic processes. Ann Math Statist 23(2):193–212. https://doi.org/10.1214/aoms/1177729437
Anscombe FJ (1973) Graphs in statistical analysis. Am Stat 27(1):17–21. https://doi.org/10.1080/00031305.1973.10478966
Bartlett MS (1937) Properties of sufficiency and statistical tests. Proc R Soc Lond A Math Phys Sci 160(901):268–282. https://doi.org/10.1098/rspa.1937.0109
Bland JM, Altman DG (2009) Analysis of continuous data from small samples. BMJ 338:a3166. https://doi.org/10.1136/bmj.a3166
Box G, Cox D (1964) An analysis of transformations. Proc R Soc Lond A Math Phys Sci 26:211–252
Cleveland WS (1993) Visualizing data. At & T Bell Laboratories, Murray Hill
Conover WJ, Johnson ME, Johnson MM (1981) A comparative study of tests for homogeneity of variances, with applications to the outer continental shelf bidding data. Technometrics 23(4):351–361. https://doi.org/10.2307/1268225
Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18. https://doi.org/10.2307/1268249
Cumming G (2014) The new statistics: why and how. Psychol Sci 25(1):7–29. https://doi.org/10.1177/0956797613504966
Dallal GE, Wilkinson L (1986) An analytic approximation to the distribution of Lilliefors's test statistic for normality. Am Stat 40(4):294–296. https://doi.org/10.1080/00031305.1986.10475419
Dean RB, Dixon WJ (1951) Simplified statistics for small numbers of observations. Anal Chem 23(4):636–638. https://doi.org/10.1021/ac60052a025
Delignette-Muller ML, Forfait C, Billoir E, Charles S (2011) A new perspective on the Dunnett procedure: filling the gap between NOEC/LOEC and ECx concepts. Environ Toxicol Chem 30(12):2888–2891. https://doi.org/10.1002/etc.686
Dilba G, Bretz F, Guiard V, Hothorn LA (2004) Simultaneous confidence intervals for ratios with applications to the comparison of several treatments with a control. Methods Inf Med 43(5):465–469
Drezner Z, Turel O, Zerom D (2010) A modified Kolmogorov–Smirnov test for normality. Commun Stat Simul Comput 39(4):693–704. https://doi.org/10.1080/03610911003615816
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56:52–64. https://doi.org/10.1080/01621459.1961.10482090
Dunnett CW (1955) A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc 50(272):1096–1121. https://doi.org/10.2307/2281208
Ekstrøm CT (2014) Teaching ‘instant experience’ with graphical model validation techniques. Teach Stat 36(1):23–26. https://doi.org/10.1111/test.12027
European Commission (2013) COMMISSION REGULATION (EU) No 283/2013 of 1 March 2013 setting out the data requirements for active substances, in accordance with Regulation (EC) No 1107/2009 of the European Parliament and of the Council concerning the placing of plant protection products on the market. OJ L 93/1
Farouki RT (2012) The Bernstein polynomial basis: a centennial retrospective. Comput Aided Geom Des 29(6):379–419. https://doi.org/10.1016/j.cagd.2012.03.001
Festing M (1993) Genetic variation in outbred rats and mice and its implications for toxicological screening. J Exp Anim Sci 35(5–6):210–220
Fisher RA (1925) Statistical methods for research workers. Oliver & Boyd, Edinburgh
Fosang AJ, Colbran RJ (2015) Transparency is the key to quality. J Biol Chem 290(50):29692–29694. https://doi.org/10.1074/jbc.E115.000002
Fox DR, Landis WG (2016) Don't be fooled—a no-observed-effect concentration is no substitute for a poor concentration–response experiment. Environ Toxicol Chem 35(9):2141–2148. https://doi.org/10.1002/etc.3459
Gandrud C (2015) Reproducible research with R and R studio. Chapman and Hall/CRC, New York
Greenland S (2019) Valid P-values behave exactly as they should: some misleading criticisms of P-values and their resolution with S-values. Am Stat 73(sup1):106–114. https://doi.org/10.1080/00031305.2018.1529625
Greenland S, Senn SJ, Rothman KJ et al (2016) Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 31(4):337–350. https://doi.org/10.1007/s10654-016-0149-3
Hahn GJ, Meeker WQ (1991) Statistical intervals—a guide for practitioners. Wiley, New York
Hamada C (2018) Statistical analysis for toxicity studies. J Toxicol Pathol 31(1):15–22. https://doi.org/10.1293/tox.2017-0050
Hardy A, Benford D, Halldorsson T et al (2017) Update: use of the benchmark dose approach in risk assessment. EFSA J 15(1):e04658. https://doi.org/10.2903/j.efsa.2017.4658
Hasler M (2016) Heteroscedasticity: multiple degrees of freedom vs sandwich estimation. Stat Pap 57(1):55–68. https://doi.org/10.1007/s00362-014-0640-4
Hasler M, Hothorn LA (2008) Multiple contrast tests in the presence of heteroscedasticity. Biom J 50(5):793–800. https://doi.org/10.1002/bimj.200710466
Hawkins DM (1980) Identification of outliers. Chapman and Hall, New York
Herberich E, Hothorn LA (2012) Statistical evaluation of mortality in long-term carcinogenicity bioassays using a Williams-type procedure. Regul Toxicol Pharmacol 64(1):26–34. https://doi.org/10.1016/j.yrtph.2012.06.014
Herberich E, Sikorski J, Hothorn T (2010) A robust procedure for comparing multiple means under heteroscedasticity in unbalanced designs. PLoS ONE 5(3):e9788. https://doi.org/10.1371/journal.pone.0009788
Hoffman D, Berger M (2011) Statistical considerations for calculation of immunogenicity screening assay cut points. J Immunol Methods 373(1–2):200–208. https://doi.org/10.1016/j.jim.2011.08.019
Hothorn L (1989) Robustness study on Williams- and Shirley-procedure, with application in toxicology. Biom J 31(8):891–903. https://doi.org/10.1002/bimj.4710310802
Hothorn LA (2014) Statistical evaluation of toxicological bioassays—a review. Toxicol Res 3(6):418–432. https://doi.org/10.1039/c4tx00047a
Hothorn LA (2016a) Statistics in toxicology using R. CRC Press, Boca Raton
Hothorn LA (2016b) The two-step approach—a significant ANOVA F-test before Dunnett's comparisons against a control—is not recommended. Commun Stat Theory Methods 45(11):3332–3343. https://doi.org/10.1080/03610926.2014.902225
Hothorn T (2018) Most likely transformations: the mlt package. J Stat Softw
Hothorn LA, Hasler M (2008) Proof of hazard and proof of safety in toxicological studies using simultaneous confidence intervals for differences and ratios to control. J Biopharm Stat 18(5):915–933. https://doi.org/10.1080/10543400802287511
Hothorn LA, Kluxen FM (2019) Robust multiple comparisons against a control group with application in toxicology arXiv.
Hothorn LA, Pirow R (2019) Use compatibility intervals in regulatory toxicology [submitted to Regulatory Toxicology and Pharmacology].
Hothorn T, Bretz F, Westfall P (2008) Simultaneous inference in general parametric models. Biom J 50(3):346–363. https://doi.org/10.1002/bimj.200810425
Hothorn T, Möst L, Bühlmann P (2018) Most Likely Transformations. Scand J. Stat 45(1):110–134. https://doi.org/10.1111/sjos.12291
Igl B-W, Bitsch A, Bringezu F et al (2019) The rat bone marrow micronucleus test: statistical considerations on historical negative control data. Regul Toxicol Pharmacol 102:13–22. https://doi.org/10.1016/j.yrtph.2018.12.009
Jaki T, Hothorn LA (2013) Statistical evaluation of toxicological assays: Dunnett or Williams test-take both. Arch Toxicol 87(11):1901–1910. https://doi.org/10.1007/s00204-013-1065-x
Jarvis P, Saul J, Aylott M, Bate S, Geys H, Sherington J (2011) An assessment of the statistical methods used to analyse toxicology studies. Pharm Stat 10(6):477–484. https://doi.org/10.1002/pst.527
Jensen SM, Kluxen FM, Ritz C (2019) A review of recent advances in benchmark dose methodology. Risk Anal 39(10):2295–2315
Kluxen FM (2019a) "New Statistics” for regulatory toxicology? [submitted, preprint available https://doi.org/10.13140/RG.2.2.14639.48803]
Kluxen FM (2019b) Scatter plotting as a simple tool to analyse relative organ to body weight in toxicological bioassays. Arch Toxicol 93(8):2409–2420. https://doi.org/10.1007/s00204-019-02509-3
Kobayashi K, Pillai KS, Sakuratani Y, Abe T, Kamata E, Hayashi M (2008) Evaluation of statistical tools used in short-term repeated dose administration toxicity studies with rodents. J Toxicol Sci 33(1):97–104
Koller M, Stahel WA (2011) Sharpening Wald-type inference in robust regression for small samples. Comput Stat Data Anal 55(8):2504–2515. https://doi.org/10.1016/j.csda.2011.02.014
Konietschke F, Placzek M, Schaarschmidt F, Hothorn LA (2015) nparcomp: an R software package for nonparametric multiple comparisons and simultaneous confidence intervals. J Stat Softw 64(9):17. https://doi.org/10.18637/jss.v064.i09
Kozak M (2009) Analyzing one-way experiments: a piece of cake or pain in the neck? Sci Agric 66(4):556–562. https://doi.org/10.1590/S0103-90162009000400020
Kozak M, Piepho HP (2018) What's normal anyway? Residual plots are more telling than significance tests when checking ANOVA assumptions. J Agron Crop Sci 204(1):86–98. https://doi.org/10.1111/jac.12220
Levene H (1960) Robust tests for equality of variances. In: Olkin I (ed) Contributions to probability and statistics; essays in honor of harold hotelling. Stanford University Press, Palo Alto, pp 278–292
Lohse T, Rohrmann S, Faeh D, Hothorn T (2017) Continuous outcome logistic regression for analyzing body mass index distributions [version 1; peer review: 3 approved]. F1000Res. https://doi.org/10.12688/f1000research.12934.1
Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Statist 18(1):50–60. https://doi.org/10.1214/aoms/1177730491
Matejka J, Fitzmaurice G (2017) Same stats, different graphs. Paper presented at the proceedings of the 2017 CHI conference on human factors in computing systems—CHI '17
Na J, Yang H, Bae S, Lim K-M (2014) Analysis of statistical methods currently used in toxicology journals. Toxicol Res 30(3):185–192. https://doi.org/10.5487/TR.2014.30.3.185
National Toxicology Program (2010) Toxicology and carcinogenesis studies of sodium dichromate dihydrate (CAS No. 7789-12-0) in F344/N rats and B6C3F1 mice (Drinking water studies). Technical report
Nature methods editorial (2014) Kick the bar chart habit. Nat Methods 11:113. https://doi.org/10.1038/nmeth.2837
Nuzzo R (2014) Scientific method: statistical errors – P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume. Nature 506:150–152
OECD (1998) Test no. 409: repeated dose 90-day oral toxicity study in non-rodents OECD guidelines for the testing of chemicals, section 4. OECD Publishing, Paris
OECD (2008) Test no. 407: repeated dose 28-day oral toxicity study in rodents. OECD Publishing, Paris
OECD (2010) Section 4: statistical and dose response analysis, including benchmark dose and linear extrapolation, NOAELS and NOELS, LOAELS and LOELS OECD guidance document for the design and conduct of chronic toxicity and carcinogenicity studies, supporting TG 451, 452 and 453. OECD Publishing, Paris
OECD (2014a) Current approaches in the statistical analysis of ecotoxicity data. OECD Publishing, Paris
OECD (2014b) Guidance document 116 on the conduct and design of chronic toxicity and carcinogenicity studies, supporting test guidelines 451, 452 and 453. OECD Publishing, Paris
OECD (2014c) No. 198 report on statistical issues related to OECD test guidelines (tgs) on genotoxicity. OECD Publishing, Paris
OECD (2016) Test no. 474: mammalian erythrocyte micronucleus test. OECD Publishing, Paris
OECD (2016) Test no.: in vitro mammalian cell micronucleus test 487. OECD Publishing, Paris
OECD (2018a) Test no. 408: repeated dose 90-day oral toxicity study in rodents. OECD Publishing, Paris
OECD (2018b) Test no. 451: carcinogenicity studies. OECD Publishing, Paris
OECD (2018c) Test no. 453: combined chronic toxicity/carcinogenicity studies. OECD Publishing, Paris
Pallmann P, Hothorn LA (2016) Boxplots for grouped and clustered data in toxicology. Arch Toxicol 90(7):1631–1638. https://doi.org/10.1007/s00204-015-1608-4
R Core Team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Ramsey FL, Schafer DW (2002) The statistical sleuth: a course in methods of data analysis. Thomson Learning, Duxbury
Salsburg D (2002) The lady tasting tea: how statistics revolutionized science in the twentieth century. Freeman, New York
Satterthwaite FE (1946) An approximate distribution of estimates of variance components. Biometr Bull 2(6):110–114. https://doi.org/10.2307/3002019
Schaarschmidt F, Biesheuvel E, Hothorn LA (2009) Asymptotic simultaneous confidence intervals for many-to-one comparisons of binary proportions in randomized clinical trials. J Biopharm Stat 19(2):292–310. https://doi.org/10.1080/10543400802622501
Schaarschmidt F, Sill M, Hothorn LA (2008) Poly-k-trend tests for survival adjusted analysis of tumor rates formulated as approximate multiple contrast test. J Biopharm Stat 18(5):934–948. https://doi.org/10.1080/10543400802294285
Schmidt K, Schmidtke J, Kohl C et al (2016) Enhancing the interpretation of statistical P values in toxicology studies: implementation of linear mixed models (LMMs) and standardized effect sizes (SESs). Arch Toxicol 90(3):731–751. https://doi.org/10.1007/s00204-015-1487-8
Schucany WR, Tony Ng HK (2006) Preliminary goodness-of-fit tests for normality do not validate the one-sample student t. Commun Stat Theory Methods 35(12):2275–2286. https://doi.org/10.1080/03610920600853308
Shaffer JP (1995) Multiple hypothesis testing. Annu Rev Psychol 46(1):561–584. https://doi.org/10.1146/annurev.ps.46.020195.003021
Steel RGD (1959) A multiple comparison rank sum test: treatments versus control. Biometrics 15(4):560–572. https://doi.org/10.2307/2527654
Student (1908) The probable error of the mean. Biometrika 6(1):1–25. https://doi.org/10.1093/biomet/6.1.1
Szocs E, Schafer RB (2015) Ecotoxicology is not normal: a comparison of statistical approaches for analysis of count and proportion data in ecotoxicology. Environ Sci Pollut Res Int 22(18):13990–13999. https://doi.org/10.1007/s11356-015-4579-3
Tukey JW (1977) Exploratory data analysis. Addison-Wesley Pub. Co, Reading
U.S. Food and Drug Administration (2001) Guidance for industry: statistical approaches to establishing bioequivalence
Wasserstein RL, Lazar NA (2016) The ASA's Statement on p-values: context, process, and purpose. Am Stat 70(2):129–133. https://doi.org/10.1080/00031305.2016.1154108
Wasserstein RL, Schirm AL, Lazar NA (2019) Moving to a world beyond p < 0.05. Am Stat 73(1):1–19. https://doi.org/10.1080/00031305.2019.1583913
Weissgerber TL, Winham SJ, Heinzen EP et al (2019) Reveal, don't conceal: transforming data visualization to improve transparency. Circulation 140(18):1506–1518. https://doi.org/10.1161/CIRCULATIONAHA.118.037777
Welch BL (1947) The generalization of `student's' problem when several different population variances are involved. Biometrika 34(1/2):28–35. https://doi.org/10.2307/2332510
Wheeler J (2019) Historical control data for the interpretation of ecotoxicity data: are we missing a trick? Ecotoxicology. https://doi.org/10.1007/s10646-019-02128-9
Wheeler MW, Bailer AJ (2007) Properties of model-averaged BMDLs: a study of model averaging in dichotomous response risk estimation. Risk Anal 27(3):659–670. https://doi.org/10.1111/j.1539-6924.2007.00920.x
Wickham H, Stryjewski L (2011) 40 years of boxplots. hadconz
Wilcox RR (2012) Introduction to robust estimation and hypothesis testing. Academic Press, Amsterdam
Wilk MB, Shapiro SS (1965) An analysis of variance test for normality (complete samples)†. Biometrika 52(3–4):591–611. https://doi.org/10.1093/biomet/52.3-4.591
Williams DA (1971) A test for differences between treatment means when several dose levels are compared with a zero dose control. Biometrics 27(1):103–117. https://doi.org/10.2307/2528930
Zeileis A (2006) Object-oriented computation of sandwich estimators. J Stat Softw 16(9):16. https://doi.org/10.18637/jss.v016.i09
Zimmerman DW (1996) A note on homogeneity of variance of scores and ranks. J Exp Educ 64(4):351–362
Zimmerman DW (2004) A note on preliminary tests of equality of variances. Br J Math Stat Psychol 57(1):173–181. https://doi.org/10.1348/000711004849222
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Kluxen, F.M., Hothorn, L.A. Alternatives to statistical decision trees in regulatory (eco-)toxicological bioassays. Arch Toxicol 94, 1135–1149 (2020). https://doi.org/10.1007/s00204-020-02690-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00204-020-02690-w