Abstract
The comparison of two quantitative measuring devices is often performed with the Limits of Agreement proposed by Bland and Altman in their seminal Lancet paper back in 1986. Sample size considerations were rare for such agreement analyses in the past, but recently several proposals have been made depending on how agreement is to be assessed and the number of replicates to be used. We have summarized recent developments and recommendations in various situations including a distinction between method comparison and observer variability studies. These include current state-of-the-art analysis of and reporting guidelines for agreement studies. General recommendations close the paper.
1 Method comparison studies
Formal sample size motivations have been scarce for agreement studies according to the Guidelines for Reporting Reliability and Agreement Studies [1]. Assuming normally distributed differences between two measurement methods with unknown mean µ and standard deviation σ, the paired quantiles of interest in Bland-Altman analysis are θ0.025 and θ0.975 [2]. Jan and Shieh proposed an exact 95% confidence interval {\({\widehat{\omega }}_{L},{\widehat{\omega }}_{U}\)} to cover the central 95% proportion of the differences (see their Supplemental Material A and D for implementations with SAS/IML and R, respectively) [3]. They proposed to base the sample size on either the expected width of {\({\widehat{\omega }}_{L},{\widehat{\omega }}_{U}\)} which must not exceed a predefined benchmark value Δ or the observed width of {\({\widehat{\omega }}_{L},{\widehat{\omega }}_{U}\)} that will not exceed Δ with an assurance probability of, say, 90% (see their Supplemental Material B, E and C, F, respectively). The former approach leads to similar sample sizes than the latter with an assurance probability of 50%. Shieh advanced this procedure into a formal hypothesis test and compared the relative performance of Type I error rate for their approach to previous two one-sided tests (TOST) procedures [4]. Shieh concluded that TOST procedures in the context of Bland-Altman analysis are too conservative, i.e. falling short of the nominal level, hence corresponding sample size formulas are problematic.
In the context of variance component analysis, the repeatability coefficient (RC) is derived from within-subject variance \({\sigma }_{w}^{2}\) and estimated as \(1.96\surd 2 \cdot \hat \sigma _w^2\). The RC is an estimate for the limit within which 95% of differences are expected to lie [5, 6]. The RC can be used with multiple assessments from each subject; in case of single measurements from each subject by each method, the RC coincides with half the width of the Bland-Altman Limits of Agreement [7]. Yi and colleagues proposed an equivalence test for agreement in case of k repeated measurements from each subject (k ≥ 2), using analysis of variance [8]. The test aims at confirming that \(1.96\surd 2 \cdot \hat \sigma _w^2\) is small enough to be acceptable. It can be formulated as testing H0: \({\sigma }_{w}^{2}\)≥\({\sigma }_{U}^{2}\) against H1: \({\sigma }_{w}^{2}\)<\({\sigma }_{U}^{2}\), with predefined unacceptable within-subject variance \({\sigma }_{U}^{2}.\) Denoting the difference between \({\sigma }_{U}^{2}\) and the assumed population within-subject variance \({\sigma }^{2}\) as Δ (Δ=\({\sigma }_{U}^{2}\)–\({\sigma }^{2}\), with \({\sigma }^{2}\)<\({\sigma }_{U}^{2}\)), the sample size is derived from determining the degrees of freedom (df) that make both sides of the following equation equal:
.
Here, α and 1-β represent the significance level and the power, respectively. The number of subjects to be included is derived from df and depends on the number of repeated measurements k. For k = 2, the number of subjects is equal to the df; for k > 2, it is equal to df/(k − 1).
Employing repeated measurements enables the investigation of variance parameters that describe the uncertainty between and within subjects. Increasing the planning complexity with repeated measurements suggests determining sample sizes on simulation-based methodology [9, 10].
Carstensen pointed out that many observations are required to produce stable variance estimates [10]. He assumed scaled Chi-squared distributions for variance estimates, assessed approximate 95% confidence intervals for these, and compared the widths of the 95% confidence intervals for 20 to 500 df. Based on these assessments, he made a rough, general recommendation of 50 subjects with three repeated measurements on each method.
2 Observer variability analysis
Though, technically, Bland-Altman Limits of Agreement can be equally applied in method comparison and observer variability studies, usually only very few (most often two) fixed methods are compared with each other, whereas interrater variability assessments will ultimately aim at generalizability of, say, clinical readings, independent of a specific set of raters employed. Christensen and colleagues provided sample size motivations when using the Limits of Agreement with the mean (LOAM) for multiple observers [11]. The measurements are assumed to follow an additive two-way random effects model, and sample size considerations are based on the width of confidence intervals for the proposed LOAM. They ascertained that a higher precision for the confidence intervals is obtained by increasing the number of observers while increasing the number of subjects is not sufficient. This underlines the inherent difference between method and observer comparisons, and it mirrors the need to illuminate interrater variability in multicenter studies. Christensen and colleagues made an R-package, R-scripts, and their example for the LOAM calculations available in a GitHub repository.
3 Analysis and reporting
Olofsen and colleagues presented a formal description of more advanced Bland-Altman analysis models employing repeated measurements and provided a freely available online implementation of it [12, 13]. These methods are based on analysis of variance and make, therefore, use of normality assumptions. Taffé asserted that Bland-Altman Limits of Agreement may be misleading when the variances of the measurement errors of the two methods are different [14, 15]. In this case, he proposed a set of graphs that support the investigator to assess bias, precision, and agreement between two measurement methods. Corresponding sample size considerations would have to be based on simulation studies.
Abu-Arafeh and colleagues reviewed the reporting of Bland-Altman analysis across five anesthetic journals and derived a list of 13 key features for adequate presentation of a Bland and Altman analysis (see their Table 1 in [16]). Likewise, the Guidelines for Reporting Reliability and Agreement Studies comprised 15 items to keep in mind for transparent reporting (see their Table 1 in [1]).
4 Recommendations
The necessary assumptions for any sample size rationale and the targeted level of precision require careful planning in light of the research context and the pre-specified study goal [3]. One general advice is, though, that the sampling procedure should result in the inclusion of study subjects contributing with measurements across the whole measurement range of clinical interest and relevance [10]. The Preiss-Fisher procedure [17] is a tool for the visual assessment to this end (for an exemplification, see, for instance [18]). Using the Guidelines for Reporting Reliability and Agreement Studies [1] already in the planning phase of a study will support purposive rigor.
In method comparison studies with single measurements by each method, the sample size calculations can be liberally based on the expected width for an exact 95% confidence interval to cover the central 95% proportion of the differences [3]. A more conservative approach, resulting in larger sample sizes, would be to require that the observed width of above exact 95% confidence interval will not exceed a predefined benchmark value Δ with an assurance probability exceeding 50% [3]. In case of k repeated measurements from each subject (k ≥ 2), the equivalence test for agreement proposed by Yi and colleagues can be used [8]. In observer variability analysis with multiple observers, sample size considerations can be based on the width of confidence intervals for the proposed LOAM [11]. R-scripts are readily available for all sample size calculations, especially those that have to be solved iteratively [3, 4, 11].
References
Kottner J, Audigé L, Brorson S, Donner A, Gajewski BJ, Hróbjartsson A, Roberts C, Shoukri M, Streiner DL. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. J Clin Epidemiol. 2011;64:96–106. https://doi.org/10.1016/j.jclinepi.2010.03.002.
Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1:307–10. https://doi.org/10.1016/S0140-6736(86)90837-8.
Jan SL, Shieh G. The Bland-Altman range of agreement: Exact interval procedure and sample size determination. Comput Biol Med. 2018;100:247–52. https://doi.org/10.1016/j.compbiomed.2018.06.020.
Shieh G. Assessing agreement between two methods of quantitative measurements: Exact test procedure and sample size calculation. Stat Biopharm Res. 2020;12:352–9. https://doi.org/10.1080/19466315.2019.1677495.
Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res. 1999;8:135–60. https://doi.org/10.1177/096228029900800204.
Bland JM. (2015) Frequently asked questions on the design and analysis of measurement studies. https://www-users.york.ac.uk/~mb55/meas/comfaq.htm. Accessed 08 Feb 2022.
Gerke O, Vilstrup MH, Segtnan EA, Halekoh U, Høilund-Carlsen PF. How to assess intra- and inter-observer agreement with quantitative PET using variance component analysis: a proposal for standardization. BMC Med Imaging. 2016;16:54. https://doi.org/10.1186/s12880-016-0159-3.
Yi Q, Wang PP, He Y. Reliability analysis for continuous measurements: Equivalence test for agreement. Stat Med. 2008;27:2816–25. https://doi.org/10.1002/sim.3110.
Choudhary PK, Nagaraja HN. Measuring Agreement. Models, Methods, and Applications. Hoboken: Wiley; 2017. pp. 279–87.
Carstensen B. Comparing Clinical Measurement Methods. Chichester: Wiley; 2010. pp. 127–31.
Christensen HS, Borgbjerg J, Børty L, Bøgsted M. On Jones et al.‘s method for extending Bland-Altman plots to limits of agreement with the mean for multiple observers. BMC Med Res Methodol. 2020;20:304. https://doi.org/10.1186/s12874-020-01182-w.
Olofsen E, Dahan A, Borsboom G, Drummond G. Improvements in the application and reporting of advanced Bland-Altman methods of comparison. J Clin Monit Comput. 2015;29:127–39. https://doi.org/10.1007/s10877-014-9577-3.
Olofsen E. (2021) Webpage for Bland-Altman Analysis. https://sec.lumc.nl/method_agreement_analysis. Accessed 11 Nov 2021.
Taffé P. Assessing bias, precision, and agreement in method comparison studies. Stat Methods Med Res. 2020;29:778–96. https://doi.org/10.1177/0962280219844535.
Taffé P. When can the Bland & Altman limits of agreement method be used and when it should not be used. J Clin Epidemiol. 2021;137:176–81. https://doi.org/10.1016/j.jclinepi.2021.04.004.
Abu-Arafeh A, Jordan H, Drummond G. Reporting of method comparison studies: A review of advice, an assessment of current practice, and specific suggestions for future reports. Br J Anaesth. 2016;117:569–75. https://doi.org/10.1093/bja/aew320.
Preiss D, Fisher J. A measure of confidence in Bland-Altman analysis for the interchangeability of two methods of measurement. J Clin Monit Comput. 2008;22:257–9. https://doi.org/10.1007/s10877-008-9127-y.
Gerke O. Reporting Standards for a Bland-Altman Agreement Analysis: A Review of Methodological Reviews. Diagnostics (Basel). 2020;10:334. https://doi.org/10.3390/diagnostics10050334.
Acknowledgements
The authors would like to thank research librarian Mette Brandt Eriksen, PhD (University Library of Southern Denmark), for assisting with reviewing the literature. Moreover, the authors would like to express their gratitude to an anonymous reviewer and an associate editor for their helpful comments on an earlier version that improved the manuscript.
Funding
The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
OG contributed to the conception of the work, and AKP acquired all materials. All authors assessed and interpreted research articles for this work, and OG drafted the manuscript. All authors revised it critically for important intellectual content, approved the final version to be published, and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gerke, O., Pedersen, A.K., Debrabant, B. et al. Sample size determination in method comparison and observer variability studies. J Clin Monit Comput 36, 1241–1243 (2022). https://doi.org/10.1007/s10877-022-00853-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10877-022-00853-x