Sample size determination in method comparison and observer variability studies

Gerke, Oke; Pedersen, Andreas Kristian; Debrabant, Birgit; Halekoh, Ulrich; Möller, Sören

doi:10.1007/s10877-022-00853-x

Sample size determination in method comparison and observer variability studies

Editorial
Published: 19 April 2022

Volume 36, pages 1241–1243, (2022)
Cite this article

Download PDF

Journal of Clinical Monitoring and Computing Aims and scope Submit manuscript

Sample size determination in method comparison and observer variability studies

Download PDF

3779 Accesses
5 Citations
Explore all metrics

Abstract

The comparison of two quantitative measuring devices is often performed with the Limits of Agreement proposed by Bland and Altman in their seminal Lancet paper back in 1986. Sample size considerations were rare for such agreement analyses in the past, but recently several proposals have been made depending on how agreement is to be assessed and the number of replicates to be used. We have summarized recent developments and recommendations in various situations including a distinction between method comparison and observer variability studies. These include current state-of-the-art analysis of and reporting guidelines for agreement studies. General recommendations close the paper.

1 Method comparison studies

Formal sample size motivations have been scarce for agreement studies according to the Guidelines for Reporting Reliability and Agreement Studies [1]. Assuming normally distributed differences between two measurement methods with unknown mean µ and standard deviation σ, the paired quantiles of interest in Bland-Altman analysis are θ_0.025 and θ_0.975 [2]. Jan and Shieh proposed an exact 95% confidence interval {${\widehat{\omega }}_{L},{\widehat{\omega }}_{U}$} to cover the central 95% proportion of the differences (see their Supplemental Material A and D for implementations with SAS/IML and R, respectively) [3]. They proposed to base the sample size on either the expected width of {${\widehat{\omega }}_{L},{\widehat{\omega }}_{U}$} which must not exceed a predefined benchmark value Δ or the observed width of {${\widehat{\omega }}_{L},{\widehat{\omega }}_{U}$} that will not exceed Δ with an assurance probability of, say, 90% (see their Supplemental Material B, E and C, F, respectively). The former approach leads to similar sample sizes than the latter with an assurance probability of 50%. Shieh advanced this procedure into a formal hypothesis test and compared the relative performance of Type I error rate for their approach to previous two one-sided tests (TOST) procedures [4]. Shieh concluded that TOST procedures in the context of Bland-Altman analysis are too conservative, i.e. falling short of the nominal level, hence corresponding sample size formulas are problematic.

In the context of variance component analysis, the repeatability coefficient (RC) is derived from within-subject variance ${\sigma }_{w}^{2}$ and estimated as $1.96\surd 2 \cdot \hat \sigma _w^2$. The RC is an estimate for the limit within which 95% of differences are expected to lie [5, 6]. The RC can be used with multiple assessments from each subject; in case of single measurements from each subject by each method, the RC coincides with half the width of the Bland-Altman Limits of Agreement [7]. Yi and colleagues proposed an equivalence test for agreement in case of k repeated measurements from each subject (k ≥ 2), using analysis of variance [8]. The test aims at confirming that $1.96\surd 2 \cdot \hat \sigma _w^2$ is small enough to be acceptable. It can be formulated as testing H₀: ${\sigma }_{w}^{2}$≥${\sigma }_{U}^{2}$ against H₁: ${\sigma }_{w}^{2}$<${\sigma }_{U}^{2}$, with predefined unacceptable within-subject variance ${\sigma }_{U}^{2}.$ Denoting the difference between ${\sigma }_{U}^{2}$ and the assumed population within-subject variance ${\sigma }^{2}$ as Δ (Δ=${\sigma }_{U}^{2}$–${\sigma }^{2}$, with ${\sigma }^{2}$<${\sigma }_{U}^{2}$), the sample size is derived from determining the degrees of freedom (df) that make both sides of the following equation equal:

$$\frac{{\text{{\rm X}}}_{df,1-\beta }^{2}}{{\text{{\rm X}}}_{df,\alpha }^{2}}=\frac{{\sigma }_{U}^{2}}{{\sigma }_{U}^{2}-\varDelta }$$

.

Here, α and 1-β represent the significance level and the power, respectively. The number of subjects to be included is derived from df and depends on the number of repeated measurements k. For k = 2, the number of subjects is equal to the df; for k > 2, it is equal to df/(k − 1).

Employing repeated measurements enables the investigation of variance parameters that describe the uncertainty between and within subjects. Increasing the planning complexity with repeated measurements suggests determining sample sizes on simulation-based methodology [9, 10].

Carstensen pointed out that many observations are required to produce stable variance estimates [10]. He assumed scaled Chi-squared distributions for variance estimates, assessed approximate 95% confidence intervals for these, and compared the widths of the 95% confidence intervals for 20 to 500 df. Based on these assessments, he made a rough, general recommendation of 50 subjects with three repeated measurements on each method.

2 Observer variability analysis

Though, technically, Bland-Altman Limits of Agreement can be equally applied in method comparison and observer variability studies, usually only very few (most often two) fixed methods are compared with each other, whereas interrater variability assessments will ultimately aim at generalizability of, say, clinical readings, independent of a specific set of raters employed. Christensen and colleagues provided sample size motivations when using the Limits of Agreement with the mean (LOAM) for multiple observers [11]. The measurements are assumed to follow an additive two-way random effects model, and sample size considerations are based on the width of confidence intervals for the proposed LOAM. They ascertained that a higher precision for the confidence intervals is obtained by increasing the number of observers while increasing the number of subjects is not sufficient. This underlines the inherent difference between method and observer comparisons, and it mirrors the need to illuminate interrater variability in multicenter studies. Christensen and colleagues made an R-package, R-scripts, and their example for the LOAM calculations available in a GitHub repository.

3 Analysis and reporting

Olofsen and colleagues presented a formal description of more advanced Bland-Altman analysis models employing repeated measurements and provided a freely available online implementation of it [12, 13]. These methods are based on analysis of variance and make, therefore, use of normality assumptions. Taffé asserted that Bland-Altman Limits of Agreement may be misleading when the variances of the measurement errors of the two methods are different [14, 15]. In this case, he proposed a set of graphs that support the investigator to assess bias, precision, and agreement between two measurement methods. Corresponding sample size considerations would have to be based on simulation studies.

Abu-Arafeh and colleagues reviewed the reporting of Bland-Altman analysis across five anesthetic journals and derived a list of 13 key features for adequate presentation of a Bland and Altman analysis (see their Table 1 in [16]). Likewise, the Guidelines for Reporting Reliability and Agreement Studies comprised 15 items to keep in mind for transparent reporting (see their Table 1 in [1]).

4 Recommendations

The necessary assumptions for any sample size rationale and the targeted level of precision require careful planning in light of the research context and the pre-specified study goal [3]. One general advice is, though, that the sampling procedure should result in the inclusion of study subjects contributing with measurements across the whole measurement range of clinical interest and relevance [10]. The Preiss-Fisher procedure [17] is a tool for the visual assessment to this end (for an exemplification, see, for instance [18]). Using the Guidelines for Reporting Reliability and Agreement Studies [1] already in the planning phase of a study will support purposive rigor.

In method comparison studies with single measurements by each method, the sample size calculations can be liberally based on the expected width for an exact 95% confidence interval to cover the central 95% proportion of the differences [3]. A more conservative approach, resulting in larger sample sizes, would be to require that the observed width of above exact 95% confidence interval will not exceed a predefined benchmark value Δ with an assurance probability exceeding 50% [3]. In case of k repeated measurements from each subject (k ≥ 2), the equivalence test for agreement proposed by Yi and colleagues can be used [8]. In observer variability analysis with multiple observers, sample size considerations can be based on the width of confidence intervals for the proposed LOAM [11]. R-scripts are readily available for all sample size calculations, especially those that have to be solved iteratively [3, 4, 11].

References

Kottner J, Audigé L, Brorson S, Donner A, Gajewski BJ, Hróbjartsson A, Roberts C, Shoukri M, Streiner DL. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. J Clin Epidemiol. 2011;64:96–106. https://doi.org/10.1016/j.jclinepi.2010.03.002.
Article PubMed Google Scholar
Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1:307–10. https://doi.org/10.1016/S0140-6736(86)90837-8.
Article CAS PubMed Google Scholar
Jan SL, Shieh G. The Bland-Altman range of agreement: Exact interval procedure and sample size determination. Comput Biol Med. 2018;100:247–52. https://doi.org/10.1016/j.compbiomed.2018.06.020.
Article PubMed Google Scholar
Shieh G. Assessing agreement between two methods of quantitative measurements: Exact test procedure and sample size calculation. Stat Biopharm Res. 2020;12:352–9. https://doi.org/10.1080/19466315.2019.1677495.
Article Google Scholar
Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res. 1999;8:135–60. https://doi.org/10.1177/096228029900800204.
Article CAS PubMed Google Scholar
Bland JM. (2015) Frequently asked questions on the design and analysis of measurement studies. https://www-users.york.ac.uk/~mb55/meas/comfaq.htm. Accessed 08 Feb 2022.
Gerke O, Vilstrup MH, Segtnan EA, Halekoh U, Høilund-Carlsen PF. How to assess intra- and inter-observer agreement with quantitative PET using variance component analysis: a proposal for standardization. BMC Med Imaging. 2016;16:54. https://doi.org/10.1186/s12880-016-0159-3.
Article PubMed PubMed Central Google Scholar
Yi Q, Wang PP, He Y. Reliability analysis for continuous measurements: Equivalence test for agreement. Stat Med. 2008;27:2816–25. https://doi.org/10.1002/sim.3110.
Article PubMed Google Scholar
Choudhary PK, Nagaraja HN. Measuring Agreement. Models, Methods, and Applications. Hoboken: Wiley; 2017. pp. 279–87.
Book Google Scholar
Carstensen B. Comparing Clinical Measurement Methods. Chichester: Wiley; 2010. pp. 127–31.
Book Google Scholar
Christensen HS, Borgbjerg J, Børty L, Bøgsted M. On Jones et al.‘s method for extending Bland-Altman plots to limits of agreement with the mean for multiple observers. BMC Med Res Methodol. 2020;20:304. https://doi.org/10.1186/s12874-020-01182-w.
Article PubMed PubMed Central Google Scholar
Olofsen E, Dahan A, Borsboom G, Drummond G. Improvements in the application and reporting of advanced Bland-Altman methods of comparison. J Clin Monit Comput. 2015;29:127–39. https://doi.org/10.1007/s10877-014-9577-3.
Article PubMed Google Scholar
Olofsen E. (2021) Webpage for Bland-Altman Analysis. https://sec.lumc.nl/method_agreement_analysis. Accessed 11 Nov 2021.
Taffé P. Assessing bias, precision, and agreement in method comparison studies. Stat Methods Med Res. 2020;29:778–96. https://doi.org/10.1177/0962280219844535.
Article PubMed Google Scholar
Taffé P. When can the Bland & Altman limits of agreement method be used and when it should not be used. J Clin Epidemiol. 2021;137:176–81. https://doi.org/10.1016/j.jclinepi.2021.04.004.
Article PubMed Google Scholar
Abu-Arafeh A, Jordan H, Drummond G. Reporting of method comparison studies: A review of advice, an assessment of current practice, and specific suggestions for future reports. Br J Anaesth. 2016;117:569–75. https://doi.org/10.1093/bja/aew320.
Article CAS PubMed Google Scholar
Preiss D, Fisher J. A measure of confidence in Bland-Altman analysis for the interchangeability of two methods of measurement. J Clin Monit Comput. 2008;22:257–9. https://doi.org/10.1007/s10877-008-9127-y.
Article PubMed Google Scholar
Gerke O. Reporting Standards for a Bland-Altman Agreement Analysis: A Review of Methodological Reviews. Diagnostics (Basel). 2020;10:334. https://doi.org/10.3390/diagnostics10050334.
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank research librarian Mette Brandt Eriksen, PhD (University Library of Southern Denmark), for assisting with reviewing the literature. Moreover, the authors would like to express their gratitude to an anonymous reviewer and an associate editor for their helpful comments on an earlier version that improved the manuscript.

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Author information

Authors and Affiliations

Department of Clinical Research, University of Southern Denmark, Odense, Denmark
Oke Gerke & Sören Möller
Department of Nuclear Medicine, Odense University Hospital, Odense, Denmark
Oke Gerke
Department of Research and Learning, Hospital of Southern Jutland, Aabenraa, Denmark
Andreas Kristian Pedersen
Department of Regional Health Research, University of Southern Denmark, Odense, Denmark
Andreas Kristian Pedersen
Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
Birgit Debrabant
Epidemiology, Biostatistics and Biodemography, University of Southern Denmark, Odense, Denmark
Ulrich Halekoh
Open Patient data Explorative Network, Odense University Hospital, Odense, Denmark
Sören Möller

Authors

Oke Gerke
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Kristian Pedersen
View author publications
You can also search for this author in PubMed Google Scholar
Birgit Debrabant
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Halekoh
View author publications
You can also search for this author in PubMed Google Scholar
Sören Möller
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

OG contributed to the conception of the work, and AKP acquired all materials. All authors assessed and interpreted research articles for this work, and OG drafted the manuscript. All authors revised it critically for important intellectual content, approved the final version to be published, and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding author

Correspondence to Oke Gerke.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gerke, O., Pedersen, A.K., Debrabant, B. et al. Sample size determination in method comparison and observer variability studies. J Clin Monit Comput 36, 1241–1243 (2022). https://doi.org/10.1007/s10877-022-00853-x

Download citation

Received: 17 November 2021
Accepted: 22 March 2022
Published: 19 April 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s10877-022-00853-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Sample size determination in method comparison and observer variability studies

Abstract

1 Method comparison studies

2 Observer variability analysis

3 Analysis and reporting

4 Recommendations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation