1 Method comparison studies

Formal sample size motivations have been scarce for agreement studies according to the Guidelines for Reporting Reliability and Agreement Studies [1]. Assuming normally distributed differences between two measurement methods with unknown mean µ and standard deviation σ, the paired quantiles of interest in Bland-Altman analysis are θ0.025 and θ0.975 [2]. Jan and Shieh proposed an exact 95% confidence interval {\({\widehat{\omega }}_{L},{\widehat{\omega }}_{U}\)} to cover the central 95% proportion of the differences (see their Supplemental Material A and D for implementations with SAS/IML and R, respectively) [3]. They proposed to base the sample size on either the expected width of {\({\widehat{\omega }}_{L},{\widehat{\omega }}_{U}\)} which must not exceed a predefined benchmark value Δ or the observed width of {\({\widehat{\omega }}_{L},{\widehat{\omega }}_{U}\)} that will not exceed Δ with an assurance probability of, say, 90% (see their Supplemental Material B, E and C, F, respectively). The former approach leads to similar sample sizes than the latter with an assurance probability of 50%. Shieh advanced this procedure into a formal hypothesis test and compared the relative performance of Type I error rate for their approach to previous two one-sided tests (TOST) procedures [4]. Shieh concluded that TOST procedures in the context of Bland-Altman analysis are too conservative, i.e. falling short of the nominal level, hence corresponding sample size formulas are problematic.

In the context of variance component analysis, the repeatability coefficient (RC) is derived from within-subject variance \({\sigma }_{w}^{2}\) and estimated as \(1.96\surd 2 \cdot \hat \sigma _w^2\). The RC is an estimate for the limit within which 95% of differences are expected to lie [5, 6]. The RC can be used with multiple assessments from each subject; in case of single measurements from each subject by each method, the RC coincides with half the width of the Bland-Altman Limits of Agreement [7]. Yi and colleagues proposed an equivalence test for agreement in case of k repeated measurements from each subject (k ≥ 2), using analysis of variance [8]. The test aims at confirming that \(1.96\surd 2 \cdot \hat \sigma _w^2\) is small enough to be acceptable. It can be formulated as testing H0: \({\sigma }_{w}^{2}\)\({\sigma }_{U}^{2}\) against H1: \({\sigma }_{w}^{2}\)<\({\sigma }_{U}^{2}\), with predefined unacceptable within-subject variance \({\sigma }_{U}^{2}.\) Denoting the difference between \({\sigma }_{U}^{2}\) and the assumed population within-subject variance \({\sigma }^{2}\) as Δ (Δ=\({\sigma }_{U}^{2}\)\({\sigma }^{2}\), with \({\sigma }^{2}\)<\({\sigma }_{U}^{2}\)), the sample size is derived from determining the degrees of freedom (df) that make both sides of the following equation equal:

$$\frac{{\text{{\rm X}}}_{df,1-\beta }^{2}}{{\text{{\rm X}}}_{df,\alpha }^{2}}=\frac{{\sigma }_{U}^{2}}{{\sigma }_{U}^{2}-\varDelta }$$

.

Here, α and 1-β represent the significance level and the power, respectively. The number of subjects to be included is derived from df and depends on the number of repeated measurements k. For k = 2, the number of subjects is equal to the df; for k > 2, it is equal to df/(k − 1).

Employing repeated measurements enables the investigation of variance parameters that describe the uncertainty between and within subjects. Increasing the planning complexity with repeated measurements suggests determining sample sizes on simulation-based methodology [9, 10].

Carstensen pointed out that many observations are required to produce stable variance estimates [10]. He assumed scaled Chi-squared distributions for variance estimates, assessed approximate 95% confidence intervals for these, and compared the widths of the 95% confidence intervals for 20 to 500 df. Based on these assessments, he made a rough, general recommendation of 50 subjects with three repeated measurements on each method.

2 Observer variability analysis

Though, technically, Bland-Altman Limits of Agreement can be equally applied in method comparison and observer variability studies, usually only very few (most often two) fixed methods are compared with each other, whereas interrater variability assessments will ultimately aim at generalizability of, say, clinical readings, independent of a specific set of raters employed. Christensen and colleagues provided sample size motivations when using the Limits of Agreement with the mean (LOAM) for multiple observers [11]. The measurements are assumed to follow an additive two-way random effects model, and sample size considerations are based on the width of confidence intervals for the proposed LOAM. They ascertained that a higher precision for the confidence intervals is obtained by increasing the number of observers while increasing the number of subjects is not sufficient. This underlines the inherent difference between method and observer comparisons, and it mirrors the need to illuminate interrater variability in multicenter studies. Christensen and colleagues made an R-package, R-scripts, and their example for the LOAM calculations available in a GitHub repository.

3 Analysis and reporting

Olofsen and colleagues presented a formal description of more advanced Bland-Altman analysis models employing repeated measurements and provided a freely available online implementation of it [12, 13]. These methods are based on analysis of variance and make, therefore, use of normality assumptions. Taffé asserted that Bland-Altman Limits of Agreement may be misleading when the variances of the measurement errors of the two methods are different [14, 15]. In this case, he proposed a set of graphs that support the investigator to assess bias, precision, and agreement between two measurement methods. Corresponding sample size considerations would have to be based on simulation studies.

Abu-Arafeh and colleagues reviewed the reporting of Bland-Altman analysis across five anesthetic journals and derived a list of 13 key features for adequate presentation of a Bland and Altman analysis (see their Table 1 in [16]). Likewise, the Guidelines for Reporting Reliability and Agreement Studies comprised 15 items to keep in mind for transparent reporting (see their Table 1 in [1]).

4 Recommendations

The necessary assumptions for any sample size rationale and the targeted level of precision require careful planning in light of the research context and the pre-specified study goal [3]. One general advice is, though, that the sampling procedure should result in the inclusion of study subjects contributing with measurements across the whole measurement range of clinical interest and relevance [10]. The Preiss-Fisher procedure [17] is a tool for the visual assessment to this end (for an exemplification, see, for instance [18]). Using the Guidelines for Reporting Reliability and Agreement Studies [1] already in the planning phase of a study will support purposive rigor.

In method comparison studies with single measurements by each method, the sample size calculations can be liberally based on the expected width for an exact 95% confidence interval to cover the central 95% proportion of the differences [3]. A more conservative approach, resulting in larger sample sizes, would be to require that the observed width of above exact 95% confidence interval will not exceed a predefined benchmark value Δ with an assurance probability exceeding 50% [3]. In case of k repeated measurements from each subject (k ≥ 2), the equivalence test for agreement proposed by Yi and colleagues can be used [8]. In observer variability analysis with multiple observers, sample size considerations can be based on the width of confidence intervals for the proposed LOAM [11]. R-scripts are readily available for all sample size calculations, especially those that have to be solved iteratively [3, 4, 11].