1 Introduction

In various fields of science, it is frequently required that units (persons, individuals, objects) are rated on a scale by human observers. Examples are teachers that rate assignments completed by pupils to assess their proficiency, neurologists that rate the severity of patients’ symptoms to determine the stage of Alzheimer’s disease, psychologists that classify patients’ mental health problems, and biologists that examine features of animals in order to find similarities between them, which enables the classification of newly discovered species.

To study whether ratings are reliable, a standard procedure is to ask two raters to judge independently the same group of units. The agreement between the ratings can then be used as an indication of the reliability of the classifications by the raters (McHugh 2012; Shiloach et al. 2010; Wing et al. 2002; Blackman and Koval 2000). Requirements for obtaining reliable ratings are, e.g., clear definitions of the categories and the use of clear scoring criteria. A sufficient level of agreement ensures interchangeability of the ratings and consensus in decisions (Warrens 2015).

Assessing reliability is of concern for both categorical as well as interval rating instruments. For categorical ratings, kappa coefficients are commonly used. For example, Cohen’s kappa coefficient (Cohen 1960) is commonly used to quantify the extent to which two raters agree on a nominal (unordered) scale (De Raadt et al. 2019; Viera and Garrett 2005; Muñoz and Bangdiwala 1997; Graham and Jackson 1993; Maclure and Willett 1987; Schouten 1986), while the weighted kappa coefficient (Cohen 1968) is widely used for quantifying agreement between ratings on an ordinal scale (Moradzadeh et al. 2017; Vanbelle 2016; Warrens 2012a, 2013, 2014; Vanbelle and Albert 2009; Crewson 2005; Cohen 1968). Both Cohen’s kappa and weighted kappa are standard tools for assessing agreemen t in behavioral, social, and medical sciences (De Vet et al. 2013; Sim and Wright 2005; Banerjee 1999).

The Pearson correlation and intraclass correlation coefficients are widely used for assessing reliability when ratings are on an interval scale (McGraw and Wong 1996; Shrout and Fleiss 1979). Shrout and Fleiss (1979) discuss six intraclass correlation coefficients. Different intraclass correlations are appropriate in different situations (Warrens 2017; McGraw and Wong 1996). Both kappa coefficients and correlation coefficients can be used to assess the reliability of ordinal rating scales.

The primary aim of this study is to provide a thorough understanding of seven reliability coefficients that can be used with ordinal rating scales, such that the applied researcher can make a sensible choice out of these seven coefficients. A second aim of this study is to find out whether the choice of the coefficient matters. We compare the following reliability coefficients: Cohen’s unweighted kappa, weighted kappa with linear and quadratic weights, intraclass correlation ICC(3,1) (Shrout and Fleiss 1979), Pearson’s and Spearman’s correlations, and Kendall’s tau-b. We have the following three research questions: (1) under what conditions do quadratic kappa and the Pearson and intraclass correlations produce similar values? (2) To what extent do we reach the same conclusions about inter-rater reliability with different coefficients? (3) To what extent do the coefficients measure agreement in similar ways?

To answer the research questions, we will compare the coefficients analytically and by using simulated and empirical data. These different approaches complement each other. The analytical methods are used to make clear how some of the coefficients are related. The simulated and empirical data are used to explore a wide variety of inter-rater reliability situations. For the empirical comparison, we will use two different real-world datasets. The marginal distributions of the real-world datasets are in many cases skewed. In contrast, the marginal distributions of the simulated datasets are symmetric.

The paper is organized as follows. The second and third sections are used to define, respectively, the kappa coefficients and correlation coefficients, and to discuss connections between the coefficients. In the fourth section, we briefly discuss the comparison of reliability coefficients in Parker et al. (2013) and we present hypotheses with regard to the research questions. In the fifth section, three coefficients that can be expressed in terms of the rater means, variances, and covariance (quadratic kappa, intraclass correlation ICC(3,1), and the Pearson correlation) are compared analytically. In the sixth section, we compare all seven coefficients in a simulation study. This is followed by a comparison of all seven coefficients using two real-world datasets in the seventh section. The final section contains a discussion and recommendations.

2 Kappa Coefficients

Suppose that two raters classified independently n units (individuals, objects, products) into one of k ≥ 3 ordered categories that were defined in advance. Let pij denote the proportion of units that were assigned to category i by the first rater and to category j by the second rater. Table 1 is an example of an agreement table with elements pij for k = 4. The table presents pairwise classifications of a sample of units into four categories. The diagonal cells p11, p22, p33, and p44 are the proportion of units on which the raters agree. The off-diagonal cells consist of units on which the raters have not reached agreement. The marginal totals or base rates pi+ and p+j reflect how often a category is used by a rater.

Table 1 Pairwise classifications of units into four categories

Table 2 is an example of an agreement table with real-world numbers. Table 2 contains the pairwise classifications of two observers who each rated the same teacher on 35 items of the International Comparative Analysis of Learning and Teaching (ICALT) observation instrument (Van de Grift 2007). The agreement table is part of the data used in Van der Scheer et al. (2017). The Van der Scheer data are further discussed in the fifth section.

Table 2 Pairwise classifications of two observers who rated teacher 7 on 35 ICALT items (Van der Scheer et al. 2017)

Theweighted kappa coefficient can be defined as a similarity coefficient or as a dissimilarity coefficient. In the dissimilarity coefficient definition, it is usual to assign a weight of zero to full agreements and to allocate to disagreements a positive weight whose magnitude increases proportionally to their seriousness (Gwet 2012). Each of the k2 cells of the agreement table has its own disagreement weight, denoted by wij, where wij ≥ 0 for all i and j. Cohen’s weighted kappa (Cohen 1968) is then defined as

$$ \kappa_{w}=1-\frac{\sum\limits^{k}_{i=1}\sum\limits^{k}_{j=1}w_{ij}p_{ij}}{\sum\limits^{k}_{i=1}\sum\limits^{k}_{j=1} w_{ij}p_{i+}p_{+j}}. $$
(1)

Weighted kappa in Eq. 1 consists of two quantities: the proportion weighted observed disagreement in the numerator of the fraction, and the proportion expected weighted disagreement in the denominator. The value of weighted kappa is not affected when all weights are multiplied by a positive number.

Using wij = 1 if ij and wii = 0 in Eq. 1 we obtain Cohen’s kappa or unweighted kappa

$$ \kappa=\frac{P_{o}-P_{e}}{1-P_{e}}=\frac{\sum\limits^{k}_{i=1}(p_{ii}-p_{i+}p_{+i})}{1-\sum\limits^{k}_{i=1}p_{i+}p_{+i}}, $$
(2)

where \(P_{o}={\sum }^{k}_{i=1}p_{ii}\) is the proportion observed agreement, i.e., the proportion of units on which the raters agree, and \(P_{e}={\sum }^{k}_{i=1}p_{i+}p_{+i}\) is the proportion expected agreement. Unweighted kappa differentiates only between agreements and disagreements. Furthermore, unweighted kappa is commonly used when ratings are on a nominal (unordered) scale, but it can be applied to scales with ordered categories as well.

For ordinal scales, frequently used disagreement weights are the linear weights and the quadratic weights (Vanbelle 2016; Warrens 2012a; Vanbelle and Albert 2009; Schuster 2004). The linear weights are given by wij = |ij|. The linearly weighted kappa, or linear kappa for short, is given by

$$ \kappa_{l}=1-\frac{\sum\limits^{k}_{i=1}\sum\limits^{k}_{j=1}|i-j|p_{ij}}{\sum\limits^{k}_{i=1}\sum\limits^{k}_{j=1}|i-j|p_{i+}p_{+j}}. $$
(3)

With linear weights, the categories are assumed to be equally spaced (Brenner and Kliebsch 1996). For many real-world data, linear kappa gives a higher value than unweighted kappa (Warrens 2013). For example, for the data in Table 2, we have κ = 0.61 and κl = 0.68. Furthermore, the quadratic weights are given by wij = (ij)2, and the quadratically weighted kappa, or quadratic kappa for short, is given by

$$ \kappa_{q}=1-\frac{\sum\limits^{k}_{i=1}\sum\limits^{k}_{j=1}(i-j)^{2}p_{ij}}{\sum\limits^{k}_{i=1}\sum\limits^{k}_{j=1}(i-j)^{2}p_{i+}p_{+j}}. $$
(4)

For many real-world data, quadratic kappa produces higher values than linear kappa (Warrens 2013). For example, for the data in Table 2 we have κl = 0.68 and κq = 0.77.

In contrast to unweighted kappa, linear kappa in Eq. 3 and quadratic kappa in Eq. 4 allow that some disagreements are considered of greater gravity than others (Cohen 1968). For example, disagreements on categories that are adjacent in an ordinal scale are considered less serious than disagreements on categories that are further apart: the seriousness of disagreements is modeled with the weights. It should be noted that all special cases of weighted kappa in Eq. 1 with symmetric weighting schemes, e.g., linear and quadratic kappa, coincide with unweighted kappa with k = 2 categories (Warrens 2013).

The flexibility provided by weights to deal with the different degrees of disagreement could be considered a strength of linear kappa and quadratic kappa. However, the arbitrariness of the choice of weights is generally considered a weakness of the coefficient (Vanbelle 2016; Warrens 2012a, 2013, 2014; Vanbelle and Albert2009; Crewson 2005; Maclure and Willett 1987). The assignment of weights can be very subjective and studies in which different weighting schemes were used are generally not comparable (Kundel and Polansky 2003). Because of such perceived limitations of linear kappa and quadratic kappa, Tinsley and Weiss (2000) have recommended against the use of these coefficients. Soeken and Prescott (1986, p. 736) also recommend against the use of these coefficients: “because nonarbitrary assignment of weighting schemes is often very difficult to achieve, some psychometricians advocate avoiding such systems in absence of well-established theoretical criteria, due to the serious distortions they can create.”

3 Correlation Coefficients

Correlation coefficients are popular statistics for measuring agreement, or more generally association, on an interval scale. Various correlation coefficients can be defined using the rater means and variances, denoted by m1 and \({s^{2}_{1}}\) for the first rater, and m2 and \({s^{2}_{2}}\) for the second rater, respectively, and the covariance between the raters, denoted by s12. To calculate these statistics, one could use a unit by rater table of size n × 2 associated with agreement (Tables 1 and 2), where an entry of the n × 2 table indicates to which of the k categories a unit (row) was assigned by the first and second raters (first and second columns, respectively). We will use consecutive integer values for coding the categories, i.e., the first category is coded as 1, the second category is coded as 2, and so on.

The Pearson correlation is given by

$$ r=\frac{s_{12}}{s_{1}s_{2}}. $$
(5)

The correlation in Eq. 5 is commonly used in statistics and data analysis, and is the most popular coefficient for quantifying linear association between two variables (Rodgers and Nicewander 1988). Furthermore, in factor analysis, the Pearson correlation is commonly used to quantify association between ordinal scales, in many cases 4-point or 5-point Likert-type scales.

The Spearman correlation is a nonparametric version of the Pearson correlation that measures the strength and direction of a monotonic relationship between the numbers. We will denote the Spearman correlation by ρ. The value of the Spearman correlation can be obtained by replacing the observed scores by rank scores and then using Eq. 5. The values of the Pearson and Spearman correlations are often quite close (De Winter et al. 2016; Mukaka 2012; Hauke and Kossowski 2011).

A third correlation coefficient is intraclass correlation ICC(3,1) from Shrout and Fleiss (1979). This particular intraclass correlation is given by

$$ R=\text{ICC(3,1)}=\frac{2s_{12}}{{s_{1}^{2}}+{s_{2}^{2}}}. $$
(6)

Intraclass correlations are commonly used in agreement studies with interval ratings. The correlations in Eqs. 5 and 6 are identical if the raters have the same variance (i.e., \({s^{2}_{1}}={s^{2}_{2}}\)). If the rater variances differ, the Pearson correlation produces a higher value than the intraclass correlation (i.e., r > R). For example, for the data in Table 2, we have R = 0.81 and r = 0.83.

Quadratic kappa in Eq. 4 can also be expressed in terms of rater means, variances, and the covariance between the raters. If the ratings (scores) are labeled as 1, 2, 3, and so on, quadratic kappa is given by (Schuster 2004; Schuster and Smith 2005)

$$ \kappa_{q}=\frac{2s_{12}}{{s_{1}^{2}}+{s_{2}^{2}}+\frac{n}{n-1}(m_{1}-m_{2})^{2}}. $$
(7)

Quadratic kappa in Eq. 7 may be interpreted as a proportion of variance (Schuster and Smith 2005; Schuster 2004; Fleiss and Cohen 1973). Coefficients Eqs. 6 and 7 are identical if the rater means are equal (i.e., m1 = m2). If the rater means differ, the intraclass correlation produces a higher value than quadratic kappa (i.e., R > κq). For example, for the data in Table 2, we have κq = 0.77 and R = 0.81. Furthermore, if both rater means and rater variances are equal (i.e., m1 = m2 and \({s^{2}_{1}}={s^{2}_{2}}\)), the coefficients in Eqs. 56, and 7 coincide.

Warrens (2014) showed that intraclass correlation ICC(3,1), the Pearson correlation and the Spearman correlation (coefficients R, r, and ρ) are in fact special cases of the weighted kappa coefficient in Eq. 1, since the coefficients produce equal values if particular weighting schemes are used. The details of these particular weighting schemes can be found in Warrens (2014).

Linear and quadratic kappa (through their weighting schemes) and the Pearson, intraclass, and Spearman correlations (through the means, variances, and covariance of the raters) use a numerical system to quantify agreement between two raters. They use more information than just the order of the categories. In contrast, the Kendall rank correlation (Kendall 1955, 1962; Parker et al. 2013) is a non-parametric coefficient for ordinal association between two raters that only uses the order of the categories.

Let (xi,yi) and (xj,yj) be two rows of the unit by rater table of size n × 2. A pair of rows (xi,yi) and (xj,yj) is said to be concordant if either both xi > xj and yi > yj holds or both xi < xj and yi < yj holds; otherwise, the pair is said to be discordant. A pair of rows (xi,yi) and (xj,yj) is said to be tied if xi = xj or yi = yj. Furthermore, let nc denote the number of concordant pairs and nd the number of discordant pairs. Moreover, let n0 = n(n − 1)/2 be the total number of unit pairs, and define

$$ n_{1}={\sum}^{k}_{s=1}t_{s}(t_{s}-1)/2\qquad\text{and}\qquad n_{2}={\sum}^{k}_{s=1}u_{s}(u_{s}-1)/2, $$
(8)

where ts and us are the number of tied values associated with category s of raters 1 and 2, respectively. Kendall’s tau-b is given by

$$ \tau_{b}=\frac{n_{c}-n_{d}}{\sqrt{(n_{0}-n_{1})(n_{0}-n_{2})}}. $$
(9)

The particular version of the Kendall rank correlation in Eq. 9 makes adjustment for ties and is most suitable when both raters use the same number of possible values (Berry et al. 2009). Both conditions apply to the present study.

The values of the Spearman and Kendall correlations can be different (Siegel and Castellan 1988; Xu et al. 2013). Although both coefficients range from − 1.0 to + 1.0, for most of this range, the absolute value of the Spearman correlation is empirically about 1.5 times that of the Kendall correlation (Kendall 1962).

4 Hypotheses

Before we present our hypotheses with regard to the research questions, we summarize several relevant results from Parker et al. (2013). These authors compared various reliability coefficients for ordinal rating scales, including linear kappa, quadratic kappa and the Pearson and Kendall correlations, using simulated data. They investigated whether a fixed value, e.g., 0.60, has the same meaning across reliability coefficients, and across rating scales with different number of categories. Among other things, Parker et al. (2013) in their study reported the following results. Differences between the values of quadratic kappa and the Pearson and Kendall correlations usually were less than 0.15. Furthermore, the values of quadratic kappa and the Pearson and Kendall correlations, on the one hand, and linear kappa, on the other hand, were usually quite different. Moreover, differences between the coefficients depend on the number of categories considered. Differences tend to be smaller with two and three categories than with five or more categories. With two categories, the three kappa coefficients are identical (Warrens 2013).

With respect to the first research question (under what conditions do quadratic kappa and the Pearson and intraclass correlations produce similar values?), we have only general expectations, since these relationships have not been comprehensively studied. We expect that intraclass correlation ICC(3,1) will produce similar values as the Pearson correlation if rater variances are similar, and similar values as quadratic kappa if the rater means are similar (Schuster 2004).

With regard to the second research question (to what extent do we reach the same conclusions about inter-rater reliability with different coefficients?), and third research question (to what extent do the coefficients measure agreement in similar ways?), we hypothesize that the values of the Pearson and Spearman correlations are very similar (De Winter et al. 2016; Mukaka 2012; Hauke and Kossowski 2011). Furthermore, we hypothesize the values of the Spearman and Kendall correlations to be somewhat different (Kendall 1962; Siegel and Castellan 1988; Xu et al. 2013; Parker et al. 2013). In addition, we hypothesize that the values of the three kappa coefficients can be quite different (Warrens 2013). Combining some of the above expectations, we also expect the values of both unweighted kappa and linear kappa to be quite different from the values of the four correlation coefficients.

5 Analytical Comparison of Quadratic Kappa and the Pearson and Intraclass Correlations

The Pearson and Spearman correlations have been compared analytically by various authors (De Winter et al. 2016; Mukaka 2012; Hauke and Kossowski 2011). Furthermore, the three kappa coefficients have been compared analytically and empirically (Warrens2011, 2013). For many real-world data, we can expect to observe the double inequality κ < κl < κq, i.e., quadratic kappa tends to produce a higher value than linear kappa, which in turn tends to produce a higher value than the unweighted kappa coefficient (Warrens 2011). Moreover, the values of the three kappa coefficients tend to be quite different (Warrens 2013).

To approach the first research question (under what conditions do quadratic kappa and the Pearson and intraclass correlations produce similar values?), we study, in this section, differences between the three agreement coefficients. The relationships between these three coefficients have not been comprehensively studied. What is known is that, in general, we have the double inequality κqRr, i.e., quadratic kappa will never produce a higher value than the intraclass correlation, which in turn will never produce a higher value than the Pearson correlation (Schuster 2004). This inequality between the coefficients can be used to study the positive differences rR, Rκq, and rκq.

We first consider the difference between the Pearson and intraclass correlations. The positive difference between the two coefficients can be written as

$$ r-R=\frac{r(s_{1}-s_{2})^{2}}{{s^{2}_{1}}+{s^{2}_{2}}}. $$
(10)

The right-hand side of Eq. 10 consists of three quantities. We lose one parameter if we consider the ratio between the standard deviations

$$ c=\frac{\max(s_{1},s_{2})}{\min(s_{1},s_{2})}, $$
(11)

instead of the standard deviations separately. Using Eq. 11 we may write difference (10) as

$$ r-R=\frac{r(1-c)^{2}}{1+c^{2}}. $$
(12)

The first derivative of f(c) = (1 − c)2/(1 + c2) with respect to c is presented in Appendix 1. Since this derivative is strictly positive for c > 1, formula (12) shows that difference rR is strictly increasing in both r and c. In other words, the difference between the Pearson and intraclass correlations increases (1) if agreement in terms of r increases, and (2) if the ratio between the standard deviations increases.

Table 3 gives the values of difference rR for different values of r and ratio (11). The table shows that the difference between the Pearson and intraclass correlations is very small (≤ 0.05) if c ≤ 1.40, and is small (≤ 0.10) if c ≤ 1.60 or if r ≤ 0.50.

Table 3 Values of difference rR for different values of r and ratio (9)

Next, we consider the difference between the intraclass correlation and quadratic kappa. The positive difference between the two coefficients can be written as

$$ R-\kappa_{q}=\frac{R}{g(\cdot)+1}, $$
(13)

where the function g(⋅) is given by

$$ g(n,m_{1},m_{2},s_{1},s_{2})=\frac{n-1}{n}\cdot\frac{{s^{2}_{1}}+{s^{2}_{2}}}{(m_{1}-m_{2})^{2}}. $$
(14)

A derivation of Eqs. 13 and 14 is presented in Appendix 2. The right-hand side of Eq. 13 shows that difference (13) is increasing in R and is decreasing in the function g(⋅). Hence, the difference between the intraclass correlation and quadratic kappa increases if agreement in terms of R increases. Since the ratio (n − 1)/n is close to unity for moderate to large sample sizes, quantity (14) is approximately equal to the ratio of the sum of the two variances (i.e., \({s^{2}_{1}}+{s^{2}_{2}}\)) to the squared difference between the rater means (i.e., (m1m2)2). Quantity (14) increases if one of the rater variances becomes larger, and decreases if the difference between the rater means increases.

Tables 4 and 5 give the values of difference Rκq for different values of intraclass correlation R and mean difference |m1m2|, and for \({s^{2}_{1}}+{s^{2}_{2}}\) and n = 100. Table 4 contains the values of Rκq when the sum of the rater variances is equal to unity (i.e., \({s^{2}_{1}}+{s^{2}_{2}}=1\)). Table 5 presents the values of the difference when \({s^{2}_{1}}+{s^{2}_{2}}=2\).

Table 4 Values of difference Rκq for different values of R and |m1m2|, and \({s^{2}_{1}}+{s^{2}_{2}}=1\)
Table 5 Values of difference Rκq for different values of R and |m1m2|, and \({s^{2}_{1}}+{s^{2}_{2}}=2\)

Tables 4 and 5 show that the difference between the intraclass correlation and quadratic kappa is very small (≤ 0.04) if \({s^{2}_{1}}+{s^{2}_{2}}=1\) and |m1m2|≤ 0.20 or R ≤ 0.20, or if \({s^{2}_{1}}+{s^{2}_{2}}=2\) and |m1m2|≤ 0.30 or R ≤ 0.40. Furthermore, the difference between the coefficients is small (≤ 0.10) if \({s^{2}_{1}}+{s^{2}_{2}}=1\) and |m1m2|≤ 0.30 or R ≤ 0.50, or if \({s^{2}_{1}}+{s^{2}_{2}}=2\) and |m1m2|≤ 0.40 or R ≤ 0.90.

Finally, we consider the difference between the Pearson correlation and quadratic kappa. The positive difference between the two coefficients can be written as

$$ r-\kappa_{q}=r\cdot h(\cdot), $$
(15)

where the function h(⋅) is given by

$$ h(n,m_{1},m_{2},s_{1},s_{2})=\frac{(s_{1}-s_{2})^{2}+\frac{n}{n-1}(m_{1}-m_{2})^{2}}{{s^{2}_{1}}+{s^{2}_{2}}+\frac{n}{n-1}(m_{1}-m_{2})^{2}}. $$
(16)

The right-hand side of Eq. 15 shows that difference (15) is increasing in r and in the function h(⋅). Hence, the difference between the Pearson correlation and quadratic kappa increases if agreement in terms of r increases. Quantity (16) is a rather complex function that involves rater means as well as rater variances. Since the inequality \((s_{1}-s_{2})^{2}\leq {s^{2}_{1}}+{s^{2}_{2}}\) holds, quantity (16) and difference (15) increase if the difference between the rater means increases.

To understand the difference rκq in more detail, it is insightful to consider two special cases. If the rater means are equal (i.e., m1 = m2), the intraclass correlation coincides with quadratic kappa (i.e., R = κq) and difference rκq is equal to difference rR. Thus, in the special case that the rater means are equal, all conditions discussed above for difference rR also apply to difference rκq. Furthermore, if the rater variances are equal (i.e., \({s^{2}_{1}}={s^{2}_{2}}\)), the Pearson and intraclass correlations coincide (i.e., r = R) and difference rκq is equal to difference Rκq. If we set s = s1 = s2 and use 2s2 instead of \({s^{2}_{1}}+{s^{2}_{2}}\), then all conditions discussed above for difference Rκq also apply to difference rκq.

Difference (15) is equal to the sum of differences Eqs. 10 and 13, i.e.,

$$ r-\kappa_{q}=r-R+R-\kappa_{q}=\frac{r(1-c)^{2}}{1+c^{2}}+\frac{R}{g(\cdot)+1}, $$
(17)

where quantity c is given in Eq. 11 and function g(⋅) in Eq. 14. Identity (17) shows that to understand difference (15), it suffices to understand the differences rR and Rκq. Apart from the overall level of agreement, difference rR depends on the rater variances, whereas difference Rκq depends primarily on the rater means.

Identity (17) also shows that we may also combine the various conditions that hold for differences Eqs. 10 and 13 to obtain new conditions for difference (15). For example, combining the numbers in Tables 34, and 5 we find that difference (15) is small (≤ 0.09) if c ≤ 1.40, and in addition, if \({s^{2}_{1}}+{s^{2}_{2}}=1\) and |m1m2|≤ 0.20 or R ≤ 0.20, or if \({s^{2}_{1}}+{s^{2}_{2}}=2\) and |m1m2|≤ 0.30 or R ≤ 0.40.

With regard to the first research question, the analyses in this section can be summarized as follows. In general, differences between quadratic kappa and the Pearson and intraclass correlations increase if agreement becomes larger. Differences between the three coefficients are generally small if differences between rater means and variances are relatively small. However, if differences between rater means and variances are substantial, differences between the values of the three coefficients are small only if agreement between raters is small.

6 A Simulation Study

6.1 Data Generation

In this section, we compare all seven reliability coefficients using simulated ordinal rating data. We carried out a number of simulations under different conditions, according to the following procedure. In each scenario, we sampled scores for 200 units from a bivariate normal distribution, using the mvrnorm function in R (R Core Team 2019). The two variables correspond to the two raters. To obtain categorical agreement data, we discretized the variables into five categories: values smaller than − 1.0 were coded 1, values equal to or greater than − 1.0 and smaller than − 0.4 were coded as 2, values equal to or greater than − 0.4 and smaller than 0.4 were coded as 3, values equal to or greater than 0.4 and smaller than 1.0 were coded as 4, and values equal to or greater than 1.0 were coded as 5. For a standardized variable, this coding scheme corresponds to a unimodal and symmetric distribution with probabilities 0.16, 0.18, 0.32, 0.18, and 0.16 for categories 1, 2, 3, 4, and 5, respectively. Thus, the middle category is a bit more popular in the case of a standardized variable. Finally, the values of the seven reliability coefficients were calculated using the discretized data. The above steps were repeated 10,000 times, denoted by 10K for short, in each condition.

For the simulations, we differentiated between various conditions. The mvrnorm function in R allows the user to specify the means and covariance matrix of the bivariate normal distribution. We generated data with either a high (0.80) or medium (0.40) value of the Pearson correlation (i.e., high or medium agreement). Furthermore, we varied the rater means and the rater variances. Either both rater means were set to 0 (i.e., equal rater means), or we set one mean value to 0 and one to 0.5 (i.e., unequal rater means). Moreover, we either set both rater variances to 1 (i.e., equal rater variances), or we set the variances to 0.69 and 1.44 (i.e., unequal rater variances). Fully crossed, the simulation design consists of 8 (= 2 × 2 × 2) conditions. These eight conditions were chosen to illustrate some of the findings from the previous section. Notice that with both variances equal to 1, ratio (9) is also equal to 1. If the variances are equal to 0.69 and 1.44, ratio (9) is equal to 1.44.

6.2 Comparison Criteria

To answer the second research question (to what extent we will reach the same conclusions about inter-rater reliability with different coefficients), we will compare the values of the coefficients in an absolute sense. If the differences between the values (of one replication of the simulation study) are small (≤ 0.10), we will conclude that the coefficients lead to the same decision in practice. Of course the value 0.10 is somewhat arbitrary, but we think this is a useful criterion for many real-world applications. We will use ratios of the numbers of simulations in which the values lead to the same conclusion (maximum difference between the values is less than or equal to 0.10) and the total numbers of simulations (= 10K), to quantify how often we will reach the same conclusion. To answer the third research question (to what extent the coefficients measure agreement in a similar way), Pearson correlations between the coefficient values will be used to assess how similar the coefficients measure agreement in this simulation study.

6.3 Results of the Simulation Study

Tables 6 and 7 give two statistics that we will use to assess the similarity between the coefficients for the simulated data. Both tables consist of four subtables. Each subtable is associated with one of the simulated conditions. Table 6 contains four subtables associated with the high agreement condition, whereas Table 7 contains four subtables associated with the medium agreement condition. The upper panel of each subtable of Tables 6 and 7 gives the Pearson correlations between the coefficient values of all 10,000 simulations. The lower panel of each subtable contains the ratios of the numbers of simulations in which the values lead to the same conclusion about inter-rater reliability (maximum difference between the values is less than or equal to 0.10) and the total numbers of simulations (= 10K).

Table 6 Correlations and number of times the same decision will be reached for the values of the agreement coefficients for the simulated data, for the high agreement condition
Table 7 Correlations and number of times the same decision will be reached for the values of the agreement coefficients for the simulated data, for the medium agreement condition

Consider the lower panels of the subtables of Tables 6 and 7 first. In all cases, we will come to the same conclusion with the intraclass, Pearson, and Spearman correlations (10K/10K). Hence, for these simulated data, it does not really matter which of these correlation coefficients is used. Furthermore, with medium agreement (Table 7), we will almost always reach the same conclusion with intraclass, Pearson, and Spearman correlations, on the one hand, and the Kendall correlation, on the other hand. When agreement is high (Table 6), we will reach the same conclusion in a substantial number of cases.

If rater means are equal (the two top subtables of Tables 6 and 7) the quadratic kappa, intraclass correlation, and the Pearson correlation coincide (see previous section), and we will come to the same conclusion with quadratic kappa and the three correlation coefficients (10K/10K). If rater means are unequal (the two bottom subtables of Tables 6 and 7), the quadratic kappa is not identical to the intraclass and Pearson correlations, but we will still reach the same conclusion in many cases with quadratic kappa and the four correlation coefficients.

The differences in the values of unweighted kappa and linear kappa compared to quadratic kappa and the four correlation coefficients are striking. If there is high agreement (Table 6), we will generally never come to the same conclusion with unweighted kappa and linear kappa. Furthermore, with high agreement, we will generally not reach the same conclusion about inter-rater reliability with unweighted kappa and linear kappa, on the one hand, and the other five coefficients, on the other hand. If there is medium agreement (Table 7), the values of the seven coefficients tend to be a bit closer to one another, but we will still come to the same conclusion in only relatively few replications.

Next, consider the upper panels of the subtables of Tables 6 and 7. The correlations between the intraclass, Pearson, Spearman, and Kendall correlations are very high (≥ 0.95) in general and almost perfect (≥ 0.98) if agreement is medium. These four correlation coefficients may produce different values but tend to measure agreement in a similar way. The correlations between quadratic kappa and the correlation coefficients are very high (≥ 0.96) in the case of medium agreement, or if high agreement is combined with equal rater means. In the case of high agreement and unequal rater means, the values drop a bit (0.86–0.92). All in all, it seems that quadratic kappa measures agreement in a very similar way as the correlation coefficients for these simulated data. All other correlations are substantially lower.

With regard to the second research question, we will reach the same conclusion about inter-rater reliability for most simulated replications with any correlation coefficient (intraclass, Pearson, Spearman, or Kendall). Furthermore, using quadratic kappa, we may reach a similar conclusion as with any correlation coefficient a great number of times. Unweighted kappa and linear kappa generally produce different (much lower) values than the other five coefficients. If there is medium agreement, the values of the seven coefficients tend to be a bit closer to one another than if agreement is high.

With regard to the third research question, the four correlation coefficients tend to measure agreement in a similar way: their values are very highly correlated in this simulation study. Furthermore, quadratic kappa is highly correlated with all four correlation coefficients as well for these simulated data.

7 Empirical Comparison of Coefficients

7.1 Datasets

In this section, we compare all seven reliability coefficients using empirical data. Two different real-world datasets will be used to compare the values of the coefficients. For both datasets, all ratings are on what are essentially ordinal scales. One dataset is from medical research and one dataset from educational research.

Holmquist et al. (1967) examined the variability in the histological classification of carcinoma in situ and related lesions of the uterine cervix. In total, 118 biopsies of the uterine cervix were classified independently by seven pathologists into five categories. The raters were involved in the diagnosis of surgical pathologic specimens. The categories were defined as 1 = negative, 2 = atypical squamous hyperplasia (anaplasia or dysplasia), 3 = carcinoma in situ, 4 = squamous carcinoma with early stromal invasion (microinvasion), and 5 = invasive carcinoma. With 7 raters, there are 21 rater pairs. We will examine the values of the coefficients for these 21 different rater pairs.

Van der Scheer et al. (2017) evaluated whether 4th grade teachers’ instructional skills changed after joining an intensive data-based decision making intervention. Teachers’ instructional skills were measured using the ICALT observation instrument (Van de Grift 2007). The instrument includes 35 four-point Likert scale items, where 1 = predominantly weak, 2 = more weaknesses than strengths, 3 = more strengths than weaknesses, and 4 = predominantly strong. Example items are “The teacher ensures a relaxed atmosphere” and “The teacher gives clear instructions and explanations.” In total, 31 teachers were assessed by two raters on all 35 items on three different time points. The complete data consist of 3 × 31 = 93 agreement tables. We only use a selection of the available agreement tables. More precisely, we systematically included the data on one time point for each teacher (see Table 10 below). Hence, we will examine the values of the coefficients for 31 agreement tables.

7.2 Comparison Criteria

To compare the coefficient values, we will use the same comparison criteria as we used for the simulated data in the previous section. To answer the second research question (to what extent we will reach the same conclusion about inter-rater reliability with different coefficients), we will use ratios of the numbers of tables in which the values lead to the same conclusion (maximum difference between the values is less than or equal to 0.10) and the total numbers of tables to quantify how often we will reach the same conclusion. To approach the third research question (to what extent the coefficients measure agreement in a similar way), Pearson correlations between the coefficient values will be used to assess how similar the coefficients measure agreement empirically, for these datasets.

7.3 Results for the Holmquist Data

Table 8 presents the values of the reliability coefficients for all 21 rater pairs of the Holmquist data (Holmquist et al. 1967) together with the rater means and standard deviations. If we consider the three kappa coefficients, we may observe that their values are quite different. We may also observe that for each row the commonly observed double inequality κ < κl < κq holds. Furthermore, if we consider quadratic kappa and the intraclass and Pearson correlations, we find for each row the double inequality κqRr (Schuster 2004). Like quadratic kappa, the value of the Kendall correlation is always between the values of linear kappa and the intraclass correlation. The values of the intraclass and Pearson correlations are almost identical for all 21 rater pairs. The maximum difference is 0.02. Furthermore, the values of the intraclass, Pearson, and Spearman correlations are very similar for all 21 rater pairs. The maximum difference between the three correlations is 0.05.

Table 8 Coefficient values, rater means, and standard deviations for the Holmquist data

We may consider some of the analytical results from the fifth section for these data. Note that the ratio of the standard deviations is smaller than 1.26 for each row of Table 8 (i.e., c < 1.26). It then follows from formula (10) that the maximum difference between the Pearson and intraclass correlations is less than 0.026 (i.e., rR < 0.026), which is indeed the case for all rows. Furthermore, for these data, the rater variances are very similar. Thus, if we compare the Pearson and intraclass correlations on the one hand, and quadratic kappa on the other hand, we see that differences between the coefficients depend to a large extent on the rater means: larger differences between coefficients if larger differences between rater means.

Table 9 gives two additional statistics that we will use to assess the similarity between the coefficients for the data in Table 8. The upper panel gives the Pearson correlations between the coefficient values in Table 8. The lower panel contains the ratios of the numbers of tables in which the values lead to the same conclusion about inter-rater reliability (maximum difference between the values is less than or equal to 0.10) and the total numbers of tables.

Table 9 Correlations and number of times the same decision will be reached for the values of the agreement coefficients in Table 8

Consider the lower panel of Table 9 first. In all cases, we will come to the same conclusion with the four correlation coefficients (21/21). Hence, for these data, it does not really matter which correlation coefficient is used. Furthermore, if quadratic kappa is compared to the four correlation coefficients, we will reach the same conclusion in at least 15 of the 21 cases. These numbers indicate that the values are very similar for these data. In the cases where we found different values for quadratic kappa on the one hand and the four correlation coefficients on the other hand, the rater means tend to be more different.

The differences in the values of unweighted kappa and linear kappa compared to quadratic kappa and the three correlation coefficients are striking. With unweighted kappa, we will never reach an identical conclusion with regard to inter-rater reliability as with any of the other coefficients. With linear kappa, we will only reach the same conclusion in only a few cases.

Next, consider the upper panel of Table 9. We may observe very high correlations between the three kappa coefficients. The correlation between unweighted kappa and linear kappa is almost perfect. The unweighted kappa and weighted kappas appear to measure agreement in a similar way (high correlation) but to a different extent (values can be far apart) for these data. The correlations between the four correlation coefficients are almost perfect. Table 9 also shows that linear kappa has correlations of at least 0.90 with the four correlation coefficients. The correlations between quadratic kappa and the correlation coefficients are equal to or greater than 0.93. It seems that quadratic kappa measures agreement in a very similar way as the correlation coefficients, for these data.

7.4 Results for the Van der Scheer Data

Table 10 presents the values of the coefficients for the Van der Scheer et al. data (2017). Table 11 gives the two statistics that we use to assess the similarity between the coefficients for the data in Table 10. Consider the lower panel of Table 11 first. In contrast to the Holmquist data, the ratios show that, in a few cases, the four correlation coefficients do not lead to the same conclusion about inter-rater reliability for these data (3 pairs with 30/31 instead of 31/31). However, since the numbers are still quite high, we still expect similar conclusions from the correlation coefficients.

Table 10 Coefficient values, rater means, and standard deviations for the Van der Scheer data
Table 11 Correlations and number of times the same decision will be reached for the values of the agreement coefficients in Table 10

The lower panel of Table 11 also shows that the values of the three kappa coefficients and the correlation coefficients lead to the same conclusion more often for these data compared to the Holmquist data. In fact, quadratic kappa and the four correlation coefficients almost always led to the same conclusion. Similar to the Holmquist data, the values of quadratic kappa are closer to the values of the four correlation coefficients than the values of unweighted kappa and linear kappa.

Finally, consider the upper panel of Table 11. The correlations between the four correlation coefficients are again very high (≥ 0.98). Furthermore, for these data, the correlations between quadratic kappa and the correlation coefficients, and linear kappa and the correlation coefficients are high as well (≥ 0.94 and ≥ 0.92, respectively).

8 Discussion

8.1 Conclusions

In this study, we compared seven reliability coefficients for categorical rating scales, using analytic methods, and simulated and empirical data. The reliability coefficients are unweighted kappa, linear kappa, quadratic kappa, intraclass correlation ICC(3,1) (Shrout and Fleiss 1979), and the Pearson, Spearman, and Kendall correlations. To approach the first research question, we studied differences between quadratic kappa and the intraclass and Pearson correlations analytically. In general, differences between these coefficients increase if agreement becomes larger. Differences between the three coefficients are generally small if differences between rater means and variances are relatively small. However, if differences between rater means and variances are substantial, differences between the values of the three coefficients are small only if agreement between raters is small.

With regard to the second research question, for the data used in this study, we came to the same conclusion about inter-rater reliability in virtually all cases with any of the correlation coefficients (intraclass, Pearson, Spearman, or Kendall). Hence, it does not really matter which correlation coefficient is used with ordinal data in this study. Furthermore, using quadratic kappa, we may reach a similar conclusion as with any correlation coefficient a great number of times. Hence, for the data in this study, it does not really matter which of these five coefficients is used. Unweighted kappa and linear kappa generally produce different (much lower) values than the other five coefficients. The number of times we reached a similar conclusion with unweighted kappa or linear kappa and any other reliability coefficient was very low, and in some cases even zero. Moreover, if there is medium agreement, the values of the seven coefficients tend to be a bit closer to one another than if agreement is high.

With regard to the third research question, the four correlation coefficients tend to measure agreement in a similar way: their values are very highly correlated for the data used in this study. Furthermore, quadratic kappa is highly correlated with all four correlation coefficients as well for these data. These findings support earlier observations that quadratic kappa tends to behave as a correlation coefficient (Graham and Jackson 1993), although it should be noted that it sometimes gives considerably lower values than the correlation coefficients do.

8.2 Replace Weighted Kappa with a Correlation Coefficient

The application of weighted kappa with ordinal rating scales has been criticized by various authors (e.g., Tinsley and Weiss 2000; Maclure and Willett 1987; Soeken and Prescott 1986). Six reliability coefficients studied in this manuscript (the Kendall correlation not included) can be considered special cases of weighted kappa (Warrens 2014). However, the criticism has been aimed at linear and quadratic kappa in particular since unweighted kappa is commonly applied to nominal ratings and the correlation coefficients are commonly applied to interval ratings. Of the two, quadratic kappa has been applied most extensively by far (Vanbelle 2016; Warrens 2012a; Graham and Jackson 1993).

A pro of using quadratic kappa is that it may be interpreted as a proportion of variance, which also takes into account mean differences between ratings. Despite taking rater means into account, empirically quadratic kappa acts more like a correlation coefficient. For the ordinal rating scale data considered in this manuscript, we found that we reached a similar conclusion about inter-rater reliability with a correlation coefficient and quadratic kappa in many cases. Furthermore, the definitions underlying quadratic kappa and the Pearson and intraclass correlations turn out to be very similar empirically. If quadratic kappa is replaced by a correlation coefficient, then it is likely that in many cases a similar conclusion about inter-rater reliability will be reached.

8.3 Practical Recommendations

Based on the findings in the literature and the results of this study, we have the following recommendations for assessing inter-rater reliability. If one is only interested in distinguishing between agreement and disagreement, Cohen’s unweighted kappa (formula (2) should be used. Furthermore, if one wants to take into account the gravity of the disagreements (e.g., disagreement on categories that are adjacent are considered less serious than disagreement on categories that are further apart), then the Pearson correlation (formula 5) should be used. The use of the Pearson correlation is basically unchallenged, something that is not the case for linear and quadratic kappa (e.g., Tinsley and Weiss 2000; Maclure and Willett 1987; Soeken and Prescott 1986). Furthermore, the Pearson correlation is, to the best of our knowledge, available in all statistical software packages. Moreover, with the Pearson correlation, one will in many cases reach the same conclusion about inter-rater reliability as with the intraclass, Spearman, and Kendall correlation coefficients, as well as with quadratic kappa.

8.4 Limitations and Future Research

Rating scales may have various numbers of categories. The analytic results presented in the fifth section hold for any number of categories. However, a possible limitation of the simulation study and the empirical comparison is the use of scales with four and five categories only. Considering scales with smaller and larger numbers of categories is a topic for further study. To some extent, we expect that our results also hold for scales with seven or more categories: the values of the Pearson and Spearman correlations are often very similar (De Winter et al. 2016; Mukaka 2012; Hauke and Kossowski 2011), and differences between the values of quadratic kappa and the Pearson and Kendall correlations for seven or more categories are usually not substantial (Parker et al. 2013). For scales with two or three categories, we expect that differences between the reliability coefficients are even smaller (Parker et al. 2013). For example, with two categories, the three kappa coefficients studied in this manuscript are identical (Warrens 2013).

The present study was limited to reliability coefficients for two raters. A topic for further study is a comparison of reliability coefficients for multiple raters. Multi-rater extensions of unweighted kappa are presented in Light (1971), Hubert (1977), Conger (1980), and Davies and Fleiss (1982). An overview of these generalizations is presented in Warrens (2010, 2012b). Multi-rater extensions of linear and quadratic kappa are presented in Abraira and Pérez de Vargas (1999), Mielke et al. (2007, 2008), and Schuster and Smith (2005). An overview of these generalizations is presented in Warrens (2012c). Intraclass correlations are generally defined for multiple raters (Shrout and Fleiss 1979; Warrens 2017). Multi-rater extensions of the Pearson and Spearman correlations are presented in Fagot (1993).

The present study was limited to a selection of reliability coefficients that we believe are commonly used. In future studies, one may want to include other reliability coefficients that are suitable for ordinal rating scales in a comparison. Some alternative coefficients are considered in Parker et al. (2013). Among these alternative coefficients is Scott’s pi (Scott 1955; Krippendorff 1978, 2013), which, like unweighted kappa, is usually applied to nominal ratings. Since unweighted kappa and Scott’s pi produce very similar values in many cases (e.g., Strijbos and Stahl 2007; Parker et al. 2013), we expect that the results presented in this study for unweighted kappa are also applicable to Scott’s pi. Furthermore, we expect that the two coefficients will almost always lead to the same conclusion about inter-rater reliability. An extension of Scott’s pi to multiple raters is presented in Fleiss (1971). Moreover, both Scott’s pi and the coefficient in Fleiss (1971) are special cases of Krippendorff’s alpha (Krippendorff 1978, 2013) that incorporates weighting schemes and can be used when there are three or more raters.

To answer the second research question (to what extent we will reach the same conclusions about inter-rater reliability with different coefficients), we compared the values of the reliability coefficients in an absolute sense: if the differences between the values are small (≤ 0.10), we will conclude that the coefficients lead to the same decision in practice. The present study was limited to one cutoff value (i.e., 0.10). A topic for further study would be to consider other cutoff values. Furthermore, in practical applications, interpretation of specific values of the reliability coefficients may be based on guidelines or rules of thumb (e.g., McHugh 2012; Landis and Koch 1977). Using a particular set of guidelines, researchers may reach substantially different conclusions with one coefficient compared to another coefficient. A topic for further study is considering differences between coefficients in the context of particular sets of guidelines.