Derivation of Passing–Bablok regression from Kendall’s tau

Florian Dufey

doi:10.1515/ijb-2019-0157

Open Access Published by De Gruyter August 12, 2020

Derivation of Passing–Bablok regression from Kendall’s tau

Florian Dufey

From the journal The International Journal of Biostatistics

https://doi.org/10.1515/ijb-2019-0157

Abstract

It is shown how Passing’s and Bablok’s robust regression method may be derived from the condition that Kendall’s correlation coefficient tau shall vanish upon a scaling and rotation of the data. If the ratio of the standard deviations of the regressands is known, a similar procedure leads to a robust alternative to Deming regression, which is known as the circular median of the doubled slope angle in the field of directional statistics. The derivation of the regression estimates from Kendall’s correlation coefficient makes it possible to give analytical estimates of the variances of the slope, intercept, and of the bias at medical decision point, which have not been available to date. Furthermore, it is shown that using Knight’s algorithm for the calculation of Kendall’s tau makes it possible to calculate the Passing–Bablok estimator in quasi-linear time. This makes it possible to calculate this estimator rapidly even for very large data sets. Examples with data from clinical medicine are also provided.

Keywords: Kendall’s tau; method comparison; Passing–Bablok regression; robust regression; Theil–Sen regression

Acronyms
CI: Confidence interval
cPBR: classical Passing–Bablok regression
ePBR: equivariant Passing–Bablok regression
LPR: least product regression
LSR: least square regression
PB: Passing–Bablok
PBR: Passing–Babkok regression
TSR: Theil–Sen regression

1 Introduction

The comparison of measurements obtained with different instruments or test formats are a biometrical problem familiar to every scientist. While disciplines differ considerably with respect to the preferred methods, regression is of general importance [1]. Since both measurements to be compared have some level of imprecision, least squares regression (LSR) is not the best choice, as it assumes the abscissa values to be error free. This drawback is avoided by techniques like Deming regression (also known as major axis regression) or least product regression (LPR) [1], [2], [3], [4], [5]. However, the results of these methods may be seriously biased by a small number of outlying measurements. To overcome this problem, in the field of clinical chemistry, Passing and Bablok [6] proposed a regression method, to be called classical Passing–Bablok regression (cPBR) in the following, which may be considered a variant of Theil–Sen regression (TSR) [7], [8], adapted to take into account the variation not only of the ordinate but also of the abscissa.

While these authors gave only a heuristic justification for their method, they showed empirically the un-biasedness of the results obtained [9] and the robustness of the method. Although it has become one of the most often used statistical tools in clinical chemistry [5], lamentably, the method remains widely unknown outside this field.

Due to the limited analytical understanding of cPBR, formulas for the variance and confidence intervals of the slope and intercept are also not well developed. Furthermore, the cPBR is limited to slopes near 1, although, in a follow up paper [10], a method applicable for any value of the slope was proposed, which in the following shall be called equivariant PBR (ePBR). cPBR may be shown to be an approximation to ePBR. In the following, if both cPBR and ePBR are in scope, the term Passing–Bablok regression (PBR) will be used.

It is the aim of this paper to show that all the aforementioned regression techniques and specifically PBR result from the condition that either Pearson’s or Kendall’s correlation shall diminish to 0 upon a suitable slope-dependent scaling and rotation of the original variables. This extends the ideas of Sen [8] on the derivation of TSR to PBR. Along the same lines, a robust variant of the Deming regression procedure is developed, which is found to be equivalent to the determination of the circular median of the doubled slope angle – a well-known estimator in the theory of directional statistics.

The paper is structured as follows: in Section 2.1, the methods within scope of the paper are shortly recapitulated. The estimators obtained with the different methods are compared using an example from clinical chemistry (Figure 1). Section 2.2 contains the theoretical derivation of the Passing–Bablok estimator and is relevant mainly for the reader more interested in the theory. In Section 2.3, the main equations for the calculation of the PB estimators are laid out, while in the following Section 2.4, expressions for the circular median estimator are given. It also contains a numerical comparison of the different methods. In Section 2.5, it is shown how the PB estimators can be calculated efficiently. Section 3 contains a derivation of the confidence interval for the slope and an efficient algorithm for its numerical calculation. Finally, in Section 4, new equations for the confidence intervals of the intercept and bias are derived. Their coverage is then studied by simulations. Generally, examples or simulations more relevant for the practitioner are included at the end of the sections or in the figures and their captions. Recommendations on the use of the methods are put together in the discussion (Section 5).

Figure 1:

To exemplify the performance of the different regression schemes for measurements with very different ranges, results from a comparison of a clinical chemistry tests, a PCR assay (concentration [IU/ml]) and an immunoassay (cutoff index, COI), respectively, are shown. Red lines are for TSR of concentration on cutoff index (dashed), cutoff index on concentration (dotted)^[1], ePBR (solid). Blue lines are for LSR of concentration on cutoff index (dashed), cutoff index on concentration (dotted) and LPRsolid. Green curve is cPBR. On the left, all curves are forced through the center of mass of concentration and cutoff index. On the right, the intercept for the TSR and PBR curves is obtained as proposed by Passing and Bablok.

2 Derivation of regression from correlation

2.1 Synopsis of the relevant regression methods

In the following, a caret will designate the estimated value, e.g., of the slope m ^ , and a bar the mean, e.g., x ‾ = ∑ i x i / n . Furthermore, X : = { x i } , and Y : = { y i } .

It will be shown how the regression schemes mentioned in the introduction, namely, LSR, LPR, Deming regression, and notably PBR, may be derived from the condition that either Pearson’s or Kendall’s correlation diminishes to 0 for transformed input data.

The following assumptions are made, although for some methods further generalizations are possible. For LSR and Deming regression, both weighted and unweighted variants exist. For the ease of presentation, only the unweighted variants will be discussed.

There exists a structural relationship between the expected values of random variables x _i and y _i, denoted as x i * and y i * , respectively, of the form y * = b + m x * . Here, x i = x i * + ξ i and y i = y i * + η i . For LSR and TSR, ξ i = 0 .
Error terms for points i ≠ j are independent. ξ i and η i are independent, too.
E ( ξ i ) = E ( η i ) = 0 , for all i.
The ratio of the variances is independent of i, γ 2 : = σ η i 2 / σ ξ i 2 = constant .
The distribution functions of the error terms may depend on i, but for fixed i, the distributions of ξ i and η i are equal up to scaling with the standard deviation^[2], i.e., ∀ i , x : P ( ξ i / σ ξ i < x ) = P ( η i / σ η i < x ) .

For a detailed description of the estimation procedures to be considered, the reader is referred to the literature [1], [2], [3], [4], [6]. Here, only a very condensed synopsis of the main assumptions and formulas for the estimated slope are given for the methods of interest, as follows:

LSR: In least squares regression of Y on X, the estimate of the slope m ^ is obtained by minimizing the sum of the squared deviations of the y _i from the estimated values, ∑ i ( y i − y ‾ − m ( x i − x ‾ ) ) 2 as

m ^ = ∑ i ( y i − y ‾ ) ( x i − x ‾ ) ∑ i ( x i − x ‾ ) 2 .

LPR: In least product regression, the estimate m ^ is obtained minimizing the sum of the products of the deviations of the y _i and x _i from their estimated values, ∑ i ( y i − y ‾ − m ^ ( x i − x ‾ ) ) ( x i − x ‾ − ( y i − y ‾ ) / m ^ ) as

m ^ = ( ∑ i ( y i − y ‾ ) 2 ∑ i ( x i − x ‾ ) 2 ) 1 / 2 ,

where the sign of the root is chosen equal to the sign of Pearson’s correlation coefficient of x and y. Sometimes, LPR is called geometric mean regression, as the estimate m ^ may also be expressed as the geometric mean of the slope estimate from a LSR of y on x and the reciprocal slope estimate of regression of x on y. Also the terminology “reduced major axis regression” is common.

Deming Reg.: In major axis, or Deming regression, the slope is determined by minimization of both the sum of the squared deviances of x and y from the estimated true values x ^ i * and y ^ i * = y ‾ + m ^ ( x ^ i * − x ‾ ) ,

∑ i ( ( y i − y ^ i * ) 2 + γ 2 ( x i − x ^ i * ) 2 ) ,

where the sum is minimized with respect to both m ^ and the x ^ i * . The slope estimate is

m ^ = 1 2 s x y ( ( s y y − s x x γ 2 ) + ( s y y − s x x γ 2 ) 2 + 4 s x y 2 γ 2 ) ,

where s x y = ∑ i ( x i − x ‾ ) ( y i − y ‾ ) , s x x = ∑ i ( x i − x ‾ ) 2 , and s y y = ∑ i ( y i − y ‾ ) 2 . It can be seen that the slope estimators of LSR and LPR arise as special cases of the Deming regression estimate. Namely, γ = ∞ corresponds to regression of y on x and γ = 0 to regression of x on y. Assuming that γ = m , replacement of γ by its estimator m ^ in the last expression, yields the LPR expression for m ^ .

Roos Reg.: As first discussed by Roos [11] and Xu [12], LSR may be considered as a special case from a family of regressions where the squared distance of the points from a regression line is minimized along lines with slope κ, e.g., LSR is a special case with κ = ∞ . The slope estimate is

m ^ κ = s y y − κ s x y s x y − κ s x x .

TSR: According to Sen [8], in Theil–Sen regression the slope can be estimated from the condition that Kendall’s correlation coefficient τ between the X and the intercepts Y − m X shall be 0,

τ ( X , Y − m ^ X ) = N ∑ i < j sgn ( x j − x i ) sgn ( y j − y i − m ^ ( x j − x i ) ) = 0.

N is a normalization factor. When there are no ties, N = 2 / n ( n − 1 ) . The function sgn (x) is the sign of x, if x ≠ 0 , and 0, if x = 0 . If the x _i are ordered so that x j > x i if j > i , then

m ^ = median ( y j − y i x j − x i ) ⏟ = : S i j ,

where S _ij is the slope of the line connecting points i and j.

cPBR: In classical Passing–Bablok regression [6] proposed to modify the TSR estimator, to take account of the X not being free of error. If K is the number of the slopes S i j < − 1 and N = n ( n − 1 ) / 2 the total number of pairs, then

m ^ = F S − 1 ( 1 2 + K N ) ,

where F _S is the empirical cumulative distribution function of the { S i j } . Furthermore, Passing and Bablok made the assumption that m ≈ 1 .

ePBR: In a follow-up article Bablok et al. [10] proposed an alternative method to cPBR which is applicable also for slopes ≠ 1 . They gave a recursive definition which can be shown [13] to be equivalent to

m ^ = med ( | S i j | ) .

In Figure 1, some of these methods are applied to a data set from clinical chemistry containing paired measurements obtained with both PCR assay and an immunoassay. Due to different units, the numerical ranges of the results differ by several orders of magnitude. It can be seen that the LSR, TSR and also the cPBR are sensitive to the choice of the dependent and independent variables, while the LPR and ePBR yield results which are more centered than those of the other methods.

2.2 Coordinate transformations

In this section it will be shown that all these estimators may be derived from a common expression. Since LSR and LPR can be derived from Deming regression, the latter will be the focus. The idea is to rotate the coordinate system so that either Pearson’s correlation coefficient r, or Kendall’s correlation coefficient τ, vanishes [13]. However, this cannot be done directly because, upon rotation, the error terms will in general be no longer independent:

Let Σ be the covariance matrix of the { ( x i , y i ) T } , Σ * of the { ( x i * , y i * ) T } and Σ i Ξ that of ξ i , η i T , then

Σ = Σ ∗ + ∑ i Σ i Ξ ,

where Σ i , 11 = E ( ( x i − x i * ) 2 ) , Σ i , 12 = Σ i , 21 = E ( ( x i − x i * ) ( y i − y i * ) ) , and Σ i , 22 = E ( ( y i − y i * ) 2 ) . Corresponding expressions hold for Σ * , e.g., Σ 11 * = ∑ i ( x i * − x ‾ * ) 2 , while Σ i , 11 Ξ = σ ξ i 2 , Σ i , 12 Ξ = Σ i , 21 Ξ = 0 , and Σ i , 22 Ξ = γ 2 σ ξ i 2 . The rotation angle, which diagonalizes Σ is

ϕ = 1 2 arctan ( Σ 12 Σ 22 − Σ 11 ) = 1 2 arctan ( Σ 12 ∗ Σ 22 ∗ − Σ 11 ∗ + ∑ i σ i 2 ( γ 2 − 1 ) ) ,

which is only consistent if γ 2 = 1 .

However, if new x-values are introduced scaling by γ so that x ′ i = γ x i , the covariance matrix of the new error terms ξ ′ = γ ξ and η will become a multiple of the unit matrix. Upon scaling, the original slope m will be transformed into m ′ = m / γ . The coordinate system can now be rotated by the angle ϕ = − arctan ( m ′ ) , whence

x ″ = x ′ cos ( ϕ ) − y sin ( ϕ ) = ( y + x γ 2 m ) m γ 2 + m 2 ,

and

y ″ = y cos ( ϕ ) + x ′ sin ( ϕ ) = ( y − m x ) γ γ 2 + m 2 .

Dropping irrelevant constant factors, X ″ : = { y i + x i γ 2 / m } , and Y ″ : = { y i − m x i } are defined.

The condition that the Pearson correlation r ( X ″ ( m ) , Y ″ ( m ) ) of the transformed points diminishes to 0 for m = m ^ becomes

(1) r ( X ″ ( m ^ ) , Y ″ ( m ^ ) ) = ∑ i ( x ″ i − x ‾ ″ ) ( y ″ i − y ‾ ″ ) ∑ i ( x ″ i − x ‾ ″ ) 2 ∑ i ( y ″ i − y ‾ ″ ) 2 = 0 ,

which can be simplified to

(2) ∑ i ( y i − y ‾ + γ 2 m ^ ( x i − x ‾ i ) ) ( y i − y ‾ − m ^ ( x i − x ‾ i ) ) = 0.

It can be shown that the solution of this equation yields exactly the Deming regression estimate m ^ .

When Kendall’s τ is substituted for Pearson’s r in the preceding derivation, differences Δ x i j = x j − x i and Δ y i j has to be considered. The errors of these differences, Δ η i j , and Δ ξ i j will also be independent with the same ratio γ ² of the variances. Hence, the equivalent of Eq. (1) becomes

(3) τ m ^ : = τ X ″ m ^ , Y ″ m ^ : = n n − 1 2 − 1 ∑ i < j sgn Δ x ″ i j Δ y ″ i j = 0 .

This can also be simplified to

(4) ∑ i < j sgn [ ( Δ y i j + γ 2 m ^ Δ x i j ) ( Δ y i j − m ^ Δ x i j ) ] = 0.

The solution of this equation yields an estimator^[3] which is the equivalent of Deming regression. For this slope estimate to be consistent, it is necessary that the expectation value of Eq. (4) with m ^ replaced by m must be 0,

(5) E ( ∑ i < j sgn [ ( Δ y i j * + γ 2 m Δ x i j * + Δ η i j + γ 2 m Δ ξ i j ) ( m Δ ξ i j − Δ η i j ) ] ) = 0.

If the x i * are not random variables, generically, the expectation values of each term in the sum has to be 0 on its own. This condition is fulfilled if either:

γ = | m | and the distributions of the m Δ ξ i j and Δ η i j are identical. The term whose expectation is taken in Eq. (5) is anti-symmetric under an interchange of the m Δ ξ i j and Δ η i j whence the expectation must be 0. These are exactly the assumptions made by Passing and Bablok.
Or the m Δ ξ i j − Δ η i j are statistically independent of the corresponding Δ y i j * + γ 2 m Δ x i j * + Δ η i j + γ 2 m Δ ξ i j and their expectations be 0. For general m and γ, this implies that the distribution of the γ ξ i and η i has to be rotationally invariant, which is only true if they are independent samples from a common Gaussian distribution with mean E ( η i ) = E ( ξ i ) = 0 . Although requiring a Gaussian distribution may seem quite restrictive, one should note that the two terms become also approximately statistically independent when the errors are small compared to the mean distance between the points, since it follows that Δ y i j * + γ 2 m Δ x i j * + Δ η i j + γ 2 m Δ ξ i j ≈ Δ y i j * + γ 2 m Δ x i j * in most of the terms, which may therefore be pulled out of the expectation. The expectation E ( sgn ( m Δ ξ i j − Δ η i j ) ) = 0 , if the distribution of the ξ i and η i is symmetric.

The two cases shall be discussed separately in the following subsections. It is remarked that in both cases there are two possible solutions which only differ in the sign of the resulting slope. As in LPR, the sign may be fixed as that of Kendall’s correlation between the original X and Y. In the following descriptions, it will be assumed that this sign is positive, which can always be achieved by eventually changing the sign of the X.

2.3 Passing–Bablok regression

The first case leads to ePBR:

Setting γ 2 = m 2 , finding the m ^ for which the Pearson correlation becomes 0 yields the LPR estimate. Analogously, the condition that Kendall’s correlation shall diminish to 0 leads to an equation for m ^

(6) ∑ i < j sgn [ ( Δ y i j + m ^ Δ x i j ) ( Δ y i j − m ^ Δ x i j ) ] = 0.

with

S i j : = Δ y i j Δ x i j ,

this can also be written as

τ ( m ^ ) ∼ ∑ i < j sgn ( m ^ 2 − S i j 2 ) = ∑ i < j sgn ( m ^ − | S i j | ) = 0 ,

or,

m ^ = median ( | S i j | ) .

Unlike the cPBR estimator, a scaling of the S _ij with a factor will also scale the estimator m ^ by the same factor. Therefore, this estimator will be called ePBR estimator in the following.

Alternatively, this estimator arises on introduction of the angles ϕ ^ = arctan m ^ , and ϕ i j = arctan ( S i j ) , as

ϕ ^ = median ( | ϕ i j | ) .

Therneau [13] could show that this expression corresponds to an estimator proposed by Bablok et al. [10], who gave a recursive approximation for this estimator. This recursion uses the cPBR as a starting point. It can be seen that if in the first bracket of Eq. (6), m ^ is replaced by 1, this will yield exactly the cPBR estimator. If Pearson correlation is used instead of Kendall’s, the equivalent of the cPBR corresponds to the Roos regression estimate with parameter κ = − 1 .

While Eq. (6) yields a consistent estimator, the bias of the ePBR estimator for small sample sizes remains to be determined. It is assumed that the true m ≠ 0 . Multiplying

(7) ∑ i < j sgn ( Δ y i j 2 − ( m ^ m ) 2 ( m Δ x i j ) 2 ) = 0

with ( m / m ^ ) 2 yields

(8) ∑ i < j sgn ( ( m Δ x i j ) 2 − ( m m ^ ) 2 Δ y i j 2 ) = 0.

As m Δ x i j and Δ y i j are distributed alike, it can be concluded that m ^ / m and m / m ^ also follow the same distribution. Therefore, ln ( m ^ ) is symmetrically distributed around ln ( m ) . As the variance of m ^ is O ( 1 / n ) (see below), ln ( m ^ ) is an unbiased estimator of ln ( m ) , and m ^ is an asymptotically unbiased estimate of m. An equivalent result holds for the LPR estimator [14].

Considering the robustness properties of the ePBR estimator, it is easy to show that it has the same minimal breakdown point as the TSR estimator. For the TSR estimator the breakdown point is reached earliest, when the slopes between any two points of which at least one is an outlier make up 50% of all slopes and are either all larger or all smaller than all the slopes between any two points which are not outliers ([15], p. 67). Asymptotically, this is the case when the fraction of the outliers reaches 1 − 1 / 2 ≈ 29 % . The calculation remains true, if, in case of the ePBR, the slopes are replaced by their absolute value, whence the breakdown point of the ePBR is also 29 %.

2.4 Circular median of the doubled angle

Considering the general solution of Eq. (4) for fixed and known γ, setting ϕ : = arctan ( m / γ ) , ϕ i j = arctan ( Δ y i j / ( γ Δ x i j ) ) , and making use of addition theorems for the trigonometric functions, one obtains

∑ i < j sgn ( sin ( 2 ( ϕ ^ − ϕ i j ) ) = ∑ i < j − sgn ( 2 ( ϕ ^ − ϕ i j ) − 2 π 2 ( ϕ ^ − ϕ i j ) 2 π + π ) = 0.

Here, ⌊ x ⌋ is the floor function, i.e., the largest integer less than or equal to x. ϕ ^ is a standard estimator of circular or, more generally, of directional statistics [16], [17] known as the “circular median” of the doubled angle. Doubling of the angle is standard in circular statistics for data showing diametrically bimodal distributions. The circular median has been shown to be a robust and unbiased location estimator of the slope angle [18]. To visualize the circular median, the (doubled) angles are represented as points on a unit circle and a diameter PQ is drawn which separates the circle into two halves containing the same number of points, cf. Figure 2. The point P, for which the majority of points have less distance than to Q, is then the circular median. For an odd number of points it coincides with a measured angle, while for an even number of points the mean of the nearest two points is used.

$Figure 2: A: Histogram of all unique pairwise slope angles for the example from Figure 1 after scaling with the PB slope estimate. Light grey: slopes which contribute +1 to Kendall’s correlation function; Dark grey: slopes which contribute −1 to the correlation function. B: Histogram for the doubled slope angles (modulo 2π). An equal amount of points lie left (light grey) and right (dark grey) of the line passing through the circular median at π / 2 $\pi /2$ and the origin of the circle.$

Figure 2:

A: Histogram of all unique pairwise slope angles for the example from Figure 1 after scaling with the PB slope estimate. Light grey: slopes which contribute +1 to Kendall’s correlation function; Dark grey: slopes which contribute −1 to the correlation function. B: Histogram for the doubled slope angles (modulo 2π). An equal amount of points lie left (light grey) and right (dark grey) of the line passing through the circular median at π / 2 and the origin of the circle.

Estimation of the slope via the circular median seems to be an attractive and robust alternative to the usual Deming regression, especially if the ratio of the standard errors is known and different from the expected slope. Similar to the way the Deming slope estimator interpolates between LSR estimator for the slope of Y versus X and the inverse of X versus Y depending on the ratio γ, the circular median can be seen to interpolate between the corresponding TSR estimates. The calculation of the PBR estimate is also possible using the circular median algorithm.

While finding the circular median is as complex as finding the usual median as required in the practical computation of the PBR, the concept offers the advantage, that it can be extended to find a regression line when comparing more than two sets of data [19].

It is of interest to compare the ePBR and the circular median method numerically. To that end, 100 points were sampled 1000 times with x * = y * from a uniform distribution, i.e., with m = 1 , and a) equal coefficient of variance c v y = c v x = 0.3 and b) unequal coefficient of variance c v x = 0.3 and c v y = 0.9 , with Gaussian error. In case a), both methods yield practically the same results with mean m ^ = 1.005 ( sd = 0.067 ) for the ePBR and mean m ^ = 1.007 ( sd = 0.084 ) for the median circular slope. However, in case b), the ePBR is biased with mean m ^ = 1.606 ( sd = 0.160 ) , while the mean median circular slope estimate is m ^ = 1.007 ( sd = 0.084 ) .

2.5 Numerical complexity of the calculation of the PBR and circular median estimators

The question remains how efficiently the TSR, PBR, and circular median estimate can be calculated. In case of the TSR, this problem has long been solved and represents a classical problem of computational geometry [20]. While the brute force approach to find the median of all the pairwise slopes between n points has a numerical complexity of O ( n 2 ) , better algorithms have only a complexity of O ( n ln n ) . The problem can be shown to be equivalent to that of computing Kendall’s tau, for which an efficient solution was given by Knight in 1966 [21]. It is based on the observation that the evaluation of Kendall’s τ ( X ″ , Y ″ ) is equivalent to that of counting the inversions needed to bring the set Y ″ into the order of X ″ , an operation which can be accomplished in O ( n ln n ) steps using a sorting algorithm like mergesort [22]. x ″ i and y ″ i can be interpreted as intercepts b i ( m ) of lines passing through the points ( x i , y i ) whose slope m is taken as an independent variable. This lends to an interpretation in terms of a duality transformation which maps points ( x i , y i ) to lines (parametrized by ( m , b i ( m ) ) and lines to points. In the field of image analysis, this transformation is known as Hough transformation [23].

Since Knight’s algorithm has been implemented in many statistical packages and these also take care of the handling of ties, it is an easy task to use these algorithms to calculate the PBR and circular mean estimators. Even in combination with a simple bisection algorithm for root finding it thus becomes possible to find the estimators for data sets with n = 10 7 on an ordinary laptop in a few minutes.

3 Confidence interval of the slope estimate

Passing and Bablok [6] gave arguments to justify a formula for the confidence intervals of the estimated slope. They noted that, in general, the variance of their slope estimator should depend on the distribution of the measured values. Numerical simulations showed that, in practice, their formula led to acceptable coverages. Nevertheless, they concluded that “the empirical derivation of this statement might seem unsatisfactory”. Their estimator can be justified by deriving the confidence intervals for the slope from that of Kendall’s τ.

Writing τ ( m ) : = τ ( X ″ ( m ) , Y ″ ( m ) ) , where X ″ and Y ″ may be any of the expressions listed in Table 1, the estimator for the slope m ^ is the solution of the equation τ ( m ^ ) = 0 . Using the delta method,

(9) σ ^ m ^ ≈ σ ^ τ d m d τ ≈ τ − 1 ( z 1 − α / 2 σ τ ) − τ − 1 ( − z 1 − α / 2 σ τ ) 2 z 1 − α / 2 ,

where in the last step, the differential quotient was regularized with a McKean–Schrader type finite difference approximation [24]. z 1 − α / 2 is the quantile of level α / 2 of the normal distribution.

Table 1:

Comparison of corresponding regression methods obtained by zeroing of either Pearson or Kendall correlation.

Assumed variance structure	X ″	Y ″	Correlation function
Assumed variance structure	X ″	Y ″	Pearson, r ( X ″ , Y ″ ) = 0	Kendall, τ ( X ″ , Y ″ ) = 0
σ x = 0	X	Y − m X	LSR	TSR
σ y / σ x = m	Y + m X	Y − m X	LPR	ePBR
σ y / σ x = m ≈ − κ	Y − κ X	Y − m X	Roos	cPBR ( κ = − 1 )
σ y / σ x = γ ( ≠ m )	Y + γ 2 m X	Y − m X	Deming Reg.	circular median of doubled slope angle

Hence, an asymptotic confidence bound for m ^ is

m ^ u / l = τ − 1 ( ± z 1 − α / 2 σ ^ τ ) .

If m is the true value of the slope, the variance of τ ( m ) is

(10) Var ( τ ( m ) ) = σ τ 2 = 2 ( 2 n + 5 ) 9 n ( n − 1 ) ,

if x i ″ and y i ″ are statistically independent, and all the y ″ i are from the same distribution. With these assumptions, the estimates for the slope’s confidence intervals coincide with the result of Passing and Bablok.

However, these assumptions are quite restrictive, since they are only fulfilled if the original η i and ξ i are homoscedastic and from a Gaussian distribution. If the errors are small compared to the Δ x i j * or when γ = m , a symmetric distribution of the errors will be sufficient, but heteroscedasticity is still an issue. In the Theil–Sen setting, bootstrap, and jackknife techniques have been recommended [25] as an alternative for the analytical confidence interval estimators given by Theil [7].

Alternatively, it is possible to derive analytical expressions for the confidence intervals of the PBR estimators without making restrictive assumptions about the distribution of the X and Y, using an expression given by Daniels and Kendall and Cliff [26] and Charlin [27] for the variance,

(11) Var ^ ( τ ( m ) ) = n ( n − 1 ) ∑ j ( ∑ j τ i j ) 2 − 1 ( n − 2 ) ( n − 3 ) ,

where τ i j = ( n ( n − 1 ) / 2 ) sgn ( x ″ i − x ″ j ) sgn ( y ″ i − y ″ j ) . The estimator depends explicitly on the measured values. To evaluate this estimator efficiently, the order of the i, j is chosen so that, if x ″ i < x ″ j , then i < j . Then, ∑ j τ i j can be seen to be proportional to the difference of concordant and discordant pairs y ″ i , y ″ j for all j. This number may be evaluated for all values of i in one sorting step of numerical complexity O ( n ln n ) :

Initialize a vector q of length n to 0.
The vector Y ″ , with elements in order of the i, is then ordered by decreasing size of the y ″ i using mergesort [22] where, during the merging of any two sub-lists, a counter is kept of the number of elements already taken from the left list.
When during merging an element y ″ i from the right list is selected, q i is increased by the actual value of the counter.
When the complete list is sorted, then ∑ j τ i j = 2 ( n − 2 q i ) / n ( n + 1 ) .

For homoscedastic Gaussian errors, this variance estimator reduces asymptotically to Eq. (10).

A numerical comparison (cf. the Section 4) shows that the variance calculated with Eq. (11) for heteroscedastic, Gaussian errors with constant coefficient of variation is only 9.4 % greater than the variance for homoscedastic error and the corresponding difference in the standard deviation is 4.6 %. For moderate sample sizes, this difference will be below the statistical uncertainty of Eq. (11) and probably justifies the use of the simpler expression used by Passing and Bablok in most practial situations. Nevertheless Daniels and Kendall [26] derived a tight upper bound 2/n for the variance from Eq. (11) which, asymptotically for large n, is 4.5 times larger than the variance from Eq. (10).

Finally, in the context of estimating the variance of m ^ , the efficiency of the PBR estimation shall be addressed. The additional condition to the general assumptions made in Section 2.1 for the estimate Eq. (10) to hold, is that the x i * , y i * , η i and ξ i follow a normal distribution. Under this condition, the “twin” method LPR of the ePBR, which is derived here from the condition of Pearson correlation diminishing to 0, can also be derived from maximization of likelihood [28] and therefore has maximal asymptotic efficiency. The relative asymptotic efficiency of ePBR is then expressible as the ratio Var ( m ^ LPR ) / Var ( m ^ ePBR ) of the two estimators [8]. The standard deviation of the LPR estimate can be calculated as the corresponding standard deviation of the ePBR,

σ m ^ LPR = σ r d m d r ,

where r ( X ″ ( m ) , Y ″ ( m ) ) is Pearson’s coefficient of correlation and σ r its standard deviation.

Asymptotically [26], σ r = 1 / n . Also, for normally distributed X and Y,

r = sin ( π τ / 2 ) ,

which allows relating the derivatives of m ^ with respect to r and τ. This yields the relative asymptotic efficiency

Var ( m ^ LPR ) Var ( m ^ ePBR ) ∼ 9 π 2 ≈ 0.91.

So, at least for normally distributed measurements and errors, the asymptotic relative efficiency will be nearly optimal. The same relative asymptotic efficiency is also found for Theil–Sen regression [8].

4 Confidence intervals for the intercept and bias

The estimators for the CI of the intercept proposed by Passing–Bablok

b ^ low / up = median y i − m ^ up / low x i | i = 1 , ⋯ , n

do not seem acceptable, since with a shift of the origin of the variable x, the two limits can be made to coincide, so that the CI has length 0. One would merely expect the width of the CI to run through a minimum near the center of gravity of the x _i, as in simple linear regression.

Passing and Bablok propose the following estimator for the intercept:

b ^ = median { y i − m ^ x i | i = 1 , ⋯ , n } ,

or with,

ζ ( b , m ^ ) = ∑ i sgn ( y i − m ^ x i − b ) = n ( 2 F Y ″ ( b ) − 1 )

as a solution of

(12) ζ ( b ^ , m ^ ) = 0 or , F Y ″ ( b ^ ) = 1 2 .

The intercept b is the bias of y at x = 0 . Here, F Y ″ is the empirical cumulative distribution function of the Y ″ = Y − m ^ X . Neglecting a small possible dependence of the y ″ i for different i due to the estimation of m ^ , the variance of ζ ( b , m ^ ) for fixed b is Var ^ ζ = σ ^ ζ 2 = n and that of F Y ″ ( b , m ^ ) is Var ^ ( F Y ″ ) = 1 / 2 n .

In clinical chemistry, the bias at the medical decision point x _c,

B ^ c = y c − x c = b ^ + m ^ − 1 x c ,

is of special interest.

For the calculation of its standard error, the delta method is proposed

Var ^ ( B ^ c ) = ( 1 x c ) Q − 1 V ( Q − 1 ) T ( 1 x c ) T ,

with V being the covariance matrix of ζ ( b , m ^ ) and τ ( m ) . It has entries V 11 = σ ζ 2 , V 22 = σ ^ τ 2 (cf. Eq. (10)) and

V 12 = V 21 = cov ^ ( ζ , τ ) = ∑ i < j ( sgn ( y ″ i ) + sgn ( y ″ j ) ) τ i j = : cor ^ ( ζ , τ ) σ ^ ζ σ ^ τ .

This expression for the covariance can be derived using the fact that both ζ and τ are U-statistics [29]. It can be seen to quantify heteroscedasticity of the data and can also be expressed in terms of Kendall correlation coefficients,

cov ^ ( ζ , τ ) = 2 ( n + 2 ) τ + − ( n − 2 ) τ − ( n 2 ) ,

where n ± are the numbers of positive and negative y ″ i , respectively, and τ ± = ∑ i < j τ i j where either the sum is extended over those i, for which either y ″ i > 0 or y ″ i < 0 , respectively.

The matrix Q is the Jacobian

(13) Q = ( ∂ ζ ∂ b ^ | m ^ ∂ ζ ∂ m ^ | b ^ 0 d τ d m ^ ) , with inverse Q − 1 = ( ∂ b ^ ∂ ζ | m ^ − d m ^ d τ x 0 0 d m ^ d τ ) ,

where x 0 : = ∂ ζ / ∂ m ^ | b ^ ∂ b ^ / ∂ ζ | m ^ = − ∂ b ^ / ∂ m ^ | ζ = 0 .

For all appearing derivatives, McKean–Schrader type regularized approximations are used:

For σ ^ τ d m ^ / d τ = σ ^ m ^ , the expression derived in the previous section.

σ ^ b ^ : = σ ^ ζ ∂ b ^ ∂ ζ | m ^ ≈ F Y ″ − 1 ( 1 2 + z 1 − α / 2 σ ^ F Y ’’ ) − F Y ″ − 1 ( 1 2 − z 1 − α / 2 σ ^ F Y ’’ ) 2 z 1 − α / 2 .

x 0 : = − ( ∂ b ^ ∂ m ^ ) ζ = 0 ≈ − b ^ ( m ^ + z 1 − α / 2 σ ^ m ^ ) − b ^ ( m ^ − z 1 − α / 2 σ ^ m ^ ) 2 z 1 − α / 2 σ ^ m ^ .

Thus the final expression for the variance of the bias at the medical decision points becomes

Var ^ ( B ^ c ) = Var ^ ( B ^ c min ) + σ ^ m ^ 2 ( x c − x c min ) 2 .

Here, the variance is minimal for x c min = x 0 − σ ^ b ^ / σ ^ m ^ cor ^ ( ζ , τ ) where it attains the value

Var ^ ( B ^ c min ) = σ ^ b ^ 2 ( 1 − cor ^ ( ζ , τ ) 2 ) .

Hence, confidence intervals for the bias at medical decision point can be obtained as

B ^ c up / low = B ^ c ± z 1 − α / 2 Var ^ ( B ^ c ) .

The method has been tested in numerical simulations. Results are shown (Figure 3) for a uniform distribution of 200 “true” x * = y * values (i.e., m = 1 and b = 0 ) in the range (0, 1000) with a Gaussian error in x and y with constant coefficient of variance c v x = c v y = 0.1 . The simulation was repeated 1000 times to estimate the standard deviations of the estimated slope, intercept and bias at medical decision point, as well as the coverage of the true parameter values, cf. Table 2. Furthermore, in Figure 3, the simulated estimates of the bias are shown together with their standard deviation and the theoretical curve for the standard derivation derived from the mean values of Var ^ ( m ^ ) , Var ^ ( B ^ c min ) , and x c min .

$Figure 3: A: A set of 200 points with x * = y * ${\mathrm{x}}^{\text{{\ast}}}={\mathrm{y}}^{\text{{\ast}}}$ from a uniform distribution in (0, 1000) with a constant coefficient of variation c v x = c v y = 0.1 $\mathrm{c}{\mathrm{v}}_{x}=\mathrm{c}{\mathrm{v}}_{y}=0.1$ . B: grey: 1000 simulated values of the bias B c min ${B}_{\mathrm{c}}^{\mathrm{min}}$ as a function of the cutoff at medical decision point x c . Dashed: standard deviation of the simulated bias values. Dotted: standard deviation of the bias calculated from the means of Var ^ ( m ^ ) , Var ^ ( B ^ c min ) $\hat{\text{Var}}\left(\hat{m}\right),\,\hat{\text{Var}}\left({\hat{B}}_{\text{c}}^{\text{min}}\right)$ , and of x c min ${x}_{\text{c}}^{\text{min}}$ from the simulations. The analytical expression not only approximates the width of the confidence intervals from simulation very well, but also reproduces the position where the variance attains its minimum.$

Figure 3:

A: A set of 200 points with x * = y * from a uniform distribution in (0, 1000) with a constant coefficient of variation c v x = c v y = 0.1 . B: grey: 1000 simulated values of the bias B c min as a function of the cutoff at medical decision point x _c. Dashed: standard deviation of the simulated bias values. Dotted: standard deviation of the bias calculated from the means of Var ^ ( m ^ ) , Var ^ ( B ^ c min ) , and of x c min from the simulations. The analytical expression not only approximates the width of the confidence intervals from simulation very well, but also reproduces the position where the variance attains its minimum.

Table 2:

Comparison of standard deviations calculated with the formulas derived in the text to simulated values, and of the coverage of the bias Bc for values of the medical decision point x c ∈ ( 0 , 250 , 500 , 750 , 1000 ) . The coverage of the slope and bias are generally good, only near x c min the coverage is rather conservative.

parameter	Theoretical	Simulation
sd ( m ^ )	0.0179	0.0178
sd ( b ^ )	3.44	3.21
x c min	123	127
coverage m	95%	94.4%
coverage b = B ₀	95%	97.4%
coverage B ₂₅₀	95%	94.9%
coverage B ₅₀₀	95%	94.0%
coverage B ₇₅₀	95%	94.0%
coverage B ₁₀₀₀	95%	94.4%

5 Discussion

Most classical regression methods – including the ones which allow for imprecision in both variables, such as Deming or geometric mean regression – can be derived from the condition that Pearson’s correlation diminishes to 0 for a specific scaling and rotation of the data. The slope of interest is thereby a parameter of the scaling or rotation. It has been shown that a replacement of Pearson’s correlation by Kendall’s tau yields interesting robust regression estimates. While this is well known for TSR, the argument can be extended to PBR; the ePBR [10] can be derived this way, and it has been shown to be the analog of geometric mean regression. It is easier to calculate than the cPBR estimate, since it is merely the median of the absolute values of the slope m ^ = median ( | S i j | ) , and it has the advantage that the estimator is equivariant under a scaling of the axes, while the cPBR estimator could only be determined for slopes approximately equal to 1.

The cPBR estimator appears as an approximation to the ePBR and has been shown to be the equivalent of Roos regression. Therefore, use of the ePBR seems to be preferable over that of the cPBR. Furthermore, as the PBR methods are derivable from Kendall correlation, it would be natural to report also Kendall’s τ as a measure of correlation rather than Pearson’s r.

The corresponding regression method to Deming regression is shown to be equivalent to a standard problem in circular statistics, namely finding the circular median of the doubled slope angles. While this is a standard method in fields like geology or ecology, with proven robustness and un-biasedness, it has apparently not been applied in clinical chemistry. Consistency of this estimator has here only been proven for errors following a normal distribution. The effect of deviations from normality on the bias of the estimator needs to be evaluated in simulation studies; nevertheless it is expected to be lower than in Pearson regression. The method may be especially then a viable alternative to PBR, when the standard deviations are known and their ratio is very different from the expected slope.

Since the PBR estimator is derived from Kendall’s tau statistics, which is a member of the class of U-statistics, it has furthermore been possible to derive an analytical expression for the variance of the intercept. This is of special importance in clinical chemistry, where the intercept which is obtained after shifting the origin of the coordinate axes to the medical decision point is known as bias of the medical decision point.

Finally, the insight that the PBR estimator can be obtained as a root of Kendall’s τ yields a fast algorithm of numerical complexity O ( n ln n ) , as compared to O ( n 2 ) for the original shifted median algorithm, so that PBR estimation also becomes possible for larger data sets. It is planned to include the algorithms described in a future version of the R [30] package “MCR” [31] to be deposited in the CRAN repository.

Corresponding author: Florian Dufey, Roche Diagnostics GmbH Standort Penzberg, Biostatistics, DXREBA, Nonnenwald 2, Penzberg, 82377, Germany, E-mail: florian.dufey@roche.com

Acknowledgment

My sincere thanks are given to Dr. Terry Therneau for a thorough review and many comments which greatly helped to improve the paper. It was an honour to talk to Dr. Heinrich Passing and the family of the late Wolfgang Bablok about the collaboration which lead to the invention of the method which now bears their names. I am indebted to my dear colleagues, especially Ge Guo, Kendra Jones, Ekaterina Manuilova, Gabrielle Miles, and Crystal Williams, for valuable discussions and proofreading.

Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
Research funding: None declared.
Conflict of interest statement: The author declares no conflicts of interest regarding this article.

References

1. Ludbrook, J. Linear regression analysis for comparing two measurers or methods of measurement: but which regression? Clin Exp Pharmacol Physiol 2010;37:692–9.10.1111/j.1440-1681.2010.05376.xSearch in Google Scholar PubMed

2. Linnet, K. Estimation of the linear relationship between the measurements of two methods with proportional errors. Stat Med 1990;9:1463–73. DOI: https://doi.org/10.1002/sim.4780091210.Search in Google Scholar

3. Linnet, K. Evaluation of regression procedures for methods comparison studies. Clin Chem 1993;39:424–32. DOI: https://doi.org/10.1093/clinchem/39.3.424.Search in Google Scholar

4. Ludbrook, J. Special article comparing methods of measurement. Clin Exp Pharmacol Physiol 1997;24:193–203. DOI: https://doi.org/10.1111/j.1440-1681.1997.tb01807.x.Search in Google Scholar

5. CLSI. Measurement procedure comparison and bias estimation using patient samples. CLSI guideline EP09c. Wayne, PA: Clinical and Laboratory Standards Institute; 2018.Search in Google Scholar

6. Passing, H, Bablok, W. A new biometrical procedure for testing the equality of measurements from two different analytical methods. Application of linear regression procedures for method comparison studies in clinical chemistry, part I. Clin Chem Lab Med 1983;21:709–20. DOI: https://doi.org/10.1515/cclm.1983.21.11.709.Search in Google Scholar

7. Theil, H. A rank-invariant method of linear and polynomial regression analysis. Henri Theil’s contributions to economics and econometrics. Dordrecht: Springer; 1992:345–81 pp.10.1007/978-94-011-2546-8_20Search in Google Scholar

8. Sen, PK. Estimates of the regression coefficient based on Kendall’s tau. J Am Stat Assoc 1968;63:1379–89. DOI: https://doi.org/10.1080/01621459.1968.10480934.Search in Google Scholar

9. Passing, H, Bablok, W. Comparison of several regression procedures for method comparison studies and determination of sample sizes application of linear regression procedures for method comparison studies in clinical chemistry, part II. Clin Chem Lab Med 1984;22:431–45. DOI: https://doi.org/10.1515/cclm.1984.22.6.431.Search in Google Scholar

10. Bablok, W, Passing, H, Bender, R, Schneider, B. A general regression procedure for method transformation. Application of linear regression procedures for method comparison studies in clinical chemistry, Part III. Clin Chem Lab Med 1988;26:783–90. DOI: https://doi.org/10.1515/cclm.1988.26.11.783.Search in Google Scholar

11. Roos, CF. A general invariant criterion of fit for lines and planes where all variates are subject to error. Metron 1937;13:3–20.Search in Google Scholar

12. Xu, S. A property of geometric mean regression. Am Stat 2014;68:277–81. DOI: https://doi.org/10.1080/00031305.2014.962763.Search in Google Scholar

13. Therneau, T. Deming: deming, Theil–Sen and Passing–Bablok and total least squares regression. Rochester, Minnesota, USA: Mayo clinic; 2018, R package version 1.4. Available from: https://CRAN.R-project.org/package=deming.Search in Google Scholar

14. Clarke, M. The reduced major axis of a bivariate sample. Biometrika 1980;67:441–6. DOI: https://doi.org/10.1093/biomet/67.2.441.Search in Google Scholar

15. Rousseeuw, PJ, Leroy, AM. Robust regression and outlier detection. Hoboken, New Jersey: John Wiley & Sons; 2003, Vol 516.Search in Google Scholar

16. Jammalamadaka, SR, Sengupta, A Topics in circular statistics, Vol 5. Singapore: World Scientific; 2001.10.1142/4031Search in Google Scholar

17. Mardia, KV, Jupp, PE. Directional statistics. Hoboken, New Jersey: John Wiley & Sons; 2009, Vol 494.Search in Google Scholar

18. He, X, Simpson, DG. Robust direction estimation. Ann Stat 1992:351–69. https://doi.org/10.1214/aos/1176348526.Search in Google Scholar

19. Fisher, N, Lunn, AD, Davies, SJ. Spherical median axes. J Royal Stat Soc, Ser B 1993;55:117–24. DOI: https://doi.org/10.1111/j.2517-6161.1993.tb01471.x.Search in Google Scholar

20. Matoušek, J. Randomized optimal algorithm for slope selection. Inf Process Lett 1991;39:183–7. DOI: https://doi.org/10.1016/0020-0190(91)90177-j.Search in Google Scholar

21. Knight, WR. A computer method for calculating Kendall’s tau with ungrouped data. J Am Stat Assoc 1966;61:436–9. DOI: https://doi.org/10.1080/01621459.1966.10480879.Search in Google Scholar

22. Goldstine, HH, Von Neumann, J. Planning and coding of problems for an electronic computing instrument; part II, Technical report. Princeton, NJ: Institute for Advanced Study; 1947, Vol II.Search in Google Scholar

23. Ballester, P Hough transform for robust regression and automated detection. Astron Astrophys 1994;286:1011–1018. https://ui.adsabs.harvard.edu/abs/1994A&A...286.1011B.Search in Google Scholar

24. McKean, JW, Schrader, RM. A comparison of methods for studentizing the sample median. Commun Stat - Simul Comput 1984;13:751–73. DOI: https://doi.org/10.1080/03610918408812413.Search in Google Scholar

25. Newson, R. Confidence intervals for rank statistics: percentile slopes, differences, and ratios. Stata J 2006;6:497–520. DOI: https://doi.org/10.1177/1536867x0600600404.Search in Google Scholar

26. Daniels, HE, Kendall, MG. The significance of rank correlations where parental correlation exists. Biometrika 1947;34:197–208. DOI: https://doi.org/10.1093/biomet/34.3-4.197.Search in Google Scholar

27. Cliff, N, Charlin, V. Variances and covariances of Kendall’s tau and their estimation. Multivar Behav Res 1991;26:693–707, pMID: 26751027. DOI: https://doi.org/10.1207/s15327906mbr2604_6.Search in Google Scholar

28. Fuller, WA. Measurement error models. New York: John Wiley & Sons; 2009, Vol 305.Search in Google Scholar

29. Lee, AJ. U-statistics: theory and practice, statistics: a series of textbooks and monographs. Oxfordshire, United Kingdom: Taylor & Francis; 1990, Vol 110.Search in Google Scholar

30. R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2017. Available from: https://www.R-project.org/.Search in Google Scholar

31. Manuilova, E, Schuetzenmeister, A, Model, F. MCR: method comparison regression; 2014, R package version 1.2.1. Available from: https://CRAN.R-project.org/package=mcr.Search in Google Scholar

Received: 2019-12-13

Accepted: 2020-04-20

Published Online: 2020-08-12

This work is licensed under the Creative Commons Attribution 4.0 International License.

Derivation of Passing–Bablok regression from Kendall’s tau

Abstract

1 Introduction

2 Derivation of regression from correlation

2.1 Synopsis of the relevant regression methods

2.2 Coordinate transformations

2.3 Passing–Bablok regression

2.4 Circular median of the doubled angle

2.5 Numerical complexity of the calculation of the PBR and circular median estimators

3 Confidence interval of the slope estimate

4 Confidence intervals for the intercept and bias

5 Discussion

Acknowledgment

References

Journal and Issue

Articles in the same Issue