Next Article in Journal
Differential Games for an Infinite 2-Systems of Differential Equations
Next Article in Special Issue
Multiple Ordinal Correlation Based on Kendall’s Tau Measure: A Proposal
Previous Article in Journal
A Constrained Markovian Diffusion Model for Controlling the Pollution Accumulation
Previous Article in Special Issue
Confidence Intervals and Sample Size to Compare the Predictive Values of Two Diagnostic Tests
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

About the Equivalence of the Latent D-Scoring Model and the Two-Parameter Logistic Item Response Model

by
Alexander Robitzsch
1,2
1
IPN—Leibniz Institute for Science and Mathematics Education, Olshausenstrasse 62, 24118 Kiel, Germany
2
Centre for International Student Assessment (ZIB), Olshausenstrasse 62, 24118 Kiel, Germany
Mathematics 2021, 9(13), 1465; https://doi.org/10.3390/math9131465
Submission received: 25 May 2021 / Revised: 18 June 2021 / Accepted: 21 June 2021 / Published: 22 June 2021

Abstract

:
This article shows that the recently proposed latent D-scoring model of Dimitrov is statistically equivalent to the two-parameter logistic item response model. An analytical derivation and a numerical illustration are employed for demonstrating this finding. Hence, estimation techniques for the two-parameter logistic model can be used for estimating the latent D-scoring model. In an empirical example using PISA data, differences of country ranks are investigated when using different metrics for the latent trait. In the example, the choice of the latent trait metric matters for the ranking of countries. Finally, it is argued that an item response model with bounded latent trait values like the latent D-scoring model might have advantages for reporting results in terms of interpretation.

1. Introduction

Item response theory (IRT; [1]) is the statistical analysis of test items in education, psychology, and other fields of social sciences. Typically, a number of test items are administered to test takers, and the interest is to infer the ability (performance or trait) of them. IRT models relate observed item responses to unobserved latent traits. Because the latent trait is unobserved, there are many plausible choices for modeling these relationships. The most popular class of IRT models comprises logistic IRT models [2]. Recently, in a series of papers, Dimitrov proposed an alternative IRT model, the so-called latent D-scoring model [3]. The main goal of this paper is to demonstrate that the newly proposed IRT model is statistically equivalent to the well-established two-parameter logistic IRT model.
The paper is structured as follows. In Section 2, IRT models are introduced in their general form. Afterward, the logistic IRT model and the latent D-scoring model are discussed. In Section 3, we show the statistical equivalence of the latent D-scoring model and the logistic IRT model utilizing an analytical derivation and a numerical illustration. Furthermore, we study the properties of the two models. Section 4 presents an empirical example that compares outcomes of the two different modeling strategies and compares them with two alternative parameterizations of the latent trait. Finally, the article closes with a discussion.

2. Item Response Modeling

In Section 2.1, we discuss the indeterminacy of the latent trait in IRT models. In Section 2.2, we focus on the logistic IRT model and its estimation. As an alternative IRT model, the latent D-scoring model is introduced in Section 2.3.

2.1. Indeterminacy of the Latent Trait in IRT Models

A unidimensional IRT model for dichotomous item responses X i { 0 , 1 } is a statistical model [2]
P ( X = x ) = i = 1 I P i ( θ ) x i 1 P i ( θ ) 1 x i f ( θ ) d θ , θ F ,
where f denotes the density function of the latent variable θ (also denoted as the latent trait), and P i ( x , θ ) = P ( X i = x | θ ) denotes the item response function of item i. Note that items i = 1 , , I are conditionally independent given the latent trait θ . The model parameters in Equation (1) are typically not uniquely defined. Assume that one utilizes a monotone function m : R ( 0 , 1 ) for defining a transformed latent trait δ by δ = m ( θ ) . For example, m could be the logistic function Ψ ( x ) = 1 + exp ( x ) 1 that maps the real line onto the unit interval ( 0 , 1 ) . Define P i * ( δ ) = P i ( m 1 ( δ ) ) , where m 1 denotes the inverse function of m. Furthermore, denote by g the density function of the transformed latent trait δ . The IRT model in Equation (1) can be equivalently written as
P ( X = x ) = 0 1 i = 1 I P i * ( δ ) x i 1 P i * ( δ ) 1 x i g ( δ ) d δ .
The density g can be obtained from the density f by applying the density transformation theorem
g ( δ ) = f ( m 1 ( δ ) ) m ( m 1 ( δ ) ) ,
where m = d m d x is the derivative of m with respect to θ .
It could be argued that only ordinal information can be extracted from the latent trait θ because the general IRT model (1) is only identified up to monotone transformations [4,5,6,7]. The indeterminacy of the latent trait metric implies that a researcher can seek a transformation m ( θ ) for the sake of enhancing interpretations of the results. One possible transformation is the true score metric τ = τ ( θ ) [2] that maps the θ metric from the real line to the bounded interval ( 0 , 1 ) by defining
τ ( θ ) = 1 I i = 1 I P i ( θ ) .
For a fixed value of θ , τ = τ ( θ ) is the expected value of the proportion of correctly solved items. Another alternative is the rank score metric ρ = ρ ( θ ) [7] that is defined by
ρ ( θ ) = F ( θ ) ,
where F is the distribution function of θ . One can show that ρ follows a uniform distribution (hence, the label “rank score”):
P ( ρ u ) = P ( F ( θ ) u ) = P ( θ F 1 ( u ) ) = F ( F 1 ( u ) ) = u , 0 < u < 1 .

2.2. Logistic Item Response Model

An important class of IRT models is the class of logistic IRT models. Logistic IRT models employ the logistic link function for parameterizing IRFs. The IRFs in the two-parameter logistic (2PL) model [8] are given by
P ( X i = 1 | θ ) = P i ( θ ) = 1 1 + exp ( a i ( θ b i ) ) = Ψ a i ( θ b i ) , θ F ,
where a i are item discriminations, and b i are item difficulties. The one-parameter logistic (1PL) model (Rasch model; [9]) is obtained by setting all item discriminations equal to one (i.e., a i = 1 for i = 1 , , I ).
In Figure 1, IRFs of seven items of the 2PL model are displayed (see the figure legend for item parameters a i and b i ). It can be seen that items with higher item discriminations a i have steeper slopes. Additionally, items with larger item difficulties b i are shifted to the right. A fundamental property of IRFs in the 2PL model is a lower asymptote of zero and an upper asymptote of one. Hence, persons with very low abilities ( θ ) have almost zero probability of correctly solving any item in the test, while highly able persons ( θ ) correctly solve items with a probability of one. Alternative IRT models allow lower and upper asymptotes different from 0 or 1, respectively [10].
In many applications, a normal distribution N ( μ , σ 2 ) for the latent trait θ is assumed [7]. However, more flexible distributions or semiparametric specifications are possible [11,12]. Identification constraints are required in the 1PL and 2PL models for the estimation of model parameters. In the 1PL model, one can identify the model by setting μ = 0 or fixing an item difficulty of a reference item to 0 (or to a prespecified value). Alternatively, one can constrain the sum of the item difficulties equal to zero. In the 2PL model, identification can be ensured by posing a standard normal distribution N ( 0 , 1 ) (i.e., μ = 0 and σ = 1 ). Alternatively, a reference item i 0 can be chosen for which a i 0 = 1 and b i 0 = 0 are used as fixed values in the estimation. Using a reference item has the advantage that the distribution F of θ can be flexibly estimated without using constraints on some parameters of F.
The 1PL model or the 2PL model can be estimated using marginal maximum likelihood (MML) or joint maximum likelihood (JML) estimation [2]. It is noteworthy that i = 1 I X i is a sufficient statistic for θ in the 1PL model, while i = 1 I a i X i is the corresponding sufficient statistic in the 2PL model. Hence, the different models imply different interpretations and implications of the trait because the contribution of items to the variable of interest differs considerably [13].

2.3. Dimitrov’s Latent D-Scoring Model

Dimitrov proposes an alternative IRT model that has a bounded metric for the latent trait. His latent D-scoring (LDS) model [3,14] includes a latent trait δ that takes values in the interval ( 0 , 1 ) . The IRF in the LDS model is given as [15]
P ( X i = 1 | δ ) = 1 1 + 1 δ δ β i 1 β i α i , δ G ,
where G is some distribution on ( 0 , 1 ) . Item discriminations α i are non-negative and indicate the extent to which item i measures the trait δ . Item difficulties δ i range between 0 and 1 and primarily determine the proportion of correctly solving item i. The IRF in Equation (8) is also referred to as the rational function model with two item parameters [15]. The IRFs of the LDS model for seven items are shown in Figure 2 (see the figure legend for item parameters α i and β i ).
The LDS model with one item parameter is obtained by setting α i = 1 [15]:
P ( X i = 1 | δ ) = 1 1 + 1 δ δ β i 1 β i , δ G .
The LDS model with three item parameters that accommodates guessing effects is defined as [15]
P ( X i = 1 | δ ) = γ i + ( 1 γ i ) 1 1 + 1 δ δ β i 1 β i α i , δ G .
In the following, we mainly consider the case of the LDS model with two item parameters.
The LDS model can be estimated with MML [3] or JML [16]. In Section 3, we show that identification constraints are needed for the estimation of the model. The latent D-scoring model is applied in psychometric areas of linking and equating [16], differential item functioning [14], and the development of multistage tests [17].

3. Relation of the Latent D-Scoring Model and the 2PL Model

In this section, we show the close correspondence of the 2PL model and the LDS model. It is demonstrated that the two models are equivalent using analytical (Section 3.1) and numerical (Section 3.2) arguments. However, the two models imply different consequences regarding measurement precision and interpretations (Section 3.3). Finally, we propose an extension of the LDS model to multiple dimensions in Section 3.4.

3.1. Equivalence of the Latent D-Scoring Model and the 2PL Model

In this subsection, we analytically show that the LDS model is statistically equivalent to the 2PL model. Consequently, the model parameters of the 2PL model can be transformed to obtain model parameters of the LDS model.
The IRF of the LDS model (Equation (8)) can be rewritten as
P ( X i = 1 | δ ) = 1 1 + exp α i log δ 1 δ log β i 1 β i , δ G ,
where G is the distribution function of δ . By defining θ = log δ 1 δ , b i = log β i 1 β i , and a i = α i , one can rephrase the LDS model in Equation (11) as the 2PL model. Equivalently, we can write δ = Ψ ( θ ) = [ 1 + exp ( θ ) ] 1 as the logistic transform of θ . Note that the logistic transform of δ = Ψ ( θ ) was also discussed in [6,7]. Hence, the LDS model is a reparametrization of the 2PL model. Hence, estimation routines for the 2PL model can be used for estimating the latent D-scoring model, and item parameters are transformed afterward; that is, α i = a i and β i = Ψ ( b i ) .
Our derivation also implies that the LDS model with one item parameter is equivalent to the 1PL model. Moreover, the LDS model with three parameters is equivalent to the three-parameter logistic IRT model.
The distribution of δ can also be derived from the distribution of θ . The density function g of δ can be obtained from the density function f of θ by applying Equation (3)
g ( δ ) = 1 δ ( 1 δ ) f log δ 1 δ .
Conversely, the density function θ can also be obtained from the density function of δ by
f ( θ ) = Ψ ( θ ) 1 Ψ ( θ ) g Ψ ( θ ) .
The estimation of the LDS model using software for the 2PL model requires a correct specification of the distribution for θ . Suppose that a particular distributional assumption is posed on δ with density g. In that case, the estimation procedure must ensure that the assumed distribution for θ aligns with the implied density f for θ (see Equation (13)) to avoid biased item parameter estimates.
The variance–covariance matrix V for item parameters in the 2PL model can be obtained utilizing the observed information matrix. The transformed item parameters for the LDS model emerge from a nonlinear transformation of the 2PL item parameters. Hence, the delta method can be applied for obtaining item parameters for the LDS model. In more detail, the matrix A of derivatives of the transformed item parameters with respect to the 2PL item parameters is a diagonal matrix, and the variance–covariance matrix for transformed item parameters is A V A T .
In Section 2.2, we showed that identification constraints are needed for estimating the 2PL model. Because the latent D-scoring model is equivalent to the 2PL model, the former also needs identification constraints. In the 2PL model, the location (i.e., the mean μ ) and the scale (i.e., the standard deviation σ ) for the latent trait θ can be fixed in the estimation. This would translate into identification constraints for the LDS model. Alternatively, a reference item i 0 could be chosen for the LDS model with fixed parameters α i 0 = 1 and β i 0 = 0.5 .

3.2. Numerical Illustration

This subsection demonstrates that the LDS model can be estimated using software for the 2PL model. We used item parameters of I = 7 items of the LDS model that were also used in Figure 2 (see also Table 1). The multivariate distribution of these I = 7 items according to the LDS model can be written as
P ( X = x ) = 0 1 i = 1 I P i ( δ ; α i , β i ) x i 1 P i ( δ ; α i , β i ) 1 x i g ( δ ) d δ ,
where P i ( δ ; α i , β i ) is the IRF for the ith item of the LDS model, and x = ( x 1 , , x I ) . Note that there are 2 I = 128 different item response patterns. The corresponding marginal probabilities P ( X = x ) are computed using (14) and numerical integration with respect to θ . This numerical illustration aims to compute the multivariate distribution P ( X = x ) of X in Equation (14) for specified item and distribution parameters and to show that the input parameters can be uniquely and correctly identified from P ( X = x ) . This probability distribution corresponds to a population, i.e., a sample with an infinite sample size. Maximum likelihood estimation is applied to estimate the item and distribution parameters. Sampling variability does not play a role in estimated (i.e., identified) parameters by relying on the population. Hence, the illustration demonstrates the parameter equivalence of the 2PL and the LDS model at the population level. Note that no standard errors must be reported for item parameters because the data are defined at the population level.
In Section 3.1, we derived the transformation of item parameters from the 2PL model to the LDS model when showing statistical equivalence. Notably, for establishing statistical equivalence, the distribution G for δ is a transformation of the distribution of F for θ . When item parameters of the LDS model are obtained with transformed item parameters from the 2PL model, it must be ensured that the distribution F of θ is correctly specified in the 2PL model. This means that F corresponds to the distribution G for δ that is used for generating data. Hence, we investigate whether distributional misspecifications of θ have consequences for transformed item parameters of the LDS model. We considered two distributions for δ in the data-generating model. First, δ followed a beta distribution Beta(4,2) [18]. Second, δ followed a logit-normal distribution LogitN ( 0.6 , 1 . 2 2 ) ; that is θ = log δ 1 δ is a normal distribution with a mean of 0.6 and a standard deviation of 1.2 [19,20,21].
The 2PL model was estimated in the R [22] package sirt [23] using a sample weights option that inputs the item response pattern probabilities P ( X = x ) . To avoid a restrictive distributional assumption on θ , we used a fixed grid of 61 equidistant θ values ranging between 6 and 6 and assumed a normal distribution for θ . The item parameters of the fourth item were fixed (i.e., a 4 = 1 and b 4 = 0 in the 2PL model, which corresponds to α 4 = 1 and β 4 = 0.5 in the LDS model). The 2PL model was estimated using MML estimation and an EM algorithm [24]. Sample R code for the estimation is provided in Appendix A.
Results for this numerical illustration are presented in Table 1. It can be seen that estimated item parameters α ^ i and β ^ i for the LDS model almost perfectly recover true values in the case of the logit-normal distribution. This finding can be expected because the log-linear smoothing approach includes the normal distribution as a particular instance (smoothing up to two moments). Slightly larger deviations were observed if the distribution for δ was a beta distribution. The logit transform of the beta distribution is not correctly represented by a normal distribution for θ , which explains slight biases in item parameter estimates. For example, β ^ 7 = 0.892 deviated from the true values β 7 = 0.90 and α ^ 5 = 1.472 deviated from α 5 = 1.50 . However, these numerical differences are probably negligible in practical applications and confirm our analytical reasoning for the equivalence of the 2PL and the LDS model.

3.3. Conditional Standard Errors for the Latent Trait

In this subsection, we study the amount of information for the latent trait that can be extracted with the 2PL model and the LDS model by using the concept of item information. Let x p i = ( x p 1 , , x p I ) denote the vector of item responses of person p. For IRFs P i (depending on already estimated item parameters), the maximum likelihood estimate θ ^ p for the latent trait of person p is given as [1]
θ ^ p = arg max θ i = 1 I x p i log P i ( θ ) + ( 1 x p i ) log 1 P i ( θ ) .
Hence, the standard error associated with the estimate θ ^ p is related to the information function that is obtained as the negative value of the second derivative of the log-likelihood function evaluated at θ ^ p . The information that is provided by item i in (15) is then given as
x p i d 2 d θ 2 log P i ( θ ) ( 1 x p i ) d 2 d θ 2 log 1 P i ( θ ) .
This allows defining the (expected) item information I i for item i [25]
I i ( θ ) = π i d 2 d θ 2 log P i ( θ ) ( 1 π i ) d 2 d θ 2 log 1 P i ( θ ) ,
where π i = E ( X i ) is the expected value for item i. In the literature, the observed item information
O I i ( θ ) = d 2 d θ 2 log P i ( θ )
is often defined as the item information function. However, this function can become negative for some IRT models and the LDS model in particular [25], which is why preferring (17) for ensuring positivity of the item information function. For the 2PL model, the expected and observed item information coincide and are given as
I i ( θ ) = a i 2 P i ( θ ) 1 P i ( θ ) .
Equation (19) implies that the least information is available for extreme θ values (i.e., extremely negative or positive).
The test information I ( θ ) is defined as I ( θ ) = i = 1 I I i ( θ ) . It quantifies the information that is provided by the test at each latent trait value θ . The conditional standard error for the latent trait θ is given by SE ( θ ) = 1 / I ( θ ) .
One can similarly define the item information function for δ for the LDS model (see also [15]):
I i ( δ ) = π i d 2 d δ 2 log P i ( δ ) ( 1 π i ) d 2 d δ 2 log 1 P i ( δ ) .
Analogously, the test information function I ( δ ) = i = 1 I I i ( δ ) can be defined for the latent trait δ .
Because the latent D-scoring model is equivalent to the 2PL model (see Section 3.1), δ = Ψ ( θ ) is a monotonous transformation of θ , and the test information function for θ can be converted into the test transformation for δ . More generally, let δ = m ( θ ) be a monotone differentiable transformation. The test information function for δ can be computed from the test information function for θ (see [2]):
I ( δ ) = I ( m ( θ ) ) = 1 m ( θ ) I ( θ ) ,
where m = d m d θ . Equation (21) can be rewritten for conditional standard errors as
SE ( δ ) = SE ( m ( θ ) ) = m ( θ ) SE ( θ ) .
Hence, the conditional standard error SE ( δ ) for the LDS model is given as
SE ( δ ) = SE ( Ψ ( θ ) ) = Ψ ( θ ) ( 1 Ψ ( θ ) ) SE ( θ ) .
In Section 3.2, we demonstrated that the LDS model is equivalent to the 2PL model. For the item parameters of the seven items used in the demonstration (see Table 1), the conditional standard errors for θ and δ are shown in Figure 3. It can be seen that the 2PL model measures the latent trait θ less precisely for extremely large negative and extremely large positive values; that is, for low- and high-achieving persons. In line with the results of [3], the converse holds for the LDS model. Conditional standard errors are smallest for persons with δ values near 0 or 1. Hence, statements about measurement precision in different ranges of values for the latent trait strongly depend on the chosen metric (see also [26]). Interestingly, the transformed latent trait ξ = m ( θ ) = θ I ( u ) d u (the so-called arc length metric; see [6]) has homogeneous standard errors among the latent trait
SE ( ξ ) = SE ( m ( θ ) ) = 1 .
These observations indicate that it is difficult to state for which subgroups of persons adaptive or multistage testing [27] provides measurement precision gains because such statements depend on the chosen metric.
An anonymous reviewer presented an insightful explanation of the behavior of conditional standard errors. For people with a high D-score, a low standard error will result because one can be very confident that these persons will get a new but parallel item correct. In contrast, for persons with a high θ score, a high standard error will be observed because it is uncertain what the hardest item they could write is. Overall, the different scoring methods give different interpretations and, therefore, different interpretations to their respective standard errors.

3.4. A Multidimensional Latent D-Scoring Model

To our knowledge, the LDS model has only been investigated for a unidimensional latent variable δ . However, in applications, multidimensional traits are often of interest [28,29]. We now show that an apparent extension of the LDS model to multiple dimensions can be obtained by using the same transformations of the multidimensional variant of the 2PL model. We illustrate the arguments for two dimensions θ 1 and θ 2 .
The multidimensional logistic IRT model can be written as [29]
P ( X i = 1 | θ 1 , θ 2 ) = 1 1 + exp ( a i 1 θ 1 a i 2 θ 2 + d i ) , ( θ 1 , θ 2 ) F ,
where F is a bivariate distribution of ( θ 1 , θ 2 ) and θ d ( d = 1 , 2 ) attain values on the real line. Define transformed latent traits δ d = Ψ ( θ d ) = [ 1 + exp ( θ d ) ] 1 ( d = 1 , 2 ) as the logistic transformations of θ d . Like in the unidimensional LDS model, the δ d variables attain values in the interval ( 0 , 1 ) . Note that the inverse transformation is given as θ d = log δ d 1 δ d . Then, employing the same strategy as in Section 3.1, one can rewrite Equation (25) by using β i = Ψ ( d i ) and α i d = a i d as
P ( X i = 1 | δ 1 , δ 2 ) = 1 1 + 1 δ 1 δ 1 α i 1 1 δ 2 δ 2 α i 2 β i 1 β i , ( δ 1 , δ 2 ) G .
Hence, the multidimensional 2PL model can easily be reparametrized for defining a multidimensional LDS model. The generalization to more than two dimensions is straightforward. Given that multidimensional IRT models are more difficult to estimate than unidimensional IRT models, it is advantageous that existing software implementations of multidimensional logistic IRT models can be used for estimating a multidimensional variant of the LDS model.

4. Empirical Example: PISA 2006 Reading

4.1. Method

In order to illustrate the consequences of the choice of different metrics of the latent trait in multiple-group comparisons, we analyzed the data from the Programme for International Student Assessment (PISA) conducted in 2006 (PISA 2006; [30]). In this situation, groups constitute countries. We included 26 countries (see Table 2) that participated in 2006 and focused on the reading test (see [31,32] for other studies using this dataset).
Items for the reading domain were only administered to a subset of the participating students. We included only those students who received a test booklet with at least one reading item. This resulted in a total sample size of 110,236 students (ranging from 2010 to 12,142 students between countries). In total, 28 reading items nested within eight reading texts were used in PISA 2006. Six of the 28 items were polytomous and were dichotomously recorded, with only the highest category being recorded as correct.
In all analyses, student weights were taken into account. Within a country, student weights were normalized to a sum of 5000, so that all countries contributed equally to the analyses.
In a first step, the 2PL model was estimated based on the data comprising students of all 26 countries. Student weights were taken into account, and a normal distribution was posed for θ in the estimation. The obtained item parameters a ^ i and b ^ i were fixed in the second step when estimating the trait distribution in each country. More concretely, the 2PL model was fitted using the R [22] package sirt [23] using MML estimation. The 2PL model was estimated by using a discrete grid of T = 121 equidistant θ points ranging between 6 and 6 for numerical integration of the involved integrals in the log-likelihood function of the 2PL model. As in Section 3.2, log-linear smoothing up to four moments of the trait distribution [12] within a country was employed to allow non-normal distributions. Assume that the estimated parametric distribution for θ in country g is π g t = P ( θ t ; δ g ) for grid values θ t ( t = 1 , , T ) and country-specific distribution parameters δ g . Afterward, individual posterior distributions h p ( θ t | x p ) ( t = 1 , , T ) were computed as
h p ( θ t | x p ) = i = 1 I P i ( θ t ; a ^ i , b ^ i ) x p i 1 P i ( θ t ; a ^ i , b ^ i ) 1 x p i π g t u = 1 T i = 1 I P i ( θ u ; a ^ i , b ^ i ) x p i 1 P i ( θ u ; a ^ i , b ^ i ) 1 x p i π g u ,
where P i ( θ t ; a ^ i , b ^ i ) is the IRF of item i from the 2PL model using estimated item parameters a ^ i and b ^ i from the total sample. By construction, it holds that t = 1 T h p ( θ t | x p ) = 1 . For N g persons per country g, the country means μ ^ θ , g on the logit metric θ was estimated by
μ ^ θ , g = 1 W p = 1 N g w p t = 1 T θ t h p ( θ t | x p ) ,
where the person weights w p sum to W = 5000 within a country (i.e., p = 1 N g w p = W ). Country-specific standard deviations σ ^ θ , g can be computed similarly:
σ ^ θ , g = 1 W p = 1 N g w p t = 1 T θ t 2 h p ( θ t | x p ) μ ^ θ , g 2 .
Besides the logit metric θ , we also investigated the metric δ based on the LDS model, the true score metric τ (see Equation (4)), and the rank score metric ρ (see Equation (5)). All three alternative metrics are monotone transformations m ( θ ) of θ . The country mean μ ^ m ( θ ) , g at the transformed metric was calculated as
μ ^ m ( θ ) , g = 1 W p = 1 N g w p t = 1 T m ( θ t ) h p ( θ t | x p ) .
Using (30), the standard deviation of m ( θ ) can be computed similarly to (29). Furthermore, conditional standard errors for the four latent trait metrics are computed for the whole sample containing all students. The item information is obtained by using the second derivatives of IRFs with respect to the metrics θ , δ , τ , and ρ (see Equation (17)).

4.2. Results

In Table A1 of Appendix B, estimated item parameters a ^ i and b ^ i from the 2PL model are shown. These item parameters were transformed into parameters of the equivalent LDS model (see columns α ^ i and β ^ i in Table A1). The IRFs of seven selected items are displayed in Figure A1 in Appendix B for the four latent trait metrics θ , δ , τ and ρ . IRFs for the bounded metrics δ , τ and ρ look very similar.
In Figure 4, the transformation functions δ = δ ( θ ) , τ = τ ( θ ) and ρ = ρ ( θ ) are depicted. The latent D-score δ and the true score τ follow a very close transformation function. The rank score ρ differs from the former two in the tails of the θ distribution. Hence, it can be expected that δ and τ provide similar country rankings, while using ρ might lead to slightly different country rankings.
In Figure 5, conditional standard errors are displayed. It can be seen that θ has a U-shaped form, while the three other metrics are inverted U-shaped. Interestingly, the standard errors SE ( δ ) and SE ( τ ) approach 0 for δ or τ near to 0 or 1. This is not the case for the rank score metric ρ , for which standard errors for ρ = 0 and ρ = 1 are larger than 0. Assume that Country C1 is low-performing (negative θ value) and Country C2 has average performance ( θ average of about 0). Then, it can be the case that the latent trait is less precisely assessed for Country C1 than for Country C2 in the θ metric but more precisely assessed for Country C1 than C2 in one of the three alternative metrics δ , τ , or ρ . These statements rely on the somewhat arbitrary choice of the latent trait metric used to quantify differences between countries.
Table 2 contains detailed results of means, standard deviations, and country ranks based on means for the 26 countries. For the first six high-performing countries, country ranks are the same for all four trait metrics. However, there are countries for which ranks differ considerably. Relatively large deviations are observed for Belgium (BEL; maximum rank difference (maxrk) of 4), Estonia (EST; makrk = 6), and Germany (DEU; maxrk = 7). The most crucial difference occurs for the τ and the ρ metric. For the three mentioned countries, the standard deviation of θ was relatively low or high compared to all other countries in the sample. This observation explains the differences among ranks because the tails of the θ distributions are differently weighted (i.e., differently transformed) for τ and ρ .
Overall, the Spearman rank correlations of country means ranged between 0.949 (between τ and ρ ) and 0.992 (between θ and δ ). The average rank difference of country means across different metrics was 2.000 (see column “maxrk” in Table 2; SD = 1.853 , Min = 0 , Max = 7 ). The Spearman rank correlations of country standard deviations ranged between 0.973 (between τ and ρ ) and 0.999 (between δ and ρ ). The average rank difference of country standard deviations across different metrics was 1.000 ( SD = 1.301 , Min = 0 , Max = 5 ). To sum up, the choice of the ability metric can have relevance for some countries for the reporting of country means.

5. Discussion

This article shows that the newly proposed LDS model of Dimitrov can be interpreted as a reparametrization of the well-studied 2PL model. Hence, all established statistical techniques for the 2PL model can be used for practical applications of the LDS model. It has been shown that the latent trait score δ from the LDS model is a monotonous (logistic) transformation of the θ score from the 2PL model. All other psychometric areas such as differential item functioning, equating and linking, or test assembly must not be reinvented for the LDS model because known techniques for the 2PL model can be used.
Although these findings might be interpreted as somehow destructive for the research surrounding the LDS model, we do not think that the LDS model is not of interest at all. We wanted to argue that the choice latent trait metric is arbitrary in IRT models, and the θ or the δ metric can be both useful in applications. The authors of this paper tend to prefer bounded trait metrics in applications because it seems more challenging to interpret the possibility of unbounded negative and positive trait values of θ [33]. However, we would prefer the true score metric τ or the rank score ρ over δ . The latent D-score δ can be interpreted as a particular true score in which only a reference item with a i = 1 and b i = 0 is used. We believe that using a well-chosen reference test with its item parameters provides a better interpretable latent trait metric in practical applications. The rank score ρ has the advantage that it does not depend on item parameters. For example, in the PISA study, one fixes the θ metric in the starting study (e.g., in PISA 2000) to a mean of 500 and a standard deviation of 100. Using the rank metric ρ would imply that the metric is identified by assuming a uniform distribution on [ 0 , 1 ] for identification. Both approaches might be legitimate in practical applications. Notably, linking and equating for bounded metrics are more difficult to conduct than for unbounded metrics. However, we would opt for using the unbounded metric from the 2PL model for the operational use for linking but bounded metrics for reporting ability distributions.
In IRT models, items are typically treated as fixed. However, they can alternatively be interpreted as exchangeable. Item sampling models [34,35,36] have fewer assumptions in this respect and could be alternatively employed in assessment studies.
The LDS model has been motivated as an IRT analog of the so-called manifest D-scoring method [37]. The scoring rule i = 1 I ( 1 π i ) X i is used in this approach, where π i = P ( X i = 1 ) is the probability of getting item i correct. In manifest D-scoring, more difficult items receive larger weights. This property might have appeal in some applications. However, we believe that this scoring rule does not adequately represent all items in a test in typical assessment studies and might lead to country comparisons with reduced validity.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The PISA 2006 data set is available from https://www.oecd.org/pisa/pisaproducts/database-pisa2006.htm (accessed on 2 May 2021).

Acknowledgments

I would like to thank Dimiter Dimitrov for helpful explanations about motivations of the D-scoring method.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
1PLOne-parameter logistic model
2PLTwo-parameter logistic model
IRFItem response function
IRTItem response theory
JMLJoint maximum likelihood
LDSLatent D-scoring model
MMLMarginal maximum likelihood
PISAProgramme for International Student Assessment

Appendix A. R Code for the Numerical Illustration

In this Appendix, we provide the R code for the estimation of the numerical illustration in Section 3.2.
Mathematics 09 01465 i001

Appendix B. Item Parameter Estimates for the PISA 2006 Reading Data

In Table A1, estimated item parameters from the 2PL model are shown (columns “ a ^ i ” and “ b ^ i ”). In addition, transformed item parameters for the LDS model are displayed in the columns “ α ^ i ” and “ β ^ i
In Figure A1, IRFs of the following seven selected items are shown: R067Q01, R104Q02, R104Q05, R111Q02B, R219Q01T, R219Q02, and R220Q01.
Table A1. Estimated item parameters for the PISA 2006 reading dataset.
Table A1. Estimated item parameters for the PISA 2006 reading dataset.
2PLLDS
Item π i a i b i α i β i
R055Q010.8171.395−1.4861.3950.185
R055Q020.4801.379 0.0431.3790.511
R055Q030.5841.620−0.3341.6200.417
R055Q050.7192.118−0.7782.1180.315
R067Q010.8921.227−2.0721.2270.112
R067Q040.3820.832 0.7230.8320.673
R067Q050.5821.088−0.3071.0880.424
R102Q04A0.3431.460 0.6691.4600.661
R102Q050.4571.330 0.2441.3300.561
R102Q070.8421.417−1.4931.4170.183
R104Q010.8161.627−1.3221.6270.211
R104Q020.3260.584 1.3330.5840.791
R104Q050.0461.132 3.1311.1320.958
R111Q010.6431.365−0.6041.3650.353
R111Q02B0.1551.046 1.9121.0460.871
R111Q06B0.3511.588 0.5421.5880.632
R219Q01E0.5821.633−0.2501.6330.438
R219Q01T0.6991.860−0.6641.8600.340
R219Q020.7921.534−1.1791.5340.235
R220Q010.4341.762 0.3051.7620.576
R220Q02B0.6211.520−0.3761.5200.407
R220Q040.5961.302−0.3121.3020.423
R220Q050.8231.977−1.1451.9770.241
R220Q060.6691.167−0.6751.1670.337
R227Q010.5210.778−0.1510.7780.462
R227Q02T0.3370.993 0.7930.9930.688
R227Q030.5461.664−0.1831.6640.454
R227Q060.7061.766−0.7771.7660.315
Note. 2PL = two-parameter logistic model; LDS = latent D-scoring model; πi = proportion correct.
Figure A1. Item response functions of seven selected items from the PISA 2006 reading test.
Figure A1. Item response functions of seven selected items from the PISA 2006 reading test.
Mathematics 09 01465 g0a1

References

  1. Baker, F.B.; Kim, S.H. Item Response Theory: Parameter Estimation Techniques; CRC Press: Boca Raton, FL, USA, 2004. [Google Scholar] [CrossRef]
  2. Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
  3. Dimitrov, D.M.; Atanasov, D.V. Latent D-scoring modeling: Estimation of item and person parameters. Educ. Psychol. Meas. 2021, 81, 388–404. [Google Scholar] [CrossRef]
  4. Ballou, D. Test scaling and value-added measurement. Educ. Financ. Policy 2009, 4, 351–383. [Google Scholar] [CrossRef]
  5. Ho, A.D. A nonparametric framework for comparing trends and gaps across tests. J. Educ. Behav. Stat. 2009, 34, 201–228. [Google Scholar] [CrossRef]
  6. Ramsay, J.O. A geometrical approach to item response theory. Behaviormetrika 1996, 23, 3–16. [Google Scholar] [CrossRef] [Green Version]
  7. van der Linden, W.J. Unidimensional Logistic Response Models. In Handbook of Item Response Theory, Volume One: Models; CRC Press: Boca Raton, CT, USA, 2016; pp. 11–30. [Google Scholar] [CrossRef]
  8. Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
  9. Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
  10. Culpepper, S.A. The prevalence and implications of slipping on low-stakes, large-scale assessments. J. Educ. Behav. Stat. 2017, 42, 706–725. [Google Scholar] [CrossRef]
  11. Formann, A.K. Constrained latent class models: Theory and applications. Br. J. Math. Stat. Psychol. 1985, 38, 87–111. [Google Scholar] [CrossRef]
  12. Xu, X.; von Davier, M. Fitting the structured general diagnostic model to NAEP data. (Research Report No. RR-08-28). Educ. Test. Serv. 2008. [Google Scholar] [CrossRef]
  13. Brennan, R.L. Misconceptions at the intersection of measurement theory and practice. Educ. Meas. Issues Pract. 1998, 17, 5–9. [Google Scholar] [CrossRef]
  14. Dimitrov, D.M.; Atanasov, D.V. Testing for differential item functioning under the D-scoring method. Educ. Psychol. Meas. 2021. [Google Scholar] [CrossRef]
  15. Dimitrov, D.M. Modeling of item response functions under the D-scoring method. Educ. Psychol. Meas. 2020, 80, 126–144. [Google Scholar] [CrossRef] [PubMed]
  16. Dimitrov, D.M.; Atanasov, D.V. An approach to test equating under the latent D-scoring method. Meas. Interdiscip. Res. Perspect. 2021, in press. [Google Scholar]
  17. Han, K.C.T.; Dimitrov, D.M.; Al-Mashary, F. Developing multistage tests using D-scoring method. Educ. Psychol. Meas. 2019, 79, 988–1008. [Google Scholar] [CrossRef] [PubMed]
  18. Hoff, P.D. A First Course in Bayesian Statistical Methods; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
  19. Atchison, J.; Shen, S.M. Logistic-normal distributions: Some properties and uses. Biometrika 1980, 67, 261–272. [Google Scholar] [CrossRef]
  20. DeCarlo, L.T. A signal detection model for multiple-choice exams. Appl. Psychol. Meas. 2021. [Google Scholar] [CrossRef]
  21. Mead, R. A generalised logit-normal distribution. Biometrics 1965, 21, 721–732. [Google Scholar] [CrossRef] [PubMed]
  22. R Core Team. R: A Language and Environment for Statistical Computing. 2020. Vienna, Austria. Available online: https://www.R-project.org/ (accessed on 24 August 2020).
  23. Robitzsch, A. sirt: Supplementary Item Response Theory Models. 2020. R Package Version 3.9-4. Available online: https://CRAN.R-project.org/package=sirt (accessed on 17 February 2020).
  24. Aitkin, M. Expectation maximization algorithm and extensions. In Handbook of Item Response Theory, Vol. 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, CT, USA, 2016; pp. 217–236. [Google Scholar] [CrossRef]
  25. Magis, D. A note on the equivalence between observed and expected information functions with polytomous IRT models. J. Educ. Behav. Stat. 2015, 40, 96–105. [Google Scholar] [CrossRef]
  26. Brennan, R.L. Perspectives on the evolution and future of educational measurement. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 1–16. [Google Scholar]
  27. Yamamoto, K.; Shin, H.J.; Khorramdel, L. Multistage adaptive testing design in international large-scale assessments. Educ. Meas. Issues Pract. 2018, 37, 16–27. [Google Scholar] [CrossRef]
  28. Bonifay, W. Multidimensional Item Response Theory; Sage: Thousand Oaks, CA, USA, 2019. [Google Scholar]
  29. Reckase, M.D. Multidimensional Item Response Theory Models; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
  30. OECD. PISA 2006. Technical Report; OECD: Paris, France, 2009. [Google Scholar]
  31. Oliveri, M.E.; von Davier, M. Analyzing invariance of item parameters used to estimate trends in international large-scale assessments. In Test Fairness in the New Generation of Large-Scale Assessment; Jiao, H., Lissitz, R.W., Eds.; Information Age Publishing: New York, NY, USA, 2017; pp. 121–146. [Google Scholar]
  32. Robitzsch, A. Robust Haebara linking for many groups: Performance in the case of uniform DIF. Psych 2020, 2, 14. [Google Scholar] [CrossRef]
  33. Ramsay, J.O.; Li, J.; Wiberg, M. Better rating scale scores with information–based psychometrics. Psych 2020, 2, 26. [Google Scholar] [CrossRef]
  34. van der Linden, W.J. Binomial test models and item difficulty. Appl. Psychol. Meas. 1979, 3, 401–411. [Google Scholar] [CrossRef] [Green Version]
  35. Wiley, J.A.; Martin, J.L.; Herschkorn, S.J.; Bond, J. A new extension of the binomial error model for responses to items of varying difficulty in educational testing and attitude surveys. PLoS ONE 2015, 10, e0141981. [Google Scholar] [CrossRef] [Green Version]
  36. Hong, H.; Wang, C.; Lim, Y.S.; Douglas, J. Efficient models for cognitive diagnosis with continuous and mixed-type latent variables. Appl. Psychol. Meas. 2015, 39, 31–43. [Google Scholar] [CrossRef] [PubMed]
  37. Dimitrov, D.M. An approach to scoring and equating tests with binary items: Piloting with large-scale assessments. Educ. Psychol. Meas. 2016, 76, 954–975. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Figure 1. Item response functions for seven items of the 2PL model.
Figure 1. Item response functions for seven items of the 2PL model.
Mathematics 09 01465 g001
Figure 2. Item response functions for seven items of the LDS model.
Figure 2. Item response functions for seven items of the LDS model.
Mathematics 09 01465 g002
Figure 3. Conditional standard errors SE ( θ ) for the 2PL model (left) and SE ( δ ) for the LDS model (right).
Figure 3. Conditional standard errors SE ( θ ) for the 2PL model (left) and SE ( δ ) for the LDS model (right).
Mathematics 09 01465 g003
Figure 4. Transformation functions δ = δ ( θ ) , τ = τ ( θ ) and ρ = ρ ( θ ) of latent ability θ for the PISA 2006 reading test.
Figure 4. Transformation functions δ = δ ( θ ) , τ = τ ( θ ) and ρ = ρ ( θ ) of latent ability θ for the PISA 2006 reading test.
Mathematics 09 01465 g004
Figure 5. Conditional standard error functions for the logit score θ (upper left), the delta score δ (upper right), the true score τ (lower left), and the rank score ρ (lower right) for the PISA 2006 reading test.
Figure 5. Conditional standard error functions for the logit score θ (upper left), the delta score δ (upper right), the true score τ (lower left), and the rank score ρ (lower right) for the PISA 2006 reading test.
Mathematics 09 01465 g005
Table 1. Estimated item parameters for the numerical illustration assuming a logit-normal distribution and a beta distribution.
Table 1. Estimated item parameters for the numerical illustration assuming a logit-normal distribution and a beta distribution.
ItemLDS2PLLogitN (0.6, 1.2 2 )Beta (4,2)
α i β i a i b i α ^ i β ^ i α ^ i β ^ i
10.500.100.50−2.200.5010.1010.5050.102
21.000.301.00−0.851.0020.3010.9880.296
30.500.500.50 0.000.5010.5000.5130.502
4 1.000.501.00 0.001.0000.5001.0000.500
51.500.501.50 0.001.5030.5001.4720.499
61.500.701.50 0.851.5040.7001.5310.701
71.000.901.00 2.201.0020.9001.0710.892
Note. LDS = latent D-scoring model; 2PL = two-parameter logistic model; LogitN = logit-normal distribution; Beta = beta distribution; = Item 4 was used as a reference item in estimation by fixing a4 = 1 and b4 = 0 (i.e., α4 = 1 and β4 = 0.50).
Table 2. Country-level results for PISA 2006 reading for different ability metrics.
Table 2. Country-level results for PISA 2006 reading for different ability metrics.
cntCountryNMSDRank M
θ δ τ ρ θ δ τ ρ θ δ τ ρ maxrk
KORSouth Korea2790 0.4710.6030.6630.6460.8310.1760.1660.24611110
FINFinland2536 0.3270.5760.6460.6140.5700.1300.1240.19322220
CANCanada12,142 8 0.2340.5530.6160.5770.8230.1790.1760.25533330
IRLIreland2468 0.1700.5380.5990.5540.9110.1930.1920.27244440
AUSAustralia7562 0.1440.5340.5960.5500.8760.1880.1890.26755550
SWESweden2374 0.0980.5230.5810.5351.0150.2130.2140.29566660
NLDNetherlands2666 0.0840.5210.5770.5311.0510.2190.2210.30277781
POLPoland2968 0.0650.5150.5730.5210.9810.2090.2110.29389891
BELBelgium4840 0.0310.5170.5670.5321.2780.2500.2570.333981174
JPNJapan3203 0.0150.5070.5620.5121.1030.2250.2290.308101013103
CHESwitzerland6578 0.0150.5060.5690.5110.8520.1860.1900.265111110111
DNKDanmark2431 0.0080.5020.5660.5030.8280.1810.1830.260121212142
ESTEstonia2630 0.0020.5010.5710.5010.6160.1420.1430.21113139156
GBRGreat Britain7061−0.0280.4980.5570.5000.9890.2060.2110.286141515162
FRAFrance2524−0.0390.5000.5590.5081.0040.2060.2150.285151414132
ISLIceland2010−0.0550.4890.5560.4860.7410.1650.1700.239161816182
AUTAustria2646−0.0570.4930.5470.4951.1250.2300.2370.314171717170
DEUGermany2701−0.0980.4970.5390.5101.4850.2800.2900.364181619127
HUNHungary2399−0.1100.4770.5440.4680.6940.1560.1630.229192018202
NORNorway2504−0.1350.4790.5350.4781.0790.2210.2310.303201921192
ESPSpain10,506 8 −0.1680.4600.5350.4400.4320.1020.1080.155212320233
LUXLuxembourg2443−0.2100.4630.5190.4561.0730.2190.2310.300222122221
PRTPortugal2773−0.2190.4550.5170.4390.8630.1850.1950.262232423241
CZECzech Republic3246−0.2370.4620.5060.4571.3980.2700.2800.355242224213
ITAItaly11,629 8 −0.2880.4430.5020.4260.9660.1990.2120.276252525250
GRCGreece2606−0.3850.4190.4790.3880.8680.1830.1960.256262626260
Note. cnt = country label; N = sample size per country; M = mean; SD = standard deviation; Rank M = country rank with respect to mean M; θ = logit ability metric from two-parameter logistic (2PL) model; δ = metric of the latent D-scoring (LDS) model; τ = true score metric; ρ = rank score metric; maxrk = maximum rank difference among ability metrics θ , δ , τ , and ρ .
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Robitzsch, A. About the Equivalence of the Latent D-Scoring Model and the Two-Parameter Logistic Item Response Model. Mathematics 2021, 9, 1465. https://doi.org/10.3390/math9131465

AMA Style

Robitzsch A. About the Equivalence of the Latent D-Scoring Model and the Two-Parameter Logistic Item Response Model. Mathematics. 2021; 9(13):1465. https://doi.org/10.3390/math9131465

Chicago/Turabian Style

Robitzsch, Alexander. 2021. "About the Equivalence of the Latent D-Scoring Model and the Two-Parameter Logistic Item Response Model" Mathematics 9, no. 13: 1465. https://doi.org/10.3390/math9131465

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop