Abstract
Differential item functioning (DIF) analysis is an important step in establishing the validity of measurements. Most traditional methods for DIF analysis use an item-by-item strategy via anchor items that are assumed DIF-free. If anchor items are flawed, these methods will yield misleading results due to biased scales. In this article, based on the fact that the item’s relative change of difficulty difference (RCD) does not depend on the mean ability of individual groups, a new DIF detection method (RCD-DIF) is proposed by comparing the observed differences against those with simulated data that are known DIF-free. The RCD-DIF method consists of a D-QQ (quantile quantile) plot that permits the identification of internal references points (similar to anchor items), a RCD-QQ plot that facilitates visual examination of DIF, and a RCD graphical test that synchronizes DIF analysis at the test level with that at the item level via confidence intervals on individual items. The RCD procedure visually reveals the overall pattern of DIF in the test and the size of DIF for each item and is expected to work properly even when the majority of the items possess DIF and the DIF pattern is unbalanced. Results of two simulation studies indicate that the RCD graphical test has Type I error rate comparable to those of existing methods but with greater power.
Similar content being viewed by others
Notes
The number (j) is the subscript of the jth smallest \(d_{j}\), corresponding to the standard notation for order statistics.
Note that \(\hat{d}_{(1)}\) does not necessarily correspond to the smallest \(b_{j}^{(2)}-b_{j}^{\left( 1 \right) }\) in the population, and similarly \(\hat{d}_{(M)}\) does not necessarily correspond to the greatest \(b_{j}^{(2)}-b_{j}^{\left( 1 \right) }\) in the population. Sampling errors also contribute to the order of the estimated difference in addition to the true size of the \(b_{j}^{(2)}-b_{j}^{\left( 1 \right) }\) in the population.
Alternatively, we can use \(\hat{b}_{j}^{{(1)}}\) or \(\hat{b}_{j}^{{(2)}}\) obtained in step 1 as the population values if group 1 or group 2 is the reference group. We found that they yield essentially the same results in both Monte Carlo simulation (Sect. 4) and illustrative data analysis (Sect. 3.2), simply because each choice closely matches the underlying population and satisfies the need of DIF-free null hypothesis.
In our initial study, we tried with 1000, 2000, and 5000 replications and obtained essentially the same results.
A condition is called balanced if both the number of items and the sizes of DIF that favor group one equal those that favor group two so that the effect of DIF on the whole test cancels out. Otherwise, the condition is called unbalanced.
Because the distribution of \(\hat{d}_{(j)}\) is approximated by Monte Carlo simulation, slight differences might be observed on a particular dataset between the choices of \(\hat{b}_{j}\), \(\hat{b}_{j}^{(1)}\) or \(\hat{b}_{j}^{(2)}\) as the population parameters or when using different seeds for generating the DIF-free data \(Y_{ij}^{(1)}\) and \(Y_{ij}^{(2)}\) in step 2 of Sect. 2.3. But such differences disappear when averaging across replications (see footnote 3).
We used the D-QQ plot to choose the reference points and examined the plots for the first 10 replications and found that they always pointed us to items that are DIF-free in Study 1 and Study 2. We also explored the effect of different number of reference points in our initial analysis: 10, 4, and 2 and found that they lead to essentially the same results. Thus, instead of examining the D-QQ plot for every replication (which is implausible), we just choose the middle four items that are most likely DIF-free as the reference points in the simulation Study 1 and Study 2.
With the DIPF procedure, the total number of comparisons is \(30\times 29/2=435\); for each item the number of comparisons is 29. The Type I errors reported in Table 4 and the values of power in Table 6 are for each item paired with the other 29 items, and the corresponding significant level is corrected using 29 comparisons. If the Type I error and power were calculated based on 435 comparisons and corrected accordingly using Holm’s method, then the Type I error rates and power values would be much smaller than being reported in Tables 4 and 6.
The Type I error rates at the test level of the MH, the LRT, and the Wald methods are more than 60%, 60%, and 40% without using the Holm adjustment, respectively.
The 2PL model can be equivalently expressed via discrimination and difficulty parameters (a, b), with \(a_{j}^{\left( g \right) }=\alpha _{j}^{\left( g \right) }\) and \(b_{j}^{(g)}=-\beta _{j}^{\left( g \right) } \big / \alpha _{j}^{\left( g \right) }.\) But it involves more complicated equations to present the extension using the (a, b) parameterization.
References
Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29(1), 67–91.
Barnard, G. A. (1963). Discussion on The spectral analysis of point processes (by M. S. Bartlett). Journal of the Royal Statistical Society B, 25, 294–296.
Barnett, V., & Lewis, T. (1994). Outliers in statistical data (3rd ed.). Chichester: Wiley.
Bauer, D. J., Belzak, W. C. M., & Cole, V. T. (2019). Simplifying the assessment of measurement invariance over multiple background variables: Using regularized moderated nonlinear factor analysis to detect differential item functioning. Structural Equation Modeling: A Multidisciplinary Journal, 27(1), 43–55.
Bechger, T. M., & Maris, G. (2015). A statistical test for differential item pair functioning. Psychometrika, 80(2), 317–340.
Belzak, W. C. M., & Bauer, D. J. (2020). Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychological Methods,. https://doi.org/10.1037/met0000253.
Cai, L. (2017). flexMIRT R 3.51: Flexible multilevel and multidimensional item response theory analysis and test scoring [Computer software]. Chapel Hill, NC: Vector Psychometric Group LLC.
Cai, L., duToit, S. H. L., & Thissen, D. (2009). IRTPRO: flexible, multidimensional, multiple categorical IRT modeling [Computer software]. Chicago: Scientific Software International.
Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12(3), 253–260.
Cao, M., Tay, L., & Liu, Y. (2017). A Monte Carlo study of an iterative Wald test procedure for DIF analysis. Educational and Psychological Measurement, 77(1), 104–118.
Chalmers, R. P. (2012). Mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29.
Choi, S. W., Gibbons, L. E., & Crane, P. K. (2011). Lordif: An R package for detecting differential item functioning using iterative hybrid ordinal logistic regression / item response theory and Monte Carlo simulations. Journal of Statistical Software, 39(2), 1–30.
Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differential item functioning test items. Educational Measurement: Issues and Practice, 17(1), 31–44.
Clauser, B., Mazor, K., & Hambleton, R. K. (1993). The effects of purification of matching criterion on the identification of DIF using the Mantel-Haenszel procedure. Applied Measurement in Education, 6(4), 269–279.
Da Costa, P. D., & Araújo, L. (2012). Differential item functioning (DIF): What function differently for Immigrant students in PISA 2009 reading items (Report EUR 25565 EN). Retrieved from https://core.ac.uk/display/38627538
Davey, A., & Savla, J. (2009). Estimating statistical power with incomplete data. Organizational Research Methods, 12(2), 320–346.
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge: Cambridge University Press.
DeMars, C. E. (2011). An analytic comparison of effect sizes for differential item functioning. Applied Measurement in Education, 24(3), 189–209.
Doebler, A. (2019). Looking at DIF from a new perspective: A structure-based approach acknowledging inherent indefinability. Applied Psychological Measurement, 43(4), 303–321.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall.
Falk, C. F., & Cai, L. (2016). Maximum marginal likelihood estimation of a monotonic polynomial generalized partial credit model with applications to multiple group analysis. Psychometrika, 81(2), 434–460.
Fidalgo, A. M., Mellenbergh, G. J., & Muñiz, J. (2000). Effects of amount of DIF, test length, and purification type on robustness and power of Mantel-Haenszel procedures. Methods of Psychological Research Online, 5(3), 43–53.
Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278–295.
Fischer, G. H., & Molenaar, I. W. (1995). Rasch models: Foundations, recent developments, and applications. New York, NY: Springer.
Frederickx, S., Tuerlinckx, F., De Boeck, P., & Magis, D. (2010). RIM: A random item mixture model to detect differential item functioning. Journal of Educational Measurement, 47(4), 432–457.
French, B. F., & Maller, S. J. (2007). Iterative purification and effect size use with logistic regression for differential item functioning detection. Educational and Psychological Measurement, 67(3), 373–393.
Frick, H., Strobl, C., & Zeileis, A. (2015). Rasch mixture models for DIF detection: A comparison of old and new score specifications. Educational and Psychological Measurement, 75(2), 208–234.
Gnanadesikan, R. (1997). Methods for statistical data analysis of multivariate observations (2nd ed.). New York: Wiley.
González-Betanzos, F., & Abad, F. J. (2012). The effects of purification and the evaluation of differential item functioning with the likelihood ratio test. Methodology: European Journal of Research Methods for the Behavioral and Social Science, 8(4), 134–145.
Hall, P., & Wilson, S. R. (1991). Two guidelines for bootstrap hypothesis testing. Biometrics, 47, 757–762.
Hancock, G. R., Stapleton, L. M., & Arnold-Berkovits, I. (2009). The tenuousness of invariance tests within multi-sample covariance and mean structure models. In T. Teo & M. S. Khine (Eds.), Structural equation modeling: Concepts and applications in educational research (pp. 137–174). Rotterdam: Sense Publishers.
Harman, H. H. (1976). Modern factor analysis (3rd ed.). Chicago: The University of Chicago Press.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Erlbaum.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.
Hope, A. C. A. (1968). A simplified Monte Carlo test procedure. Journal of the Royal Statistical Society, 30(3), 582–598.
Huang, X., Wilson, M., & Wang, L. (2016). Exploring plausible causes of differential item functioning in the PISA science assessment: Language, curriculum or culture. Educational Psychology, 36(2), 378–390.
Huang, P. H. (2018). A penalized likelihood method for multi-group structural equation modelling. British Journal of Mathematical & Statistical Psychology, 71(6), 499–522.
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research findings (2nd ed.). Thousand Oaks, CA: Sage.
Jalal, S., & Bentler, P. (2018). Using Monte Carlo normal distributions to evaluate structural equation models with nonnormal data. Structural Equation Modeling: A Multidisciplinary Journal, 25(4), 541–557.
Jöreskog, K. G., & Goldberger, A. S. (1975). Estimation of a model with multiple indicators and multiple causes of a single latent variable. Journal of the American Statistical Association, 70(351a), 631–639.
Kim, J., & Oshima, T. C. (2013). Effect of multiple testing adjustment in differential item functioning detection. Educational and Psychological Measurement, 73(3), 458–470.
Kopf, J., Zeileis, A., & Strobl, C. (2013). Anchor methods for DIF detection: A comparison of the iterative forward, backward, constant and all-other anchor class (Technical Report 141). Munich: Department of Statistics, LMU Munich.
Kopf, J., Zeileis, A., & Strobl, C. (2015a). A framework for anchor methods and an iterative forward approach for DIF detection. Applied Psychological Measurement, 39(2), 83–103.
Kopf, J., Zeileis, A., & Strobl, C. (2015b). Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educational and Psychological Measurement, 75(1), 22–56.
Le, L. T. (2009). Investigating gender differential item functioning across countries and test languages for PISA science items. International Journal of Testing, 9(2), 122–133.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.
Magis, D., & De Boeck, P. (2012). A robust outlier approach to prevent type I error inflation in differential item functioning. Educational and Psychological Measurement, 72(2), 291–311.
Magis, D., & Facon, B. (2013). Item purification does not always improve DIF detection: A counterexample with Angoff’s delta plot. Educational and Psychological Measurement, 73(2), 293–311.
Magis, D., Béland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862.
Magis, D., Tuerlinckx, F., & De Boeck, P. (2015). Detection of differential item functioning using the lasso approach. Journal of Educational and Behavioral Statistics, 40(2), 111–135.
May, H. (2006). A multilevel Bayesian item response theory method for scaling socioeconomic status in international studies of education. Journal of Educational Behavioral Statistics, 31(1), 63–79.
Millsap, R. E., & Meredith, W. (1992). Inferential conditions in the statistical detection of measurement bias. Applied Psychological Measurement, 16(4), 389–402.
Muthén, B. O. (1985). A method for studying the homogeneity of test items with respect to other relevant variables. Journal of Educational Statistics, 10, 121–132.
Navas-Ara, M. J., & Gómez-Benito, J. (2002). Effects of ability scale purification on the identification of DIF. European Journal of Psychological Assessment, 18(1), 9–15.
Oshima, T. C., Kushubar, S., Scott, J. C., & Raju, N. S. (2009). DFIT8 for Window User’s Manual: differential functioning of items and tests. St. Paul MN: Assessment Systems Corporation.
Özdemir, B. (2015). A comparison of IRT-based methods for examining differential item functioning in TIMSS 2011 mathematics subtest. Procedia-Social and Behavioral Sciences, 174, 2075–2083.
Price, E. A. (2014). Item discrimination, model-data fit, and Type I error rates in DIF detection using Lord’s Chi\(^{2}\), the Likelihood ratio test, and the Mantel-Haenszel procedure. Ohio University, ProQuest Dissertations Publishing.
R Core Team. (2018). R: A Language and Environment for Statistical Computing. Austria: R Foundation for Statistical Computing.
Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17(2), 105–116.
Roussos, L. A., Schnipke, D. L., & Pashley, P. J. (1999). A generalized formula for the Mantel-Haenszel differential item functioning parameter. Journal of Educational and Behavioral Statistics, 24, 293–322.
Santoso, A. (2018). Equivalence testing for anchor selection in differential item functioning detection (Doctoral dissertation). Retrieved from https://curate.nd.edu/downloads/und:5712m61688h
Schauberger, G., & Mair, P. (2020). A regularization approach for the detection of differential item functioning in generalized partial credit models. Behavior Research Methods, 52(4), 279–294.
Schmetterer, L. (1974). Introduction to mathematical statistics (translated from German to English by Kenneth Wickwire). New York: Springer.
Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194.
Shih, C. L., & Wang, W. C. (2009). Differential item functioning detection using the multiple indicators, multiple causes method with a pure short anchor. Applied Psychological Measurement, 33(3), 184–199.
Sinharay, S., Dorans, N. J., Grant, M. C., Blew, E. O., & Knorr, C. M. (2006). Using past data to enhance small-sample DIF estimation: A Bayesian approach (ETS RR-06-09). Princeton, NJ: Educational Testing Servics.
Soares, T. M., Gonçalves, F. B., & Gamerman, D. (2009). An integrated Bayesian model for DIF analysis. Journal of Educational and Behavioral Statistics, 34(3), 348–377.
Strobl, C., Kopf, J., & Zeileis, A. (2015). Rasch trees: A new method for detecting differential item functioning in the Rasch model. Psychometrika, 80(2), 289–316.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370.
Tay, L., Huang, Q., & Vermunt, J. K. (2016). Item response theory with covariates (IRT-C) assessing item recovery and differential item functioning for the three-parameter logistic model. Educational and Psychological Measurement, 76(1), 22–42.
Tay, L., Meade, A. W., & Cao, M. (2015). An overview and practical guide to IRT measurement equivalence analysis. Organizational Research Methods, 18(1), 3–46.
Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group-mean differences: The concept of item bias. Psychological Bulletin, 99(1), 118–128.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale, NJ: Lawrence Erlbaum Associates.
Toland, M. (2008). Determining the accuracy of item parameter standard error of estimates in BILOG-MG3.
Tutz, G., & Berger, M. (2016). Item-focused trees for the identification of items in differential item functioning. Psychometrika, 81(3), 727–750.
Tutz, G., & Schauberger, G. (2015). A penalty approach to differential item functioning in Rasch models. Psychometrika, 80(1), 21–43.
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4–70.
Wang, W. C., & Su, Y. H. (2004). Effects of average signed area between two item characteristic curves and test purification procedures on the DIF detection via the Mantel-Haenszel method. Applied Measurement in Education, 17(2), 113–144.
Wang, W. C., Shih, C. L., & Sun, G. W. (2012). The DIF-free-then-DIF strategy for the assessment of differential item functioning. Educational and Psychological Measurement, 72(4), 687–708.
Wang, W. C., Shih, C. L., & Yang, C. C. (2009). The MIMIC method with scale purification for detecting differential item functioning. Educational and Psychological Measurement, 69(5), 713–731.
Woods, C. M. (2009). Testing for differential item functioning with measures of partial association. Applied Psychological Measurement, 33(1), 538–554.
Woods, C. M., & Grimm, K. J. (2011). Testing for nonuniform differential item functioning with multiple indicator multiple cause models. Applied Psychological Measurement, 35(5), 339–361.
Woods, C. M., Cai, L., & Wang, M. (2013). The Langer-improved Wald test for DIF testing with multiple groups: Evaluation and comparison to two-group IRT. Educational and Psychological Measurement, 73(3), 532–547.
Yuan, K.-H., & Chan, W. (2008). Structural equation modeling with near singular covariance matrices. Computational Statistics & Data Analysis, 52(10), 4842–4858.
Yuan, K.-H., Hayashi, K., & Bentler, P. M. (2007). Normal theory likelihood ratio statistic for mean and covariance structure analysis under alternative hypotheses. Journal of Multivariate Analysis, 98(6), 1262–1282.
Zhang, G. (2018). Testing process factor analysis models using the parametric bootstrap. Multivariate Behavioral Research, 53(2), 219–230.
Zieky, M. (1993). DIF statistics in test development. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 337–347). Hillsdale, NJ: Erlbaum.
Zwick, R., & Thayer, D. T. (2002). Application of an empirical Bayes enhancement of Mantel-Haenszel DIF analysis to a computerized adaptive test. Applied Psychological Measurement, 26(1), 57–76.
Zwick, R., Thayer, D. T., & Lewis, C. (2000). Using loss functions for DIF detection: An empirical Bayes approach. Journal of Educational and Behavioral Statistics, 25(2), 225–247.
Acknowledgements
Funding was provided by National Natural Science Foundation of China (Grant Nos. 31971029, 32071091).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
K.-H. Yuan: His research has been around developing better or more valid methods for analyzing messy data or non-standard samples in social and behavioral sciences. Most of his work is on factor analysis, structural equation modeling, and multilevel modeling.
H. Liu: Her research interests are educational measurement, advanced statistics methods.
Y. Han: Her research interests are psychometrics and educational measurement.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Yuan, KH., Liu, H. & Han, Y. Differential Item Functioning Analysis Without A Priori Information on Anchor Items: QQ Plots and Graphical Test. Psychometrika 86, 345–377 (2021). https://doi.org/10.1007/s11336-021-09746-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-021-09746-5