Abstract
Abstract. The bifactor model (BM) and the testlet response model (TRM) are the most common multidimensional models applied to testlet-based tests. The common procedure is to estimate these models using different estimation methods (see, e.g., DeMars, 2006). A possible consequence of this is that previous findings about the implications of fitting a wrong model to the data may be confounded with the estimation procedures they employed. With this in mind, the present study uses the same method (maximum marginal likelihood [MML] using dimensional reduction) to compare uni- and multidimensional strategies to testlet-based tests, and assess the performance of various relative fit indices. Data were simulated under three different models, namely BM, TRM, and the unidimensional model. Recovery of item parameters, reliability estimates, and selection rates of the relative fit indices were documented. The results were essentially consistent with those obtained through different methods (DeMars, 2006), indicating that the effect of the estimation method is negligible. Regarding the fit indices, Akaike Information Criterion (AIC) showed the best selection rates, whereas Bayes Information Criterion (BIC) tended to select a model which is simpler than the true one. The work concludes with recommendations for practitioners and proposals for future research.
References
2015). Modeling local item dependence due to common test format with a multidimensional Rasch model. International Journal of Testing, 15, 71–87. https://doi.org/10.1080/15305058.2014.941108
(2016). Modeling Local Item Dependence in Cloze and Reading Comprehension Test Items Using Testlet Response Theory. Psicologica: International Journal of Methodology and Experimental Psychology, 37, 85–104.
(1968).
(Some latent train models and their use in inferring an examinee’s ability . In F. M. LordM. R. NovickEds., Statistical theories of mental test scores (pp. 395–479). Oxford, UK: Addison-Wesley.1988). Full-Information Item Factor Analysis. Applied Psychological Measurement, 12, 261–280. https://doi.org/10.1177/014662168801200305
(2003). Testfact (Version 4.0) [Computer software and manual]. Lincolnwood, IL: Scientific Software International.
(1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168. https://doi.org/10.1007/BF02294533
(2015).
(Scoring and estimating score precision using multidimensional IRT models . In S. P. ReiseD. A. RevickiEds., Handbook of Item Response Theory Modeling: Applications to typical performance assessment (pp. 307–333). New York, NY: Routledge.2010). A two-tier full-information item factor analysis model with applications. Psychometrika, 75, 581–612. https://doi.org/10.1007/s11336-010-9178-0
(2011). Generalized full-information item bifactor analysis. Psychological Methods, 16, 221–248. https://doi.org/10.1037/a0023350
(2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1–29. https://doi.org/10.18637/jss.v048.i06
(2012). Modeling general and specific variance in multifaceted constructs: A comparison of the Bifactor Model to other approaches: Bifactor modeling of multifaceted constructs. Journal of Personality, 80, 219–251. https://doi.org/10.1111/j.1467-6494.2011.00739.x
(2006). Application of the Bi-Factor Multidimensional Item Response Theory Model to testlet-based tests. Journal of Educational Measurement, 43, 145–168. https://doi.org/10.1111/j.1745-3984.2006.00010.x
(2012). Confirming testlet effects. Applied Psychological Measurement, 36, 104–121. https://doi.org/10.1177/0146621612437403
(2013). A tutorial on interpreting Bifactor Model scores. International Journal of Testing, 13, 354–378. https://doi.org/10.1080/15305058.2013.799067
(1992). Full-information item bi-factor analysis. Psychometrika, 57, 423–436. https://doi.org/10.1007/BF02295430
(2013). Estimation methods for one-parameter testlet models. Journal of Educational Measurement, 50, 186–203. https://doi.org/10.1111/jedm.12010
(2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30, 3–21. https://doi.org/10.1177/0146621605275414
(2011). Performance of the S – χ2 statistic for full-information bifactor models. Educational and Psychological Measurement, 71, 986–1005. https://doi.org/10.1177/0013164410392031
(2015). Are fit indices biased in favor of bi-factor models in cognitive ability research? A comparison of fit in correlated factors, higher-order, and bi-factor models via Monte Carlo simulations. Journal of Intelligence, 3, 2–20. https://doi.org/10.3390/jintelligence3010002
(1992). A Generalized Partial Credit Model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176. https://doi.org/10.1177/014662169201600206
(2014).
(Evaluating the impact of multidimensionality on unidimensional item response theory model parameters . In S. P. ReiseD. A. RevickiEds., Handbook of Item Response Theory Modeling: Applications to typical performance assessment (pp. 13–40). New York, NY: Routledge.2014). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
. (2009). Efficient full information maximum likelihood estimation for multidimensional IRT models. ETS Research Report Series, 2009, i–31. https://doi.org/10.1002/j.2333–8504.2009.tb02160.x
(2010). Formal relations and an empirical comparison among the Bi-Factor, the Testlet, and a Second-Order Multidimensional IRT Model. Journal of Educational Measurement, 47, 361–372. https://doi.org/10.1111/j.1745-3984.2010.00118.x
(2014). A third-order item response theory model for modeling the effects of domains and subdomains in large-scale educational assessment surveys. Journal of Educational and Behavioral Statistics, 39, 235–256. https://doi.org/10.3102/1076998614531045
(1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237–247. https://doi.org/10.1111/j.1745-3984.1991.tb00356.x
(2013).
(Using the Testlet Response Model as a shortcut to Multidimensional Item Response Theory subscore computation . In R. E. MillsapL. A. van der ArkD. M. BoltC. M. WoodsEds., New Developments in Quantitative Psychology (pp. 29–40). New York, NY: Springer.1989). Trace lines for testlets: A use of Multiple-Categorical-Response Models. Journal of Educational Measurement, 26, 247–260. https://doi.org/10.1111/j.1745-3984.1989.tb00331.x
(2000).
(Testlet Response Theory: An analog for the 3PL Model useful in testlet-based adaptive testing . In W. J. van der LindenG. A. W. GlasEds., Computerized Adaptive Testing: Theory and Practice (pp. 245–269). Amsterdam, The Netherlands: Springer.1987). Item Clusters and Computerized Adaptive Testing: A case for testlets. Journal of Educational Measurement, 24, 185–201. https://doi.org/10.1111/j.1745-3984.1987.tb00274.x
(2000). Using a new statistical model for Testlets to score TOEFL. Journal of Educational Measurement, 37, 203–220. https://doi.org/10.1111/j.1745-3984.2000.tb01083.x
(2007). Item factor analysis: Current approaches and future directions. Psychological Methods, 12, 58–79. https://doi.org/10.1037/1082-989X.12.1.58
(1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125–145. https://doi.org/10.1177/014662168400800201
(1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187–213. https://doi.org/10.1111/j.1745-3984.1993.tb00423.x
(