Abstract
The psychometric process used to establish a relationship between the scores of two (or more) instruments is generically referred to as linking. When two instruments with the same content and statistical test specifications are linked, these instruments are said to be equated. Linking and equating procedures have long been used for practical benefit in educational testing. In recent years, health outcome researchers have increasingly applied linking techniques to patient-reported outcome (PRO) data. However, these applications have some noteworthy purposes and associated methodological questions. Purposes for linking health outcomes include the harmonization of data across studies or settings (enabling increased power in hypothesis testing), the aggregation of summed score data by means of score crosswalk tables, and score conversion in clinical settings where new instruments are introduced, but an interpretable connection to historical data is needed. When two PRO instruments are linked, assumptions for equating are typically not met and the extent to which those assumptions are violated becomes a decision point around how (and whether) to proceed with linking. We demonstrate multiple linking procedures—equipercentile, unidimensional IRT calibration, and calibrated projection—with the Patient-Reported Outcomes Measurement Information System Depression bank and the Patient Health Questionnaire-9. We validate this link across two samples and simulate different instrument correlation levels to provide guidance around which linking method is preferred. Finally, we discuss some remaining issues and directions for psychometric research in linking PRO instruments.
Similar content being viewed by others
References
Ahmed, S., Berzon, R. A., Revicki, D. A., Lenderking, W. R., Moinpour, C. M., Basch, E., Reeve, B. B., Wu, A. W., & International Society for Quality of Life Research (2012). The use of patient-reported outcomes (PRO) within comparative effectiveness research: implications for clinical practice and health care policy. Medical Care, 50(12), 1060–1070.
Albano, A. D. (2016). equate: An R package for observed-score linking and equating. Journal of Statistical Software, 74(8), 1–36.
Amtmann, D., Cook, K. F., Jensen, M. P., Chen, W.-H., Choi, S., Revicki, D., et al. (2010). Development of a PROMIS item bank to measure pain interference. Pain, 150(1), 173–182.
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R.L. Thorndike (Ed.) Educational measurement. (2nd ed., pp. 508–600). Washington, DC: American Council on Education.
Askew, R. L., Kim, J., Chung, H., Cook, K. F., Johnson, K. L., & Amtmann, D. (2013). Development of a crosswalk for pain interference measured by the BPI and PROMIS pain interference short form. Quality of Life Research, 22(10), 2769–2776.
Basch, E. (2014). New frontiers in patient-reported outcomes: Adverse event reporting, comparative effectiveness, and quality assessment. Annual Review of Medicine, 65, 307–317.
Basch, E., Spertus, J., Dudley, R. A., Wu, A., Chuahan, C., Cohen, P., et al. (2015). Methods for developing patient-reported outcome-based performance measures (PRO-PMs). Value in Health, 18(4), 493–504.
Baumhauer, J. F., & Bozic, K. J. (2016). Value-based healthcare: Patient-reported outcomes in clinical decision making. Clinical Orthopaedics and Related Research®, 474(6), 1375–1378.
Bland, J. M., & Altman, D. G. (1999). Measuring agreement in method comparison studies. Statistical Methods in Medical Research, 8(2), 135–160.
Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431–444.
Brennan, R. (2004). Linking with Equivalent Group or Single Group Design (LEGS; Version 2.0)[Computer software]. Iowa City, IA: University of Iowa, Center for Advanced Studies in Measurement and Assessment (CASMA).
Browne, M. W., & Cudeck, R. (1992). Alternative ways of assessing model fit. Sociological Methods and Research, 21(2), 230–258.
Bryant, D. U., Smith, A. K., Alexander, S. G., Vaughn, K., & Canali, K. G. (2005). Expected a posteriori estimation of multiple latent traits (518612013-445)
Buysse, D. J., Yu, L., Moul, D. E., Germain, A., Stover, A., Dodds, N. E., et al. (2010). Development and validation of patient-reported outcome measures for sleep disturbance and sleep-related impairments. Sleep, 33(6), 781–792.
Cai, L. (2015). Lord–Wingersky algorithm version 2.0 for hierarchical item factor models with applications in test scoring, scale alignment, and model fit testing. Psychometrika, 80(2), 535–559.
Carstensen, B. (2010). Comparing methods of measurement: Extending the LoA by regression. Statistics in Medicine, 29(3), 401–410.
Cella, D., Choi, S. W., Condon, D. M., Schalet, B., Hays, R. D., Rothrock, N. E., et al. (2019). PROMIS® adult health profiles: Efficient short-form measures of seven health domains. Value in Health, 22(5), 537–544.
Cella, D., Riley, W., Stone, A., Rothrock, N., Reeve, B., Yount, S., et al. (2010). The patient-reported outcomes measurement information system (PROMIS) developed and tested its first wave of adult self-reported health outcome item banks: 2005–2008. Journal of Clinical Epidemiology, 63(11), 1179–1194.
Cella, D., Schalet, B., Kallen, M., Lai, J.-S., Cook, K., Rutsohn, J., & Choi, S. (2016). PROSETTA stone analysis report: A rosetta stone for patient reported outcomes.
Cella, D., & Stone, A. A. (2015). Health-related quality of life measurement in oncology: Advances and opportunities. American Psychologist, 70(2), 175.
Cella, D., Yount, S., Rothrock, N., Gershon, R., Cook, K., Reeve, B., et al. (2007). The patient-reported outcomes measurement information system (PROMIS): Progress of an NIH Roadmap cooperative group during its first two years. Medical Care, 45(5 Suppl 1), S3.
Chalmers, R.P. mirt: A Multidimensional Item Response Theory Package for the R Environment. Journal of Statistical Software, 48(6), 1–29 (2012).
Choi S, Lim S, Schalet B, Kaat A, & Cella, D. (2020). PROsetta: Linking Patient-Reported Outcomes Measures. R package version 0.2.0, https://cran.r-project.org/package=PROsetta
Choi, S. W., Gibbons, L. E., & Crane, P. K. (2011). Lordif: An R package for detecting differential item functioning using iterative hybrid ordinal logistic regression/item response theory and Monte Carlo simulations. Journal of Statistical Software, 39(8), 1.
Choi, S. W., Schalet, B., Cook, K. F., & Cella, D. (2014). Establishing a common metric for depressive symptoms: Linking the BDI-II, CES-D, and PHQ-9 to PROMIS depression. Psychological Assessment, 26(2), 513.
Cleeland, C. S., Gonin, R., Hatfield, A. K., Edmonson, J. H., Blum, R. H., Stewart, J. A., et al. (1994). Pain and its treatment in outpatients with metastatic cancer. New England Journal of Medicine, 330(9), 592–596.
Cook, K. F., Schalet, B. D., Kallen, M. A., Rutsohn, J. P., & Cella, D. (2015). Establishing a common metric for self-reported pain: Linking BPI pain interference and SF-36 bodily pain subscale scores to the PROMIS pain interference metric. Quality of Life Research, 24(10), 2305–2318.
Coster, W. J., Ni, P., Slavin, M. D., Kisala, P. A., Nandakumar, R., Mulcahey, M. J., et al. (2016). Differential item functioning in the patient reported outcomes measurement information system pediatric short forms in a sample of children and adolescents with cerebral palsy. Developmental Medicine and Child Neurology, 58(11), 1132–1138.
Curran, P. J., & Hussong, A. M. (2009). Integrative data analysis: The simultaneous analysis of multiple data sets. Psychological Methods, 14(2), 81–100. https://doi.org/10.1037/a0015914.
De Vet, H. C., Terwee, C. B., Mokkink, L. B., & Knol, D. L. (2011). Measurement in medicine: A practical guide. Cambridge: Cambridge University Press.
Dorans, N. J. (2004). Equating, concordance, and expectation. Applied Psychological Measurement, 28(4), 227–246.
Dorans, N. J. (2007). Linking scores from multiple health outcome instruments. Quality of Life Research, 16(1), 85–94.
Dorans, N. J., & Holland, P. W. (2000). Population invariance and the equatability of tests: Basic theory and the linear case. ETS Research Report Series, 2000(2), i–35.
Dorans, N. J., Lyu, C. F., Pommerich, M., & Houston, W. M. (1997). Concordance between ACT assessment and recentered SAT I sum scores. College and University, 73(2), 24–32.
Fischer, H. F., & Rose, M. (2019). Scoring depression on a common metric: A comparison of EAP estimation, plausible value imputation, and full Bayesian IRT modeling. Multivariate Behavioral Research, 54(1), 85–99.
Fischer, H. F., Wahl, I., Fliege, H., Klapp, B. F., & Rose, M. (2012). Impact of cross-calibration methods on the interpretation of a treatment comparison study using 2 depression scales. Medical Care, 50(4), 320–326.
Gershon, R. C., Lai, J. S., Bode, R., Choi, S., Moy, C., Bleck, T., et al. (2012). Neuro-QOL: Quality of life item banks for adults with neurological disorders: item development and calibrations based upon clinical and general population testing. Quality of Life Research, 21(3), 475–486.
Gottfredson, N. C., Cole, V. T., Giordano, M. L., Bauer, D. J., Hussong, A. M., & Ennett, S. T. (2019). Simplifying the implementation of modern scale scoring methods with an automated R package: Automated moderated nonlinear factor analysis (aMNLFA). Addictive Behaviors, 94, 65–73.
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149.
Hahn, E. A., DeWalt, D. A., Bode, R. K., Garcia, S. F., DeVellis, R. F., Correia, H., et al. (2014). New English and Spanish social health measures will facilitate evaluating health determinants. Health Psychology, 33(5), 490.
Hansen, M., Cai, L., Stucky, B. D., Tucker, J. S., Shadel, W. G., & Edelen, M. O. (2014). Methodology for developing and evaluating the PROMIS® smoking item banks. Nicotine and Tobacco Research, 16(Suppl 3), S175–S189.
Hanson, B. A., Zeng, L., & Colton, D. A. (1994). A comparison of presmoothing and postsmoothing methods in equipercentile equating (Vol. 94). New York: American College Testing Program.
Hays, R. D., Brodsky, M., Johnston, M. F., Spritzer, K. L., & Hui, K.-K. (2005). Evaluating the statistical significance of health-related quality-of-life change in individual patients. Evaluation and the Health Professions, 28(2), 160–171.
Hays, R. D., Liu, H., & Kapteyn, A. (2015). Use of Internet panels to conduct surveys. Behavior Research Methods, 47(3), 685–690.
Holland, P. W., & Dorans, N. J. (2006). Linking and equating. Educational Measurement, 4, 187–220.
Hu, L.-T., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification. Psychological Methods, 3(4), 424.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 1–55.
Hussong, A. M., Gottfredson, N. C., Bauer, D. J., Curran, P. J., Haroon, M., Chandler, R., et al. (2019). Approaches for creating comparable measures of alcohol use symptoms: Harmonization with eight studies of criminal justice populations. Drug and Alcohol Dependence, 194, 59–68. https://doi.org/10.1016/j.drugalcdep.2018.10.003.
Jensen, R. E., Moinpour, C. M., Potosky, A. L., Lobo, T., Hahn, E. A., Hays, R. D. et al. (2017). Responsiveness of 8 Patient-Reported Outcomes Measurement Information System (PROMIS) measures in a large, community-based cancer study cohort. Cancer, 123(2), 327–335.
Kaat, A. J., Kallen, M. A., Nowinski, C. J., Sterling, S. A., Westbrook, S. R., & Peters, J. T. (2020). PROMIS® pediatric depressive symptoms as a harmonized score metric. Journal of Pediatric Psychology, 45(3), 271–280.
Kaat, A. J., Newcomb, M. E., Ryan, D. T., & Mustanski, B. (2017). Expanding a common metric for depression reporting: linking two scales to PROMIS® depression. Quality of Life Research, 26(5), 1119–1128
Kang, T., & Petersen, N. S. (2012). Linking item parameters to a base scale. Asia Pacific Education Review, 13(2), 311–321.
Katzan, I. L., Fan, Y., Griffith, S. D., Crane, P. K., Thompson, N. R., & Cella, D. (2017). Scale linking to enable patient-reported outcome performance measures assessed with different patient-reported outcome measures. Value in Health, 20(8), 1143–1149.
Kim, J., Chung, H., Askew, R. L., Park, R., Jones, S. M., Cook, K. F., & Amtmann, D. (2015). Translating CESD-20 and PHQ-9 scores to PROMIS depression. Assessment, 1073191115607042.
Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 355–381.
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices. Berlin: Springer.
Kroenke, K., Spitzer, R. L., & Williams, J. B. (2001). The PHQ-9: Validity of a brief depression severity measure. Journal of General Internal Medicine, 16(9), 606–613.
Kroenke, K., Spitzer, R. L., Williams, J. B., & Löwe, B. (2010). The patient health questionnaire somatic, anxiety, and depressive symptom scales: A systematic review. General Hospital Psychiatry, 32(4), 345–359.
Lai, J.-S., Cella, D., Yanez, B., & Stone, A. (2014). Linking fatigue measures on a common reporting metric. Journal of Pain and Symptom Management, 48(4), 639–648.
Lee, W. C., & Lee, G. (2018). IRT linking and equating (pp. 639–673). The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test Development.
Liegl, G., Wahl, I., Berghöfer, A., Nolte, S., Pieh, C., Rose, M., et al. (2016). Using Patient Health Questionnaire-9 item parameters of a common metric resulted in similar depression scores compared to independent item response theory model reestimation. Journal of Clinical Epidemiology, 71, 25–34.
Liu, H., Cella, D., Gershon, R., Shen, J., Morales, L. S., Riley, W., et al. (2010). Representativeness of the patient-reported outcomes measurement information system internet panel. Journal of Clinical Epidemiology, 63(11), 1169–1178.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. London: Routledge.
Lord, F. M. (1982). The standard error of equipercentile equating. Journal of Educational Statistics, 7(3), 165–174.
Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score equatings. Applied Psychological Measurement, 8(4), 453–461.
Lucke JF (2015). Unipolar item response models. In Reise SP & Revicki DA (Eds.), Handbook of Item Response Theory Modeling: Applications to Typical Performance Assessment (pp. 272–284). New York, NY: Routledge/Taylor & Francis Group.
Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Stratford, P. W., Knol, D. L., et al. (2010). The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: An international Delphi study. Quality of Life Research, 19(4), 539–549.
McHugh, R. K., Rasmussen, J. L., & Otto, M. W. (2011). Comprehension of self-report evidence-based measures of anxiety. Depression and Anxiety, 28(7), 607–614.
Park, T., Reilly-Spong, M., & Gross, C. R. (2013). Mindfulness: A systematic review of instruments to measure an emergent patient-reported outcome (PRO). Quality of Life Research, 22(10), 2639–2659.
Pilkonis, P. A., Choi, S. W., Reise, S. P., Stover, A. M., Riley, W. T., & Cella, D. (2011). Item banks for measuring emotional distress from the patient-reported outcomes measurement information system (PROMIS®): Depression, anxiety, and anger. Assessment, 18(3), 263–283.
Pilkonis, P. A., Choi, S. W., Salsman, J. M., Butt, Z., Moore, T. L., Lawrence, S. M., et al. (2013). Assessment of self-reported negative affect in the NIH Toolbox. Psychiatry Research, 206(1), 88–97.
Pilkonis, P. A., Yu, L., Dodds, N. E., Johnston, K. L., Maihoefer, C. C., & Lawrence, S. M. (2014). Validation of the depression item bank from the patient-reported outcomes measurement information system (PROMIS®) in a three-month observational study. Journal of Psychiatric Research, 56, 112–119.
Purvis, T. E., Neuman, B. J., Riley, L. H, I. I. I., & Skolasky, R. L. (2018). Discriminant ability, concurrent validity, and responsiveness of PROMIS health domains among patients with lumbar degenerative disease undergoing decompression with or without arthrodesis. Spine, 43(21), 1512–1520.
Reeve, B. B., Hays, R. D., Bjorner, J. B., Cook, K. F., Crane, P. K., Teresi, J. A., et al. (2007). Psychometric evaluation and calibration of health-related quality of life item banks: Plans for the patient-reported outcomes measurement information system (PROMIS). Medical Care, 45(5), S22–S31.
Reeve, B. B., Thissen, D., DeWalt, D. A., Huang, I.-C., Liu, Y., Magnus, B., et al. (2016). Linkage between the PROMIS®pediatric and adult emotional distress measures. Quality of Life Research, 25(4), 823–833.
Reinsch, C. H. (1967). Smoothing by spline functions. Numerische mathematik, 10(3), 177–183.
Reise, S. P., Moore, T. M., & Haviland, M. G. (2013). Applying unidimensional item response theory models to psychological data. In K. F. Geisinger, B. A. Bracken, J. F. Carlson, J.-I. C. Hansen, N. R. Kuncel, S. P. Reise, & M. C. Rodriguez (Eds.), APA handbooks in psychology®. APA handbook of testing and assessment in psychology, Vol. 1. Test theory and testing and assessment in industrial and organizational psychology (p. 101–119). American Psychological Association.
Reise, S. P., Rodriguez, A., Spritzer, K. L., & Hays, R. D. (2018). Alternative approaches to addressing non-normal distributions in the application of IRT models to personality measures. Journal of Personality Assessment, 100(4), 363–374.
Revicki, D., Hays, R. D., Cella, D., & Sloan, J. (2008). Recommended methods for determining responsiveness and minimally important differences for patient-reported outcomes. Journal of Clinical Epidemiology, 61(2), 102–109.
Rose, J. S., Dierker, L. C., Hedeker, D., & Mermelstein, R. (2013). An integrated data analysis approach to investigating measurement equivalence of DSM nicotine dependence symptoms. Drug and Alcohol Dependence, 129(1–2), 25–32.
Rose, M., Bjorner, J. B., Gandek, B., Bruce, B., Fries, J. F., & Ware, J. E. (2014). The PROMIS physical function item bank was calibrated to a standardized metric and shown to improve measurement efficiency. Journal of Clinical Epidemiology, 67(5), 516–526.
Rosseel, Y. (2012). Lavaan: An R package for structural equation modeling and more. Version 0.5–12 (BETA). Journal of Statistical Software, 48(2), 1–36.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. (Psychometrika Monograph Supplement No. 17) Richmond, VA Psychometrics Society.
Schalet, B. D., Cook, K. F., Choi, S. W., & Cella, D. (2014). Establishing a common metric for self-reported anxiety: Linking the MASQ, PANAS, and GAD-7 to PROMIS Anxiety. Journal of Anxiety Disorders, 28(1), 88–96.
Schalet, B. D., Janulis, P., Kipke, M. D., Mustanski, B., Shoptaw, S., Moore, R., et al. (2020). Psychometric Data Linking Across HIV and Substance Use Cohorts. AIDS and Behavior, 24, 3215–3224.
Segawa, E., Schalet, B., & Cella, D. (2020). A comparison of computer adaptive tests (CATs) and short forms in terms of accuracy and number of items administrated using PROMIS profile. Quality of Life Research, 29(1), 213–221.
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210.
ten Klooster, P. M., Voshaar, M. A. O., Gandek, B., Rose, M., Bjorner, J. B., Taal, E., et al. (2013). Development and evaluation of a crosswalk between the SF-36 physical functioning scale and Health Assessment Questionnaire disability index in rheumatoid arthritis. Health and Quality of Life Outcomes, 11(1), 1.
Thissen D., Liu Y., Magnus B., Quinn H. (2015) Extending the Use of Multidimensional IRT Calibration as Projection: Many-to-One Linking and Linear Computation of Projected Scores. In van der Ark L., Bolt D., Wang WC., Douglas J., Chow SM. (Eds.), Quantitative Psychology Research. Springer Proceedings in Mathematics & Statistics, vol 140 (pp 1–16). Springer, Cham.
Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19(1), 39–49.
Thissen, D., Varni, J. W., Stucky, B. D., Liu, Y., Irwin, D. E., & DeWalt, D. A. (2011). Using the PedsQL™3.0 asthma module to obtain scores comparable with those of the PROMIS pediatric asthma impact scale (PAIS). Quality of Life Research, 20(9), 1497–1505.
Tomitaka, S., Kawasaki, Y., Ide, K., Akutagawa, M., Ono, Y., & Furukawa, T. A. (2019). Distribution of psychological distress is stable in recent decades and follows an exponential pattern in the US population. Scientific Reports, 9(1), 1–10.
Tuck, N. L., Johnson, M. H., & Bean, D. J. (2019). You’d better believe it: The conceptual and practical challenges of assessing malingering in patients with chronic pain. The Journal of Pain, 20(2), 133–145.
Tulsky, D. S., Kisala, P. A., Boulton, A. J., Jette, A. M., Thissen, D., Ni, P., et al. (2019). Determining a transitional scoring link between PROMIS® pediatric and adult physical health measures. Quality of Life Research, 28(5), 1217–1229.
Uijen, A. A., Heinst, C. W., Schellevis, F. G., van den Bosch, W. J., van de Laar, F. A., Terwee, C. B., et al. (2012). Measurement properties of questionnaires measuring continuity of care: A systematic review. PloS One, 7(7), e42256.
Victorson, D., Schalet, B. D., Kundu, S., Helfand, B. T., Novakovic, K., Penedo, F., et al. (2019). Establishing a common metric for self-reported anxiety in patients with prostate cancer: Linking the Memorial Anxiety Scale for Prostate Cancer with PROMIS Anxiety. Cancer, 125(18), 3249–3258.
von Davier, M., Yamamoto, K., Shin, H. J., Chen, H., Khorramdel, L., Weeks, J., et al. (2019). Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assessment in Education: Principles, Policy and Practice, 26(4), 466–488.
Voshaar, M. O., Vonkeman, H., Courvoisier, D., Finckh, A., Gossec, L., Leung, Y., et al. (2019). Towards standardized patient reported physical function outcome reporting: Linking ten commonly used questionnaires to a common metric. Quality of Life Research, 28(1), 187–197.
Wall, M. M., Park, J. Y., & Moustaki, I. (2015). IRT modeling in the presence of zero-inflation with application to psychiatric disorder severity. Applied Psychological Measurement, 39(8), 583–597.
Acknowledgements
We wish to clarify that Seung W. Choi served as the senior author on this manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Schalet, B.D., Lim, S., Cella, D. et al. Linking Scores with Patient-Reported Health Outcome Instruments:A VALIDATION STUDY AND COMPARISON OF THREE LINKING METHODS. Psychometrika 86, 717–746 (2021). https://doi.org/10.1007/s11336-021-09776-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-021-09776-z