Challenges in replicating secondary analysis of electronic health records data with multiple computable phenotypes: A case study on methicillin-resistant Staphylococcus aureus bacteremia infections

https://doi.org/10.1016/j.ijmedinf.2021.104531Get rights and content

Highlight

  • Clinical management of methicillin-resistant Staphylococcus aureus bacteremia with reduced vancomycin susceptibility (MRSA-RVS) is challenging given the complex spectrum of risk factors and limited therapeutic options.

  • A number of MRSA-RVS outcome prediction models is available but replication to assess generalizability using electronic health records (EHR) is problematic due to lack of computable phenotypes for many of predictors, besides study inclusion and outcome criteria.

  • Derivation and evaluation of computable phenotypes for MRSA-RVS based on prior literature using EHR.

  • Replication and external validation of a MRSA-RVS 30-day mortality prediction model using a large multi-county EHR database in the US.

Abstract

Background

Replication of prediction modeling using electronic health records (EHR) is challenging because of the necessity to compute phenotypes including study cohort, outcomes, and covariates. However, some phenotypes may not be easily replicated across EHR data sources due to a variety of reasons such as the lack of gold standard definitions and documentation variations across systems, which may lead to measurement error and potential bias. Methicillin-resistant Staphylococcus aureus (MRSA) infections are responsible for high mortality worldwide. With limited treatment options for the infection, the ability to predict MRSA outcome is of interest. However, replicating these MRSA outcome prediction models using EHR data is problematic due to the lack of well-defined computable phenotypes for many of the predictors as well as study inclusion and outcome criteria.

Objective

In this study, we aimed to evaluate a prediction model for 30-day mortality after MRSA bacteremia infection diagnosis with reduced vancomycin susceptibility (MRSA-RVS) considering multiple computable phenotypes using EHR data.

Methods

We used EHR data from a large academic health center in the United States to replicate the original study conducted in Taiwan. We derived multiple computable phenotypes of risk factors and predictors used in the original study, reported stratified descriptive statistics, and assessed the performance of the prediction model.

Results

In our replication study, it was possible to (re)compute most of the original variables. Nevertheless, for certain variables, their computable phenotypes can only be approximated by proxy with structured EHR data items, especially the composite clinical indices such as the Pitt bacteremia score. Even computable phenotype for the outcome variable was subject to variation on the basis of the admission/discharge windows. The replicated prediction model exhibited only a mild discriminatory ability.

Conclusion

Despite the rich information in EHR data, replication of prediction models involving complex predictors is still challenging, often due to the limited availability of validated computable phenotypes. On the other hand, it is often possible to derive proxy computable phenotypes that can be further validated and calibrated.

Introduction

Reproducibility or replicability is considered “the coin of the scientific realm” [1]. The growth of electronic health record (EHR) systems and the increasing availability of EHR data repositories have enabled not only knowledge discovery for studying diseases and outcomes in clinical settings but also replication studies which provide external validation of such knowledge across different EHR systems [2].

Several steps are required to carry out a replication study with EHR data, including automation, harmonization and standardization of study inclusion criteria to identify suitable study population cohorts [3], [4], [5]. Using diagnostic codes alone for cohort identification in EHR has poor specificity and sensitivity [6], [7]. This is why computable phenotypes, or algorithmic combinations of machine-readable representation of health concepts (i.e., clinical conditions, characteristics, or sets of clinical features) are necessary to improve the identification of health statuses without the need for interpretation by a clinician [8]. Computable phenotypes save time, reduce misclassification errors, and increase the portability of information between different health care systems, facilitating multisite cohort research [9].

Computable phenotypes are not only important for cohort identification in replication studies using EHR data, but also for feature collation as measurement error and bias in model covariates can also affect reproducibility of results. Thus, without reproducible computable phenotypes, in a replication study using EHRs, one may have to develop multiple new outcome and feature phenotypes specific to EHR data she has access to, which may introduce measurement errors and subsequently additional bias.

In this work, we investigate the challenges of EHR-based replication studies of clinical risk prediction models. As a use case, we re-evaluate the study conducted by Yang et al. [10], who analyzed the 30-day mortality due to methicillin-resistant Staphylococcus aureus (MRSA) bacteremia infection with reduced vancomycin susceptibility (MRSA-RVS). S. aureus is one of the most prevalent pathogens in both health care facilities and the community and is a leading cause of bacteremia in the United States and worldwide [11]. More than 2.8 million antibiotic-resistant infections occur in the United States every year and more than 35,000 people die from the infection [12]. Many of these infections progress to bacteremia, which is when bacteria invade the bloodstream. In 2017, 119,247 cases of S. aureus bloodstream infections and 19,832 associated deaths occurred in the United States [13]. Most of the S. aureus bacteremia cases globally are due to MRSA [11]. Vancomycin is one of the few antibiotic therapies available to treat MRSA; however, reports of elevated vancomycin minimum inhibitory concentrations (MIC) are on the rise, jeopardizing the success of antibiotic therapy for these infections [14].

The study by Yang et al. was retrospective and included EHR data from individuals admitted to a large academic healthcare network in Taiwan, the Chang Gung Memorial hospital, from 2009 to 2012 [10]. Covariates included both clinical features and microbiology laboratory results. The study is particularly challenging to replicate because many variables required detailed clinical information of the patient, e.g., the Pitt bacteremia score [15], and could not be readily recalculated using structured and coded EHR data (e.g., diagnoses and procedures) or matched to specific EHR ontology codes. For this replication study, we used data from the University of Florida (UF) Health system, a large academic health center in north Florida, whose EHR data warehouse includes rich, longitudinal patient data.

Section snippets

Data source and cohort identification

We used data from the UF Integrated Data Repository (IDR) which contains EHRs for patients of the UF Health system since 2011. UF Health provides care to more than 1 million patients with over 3 million inpatient and outpatient visits each year in Florida, USA, with hospitals in Gainesville (Alachua County), Jacksonville (Duval County), and satellite clinics in other Florida counties. The study was approved as exempt by the UF Institutional Review Board no. IRB201900652.

The target population

Results

Out of the 58,461 patients diagnosed with a bacterial infection with an available antibiogram test in the UF IDR data (Fig. 1), 479 patients had the principal ICD code (ICD-9-CM 038.12; ICD-10-CM A41.02) for MRSA bacteremia and 876 patients had both the MRSA and bacteremia ICD codes separately within 7 days. Among the 1,355 patients, 178 patients had reduced vancomycin susceptibility and 140 patients had at least one year of medical history from the first bacteremia diagnosis date. Therefore,

Discussion

The use of EHR for clinical risk prediction appears promising because EHR data contain detailed patient information such as disease status, treatment, treatment adherence and outcomes, comorbidities, and concurrent treatments that are tracked longitudinally. The use of real-world data like EHRs provides important real-world evidence to inform therapeutic development and outcomes research [27], [28]. However, there exists several challenges in generating computable phenotypes – for defining

Conclusion

In this study, we used EHR data from the UF Health system in Florida, United States to replicate a clinical prediction model for 30-day mortality due to MRSA bacteremia. We generated multiple proxy phenotypes to identify the study cohort, outcomes, as well as the risk factors used for prediction. Although the two population exhibited substantial heterogeneity in risk factor distribution and outcome event size, affecting the model replication and validation process, our results show consistency

Funding

This work was supported in part by: the National Institutes of Health [grant numbers R01AI141810, R01CA246418, R21CA253394, R21CA245858); University of Florida’s (UF) Office of the Provost, UF Office of Research, UF Health, UF College of Medicine and UF Clinical and Translational Science Institute (CTSI)’s “Creating the Healthiest Generation” Moonshot initiative; UF Health Cancer Center's Cancer Informatics Shared Resource; and the University of Florida Informatics Institute Fellowship Program.

Authors' contributions

All authors had substantial contributions to the research and written of the manuscript. IJ, SR, JB, and MP were responsible for the overall design of this study. IJ and SR collected data and planned statistical analysis. IJ, SR and ZC wrote the first draft manuscript. All authors contributed to writing the final versions of the manuscript.

CRediT authorship contribution statement

Inyoung Jun: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Writing - original draft, Writing - review & editing. Shannan N. Rich: Conceptualization, Data curation, Investigation, Methodology, Writing - review & editing. Zhaoyi Chen: Investigation, Writing - review & editing. Jiang Bian: Conceptualization, Funding acquisition, Methodology, Writing - review & editing, Project administration. Mattia Prosperi: Conceptualization, Funding acquisition,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank Dr. Pierangelo Veltri in the Department of Surgical and Clinical Science at the University of Catanzaro, Italy for the useful insights provided when pre-reviewing the manuscript.

References (34)

  • M. Holubar et al.

    Bacteremia due to Methicillin-Resistant Staphylococcus aureus

    Infect. Dis. Clin. North Am.

    (2016)
  • J.W. Chow et al.

    Combination antibiotic therapy versus monotherapy for gram-negative bacteraemia: a commentary

    Int. J. Antimicrob. Agents

    (1999)
  • J.S. Garner et al.

    CDC definitions for nosocomial infections, 1988

    Am. J. Infect. Control

    (1988)
  • J. Loscalzo

    Irreproducible Experimental Results: Causes, (Mis)interpretations, and Consequences

    Circulation

    (2012)
  • V.L. Bartlett et al.

    Feasibility of Using Real-World Data to Replicate Clinical Trial Evidence

    JAMA Netw. Open

    (2019 Oct 2)
  • R.L. Richesson et al.

    A comparison of phenotype definitions for diabetes mellitus

    J. Am. Med. Inform. Assoc.

    (2013)
  • K.P. Liao et al.

    Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts

    PLoS ONE

    (2015 Aug 24)
  • Z. Afzal et al.

    Automatic generation of case-detection algorithms to identify children with asthma from large electronic health record databases

    Pharmacoepidemiol. Drug Saf.

    (2013)
  • W.-Q. Wei et al.

    Extracting research-quality phenotypes from electronic health records to support precision medicine

    Genome Med.

    (2015)
  • H. Mo et al.

    Desiderata for computable representations of electronic health records-driven phenotype algorithms

    J. Am. Med. Inf. Assoc.

    (2015)
  • R.L. Richesson et al.

    Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory

    J. Am. Med. Inform. Assoc.

    (2013)
  • D.W. Paul et al.

    Development and validation of an electronic medical record (EMR)-based computed phenotype of HIV-1 infection

    J. Am. Med. Inform. Assoc.

    (2017)
  • C.-C. Yang et al.

    Risk factors of treatment failure and 30-day mortality in patients with bacteremia due to MRSA with reduced vancomycin susceptibility

    Sci. Rep.

    (2018)
  • A. Hassoun et al.

    Incidence, prevalence, and management of MRSA bacteremia across patient populations—a review of recent developments in MRSA management and treatment

    Crit. Care

    (2017)
  • Centers for Disease Control and Prevention (U.S.), National Center for Emerging Zoonotic and Infectious Diseases...
  • A.P. Kourtis et al.

    Vital Signs: Epidemiology and Recent Trends in Methicillin-Resistant and in Methicillin-Susceptible Staphylococcus aureus Bloodstream Infections — United States

    MMWR Morb. Mortal. Wkly Rep.

    (2019)
  • K. Inagaki et al.

    Methicillin-susceptible and Methicillin-resistant Staphylococcus aureus Bacteremia: Nationwide Estimates of 30-Day Readmission, In-hospital Mortality, Length of Stay, and Cost in the United States

    Clin. Infect. Dis.

    (2019)
  • View full text