Challenges in replicating secondary analysis of electronic health records data with multiple computable phenotypes: A case study on methicillin-resistant Staphylococcus aureus bacteremia infections
Introduction
Reproducibility or replicability is considered “the coin of the scientific realm” [1]. The growth of electronic health record (EHR) systems and the increasing availability of EHR data repositories have enabled not only knowledge discovery for studying diseases and outcomes in clinical settings but also replication studies which provide external validation of such knowledge across different EHR systems [2].
Several steps are required to carry out a replication study with EHR data, including automation, harmonization and standardization of study inclusion criteria to identify suitable study population cohorts [3], [4], [5]. Using diagnostic codes alone for cohort identification in EHR has poor specificity and sensitivity [6], [7]. This is why computable phenotypes, or algorithmic combinations of machine-readable representation of health concepts (i.e., clinical conditions, characteristics, or sets of clinical features) are necessary to improve the identification of health statuses without the need for interpretation by a clinician [8]. Computable phenotypes save time, reduce misclassification errors, and increase the portability of information between different health care systems, facilitating multisite cohort research [9].
Computable phenotypes are not only important for cohort identification in replication studies using EHR data, but also for feature collation as measurement error and bias in model covariates can also affect reproducibility of results. Thus, without reproducible computable phenotypes, in a replication study using EHRs, one may have to develop multiple new outcome and feature phenotypes specific to EHR data she has access to, which may introduce measurement errors and subsequently additional bias.
In this work, we investigate the challenges of EHR-based replication studies of clinical risk prediction models. As a use case, we re-evaluate the study conducted by Yang et al. [10], who analyzed the 30-day mortality due to methicillin-resistant Staphylococcus aureus (MRSA) bacteremia infection with reduced vancomycin susceptibility (MRSA-RVS). S. aureus is one of the most prevalent pathogens in both health care facilities and the community and is a leading cause of bacteremia in the United States and worldwide [11]. More than 2.8 million antibiotic-resistant infections occur in the United States every year and more than 35,000 people die from the infection [12]. Many of these infections progress to bacteremia, which is when bacteria invade the bloodstream. In 2017, 119,247 cases of S. aureus bloodstream infections and 19,832 associated deaths occurred in the United States [13]. Most of the S. aureus bacteremia cases globally are due to MRSA [11]. Vancomycin is one of the few antibiotic therapies available to treat MRSA; however, reports of elevated vancomycin minimum inhibitory concentrations (MIC) are on the rise, jeopardizing the success of antibiotic therapy for these infections [14].
The study by Yang et al. was retrospective and included EHR data from individuals admitted to a large academic healthcare network in Taiwan, the Chang Gung Memorial hospital, from 2009 to 2012 [10]. Covariates included both clinical features and microbiology laboratory results. The study is particularly challenging to replicate because many variables required detailed clinical information of the patient, e.g., the Pitt bacteremia score [15], and could not be readily recalculated using structured and coded EHR data (e.g., diagnoses and procedures) or matched to specific EHR ontology codes. For this replication study, we used data from the University of Florida (UF) Health system, a large academic health center in north Florida, whose EHR data warehouse includes rich, longitudinal patient data.
Section snippets
Data source and cohort identification
We used data from the UF Integrated Data Repository (IDR) which contains EHRs for patients of the UF Health system since 2011. UF Health provides care to more than 1 million patients with over 3 million inpatient and outpatient visits each year in Florida, USA, with hospitals in Gainesville (Alachua County), Jacksonville (Duval County), and satellite clinics in other Florida counties. The study was approved as exempt by the UF Institutional Review Board no. IRB201900652.
The target population
Results
Out of the 58,461 patients diagnosed with a bacterial infection with an available antibiogram test in the UF IDR data (Fig. 1), 479 patients had the principal ICD code (ICD-9-CM 038.12; ICD-10-CM A41.02) for MRSA bacteremia and 876 patients had both the MRSA and bacteremia ICD codes separately within 7 days. Among the 1,355 patients, 178 patients had reduced vancomycin susceptibility and 140 patients had at least one year of medical history from the first bacteremia diagnosis date. Therefore,
Discussion
The use of EHR for clinical risk prediction appears promising because EHR data contain detailed patient information such as disease status, treatment, treatment adherence and outcomes, comorbidities, and concurrent treatments that are tracked longitudinally. The use of real-world data like EHRs provides important real-world evidence to inform therapeutic development and outcomes research [27], [28]. However, there exists several challenges in generating computable phenotypes – for defining
Conclusion
In this study, we used EHR data from the UF Health system in Florida, United States to replicate a clinical prediction model for 30-day mortality due to MRSA bacteremia. We generated multiple proxy phenotypes to identify the study cohort, outcomes, as well as the risk factors used for prediction. Although the two population exhibited substantial heterogeneity in risk factor distribution and outcome event size, affecting the model replication and validation process, our results show consistency
Funding
This work was supported in part by: the National Institutes of Health [grant numbers R01AI141810, R01CA246418, R21CA253394, R21CA245858); University of Florida’s (UF) Office of the Provost, UF Office of Research, UF Health, UF College of Medicine and UF Clinical and Translational Science Institute (CTSI)’s “Creating the Healthiest Generation” Moonshot initiative; UF Health Cancer Center's Cancer Informatics Shared Resource; and the University of Florida Informatics Institute Fellowship Program.
Authors' contributions
All authors had substantial contributions to the research and written of the manuscript. IJ, SR, JB, and MP were responsible for the overall design of this study. IJ and SR collected data and planned statistical analysis. IJ, SR and ZC wrote the first draft manuscript. All authors contributed to writing the final versions of the manuscript.
CRediT authorship contribution statement
Inyoung Jun: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Writing - original draft, Writing - review & editing. Shannan N. Rich: Conceptualization, Data curation, Investigation, Methodology, Writing - review & editing. Zhaoyi Chen: Investigation, Writing - review & editing. Jiang Bian: Conceptualization, Funding acquisition, Methodology, Writing - review & editing, Project administration. Mattia Prosperi: Conceptualization, Funding acquisition,
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors would like to thank Dr. Pierangelo Veltri in the Department of Surgical and Clinical Science at the University of Catanzaro, Italy for the useful insights provided when pre-reviewing the manuscript.
References (34)
- et al.
Bacteremia due to Methicillin-Resistant Staphylococcus aureus
Infect. Dis. Clin. North Am.
(2016) - et al.
Combination antibiotic therapy versus monotherapy for gram-negative bacteraemia: a commentary
Int. J. Antimicrob. Agents
(1999) - et al.
CDC definitions for nosocomial infections, 1988
Am. J. Infect. Control
(1988) Irreproducible Experimental Results: Causes, (Mis)interpretations, and Consequences
Circulation
(2012)- et al.
Feasibility of Using Real-World Data to Replicate Clinical Trial Evidence
JAMA Netw. Open
(2019 Oct 2) - et al.
A comparison of phenotype definitions for diabetes mellitus
J. Am. Med. Inform. Assoc.
(2013) - et al.
Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts
PLoS ONE
(2015 Aug 24) - et al.
Automatic generation of case-detection algorithms to identify children with asthma from large electronic health record databases
Pharmacoepidemiol. Drug Saf.
(2013) - et al.
Extracting research-quality phenotypes from electronic health records to support precision medicine
Genome Med.
(2015) - et al.
Desiderata for computable representations of electronic health records-driven phenotype algorithms
J. Am. Med. Inf. Assoc.
(2015)