Augmented intelligence with natural language processing applied to electronic health records for identifying patients with non-alcoholic fatty liver disease at risk for disease progression

https://doi.org/10.1016/j.ijmedinf.2019.06.028Get rights and content

Highlights

  • NAFLD is documented poorly in the EMR. We assess how well we can identify it using an NLP approach versus ICD or text search.

  • NAFLD can progress to NASH and cirrhosis. We examine our ability to measure disease progression within the EMR with NLP.

  • We look at breakdowns in the knowledge chain between doctors, when NAFLD was identified but not mentioned in future notes.

  • We identify cases of these breakdowns where the patient developed NASH/cirrhosis without referencing prior NAFLD diagnosis.

Abstract

Objective

Electronic health record (EHR) systems contain structured data (such as diagnostic codes) and unstructured data (clinical documentation). Clinical insights can be derived from analyzing both. The use of natural language processing (NLP) algorithms to effectively analyze unstructured data has been well demonstrated. Here we examine the utility of NLP for the identification of patients with non-alcoholic fatty liver disease, assess patterns of disease progression, and identify gaps in care related to breakdown in communication among providers.

Materials and Methods

All clinical notes available on the 38,575 patients enrolled in the Mount Sinai BioMe cohort were loaded into the NLP system. We compared analysis of structured and unstructured EHR data using NLP, free-text search, and diagnostic codes with validation against expert adjudication. We then used the NLP findings to measure physician impression of progression from early-stage NAFLD to NASH or cirrhosis. Similarly, we used the same NLP findings to identify mentions of NAFLD in radiology reports that did not persist into clinical notes.

Results

Out of 38,575 patients, we identified 2,281 patients with NAFLD. From the remainder, 10,653 patients with similar data density were selected as a control group. NLP outperformed ICD and text search in both sensitivity (NLP: 0.93, ICD: 0.28, text search: 0.81) and F2 score (NLP: 0.92, ICD: 0.34, text search: 0.81). Of 2281 NAFLD patients, 673 (29.5%) were believed to have progressed to NASH or cirrhosis. Among 176 where NAFLD was noted prior to NASH, the average progression time was 410 days. 619 (27.1%) NAFLD patients had it documented only in radiology notes and not acknowledged in other forms of clinical documentation. Of these, 170 (28.4%) were later identified as having likely developed NASH or cirrhosis after a median 1057.3 days.

Discussion

NLP-based approaches were more accurate at identifying NAFLD within the EHR than ICD/text search-based approaches. Suspected NAFLD on imaging is often not acknowledged in subsequent clinical documentation. Many such patients are later found to have more advanced liver disease. Analysis of information flows demonstrated loss of key information that could have been used to help prevent the progression of early NAFLD (NAFL) to NASH or cirrhosis.

Conclusion

For identification of NAFLD, NLP performed better than alternative selection modalities. It then facilitated analysis of knowledge flow between physician and enabled the identification of breakdowns where key information was lost that could have slowed or prevented later disease progression.

Introduction

Artificial intelligence has shown increasing promise when applied identification and prediction of countless medical outcomes. When applied to clinician workflow, it provides a form of augmented intelligence, aiding clinicians with decision support and error avoidance. These applications have predominantly been built leveraging only structured data because of its availability and ease of interpretation, however unstructured data (such as dictated notes) contain critical clinical information and thereby offer the potential to greatly enhance clinical insights than can be derived from use of structured data alone. Several public and proprietary approaches have been taken to developing natural language processing (NLP) systems to make sense of unstructured clinical data [1,2]. NLP approaches have been used successfully for biomedical research such as accurate phenotyping of complex diseases and for clinical tasks including identification of patients with NAFLD [9,10]. Here we examine the use of one SNOMED-based NLP tool for extracting patient features related non-alcoholic fatty liver disease (NAFLD). NAFLD represents a spectrum of liver diseases characterized histologically by macrovesicular fat and ranging in severity from nonalcoholic fatty liver (NAFL) to non-alcoholic steatohepatitis (NASH) [3,4]. A subset of patients with NAFLD progresses to cirrhosis and has an increased risk of hepatocellular carcinoma and liver-related mortality [3]. NAFLD is emerging as one of the most common causes of liver failure in the United States [5]. Multiple professional societies have published guidelines for the diagnosis and management of patients with NAFLD [6]. NAFLD is suspected in patients with metabolic syndrome, hepatomegaly, or mild elevations in aspartate aminotransferase (AST) and alanine aminotransferase (ALT) levels. However, normal levels of AST and ALT do not exclude the presence of NAFLD and hepatomegaly is found only in approximately 20% of patients [7,8]. A key component in the diagnosis of NAFLD is evidence of hepatic steatosis on imaging or biopsy. Many patients have abdominal or chest imaging performed for unrelated disorders which may incidentally find hepatic steatosis and allow for additionally workup for NAFLD including exclusion of other chronic liver diseases and alcohol consumption. The rapid and accurate identification of NAFLD by NLP from unstructured text such as radiology reports is one potential method to address the gap between incidental findings and patient care.

We first assess the accuracy of NLP against other simpler approaches to derive insights into the treating clinician’s understanding of the patient. We then determine the ability of NLP to track the diagnostic process and identify potential breakdowns. We determine the proportion of patients with fatty liver documented in radiology reports in which the presumptive diagnosis was also documented in a progress note from a healthcare provider. Second, to identify communication breakdowns at the point of care, we examine patients where NAFLD was identified in radiology notes but never referenced in progress note. While we focused our approach on NAFLD, the methodology is broadly applicable to other disease processes.

Section snippets

Study setting and population

The study was conducted at the Icahn School of Medicine at Mount Sinai and used the data resources of the BioMe Biobank at the Charles Bronfman Institute of Personalized Medicine [11]. The BioMe Biobank is a prospective cohort study with over 40,000 ethnically diverse patients recruited from primary care and subspecialty clinics within the Mount Sinai Health System, used for a diverse range of associated studies [[12], [13], [14], [15]]. BioMe has no inclusion or exclusion criteria beyond that

Results

We included 7,766,654 notes of 38,575 BioMe enrollees from July 8, 2002 through December 31, 2017. Parsing with NLP yielded 428,469,717 post-coordinated SNOMED expressions describing clinical concepts and related context. Fig. 3 shows the queries (or query clusters, in the case of alcohol users) for the identification of case and control patient cohorts.

Discussion

We assessed several informatics approaches to identify NAFLD within the EHR data compared to manual validation by clinicians in a large, multiethnic cohort. Our observations suggest that NLP approaches had the best overall performance compared to ICD and text search-based approaches, though there were numerous patients identified by ICD that were missed by NLP. In addition, the prevalence of NAFLD (˜18% in those patients with imaging data) identified by the NLP-based approach was similar to

Conclusions

In summary, we demonstrate that NLP-based approaches have superior accuracy in identifying NAFLD within the EHR compared to ICD/text search-based approaches. There is lack of acknowledgement in clinical documentation of NAFLD findings in radiology reports and a significant number of these patients are later reported to have NASH. As medical practice becomes more specialized and patient care is provided by more physicians, the opportunities for information loss at patient handoffs increase. Our

Authors’ contributions

Conception and design: TTVV, LC, GNN

Analysis and interpretation: TTVV, GNN

Data collection: TTVV, LC, GNN

Writing the article: TTVV, LC, GNN

Critical revision of the article: PB, CKC, JLK, SBE, RD, RL

Final approval of the article: TTVV, GNN

Statistical analysis: TTVV, GNN

Obtained funding: JC, SGC, GNN

Overall responsibility: TTVV, LC, GNN

Funding

SGC is supported by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) (grant no. R01DK096549 to SGC). GNN is supported by a career development award from the National Institutes of Health (NIH) (K23DK107908) and is also supported by R01DK108803, U01HG007278, U01HG009610, and 1U01DK116100-01 grants. SGC and GNN are members and are supported in part by the Chronic Kidney Disease Biomarker Consortium (U01DK106962). SGC is also supported by R01DK106085, R01HL85757,

Declaration of competing interest

TTVV was part of launching Clinithink and retains a financial interest in the company. GNN is cofounder of Renalytix AI and owns equity in that company.

Acknowledgements

The BioMe healthcare delivery cohort at Mount Sinai was established and maintained with a generous gift from the Andrea and Charles Bronfman Philanthropies. This work was supported in part through the computational resources and staff expertise provided by the Department of Scientific Computing at the Icahn School of Medicine at Mount Sinai.

References (37)

  • N. Chalasani et al.

    The diagnosis and management of non-alcoholic fatty liver disease: practice guideline by the American Association for the Study of Liver Diseases, American College of Gastroenterology, and the American Gastroenterological Association

    Am. J. Gastroenterol.

    (2012)
  • P. Angulo et al.

    Independent predictors of liver fibrosis in patients with nonalcoholic steatohepatitis

    Hepatol Baltim Md

    (1999)
  • K.P. Liao et al.

    Development of phenotype algorithms using electronic medical records and incorporating natural language processing

    BMJ

    (2015)
  • J.S. Redman et al.

    Accurate identification of fatty liver disease in data warehouse utilizing natural language processing

    Dig. Dis. Sci.

    (2017)
  • BioMe BioBank Program | Icahn School of Medicine. Icahn Sch. Med. Mt. Sinai....
  • G.M. Belbin et al.

    Genetic identification of a common collagen disease in Puerto Ricans via identity-by-descent mapping in a health system

    eLife

    (2017)
  • M.R. Smith et al.

    Loss-of-function of neuroplasticity-related genes confers risk for human neurodevelopmental disorders

    Pac Symp Biocomput Pac Symp Biocomput

    (2018)
  • A. Tin et al.

    Serum 6-Bromotryptophan levels identified as a risk factor for CKD progression

    J Am Soc Nephrol JASN

    (2018)
  • Cited by (29)

    • “Beyond MELD” – Emerging strategies and technologies for improving mortality prediction, organ allocation and outcomes in liver transplantation

      2022, Journal of Hepatology
      Citation Excerpt :

      NLP has been used on abdominal ultrasound, computerised tomography, and magnetic resonance imaging reports from the VHACDW to rapidly screen patients with radiographic evidence of fatty liver disease.126 In an analysis of clinical notes available for 38,575 patients enrolled in the Mount Sinai BioMe cohort, NLP methods outperformed ICD codes and text search.125 Key point

    • Artificial Intelligence for Disease Assessment in Inflammatory Bowel Disease: How Will it Change Our Practice?

      2022, Gastroenterology
      Citation Excerpt :

      Natural language processing (NLP) is a field of AI designed to understand human text and can be used to automate collection of nuanced clinical data contained within electronic medical records.71,72 NLP can help fill the information gaps in studies relying on administrative claims, diagnostic and procedure coding, and medication order records to characterize patients.73 The development of NLP for IBD is in its infancy.

    • A common variant in PNPLA3 is associated with age at diagnosis of NAFLD in patients from a multi-ethnic biobank

      2020, Journal of Hepatology
      Citation Excerpt :

      Inter-rater agreement was high (kappa = 0.95) and disagreements were reviewed by consensus in consultation with a third clinician. Further details are provided elsewhere.22 This approach significantly improved sensitivity (93% vs. 32%) compared to ICD codes.

    • Natural language processing of electronic health records is superior to billing codes to identify symptom burden in hemodialysis patients

      2020, Kidney International
      Citation Excerpt :

      Supplementary Figure S6 includes an example of how “cramp” would be represented in the SNOMED CT hierarchy. In our testing for this and other projects, we found that CLiX NLP was able to handle typographical errors, sentence context, and negation well.26,27 Common abbreviations (e.g., N/V for nausea and/or vomiting) were correctly identified; however, during chart review, additional abbreviations that were incorrectly identified required a request to alter the NLP algorithm.

    View all citing articles on Scopus
    1

    Equal Contribution.

    View full text