Mining incomplete clinical data for the early assessment of Kawasaki disease based on feature clustering and convolutional neural networks

https://doi.org/10.1016/j.artmed.2020.101859Get rights and content

Highlights

  • We investigated data-driven approaches for the early assessment of Kawasaki disease.

  • We developed a method to tackle the incompleteness problem of clinical data associated with group-based missing patterns.

  • We demonstrated the superior performance of the proposed method under incomplete data settings using a real-world dataset.

Abstract

Kawasaki disease (KD) is the leading cause of acquired heart disease in children. Its prompt treatment can effectively lower the risk of severe complications, such as coronary aneurysms. However, accurately diagnosing KD at its early stage is impracticable given its unknown pathogenesis and lack of pathognomonic features. In this study, we investigated data-driven approaches by using a cohort of 10,367 patients extracted from electronic health records for early KD assessment. The incompleteness of clinical data presents group-based missing patterns associated with different clinical assessment measures. To address this problem, we developed a method integrating feature clustering to enable matrix-based representation and convolutional neural networks (CNN) for feature extraction and fusion to explicitly exploit the multi-source data structure. Integrating missing data imputation methods with the proposed method demonstrated superior accuracy (an AUC of 0.97) compared with a number of benchmark methods. The present method shows potential to improve clinical data mining. Our study highlighted the feasible utilization of matrix-based feature representation and CNN-based feature extraction for incomplete clinical data mining to support medical decision-making.

Introduction

Kawasaki disease (KD) is an acute systemic vasculitis syndrome that occurs in infants and young children and has become the leading cause of acquired heart disease in children worldwide. Without early intervention, patients with KD are at high risk of developing coronary aneurysms, which can further lead to progressive coronary artery obstruction with angina, myocardial infarction, or sudden death over time [1]. Timely treatment can effectively lower the risk of coronary aneurysms and other major complications. Therefore, diagnosing KD at its early stage is crucial to achieve favorable clinical outcomes. However, the etiology and pathogenesis of KD remain unknown, and no pathognomonic features or specific diagnostic tests can be used to enable the accurate and effective diagnosis of KD. KD diagnosis is currently based on clinical signs and supportive non-specific laboratory testing. In clinical practice, KD diagnosis relies on clinical guidelines that define epidemiological criteria, including at least four of the five given principal clinical findings together with fever for 4 or more days [2]. Thus, the early assessment of KD is difficult. In addition, several patients with incomplete KD do not satisfy the criteria and even demonstrate atypical symptoms. The clinical symptoms caused by KD are easily confused with other febrile diseases, leading to a high rate of missed KD diagnosis. Consequently, investigating sensitive and specific methods to enable diagnosis at early stage and permit timely treatment is necessary. To address the aforementioned challenges and assist clinical decision-making, we investigated data-driven approaches for the accurate and early assessment of KD by using routinely collected structured electronic health records (EHRs) data and developed a novel diagnostic method.

EHRs provide valuable and detailed information for individualized health conditions, treatment responses, and clinical outcomes. Diagnostic and predictive models for clinical events using HERs data show promising results. Perotte et al. incorporated routinely collected EHRs data, including longitudinal laboratory test and clinical documentation, to develop accurate risk prediction models for chronic kidney disease progression [3]. Teoh et al. utilized EHRs to predict whether a patient will suffer from stroke within 1 year based on examination results and medical diagnoses [4]. Jin et al. employed one-hot encoding and word embeddings to model the diagnosis event extracted from EHRs and predicted heart failure by using the long short-term memory network model [5]. Clinical innovations based on EHRs can overcome some of the limitations of randomized clinical trials, such as limited sample sizes, confounding factors, and unaccounted-for comorbidities [6]. EHRs can serve as a foundation to support intelligent systems to discover new clinical evidence, facilitate the translation of knowledge for clinical practice, and improve the delivery of personalized healthcare.

Despite their great potential, various challenges are associated with EHRs, such as completeness, accuracy, complexity, and bias. Data incompleteness is particularly a widespread problem for the secondary analysis of EHRs data, thereby limiting the reliability and significance of the outcomes. Data imputation is typically used to address this problem to deal with the missing data. The simple replacement of missing values or discarding observations may lead to misleading results. Most existing studies for data imputation have considered low missing rate (e.g., less than 30%) [7]. The missing mechanisms are basically categorized as missing completely at random, missing at random, and missing not at random. However, these missing data mechanisms are not always suited for clinical data, which are incomplete due to various reasons and often have high missing rate. Patients may move among institutions; thus, EHRs data from a single center may be incomplete [8]. Under specific health conditions, physicians usually issue an optimal set of medical examinations and laboratory tests for a patient. Some medical examinations and laboratory tests may be unnecessary, or the missing information can be inferred on the basis of medical expert knowledge. In this case, missing data may also reveal useful information for assessing the health conditions of patients. In addition to performing missing data imputation, this study aimed to develop methods to improve the performance of clinical data mining under incomplete data settings by representing and fusing multiple feature groups. The motivation is summarized in two aspects. First, the incompleteness of clinical data associated with different clinical assessment measures can be characterized by feature clustering and matrix-based feature representation. Second, convolutional neural networks (CNN) can capture and extract features from incomplete data and fuse feature groups for effective classification tasks.

Fig. 1 illustrates the characteristics of the collected clinical data. M is the missing data indicator matrix, where Mij=1 if the corresponding entry is not missing, and Mij=0 if the corresponding entry is missing. The x-axis represents patients, whereas the y-axis represents clinical features. The observed and missing elements are labelled with 1 or 0 by using different colors. Through hierarchical clustering [9], the rows (features) in the matrix can be clustered into four groups (labelled as 1–4 in Fig. 1 on the right side) sharing similar missing patterns. Thus, incomplete clinical data can be characterized by unsupervised clustering method. The regrouped indicator matrix demonstrates block-wise missing patterns because a missing medical examination or laboratory test can result in the absence of a group of clinical features for individuals. Meanwhile, the missing data of each feature demonstrates group-based patterns for patient subgroups.

Given this observation, the clinical features of patients were divided into groups on the basis of the shared data missing patterns associated with different data sources of clinical assessment measures. Without the loss of generality, we can characterize the feature groups by using feature clustering, as shown in Fig. 1. To further utilize the incomplete clinical data, we proposed a two-stage method to improve the performance of the automated diagnosis of KD using structured EHRs data. In the first stage, clinical features are clustered using the corresponding indicator vectors so that features are divided into multiple groups. Each patient can then be described with multiple distinct feature sets represented in the form of a matrix. The integration and fusion of multiple feature sets can jointly optimize the learning methods and improve the generalization performance [10]. Even when natural feature splitting is lacking for single-view data, the fusion of sampling with replication for a single source system can also provide more accurate observations [11,12]. The proposed method explicitly rearranges features into groups sharing similar data missing patterns associated with patient subgroups. Each feature group can be used to create a single view of a patient. The fusion of multi-source raw data can be used to obtain a more informative and accurate information of patients. In the second stage, CNN are used to extract and fuse features learned from different feature groups via convolution and pooling layers; CNN can capture sparse feature interaction by kernels smaller than the input matrix [13]. We retrospectively evaluated KD diagnosis performance by using a real-world dataset extracted from EHRs. Compared with other representative machine learning methods, the proposed method demonstrated consistently superior performance in exploring incomplete clinical data for the early assessment of KD.

The remainder of this paper is organized as follows. In the next section, we briefly introduce related works for EHRs-based data-driven studies and the applications of neural networks in healthcare. In Section 3, we present a two-stage method integrating feature clustering and CNN for data mining, which particularly addresses multi-source and incomplete clinical data. In Section 4, a real-world dataset is collected to extensively evaluate the performance of several benchmark methods in KD diagnosis. The last section provides our conclusions.

Section snippets

Related work

Data-driven approaches have attracted increasing interest in utilizing EHRs to investigate disease comorbidities, risk assessment, drug interactions, and clinical outcomes [14]. Ling et al. developed a diagnostic algorithm to distinguish KD from other febrile illnesses [15]. They improved KD diagnosis by integrating molecular findings, which are not widely available in regular clinical practice. Doan et al. built a natural language processing (NLP) tool to identify patients with KD by using

Methods

In this section, we introduce the proposed two-stage method to incorporate structured EHRs data for the early assessment of KD. Most existing methods represent the features of each patient in the form of vectors, providing a single view of a patient by concatenating features from different sources. Features are clustered using corresponding indicator vectors in the first stage to further exploit the intercorrelations of multisource data with missing entries. Through unsupervised clustering

Experiments

We identified a cohort of 10,367 patients for a 10-year period between October 2007 and December 2017 from the Children’s Hospital of Chongqing Medical University, a major children’s hospital in China, to retrospectively evaluate the data-driven approaches for KD diagnosis at its early stage. A total of 5642 patients have been diagnosed with KD. The applied criteria for the diagnosis of KD are based on the guidelines of Kawasaki Disease Version 5 [33], which includes at least 5 days of fever

Conclusion

We investigated data-driven approaches for the early assessment of KD by using structured EHRs and presented a two-stage method to exploit incomplete and multisource clinical data. Machine learning methods enable the integration of complex, high-dimensional, and heterogeneous features rather than considering limited high-risk factors to achieve better performance for medical data mining tasks. The proposed method addressed the incompleteness of clinical data by perceiving the multisource

Declaration of Competing Interest

None.

Acknowledgements

This work was supported by the Chongqing Science and Technology Bureau under Grant cstc2019jcyj-bshX0010.

References (35)

  • A.N. Richter et al.

    A review of statistical and machine learning methods for modeling cancer risk using structured clinical data

    Artif Intell Med

    (2018)
  • L. Yuan et al.

    Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data

    NeuroImage

    (2012)
  • J.T. Willerson et al.

    Coronary artery disease

    (2015)
  • J.W. Newburger

    Diagnosis, treatment, and long-term management of Kawasaki disease: a statement for health professionals from the Committee on Rheumatic Fever, Endocarditis and Kawasaki Disease, Council on Cardiovascular Disease in the Young, American Heart Association

    Circulation

    (2004)
  • A. Perotte et al.

    Risk prediction for chronic kidney disease progression using heterogeneous electronic health record data and time series analysis

    J Am Med Inform Assoc

    (2015)
  • D. Teoh

    Towards stroke prediction using electronic health records

    BMC Med Inform Decis Mak

    (2018)
  • B. Jin et al.

    Predicting the risk of heart failure with EHR sequential data modeling

    IEEE Access

    (2018)
  • P. Yadav et al.

    Mining electronic health records (EHRs): a survey

    ACM Computing Surveys (CSUR)

    (2018)
  • W.-C. Lin et al.

    Missing value imputation: a review and analysis of the literature (2006–2017)

    Artif Intell Rev

    (2019)
  • G. Hripcsak et al.

    Next-generation phenotyping of electronic health records

    J Am Med Inform Assoc

    (2012)
  • D. Müllner

    Modern hierarchical, agglomerative clustering algorithms

    (2011)
  • S. Sun

    A survey of multi-view machine learning

    Neural Comput Appl

    (2013)
  • Z. Jing et al.

    Multi-view learning overview: recent progress and new challenges

    Inf Fusion

    (2017)
  • T. Meng et al.

    A survey on machine learning for data fusion

    Inf Fusion

    (2019)
  • I. Goodfellow et al.

    Deep Learning

    (2016)
  • X.B. Ling

    A diagnostic algorithm combining clinical and molecular data distinguishes Kawasaki disease from other febrile illnesses

    BMC Med

    (2011)
  • S. Doan

    Building a natural language processing tool to identify patients with high clinical suspicion for kawasaki disease from emergency department notes

    Acad Emerg Med

    (2016)
  • Cited by (9)

    • A survey of artificial immune algorithms for multi-objective optimization

      2022, Neurocomputing
      Citation Excerpt :

      In the optimization process of MOIAs, some heuristic operators which enhance global search ability to bring more population disturbance can be introduced to overcome some bottlenecks and difficulties faced by traditional MOIAs in solving some MOPs, such as the clustering methods [67–69], the quantum-inspired approaches [70–72] and so on. Regarding the clustering method, it is one of important technologies of the data mining [73–75], which is not only as an individual tool to find the deeper information from the data distribution, but also as a preprocessing step for the other algorithms in data mining. The main purpose of clustering is to partition the data set into the significant or useful clusters that benefit the diversity maintenance.

    View all citing articles on Scopus

    This article belongs to the special issue: Medical Analytics for Healthcare Intelligence.

    View full text