当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Lifelong Machine Learning and root cause analysis for large-scale cancer patient data
Journal of Big Data ( IF 8.6 ) Pub Date : 2019-12-03 , DOI: 10.1186/s40537-019-0261-9
Gautam Pal , Xianbin Hong , Zhuo Wang , Hongyi Wu , Gangmin Li , Katie Atkinson


This paper presents a lifelong learning framework which constantly adapts with changing data patterns over time through incremental learning approach. In many big data systems, iterative re-training high dimensional data from scratch is computationally infeasible since constant data stream ingestion on top of a historical data pool increases the training time exponentially. Therefore, the need arises on how to retain past learning and fast update the model incrementally based on the new data. Also, the current machine learning approaches do the model prediction without providing a comprehensive root cause analysis. To resolve these limitations, our framework lays foundations on an ensemble process between stream data with historical batch data for an incremental lifelong learning (LML) model.

Case description

A cancer patient’s pathological tests like blood, DNA, urine or tissue analysis provide a unique signature based on the DNA combinations. Our analysis allows personalized and targeted medications and achieves a therapeutic response. Model is evaluated through data from The National Cancer Institute’s Genomic Data Commons unified data repository. The aim is to prescribe personalized medicine based on the thousands of genotype and phenotype parameters for each patient.

Discussion and evaluation

The model uses a dimension reduction method to reduce training time at an online sliding window setting. We identify the Gleason score as a determining factor for cancer possibility and substantiate our claim through Lilliefors and Kolmogorov–Smirnov test. We present clustering and Random Decision Forest results. The model’s prediction accuracy is compared with standard machine learning algorithms for numeric and categorical fields.


We propose an ensemble framework of stream and batch data for incremental lifelong learning. The framework successively applies first streaming clustering technique and then Random Decision Forest Regressor/Classifier to isolate anomalous patient data and provides reasoning through root cause analysis by feature correlations with an aim to improve the overall survival rate. While the stream clustering technique creates groups of patient profiles, RDF further drills down into each group for comparison and reasoning for useful actionable insights. The proposed MALA architecture retains the past learned knowledge and transfer to future learning and iteratively becomes more knowledgeable over time.






癌症患者的血液,DNA,尿液或组织分析等病理学检查可根据DNA组合提供独特的特征。我们的分析允许个性化和针对性的药物,并获得治疗效果。该模型是通过美国国家癌症研究所(National Cancer Institute)的Genomic Data Commons统一数据存储库中的数据进行评估的。目的是根据每位患者的数千个基因型和表型参数开出个性化药物。



