Comparison of machine learning methods for estimating case fatality ratios: An Ebola outbreak simulation study.,PLOS ONE

当前位置： X-MOL 学术 › PLOS ONE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Comparison of machine learning methods for estimating case fatality ratios: An Ebola outbreak simulation study.
PLOS ONE ( IF 2.9 ) Pub Date : 2021-09-15 , DOI: 10.1371/journal.pone.0257005
Alpha Forna ₁ , Ilaria Dorigatti ₂ , Pierre Nouvellet _{2,

3} , Christl A Donnelly _{2,

4}

Affiliation

BACKGROUND Machine learning (ML) algorithms are now increasingly used in infectious disease epidemiology. Epidemiologists should understand how ML algorithms behave within the context of outbreak data where missingness of data is almost ubiquitous. METHODS Using simulated data, we use a ML algorithmic framework to evaluate data imputation performance and the resulting case fatality ratio (CFR) estimates, focusing on the scale and type of data missingness (i.e., missing completely at random-MCAR, missing at random-MAR, or missing not at random-MNAR). RESULTS Across ML methods, dataset sizes and proportions of training data used, the area under the receiver operating characteristic curve decreased by 7% (median, range: 1%-16%) when missingness was increased from 10% to 40%. Overall reduction in CFR bias for MAR across methods, proportion of missingness, outbreak size and proportion of training data was 0.5% (median, range: 0%-11%). CONCLUSION ML methods could reduce bias and increase the precision in CFR estimates at low levels of missingness. However, no method is robust to high percentages of missingness. Thus, a datacentric approach is recommended in outbreak settings-patient survival outcome data should be prioritised for collection and random-sample follow-ups should be implemented to ascertain missing outcomes.

中文翻译：

估计病死率的机器学习方法比较：埃博拉病毒爆发模拟研究。

背景技术机器学习（ML）算法现在越来越多地用于传染病流行病学。流行病学家应该了解 ML 算法在数据缺失几乎无处不在的爆发数据环境中的行为方式。方MAR，或不随机丢失-MNAR）。结果在 ML 方法、数据集大小和使用的训练数据比例中，当缺失率从 10% 增加到 40% 时，接收器操作特征曲线下的面积减少了 7%（中位数，范围：1%-16%）。跨方法的 MAR 的 CFR 偏差总体降低，缺失比例、暴发规模和训练数据比例为 0.5%（中位数，范围：0%-11%）。结论 ML 方法可以减少偏差并提高在低缺失水平下 CFR 估计的精度。然而，没有一种方法对高缺失率具有鲁棒性。因此，在爆发设置中建议采用以数据为中心的方法——应优先收集患者生存结果数据，并应实施随机样本随访以确定缺失的结果。

更新日期：2021-09-15

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11