Building machine learning models without sharing patient data: A simulation-based analysis of distributed learning by ensembling.,Journal of Biomedical informatics

当前位置： X-MOL 学术 › J. Biomed. Inform. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Building machine learning models without sharing patient data: A simulation-based analysis of distributed learning by ensembling.
Journal of Biomedical informatics ( IF 4.0 ) Pub Date : 2020-04-23 , DOI: 10.1016/j.jbi.2020.103424
Anup Tuladhar ₁ , Sascha Gill ₂ , Zahinoor Ismail ₃ , Nils D Forkert ₄ ,

Affiliation

The development of machine learning solutions in medicine is often hindered by difficulties associated with sharing patient data. Distributed learning aims to train machine learning models locally without requiring data sharing. However, the utility of distributed learning for rare diseases, with only a few training examples at each contributing local center, has not been investigated. The aim of this work was to simulate distributed learning models by ensembling with artificial neural networks (ANN), support vector machines (SVM), and random forests (RF) and evaluate them using four medical datasets. Distributed learning by ensembling locally trained agents improved performance compared to models trained using the data from a single institution, even in cases where only a very few training examples are available per local center. Distributed learning improved when more locally trained models were added to the ensemble. Local class imbalance reduced distributed SVM performance but did not impact distributed RF and ANN classification. Our results suggest that distributed learning by ensembling can be used to train machine learning models without sharing patient data and is suitable to use with small datasets.

中文翻译：

在不共享患者数据的情况下构建机器学习模型：通过集成进行基于仿真的分布式学习分析。

与患者数据共享相关的困难常常阻碍了医学中机器学习解决方案的开发。分布式学习旨在在本地培训机器学习模型，而无需数据共享。但是，尚未研究针对稀有疾病的分布式学习的效用，每个贡献中心仅提供几个培训示例。这项工作的目的是通过与人工神经网络（ANN），支持向量机（SVM）和随机森林（RF）相集成来模拟分布式学习模型，并使用四个医学数据集对其进行评估。与使用来自单个机构的数据训练的模型相比，通过集合本地训练的代理进行分布式学习可以提高性能，即使在每个本地中心只有很少的训练示例的情况下。当更多本地培训的模型添加到集成中时，分布式学习得到了改善。局部类别的不平衡降低了分布式SVM的性能，但不影响分布式RF和ANN分类。我们的结果表明，通过集成进行的分布式学习可用于训练机器学习模型而无需共享患者数据，并且适合与小型数据集一起使用。

更新日期：2020-04-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11