Facing small and biased data dilemma in drug discovery with federated learning,bioRxiv - Pharmacology and Toxicology

当前位置： X-MOL 学术 › bioRxiv. Pharmacol. Toxicol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Facing small and biased data dilemma in drug discovery with federated learning
bioRxiv - Pharmacology and Toxicology Pub Date : 2020-09-26 , DOI: 10.1101/2020.03.19.998898
Zhaoping Xiong , Ziqiang Cheng , Chi Xu , Xinyuan Lin , Xiaohong Liu , Dingyan Wang , Xiaomin Luo , Yong Zhang , Nan Qiao , Mingyue Zheng , Hualiang Jiang

Artificial intelligence (AI) models usually require large amounts of high quality training data, which is in striking contrast to the situation of small and biased data faced by current drug discovery pipelines. The concept of federated learning has been proposed to utilize distributed data from different sources without leaking sensitive information of these data. This emerging decentralized machine learning paradigm is expected to dramatically improve the success of AI-powered drug discovery. We here simulate the federated learning process with 7 aqueous solubility datasets from different sources, among which there are overlapping molecules with high or low biases in the recorded values. Beyond the benefit of gaining more data, we also demonstrate federated training has a regularization effect making it superior than centralized training on the pooled datasets with high biases. Further, federated model customization for each client can effectively help us deal with the highly biased data in drug discovery and achieve better generalization performance. Our work demonstrates the application of federated learning in predicting drug related properties, but also highlights its promising role in addressing the small data and biased data dilemma in drug discovery.

中文翻译：

联合学习在药物发现中面临小的偏见和数据偏见

人工智能（AI）模型通常需要大量高质量的训练数据，这与当前药物开发管道所面临的少量且有偏差的数据形成鲜明对比。已经提出了联合学习的概念，以利用来自不同来源的分布式数据而不会泄漏这些数据的敏感信息。预计这种新兴的去中心化机器学习范例将极大地提高AI驱动的药物发现的成功率。我们在这里使用来自不同来源的7个水溶性数据集来模拟联合学习过程，其中记录值中存在具有高或低偏差的重叠分子。除了获得更多数据的好处外，我们还证明了联合训练具有正则化效果，使其在具有高偏差的汇总数据集中优于集中训练。此外，针对每个客户的联合模型定制可以有效地帮助我们处理药物发现中高度偏向的数据，并获得更好的泛化性能。我们的工作证明了联合学习在预测药物相关特性方面的应用，但同时也突出了其在解决药物发现中的小数据和有偏见的数据困境方面的有前途的作用。

更新日期：2020-09-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文