当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Splitting chemical structure data sets for federated privacy-preserving machine learning
Journal of Cheminformatics ( IF 8.6 ) Pub Date : 2021-12-07 , DOI: 10.1186/s13321-021-00576-2
Jaak Simm 1 , Lina Humbeck 2 , Adam Zalewski 3 , Noe Sturm 4 , Wouter Heyndrickx 5 , Yves Moreau 1 , Bernd Beck 2 , Ansgar Schuffenhauer 4
Affiliation  

With the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant, but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties. In this work we discuss three methods which provide a splitting of a data set and are applicable in a federated privacy-preserving setting, namely: a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria (compared to random splitting): bias in prediction performance, classification label and data imbalance, similarity distance between the test and training set compounds. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.

中文翻译:

用于联邦隐私保护机器学习的拆分化学结构数据集

随着机器学习方法在药物设计及相关领域的应用越来越多,设计合理的测试集的挑战越来越突出。此挑战的目标是在训练、验证和测试集之间对化学结构(化合物)进行实际拆分,以便测试集上的性能对推断预期应用中的性能有意义。这一挑战本身就非常有趣和相关,但在联合机器学习方法中更加复杂,在这种方法中,多个合作伙伴在隐私保护条件下共同训练模型,不同参与方之间不得共享化学结构。在这项工作中,我们讨论了三种方法,它们提供了数据集的拆分,并且适用于联合隐私保护设置,即: a. 局部敏感哈希(LSH),b。球体排除聚类,c。基于支架的分箱(支架网络)。为了评估这些拆分方法,我们考虑以下质量标准(与随机拆分相比):预测性能的偏差、分类标签和数据不平衡、测试和训练集化合物之间的相似距离。该论文的主要发现是 a. 球体排除聚类和基于支架的分箱都导致数据集的高质量分割,b。在计算成本方面,在联合隐私保护设置的情况下,球体排除聚类非常昂贵。为了评估这些拆分方法,我们考虑以下质量标准(与随机拆分相比):预测性能的偏差、分类标签和数据不平衡、测试和训练集化合物之间的相似距离。该论文的主要发现是 a. 球体排除聚类和基于支架的分箱都导致数据集的高质量分割,b。在计算成本方面,在联合隐私保护设置的情况下,球体排除聚类非常昂贵。为了评估这些拆分方法,我们考虑以下质量标准(与随机拆分相比):预测性能的偏差、分类标签和数据不平衡、测试和训练集化合物之间的相似距离。该论文的主要发现是 a. 球体排除聚类和基于支架的分箱都导致数据集的高质量分割,b。在计算成本方面,在联合隐私保护设置的情况下,球体排除聚类非常昂贵。
更新日期:2021-12-07
down
wechat
bug