当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization
Journal of Cheminformatics ( IF 8.6 ) Pub Date : 2021-10-02 , DOI: 10.1186/s13321-021-00554-8
Jules Leguy 1 , Marta Glavatskikh 1, 2 , Thomas Cauchy 2 , Benoit Da Mota 1
Affiliation  

Chemical diversity is one of the key term when dealing with machine learning and molecular generation. This is particularly true for quantum chemical datasets. The composition of which should be done meticulously since the calculation is highly time demanding. Previously we have seen that the most known quantum chemical dataset QM9 lacks chemical diversity. As a consequence, ML models trained on QM9 showed generalizability shortcomings. In this paper we would like to present (i) a fast and generic method to evaluate chemical diversity, (ii) a new quantum chemical dataset of 435k molecules, OD9, that includes QM9 and new molecules generated with a diversity objective, (iii) an analysis of the diversity impact on unconstrained and goal-directed molecular generation on the example of QED optimization. Our innovative approach makes it possible to individually estimate the impact of a solution to the diversity of a set, allowing for effective incremental evaluation. In the first application, we will see how the diversity constraint allows us to generate more than a million of molecules that would efficiently complete the reference datasets. The compounds were calculated with DFT thanks to a collaborative effort through the QuChemPedIA@home BOINC project. With regard to goal-directed molecular generation, getting a high QED score is not complicated, but adding a little diversity can cut the number of calls to the evaluation function by a factor of ten

中文翻译:

用于从头生成分子的多样性的可扩展估计器,从而产生更强大的 QM 数据集 (OD9) 和更有效的分子优化

化学多样性是处理机器学习和分子生成的关键术语之一。对于量子化学数据集尤其如此。由于计算时间要求很高,因此应仔细完成其组成。之前我们已经看到,最著名的量子化学数据集 QM9 缺乏化学多样性。因此,在 QM9 上训练的 ML 模型显示出普遍性的缺点。在本文中,我们想介绍 (i) 一种评估化学多样性的快速通用方法,(ii) 435k 分子的新量子化学数据集 OD9,其中包括 QM9 和以多样性为目标生成的新分子,(iii)以 QED 优化为例,分析对无约束和目标导向分子生成的多样性影响。我们的创新方法可以单独估计解决方案对集合多样性的影响,从而实现有效的增量评估。在第一个应用程序中,我们将看到多样性约束如何让我们生成超过一百万个能够有效完成参考数据集的分子。由于 QuChemPedIA@home BOINC 项目的协作努力,使用 DFT 计算了化合物。关于目标导向的分子生成,获得高 QED 分数并不复杂,但增加一点多样性可以将评估函数的调用次数减少 10 倍 我们将看到多样性约束如何让我们生成超过一百万个能够有效完成参考数据集的分子。由于 QuChemPedIA@home BOINC 项目的协作努力,使用 DFT 计算了化合物。关于目标导向的分子生成,获得高 QED 分数并不复杂,但增加一点多样性可以将评估函数的调用次数减少 10 倍 我们将看到多样性约束如何让我们生成超过一百万个能够有效完成参考数据集的分子。由于 QuChemPedIA@home BOINC 项目的协作努力,使用 DFT 计算了化合物。关于目标导向的分子生成,获得高 QED 分数并不复杂,但增加一点多样性可以将评估函数的调用次数减少 10 倍
更新日期:2021-10-02
down
wechat
bug