Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration†,Molecular Systems Design & Engineering

当前位置： X-MOL 学术 › Mol. Syst. Des. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration†
Molecular Systems Design & Engineering ( IF 3.6 ) Pub Date : 2019-08-27 , DOI: 10.1039/c9me00078j
Bowen Li _{1,

2,

3,

4} , Srinivas Rangarajan _{1,

2,

3,

4}

Affiliation

In this paper, we consider the problem of designing a compact training set comprising the most informative molecules from a specified library to build data-driven molecular property models. Specifically, using (i) sparse generalized group additivity and (ii) kernel ridge regression as two representative classes of models, we propose a method combining rigorous model-based design of experiments and cheminformatics-based diversity-maximizing subset selection within the ε-greedy framework to systematically minimize the amount of data needed to train these models. We demonstrate the effectiveness of the algorithm on various databases, including QM7, NIST, and a dataset of surface intermediates for calculating thermodynamic properties (heat of atomization and enthalpy of formation). For sparse group additive models, a balance between exploration (diversity-maximizing selection) and exploitation (D-optimality selection) leads to learning with a fraction (sometimes as little as 15%) of the data to achieve similar accuracy to five-fold cross validation on the entire set. On the other hand, our results indicate that kernel methods prefer diversity-maximizing selection.

中文翻译：

通过优化开发和探索，为数据驱动的分子特性预测设计紧凑型训练集†

在本文中，我们考虑设计一个紧凑的训练集的问题，该训练集包含指定库中信息最丰富的分子以构建数据驱动的分子特性模型。具体来说，我们将（i）稀疏广义群可加性和（ii）核岭回归作为两个代表性的模型类别，提出了一种结合严格的基于模型的实验设计和基于化学信息学的ε内最大化多样性最大化子集选择的方法-贪婪的框架以系统地最小化训练这些模型所需的数据量。我们在各种数据库（包括QM7，NIST和用于计算热力学性质（雾化热和形成焓）的表面中间体数据集）上证明了该算法的有效性。对于稀疏的组加性模型，在探索（多样性最大化选择）和开发（D优化选择）之间取得平衡，可以学习一部分数据（有时低至15％），从而获得与五重交叉相似的准确性。整套验证。另一方面，我们的结果表明，内核方法更喜欢多样性最大化的选择。

更新日期：2019-10-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>