Revving up 13C NMR shielding predictions across chemical space: benchmarks for atoms-in-molecules kernel machine learning with new data for 134 kilo molecules,Machine Learning: Science and Technology

当前位置： X-MOL 学术 › Mach. Learn. Sci. Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Revving up 13C NMR shielding predictions across chemical space: benchmarks for atoms-in-molecules kernel machine learning with new data for 134 kilo molecules
Machine Learning: Science and Technology ( IF 6.3 ) Pub Date : 2021-05-17 , DOI: 10.1088/2632-2153/abe347
Amit Gupta , Sabyasachi Chakraborty , Raghunathan Ramakrishnan

The requirement for accelerated and quantitatively accurate screening of nuclear magnetic resonance spectra across the small molecules chemical compound space is two-fold: (1) a robust ‘local’ machine learning (ML) strategy capturing the effect of the neighborhood on an atom’s ‘near-sighted’ property—chemical shielding; (2) an accurate reference dataset generated with a state-of-the-art first-principles method for training. Herein we report the QM9-NMR dataset comprising isotropic shielding of over 0.8 million C atoms in 134k molecules of the QM9 dataset in gas and five common solvent phases. Using these data for training, we present benchmark results for the prediction transferability of kernel-ridge regression models with popular local descriptors. Our best model, trained on 100k samples, accurately predicts isotropic shielding of 50k ‘hold-out’ atoms with a mean error of less than 1.9 ppm. For the rapid prediction of new query molecules, the models were trained on geometries from an inexpensive theory. Furthermore, by using a Δ-ML strategy, we quench the error below 1.4 ppm. Finally, we test the transferability on non-trivial benchmark sets that include benchmark molecules comprising 10–17 heavy atoms and drugs.

中文翻译：

加速跨化学空间的¹³ C NMR 屏蔽预测：分子中原子核机器学习的基准与 134 千分子的新数据

在小分子化合物空间中加速和定量准确筛选核磁共振谱的要求有两个：（1）强大的“本地”机器学习（ML）策略，捕捉邻域对原子“近邻”的影响。 - 有视力的特性——化学屏蔽；(2) 使用最先进的第一性原理训练方法生成的准确参考数据集。在此，我们报告了 QM9-NMR 数据集，该数据集包含在气体和五种常见溶剂相中 QM9 数据集的 134k 分子中超过 80 万个 C 原子的各向同性屏蔽。使用这些数据进行训练，我们展示了具有流行局部描述符的核岭回归模型的预测可转移性的基准结果。我们最好的模型，在 10 万个样本上训练，准确预测 50k 'hold-out' 原子的各向同性屏蔽，平均误差小于 1.9 ppm。为了快速预测新的查询分子，模型是根据廉价理论的几何形状进行训练的。此外，通过使用 Δ-ML 策略，我们将误差抑制在 1.4 ppm 以下。最后，我们测试了非平凡基准集的可转移性，这些基准集包括由 10-17 个重原子和药物组成的基准分子。

更新日期：2021-05-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文