当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Dataset’s chemical diversity limits the generalizability of machine learning predictions
Journal of Cheminformatics ( IF 8.6 ) Pub Date : 2019-11-12 , DOI: 10.1186/s13321-019-0391-2
Marta Glavatskikh , Jules Leguy , Gilles Hunault , Thomas Cauchy , Benoit Da Mota

The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functional Theory calculations. Such ML models need to be tested and generalized on real data. PC9, a new QM9 equivalent dataset (only H, C, N, O and F and up to 9 “heavy” atoms) of the PubChemQC project is presented in this article. A statistical study of bonding distances and chemical functions shows that this new dataset encompasses more chemical diversity. Kernel Ridge Regression, Elastic Net and the Neural Network model provided by SchNet have been used on both datasets. The overall accuracy in energy prediction is higher for the QM9 subset. However, a model trained on PC9 shows a stronger ability to predict energies of the other dataset.

中文翻译:

数据集的化学多样性限制了机器学习预测的通用性

QM9数据集已成为各种化学性质的机器学习(ML)预测的黄金标准。QM9基于GDB,它是化学空间的组合探索。ML分子预测最近已发布,其准确性与密度泛函理论计算相当。此类ML模型需要在真实数据上进行测试和推广。本文介绍了PubChemQC项目的一个新的PC9,它是一个新的QM9等效数据集(仅H,C,N,O和F以及最多9个“重”原子)。对键距和化学功能的统计研究表明,这个新的数据集包含更多的化学多样性。SchNet提供的Kernel Ridge回归,弹性网和神经网络模型已在两个数据集上使用。对于QM9子集,能量预测的总体准确性更高。然而,
更新日期:2019-11-12
down
wechat
bug