MoleculeNet: a benchmark for molecular machine learning†,Chemical Science

当前位置： X-MOL 学术 › Chem. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MoleculeNet: a benchmark for molecular machine learning†
Chemical Science ( IF 8.4 ) Pub Date : 2017-10-31 00:00:00 , DOI: 10.1039/c7sc02664a
Zhenqin Wu _{1,

2,

3,

4} , Bharath Ramsundar _{2,

3,

4,

5} , Evan N. Feinberg _{3,

4,

6,

7} , Joseph Gomes _{1,

2,

3,

4} , Caleb Geniesse _{3,

4,

6,

7} , Aneesh S. Pappu _{2,

3,

4,

5} , Karl Leswing _{4,

8} , Vijay Pande _{1,

2,

3,

4}

Affiliation

Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance. However, this result comes with caveats. Learnable representations still struggle to deal with complex tasks under data scarcity and highly imbalanced classification. For quantum mechanical and biophysical datasets, the use of physics-aware featurizations can be more important than choice of particular learning algorithm.

中文翻译：

MoleculeNet：分子机器学习的基准†

在过去的几年中，分子机器学习已经迅速成熟。改进的方法和更大的数据集的存在使机器学习算法能够对分子性质做出越来越准确的预测。然而，由于缺乏标准的基准来比较所提出的方法的有效性，算法的进展受到了限制。大多数新算法都在不同的数据集上进行基准测试，因此很难衡量所提出方法的质量。这项工作介绍了MoleculeNet，这是分子机器学习的大规模基准。MoleculeNet策划了多个公共数据集，建立了评估指标，并提供了多种先前提出的分子特征化和学习算法（作为DeepChem开源库的一部分发布）的高质量开源实现。MoleculeNet基准测试表明，可学习的表示形式是分子机器学习的强大工具，并且广泛提供了最佳性能。但是，此结果带有警告。在数据缺乏和高度不平衡的分类下，可学习的表示形式仍然难以处理复杂的任务。对于量子力学和生物物理数据集，使用具有物理意识的特征化比选择特定学习算法更为重要。

更新日期：2017-10-31

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>