General Approach to Estimate Error Bars for Quantitative Structure–Activity Relationship Predictions of Molecular Activity,Journal of Chemical Information and Modeling

当前位置： X-MOL 学术 › J. Chem. Inf. Model. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

General Approach to Estimate Error Bars for Quantitative Structure–Activity Relationship Predictions of Molecular Activity
Journal of Chemical Information and Modeling ( IF 5.6 ) Pub Date : 2018-06-27 00:00:00 , DOI: 10.1021/acs.jcim.8b00114
Ruifeng Liu ₁ , Kyle P. Glover ₂ , Michael G. Feasel ₃ , Anders Wallqvist ₁

Affiliation

Key requirements for quantitative structure–activity relationship (QSAR) models to gain acceptance by regulatory authorities include a defined domain of applicability (DA) and appropriate measures of goodness-of-fit, robustness, and predictivity. Hence, many DA metrics have been developed over the past two decades. The most intuitive are perhaps distance-to-model metrics, which are most commonly defined in terms of the mean distance between a molecule and its k nearest training samples. Detailed evaluations have shown that the variance of predictions by an ensemble of QSAR models may serve as a DA metric and can outperform distance-to-model metrics. Intriguingly, the performance of ensemble variance metric has led researchers to conclude that the error of predicting a new molecule does not depend on the input descriptors or machine-learning methods but on its distance to the training molecules. This implies that the distance to training samples may serve as the basis for developing a high-performance DA metric. In this article, we introduce a new Tanimoto distance-based DA metric called the sum of distance-weighted contributions (SDC), which takes into account contributions from all molecules in a training set. Using four acute chemical toxicity data sets of varying sizes and four other molecular property data sets, we demonstrate that SDC correlates well with the prediction error for all data sets regardless of the machine-learning methods and molecular descriptors used to build the QSAR models. Using the acute toxicity data sets, we compared the distribution of prediction errors with respect to SDC, the mean distance to k-nearest training samples, and the variance of random forest predictions. The results showed that the correlation with the prediction error was highest for SDC. We also demonstrate that SDC allows for the development of robust root mean squared error (RMSE) models and makes it possible to not only give a QSAR prediction but also provide an individual RMSE estimate for each molecule. Because SDC does not depend on a specific machine-learning method, it represents a canonical measure that can be widely used to estimate individual molecule prediction errors for any machine-learning method.

中文翻译：

估算分子活动的定量构效关系预测误差条的一般方法

要获得监管机构的接受，定量结构-活动关系（QSAR）模型的关键要求包括已定义的适用范围（DA）以及适合度，稳健性和可预测性的适当度量。因此，在过去的二十年中已经开发了许多DA指标。最直观的也许是模型距离，它通常是根据分子与其k之间的平均距离来定义的最近的训练样本。详细的评估表明，一组QSAR模型的预测方差可以用作DA度量，并且可以胜过模型距离。有趣的是，集成方差度量的性能已使研究人员得出结论，预测新分子的错误并不取决于输入描述符或机器学习方法，而是取决于其与训练分子的距离。这意味着到训练样本的距离可以用作开发高性能DA度量的基础。在本文中，我们介绍了一种新的Tanimoto基于距离的DA度量，称为距离加权贡献之和（SDC），其中考虑了训练集中所有分子的贡献。使用四个大小可变的急性化学毒性数据集和四个其他分子特性数据集，我们证明了SDC与所有数据集的预测误差均具有良好的相关性，而与用于构建QSAR模型的机器学习方法和分子描述符无关。使用急性毒性数据集，我们比较了SDC，k最近训练样本，以及随机森林预测的方差。结果表明，SDC与预测误差的相关性最高。我们还证明了SDC允许开发鲁棒的均方根误差（RMSE）模型，并且不仅可以给出QSAR预测，而且可以为每个分子提供单独的RMSE估计。由于SDC不依赖于特定的机器学习方法，因此它代表了一种规范的度量，可以广泛地用于估计任何机器学习方法的单个分子预测误差。

更新日期：2018-06-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>