Evaluating parameters for ligand-based modeling with random forest on sparse data sets.,Journal of Cheminformatics

当前位置： X-MOL 学术 › J. Cheminfom. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating parameters for ligand-based modeling with random forest on sparse data sets.
Journal of Cheminformatics ( IF 8.6 ) Pub Date : 2018-10-11 , DOI: 10.1186/s13321-018-0304-9
Alexander Kensert ₁ , Jonathan Alvarsson ₁ , Ulf Norinder _{2,

3} , Ola Spjuth ₁

Affiliation

Ligand-based predictive modeling is widely used to generate predictive models aiding decision making in e.g. drug discovery projects. With growing data sets and requirements on low modeling time comes the necessity to analyze data sets efficiently to support rapid and robust modeling. In this study we analyzed four data sets and studied the efficiency of machine learning methods on sparse data structures, utilizing Morgan fingerprints of different radii and hash sizes, and compared with molecular signatures descriptor of different height. We specifically evaluated the effect these parameters had on modeling time, predictive performance, and memory requirements using two implementations of random forest; Scikit-learn as well as FEST. We also compared with a support vector machine implementation. Our results showed that unhashed fingerprints yield significantly better accuracy than hashed fingerprints ( $$p \le 0.05$$ ), with no pronounced deterioration in modeling time and memory usage. Furthermore, the fast execution and low memory usage of the FEST algorithm suggest that it is a good alternative for large, high dimensional sparse data. Both support vector machines and random forest performed equally well but results indicate that the support vector machine was better at using the extra information from larger values of the Morgan fingerprint’s radius.

中文翻译：

在稀疏数据集上使用随机森林评估基于配体的建模参数。

基于配体的预测模型被广泛用于生成预测模型，以帮助例如药物开发项目中的决策。随着数据集的增长和对低建模时间的要求，必须有效分析数据集以支持快速而强大的建模。在这项研究中，我们分析了四个数据集并研究了稀疏数据结构上的机器学习方法的效率，利用了不同半径和哈希大小的Morgan指纹，并与不同高度的分子特征描述子进行了比较。我们使用随机森林的两种实现方式专门评估了这些参数对建模时间，预测性能和内存需求的影响。Scikit学习以及FEST。我们还比较了支持向量机的实现。我们的结果表明，未哈希的指纹比哈希的指纹（$ p \ le 0.05 $$）产生更好的准确性，并且建模时间和内存使用率没有明显下降。此外，FEST算法的快速执行和低内存使用率表明它是大型，高维稀疏数据的理想选择。支持向量机和随机森林的性能均相当好，但结果表明，支持向量机在利用Morgan指纹半径较大的值中的额外信息方面表现更好。

更新日期：2018-10-11

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>