当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Open-source QSAR models for pKa prediction using multiple machine learning approaches
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2019-09-18 , DOI: 10.1186/s13321-019-0384-1
Kamel Mansouri 1 , Neal F Cariello 1 , Alexandru Korotcov 2 , Valery Tkachenko 2 , Chris M Grulke 3 , Catherine S Sprankle 1 , David Allen 1 , Warren M Casey 4 , Nicole C Kleinstreuer 4 , Antony J Williams 3
Affiliation  

The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to pass through the plasma membrane. Thus, pKa affects chemical absorption, distribution, metabolism, excretion, and toxicity properties. Multiple proprietary software packages exist for the prediction of pKa, but to the best of our knowledge no free and open-source programs exist for this purpose. Using a freely available data set and three machine learning approaches, we developed open-source models for pKa prediction. The experimental strongest acidic and strongest basic pKa values in water for 7912 chemicals were obtained from DataWarrior, a freely available software package. Chemical structures were curated and standardized for quantitative structure–activity relationship (QSAR) modeling using KNIME, and a subset comprising 79% of the initial set was used for modeling. To evaluate different approaches to modeling, several datasets were constructed based on different processing of chemical structures with acidic and/or basic pKas. Continuous molecular descriptors, binary fingerprints, and fragment counts were generated using PaDEL, and pKa prediction models were created using three machine learning methods, (1) support vector machines (SVM) combined with k-nearest neighbors (kNN), (2) extreme gradient boosting (XGB) and (3) deep neural networks (DNN). The three methods delivered comparable performances on the training and test sets with a root-mean-squared error (RMSE) around 1.5 and a coefficient of determination (R2) around 0.80. Two commercial pKa predictors from ACD/Labs and ChemAxon were used to benchmark the three best models developed in this work, and performance of our models compared favorably to the commercial products. This work provides multiple QSAR models to predict the strongest acidic and strongest basic pKas of chemicals, built using publicly available data, and provided as free and open-source software on GitHub.

中文翻译:

使用多种机器学习方法进行 pKa 预测的开源 QSAR 模型

对数酸解离常数 pKa 反映了化学物质的电离程度,它影响亲脂性、溶解度、蛋白质结合和通过质膜的能力。因此,pKa 影响化学吸收、分布、代谢、排泄和毒性特性。存在多种用于预测 pKa 的专有软件包,但据我们所知,不存在用于此目的的免费和开源程序。使用免费提供的数据集和三种机器学习方法,我们开发了用于 pKa 预测的开源模型。7912 种化学品在水中的实验最强酸性和最强碱性 pKa 值从免费提供的软件包 DataWarrior 中获得。使用 KNIME 对化学结构进行策划和标准化,以进行定量构效关系 (QSAR) 建模,并使用包含初始集 79% 的子集进行建模。为了评估不同的建模方法,根据酸性和/或碱性 pKa 化学结构的不同处理构建了多个数据集。使用 PaDEL 生成连续分子描述符、二进制指纹和片段计数,并使用三种机器学习方法创建 pKa 预测模型:(1) 支持向量机 (SVM) 与 k 最近邻 (kNN) 相结合,(2) 极端梯度提升(XGB)和(3)深度神经网络(DNN)。这三种方法在训练和测试集上的性能相当,均方根误差 (RMSE) 约为 1.5,决定系数 (R2) 约为 0.80。来自 ACD/Labs 和 ChemAxon 的两个商业 pKa 预测器用于对本工作中开发的三个最佳模型进行基准测试,我们模型的性能与商业产品相比毫不逊色。这项工作提供了多个 QSAR 模型来预测化学品的最强酸性和最强碱性 pKa,这些模型使用公开数据构建,并在 GitHub 上作为免费开源软件提供。
更新日期:2019-09-18
down
wechat
bug