当前位置: X-MOL 学术SAR QSAR Environ. Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving (Q)SAR predictions by examining bias in the selection of compounds for experimental testing.
SAR and QSAR in Environmental Research ( IF 2.3 ) Pub Date : 2019-09-24 , DOI: 10.1080/1062936x.2019.1665580
P V Pogodin 1 , A A Lagunin 1, 2 , D A Filimonov 1 , M C Nicklaus 3 , V V Poroikov 1
Affiliation  

Existing data on structures and biological activities are limited and distributed unevenly across distinct molecular targets and chemical compounds. The question arises if these data represent an unbiased sample of the general population of chemical-biological interactions. To answer this question, we analyzed ChEMBL data for 87,583 molecules tested against 919 protein targets using supervised and unsupervised approaches. Hierarchical clustering of the Murcko frameworks generated using Chemistry Development Toolkit showed that the available data form a big diffuse cloud without apparent structure. In contrast hereto, PASS-based classifiers allowed prediction whether the compound had been tested against the particular molecular target, despite whether it was active or not. Thus, one may conclude that the selection of chemical compounds for testing against specific targets is biased, probably due to the influence of prior knowledge. We assessed the possibility to improve (Q)SAR predictions using this fact: PASS prediction of the interaction with the particular target for compounds predicted as tested against the target has significantly higher accuracy than for those predicted as untested (average ROC AUC are about 0.87 and 0.75, respectively). Thus, considering the existing bias in the data of the training set may increase the performance of virtual screening.



中文翻译:

通过检查用于实验测试的化合物选择中的偏倚来改善(Q)SAR预测。

有关结构和生物活性的现有数据有限,并且在不同的分子靶标和化合物之间分布不均。如果这些数据代表了化学生物相互作用的总体种群的无偏样本,就会出现问题。为了回答这个问题,我们使用有监督和无监督的方法分析了针对919个蛋白质靶标测试的87,583个分子的ChEMBL数据。使用化学开发工具包生成的Murcko框架的层次聚类表明,可用数据形成了一个大的弥散云,没有明显的结构。与此相反,基于PASS的分类器允许预测该化合物是否已针对特定分子靶标进行了测试,尽管该化合物是否具有活性。从而,可能会得出结论,针对特定目标进行测试的化合物选择存在偏见,这可能是由于先验知识的影响。我们使用以下事实评估了改善(Q)SAR预测的可能性:对于通过针对目标进行预测的化合物,通过PASS预测与特定目标相互作用的准确度要比未经测试的预测准确度高得多(平均ROC AUC约为0.87,分别为0.75)。因此,考虑训练集数据中的现有偏差可能会提高虚拟筛选的性能。与针对未经测试的化合物相比,通过PASS预测与目标化合物相互作用的PASS预测的准确性要高得多(平均ROC AUC分别约为0.87和0.75)。因此,考虑训练集数据中的现有偏差可以提高虚拟筛选的性能。与针对未经测试的化合物相比,通过PASS预测与目标化合物相互作用的PASS预测的准确性要高得多(平均ROC AUC分别约为0.87和0.75)。因此,考虑训练集数据中的现有偏差可以提高虚拟筛选的性能。

更新日期:2019-09-24
down
wechat
bug