当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2021-08-19 , DOI: 10.1186/s13321-021-00539-7
Lewis H Mervin 1 , Maria-Anna Trapotsi 2 , Avid M Afzal 3 , Ian P Barrett 3 , Andreas Bender 2 , Ola Engkvist 4, 5
Affiliation  

Measurements of protein–ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., Ki versus IC50 values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein–ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4–0.6 log units and when ideal probability estimates between 0.4–0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC50 value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold.

中文翻译:

考虑到实验不确定性,概率随机森林改进了接近分类阈值的生物活性预测

由于实验误差,蛋白质-配体相互作用的测量具有重现性限制。任何基于此类分析的模型都会不可避免地存在影响其性能的错误,理想情况下,这些错误应该被纳入建模和输出预测中,例如实验测量的实际标准偏差 (σ) 或聚合异质活性之间的相关活性值的可比性数据集同化期间的单位(即 Ki 与 IC50 值)。然而,实验误差通常是模型生成过程中被忽视的一个方面。为了改进当前最先进的技术,我们在此提出了一种使用概率随机森林 (PRF) 分类器预测蛋白质-配体相互作用的新方法。PRF 算法应用于 ChEMBL 和 PubChem 的约 550 个任务中的计算机蛋白质目标预测。通过考虑训练和测试集中实验标准偏差的各种情况来评估预测,并使用五重分层混洗拆分来评估性能以进行验证。对于接近二进制阈值边界的数据点,观察到在 PRF 中合并实验偏差的最大好处,当原始 RF 算法中没有以任何方式考虑此类信息时。例如,在 σ 介于 0.4–0.6 log 单位之间以及理想概率估计介于 0.4–0.6 之间的情况下,PRF 的表现优于 RF,绝对误差中位数约为 17%。相比之下,对于属于活动类(远离二元决策阈值)的高置信度案例,基线 RF 的表现优于 PRF,尽管 RF 模型给出的误差小于实验不确定性,这可能表明它们训练过度和/或过度自信. 最后,与没有推定非活性物质的 PRF 模型相比,使用推定非活性物质训练的 PRF 模型降低了性能,这可能是因为推定非活性物质未分配实验 pXC50 值,因此它们被认为具有低不确定性的非活性物质(实际上可能不会)是真实的)。总之,PRF 可用于目标预测模型,特别是对于类别边界与测量不确定性重叠的数据,
更新日期:2021-08-19
down
wechat
bug