Confidence bands and hypothesis tests for hit enrichment curves,Journal of Cheminformatics

当前位置： X-MOL 学术 › J. Cheminfom. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Confidence bands and hypothesis tests for hit enrichment curves
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2022-07-28 , DOI: 10.1186/s13321-022-00629-0
Jeremy R Ash _{1,

2} , Jacqueline M Hughes-Oliver ₁

Affiliation

In virtual screening for drug discovery, hit enrichment curves are widely used to assess the performance of ranking algorithms with regard to their ability to identify early enrichment. Unfortunately, researchers almost never consider the uncertainty associated with estimating such curves before declaring differences between performance of competing algorithms. Uncertainty is often large because the testing fractions of interest to researchers are small. Appropriate inference is complicated by two sources of correlation that are often overlooked: correlation across different testing fractions within a single algorithm, and correlation between competing algorithms. Additionally, researchers are often interested in making comparisons along the entire curve, not only at a few testing fractions. We develop inferential procedures to address both the needs of those interested in a few testing fractions, as well as those interested in the entire curve. For the former, four hypothesis testing and (pointwise) confidence intervals are investigated, and a newly developed EmProc approach is found to be most effective. For inference along entire curves, EmProc-based confidence bands are recommended for simultaneous coverage and minimal width. While we focus on the hit enrichment curve, this work is also appropriate for lift curves that are used throughout the machine learning community. Our inferential procedures trivially extend to enrichment factors, as well.

中文翻译：

命中富集曲线的置信带和假设检验

在药物发现的虚拟筛选中，命中富集曲线被广泛用于评估排名算法在识别早期富集的能力方面的性能。不幸的是，在宣布竞争算法的性能差异之前，研究人员几乎从不考虑与估计此类曲线相关的不确定性。不确定性通常很大，因为研究人员感兴趣的测试部分很小。适当的推理因两个经常被忽视的相关性来源而变得复杂：单个算法中不同测试部分的相关性，以及竞争算法之间的相关性。此外，研究人员通常对沿整个曲线进行比较感兴趣，而不仅仅是在几个测试部分。我们开发了推理程序来满足对少数测试部分感兴趣的人以及对整个曲线感兴趣的人的需求。对于前者，研究了四个假设检验和（逐点）置信区间，并发现新开发的 EmProc 方法最有效。对于沿整个曲线的推理，建议使用基于 EmProc 的置信带以实现同时覆盖和最小宽度。虽然我们专注于命中丰富曲线，但这项工作也适用于整个机器学习社区使用的提升曲线。我们的推理程序也微不足道地扩展到富集因子。研究了四个假设检验和（逐点）置信区间，发现新开发的 EmProc 方法最有效。对于沿整个曲线的推理，建议使用基于 EmProc 的置信带以实现同时覆盖和最小宽度。虽然我们专注于命中丰富曲线，但这项工作也适用于整个机器学习社区使用的提升曲线。我们的推理程序也微不足道地扩展到富集因子。研究了四个假设检验和（逐点）置信区间，发现新开发的 EmProc 方法最有效。对于沿整个曲线的推理，建议使用基于 EmProc 的置信带以实现同时覆盖和最小宽度。虽然我们专注于命中丰富曲线，但这项工作也适用于整个机器学习社区使用的提升曲线。我们的推理程序也微不足道地扩展到富集因子。

更新日期：2022-07-29

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11