当前位置: X-MOL 学术ACM Trans. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Critical Reassessment of the Saerens-Latinne-Decaestecker Algorithm for Posterior Probability Adjustment
ACM Transactions on Information Systems ( IF 5.6 ) Pub Date : 2020-12-31 , DOI: 10.1145/3433164
Andrea Esuli 1 , Alessio Molinari 2 , Fabrizio Sebastiani 1
Affiliation  

We critically re-examine the Saerens-Latinne-Decaestecker (SLD) algorithm, a well-known method for estimating class prior probabilities (“priors”) and adjusting posterior probabilities (“posteriors”) in scenarios characterized by distribution shift, i.e., difference in the distribution of the priors between the training and the unlabelled documents. Given a machine learned classifier and a set of unlabelled documents for which the classifier has returned posterior probabilities and estimates of the prior probabilities, SLD updates them both in an iterative, mutually recursive way, with the goal of making both more accurate; this is of key importance in downstream tasks such as single-label multiclass classification and cost-sensitive text classification. Since its publication, SLD has become the standard algorithm for improving the quality of the posteriors in the presence of distribution shift, and SLD is still considered a top contender when we need to estimate the priors (a task that has become known as “quantification”). However, its real effectiveness in improving the quality of the posteriors has been questioned. We here present the results of systematic experiments conducted on a large, publicly available dataset, across multiple amounts of distribution shift and multiple learners. Our experiments show that SLD improves the quality of the posterior probabilities and of the estimates of the prior probabilities, but only when the number of classes in the classification scheme is very small and the classifier is calibrated. As the number of classes grows, or as we use non-calibrated classifiers, SLD converges more slowly (and often does not converge at all), performance degrades rapidly, and the impact of SLD on the quality of the prior estimates and of the posteriors becomes negative rather than positive.

中文翻译:

用于后验概率调整的 Saerens-Latinne-Decaestecker 算法的批判性重新评估

我们批判性地重新检查 Saerens-Latinne-Decaestecker (SLD) 算法,这是一种众所周知的方法,用于在以分布偏移(即差异)为特征的场景中估计类先验概率(“先验”)和调整后验概率(“后验”)在训练和未标记文档之间的先验分布。给定一个机器学习的分类器和一组未标记的文档,分类器已为其返回后验概率和先验概率的估计,SLD 以迭代、相互递归的方式更新它们,目标是使两者更准确;这在下游任务中至关重要,例如单标签多类分类和成本敏感的文本分类。自发布以来,SLD 已成为在存在分布偏移的情况下提高后验质量的标准算法,当我们需要估计先验(一项已被称为“量化”的任务)时,SLD 仍然被认为是最有力的竞争者。然而,它在提高后验质量方面的真正有效性受到质疑。我们在这里展示了在一个大型、公开可用的数据集上进行的系统实验的结果,这些实验跨越了多个分布变化和多个学习者。我们的实验表明,SLD 提高了后验概率和先验概率估计的质量,但前提是分类方案中的类数非常少并且分类器已校准。随着类别数量的增加,或者当我们使用未校准的分类器时,
更新日期:2020-12-31
down
wechat
bug