Robust supervised topic models under label noise,Machine Learning

当前位置： X-MOL 学术 › Mach. Learn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Robust supervised topic models under label noise
Machine Learning ( IF 4.3 ) Pub Date : 2021-04-14 , DOI: 10.1007/s10994-021-05967-y
Wei Wang , Bing Guo , Yan Shen , Han Yang , Yaosen Chen , Xinhua Suo

Recently, some statistical topic modeling approaches have been widely applied in the field of supervised document classification. However, there are few researches on these approaches under label noise, which widely exists in real-world applications. For example, many large-scale datasets are collected from websites or annotated by varying quality human-workers, and then have a few mislabeled items. In this paper, we propose two robust topic models for document classification problems: Smoothed Labeled LDA (SL-LDA) and Adaptive Labeled LDA (AL-LDA). SL-LDA is an extension of Labeled LDA (L-LDA), which is a classical supervised topic model. The proposed model overcomes the shortcoming of L-LDA, i.e., overfitting on noisy labels, through Dirichlet smoothing. AL-LDA is an iterative optimization framework based on SL-LDA. At each iterative procedure, we update the Dirichlet prior, which incorporates the observed labels, by a concise algorithm based on maximizing entropy and minimizing cross-entropy principles. This method avoids identifying the noisy label, which is a common difficulty existing in label noise cleaning algorithms. Quantitative experimental results on noisy completely at random (NCAR) and Multiple Noisy Sources (MNS) settings demonstrate our models have outstanding performance under noisy labels. Specially, the proposed AL-LDA has significant advantages relative to the state-of-the-art topic modeling approaches under massive label noise.

中文翻译：

标签噪声下的鲁棒监督主题模型

最近，一些统计主题建模方法已在监督文件分类领域中得到广泛应用。但是，在标签噪声下对这些方法的研究很少，这在现实应用中已广泛存在。例如，许多大规模的数据集是从网站上收集的，或者由质量不同的工作人员注释的，然后有一些标签错误的项目。在本文中，我们针对文档分类问题提出了两个健壮的主题模型：平滑标记的LDA（SL-LDA）和自适应标记的LDA（AL-LDA）。SL-LDA是Labeled LDA（L-LDA）的扩展，LDA是经典的监督主题模型。所提出的模型克服了L-LDA的缺点，即通过Dirichlet平滑来过度拟合噪声标签。AL-LDA是基于SL-LDA的迭代优化框架。在每个迭代过程中，最大化 熵和最小化 交叉熵原理。这种方法避免了识别带噪声的标签，这是标签噪声清除算法中普遍存在的困难。定量实验结果嘈杂完全在随机（NCAR）和多噪声源（MNS）的设置表明我们的模型有下嘈杂的标签表现出色。特别地，相对于在大量标签噪声下的最新主题建模方法，拟议的AL-LDA具有显着优势。

更新日期：2021-04-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11