Learning to rank for multi-label text classification: Combining different sources of information,Natural Language Engineering

当前位置： X-MOL 学术 › Nat. Lang. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning to rank for multi-label text classification: Combining different sources of information
Natural Language Engineering ( IF 2.3 ) Pub Date : 2020-02-18 , DOI: 10.1017/s1351324920000029
Hosein Azarbonyad , Mostafa Dehghani , Maarten Marx , Jaap Kamps

Efficiently exploiting all sources of information such as labeled instances, classes’ representation, and relations of them has a high impact on the performance of Multi-Label Text Classification (MLTC) systems. Most of the current approaches use labeled documents as the primary source of information for MLTC. We investigate the effectiveness of different sources of information— such as the labeled training data, textual labels of classes, and taxonomy relations of classes— for MLTC. More specifically, first, for each document–class pair, different features are extracted using different sources of information. The features reflect the similarity of classes and documents. Then, MLTC is considered to be a ranking problem, and a learning to rank (LTR) approach is used for ranking classes regarding documents and selecting labels of documents. An important characteristic of many MLTC instances is that documents can belong to multiple classes and there are implicit relations between classes. We apply score propagation on top of LTR to incorporate co-occurrence patterns of classes in labeled documents. Our main findings are the following. First, using an LTR approach integrating all features, we observe significantly better performance than previous systems for MLTC. Specifically, we show that simple classification approaches fail when there is a high number of classes. Second, the analysis of feature weights reveals the relative importance of various sources of evidence, also giving insight into the underlying classification problem. Interestingly, the results indicate that the titles of documents are more informative than all other sources of information. Third, a lean-and-mean system using only four features is able to perform at 96% of the large LTR model that we propose in this paper. Fourth, using the co-occurrence information of classes helps in classifying documents more accurately. Our results show that the co-occurrence information is more helpful when the underlying classifier has a poor performance.

中文翻译：

学习对多标签文本分类进行排名：结合不同的信息源

有效利用所有信息源，例如标记实例、类的表示以及它们之间的关系，对多标签文本分类 (MLTC) 系统的性能有很大影响。当前的大多数方法都使用带标签的文档作为 MLTC 的主要信息来源。我们调查了不同信息来源的有效性——例如标记的训练数据、类的文本标签和类的分类关系——对于 MLTC。更具体地说，首先，对于每个文档-类对，使用不同的信息源提取不同的特征。这些特征反映了类和文档的相似性。然后，MLTC被认为是一个排序问题，学习排序（LTR）方法用于对文档的类进行排序和选择文档的标签。许多 MLTC 实例的一个重要特征是文档可以属于多个类，并且类之间存在隐式关系。我们在 LTR 之上应用分数传播，以将类的共现模式合并到标记的文档中。我们的主要发现如下。首先，使用集成所有特征的 LTR 方法，我们观察到性能明显优于以前的 MLTC 系统。具体来说，我们表明，当有大量类时，简单的分类方法会失败。其次，特征权重的分析揭示了各种证据来源的相对重要性，也让我们深入了解了潜在的分类问题。有趣的是，结果表明文件的标题比所有其他信息来源都提供更多信息。第三，仅使用四个特征的精益平均系统能够以我们在本文中提出的大型 LTR 模型的 96% 执行。第四，使用类的共现信息有助于更准确地对文档进行分类。我们的结果表明，当底层分类器的性能较差时，共现信息更有帮助。

更新日期：2020-02-18

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11