当前位置: X-MOL 学术Knowl. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On entropy-based term weighting schemes for text categorization
Knowledge and Information Systems ( IF 2.7 ) Pub Date : 2021-07-07 , DOI: 10.1007/s10115-021-01581-5
Tao Wang 1 , Yi Cai 2, 3 , Ho-fung Leung 4 , Raymond Y. K. Lau 5 , Haoran Xie 6 , Qing Li 7
Affiliation  

In text categorization, Vector Space Model (VSM) has been widely used for representing documents, in which a document is represented by a vector of terms. Since different terms contribute to a document’s semantics in various degrees, a number of term weighting schemes have been proposed for VSM to improve text categorization performance. Much evidence shows that the performance of a term weighting scheme often varies across different text categorization tasks, while the mechanism underlying variability in a scheme’s performance remains unclear. Moreover, existing schemes often weight a term with respect to a category locally, without considering the global distribution of a term’s occurrences across all categories in a corpus. In this paper, we first systematically examine pros and cons of existing term weighting schemes in text categorization and explore the reasons why some schemes with sound theoretical bases, such as chi-square test and information gain, perform poorly in empirical evaluations. By measuring the concentration that a term distributes across all categories in a corpus, we then propose a series of entropy-based term weighting schemes to measure the distinguishing power of a term in text categorization. Through extensive experiments on five different datasets, the proposed term weighting schemes consistently outperform the state-of-the-art schemes. Moreover, our findings shed new light on how to choose and develop an effective term weighting scheme for a specific text categorization task.



中文翻译:

基于熵的文本分类术语加权方案

在文本分类中,向量空间模型(VSM)已被广泛用于表示文档,其中文档由术语向量表示。由于不同的术语在不同程度上对文档的语义有贡献,因此为 VSM 提出了许多术语加权方案以提高文本分类性能。许多证据表明,术语加权方案的性能通常因不同的文本分类任务而异,而方案性能变化的潜在机制仍不清楚。此外,现有方案通常相对于局部类别对术语进行加权,而不考虑术语在语料库中所有类别中出现的全局分布。在本文中,卡方检验信息增益,在经验评估中表现不佳。通过测量一个词在语料库中所有类别的分布集中度,我们提出了一系列基于熵的词加权方案来衡量一个词在文本分类中的区分能力。通过对五个不同数据集的大量实验,所提出的术语加权方案始终优于最先进的方案。此外,我们的发现为如何为特定的文本分类任务选择和开发有效的术语加权方案提供了新的思路。

更新日期:2021-07-07
down
wechat
bug