当前位置: X-MOL 学术Eng. Appl. Artif. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Using modified term frequency to improve term weighting for text classification
Engineering Applications of Artificial Intelligence ( IF 8 ) Pub Date : 2021-03-01 , DOI: 10.1016/j.engappai.2021.104215
Long Chen , Liangxiao Jiang , Chaoqun Li

Text classification (TC) is an essential task of natural language processing (NLP). In order to improve the performance of TC, term weighting is often used to obtain effective text representation by assigning appropriate weights to each term. A term weighting scheme is generally composed of term frequency factor, collection frequency factor and normalization factor. The normalization factor is commonly used as an optional factor to offset the influence of document length. Through the investigation of the existing term weighting schemes, we found that most of them focus on finding a more effective collection frequency factor, but rarely pay attention to finding a new term frequency factor. In this paper, we first proposed a new term frequency factor called modified term frequency (MTF). Different from the normalization factor, MTF directly modifies the raw term frequency based on the length information of all training documents. Then we proposed a new term weighting scheme by combining MTF with an existing collection frequency factor called modified distinguishing feature selector (MDFS). We denoted our scheme by MTF-MDFS (MDFS-based MTF). Extensive experimental results on 19 benchmark text datasets and 6 real-world text datasets show that our proposed MTF and MTF-MDFS are all much better than their state-of-the-art competitors in terms of the classification accuracy and the weighted average of F1 of widely used base classifiers, such as MNB, SVM and LR.



中文翻译:

使用修改的词频来改善词的权重以进行文本分类

文本分类(TC)是自然语言处理(NLP)的基本任务。为了提高TC的性能,术语权重通常用于通过为每个术语分配适当的权重来获得有效的文本表示。术语加权方案通常由术语频率因子,收集频率因子和归一化因子组成。标准化因子通常用作抵消文档长度影响的可选因子。通过对现有术语加权方案的研究,我们发现它们大多数集中在寻找更有效的收集频率因子上,而很少注意寻找新的术语频率因子。在本文中,我们首先提出了一个新的术语频率因子,称为修正术语频率(MTF)。与归一化因子不同,MTF根据所有培训文档的长度信息直接修改原始术语频率。然后,我们通过将MTF与现有的收集频率因子(称为改进的区别特征选择器(MDFS))结合,提出了一种新的术语加权方案。我们用MTF-MDFS(基于MDFS的MTF)来表示我们的方案。在19个基准文本数据集和6个实际文本数据集上的大量实验结果表明,我们提出的MTF和MTF-MDFS在分类准确度和加权平均值方面都比其最新的竞争对手好得多。我们用MTF-MDFS(基于MDFS的MTF)来表示我们的方案。在19个基准文本数据集和6个实际文本数据集上的大量实验结果表明,我们提出的MTF和MTF-MDFS在分类准确度和加权平均值方面都比其最新的竞争对手好得多。我们用MTF-MDFS(基于MDFS的MTF)来表示我们的方案。在19个基准文本数据集和6个实际文本数据集上的大量实验结果表明,我们提出的MTF和MTF-MDFS在分类准确度和加权平均值方面都比其最新的竞争对手好得多。F1个 广泛使用的基本分类器,例如MNB,SVM和LR。

更新日期:2021-03-01
down
wechat
bug