A new feature selection metric for text classification: eliminating the need for a separate pruning stage,International Journal of Machine Learning and Cybernetics

当前位置： X-MOL 学术 › Int. J. Mach. Learn. & Cyber. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A new feature selection metric for text classification: eliminating the need for a separate pruning stage
International Journal of Machine Learning and Cybernetics ( IF 3.1 ) Pub Date : 2021-04-11 , DOI: 10.1007/s13042-021-01324-6
Muhammad Asim , Kashif Javed , Abdur Rehman , Haroon A. Babri

Terms that occur too frequently or rarely in various texts are not useful for text classification. Pruning can be used to remove such irrelevant terms reducing the dimensionality of the feature space and, thus making feature selection more efficient and effective. Normally, pruning is achieved by manually setting threshold values. However, incorrect threshold values can result in the loss of many useful terms or retention of irrelevant ones. Existing feature ranking metrics can assign higher ranks to these irrelevant terms, thus degrading the performance of a text classifier. In this paper, we propose a new feature ranking metric, which can select the most useful terms in the presence of these too frequently and rarely occurring terms, thus eliminating the need for pruning these terms. To investigate the usefulness of the proposed metric, we compare it against seven well-known feature selection metrics on five data sets namely Reuters-21578 (re0, re1, r8) and WebACE (k1a, k1b) using multinomial naive Bayes and support vector machines classifiers. Our results based on a paired t-test show that the performance of our metric is statistically significant than that of the other seven metrics.

中文翻译：

用于文本分类的新功能选择指标：无需单独的修剪阶段

在各种文本中出现频率太高或很少出现的术语对文本分类没有用处。修剪可用于删除不相关的项，从而减少特征空间的维数，从而使特征选择更加有效。通常，修剪是通过手动设置阈值来实现的。但是，错误的阈值可能会导致丢失许多有用的术语或保留不相关的术语。现有的功能排名指标可以为这些不相关的术语分配更高的排名，从而降低文本分类器的性能。在本文中，我们提出了一种新的特征等级度量标准，该度量标准可以在出现这些过于频繁且很少出现的术语时选择最有用的术语，从而消除了修剪这些术语的需要。为了研究拟议指标的有用性，我们使用多项式朴素贝叶斯和支持向量机分类器，将其与五个数据集（即Reuters-21578（re0，re1，r8）和WebACE（k1a，k1b））上的七个知名特征选择指标进行了比较。我们基于配对t检验的结果表明，我们的指标的性能在统计上比其他七个指标显着。

更新日期：2021-04-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11