A novel filter feature selection method for text classification: Extensive Feature Selector,Journal of Information Science

当前位置： X-MOL 学术 › J. Inf. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A novel filter feature selection method for text classification: Extensive Feature Selector
Journal of Information Science ( IF 2.4 ) Pub Date : 2021-04-13 , DOI: 10.1177/0165551521991037
Bekir Parlak ₁ , Alper Kursat Uysal ₂

Affiliation

As the huge dimensionality of textual data restrains the classification accuracy, it is essential to apply feature selection (FS) methods as dimension reduction step in text classification (TC) domain. Most of the FS methods for TC contain several number of probabilities. In this study, we proposed a new FS method named as Extensive Feature Selector (EFS), which benefits from corpus-based and class-based probabilities in its calculations. The performance of EFS is compared with nine well-known FS methods, namely, Chi-Squared (CHI2), Class Discriminating Measure (CDM), Discriminative Power Measure (DPM), Odds Ratio (OR), Distinguishing Feature Selector (DFS), Comprehensively Measure Feature Selection (CMFS), Discriminative Feature Selection (DFSS), Normalised Difference Measure (NDM) and Max–Min Ratio (MMR) using Multinomial Naive Bayes (MNB), Support-Vector Machines (SVMs) and k-Nearest Neighbour (KNN) classifiers on four benchmark data sets. These data sets are Reuters-21578, 20-Newsgroup, Mini 20-Newsgroup and Polarity. The experiments were carried out for six different feature sizes which are 10, 30, 50, 100, 300 and 500. Experimental results show that the performance of EFS method is more successful than the other nine methods in most cases according to micro-F1 and macro-F1 scores.

中文翻译：

一种用于文本分类的新颖过滤器特征选择方法：广泛的特征选择器

由于文本数据的巨大维数限制了分类的准确性，因此在文本分类（TC）域中应用特征选择（FS）方法作为降维步骤至关重要。TC的大多数FS方法都包含多个概率。在这项研究中，我们提出了一种名为扩展特征选择器（EFS）的新FS方法，该方法在计算中得益于基于语料库和基于类的概率。将EFS的性能与9种著名的FS方法进行了比较，即Chi-Squared（CHI2），类区分度量（CDM），判别功率度量（DPM），赔率（OR），区分特征选择器（DFS），使用多项朴素贝叶斯（MNB）全面测量特征选择（CMFS），区分特征选择（DFSS），归一化差异测量（NDM）和最大最小比（MMR），在四个基准数据集上的k-最近邻（KNN）分类器。这些数据集是Reuters-21578、20-新闻组，迷你20-新闻组和Polarity。针对6种不同的特征尺寸（分别为10、30、50、100、300和500）进行了实验。实验结果表明，根据micro F 1，在大多数情况下，EFS方法的性能要比其他九种方法更成功。和宏F 1得分。

更新日期：2021-04-13

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>