A New Big Data Feature Selection Approach for Text Classification,Scientific Programming

当前位置： X-MOL 学术 › Sci. Program. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A New Big Data Feature Selection Approach for Text Classification
Scientific Programming Pub Date : 2021-04-19 , DOI: 10.1155/2021/6645345
Houda Amazal ₁ , Mohamed Kissi ₁

Affiliation

Feature selection (FS) is a fundamental task for text classification problems. Text feature selection aims to represent documents using the most relevant features. This process can reduce the size of datasets and improve the performance of the machine learning algorithms. Many researchers have focused on elaborating efficient FS techniques. However, most of the proposed approaches are evaluated for small datasets and validated using single machines. As textual data dimensionality becomes higher, traditional FS methods must be improved and parallelized to handle textual big data. This paper proposes a distributed approach for feature selection based on mutual information (MI) method, which is widely applied in pattern recognition and machine learning. A drawback of MI is that it ignores the frequency of the terms during the selection of features. The proposal introduces a distributed FS method, namely, Maximum Term Frequency-Mutual Information (MTF-MI), based on term frequency and mutual information techniques to improve the quality of the selected features. The proposed approach is implemented on Hadoop using the MapReduce programming model. The effectiveness of MTF-MI is demonstrated through several text classification experiments using the multinomial Naïve Bayes classifier on three datasets. Through a series of tests, the results reveal that the proposed MTF-MI method improves the classification results compared with four state-of-the-art methods in terms of macro-F1 and micro-F1 measures.

中文翻译：

文本分类的大数据特征选择新方法

功能选择（FS）是文本分类问题的基本任务。文本特征选择旨在使用最相关的特征表示文档。该过程可以减少数据集的大小并提高机器学习算法的性能。许多研究人员致力于开发高效的FS技术。但是，大多数提议的方法都针对小型数据集进行了评估，并使用单台计算机进行了验证。随着文本数据维数的提高，必须改进和并行化传统的FS方法以处理文本大数据。提出了一种基于互信息（MI）方法的分布式特征选择方法，该方法广泛应用于模式识别和机器学习中。MI的一个缺点是在特征选择期间它会忽略术语的频率。该提案引入了一种分布式FS方法，即基于术语频率和互信息技术的最大术语频率互信息（MTF-MI），以提高所选功能的质量。所提出的方法是使用MapReduce编程模型在Hadoop上实现的。使用多项式朴素贝叶斯分类器在三个数据集上进行的几个文本分类实验证明了MTF-MI的有效性。通过一系列测试，结果表明，与宏F1和微F1措施相比，所提出的MTF-MI方法与四种最新方法相比，改进了分类结果。基于词频和互信息技术，以提高所选功能的质量。所提出的方法是使用MapReduce编程模型在Hadoop上实现的。使用多项式朴素贝叶斯分类器在三个数据集上进行的几个文本分类实验证明了MTF-MI的有效性。通过一系列测试，结果表明，与宏F1和微F1措施相比，所提出的MTF-MI方法与四种最新方法相比，改进了分类结果。基于词频和互信息技术，以提高所选功能的质量。所提出的方法是使用MapReduce编程模型在Hadoop上实现的。使用多项式朴素贝叶斯分类器在三个数据集上进行的几个文本分类实验证明了MTF-MI的有效性。通过一系列测试，结果表明，与宏F1和微F1措施相比，所提出的MTF-MI方法与四种最新方法相比，改进了分类结果。

更新日期：2021-04-19

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11