A new feature selection metric for text classification: eliminating the need for a separate pruning stage

Asim, Muhammad; Javed, Kashif; Rehman, Abdur; Babri, Haroon A.

doi:10.1007/s13042-021-01324-6

A new feature selection metric for text classification: eliminating the need for a separate pruning stage

Original Article
Published: 11 April 2021

Volume 12, pages 2461–2478, (2021)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Muhammad Asim¹,
Kashif Javed ORCID: orcid.org/0000-0002-0515-2017²,
Abdur Rehman³ &
…
Haroon A. Babri²

385 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Terms that occur too frequently or rarely in various texts are not useful for text classification. Pruning can be used to remove such irrelevant terms reducing the dimensionality of the feature space and, thus making feature selection more efficient and effective. Normally, pruning is achieved by manually setting threshold values. However, incorrect threshold values can result in the loss of many useful terms or retention of irrelevant ones. Existing feature ranking metrics can assign higher ranks to these irrelevant terms, thus degrading the performance of a text classifier. In this paper, we propose a new feature ranking metric, which can select the most useful terms in the presence of these too frequently and rarely occurring terms, thus eliminating the need for pruning these terms. To investigate the usefulness of the proposed metric, we compare it against seven well-known feature selection metrics on five data sets namely Reuters-21578 (re0, re1, r8) and WebACE (k1a, k1b) using multinomial naive Bayes and support vector machines classifiers. Our results based on a paired t-test show that the performance of our metric is statistically significant than that of the other seven metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Novel Feature Selection Technique for Text Classification

Selection of Relevant Features for Text Classification with K-NN

A new feature selection method for handling redundant information in text classification

Article 01 February 2018

You-wei Wang & Li-zhou Feng

Notes

Throughout the paper, we use features, words and terms interchangeably.
In text classification, category is the same as the class.
http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download.
http://ana.cachopo.org/datasets-for-single-label-text-categorization.
https://scikit-learn.org/.

References

Aggarwal CC, Zhai C (2012) A survey of text classification algorithms. Mining text data. Springer, Berlin, pp 163–222
Chapter Google Scholar
Agnihotri D, Verma K, Tripathi P (2017) Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 81:268–281
Article Google Scholar
Ali MS, Javed K (2020) A novel inherent distinguishing feature selector for highly skewed text document classification. Arab J Sci Eng (In the press)
Asim M, Khan Z (2018) Mobile price class prediction using machine learning techniques. Int J Comput Appl 975:8887
Google Scholar
Basu T, Murthy CA (2012) Effective text classification by a supervised feature selection approach. In: Proceedings of 2012 international conference on data mining, pp 918–925
Bolon-Canedo V, Sanchez-Marono N, Alonso-Betanzos A (2015) Feature selection for high-dimensional data. Springer International Publishing, Cham
Book Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth and Brooks, Monterey
MATH Google Scholar
Cardoso-Cachopo A (2007) Improving methods for single-label text categorization. PhD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa
Chen H, Schuffels C, Orwig R (1996) Internet categorization and search: a self-organizing approach. J Vis Commun Image Represent 7(1):88–102
Article Google Scholar
Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naive Bayes. Expert Syst Appl 36(3):5432–5435
Article Google Scholar
Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20(3):273–297
Article MATH Google Scholar
Cunha W, Canuto S, Viegas F, Salles T, Gomes C, Mangaravite V, Resende E, Rosa T, Gonçalves MA, Rocha L (2020) Extended pre-processing pipeline for text classification: on the role of meta-feature representations, sparsification and selective sampling. Inf Process Manag 57(4):102263
Article Google Scholar
Dong T, Shang W, Zhu H (2011) Naive Bayesian classifier based on the improved feature weighting algorithm. Advanced research on computer science and information engineering. Springer, Berlin Heidelberg, pp 142–147
Chapter Google Scholar
Flach P (2012) Machine learning the art and science of algorithms that make sense of data. Cambridge University Press, Cambridge
Book MATH Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
MATH Google Scholar
Forman G (2008) Feature selection for text classification. Computational methods of feature selection. Chapman and Hall/CRC, Boca Raton, pp 257–276
Google Scholar
Ge S, Zhuang Y, Hu Y, Ai X (2019) Research on enterprise hidden danger association rules based on text analysis. IOP Conf Ser Earth Environ Sci 252:032170
Article Google Scholar
Ghareb AS, Bakar AA, Hamdan AR (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst Appl 49:31–47
Article Google Scholar
Grimmer J, Stewart BM (2013) Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit Anal 21:267–297
Article Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Guyon I, Gunn S, Nikravesh M, Zadeh L (2006) Feature extraction: foundations and applications. Springer, Berlin
Book MATH Google Scholar
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422
Article MATH Google Scholar
Han EH, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) WebACE: a web agent for document categorization and exploration. In: Proceedings of the second international conference on autonomous agents, pp 408–415
James J (2019) Data never sleeps 7.0. https://www.domo.com/learn/data-never-sleeps-7. Accessed: 1 Aug 2019
Javed K, Babri H, Saeed M (2012) Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans Knowl Data Eng 24(3):465–477
Article Google Scholar
Javed K, Babri HA, Saeed M (2014) Impact of a metric of association between two variables on performance of filters for binary data. Neurocomputing 143:248–260
Article Google Scholar
Javed K, Maruf S, Babri HA (2015) A two-stage Markov blanket based feature selection algorithm for text classification. Neurocomputing 157:91–104
Article Google Scholar
Javed K, Saeed M, Babri HA (2014) The correctness problem: evaluating the ordering of binary features in rankings. Knowl Inf Syst 39(3):543–563
Article Google Scholar
Jia X, Sun J (2012) An improved text classification method based on Gini index. J Theor Appl Inf Technol 43:267–273
Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European conference on machine learning (ECML), pp 137–142
Joachims T (2002) Learning to classify text using support vector machines. Kluwer Academic Publishers, Dordrecht
Book Google Scholar
Joshi H, Pareek J, Patel R, Chauhan K (2012) To stop or not to stop experiments on stopword elimination for information retrieval of gujarati text documents. In: Nirma University international conference on engineering (NUiCONE), pp 1–4
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1):273–324
Article MATH Google Scholar
Koller D, Sahami M (1996) Toward optimal feature selection. Technical Report 1996-77, Stanford InfoLab
Kou G, Yang P, Peng Y, Xiao F, Chen Y, Alsaadi FE (2020) Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput J 86:105836
Article Google Scholar
Labani M, Moradi P, Ahmadizar F, Jalili M (2018) A novel multivariate filter method for feature selection in text classification problems. Eng Appl Artif Intell 70:25–37
Article Google Scholar
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Article Google Scholar
Li X, Xie H, Chen L, Wang J, Deng X (2014) News impact on stock price return via sentiment analysis. Knowl-Based Syst 69(1):14–23
Article Google Scholar
Li Y, Li T, Liu H (2017) Recent advances in feature selection and its applications. Knowl Inf Syst 53(3):551–577
Article Google Scholar
Liu H, Motoda H (2008) Computational methods of feature selection. Taylor & Francis Group, LLC, Oxfordshire
MATH Google Scholar
Liu H, Zhou M, Lu XS, Yao C (2018) Weighted Gini index feature selection method for imbalanced data. In: 2018 IEEE 15th international conference on networking, sensing and control (ICNSC), pp 1–6
Maruf S, Javed K, Babri HA (2016) Improving text classification performance with random forests-based feature selection. Arab J Sci Eng 41:951–964
Article Google Scholar
McCallum A, Rosenfeld R, Mitchell TM, Ng AY (1998) Improving text classification by shrinkage in a hierarchy of classes. In: Proceedings of the fifteenth international conference on machine learning, ICML ’98, pp 359–367
Mironczuk M, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Expert Syst Appl 106:36–54
Article Google Scholar
Mirończuk MM, Protasiewicz J, Pedrycz W (2019) Empirical evaluation of feature projection algorithms for multi-view text classification. Expert Syst Appl 130:97–112
Article Google Scholar
Navidi W (2015) Statistics for engineers and scientists, 4th edn. McGraw-Hill Education, New York
Google Scholar
Ogura H, Amano H, Kondo M (2009) Feature selection with a measure of deviations from Poisson in text categorization. Decis Support Syst 36(3):6826–6832
Google Scholar
Ogura H, Amano H, Kondo M (2011) Comparison of metrics for feature selection in imbalanced text classification. Expert Syst Appl 38(5):4978–4989
Article Google Scholar
Park H, Kwon H (2011) Improved Gini-index algorithm to correct feature-selection bias in text classification. IEICE Trans Inf Syst 94–D(4):855–865
Article Google Scholar
Park H, Kwon S, Kwon H (2010) Complete Gini-index text (GIT) feature-selection algorithm for text classification. In: The 2nd international conference on software engineering and data mining, pp 366–371
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Purnomoputra RB, Adiwijaya Wisesty UN (2019) Sentiment analysis of movie review using Naïve Bayes method with Gini index feature selection. J Data Sci Appl 2:85–94
Google Scholar
Raileanu L, Stoffel K (2004) Theoretical comparison between the Gini index and information gain criteria. Ann Math Artif Intell 41:77–93
Article MathSciNet MATH Google Scholar
Rao Y, Xie H, Li J, Jin F, Wang FL, Li Q (2016) Social emotion classification of short text via topic-level maximum entropy model. Inf Manag 53(8):978–986
Article Google Scholar
Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manag 53(2):473–489
Article Google Scholar
Rehman A, Javed K, Babri HA, Asim N (2018) Selection of the most relevant terms based on a max–min ratio metric for text classification. Expert Syst Appl 114:78–96
Article Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Article Google Scholar
Shang S, Shi M, Shang W, Hong Z (2016) Improved feature weight algorithm and its application to text classification. Math Probl Eng 2016:1–12
Article MathSciNet MATH Google Scholar
Shang W, Huang H, Zhu H, Lin Y (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33:1–5
Article Google Scholar
Srividhya V, Anitha R (2011) Evaluating preprocessing techniques in text categorization. Int J Comput Sci Appl 47(11):49–51
Google Scholar
Stigler SM (1983) Who discovered Bayes’s theorem? Am Stat 37(4a):290–296
Article MATH Google Scholar
Su J, Shirab JS, Matwin S (2011) Large scale text classification using semi-supervised multinomial Naive Bayes. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 97–104
Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl-Based Syst 36:226–235
Article Google Scholar
Wang D, Zhang H, Liu R, Lv W, Wang D (2014) t-Test feature selection approach based on term frequency for text categorization. Pattern Recognit Lett 45:1–10
Article Google Scholar
Wang H, Hong M (2019) Supervised Hebb rule based feature selection for text classification. Inf Process Manag 56(1):167–191
Article Google Scholar
Wang Y, Feng L (2018) A new feature selection method for handling redundant information in text classification. Front Inf Technol Electron Eng 19:221–234
Article Google Scholar
Witte RS, Witte JS (2010) Statistics, 9th edn. Wiley, New York
MATH Google Scholar
Wu Y, Zhang A (2004) Feature selection for classifying high-dimensional numerical data. In: Proceedings of IEEE computer society conference on computer vision and pattern recognition (CVPR), vol 2
Zhang W, Yoshida T, Tang X (2011) A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst Appl 38:2758–2765
Article Google Scholar
Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newsl 6(1):80–89
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, Riphah International University, Lahore, Pakistan
Muhammad Asim
Department of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan
Kashif Javed & Haroon A. Babri
Department of Computer Science, University of Gujrat, Gujrat, Pakistan
Abdur Rehman

Authors

Muhammad Asim
View author publications
You can also search for this author in PubMed Google Scholar
Kashif Javed
View author publications
You can also search for this author in PubMed Google Scholar
Abdur Rehman
View author publications
You can also search for this author in PubMed Google Scholar
Haroon A. Babri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kashif Javed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Asim, M., Javed, K., Rehman, A. et al. A new feature selection metric for text classification: eliminating the need for a separate pruning stage. Int. J. Mach. Learn. & Cyber. 12, 2461–2478 (2021). https://doi.org/10.1007/s13042-021-01324-6

Download citation

Received: 12 June 2020
Accepted: 02 April 2021
Published: 11 April 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s13042-021-01324-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new feature selection metric for text classification: eliminating the need for a separate pruning stage

Abstract

Access this article

Similar content being viewed by others

A Novel Feature Selection Technique for Text Classification

Selection of Relevant Features for Text Classification with K-NN

A new feature selection method for handling redundant information in text classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A new feature selection metric for text classification: eliminating the need for a separate pruning stage

Abstract

Access this article

Similar content being viewed by others

A Novel Feature Selection Technique for Text Classification

Selection of Relevant Features for Text Classification with K-NN

A new feature selection method for handling redundant information in text classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation