Skip to main content
Log in

A new feature selection metric for text classification: eliminating the need for a separate pruning stage

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Terms that occur too frequently or rarely in various texts are not useful for text classification. Pruning can be used to remove such irrelevant terms reducing the dimensionality of the feature space and, thus making feature selection more efficient and effective. Normally, pruning is achieved by manually setting threshold values. However, incorrect threshold values can result in the loss of many useful terms or retention of irrelevant ones. Existing feature ranking metrics can assign higher ranks to these irrelevant terms, thus degrading the performance of a text classifier. In this paper, we propose a new feature ranking metric, which can select the most useful terms in the presence of these too frequently and rarely occurring terms, thus eliminating the need for pruning these terms. To investigate the usefulness of the proposed metric, we compare it against seven well-known feature selection metrics on five data sets namely Reuters-21578 (re0, re1, r8) and WebACE (k1a, k1b) using multinomial naive Bayes and support vector machines classifiers. Our results based on a paired t-test show that the performance of our metric is statistically significant than that of the other seven metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. Throughout the paper, we use features, words and terms interchangeably.

  2. In text classification, category is the same as the class.

  3. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download.

  4. http://ana.cachopo.org/datasets-for-single-label-text-categorization.

  5. https://scikit-learn.org/.

References

  1. Aggarwal CC, Zhai C (2012) A survey of text classification algorithms. Mining text data. Springer, Berlin, pp 163–222

    Chapter  Google Scholar 

  2. Agnihotri D, Verma K, Tripathi P (2017) Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 81:268–281

    Article  Google Scholar 

  3. Ali MS, Javed K (2020) A novel inherent distinguishing feature selector for highly skewed text document classification. Arab J Sci Eng (In the press)

  4. Asim M, Khan Z (2018) Mobile price class prediction using machine learning techniques. Int J Comput Appl 975:8887

    Google Scholar 

  5. Basu T, Murthy CA (2012) Effective text classification by a supervised feature selection approach. In: Proceedings of 2012 international conference on data mining, pp 918–925

  6. Bolon-Canedo V, Sanchez-Marono N, Alonso-Betanzos A (2015) Feature selection for high-dimensional data. Springer International Publishing, Cham

    Book  Google Scholar 

  7. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth and Brooks, Monterey

    MATH  Google Scholar 

  8. Cardoso-Cachopo A (2007) Improving methods for single-label text categorization. PhD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa

  9. Chen H, Schuffels C, Orwig R (1996) Internet categorization and search: a self-organizing approach. J Vis Commun Image Represent 7(1):88–102

    Article  Google Scholar 

  10. Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naive Bayes. Expert Syst Appl 36(3):5432–5435

    Article  Google Scholar 

  11. Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20(3):273–297

    Article  MATH  Google Scholar 

  12. Cunha W, Canuto S, Viegas F, Salles T, Gomes C, Mangaravite V, Resende E, Rosa T, Gonçalves MA, Rocha L (2020) Extended pre-processing pipeline for text classification: on the role of meta-feature representations, sparsification and selective sampling. Inf Process Manag 57(4):102263

    Article  Google Scholar 

  13. Dong T, Shang W, Zhu H (2011) Naive Bayesian classifier based on the improved feature weighting algorithm. Advanced research on computer science and information engineering. Springer, Berlin Heidelberg, pp 142–147

    Chapter  Google Scholar 

  14. Flach P (2012) Machine learning the art and science of algorithms that make sense of data. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  15. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305

    MATH  Google Scholar 

  16. Forman G (2008) Feature selection for text classification. Computational methods of feature selection. Chapman and Hall/CRC, Boca Raton, pp 257–276

    Google Scholar 

  17. Ge S, Zhuang Y, Hu Y, Ai X (2019) Research on enterprise hidden danger association rules based on text analysis. IOP Conf Ser Earth Environ Sci 252:032170

    Article  Google Scholar 

  18. Ghareb AS, Bakar AA, Hamdan AR (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst Appl 49:31–47

    Article  Google Scholar 

  19. Grimmer J, Stewart BM (2013) Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit Anal 21:267–297

    Article  Google Scholar 

  20. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  21. Guyon I, Gunn S, Nikravesh M, Zadeh L (2006) Feature extraction: foundations and applications. Springer, Berlin

    Book  MATH  Google Scholar 

  22. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422

    Article  MATH  Google Scholar 

  23. Han EH, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) WebACE: a web agent for document categorization and exploration. In: Proceedings of the second international conference on autonomous agents, pp 408–415

  24. James J (2019) Data never sleeps 7.0. https://www.domo.com/learn/data-never-sleeps-7. Accessed: 1 Aug 2019

  25. Javed K, Babri H, Saeed M (2012) Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans Knowl Data Eng 24(3):465–477

    Article  Google Scholar 

  26. Javed K, Babri HA, Saeed M (2014) Impact of a metric of association between two variables on performance of filters for binary data. Neurocomputing 143:248–260

    Article  Google Scholar 

  27. Javed K, Maruf S, Babri HA (2015) A two-stage Markov blanket based feature selection algorithm for text classification. Neurocomputing 157:91–104

    Article  Google Scholar 

  28. Javed K, Saeed M, Babri HA (2014) The correctness problem: evaluating the ordering of binary features in rankings. Knowl Inf Syst 39(3):543–563

    Article  Google Scholar 

  29. Jia X, Sun J (2012) An improved text classification method based on Gini index. J Theor Appl Inf Technol 43:267–273

    Google Scholar 

  30. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European conference on machine learning (ECML), pp 137–142

  31. Joachims T (2002) Learning to classify text using support vector machines. Kluwer Academic Publishers, Dordrecht

    Book  Google Scholar 

  32. Joshi H, Pareek J, Patel R, Chauhan K (2012) To stop or not to stop experiments on stopword elimination for information retrieval of gujarati text documents. In: Nirma University international conference on engineering (NUiCONE), pp 1–4

  33. Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1):273–324

    Article  MATH  Google Scholar 

  34. Koller D, Sahami M (1996) Toward optimal feature selection. Technical Report 1996-77, Stanford InfoLab

  35. Kou G, Yang P, Peng Y, Xiao F, Chen Y, Alsaadi FE (2020) Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput J 86:105836

    Article  Google Scholar 

  36. Labani M, Moradi P, Ahmadizar F, Jalili M (2018) A novel multivariate filter method for feature selection in text classification problems. Eng Appl Artif Intell 70:25–37

    Article  Google Scholar 

  37. Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735

    Article  Google Scholar 

  38. Li X, Xie H, Chen L, Wang J, Deng X (2014) News impact on stock price return via sentiment analysis. Knowl-Based Syst 69(1):14–23

    Article  Google Scholar 

  39. Li Y, Li T, Liu H (2017) Recent advances in feature selection and its applications. Knowl Inf Syst 53(3):551–577

    Article  Google Scholar 

  40. Liu H, Motoda H (2008) Computational methods of feature selection. Taylor & Francis Group, LLC, Oxfordshire

    MATH  Google Scholar 

  41. Liu H, Zhou M, Lu XS, Yao C (2018) Weighted Gini index feature selection method for imbalanced data. In: 2018 IEEE 15th international conference on networking, sensing and control (ICNSC), pp 1–6

  42. Maruf S, Javed K, Babri HA (2016) Improving text classification performance with random forests-based feature selection. Arab J Sci Eng 41:951–964

    Article  Google Scholar 

  43. McCallum A, Rosenfeld R, Mitchell TM, Ng AY (1998) Improving text classification by shrinkage in a hierarchy of classes. In: Proceedings of the fifteenth international conference on machine learning, ICML ’98, pp 359–367

  44. Mironczuk M, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Expert Syst Appl 106:36–54

    Article  Google Scholar 

  45. Mirończuk MM, Protasiewicz J, Pedrycz W (2019) Empirical evaluation of feature projection algorithms for multi-view text classification. Expert Syst Appl 130:97–112

    Article  Google Scholar 

  46. Navidi W (2015) Statistics for engineers and scientists, 4th edn. McGraw-Hill Education, New York

    Google Scholar 

  47. Ogura H, Amano H, Kondo M (2009) Feature selection with a measure of deviations from Poisson in text categorization. Decis Support Syst 36(3):6826–6832

    Google Scholar 

  48. Ogura H, Amano H, Kondo M (2011) Comparison of metrics for feature selection in imbalanced text classification. Expert Syst Appl 38(5):4978–4989

    Article  Google Scholar 

  49. Park H, Kwon H (2011) Improved Gini-index algorithm to correct feature-selection bias in text classification. IEICE Trans Inf Syst 94–D(4):855–865

    Article  Google Scholar 

  50. Park H, Kwon S, Kwon H (2010) Complete Gini-index text (GIT) feature-selection algorithm for text classification. In: The 2nd international conference on software engineering and data mining, pp 366–371

  51. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  52. Purnomoputra RB, Adiwijaya Wisesty UN (2019) Sentiment analysis of movie review using Naïve Bayes method with Gini index feature selection. J Data Sci Appl 2:85–94

    Google Scholar 

  53. Raileanu L, Stoffel K (2004) Theoretical comparison between the Gini index and information gain criteria. Ann Math Artif Intell 41:77–93

    Article  MathSciNet  MATH  Google Scholar 

  54. Rao Y, Xie H, Li J, Jin F, Wang FL, Li Q (2016) Social emotion classification of short text via topic-level maximum entropy model. Inf Manag 53(8):978–986

    Article  Google Scholar 

  55. Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manag 53(2):473–489

    Article  Google Scholar 

  56. Rehman A, Javed K, Babri HA, Asim N (2018) Selection of the most relevant terms based on a max–min ratio metric for text classification. Expert Syst Appl 114:78–96

    Article  Google Scholar 

  57. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  Google Scholar 

  58. Shang S, Shi M, Shang W, Hong Z (2016) Improved feature weight algorithm and its application to text classification. Math Probl Eng 2016:1–12

    Article  MathSciNet  MATH  Google Scholar 

  59. Shang W, Huang H, Zhu H, Lin Y (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33:1–5

    Article  Google Scholar 

  60. Srividhya V, Anitha R (2011) Evaluating preprocessing techniques in text categorization. Int J Comput Sci Appl 47(11):49–51

    Google Scholar 

  61. Stigler SM (1983) Who discovered Bayes’s theorem? Am Stat 37(4a):290–296

    Article  MATH  Google Scholar 

  62. Su J, Shirab JS, Matwin S (2011) Large scale text classification using semi-supervised multinomial Naive Bayes. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 97–104

  63. Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl-Based Syst 36:226–235

    Article  Google Scholar 

  64. Wang D, Zhang H, Liu R, Lv W, Wang D (2014) t-Test feature selection approach based on term frequency for text categorization. Pattern Recognit Lett 45:1–10

    Article  Google Scholar 

  65. Wang H, Hong M (2019) Supervised Hebb rule based feature selection for text classification. Inf Process Manag 56(1):167–191

    Article  Google Scholar 

  66. Wang Y, Feng L (2018) A new feature selection method for handling redundant information in text classification. Front Inf Technol Electron Eng 19:221–234

    Article  Google Scholar 

  67. Witte RS, Witte JS (2010) Statistics, 9th edn. Wiley, New York

    MATH  Google Scholar 

  68. Wu Y, Zhang A (2004) Feature selection for classifying high-dimensional numerical data. In: Proceedings of IEEE computer society conference on computer vision and pattern recognition (CVPR), vol 2

  69. Zhang W, Yoshida T, Tang X (2011) A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst Appl 38:2758–2765

    Article  Google Scholar 

  70. Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newsl 6(1):80–89

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kashif Javed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Asim, M., Javed, K., Rehman, A. et al. A new feature selection metric for text classification: eliminating the need for a separate pruning stage. Int. J. Mach. Learn. & Cyber. 12, 2461–2478 (2021). https://doi.org/10.1007/s13042-021-01324-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-021-01324-6

Keywords

Navigation