Abstract
Terms that occur too frequently or rarely in various texts are not useful for text classification. Pruning can be used to remove such irrelevant terms reducing the dimensionality of the feature space and, thus making feature selection more efficient and effective. Normally, pruning is achieved by manually setting threshold values. However, incorrect threshold values can result in the loss of many useful terms or retention of irrelevant ones. Existing feature ranking metrics can assign higher ranks to these irrelevant terms, thus degrading the performance of a text classifier. In this paper, we propose a new feature ranking metric, which can select the most useful terms in the presence of these too frequently and rarely occurring terms, thus eliminating the need for pruning these terms. To investigate the usefulness of the proposed metric, we compare it against seven well-known feature selection metrics on five data sets namely Reuters-21578 (re0, re1, r8) and WebACE (k1a, k1b) using multinomial naive Bayes and support vector machines classifiers. Our results based on a paired t-test show that the performance of our metric is statistically significant than that of the other seven metrics.
Similar content being viewed by others
Notes
Throughout the paper, we use features, words and terms interchangeably.
In text classification, category is the same as the class.
https://scikit-learn.org/.
References
Aggarwal CC, Zhai C (2012) A survey of text classification algorithms. Mining text data. Springer, Berlin, pp 163–222
Agnihotri D, Verma K, Tripathi P (2017) Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 81:268–281
Ali MS, Javed K (2020) A novel inherent distinguishing feature selector for highly skewed text document classification. Arab J Sci Eng (In the press)
Asim M, Khan Z (2018) Mobile price class prediction using machine learning techniques. Int J Comput Appl 975:8887
Basu T, Murthy CA (2012) Effective text classification by a supervised feature selection approach. In: Proceedings of 2012 international conference on data mining, pp 918–925
Bolon-Canedo V, Sanchez-Marono N, Alonso-Betanzos A (2015) Feature selection for high-dimensional data. Springer International Publishing, Cham
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth and Brooks, Monterey
Cardoso-Cachopo A (2007) Improving methods for single-label text categorization. PhD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa
Chen H, Schuffels C, Orwig R (1996) Internet categorization and search: a self-organizing approach. J Vis Commun Image Represent 7(1):88–102
Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naive Bayes. Expert Syst Appl 36(3):5432–5435
Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20(3):273–297
Cunha W, Canuto S, Viegas F, Salles T, Gomes C, Mangaravite V, Resende E, Rosa T, Gonçalves MA, Rocha L (2020) Extended pre-processing pipeline for text classification: on the role of meta-feature representations, sparsification and selective sampling. Inf Process Manag 57(4):102263
Dong T, Shang W, Zhu H (2011) Naive Bayesian classifier based on the improved feature weighting algorithm. Advanced research on computer science and information engineering. Springer, Berlin Heidelberg, pp 142–147
Flach P (2012) Machine learning the art and science of algorithms that make sense of data. Cambridge University Press, Cambridge
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
Forman G (2008) Feature selection for text classification. Computational methods of feature selection. Chapman and Hall/CRC, Boca Raton, pp 257–276
Ge S, Zhuang Y, Hu Y, Ai X (2019) Research on enterprise hidden danger association rules based on text analysis. IOP Conf Ser Earth Environ Sci 252:032170
Ghareb AS, Bakar AA, Hamdan AR (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst Appl 49:31–47
Grimmer J, Stewart BM (2013) Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit Anal 21:267–297
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Guyon I, Gunn S, Nikravesh M, Zadeh L (2006) Feature extraction: foundations and applications. Springer, Berlin
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422
Han EH, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) WebACE: a web agent for document categorization and exploration. In: Proceedings of the second international conference on autonomous agents, pp 408–415
James J (2019) Data never sleeps 7.0. https://www.domo.com/learn/data-never-sleeps-7. Accessed: 1 Aug 2019
Javed K, Babri H, Saeed M (2012) Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans Knowl Data Eng 24(3):465–477
Javed K, Babri HA, Saeed M (2014) Impact of a metric of association between two variables on performance of filters for binary data. Neurocomputing 143:248–260
Javed K, Maruf S, Babri HA (2015) A two-stage Markov blanket based feature selection algorithm for text classification. Neurocomputing 157:91–104
Javed K, Saeed M, Babri HA (2014) The correctness problem: evaluating the ordering of binary features in rankings. Knowl Inf Syst 39(3):543–563
Jia X, Sun J (2012) An improved text classification method based on Gini index. J Theor Appl Inf Technol 43:267–273
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European conference on machine learning (ECML), pp 137–142
Joachims T (2002) Learning to classify text using support vector machines. Kluwer Academic Publishers, Dordrecht
Joshi H, Pareek J, Patel R, Chauhan K (2012) To stop or not to stop experiments on stopword elimination for information retrieval of gujarati text documents. In: Nirma University international conference on engineering (NUiCONE), pp 1–4
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1):273–324
Koller D, Sahami M (1996) Toward optimal feature selection. Technical Report 1996-77, Stanford InfoLab
Kou G, Yang P, Peng Y, Xiao F, Chen Y, Alsaadi FE (2020) Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput J 86:105836
Labani M, Moradi P, Ahmadizar F, Jalili M (2018) A novel multivariate filter method for feature selection in text classification problems. Eng Appl Artif Intell 70:25–37
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Li X, Xie H, Chen L, Wang J, Deng X (2014) News impact on stock price return via sentiment analysis. Knowl-Based Syst 69(1):14–23
Li Y, Li T, Liu H (2017) Recent advances in feature selection and its applications. Knowl Inf Syst 53(3):551–577
Liu H, Motoda H (2008) Computational methods of feature selection. Taylor & Francis Group, LLC, Oxfordshire
Liu H, Zhou M, Lu XS, Yao C (2018) Weighted Gini index feature selection method for imbalanced data. In: 2018 IEEE 15th international conference on networking, sensing and control (ICNSC), pp 1–6
Maruf S, Javed K, Babri HA (2016) Improving text classification performance with random forests-based feature selection. Arab J Sci Eng 41:951–964
McCallum A, Rosenfeld R, Mitchell TM, Ng AY (1998) Improving text classification by shrinkage in a hierarchy of classes. In: Proceedings of the fifteenth international conference on machine learning, ICML ’98, pp 359–367
Mironczuk M, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Expert Syst Appl 106:36–54
Mirończuk MM, Protasiewicz J, Pedrycz W (2019) Empirical evaluation of feature projection algorithms for multi-view text classification. Expert Syst Appl 130:97–112
Navidi W (2015) Statistics for engineers and scientists, 4th edn. McGraw-Hill Education, New York
Ogura H, Amano H, Kondo M (2009) Feature selection with a measure of deviations from Poisson in text categorization. Decis Support Syst 36(3):6826–6832
Ogura H, Amano H, Kondo M (2011) Comparison of metrics for feature selection in imbalanced text classification. Expert Syst Appl 38(5):4978–4989
Park H, Kwon H (2011) Improved Gini-index algorithm to correct feature-selection bias in text classification. IEICE Trans Inf Syst 94–D(4):855–865
Park H, Kwon S, Kwon H (2010) Complete Gini-index text (GIT) feature-selection algorithm for text classification. In: The 2nd international conference on software engineering and data mining, pp 366–371
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Purnomoputra RB, Adiwijaya Wisesty UN (2019) Sentiment analysis of movie review using Naïve Bayes method with Gini index feature selection. J Data Sci Appl 2:85–94
Raileanu L, Stoffel K (2004) Theoretical comparison between the Gini index and information gain criteria. Ann Math Artif Intell 41:77–93
Rao Y, Xie H, Li J, Jin F, Wang FL, Li Q (2016) Social emotion classification of short text via topic-level maximum entropy model. Inf Manag 53(8):978–986
Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manag 53(2):473–489
Rehman A, Javed K, Babri HA, Asim N (2018) Selection of the most relevant terms based on a max–min ratio metric for text classification. Expert Syst Appl 114:78–96
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Shang S, Shi M, Shang W, Hong Z (2016) Improved feature weight algorithm and its application to text classification. Math Probl Eng 2016:1–12
Shang W, Huang H, Zhu H, Lin Y (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33:1–5
Srividhya V, Anitha R (2011) Evaluating preprocessing techniques in text categorization. Int J Comput Sci Appl 47(11):49–51
Stigler SM (1983) Who discovered Bayes’s theorem? Am Stat 37(4a):290–296
Su J, Shirab JS, Matwin S (2011) Large scale text classification using semi-supervised multinomial Naive Bayes. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 97–104
Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl-Based Syst 36:226–235
Wang D, Zhang H, Liu R, Lv W, Wang D (2014) t-Test feature selection approach based on term frequency for text categorization. Pattern Recognit Lett 45:1–10
Wang H, Hong M (2019) Supervised Hebb rule based feature selection for text classification. Inf Process Manag 56(1):167–191
Wang Y, Feng L (2018) A new feature selection method for handling redundant information in text classification. Front Inf Technol Electron Eng 19:221–234
Witte RS, Witte JS (2010) Statistics, 9th edn. Wiley, New York
Wu Y, Zhang A (2004) Feature selection for classifying high-dimensional numerical data. In: Proceedings of IEEE computer society conference on computer vision and pattern recognition (CVPR), vol 2
Zhang W, Yoshida T, Tang X (2011) A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst Appl 38:2758–2765
Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newsl 6(1):80–89
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Asim, M., Javed, K., Rehman, A. et al. A new feature selection metric for text classification: eliminating the need for a separate pruning stage. Int. J. Mach. Learn. & Cyber. 12, 2461–2478 (2021). https://doi.org/10.1007/s13042-021-01324-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-021-01324-6