Several alternative term weighting methods for text representation and classification
Introduction
Automatic text classification (TC) technology can efficiently organize and categorize text that increasing dramatically [1], thus it eliminates a large amount of human effort [2] and attracted a wide attention in recent years [3], [4]. The goal of the TC task is to categorize unlabeled texts into a predefined class based on their topics [5], and hence based on a set of prelabeled texts, an automatic text classifier can be established in the learning process [6], [7]. Before applying classifiers, every term in the text first need to be assigned numerical values (weights) in an appropriate term weighting scheme that is called text representation [8], [9]. The vector space model (VSM) is the most popular way to represent texts [10], [11], it usually treats a text as a set of terms namely bag of words (BoW) model [12], [13]. VSM constructs the text collections as a document-term matrix, in which the term weight represents importance of a certain term in a certain document [14], [15], and each row denotes one of the document vectors, whereas each column corresponds to one of the distinct terms (i.e., selected features).
Term weighting is critical to TC task that has a direct and significant effect on the text classification performance [16]. At present, term weighting approaches are generally grouped into unsupervised and supervised according to whether they embrace the class information of training texts [17], [18]. The unsupervised term weighting (UTW) methods neglects the diversity of class information, whereas the supervised term weighting (STW) methods exploit the category information when calculating the weight. For UTW schemes, such as term frequency (TF) and TF–IDF (term frequency–inverse document frequency) are commonly used. Among them, TF is one of the simplest weighting methods, but it is a local weighting approach due to it only considers how many times a term occurs in the text. To conquer this drawback, an inverse document frequency (IDF) has been designed to generate the TF–IDF scheme, it is concerned with how many texts a term has appeared in. Note that, TF–IDF has been primarily designed for information retrieval (IR) rather than TC tasks [10], [19].
Different from IR task, TC task aims to discriminate between different classes, not texts [20], and thus it should take category factors into account when computing the weight of terms. For that reason, it can be said that TC is a supervised learning task [9], [17], [21], [22]. Recently, most STW methods are originated from feature selection schemes, these methods adopt the category information in several ways, and can be summarized as follows. First of all, TF-CHI2, TF-IG and TF-GR have been proposed based on feature selection approaches (i.e., Chi-square statistic (CHI2), information gain (IG) and gain ratio (GR)) [18]. Since then, various STW methods similar to the above schemes have also been presented, for example, odds ratio (OR) weighting factor in TF-OR [14], [23], [24], mutual information (MI) weighting factor in TF-MI [14], [23], probability-based (PB) weighting factor in TF-PB [24], and correlation coefficient weighting factor in TF-CC [24].
Apart from these schemes, a variety of STW schemes have been built and proposed, which are derived from TF–IDF. Initially, inspired by the IDF in TF–IDF scheme, the inverse class frequency (ICF) has been introduced, it indicates that a key term of the specific class usually appears in only a few categories [25]. However, due to the number of categories is generally quite small, a certain term may occasionally exist in multiple categories or sometimes even in all categories [25], [26]. As a result, ICF fails to promote the degree of importance of a term under certain circumstances. To enhance a term’s distinguishing power, the ICF has been incorporated to generate the TF–IDF–ICF scheme [14], [25]. Meanwhile, in [14], the authors pointed out that TF–IDF–ICF scheme emphasizes on rare terms. To alleviate this problem, they redesigned the ICF and proposed a novel scheme called TF–IDF–ICSDF. Besides these, Lan et al. [17] proved that the distinguishing abilities of a term depends only on the relevant texts that contain this term, and hence they presented a novel STW scheme by replacing IDF in TF–IDF with the RF (relevance frequency). More importantly, the results in many literatures demonstrated that TF–RF achieves better performance than most STW and UTW approaches [9], [17], [23]. Moreover, in [22], the authors claimed that TF–IDF was not necessarily suitable for TC task, hence two novel term weighting method called TF-IGM and RTF-IGM were presented. Further, to deal with the drawback of TF-IGM, two improved weighting methods based on IGM have also been developed [27].
We must note, however, that not all STW schemes are always superior to UTW schemes [17], [20], [26], [28]. Moreover, as emphasized in [22], most STW schemes show its own superiority in some specific TC tasks, but in fact they cannot consistently yield the best classification performance except that they require more storage space and running time [14]. Because of these, the motivation of this paper is to focus on UTW methods. We have derived inspiration from the IDF part of TF–IDF which is defined as the logarithmic transformation, and many works are seeking to build a theoretical basis for it [29]. Hence, we naturally doubt whether there are other nonlinear transformation forms like logarithmic transformation that can achieve better classification performance than existing ones? This is the first question we wish to address in this research. Besides that, due to the test texts do not have any prior category label information [30], [31], thus how to properly represent test texts is a difficult but significant task [30], especially for the STW schemes due to these methods need to use category information in the weighting process, these include TF–RF [17], TF–IDF–ICSDF [14], TF-IGM [22], RTF-IGM [27] and so on. Nevertheless, it is not clearly explain how to represent test texts in most of existing schemes [31] including TF–IDF. Naturally, we raise the second question: How to represent test texts for TC task? Besides these two questions, the third question will be raised in Section 3. And the answer will be presented at the end of this article. Our contributions in this work can be summarized in the following: firstly, several nonlinear transformation methods are introduced and compared with the famous TF–IDF which adopts the logarithmic ratio. Secondly, we clearly explained how to represent test texts in the weighting process through a classic UTW method and three typical STW methods with different characteristics. Thirdly, a nonlinear transformation method is designed and used for TF, which performs better than the square root function-based TF. Finally, we conduct an extensive experimental comparison of our schemes with existing UTW and STW methods. Experimental results demonstrate the effectiveness and superiority of our proposed schemes.
The rest of this manuscript is outlined as follows. Section 2 takes several existing term weighting methods as examples to clearly show how to represent test texts. In Section 3, we propose four UTW methods for TC. The experimental settings are explained in Section 4. In Section 5, our findings are presented and discussed. We conclude our work in Section 6.
Section snippets
Analysis of current term weighting schemes
There is no doubt that term weighting is essential for TC task [32], which measures the importance of a term (feature) in representing the content of a text [14], [15]. Next, we analyzed the term weighting approaches (unsupervised, supervised) in depth due to they were recently proposed or are very related to ours, namely TF–IDF, TF–RF, TF–IDF–ICSDF, and RTF-IGM. Importantly, through these methods we clearly explained how to represent test texts.
Proposed unsupervised term weighting schemes
From the above arguments, it should be claimed that choose an appropriate metric function used for weighting terms is the key to obtain high-quality performance of TC [7]. Although the TF–IDF weighting scheme borrows from IR field, and it ignores the available category information of training texts [22]. We are convinced that the TF–IDF scheme is reasonable in TC tasks, the rationale can be interpreted by three fundamental assumptions [18], i.e., TF assumption, IDF assumption, and
Experimental settings
In this study, we use two public text classification datasets to validate the performance of our schemes, namely Reuters-21578 and 20 Newsgroups datasets [37]. Reuters-21578 dataset has 8 different categories including 5485 training texts and 2189 test texts. There are 20 categories in 20 Newsgroups corpus, including 11,293 training texts and 7528 test texts. Moreover, we omit those terms that length less than two characters and occurrence less than two times. In addition to that, we use porter
Experimental results and analysis
In this part, an extensive experimental comparison of our schemes with existing UTW and STW schemes are performed. Moreover, all the schemes (summarized in Table 2) are normalized except for RTF-IGM, the reasons can be seen in [22], which will not be re-explained here.
Conclusions
The overall purpose of this work is to resolve the three questions appeared in this paper, and they have been successfully achieved through three groups of experiments on two benchmark datasets with MNB and SVM classifiers. Concerning the first question, we introduced four nonlinear transformation methods as global factor in the UTW schemes, this is just like IDF in TF–IDF which adopts logarithmic transformation. Comparison of the proposed schemes with the popular TF–IDF indicates that our
CRediT authorship contribution statement
Zhong Tang: Conceptualization, Methodology, Software, Writing - original draft. Wenqiang Li: Funding acquisition, Data curation. Yan Li: Validation, Supervision. Wu Zhao: Funding acquisition, Investigation. Song Li: Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Key Research and Development Program, China (No. 2018YFB1700702), the Science & Technology Ministry Innovation Method Program, China (No. 2017IM040100), the Sichuan Major Science and Technology Project, China (No. 2019ZDZX0001), and the Sichuan Applied Foundation Project, China (No. 2018JY0119).
References (46)
- et al.
Nonlinear transformation of term frequencies for term weighting in textcategorization
Eng. Appl. Artif. Intell.
(2012) - et al.
Feature selection via maximizing global information gain for text classification
Knowl.-Based Syst.
(2013) - et al.
An automated text categorization framework based on hyperparameter optimization
Knowl.-Based Syst.
(2018) - et al.
Fast text categorization using concise semantic analysis
Pattern Recognit. Lett.
(2011) - et al.
Comparison of text feature selection policies and using an adaptive framework
Expert Syst. Appl.
(2013) - et al.
Learning short-text semantic similarity with word embeddings and external knowledge sources
Knowl.-Based Syst.
(2019) - et al.
Class-indexing-based term weighting for automatic text classification
Inform. Sci.
(2013) - et al.
An alternative framework for univariate filter based feature selection for text categorization
Pattern Recognit. Lett.
(2018) - et al.
Balancing between over-weighting and under-weighting in supervised term weighting
Inf. Process. Manage.
(2017) - et al.
Term-weighting learning via genetic programming for text classification
Knowl.-Based Syst.
(2015)
Turning from TF-IDF to TF-IGM for term weighting in text classification
Expert Syst. Appl.
Analytical evaluation of term weighting schemes for text categorization
Pattern Recognit. Lett.
Imbalanced text classification: A term weighting approach
Expert Syst. Appl.
Improved inverse gravity moment term weighting for text classification
Expert Syst. Appl.
Knowledge-enhanced document embeddings for text classification
Knowl.-Based Syst.
Insurance pricing and increased limits rate making by proportional hazards transforms
Insurance Math. Econom.
A two-stage feature selection method for text categorization
Comput. Math. Appl.
Subspace learning for unsupervised feature selection via matrix factorization
Pattern Recognit.
A novel multivariate filter method for feature selection in text classification problems
Eng. Appl. Artif. Intell.
Modified frequency-based term weighting schemes for text classification
Appl. Soft Comput.
A new text categorization technique using distributional clustering and learning logic
IEEE Trans. Knowl. Data Eng.
Machine learning in automated text categorization
Acm Comput. Surv.
Toward optimal featureselection in Naive Bayes for text categorization
IEEE Trans. Knowl. Data Eng.
Cited by (20)
A novel redistribution-based feature selection for text classification
2024, Expert Systems with ApplicationsGenerating survey draft based on closeness of position distributions of key words
2024, Expert Systems with ApplicationsSupervised term-category feature weighting for improved text classification
2023, Knowledge-Based SystemsCitation Excerpt :In this study, we introduce a new text classification framework for Category-based Feature Engineering titled CFE, which aims to improve classification quality by integrating term-category relationships in document and category representations. Our solution consists of a supervised weighting scheme based on a variant of the TF-ICF (Term Frequency-Inverse Category Frequency) model [11]. Different from existing approaches which are designed for document representation, e.g., [12–14], we adapt TF-ICF to produce weighted representations for the target categories.
An improved supervised term weighting scheme for text representation and classification
2022, Expert Systems with ApplicationsCitation Excerpt :TF-IDF (term frequency-inverse document frequency) is the most frequently used method originated from information retrieval (IR) domain (Salton et al., 1975; Sparck Jones, 1972) and has become the default choice for converting text to numerical vector (Liu, Loh, & Sun, 2009). However, TF-IDF has an obvious disadvantage, that is, it may be equal to 0 (Tang, Li, & Li, 2020; Tang, Li, Li, Zhao, & Li, 2020). To seek a reasonable solution for TF-IDF, several alternative term weighting methods for text representation are proposed by Tang et al. (2020), such as the TF-IEF and DTF-IPF.
Understanding the different emulsification mechanisms of pectin: Comparison between watermelon rind and two commercial pectin sources
2021, Food HydrocolloidsCitation Excerpt :Samples' nomenclature was “xP- ySO′ where ‘x’ refers to the pectin solution concentration, “P” indicates pectin type (AP, CP or WRP) and “y” refers to the SO concentration. The microstructure of the O/W emulsions was studied using both optical microscopy and confocal laser scanning microscopy following the method reported by Tang and Ghosh (2020). Fluorescence images of the emulsions were characterized using a FV 1000-IX81 confocal laser scanning microscope (CLSM, Olympus, Japan) with a combination of 559 nm and 635 nm excitation wavelengths and with emission wavelength of 572 nm and 647 nm for Nile red and fast green, respectively, for the fluorescence field.
Fuzzy clustering analysis for the loan audit short texts
2023, Knowledge and Information Systems