Elsevier

Knowledge-Based Systems

Volume 207, 5 November 2020, 106399
Knowledge-Based Systems

Several alternative term weighting methods for text representation and classification

https://doi.org/10.1016/j.knosys.2020.106399Get rights and content

Abstract

Text representation is one kind of hot topics which support text classification (TC) tasks. It has a substantial impact on the performance of TC. Although the most famous TF–IDF is specially designed for information retrieval rather than TC tasks, it is highly useful in the field of TC as a term weighting method to represent text contents. Inspired by the IDF part of TF–IDF which is defined as the logarithmic transformation, we proposed several alternative methods in this study to generate unsupervised term weighting schemes that can offset the drawback confronting TF–IDF.​ Moreover, owing to TC tasks are different from information retrieval, representing test texts as a vector in an appropriate way is also essential for TC tasks, especially for supervised term weighting approaches (e.g., TF–RF), mainly due to these methods need to use category information when weighting the terms. But most of current schemes do not clearly explain how to represent test texts with their schemes. To explore this problem and seek a reasonable solution to these schemes, we analyzed a classic unsupervised term weighting method and three typical supervised term weighting methods in depth to illustrate how to represent test texts. To investigate the effectiveness of our work, three sets of experiments are designed to compare their performance. Comparisons show that our proposed methods can indeed enhance the performance of TC, and sometimes even outperform existing supervised term weighting methods.

Introduction

Automatic text classification (TC) technology can efficiently organize and categorize text that increasing dramatically [1], thus it eliminates a large amount of human effort [2] and attracted a wide attention in recent years [3], [4]. The goal of the TC task is to categorize unlabeled texts into a predefined class based on their topics [5], and hence based on a set of prelabeled texts, an automatic text classifier can be established in the learning process [6], [7]. Before applying classifiers, every term in the text first need to be assigned numerical values (weights) in an appropriate term weighting scheme that is called text representation [8], [9]. The vector space model (VSM) is the most popular way to represent texts [10], [11], it usually treats a text as a set of terms namely bag of words (BoW) model [12], [13]. VSM constructs the text collections as a document-term matrix, in which the term weight represents importance of a certain term tj in a certain document dk [14], [15], and each row denotes one of the document vectors, whereas each column corresponds to one of the distinct terms (i.e., selected features).

Term weighting is critical to TC task that has a direct and significant effect on the text classification performance [16]. At present, term weighting approaches are generally grouped into unsupervised and supervised according to whether they embrace the class information of training texts [17], [18]. The unsupervised term weighting (UTW) methods neglects the diversity of class information, whereas the supervised term weighting (STW) methods exploit the category information when calculating the weight. For UTW schemes, such as term frequency (TF) and TF–IDF (term frequency–inverse document frequency) are commonly used. Among them, TF is one of the simplest weighting methods, but it is a local weighting approach due to it only considers how many times a term occurs in the text. To conquer this drawback, an inverse document frequency (IDF) has been designed to generate the TF–IDF scheme, it is concerned with how many texts a term has appeared in. Note that, TF–IDF has been primarily designed for information retrieval (IR) rather than TC tasks [10], [19].

Different from IR task, TC task aims to discriminate between different classes, not texts [20], and thus it should take category factors into account when computing the weight of terms. For that reason, it can be said that TC is a supervised learning task [9], [17], [21], [22]. Recently, most STW methods are originated from feature selection schemes, these methods adopt the category information in several ways, and can be summarized as follows. First of all, TF-CHI2, TF-IG and TF-GR have been proposed based on feature selection approaches (i.e., Chi-square statistic (CHI2), information gain (IG) and gain ratio (GR)) [18]. Since then, various STW methods similar to the above schemes have also been presented, for example, odds ratio (OR) weighting factor in TF-OR [14], [23], [24], mutual information (MI) weighting factor in TF-MI [14], [23], probability-based (PB) weighting factor in TF-PB [24], and correlation coefficient weighting factor in TF-CC [24].

Apart from these schemes, a variety of STW schemes have been built and proposed, which are derived from TF–IDF. Initially, inspired by the IDF in TF–IDF scheme, the inverse class frequency (ICF) has been introduced, it indicates that a key term of the specific class usually appears in only a few categories [25]. However, due to the number of categories is generally quite small, a certain term may occasionally exist in multiple categories or sometimes even in all categories [25], [26]. As a result, ICF fails to promote the degree of importance of a term under certain circumstances. To enhance a term’s distinguishing power, the ICF has been incorporated to generate the TF–IDF–ICF scheme [14], [25]. Meanwhile, in [14], the authors pointed out that TF–IDF–ICF scheme emphasizes on rare terms. To alleviate this problem, they redesigned the ICF and proposed a novel scheme called TF–IDF–ICSDF. Besides these, Lan et al. [17] proved that the distinguishing abilities of a term depends only on the relevant texts that contain this term, and hence they presented a novel STW scheme by replacing IDF in TF–IDF with the RF (relevance frequency). More importantly, the results in many literatures demonstrated that TF–RF achieves better performance than most STW and UTW approaches [9], [17], [23]. Moreover, in [22], the authors claimed that TF–IDF was not necessarily suitable for TC task, hence two novel term weighting method called TF-IGM and RTF-IGM were presented. Further, to deal with the drawback of TF-IGM, two improved weighting methods based on IGM have also been developed [27].

We must note, however, that not all STW schemes are always superior to UTW schemes [17], [20], [26], [28]. Moreover, as emphasized in [22], most STW schemes show its own superiority in some specific TC tasks, but in fact they cannot consistently yield the best classification performance except that they require more storage space and running time [14]. Because of these, the motivation of this paper is to focus on UTW methods. We have derived inspiration from the IDF part of TF–IDF which is defined as the logarithmic transformation, and many works are seeking to build a theoretical basis for it [29]. Hence, we naturally doubt whether there are other nonlinear transformation forms like logarithmic transformation that can achieve better classification performance than existing ones? This is the first question we wish to address in this research. Besides that, due to the test texts do not have any prior category label information [30], [31], thus how to properly represent test texts is a difficult but significant task [30], especially for the STW schemes due to these methods need to use category information in the weighting process, these include TF–RF [17], TF–IDF–ICSDF [14], TF-IGM [22], RTF-IGMimp [27] and so on. Nevertheless, it is not clearly explain how to represent test texts in most of existing schemes [31] including TF–IDF. Naturally, we raise the second question: How to represent test texts for TC task? Besides these two questions, the third question will be raised in Section 3. And the answer will be presented at the end of this article. Our contributions in this work can be summarized in the following: firstly, several nonlinear transformation methods are introduced and compared with the famous TF–IDF which adopts the logarithmic ratio. Secondly, we clearly explained how to represent test texts in the weighting process through a classic UTW method and three typical STW methods with different characteristics. Thirdly, a nonlinear transformation method is designed and used for TF, which performs better than the square root function-based TF. Finally, we conduct an extensive experimental comparison of our schemes with existing UTW and STW methods. Experimental results demonstrate the effectiveness and superiority of our proposed schemes.

The rest of this manuscript is outlined as follows. Section 2 takes several existing term weighting methods as examples to clearly show how to represent test texts. In Section 3, we propose four UTW methods for TC. The experimental settings are explained in Section 4. In Section 5, our findings are presented and discussed. We conclude our work in Section 6.

Section snippets

Analysis of current term weighting schemes

There is no doubt that term weighting is essential for TC task [32], which measures the importance of a term (feature) in representing the content of a text [14], [15]. Next, we analyzed the term weighting approaches (unsupervised, supervised) in depth due to they were recently proposed or are very related to ours, namely TF–IDF, TF–RF, TF–IDF–ICSDF, and RTF-IGMimp. Importantly, through these methods we clearly explained how to represent test texts.

Proposed unsupervised term weighting schemes

From the above arguments, it should be claimed that choose an appropriate metric function used for weighting terms is the key to obtain high-quality performance of TC [7]. Although the TF–IDF​ weighting scheme borrows from IR field, and it ignores the available category information of training texts [22]. We are convinced that the TF–IDF scheme is reasonable in TC tasks, the rationale can be interpreted by three fundamental assumptions [18], i.e., TF assumption, IDF assumption, and

Experimental settings

In this study, we use two public text classification datasets to validate the performance of our schemes, namely Reuters-21578 and 20 Newsgroups datasets [37]. Reuters-21578 dataset has 8 different categories including 5485 training texts and 2189 test texts. There are 20 categories in 20 Newsgroups corpus, including 11,293 training texts and 7528 test texts. Moreover, we omit those terms that length less than two characters and occurrence less than two times. In addition to that, we use porter

Experimental results and analysis

In this part, an extensive experimental comparison of our schemes with existing UTW and STW schemes are performed. Moreover, all the schemes (summarized in Table 2) are normalized except for RTF-IGMimp, the reasons can be seen in [22], which will not be re-explained here.

Conclusions

The overall purpose of this work is to resolve the three questions appeared in this paper, and they have been successfully achieved through three groups of experiments on two benchmark datasets with MNB and SVM classifiers. Concerning the first question, we introduced four nonlinear transformation methods as global factor in the UTW schemes, this is just like IDF in TF–IDF which adopts logarithmic transformation. Comparison of the proposed schemes with the popular TF–IDF indicates that our

CRediT authorship contribution statement

Zhong Tang: Conceptualization, Methodology, Software, Writing - original draft. Wenqiang Li: Funding acquisition, Data curation. Yan Li: Validation, Supervision. Wu Zhao: Funding acquisition, Investigation. Song Li: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key Research and Development Program, China (No. 2018YFB1700702), the   Science & Technology Ministry Innovation Method Program, China (No. 2017IM040100), the   Sichuan Major Science and Technology Project, China (No. 2019ZDZX0001), and the Sichuan Applied Foundation Project, China (No. 2018JY0119).

References (46)

  • ChenK.W. et al.

    Turning from TF-IDF to TF-IGM for term weighting in text classification

    Expert Syst. Appl.

    (2016)
  • AltınçayH. et al.

    Analytical evaluation of term weighting schemes for text categorization

    Pattern Recognit. Lett.

    (2010)
  • LiuY. et al.

    Imbalanced text classification: A term weighting approach

    Expert Syst. Appl.

    (2009)
  • DoganT. et al.

    Improved inverse gravity moment term weighting for text classification

    Expert Syst. Appl.

    (2019)
  • SinoaraR.A. et al.

    Knowledge-enhanced document embeddings for text classification

    Knowl.-Based Syst.

    (2019)
  • WangS.

    Insurance pricing and increased limits rate making by proportional hazards transforms

    Insurance Math. Econom.

    (1995)
  • MengJ.N. et al.

    A two-stage feature selection method for text categorization

    Comput. Math. Appl.

    (2011)
  • WangS.P. et al.

    Subspace learning for unsupervised feature selection via matrix factorization

    Pattern Recognit.

    (2015)
  • LabaniM. et al.

    A novel multivariate filter method for feature selection in text classification problems

    Eng. Appl. Artif. Intell.

    (2018)
  • SabbahT. et al.

    Modified frequency-based term weighting schemes for text classification

    Appl. Soft Comput.

    (2017)
  • Al-MubaidH. et al.

    A new text categorization technique using distributional clustering and learning logic

    IEEE Trans. Knowl. Data Eng.

    (2006)
  • SebastianiF.

    Machine learning in automated text categorization

    Acm Comput. Surv.

    (2002)
  • TangB. et al.

    Toward optimal featureselection in Naive Bayes for text categorization

    IEEE Trans. Knowl. Data Eng.

    (2016)
  • Cited by (20)

    • Supervised term-category feature weighting for improved text classification

      2023, Knowledge-Based Systems
      Citation Excerpt :

      In this study, we introduce a new text classification framework for Category-based Feature Engineering titled CFE, which aims to improve classification quality by integrating term-category relationships in document and category representations. Our solution consists of a supervised weighting scheme based on a variant of the TF-ICF (Term Frequency-Inverse Category Frequency) model [11]. Different from existing approaches which are designed for document representation, e.g., [12–14], we adapt TF-ICF to produce weighted representations for the target categories.

    • An improved supervised term weighting scheme for text representation and classification

      2022, Expert Systems with Applications
      Citation Excerpt :

      TF-IDF (term frequency-inverse document frequency) is the most frequently used method originated from information retrieval (IR) domain (Salton et al., 1975; Sparck Jones, 1972) and has become the default choice for converting text to numerical vector (Liu, Loh, & Sun, 2009). However, TF-IDF has an obvious disadvantage, that is, it may be equal to 0 (Tang, Li, & Li, 2020; Tang, Li, Li, Zhao, & Li, 2020). To seek a reasonable solution for TF-IDF, several alternative term weighting methods for text representation are proposed by Tang et al. (2020), such as the TF-IEF and DTF-IPF.

    • Understanding the different emulsification mechanisms of pectin: Comparison between watermelon rind and two commercial pectin sources

      2021, Food Hydrocolloids
      Citation Excerpt :

      Samples' nomenclature was “xP- ySO′ where ‘x’ refers to the pectin solution concentration, “P” indicates pectin type (AP, CP or WRP) and “y” refers to the SO concentration. The microstructure of the O/W emulsions was studied using both optical microscopy and confocal laser scanning microscopy following the method reported by Tang and Ghosh (2020). Fluorescence images of the emulsions were characterized using a FV 1000-IX81 confocal laser scanning microscope (CLSM, Olympus, Japan) with a combination of 559 nm and 635 nm excitation wavelengths and with emission wavelength of 572 nm and 647 nm for Nile red and fast green, respectively, for the fluorescence field.

    • Fuzzy clustering analysis for the loan audit short texts

      2023, Knowledge and Information Systems
    View all citing articles on Scopus
    View full text