Several alternative term weighting methods for text representation and classification

doi:10.1016/j.knosys.2020.106399

Knowledge-Based Systems

Volume 207, 5 November 2020, 106399

https://doi.org/10.1016/j.knosys.2020.106399 Get rights and content

Abstract

Text representation is one kind of hot topics which support text classification (TC) tasks. It has a substantial impact on the performance of TC. Although the most famous TF–IDF is specially designed for information retrieval rather than TC tasks, it is highly useful in the field of TC as a term weighting method to represent text contents. Inspired by the IDF part of TF–IDF which is defined as the logarithmic transformation, we proposed several alternative methods in this study to generate unsupervised term weighting schemes that can offset the drawback confronting TF–IDF. Moreover, owing to TC tasks are different from information retrieval, representing test texts as a vector in an appropriate way is also essential for TC tasks, especially for supervised term weighting approaches (e.g., TF–RF), mainly due to these methods need to use category information when weighting the terms. But most of current schemes do not clearly explain how to represent test texts with their schemes. To explore this problem and seek a reasonable solution to these schemes, we analyzed a classic unsupervised term weighting method and three typical supervised term weighting methods in depth to illustrate how to represent test texts. To investigate the effectiveness of our work, three sets of experiments are designed to compare their performance. Comparisons show that our proposed methods can indeed enhance the performance of TC, and sometimes even outperform existing supervised term weighting methods.

Introduction

Automatic text classification (TC) technology can efficiently organize and categorize text that increasing dramatically [1], thus it eliminates a large amount of human effort [2] and attracted a wide attention in recent years [3], [4]. The goal of the TC task is to categorize unlabeled texts into a predefined class based on their topics [5], and hence based on a set of prelabeled texts, an automatic text classifier can be established in the learning process [6], [7]. Before applying classifiers, every term in the text first need to be assigned numerical values (weights) in an appropriate term weighting scheme that is called text representation [8], [9]. The vector space model (VSM) is the most popular way to represent texts [10], [11], it usually treats a text as a set of terms namely bag of words (BoW) model [12], [13]. VSM constructs the text collections as a document-term matrix, in which the term weight represents importance of a certain term $t_{j}$ in a certain document $d_{k}$ [14], [15], and each row denotes one of the document vectors, whereas each column corresponds to one of the distinct terms (i.e., selected features).

Term weighting is critical to TC task that has a direct and significant effect on the text classification performance [16]. At present, term weighting approaches are generally grouped into unsupervised and supervised according to whether they embrace the class information of training texts [17], [18]. The unsupervised term weighting (UTW) methods neglects the diversity of class information, whereas the supervised term weighting (STW) methods exploit the category information when calculating the weight. For UTW schemes, such as term frequency (TF) and TF–IDF (term frequency–inverse document frequency) are commonly used. Among them, TF is one of the simplest weighting methods, but it is a local weighting approach due to it only considers how many times a term occurs in the text. To conquer this drawback, an inverse document frequency (IDF) has been designed to generate the TF–IDF scheme, it is concerned with how many texts a term has appeared in. Note that, TF–IDF has been primarily designed for information retrieval (IR) rather than TC tasks [10], [19].

Different from IR task, TC task aims to discriminate between different classes, not texts [20], and thus it should take category factors into account when computing the weight of terms. For that reason, it can be said that TC is a supervised learning task [9], [17], [21], [22]. Recently, most STW methods are originated from feature selection schemes, these methods adopt the category information in several ways, and can be summarized as follows. First of all, TF-CHI2, TF-IG and TF-GR have been proposed based on feature selection approaches (i.e., Chi-square statistic (CHI2), information gain (IG) and gain ratio (GR)) [18]. Since then, various STW methods similar to the above schemes have also been presented, for example, odds ratio (OR) weighting factor in TF-OR [14], [23], [24], mutual information (MI) weighting factor in TF-MI [14], [23], probability-based (PB) weighting factor in TF-PB [24], and correlation coefficient weighting factor in TF-CC [24].

Apart from these schemes, a variety of STW schemes have been built and proposed, which are derived from TF–IDF. Initially, inspired by the IDF in TF–IDF scheme, the inverse class frequency (ICF) has been introduced, it indicates that a key term of the specific class usually appears in only a few categories [25]. However, due to the number of categories is generally quite small, a certain term may occasionally exist in multiple categories or sometimes even in all categories [25], [26]. As a result, ICF fails to promote the degree of importance of a term under certain circumstances. To enhance a term’s distinguishing power, the ICF has been incorporated to generate the TF–IDF–ICF scheme [14], [25]. Meanwhile, in [14], the authors pointed out that TF–IDF–ICF scheme emphasizes on rare terms. To alleviate this problem, they redesigned the ICF and proposed a novel scheme called TF–IDF–ICSDF. Besides these, Lan et al. [17] proved that the distinguishing abilities of a term depends only on the relevant texts that contain this term, and hence they presented a novel STW scheme by replacing IDF in TF–IDF with the RF (relevance frequency). More importantly, the results in many literatures demonstrated that TF–RF achieves better performance than most STW and UTW approaches [9], [17], [23]. Moreover, in [22], the authors claimed that TF–IDF was not necessarily suitable for TC task, hence two novel term weighting method called TF-IGM and RTF-IGM were presented. Further, to deal with the drawback of TF-IGM, two improved weighting methods based on IGM have also been developed [27].

We must note, however, that not all STW schemes are always superior to UTW schemes [17], [20], [26], [28]. Moreover, as emphasized in [22], most STW schemes show its own superiority in some specific TC tasks, but in fact they cannot consistently yield the best classification performance except that they require more storage space and running time [14]. Because of these, the motivation of this paper is to focus on UTW methods. We have derived inspiration from the IDF part of TF–IDF which is defined as the logarithmic transformation, and many works are seeking to build a theoretical basis for it [29]. Hence, we naturally doubt whether there are other nonlinear transformation forms like logarithmic transformation that can achieve better classification performance than existing ones? This is the first question we wish to address in this research. Besides that, due to the test texts do not have any prior category label information [30], [31], thus how to properly represent test texts is a difficult but significant task [30], especially for the STW schemes due to these methods need to use category information in the weighting process, these include TF–RF [17], TF–IDF–ICSDF [14], TF-IGM [22], RTF-IGM $_{i m p}$ [27] and so on. Nevertheless, it is not clearly explain how to represent test texts in most of existing schemes [31] including TF–IDF. Naturally, we raise the second question: How to represent test texts for TC task? Besides these two questions, the third question will be raised in Section 3. And the answer will be presented at the end of this article. Our contributions in this work can be summarized in the following: firstly, several nonlinear transformation methods are introduced and compared with the famous TF–IDF which adopts the logarithmic ratio. Secondly, we clearly explained how to represent test texts in the weighting process through a classic UTW method and three typical STW methods with different characteristics. Thirdly, a nonlinear transformation method is designed and used for TF, which performs better than the square root function-based TF. Finally, we conduct an extensive experimental comparison of our schemes with existing UTW and STW methods. Experimental results demonstrate the effectiveness and superiority of our proposed schemes.

The rest of this manuscript is outlined as follows. Section 2 takes several existing term weighting methods as examples to clearly show how to represent test texts. In Section 3, we propose four UTW methods for TC. The experimental settings are explained in Section 4. In Section 5, our findings are presented and discussed. We conclude our work in Section 6.

Section snippets

Analysis of current term weighting schemes

There is no doubt that term weighting is essential for TC task [32], which measures the importance of a term (feature) in representing the content of a text [14], [15]. Next, we analyzed the term weighting approaches (unsupervised, supervised) in depth due to they were recently proposed or are very related to ours, namely TF–IDF, TF–RF, TF–IDF–ICSDF, and RTF-IGM $_{i m p}$ . Importantly, through these methods we clearly explained how to represent test texts.

Proposed unsupervised term weighting schemes

From the above arguments, it should be claimed that choose an appropriate metric function used for weighting terms is the key to obtain high-quality performance of TC [7]. Although the TF–IDF weighting scheme borrows from IR field, and it ignores the available category information of training texts [22]. We are convinced that the TF–IDF scheme is reasonable in TC tasks, the rationale can be interpreted by three fundamental assumptions [18], i.e., TF assumption, IDF assumption, and

Experimental settings

In this study, we use two public text classification datasets to validate the performance of our schemes, namely Reuters-21578 and 20 Newsgroups datasets [37]. Reuters-21578 dataset has 8 different categories including 5485 training texts and 2189 test texts. There are 20 categories in 20 Newsgroups corpus, including 11,293 training texts and 7528 test texts. Moreover, we omit those terms that length less than two characters and occurrence less than two times. In addition to that, we use porter

Experimental results and analysis

In this part, an extensive experimental comparison of our schemes with existing UTW and STW schemes are performed. Moreover, all the schemes (summarized in Table 2) are normalized except for RTF-IGM $_{i m p}$ , the reasons can be seen in [22], which will not be re-explained here.

Conclusions

The overall purpose of this work is to resolve the three questions appeared in this paper, and they have been successfully achieved through three groups of experiments on two benchmark datasets with MNB and SVM classifiers. Concerning the first question, we introduced four nonlinear transformation methods as global factor in the UTW schemes, this is just like IDF in TF–IDF which adopts logarithmic transformation. Comparison of the proposed schemes with the popular TF–IDF indicates that our

CRediT authorship contribution statement

Zhong Tang: Conceptualization, Methodology, Software, Writing - original draft. Wenqiang Li: Funding acquisition, Data curation. Yan Li: Validation, Supervision. Wu Zhao: Funding acquisition, Investigation. Song Li: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key Research and Development Program, China (No. 2018YFB1700702), the Science & Technology Ministry Innovation Method Program, China (No. 2017IM040100), the Sichuan Major Science and Technology Project, China (No. 2019ZDZX0001), and the Sichuan Applied Foundation Project, China (No. 2018JY0119).

References (46)

ErenelZ. et al.
Nonlinear transformation of term frequencies for term weighting in textcategorization
Eng. Appl. Artif. Intell.
(2012)
ShangC.X. et al.
Feature selection via maximizing global information gain for text classification
Knowl.-Based Syst.
(2013)
TellezE.S. et al.
An automated text categorization framework based on hyperparameter optimization
Knowl.-Based Syst.
(2018)
LiZ.C. et al.
Fast text categorization using concise semantic analysis
Pattern Recognit. Lett.
(2011)
TaşcıŞ. et al.
Comparison of text feature selection policies and using an adaptive framework
Expert Syst. Appl.
(2013)
NguyenH.T. et al.
Learning short-text semantic similarity with word embeddings and external knowledge sources
Knowl.-Based Syst.
(2019)
RenF.J. et al.
Class-indexing-based term weighting for automatic text classification
Inform. Sci.
(2013)
GuruD.S. et al.
An alternative framework for univariate filter based feature selection for text categorization
Pattern Recognit. Lett.
(2018)
WuH.B. et al.
Balancing between over-weighting and under-weighting in supervised term weighting
Inf. Process. Manage.
(2017)
EscalanteH.J. et al.
Term-weighting learning via genetic programming for text classification
Knowl.-Based Syst.
(2015)

ChenK.W. et al.

Turning from TF-IDF to TF-IGM for term weighting in text classification

Expert Syst. Appl.

(2016)

AltınçayH. et al.

Analytical evaluation of term weighting schemes for text categorization

Pattern Recognit. Lett.

(2010)

LiuY. et al.

Imbalanced text classification: A term weighting approach

Expert Syst. Appl.

(2009)

DoganT. et al.

Improved inverse gravity moment term weighting for text classification

Expert Syst. Appl.

(2019)

SinoaraR.A. et al.

Knowledge-enhanced document embeddings for text classification

Knowl.-Based Syst.

(2019)

WangS.

Insurance pricing and increased limits rate making by proportional hazards transforms

Insurance Math. Econom.

(1995)

MengJ.N. et al.

A two-stage feature selection method for text categorization

Comput. Math. Appl.

(2011)

WangS.P. et al.

Subspace learning for unsupervised feature selection via matrix factorization

Pattern Recognit.

(2015)

LabaniM. et al.

A novel multivariate filter method for feature selection in text classification problems

Eng. Appl. Artif. Intell.

(2018)

SabbahT. et al.

Modified frequency-based term weighting schemes for text classification

Appl. Soft Comput.

(2017)

Al-MubaidH. et al.

A new text categorization technique using distributional clustering and learning logic

IEEE Trans. Knowl. Data Eng.

(2006)

SebastianiF.

Machine learning in automated text categorization

Acm Comput. Surv.

(2002)

TangB. et al.

Toward optimal featureselection in Naive Bayes for text categorization

IEEE Trans. Knowl. Data Eng.

(2016)

Cited by (20)

A novel redistribution-based feature selection for text classification
2024, Expert Systems with Applications
Text classification is the process of automatically categorizing text documents into predefined labels, gaining increasing importance with the growing volume of data. In the vector space model, terms within documents must be quantified to be used with classifiers. The abundance of unique terms associated with documents can lead to unwieldy vectors. To address this issue, feature selection, reducing the number of terms by removing irrelevant terms, is a common solution. In this paper, a novel feature selection method called Amount of ReDistribution to Establish Neutrality, ARDEN, is proposed for the task of text classification. ARDEN, is designed with statistical distance perspective by measuring the distance of a term to its neutral counterpart, which represents the least distinguishing term having uniform document frequencies across all classes. ARDEN introduces a method to measure the related distance, providing insights into the degree of deviation of the term from neutrality, thus capturing its distinguishing power. ARDEN is experimentally tested against state-of-the-art feature selection methods. The results suggest that it is a competitive feature selection method, exhibiting superior performance in terms of summary statistics, $μ$ and $δ^{'}$ . Furthermore, ARDEN is designed for multi-class text classification problems, which does not require a globalization function. Additionally, the proposed method is compared against Wasserstein statistical distance metric, and clearly outperforms it. Finally, we also propose $δ^{'}$ evaluation criterion in order to facilitate the interpretation of experimental outcomes across multiple feature sizes.
Generating survey draft based on closeness of position distributions of key words
2024, Expert Systems with Applications
Automatically generating a survey draft is a challenge to text summarization research because it needs to select important sentences from important references in a large set of candidate papers for composing sections that are in line with section titles and different sections discuss the most relevant reference papers of different number, which are beyond the capability of previous text summarization approaches as they assume that all candidate papers should be included into one summary. This paper proposes an approach to generating survey draft according to a pattern consisting of sections with titles given by the user who requests the survey. The problem of generating each section can be divided into the following sub-problems: (1) rank the input scientific documents (in short documents) according to the title of a section, (2) determine the number of documents that are most relevant to the title, and (3) rank and select sentences from the selected documents according to the title. A position closeness distance of key word is proposed to rank a set of documents by measuring how closely two key words within section title are distributed within each document, which is used to rank the documents. The rationale is that the positions of the neighboring key words of a section title should be closer in more relevant documents than other words. As different sections have different number of selected documents, a method is proposed to determine the number of documents to be included into the current section based on the slope shape of the sorted rank curve of documents according to the section title. Based on the duality property of the closeness, ranks of sentences within a document can be directly obtained when the document is ranked according to the title of section, and both the importance and coherence of selected sentences can be reflected without extra calculation for ranking sentences. Experiments and manual evaluation show that the proposed methods achieve significant improvements compared with other approaches. The proposed approach is significant in applications as different surveys can be generated according to different patterns given by different users.
Supervised term-category feature weighting for improved text classification
2023, Knowledge-Based Systems
Citation Excerpt :
In this study, we introduce a new text classification framework for Category-based Feature Engineering titled CFE, which aims to improve classification quality by integrating term-category relationships in document and category representations. Our solution consists of a supervised weighting scheme based on a variant of the TF-ICF (Term Frequency-Inverse Category Frequency) model [11]. Different from existing approaches which are designed for document representation, e.g., [12–14], we adapt TF-ICF to produce weighted representations for the target categories.
Text classification is a central task in Natural Language Processing (NLP) that aims at categorizing text documents into predefined classes or categories. It requires appropriate features to describe the contents and meaning of text documents, and map them with their target categories. Existing text feature representations rely on a weighted representation of the document terms. Hence, choosing a suitable method for term weighting is of major importance and can help increase the effectiveness of the classification task. In this study, we provide a novel text classification framework for Category-based Feature Engineering titled CFE. It consists of a supervised weighting scheme defined based on a variant of the TF-ICF (Term Frequency-Inverse Category Frequency) model, embedded into three new lean classification approaches: (i) IterativeAdditive (flat), (ii) GradientDescentANN (1-layered), and (iii) FeedForwardANN (2-layered). The IterativeAdditive approach augments each document representation with a set of synthetic features inferred from TF-ICF category representations. It builds a term-category TF-ICF matrix using an iterative and additive algorithm that produces category vector representations and updates until reaching convergence. GradientDescentANN replaces the iterative additive process mentioned previously by computing the term-category matrix using a gradient descent ANN model. Training the ANN using the gradient descent algorithm allows updating the term-category matrix until reaching convergence. FeedForwardANN uses a feed-forward ANN model to transform document representations into the category vector space. The transformed document vectors are then compared with the target category vectors, and are associated with the most similar categories. We have implemented CFE including its three classification approaches, and we have conducted a large battery of tests to evaluate their performance. Experimental results on five benchmark datasets show that our lean approaches mostly improve text classification accuracy while requiring significantly less computation time compared with their deep model alternatives.
An improved supervised term weighting scheme for text representation and classification
2022, Expert Systems with Applications
Citation Excerpt :
TF-IDF (term frequency-inverse document frequency) is the most frequently used method originated from information retrieval (IR) domain (Salton et al., 1975; Sparck Jones, 1972) and has become the default choice for converting text to numerical vector (Liu, Loh, & Sun, 2009). However, TF-IDF has an obvious disadvantage, that is, it may be equal to 0 (Tang, Li, & Li, 2020; Tang, Li, Li, Zhao, & Li, 2020). To seek a reasonable solution for TF-IDF, several alternative term weighting methods for text representation are proposed by Tang et al. (2020), such as the TF-IEF and DTF-IPF.
Term weighting scheme has significant effects on the text classification performance. The main reason is that in text classification tasks, term weighting scheme determines the way in which texts are represented in the vector space model. Currently, term frequency-inverse document frequency is the most widely utilized term weighting scheme but it does not use the available category information of the training texts. Taking this resource of category information (or category factor) into account in the study, an improved supervised term weighting method for representing text is developed, which combines a new measure of information namely cumulative residual entropy and the proportional distortion function. To verify the text classification performance of our proposed scheme, we conducted an extensive experimental comparison of proposed scheme with existing schemes on two corpora (i.e., Reuters-21578 and 20 Newsgroups datasets) with different characteristics. Results explicitly show that our proposed scheme can obtain significantly better effect for text classification than others. Specifically, when linear support vector machine classifier is run, performances were improved to 0.972 and 0.833 (micro-F1) on Reuters-21578 dataset and 20 Newsgroups dataset, respectively.
Understanding the different emulsification mechanisms of pectin: Comparison between watermelon rind and two commercial pectin sources
2021, Food Hydrocolloids
Citation Excerpt :
Samples' nomenclature was “xP- ySO′ where ‘x’ refers to the pectin solution concentration, “P” indicates pectin type (AP, CP or WRP) and “y” refers to the SO concentration. The microstructure of the O/W emulsions was studied using both optical microscopy and confocal laser scanning microscopy following the method reported by Tang and Ghosh (2020). Fluorescence images of the emulsions were characterized using a FV 1000-IX81 confocal laser scanning microscope (CLSM, Olympus, Japan) with a combination of 559 nm and 635 nm excitation wavelengths and with emission wavelength of 572 nm and 647 nm for Nile red and fast green, respectively, for the fluorescence field.
In this work, an advanced approach combining small angle X-ray scattering (SAXS) experiments, rheology and confocal laser scanning microscopy was used to explain the different emulsification mechanisms of three pectin sources (pectin extracted from watermelon rind -WRP- and commercial citrus -CP- and apple pectin -AP). Very interestingly, three different emulsification mechanisms were identified, related to the structure and composition of the pectin extracts. WRP had significantly greater emulsifying capacity than commercial CP and AP. This enhanced emulsification ability was mainly ascribed to a combination of its relatively high protein content (mainly acting as the surface-active material), combined with the presence of longer sugar side chains in pectin, further contributing to stabilizing the oil droplets in the emulsions. All these structural features resulted in a reduction in the mean droplet size as the concentration increased, thus, hindering flocculation and coalescence during the short-term storage conditions at 4 °C. In contrast, AP had the lowest emulsification capacity, which was only related to its viscosifying effect (provided by its greater Mw), while CP, having the greatest homogalacturonan content, greatest linearity and a more balanced hydrophilic/hydrophobic character (reflected in the degree of esterification), was able to form a better adsorbed layer at the o/w interphase, although it could not avoid flocculation and creaming at low pectin concentration during refrigerated storage.
Fuzzy clustering analysis for the loan audit short texts
2023, Knowledge and Information Systems

View all citing articles on Scopus

View full text

Several alternative term weighting methods for text representation and classification

Abstract

Introduction

Section snippets

Analysis of current term weighting schemes

Proposed unsupervised term weighting schemes

Experimental settings

Experimental results and analysis

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Eng. Appl. Artif. Intell.

Knowl.-Based Syst.

Knowl.-Based Syst.

Pattern Recognit. Lett.

Expert Syst. Appl.

Knowl.-Based Syst.

Inform. Sci.

Pattern Recognit. Lett.

Inf. Process. Manage.

Knowl.-Based Syst.

Expert Syst. Appl.

Pattern Recognit. Lett.

Expert Syst. Appl.

Expert Syst. Appl.

Knowl.-Based Syst.

Insurance Math. Econom.

Comput. Math. Appl.

Pattern Recognit.

Eng. Appl. Artif. Intell.

Appl. Soft Comput.

A new text categorization technique using distributional clustering and learning logic

IEEE Trans. Knowl. Data Eng.

Machine learning in automated text categorization

Acm Comput. Surv.

Toward optimal featureselection in Naive Bayes for text categorization

IEEE Trans. Knowl. Data Eng.