当前位置: X-MOL 学术J. Informetr. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A novel term weighting scheme for text classification: TF-MONO
Journal of Informetrics ( IF 3.7 ) Pub Date : 2020-07-24 , DOI: 10.1016/j.joi.2020.101076
Turgut Dogan , Alper Kursat Uysal

The effective representation of the relationship between the documents and their contents is crucial to increase classification performance of text documents in the text classification. Term weighting is a preprocess aiming to represent text documents better in Vector Space by assigning proper weights to terms. Since the calculation of the appropriate weight values directly affects performance of the text classification, in the literature, term weighting is still one of the important sub-research areas of text classification. In this study, we propose a novel term weighting (MONO) strategy which can use the non-occurrence information of terms more effectively than existing term weighting approaches in the literature. The proposed weighting strategy also performs intra-class document scaling to supply better representations of distinguishing capabilities of terms occurring in the different quantity of documents in the same quantity of class. Based on the MONO weighting strategy, two novel supervised term weighting schemes called TF-MONO and SRTF-MONO were proposed for text classification. The proposed schemes were tested with two different classifiers such as SVM and KNN on 3 different datasets named Reuters-21578, 20-Newsgroups, and WebKB. The classification performances of the proposed schemes were compared with 5 different existing term weighting schemes in the literature named TF-IDF, TF-IDF-ICF, TF-RF, TF-IDF-ICSDF, and TF-IGM. The results obtained from 7 different schemes show that SRTF-MONO generally outperformed other schemes for all three datasets. Moreover, TF-MONO has promised both Micro-F1 and Macro-F1 results compared to other five benchmark term weighting methods especially on the Reuters-21578 and 20-Newsgroups datasets.



有效表示文档及其内容之间的关系对于在文本分类中提高文本文档的分类性能至关重要。术语加权是一种预处理过程,旨在通过为术语分配适当的权重来更好地表示Vector Space中的文本文档。由于适当权重值的计算直接影响文本分类的性能,因此在文献中,术语加权仍然是文本分类的重要子研究领域之一。在这项研究中,我们提出了一种新颖的术语加权(MONO)策略,与文献中现有的术语加权方法相比,该策略可以更有效地利用术语的非出现信息。所提出的加权策略还执行类内文档缩放,以更好地表示在相同数量类中的不同数量文档中出现的术语的区分能力。基于MONO加权策略,提出了两种新颖的监督术语加权方案TF-MONO和SRTF-MONO进行文本分类。在两个名为Reuters-21578、20-Newsgroups和WebKB的3个不同的数据集上,使用了两个不同的分类器(例如SVM和KNN)对提出的方案进行了测试。将所提出的方案的分类性能与文献中称为TF-IDF,TF-IDF-ICF,TF-RF,TF-IDF-ICSDF和TF-IGM的5种不同的现有术语加权方案进行了比较。从7种不同方案中获得的结果表明,对于所有三个数据集,SRTF-MONO通常都优于其他方案。
