当前位置: X-MOL 学术Knowl. Based Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Several alternative term weighting methods for text representation and classification
Knowledge-Based Systems ( IF 8.8 ) Pub Date : 2020-08-14 , DOI: 10.1016/j.knosys.2020.106399
Zhong Tang , Wenqiang Li , Yan Li , Wu Zhao , Song Li

Text representation is one kind of hot topics which support text classification (TC) tasks. It has a substantial impact on the performance of TC. Although the most famous TF–IDF is specially designed for information retrieval rather than TC tasks, it is highly useful in the field of TC as a term weighting method to represent text contents. Inspired by the IDF part of TF–IDF which is defined as the logarithmic transformation, we proposed several alternative methods in this study to generate unsupervised term weighting schemes that can offset the drawback confronting TF–IDF.​ Moreover, owing to TC tasks are different from information retrieval, representing test texts as a vector in an appropriate way is also essential for TC tasks, especially for supervised term weighting approaches (e.g., TF–RF), mainly due to these methods need to use category information when weighting the terms. But most of current schemes do not clearly explain how to represent test texts with their schemes. To explore this problem and seek a reasonable solution to these schemes, we analyzed a classic unsupervised term weighting method and three typical supervised term weighting methods in depth to illustrate how to represent test texts. To investigate the effectiveness of our work, three sets of experiments are designed to compare their performance. Comparisons show that our proposed methods can indeed enhance the performance of TC, and sometimes even outperform existing supervised term weighting methods.



中文翻译:

用于文本表示和分类的几种备选术语加权方法

文本表示是一种支持文本分类(TC)任务的热门话题。它对TC的性能有重大影响。尽管最著名的TF–IDF是专门为信息检索而不是TC任务而设计的,但它在TC领域中作为表示文本内容的术语加权方法非常有用。受到TF–IDF的IDF部分的启发,该部分被定义为对数变换,我们在本研究中提出了几种替代方法来生成无监督的术语加权方案,这些方案可以抵消TF–IDF面临的缺点。此外,由于TC任务不同从信息检索中,以适当的方式将测试文本表示为矢量对于TC任务,尤其是在监督术语加权方法(例如TF-RF)中也至关重要,主要是由于这些方法在加权术语时需要使用类别信息。但是,当前大多数方案都没有明确说明如何用其方案表示测试文本。为了探讨此问题并寻求合理的解决方案,我们深入分析了经典的无监督术语加权方法和三种典型的有监督术语加权方法,以说明如何表示测试文本。为了调查我们工作的有效性,设计了三组实验来比较它们的性能。比较表明,我们提出的方法确实可以增强TC的性能,有时甚至优于现有的监督术语加权方法。为了探讨此问题并寻求合理的解决方案,我们深入分析了经典的无监督术语加权方法和三种典型的有监督术语加权方法,以说明如何表示测试文本。为了调查我们工作的有效性,设计了三组实验来比较它们的性能。比较表明,我们提出的方法确实可以增强TC的性能,有时甚至优于现有的监督术语加权方法。为了探讨此问题并寻求合理的解决方案,我们深入分析了经典的无监督术语加权方法和三种典型的有监督术语加权方法,以说明如何表示测试文本。为了调查我们工作的有效性,设计了三组实验来比较它们的性能。比较表明,我们提出的方法确实可以增强TC的性能,有时甚至优于现有的监督术语加权方法。

更新日期:2020-08-20
down
wechat
bug