Context-Dependent Feature Values in Text Categorization,International Journal of Software Engineering and Knowledge Engineering

当前位置： X-MOL 学术 › Int. J. Softw. Eng. Knowl. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Context-Dependent Feature Values in Text Categorization
International Journal of Software Engineering and Knowledge Engineering ( IF 0.9 ) Pub Date : 2020-10-21 , DOI: 10.1142/s021819402050031x
Edward Kai Fung Dang ₁ , Robert Wing Pong Luk ₁ , James Allan ₂

Affiliation

Feature engineering is one aspect of knowledge engineering. Besides feature selection, the appropriate assignment of feature values is also crucial to the performance of many software applications, such as text categorization (TC) and speech recognition. In this work, we develop a general method to enhance TC performance by the use of context-dependent feature values (aka term weights), which are obtained by a novel adaptation of a context-dependent adjustment procedure previously shown to be effective in information retrieval. The motivation of our approach is that the general method can be used with different text representations and in combination of other TC techniques. Experiments on several test collections show that our context-dependent feature values can improve TC over traditional context-independent unigram feature values, using a strong classifier like Support Vector Machine (SVM), which past works have found to be hard to improve. We also show that the relative performance improvement of our method over the context-independent baseline is comparable to the levels attained by recent word embedding methods in the literature, while an advantage of our approach is that it does not require the substantial training needed to learn word embedding representations.

中文翻译：

文本分类中的上下文相关特征值

特征工程是知识工程的一个方面。除了特征选择之外，特征值的适当分配对于许多软件应用程序的性能也至关重要，例如文本分类 (TC) 和语音识别。在这项工作中，我们开发了一种通用方法，通过使用上下文相关的特征值（又名术语权重）来提高 TC 性能，这些特征值是通过对先前显示在信息检索中有效的上下文相关调整过程的新适应而获得的. 我们方法的动机是通用方法可以用于不同的文本表示并结合其他 TC 技术。对几个测试集的实验表明，我们的上下文相关特征值可以比传统的上下文无关的一元特征值提高 TC，使用像支持向量机（SVM）这样的强分类器，过去的工作发现很难改进。我们还表明，我们的方法在与上下文无关的基线上的相对性能改进与文献中最近的词嵌入方法所达到的水平相当，而我们方法的一个优点是它不需要学习所需的大量训练词嵌入表示。

更新日期：2020-10-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>