A Methodology Combining Cosine Similarity with Classifier for Text Classification,Applied Artificial Intelligence

当前位置： X-MOL 学术 › Appl. Artif. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Methodology Combining Cosine Similarity with Classifier for Text Classification
Applied Artificial Intelligence ( IF 2.9 ) Pub Date : 2020-02-08 , DOI: 10.1080/08839514.2020.1723868
Kwangil Park ₁ , June Seok Hong ₂ , Wooju Kim ₁

Affiliation

ABSTRACT Text Classification has received significant attention in recent years because of the proliferation of digital documents and is widely used in various applications such as filtering and recommendation. Consequently, many approaches, including those based on statistical theory, machine learning, and classifier performance improvement, have been proposed for improving text classification performance. Among these approaches, centroid-based classifier, multinomial naïve bayesian (MNB), support vector machines (SVM), convolutional neural network (CNN) are commonly used. In this paper, we introduce a cosine similarity-based methodology for improving performance. The methodology combines cosine similarity (between a test document and fixed categories) with conventional classifiers such as MNB, SVM, and CNN to improve the accuracy of the classifiers, and then we call the conventional classifiers with cosine similarity as enhanced classifiers. We applied the enhanced classifiers to famous datasets – 20NG, R8, R52, Cade12, and WebKB – and evaluated the performance of the enhanced classifiers in terms of the confusion matrix’s accuracy; we obtained outstanding results in that the enhanced classifiers show significant increases in accuracy. Moreover, through experiments, we identified which of two considered knowledge representation techniques (word count and term frequency-inverse document frequency (TFIDF)) is more suitable in terms of classifier performance.

中文翻译：

一种将余弦相似度与分类器相结合的文本分类方法

摘要文本分类近年来由于数字文档的激增而受到了极大的关注，并被广泛应用于过滤和推荐等各种应用中。因此，已经提出了许多方法，包括基于统计理论、机器学习和分类器性能改进的方法来提高文本分类性能。在这些方法中，常用的有基于质心的分类器、多项式朴素贝叶斯 (MNB)、支持向量机 (SVM)、卷积神经网络 (CNN)。在本文中，我们介绍了一种基于余弦相似度的方法来提高性能。该方法将余弦相似度（测试文档和固定类别之间）与传统分类器（如 MNB、SVM、和CNN来提高分类器的准确性，然后我们将具有余弦相似度的常规分类器称为增强分类器。我们将增强型分类器应用于著名的数据集——20NG、R8、R52、Cade12 和 WebKB——并在混淆矩阵的准确性方面评估了增强型分类器的性能；我们获得了出色的结果，因为增强的分类器显示出准确度的显着提高。此外，通过实验，我们确定了两种考虑的知识表示技术（字数和词频-逆文档频率（TFIDF））中哪一种更适合分类器性能。和 WebKB – 并根据混淆矩阵的准确性评估增强分类器的性能；我们获得了出色的结果，因为增强的分类器显示出准确度的显着提高。此外，通过实验，我们确定了两种考虑的知识表示技术（字数和词频-逆文档频率（TFIDF））中哪一种更适合分类器性能。和 WebKB – 并根据混淆矩阵的准确性评估增强分类器的性能；我们获得了出色的结果，因为增强的分类器显示出准确度的显着提高。此外，通过实验，我们确定了两种考虑的知识表示技术（字数和词频-逆文档频率（TFIDF））中哪一种更适合分类器性能。

更新日期：2020-02-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11