Comparison of Supervised Classification Models on Textual Data,Mathematics

当前位置： X-MOL 学术 › Mathematics › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Comparison of Supervised Classification Models on Textual Data
Mathematics ( IF 2.3 ) Pub Date : 2020-05-24 , DOI: 10.3390/math8050851
Bi-Min Hsu

Text classification is an essential aspect in many applications, such as spam detection and sentiment analysis. With the growing number of textual documents and datasets generated through social media and news articles, an increasing number of machine learning methods are required for accurate textual classification. For this paper, a comprehensive evaluation of the performance of multiple supervised learning models, such as logistic regression (LR), decision trees (DT), support vector machine (SVM), AdaBoost (AB), random forest (RF), multinomial naive Bayes (NB), multilayer perceptrons (MLP), and gradient boosting (GB), was conducted to assess the efficiency and robustness, as well as limitations, of these models on the classification of textual data. SVM, LR, and MLP had better performance in general, with SVM being the best, while DT and AB had much lower accuracies amongst all the tested models. Further exploration on the use of different SVM kernels was performed, demonstrating the advantage of using linear kernels over polynomial, sigmoid, and radial basis function kernels for text classification. The effects of removing stop words on model performance was also investigated; DT performed better with stop words removed, while all other models were relatively unaffected by the presence or absence of stop words.

中文翻译：

文本数据监督分类模型的比较

文本分类是许多应用程序中必不可少的方面，例如垃圾邮件检测和情感分析。随着通过社交媒体和新闻文章生成的文本文档和数据集的数量不断增加，准确的文本分类需要越来越多的机器学习方法。在本文中，对多种监督学习模型的性能进行了全面评估，例如逻辑回归（LR），决策树（DT），支持向量机（SVM），AdaBoost（AB），随机森林（RF），多项式朴素进行贝叶斯（NB），多层感知器（MLP）和梯度增强（GB）来评估这些模型在文本数据分类上的效率和鲁棒性以及局限性。SVM，LR和MLP通常具有更好的性能，其中SVM最好，而DT和AB在所有测试模型中的准确性要低得多。进行了关于使用不同SVM内核的进一步探索，证明了使用线性内核优于多项式，Sigmoid和径向基函数内核进行文本分类的优势。还研究了删除停用词对模型性能的影响；删除停用词后DT的效果更好，而其他所有模型相对而言都不受停用词存在与否的影响。

更新日期：2020-05-24

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文