Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports,Mathematical Problems in Engineering

当前位置： X-MOL 学术 › Math. Probl. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports
Mathematical Problems in Engineering ( IF 1.430 ) Pub Date : 2021-03-05 , DOI: 10.1155/2021/6619088
Zhiying Jiang _{1,

2} , Bo Gao _{1,

2} , Yanlin He _{1,

2} , Yongming Han _{1,

2} , Paul Doyle ₃ , Qunxiong Zhu _{1,

2}

Affiliation

With the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighting in TC. Due to the original design of information retrieval (IR), term frequency-inverse document frequency (TF-IDF) is not effective enough for TC, especially for processing text data with unbalanced distributions in internet media reports. Therefore, the variance between the DF value of a particular term and the average of all DFs , namely, the document frequency variance (ADF), is proposed to enhance the ability in processing text data with unbalanced distribution. Then, the normal TF-IDF is modified by the proposed ADF for processing unbalanced text collection in four different ways, namely, TF-IADF, TF-IADF⁺, TF-IADF_norm, and TF-IADF⁺_norm. As a result, an effective model can be established for the TC task of internet media reports. A series of simulations have been carried out to evaluate the performance of the proposed methods. Compared with TF-IDF on state-of-the-art classification algorithms, the effectiveness and feasibility of the proposed methods are confirmed by simulation results.

中文翻译：

使用基于新颖术语加权方案的改进的TF-IDF对Internet媒体报告进行文本分类

随着互联网技术的飞速发展，可以获得大量的互联网文本数据。文本分类（TC）技术在处理大量文本数据中起着非常重要的作用，但是分类的准确性直接受TC中术语加权的性能影响。由于信息检索（IR）的原始设计，术语频率反文档频率（TF-IDF）对于TC而言不够有效，尤其是对于处理Internet媒体报告中分布不均衡的文本数据而言。因此，特定项的DF值与所有DF的平均值之间的方差，提出了文档频率方差（ADF），以增强处理分布不均衡的文本数据的能力。然后，通过建议的ADF修改普通的TF-IDF，以TF-IADF，TF-IADF ⁺，TF-IADF_规范和TF-IADF ⁺_规范四种不同方式处理不平衡文本集合。结果，可以为互联网媒体报道的TC任务建立有效的模型。已经进行了一系列仿真，以评估所提出方法的性能。与最新的分类算法TF-IDF相比，仿真结果验证了所提方法的有效性和可行性。

更新日期：2021-03-05

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>