当前位置: X-MOL 学术Neural Comput. & Applic. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Weighted naïve Bayes text classification algorithm based on improved distance correlation coefficient
Neural Computing and Applications ( IF 6 ) Pub Date : 2021-04-12 , DOI: 10.1007/s00521-021-05989-6
Shufen Ruan , Baozhou Chen , Kunfang Song , Hongwei Li

This paper proposes an innovative method to improve the attribute weighting approaches for naïve Bayes text classifiers using the improved distance correlation coefficient. The resulted model is called improved distance correlation coefficient attribute weighted multinomial naïve Bayes, denoted by IDCWMNB. Unlike the traditional correlation statistical measurements that consider the cumulative distribution function of random vectors, the improved distance correlation coefficient tests the joint correlation of random vectors by describing the distance between the joint characteristic function and the product of the marginal characteristic functions. Specifically, a measurement of inverse document frequency that considers the distribution information of document concentrating and scattering has been proposed. Then, the measurement and the distance correlation coefficient between attributes and categories have been combined to measure the importance of attributes to categories, to allocate different weights to different terms. Meanwhile, the learned attribute weights are incorporated into the posterior probability estimates of the multinomial naïve Bayes model, which is known as deep attribute weighting. This measurement is more effective than the traditional statistical measurements in the presence of nonlinear relationship between two random vectors. Experimental results taking benchmark and real-world data indicate that the new attribute weighting method can achieve an effective balance between classification accuracy and execution time.



中文翻译:

基于改进距离相关系数的加权朴素贝叶斯文本分类算法

本文提出了一种创新的方法,即使用改进的距离相关系数来改进朴素贝叶斯文本分类器的属性加权方法。结果模型称为改进的距离相关系数属性加权多项式朴素贝叶斯,用IDCWMNB表示。与考虑随机向量的累积分布函数的传统相关统计测量不同,改进的距离相关系数通过描述联合特征函数与边际特征函数的乘积之间的距离来测试随机向量的联合相关性。具体地,已经提出了考虑文档集中和分散的分布信息的文档反转频率的测量。然后,属性和类别之间的度量和距离相关系数已组合在一起,以度量属性对类别的重要性,并为不同的术语分配不同的权重。同时,将学习到的属性权重合并到多项式朴素贝叶斯模型的后验概率估计中,这被称为深度属性加权。在两个随机向量之间存在非线性关系的情况下,此测量比传统的统计测量更有效。通过基准数据和实际数据的实验结果表明,新的属性加权方法可以在分类精度和执行时间之间达到有效的平衡。

更新日期:2021-04-12
down
wechat
bug