当前位置: X-MOL 学术Comput. Intell. Neurosci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improved Distance Functions for Instance-Based Text Classification
Computational Intelligence and Neuroscience ( IF 3.120 ) Pub Date : 2020-11-23 , DOI: 10.1155/2020/4717984
Khalil El Hindi 1 , Bayan Abu Shawar 2 , Reem Aljulaidan 1 , Hussien Alsalamn 1
Affiliation  

Text classification has many applications in text processing and information retrieval. Instance-based learning (IBL) is among the top-performing text classification methods. However, its effectiveness depends on the distance function it uses to determine similar documents. In this study, we evaluate some popular distance measures’ performance and propose new ones that exploit word frequencies and the ordinal relationship between them. In particular, we propose new distance measures that are based on the value distance metric (VDM) and the inverted specific-class distance measure (ISCDM). The proposed measures are suitable for documents represented as vectors of word frequencies. We compare these measures’ performance with their original counterparts and with powerful Naïve Bayesian-based text classification algorithms. We evaluate the proposed distance measures using the kNN algorithm on 18 benchmark text classification datasets. Our empirical results reveal that the distance metrics for nominal values render better classification results for text classification than the Euclidean distance measure for numeric values. Furthermore, our results indicate that ISCDM substantially outperforms VDM, but it is also more susceptible to make use of the ordinal nature of term-frequencies than VDM. Thus, we were able to propose more ISCDM-based distance measures for text classification than VDM-based measures. We also compare the proposed distance measures with Naïve Bayesian-based text classification, namely, multinomial Naïve Bayes (MNB), complement Naïve Bayes (CNB), and the one-versus-all-but-one (OVA) model. It turned out that when kNN uses some of the proposed measures, it outperforms NB-based text classifiers for most datasets.

中文翻译:

改进基于实例的文本分类的距离函数

文本分类在文本处理和信息检索中有许多应用。基于实例的学习 (IBL) 是性能最好的文本分类方法之一。然而,它的有效性取决于它用来确定相似文档的距离函数。在这项研究中,我们评估了一些流行的距离度量的性能,并提出了利用词频及其之间的顺序关系的新距离度量。特别是,我们提出了基于价值距离度量(VDM)和倒置特定类距离度量(ISCDM)的新距离度量。所提出的措施适用于表示为词频向量的文档。我们将这些措施的性能与其原始对应措施以及强大的基于朴素贝叶斯的文本分类算法进行比较。我们使用 kNN 算法在 18 个基准文本分类数据集上评估所提出的距离度量。我们的实证结果表明,与数值的欧几里德距离度量相比,名义值的距离度量为文本分类提供了更好的分类结果。此外,我们的结果表明 ISCDM 大大优于 VDM,但它也比 VDM 更容易利用术语频率的序数性质。因此,与基于 VDM 的测量相比,我们能够为文本分类提出更多基于 ISCDM 的距离测量。我们还将所提出的距离度量与基于朴素贝叶斯的文本分类进行比较,即多项朴素贝叶斯(MNB)、补朴素贝叶斯(CNB)和一对一(OVA)模型。事实证明,当 kNN 使用一些建议的度量时,对于大多数数据集,它的性能优于基于 NB 的文本分类器。
更新日期:2020-11-23
down
wechat
bug