当前位置: X-MOL 学术Int. J. Mach. Learn. & Cyber. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-view document clustering based on geometrical similarity measurement
International Journal of Machine Learning and Cybernetics ( IF 5.6 ) Pub Date : 2021-03-22 , DOI: 10.1007/s13042-021-01295-8
Bassoma Diallo , Jie Hu , Tianrui Li , Ghufran Ahmad Khan , Ahmed Saad Hussein

Numerous works implemented multi-view clustering algorithms in document clustering. A challenging problem in document clustering is the similarity metric. Existing multi-view document clustering methods broadly utilized two measurements: the Cosine similarity (CS) and the Euclidean distance (ED). The first did not consider the magnitude difference (MD) between the two vectors. The second can’t register the divergence of two vectors that offer a similar ED. In this paper, we originally created five models of similarity metric. This methodology foils the downside of the CS and ED similarity metrics by figuring the divergence between documents with the same ED while thinking about their sizes. Furthermore, we proposed our multi-view document clustering plan which dependent on the proposed similarity metric. Firstly, CS, ED, triangle’s area similarity and sector’s area similarity metric, and our five similarity metrics have been applied to every view of a dataset to generate a corresponding similarity matrix. Afterward, we ran clustering algorithms on these similarity matrices to evaluate the performance of single view. Later, we aggregated these similarity matrices to obtain a unified similarity matrix and apply spectral clustering algorithm on it to generate the final clusters. The experimental results show that the proposed similarity functions can gauge the similitude between documents more accurately than the existing metrics, and the proposed clustering scheme surpasses considerably up-to-date algorithms.



中文翻译:

基于几何相似度度量的多视图文档聚类

许多工作在文档聚类中实现了多视图聚类算法。文档聚类中的一个挑战性问题是相似性度量。现有的多视图文档聚类方法广泛地使用了两种度量:余弦相似度(CS)和欧氏距离(ED)。第一个没有考虑两个向量之间的幅度差(MD)。第二个不能注册提供相似ED的两个向量的差异。在本文中,我们最初创建了五个相似性度量模型。这种方法通过在考虑文档大小的同时计算出具有相同ED的文档之间的差异,从而消除了CS和ED相似性指标的缺点。此外,我们提出了基于所提出的相似性度量标准的多视图文档聚类计划。首先,CS,ED,三角形的区域相似性和扇区的区域相似性度量,并且我们的五个相似性度量已应用于数据集的每个视图,以生成相应的相似性矩阵。之后,我们在这些相似性矩阵上运行了聚类算法,以评估单视图的性能。之后,我们将这些相似度矩阵进行汇总,以获得统一的相似度矩阵,并在其上应用频谱聚类算法以生成最终的聚类。实验结果表明,所提出的相似度函数可以比现有度量标准更准确地度量文档之间的相似度,并且所提出的聚类方案远远超过了最新算法。之后,我们在这些相似性矩阵上运行了聚类算法,以评估单视图的性能。之后,我们将这些相似度矩阵进行汇总,以获得统一的相似度矩阵,并在其上应用频谱聚类算法以生成最终的聚类。实验结果表明,所提出的相似度函数可以比现有度量标准更准确地度量文档之间的相似度,并且所提出的聚类方案远远超过了最新算法。之后,我们在这些相似性矩阵上运行了聚类算法,以评估单视图的性能。之后,我们将这些相似度矩阵进行汇总,以获得统一的相似度矩阵,并在其上应用频谱聚类算法以生成最终的聚类。实验结果表明,所提出的相似度函数可以比现有度量标准更准确地度量文档之间的相似度,并且所提出的聚类方案远远超过了最新算法。

更新日期:2021-03-22
down
wechat
bug