当前位置: X-MOL 学术Data Technol. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A scalable eigenspace-based fuzzy c-means for topic detection
Data Technologies and Applications ( IF 1.7 ) Pub Date : 2021-03-23 , DOI: 10.1108/dta-11-2020-0262
Hendri Murfi

Purpose

The aim of this research is to develop an eigenspace-based fuzzy c-means method for scalable topic detection.

Design/methodology/approach

The eigenspace-based fuzzy c-means (EFCM) combines representation learning and clustering. The textual data are transformed into a lower-dimensional eigenspace using truncated singular value decomposition. Fuzzy c-means is performed on the eigenspace to identify the centroids of each cluster. The topics are provided by transforming back the centroids into the nonnegative subspace of the original space. In this paper, we extend the EFCM method for scalability by using the two approaches, i.e. single-pass and online. We call the developed topic detection methods as oEFCM and spEFCM.

Findings

Our simulation shows that both oEFCM and spEFCM methods provide faster running times than EFCM for data sets that do not fit in memory. However, there is a decrease in the average coherence score. For both data sets that fit and do not fit into memory, the oEFCM method provides a tradeoff between running time and coherence score, which is better than spEFCM.

Originality/value

This research produces a scalable topic detection method. Besides this scalability capability, the developed method also provides a faster running time for the data set that fits in memory.



中文翻译:

一种用于主题检测的可扩展的基于特征空间的模糊 c 均值

目的

本研究的目的是开发一种用于可扩展主题检测的基于特征空间的模糊 c 均值方法。

设计/方法/方法

基于特征空间的模糊 c 均值 (EFCM) 结合了表征学习和聚类。使用截断奇异值分解将文本数据转换为低维特征空间。对特征空间执行模糊 c 均值以识别每个集群的质心。通过将质心转换回原始空间的非负子空间来提供主题。在本文中,我们通过使用两种方法,即单次传递和在线,扩展了 EFCM 方法的可扩展性。我们将开发的主题检测方法称为 oEFCM 和 spEFCM。

发现

我们的模拟表明,对于不适合内存的数据集,oEFCM 和 spEFCM 方法都提供比 EFCM 更快的运行时间。但是,平均相干分数有所下降。对于适合和不适合内存的两个数据集,oEFCM 方法提供了运行时间和一致性分数之间的权衡,这比 spEFCM 更好。

原创性/价值

这项研究产生了一种可扩展的主题检测方法。除了这种可扩展能力之外,所开发的方法还为适合内存的数据集提供了更快的运行时间。

更新日期:2021-03-23
down
wechat
bug