当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast Clustering of Short Text Streams Using Efficient Cluster Indexing and Dynamic Similarity Thresholds
arXiv - CS - Information Retrieval Pub Date : 2021-01-21 , DOI: arxiv-2101.08595
Md Rashadul Hasan Rakib, Muhammad Asaduzzaman

Short text stream clustering is an important but challenging task since massive amount of text is generated from different sources such as micro-blogging, question-answering, and social news aggregation websites. One of the major challenges of clustering such massive amount of text is to cluster them within a reasonable amount of time. The existing state-of-the-art short text stream clustering methods can not cluster such massive amount of text within a reasonable amount of time as they compute similarities between a text and all the existing clusters to assign that text to a cluster. To overcome this challenge, we propose a fast short text stream clustering method (called FastStream) that efficiently index the clusters using inverted index and compute similarity between a text and a selected number of clusters while assigning a text to a cluster. In this way, we not only reduce the running time of our proposed method but also reduce the running time of several state-of-the-art short text stream clustering methods. FastStream assigns a text to a cluster (new or existing) using the dynamically computed similarity thresholds based on statistical measure. Thus our method efficiently deals with the concept drift problem. Experimental results demonstrate that FastStream outperforms the state-of-the-art short text stream clustering methods by a significant margin on several short text datasets. In addition, the running time of FastStream is several orders of magnitude faster than that of the state-of-the-art methods.

中文翻译:

使用有效的聚类索引和动态相似性阈值对短文本流进行快速聚类

短文本流聚类是一项重要但具有挑战性的任务,因为大量文本是从不同来源(例如微博,问答和社交新闻聚合网站)生成的。群集大量文本的主要挑战之一是在合理的时间内群集它们。现有的最新短文本流聚类方法无法在合理的时间内聚类大量文本,因为它们计算文本与所有现有聚类之间的相似度以将该文本分配给聚类。为了克服这一挑战,我们提出了一种快速的短文本流聚类方法(称为FastStream),该方法使用倒排索引有效地索引了聚类,并在将文本分配给聚类的同时计算了文本和选定数目的聚类之间的相似度。这样,我们不仅减少了我们提出的方法的运行时间,而且减少了几种最新的短文本流聚类方法的运行时间。FastStream使用基于统计量的动态计算的相似性阈值将文本分配给群集(新的或现有的)。因此,我们的方法有效地解决了概念漂移问题。实验结果表明,FastStream在几个短文本数据集上的表现远远超过了最新的短文本流聚类方法。此外,FastStream的运行时间比最新方法快几个数量级。FastStream使用基于统计量的动态计算的相似性阈值将文本分配给群集(新的或现有的)。因此,我们的方法有效地解决了概念漂移问题。实验结果表明,FastStream在几个短文本数据集上的表现远远超过了最新的短文本流聚类方法。此外,FastStream的运行时间比最新方法快几个数量级。FastStream使用基于统计量的动态计算的相似性阈值将文本分配给群集(新的或现有的)。因此,我们的方法有效地解决了概念漂移问题。实验结果表明,FastStream在几个短文本数据集上的表现远远超过了最新的短文本流聚类方法。此外,FastStream的运行时间比最新方法快几个数量级。实验结果表明,FastStream在几个短文本数据集上的表现远远超过了最新的短文本流聚类方法。此外,FastStream的运行时间比最新方法快几个数量级。实验结果表明,FastStream在几个短文本数据集上的表现远远超过了最新的短文本流聚类方法。此外,FastStream的运行时间比最新方法快几个数量级。
更新日期:2021-01-22
down
wechat
bug