当前位置: X-MOL 学术Int. J. Inf. Technol. Decis. Mak. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
EWNStream+: Effective and Real-time Clustering of Short Text Streams Using Evolutionary Word Relation Network
International Journal of Information Technology & Decision Making ( IF 4.9 ) Pub Date : 2021-01-18 , DOI: 10.1142/s0219622021500024
Shuiqiao Yang 1 , Guangyan Huang 2 , Xiangmin Zhou 3 , Vicky Mak 2 , John Yearwood 2
Affiliation  

The real-time clustering of short text streams has various applications, such as event tracking, text summarization and sentimental analysis. However, accurately and efficiently clustering short text streams is challenging due to the sparsity problem (i.e., the limited information comprised in a single short text document leads to high-dimensional and sparse vectors when we represent short texts using traditional vector space models), topic drift and the fast generated text streams. In this paper, we provide an effective and real-time Evolutionary Word relation Network for short text streams clustering (EWNStream+) method. The EWNStream+ method constructs a bi-weighted word relation network using the aggregated term frequencies and term co-occurrence statistics at corpus level to overcome the sparsity problem and topic drift of short texts. Better still, as the query window in the stream shifts to the newly arriving data, EWNStream+ is capable of incrementally updating the word relation network by incorporating new word statistics and decaying the old ones to naturally capture the underlying topic drift in the data streams and reduce the size of the network. The experimental results on a real-world dataset show that EWNStream+ can achieve better clustering accuracy and time efficiency than several counterpart methods.

中文翻译:

EWNStream+:使用进化词关系网络对短文本流进行有效和实时的聚类

短文本流的实时聚类具有多种应用,例如事件跟踪、文本摘要和情感分析。然而,由于稀疏性问题(即,当我们使用传统的向量空间模型表示短文本时,单个短文本文档中包含的有限信息导致高维和稀疏向量),对短文本流进行准确有效的聚类具有挑战性。漂移和快速生成的文本流。在本文中,我们为短文本流聚类(EWNStream)提供了一个有效的实时进化词关系网络。+) 方法。EWNStream+该方法利用语料库级别的聚合词频和词共现统计量构建双向加权词关系网络,以克服短文本的稀疏性问题和主题漂移。更好的是,当流中的查询窗口转移到新到达的数据时,EWNStream+能够通过合并新词统计信息和衰减旧词统计信息来增量更新词关系网络,以自然地捕获数据流中的潜在主题漂移并减小网络的大小。在真实世界数据集上的实验结果表明,EWNStream+可以达到比几种对应方法更好的聚类精度和时间效率。
更新日期:2021-01-18
down
wechat
bug