当前位置: X-MOL 学术Inf. Retrieval J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Clustering small-sized collections of short texts
Information Retrieval Journal ( IF 1.7 ) Pub Date : 2017-11-30 , DOI: 10.1007/s10791-017-9324-8
Lili Kotlerman , Ido Dagan , Oren Kurland

The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods.

中文翻译:

聚类短文本的小型集合

在各种应用中,需要对由几百个短文本组成的小文本语料进行聚类的需求。例如,根据摘录对最热门的文档进行聚类。由于短文本之间的词汇不匹配以及由于语料库大小而导致的不足的基于语料库的统计数据(例如,词项共现统计数据),该聚类任务具有挑战性。我们使用一个框架来解决这一集群挑战,该框架利用一组外部知识资源来提供有关术语关系的信息。具体来说,我们使用从资源中得出的信息来估算词条之间的相似度并产生词条簇。我们还利用资源来扩展给定语料库中使用的词汇,从而增强术语聚类。然后,我们将语料库中的文本投影到术语聚类上以将文本聚类。我们通过改变所使用的术语聚类方法,将文本投影到术语聚类的方法以及应用外部知识资源的方法来评估所提出框架的各种实例。广泛的经验评估证明了我们的方法的优点:将聚类算法直接应用于文本语料库,并使用最新的联合聚类和主题建模方法。
更新日期:2017-11-30
down
wechat
bug