Short text similarity measurement using context-aware weighted biterms,Concurrency and Computation: Practice and Experience

当前位置： X-MOL 学术 › Concurr. Comput. Pract. Exp. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Short text similarity measurement using context-aware weighted biterms
Concurrency and Computation: Practice and Experience ( IF 1.5 ) Pub Date : 2020-04-13 , DOI: 10.1002/cpe.5765
Shuiqiao Yang ₁ , Guangyan Huang ₁ , Bahadorreza Ofoghi ₂ , John Yearwood ₁

Affiliation

With the development of internet technologies, social media and mobile devices, short texts have become an increasingly popular medium among users to communicate with friends, search information and review products. Measuring the similarity between short texts is a fundamental task due to its importance in many applications, such as text retrieval, topic discovery, and event detection. However, short texts generally comprise sparse, noisy, and ambiguous information. Hence, effectively measuring the distance between short texts is a challenging task. In this paper, we exploit the advantageous corpus-wide word co-occurrence information into document-level feature enrichment to mitigate the challenges caused by the sparseness of short texts for distance measurement. We propose a novel context-aware weighted Biterm method for short text Distance Measurement (BDM). In BDM, we extract biterms (ie, word pairs) from a short text corpus and exploit a biterm topic model to determine the global weights of biterms in the corpus. We then determine the local importance of a biterm in different contexts (ie, short texts) based on the corpus-level biterm weight. The distance between two short texts is computed using the context-aware weighted biterms. Experimental results on three real-world datasets demonstrate better accuracy and effectiveness of the proposed BDM.

中文翻译：

使用上下文感知加权双项的短文本相似度测量

随着互联网技术、社交媒体和移动设备的发展，短文本已成为用户与朋友交流、搜索信息和评论产品的日益流行的媒介。测量短文本之间的相似性是一项基本任务，因为它在许多应用中都很重要，例如文本检索、主题发现和事件检测。然而，短文本通常包含稀疏、嘈杂和模棱两可的信息。因此，有效地测量短文本之间的距离是一项具有挑战性的任务。在本文中，我们利用有利的语料库范围内的单词共现信息来丰富文档级的特征，以减轻短文本稀疏性对距离测量带来的挑战。我们提出了一种新颖的上下文感知加权 Biterm 方法，用于短文本距离测量 (BDM)。在 BDM 中，我们从短文本语料库中提取双项（即词对），并利用双项主题模型来确定语料库中双项的全局权重。然后，我们根据语料库级别的双项权重确定双项在不同上下文（即短文本）中的局部重要性。使用上下文感知加权双项计算两个短文本之间的距离。三个真实世界数据集的实验结果证明了所提出的 BDM 具有更好的准确性和有效性。短文本）基于语料库级别的双项权重。使用上下文感知加权双项计算两个短文本之间的距离。三个真实世界数据集的实验结果证明了所提出的 BDM 具有更好的准确性和有效性。短文本）基于语料库级别的双项权重。使用上下文感知加权双项计算两个短文本之间的距离。三个真实世界数据集的实验结果证明了所提出的 BDM 具有更好的准确性和有效性。

更新日期：2020-04-13

点击分享查看原文

点击收藏

阅读更多本刊最新论文