Coverage-based query subtopic diversification leveraging semantic relevance,Knowledge and Information Systems

当前位置： X-MOL 学术 › Knowl. Inf. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Coverage-based query subtopic diversification leveraging semantic relevance
Knowledge and Information Systems ( IF 2.7 ) Pub Date : 2020-04-27 , DOI: 10.1007/s10115-020-01470-3
Md. Shajalal , Masaki Aono

Generally, users are reserved in describing their search intention when submitting queries into the search engine. Therefore, a large number of search queries are usually short, ambiguous and tend to have multiple interpretations. With the gigantic size of the web, ignoring the information needs underlying such queries can misguide the search engine. To mitigate these issues, an effective approach is to diversify the search results considering the query subtopics with diverse intents. The task of identifying possible subtopics with diverse intents underlying a query is known as subtopic mining. This paper is aimed at mining and diversifying subtopics underlying a query. Our method first exacts noun phrases containing the query terms from the top-retrieved web documents. We also extract query suggestions and completions from commercial search engines. The extracted candidates highly related to the query are then selected as subtopics. We introduce a new relatedness score function to estimate the degree of relatedness between the query and the candidate. To estimate the relevancy between the query and the subtopic, this paper introduces a semantic relevance measure using a locally trained sentence embedding model. Finally, we propose a novel coverage-based diversification technique to rank the subtopics combining their relevancy and the coverage estimated by the web documents. The experimental results on two NTCIR English subtopic mining datasets demonstrate that our proposed method achieves new state-of-the-art performance and significantly outperforms some known related methods in terms of relevance (D-nDCG) and diversity (D#-nDCG) metric at cut of 10.

中文翻译：

利用语义相关性的基于覆盖率的查询子主题多样化

通常，在向查询引擎提交查询时保留用户描述其搜索意图的权利。因此，大量搜索查询通常简短，模棱两可，并且往往具有多种解释。随着网络的巨大规模，忽略此类查询背后的信息需求可能会误导搜索引擎。为了缓解这些问题，一种有效的方法是考虑具有不同意图的查询子主题来使搜索结果多样化。通过查询来确定具有不同意图的潜在子主题的任务称为子主题挖掘。本文旨在挖掘和多样化查询基础的子主题。我们的方法首先从包含最检索的Web文档的名词词组中抽取名词短语。我们还从商业搜索引擎中提取查询建议和完成内容。然后选择与查询高度相关的提取候选作为子主题。我们引入了一个新的关联性得分函数，以估计查询和候选者之间的关联度。为了估计查询与子主题之间的相关性，本文介绍了一种使用本地训练的句子嵌入模型的语义相关性度量。最后，我们提出了一种新颖的基于覆盖率的多样化技术，可以根据子主题的相关性和网络文档估计的覆盖率对子主题进行排名。

更新日期：2020-04-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>