Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation
arXiv - CS - Information Retrieval Pub Date : 2021-05-03 , DOI: arxiv-2105.00666
Soyeong Jeong, Jinheon Baek, ChaeHun Park, Jong C. Park

One of the challenges in information retrieval (IR) is the vocabulary mismatch problem, which happens when the terms between queries and documents are lexically different but semantically similar. While recent work has proposed to expand the queries or documents by enriching their representations with additional relevant terms to address this challenge, they usually require a large volume of query-document pairs to train an expansion model. In this paper, we propose an Unsupervised Document Expansion with Generation (UDEG) framework with a pre-trained language model, which generates diverse supplementary sentences for the original document without using labels on query-document pairs for training. For generating sentences, we further stochastically perturb their embeddings to generate more diverse sentences for document expansion. We validate our framework on two standard IR benchmark datasets. The results show that our framework significantly outperforms relevant expansion baselines for IR.

中文翻译：

带有随机文本生成的信息检索的无监督文档扩展

信息检索（IR）的挑战之一是词汇不匹配问题，当查询和文档之间的术语在词汇上不同但在语义上相似时，就会发生词汇不匹配的问题。尽管最近的工作已提出通过用其他相关术语丰富查询或文档的表示来扩展查询或文档的方式来解决此难题，但它们通常需要大量的查询文档对来训练扩展模型。在本文中，我们提出了一种具有预训练语言模型的无监督生成文档扩展（UDEG）框架，该框架可为原始文档生成各种补充语句，而无需在查询文档对上使用标签进行训练。为了生成句子，我们进一步随机地扰动了它们的嵌入，以生成用于文档扩展的更多不同的句子。我们在两个标准的IR基准数据集上验证了我们的框架。结果表明，我们的框架明显优于IR的相关扩展基准。

更新日期：2021-05-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文