当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Generation-Augmented Retrieval for Open-domain Question Answering
arXiv - CS - Information Retrieval Pub Date : 2020-09-17 , DOI: arxiv-2009.08553
Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, Weizhu Chen

Conventional sparse retrieval methods such as TF-IDF and BM25 are simple and efficient, but solely rely on lexical overlap without semantic matching. Recent dense retrieval methods learn latent representations to tackle the lexical mismatch problem, while being more computationally expensive and insufficient for exact matching as they embed the text sequence into a single vector with limited capacity. In this paper, we present Generation-Augmented Retrieval (GAR), a query expansion method that augments a query with relevant contexts through text generation. We demonstrate on open-domain question answering that the generated contexts significantly enrich the semantics of the queries and thus GAR with sparse representations (BM25) achieves comparable or better performance than the state-of-the-art dense methods such as DPR \cite{karpukhin2020dense}. We show that generating various contexts of a query is beneficial as fusing their results consistently yields better retrieval accuracy. Moreover, as sparse and dense representations are often complementary, GAR can be easily combined with DPR to achieve even better performance. Furthermore, GAR achieves the state-of-the-art performance on the Natural Questions and TriviaQA datasets under the extractive setting when equipped with an extractive reader, and consistently outperforms other retrieval methods when the same generative reader is used.

中文翻译:

开放域问答的生成增强检索

TF-IDF 和 BM25 等传统稀疏检索方法简单高效,但仅依赖词汇重叠,没有语义匹配。最近的密集检索方法学习潜在表示来解决词汇不匹配问题,同时由于它们将文本序列嵌入到容量有限的单个向量中,因此计算成本更高且不足以进行精确匹配。在本文中,我们提出了生成增强检索 (GAR),这是一种查询扩展方法,通过文本生成来增强具有相关上下文的查询。我们在开放域问答中证明,生成的上下文显着丰富了查询的语义,因此具有稀疏表示的 GAR(BM25)实现了与最先进的密集方法(例如 DPR \cite{)相当或更好的性能karpukhin2020dense}。我们表明,生成查询的各种上下文是有益的,因为融合它们的结果始终如一地产生更好的检索准确性。此外,由于稀疏表示和密集表示通常是互补的,GAR 可以很容易地与 DPR 结合以实现更好的性能。此外,当配备提取阅读器时,GAR 在提取设置下的自然问题和 TriviaQA 数据集上实现了最先进的性能,并且在使用相同的生成阅读器时始终优于其他检索方法。
更新日期:2020-10-27
down
wechat
bug