当前位置:
X-MOL 学术
›
arXiv.cs.IR
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Generation-Augmented Retrieval for Open-domain Question Answering
arXiv - CS - Information Retrieval Pub Date : 2020-09-17 , DOI: arxiv-2009.08553 Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, Weizhu Chen
arXiv - CS - Information Retrieval Pub Date : 2020-09-17 , DOI: arxiv-2009.08553 Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, Weizhu Chen
Conventional sparse retrieval methods such as TF-IDF and BM25 are simple and
efficient, but solely rely on lexical overlap without semantic matching. Recent
dense retrieval methods learn latent representations to tackle the lexical
mismatch problem, while being more computationally expensive and insufficient
for exact matching as they embed the text sequence into a single vector with
limited capacity. In this paper, we present Generation-Augmented Retrieval
(GAR), a query expansion method that augments a query with relevant contexts
through text generation. We demonstrate on open-domain question answering that
the generated contexts significantly enrich the semantics of the queries and
thus GAR with sparse representations (BM25) achieves comparable or better
performance than the state-of-the-art dense methods such as DPR
\cite{karpukhin2020dense}. We show that generating various contexts of a query
is beneficial as fusing their results consistently yields better retrieval
accuracy. Moreover, as sparse and dense representations are often
complementary, GAR can be easily combined with DPR to achieve even better
performance. Furthermore, GAR achieves the state-of-the-art performance on the
Natural Questions and TriviaQA datasets under the extractive setting when
equipped with an extractive reader, and consistently outperforms other
retrieval methods when the same generative reader is used.
中文翻译:
开放域问答的生成增强检索
TF-IDF 和 BM25 等传统稀疏检索方法简单高效,但仅依赖词汇重叠,没有语义匹配。最近的密集检索方法学习潜在表示来解决词汇不匹配问题,同时由于它们将文本序列嵌入到容量有限的单个向量中,因此计算成本更高且不足以进行精确匹配。在本文中,我们提出了生成增强检索 (GAR),这是一种查询扩展方法,通过文本生成来增强具有相关上下文的查询。我们在开放域问答中证明,生成的上下文显着丰富了查询的语义,因此具有稀疏表示的 GAR(BM25)实现了与最先进的密集方法(例如 DPR \cite{)相当或更好的性能karpukhin2020dense}。我们表明,生成查询的各种上下文是有益的,因为融合它们的结果始终如一地产生更好的检索准确性。此外,由于稀疏表示和密集表示通常是互补的,GAR 可以很容易地与 DPR 结合以实现更好的性能。此外,当配备提取阅读器时,GAR 在提取设置下的自然问题和 TriviaQA 数据集上实现了最先进的性能,并且在使用相同的生成阅读器时始终优于其他检索方法。
更新日期:2020-10-27
中文翻译:
开放域问答的生成增强检索
TF-IDF 和 BM25 等传统稀疏检索方法简单高效,但仅依赖词汇重叠,没有语义匹配。最近的密集检索方法学习潜在表示来解决词汇不匹配问题,同时由于它们将文本序列嵌入到容量有限的单个向量中,因此计算成本更高且不足以进行精确匹配。在本文中,我们提出了生成增强检索 (GAR),这是一种查询扩展方法,通过文本生成来增强具有相关上下文的查询。我们在开放域问答中证明,生成的上下文显着丰富了查询的语义,因此具有稀疏表示的 GAR(BM25)实现了与最先进的密集方法(例如 DPR \cite{)相当或更好的性能karpukhin2020dense}。我们表明,生成查询的各种上下文是有益的,因为融合它们的结果始终如一地产生更好的检索准确性。此外,由于稀疏表示和密集表示通常是互补的,GAR 可以很容易地与 DPR 结合以实现更好的性能。此外,当配备提取阅读器时,GAR 在提取设置下的自然问题和 TriviaQA 数据集上实现了最先进的性能,并且在使用相同的生成阅读器时始终优于其他检索方法。