Domain-matched Pre-training Tasks for Dense Retrieval,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Domain-matched Pre-training Tasks for Dense Retrieval
arXiv - CS - Information Retrieval Pub Date : 2021-07-28 , DOI: arxiv-2107.13602
Barlas Oğuz, Kushal Lakhotia, Anchit Gupta, Patrick Lewis, Vladimir Karpukhin, Aleksandra Piktus, Xilun Chen, Sebastian Riedel, Wen-tau Yih, Sonal Gupta, Yashar Mehdad

Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional pre-training has so far failed to produce convincing results. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder models on 1) a recently released set of 65 million synthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations made available by pushshift.io. We evaluate on a set of information retrieval and dialogue retrieval benchmarks, showing substantial improvements over supervised baselines.

中文翻译：

用于密集检索的域匹配预训练任务

在模型大小不断增加的更大数据集上进行预训练现在已被证明是提高几乎所有 NLP 任务性能的方法。一个值得注意的例外是信息检索，其中额外的预训练迄今为止未能产生令人信服的结果。我们表明，通过正确的预训练设置，可以克服这一障碍。我们通过在 1）最近发布的 6500 万个综合生成问题集，以及 2）来自 pushshift.io 提供的 Reddit 对话预先存在的数据集中的 2 亿条评论后对大型双编码器模型进行预训练来证明这一点。我们对一组信息检索和对话检索基准进行评估，显示出对监督基线的实质性改进。

更新日期：2021-07-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文