当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset
arXiv - CS - Information Retrieval Pub Date : 2019-12-04 , DOI: arxiv-1912.01901
Jibril Frej and Didier Schwab and Jean-Pierre Chevallet

Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for information retrieval perform poorly on these datasets. These models (e.g. DUET, Conv-KNRM) are trained and evaluated on data collected from commercial search engines not publicly available for academic research which is a problem for reproducibility and the advancement of research. In this paper, we propose WIKIR: an open-source toolkit to automatically build large-scale English information retrieval datasets based on Wikipedia. WIKIR is publicly available on GitHub. We also provide wikIR78k and wikIRS78k: two large-scale publicly available datasets that both contain 78,628 queries and 3,060,191 (query, relevant documents) pairs.

中文翻译:

WIKIR:用于构建大规模基于维基百科的英语信息检索数据集的 Python 工具包

在过去的几年里,深度学习方法允许在临时信息检索中获得最先进的新结果。然而,此类方法通常需要大量带注释的数据才能有效。由于大多数可公开用于学术研究的标准临时信息检索数据集(例如 Robust04、ClueWeb09)最多有 250 个带注释的查询,因此最近用于信息检索的深度学习模型在这些数据集上表现不佳。这些模型(例如 DUET、Conv-KNRM)是根据从商业搜索引擎收集的数据进行训练和评估的,这些数据不是公开用于学术研究的,这是可重复性和研究进展的一个问题。在本文中,我们提出了 WIKIR:一种基于维基百科自动构建大规模英文信息检索数据集的开源工具包。WIKIR 在 GitHub 上公开提供。我们还提供了 wikIR78k 和 wikIRS78k:两个大规模公开可用的数据集,它们都包含 78,628 个查询和 3,060,191 个(查询、相关文档)对。
更新日期:2020-03-18
down
wechat
bug