QBSUM: A large-scale query-based document summarization dataset from real-world applications,Computer Speech & Language

当前位置： X-MOL 学术 › Comput. Speech Lang › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

QBSUM: A large-scale query-based document summarization dataset from real-world applications
Computer Speech & Language ( IF 3.1 ) Pub Date : 2020-10-28 , DOI: 10.1016/j.csl.2020.101166
Mingjun Zhao , Shengli Yan , Bang Liu , Xinwang Zhong , Qian Hao , Haolan Chen , Di Niu , Bowei Long , Weidong Guo

Query-based document summarization aims to extract or generate a summary of a document which directly answers or is relevant to the search query. It is an important technique that can be beneficial to a variety of applications such as search engines, document-level machine reading comprehension, and chatbots. Currently, datasets designed for query-based summarization are short in numbers and existing datasets are also limited in both scale and quality. Moreover, to the best of our knowledge, there is no publicly available dataset for Chinese query-based document summarization. In this paper, we present QBSUM, a high-quality large-scale dataset consisting of 49,000+ data samples for the task of Chinese query-based document summarization. We also propose multiple unsupervised and supervised solutions to the task and demonstrate their high-speed inference and superior performance via both offline experiments and online A/B tests. The QBSUM dataset is released in order to facilitate future advancement of this research field.

中文翻译：

QBSUM：来自实际应用程序的大规模基于查询的文档摘要数据集

基于查询的文档摘要旨在提取或生成直接回答搜索查询或与搜索查询相关的文档摘要。这是一项重要的技术，可能对多种应用程序有益，例如搜索引擎，文档级机器阅读理解和聊天机器人。当前，为基于查询的摘要而设计的数据集数量不足，而现有数据集的规模和质量也受到限制。而且，据我们所知，没有可用于基于中文查询的文档摘要的公开数据集。在本文中，我们介绍了QBSUM，是一个高质量的大规模数据集，包含49,000多个数据样本，用于基于中文查询的文档摘要任务。我们还针对该任务提出了多种无监督和有监督的解决方案，并通过离线实验和在线A / B测试证明了它们的高速推理和卓越的性能。QBSUM数据集已发布，以促进该研究领域的未来发展。

更新日期：2020-11-06

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文