Towards the benchmarking of question generation: introducing the Monserrate corpus,Language Resources and Evaluation

当前位置： X-MOL 学术 › Lang. Resour. Eval. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Towards the benchmarking of question generation: introducing the Monserrate corpus
Language Resources and Evaluation ( IF 2.7 ) Pub Date : 2021-06-03 , DOI: 10.1007/s10579-021-09545-5
Hugo Rodrigues , Eric Nyberg , Luisa Coheur

Despite the growing interest in Question Generation, evaluating these systems remains notably difficult. Many authors rely on metrics like BLEU or ROUGE instead of relying on manual evaluations, as their computation is mostly free. However, corpora generally used as reference is very incomplete, containing just a couple of hypotheses per source sentence. In this paper, we propose monserrate corpus, a dataset specifically built to evaluate Question Generation systems, with, on average, 26 questions associated to each source sentence, attempting to be an “exhaustive” reference. With monserrate we study the impact of the reference size in evaluating Question Generation systems. Several evaluation metrics are used, from more traditional lexical ones to metrics based on word embeddings, and we conclude that these are still a limiting evaluation factor, as they lead to different outcomes. Finally, with monserrate, we benchmark three different Question Generation systems, representing different approaches to this task.

中文翻译：

迈向问题生成的基准：介绍 Monserrate 语料库

尽管人们对问题生成越来越感兴趣，但评估这些系统仍然非常困难。许多作者依靠 BLEU 或 ROUGE 等指标而不是依靠人工评估，因为他们的计算大多是免费的。然而，通常用作参考的语料库非常不完整，每个源句子只包含几个假设。在本文中，我们提出了monserrate语料库，这是一个专门用于评估问题生成系统的数据集，平均每个源句子有 26 个问题，试图成为“详尽”的参考。与monserrate我们研究了参考大小对评估问题生成系统的影响。使用了几种评估指标，从更传统的词汇到基于词嵌入的指标，我们得出结论，这些仍然是一个限制性的评估因素，因为它们会导致不同的结果。最后，使用monserrate，我们对三个不同的问题生成系统进行了基准测试，代表了执行此任务的不同方法。

更新日期：2021-06-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>