当前位置: X-MOL 学术Lang. Resour. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization
Language Resources and Evaluation ( IF 2.7 ) Pub Date : 2020-06-25 , DOI: 10.1007/s10579-020-09495-4
Nhi-Thao Tran , Minh-Quoc Nghiem , Nhung T. H. Nguyen , Ngan Luu-Thuy Nguyen , Nam Van Chi , Dien Dinh

Automatic text summarization is important in this era due to the exponential growth of documents available on the Internet. In the Vietnamese language, VietnameseMDS is the only publicly available dataset for this task. Although the dataset has 199 clusters, there are only three documents in each cluster, which is small compared to typical datasets in English. This motivates us to construct ViMs—a big and high-quality Vietnamese dataset for abstractive multi-document summarization. To that end, we recruited 29 annotators and enhanced MDSWriter—an open-source annotation tool, to support the annotators in creating gold standard summaries. As a result, ViMs has 600 summaries corresponding to 300 clusters of 1,945 documents. We have verified the reliability of our dataset by using a variety of metrics including conventional Cohen’s \(\kappa \), relaxed Cohen’s \(\kappa \)—a new metric that we propose to make it more suitable for abstractive summarization, and ROUGE scores. A relaxed \(\kappa \) score of 0.55 indicate that ViMs could attain moderate agreement between annotators. Meanwhile, ROUGE scores are 0.729 of ROUGE-1, 0.507 of ROUGE-2 and 0.524 of ROUGE-SU4. We have further evaluated ViMs by using three different summarization systems: TextRank, CFVi and MUSEEC. Their performances are 0.628, 0.711 and 0.732 of ROUGE-1, respectively. These results show that the ViMs dataset is suitable for both training and evaluating multi-document summarization systems. We have made the dataset and evaluation results of this work publicly available for research community. It is noted that unlike previous work that only published the final summarization dataset, we also publish intermediate annotation results, which can be used in other NLP problems such as sentence classification.



中文翻译:

ViM:用于抽象多文档摘要的高质量越南语数据集

由于Internet上可用的文档呈指数增长,因此自动文本摘要在这个时代很重要。用越南语,VietnamMDS是此任务的唯一公开可用的数据集。尽管数据集具有199个聚类,但每个聚类中只有三个文档,与英语的典型数据集相比,该文档很小。这激励我们构建ViM,这是一个大型高质量的越南数据集,用于抽象多文档摘要。为此,我们招募了29个注释器和增强的MDSWriter(一种开放源代码注释工具),以支持注释器创建黄金标准摘要。结果,ViM具有600个摘要,对应于1,945个文档的300个簇。我们已经通过使用各种指标(包括传统的Cohen's)验证了数据集的可靠性\(\ kappa \),宽松的Cohen的\(\ kappa \) —一种新指标,我们建议使其更适合抽象摘要和ROUGE分数。轻松的\(\ kappa \)0.55分表明ViM可以在注释者之间达成适度的一致。同时,ROUGE得分是ROUGE-1的0.729,ROUGE-2的0.507和ROUGE-SU4的0.524。我们通过使用三种不同的汇总系统进一步评估了ViM:TextRank,CFVi和MUSEEC。它们的性能分别为ROUGE-1的0.628、0.711和0.732。这些结果表明,ViMs数据集适用于训练和评估多文档摘要系统。我们已将这项工作的数据集和评估结果公开提供给研究社区。请注意,与以前的工作仅发布最终的摘要数据集不同,我们还发布了中间注释结果,该结果可用于其他NLP问题,例如句子分类。

更新日期:2020-07-24
down
wechat
bug