DWIE: An entity-centric dataset for multi-task document-level information extraction,Information Processing & Management

当前位置： X-MOL 学术 › Inf. Process. Manag. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

DWIE: An entity-centric dataset for multi-task document-level information extraction
Information Processing & Management ( IF 8.6 ) Pub Date : 2021-03-20 , DOI: 10.1016/j.ipm.2021.102563
Klim Zaporojets , Johannes Deleu , Chris Develder , Thomas Demeester

This paper presents DWIE, the ‘Deutsche Welle corpus for Information Extraction’, a newly created multi-task dataset that combines four main Information Extraction (IE) annotation subtasks: (i) Named Entity Recognition (NER), (ii) Coreference Resolution, (iii) Relation Extraction (RE), and (iv) Entity Linking. DWIE is conceived as an entity-centric dataset that describes interactions and properties of conceptual entities on the level of the complete document. This contrasts with currently dominant mention-driven approaches that start from the detection and classification of named entity mentions in individual sentences. Further, DWIE presented two main challenges when building and evaluating IE models for it. First, the use of traditional mention-level evaluation metrics for NER and RE tasks on entity-centric DWIE dataset can result in measurements dominated by predictions on more frequently mentioned entities. We tackle this issue by proposing a new entity-driven metric that takes into account the number of mentions that compose each of the predicted and ground truth entities. Second, the document-level multi-task annotations require the models to transfer information between entity mentions located in different parts of the document, as well as between different tasks, in a joint learning setting. To realize this, we propose to use graph-based neural message passing techniques between document-level mention spans. Our experiments show an improvement of up to 5.5 $F_{1}$ percentage points when incorporating neural graph propagation into our joint model. This demonstrates DWIE’s potential to stimulate further research in graph neural networks for representation learning in multi-task IE. We make DWIE publicly available at https://github.com/klimzaporojets/DWIE.

中文翻译：

DWIE：以实体为中心的数据集，用于多任务文档级信息提取

本文介绍了DWIE，“信息提取的德意志威尔语语料库”，这是一个新创建的多任务数据集，该数据集结合了四个主要的信息提取（IE）注释子任务：（i）命名实体识别（NER），（ii）共指解析，（iii）关系提取（RE）和（iv）实体链接。DWIE被认为是一个以实体为中心的数据集，它在完整文档的级别上描述概念性实体的交互作用和属性。这与当前占主导地位的提及驱动方法不同，后者从对单个句子中命名实体提及的检测和分类开始。此外，在为它建立和评估IE模型时，DWIE提出了两个主要挑战。第一的，对以实体为中心的DWIE数据集上的NER和RE任务使用传统的提及级评估指标，可能会导致对更频繁提及的实体的预测占主导地位的测量。我们通过提出一种新的，由实体驱动的度量标准来解决这个问题，该度量标准考虑了构成每个预测的和基础的真实实体的提及次数。其次，文档级多任务注释要求模型在联合学习设置中在位于文档不同部分的实体提及之间以及不同任务之间传递信息。为了实现这一点，我们建议在文档级提及跨度之间使用基于图的神经消息传递技术。我们的实验表明，改进幅度高达5.5 $F_{1个}$ 将神经图传播纳入我们的联合模型时的百分比。这表明DWIE有潜力激发图神经网络中用于多任务IE表示学习的进一步研究。我们在https://github.com/klimzaporojets/DWIE上公开提供DWIE。

更新日期：2021-03-21

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>