A Dataset of German Legal Documents for Named Entity Recognition,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Dataset of German Legal Documents for Named Entity Recognition
arXiv - CS - Information Retrieval Pub Date : 2020-03-29 , DOI: arxiv-2003.13016
Elena Leitner and Georg Rehm and Juli\'an Moreno-Schneider

We describe a dataset developed for Named Entity Recognition in German federal court decisions. It consists of approx. 67,000 sentences with over 2 million tokens. The resource contains 54,000 manually annotated entities, mapped to 19 fine-grained semantic classes: person, judge, lawyer, country, city, street, landscape, organization, company, institution, court, brand, law, ordinance, European legal norm, regulation, contract, court decision, and legal literature. The legal documents were, furthermore, automatically annotated with more than 35,000 TimeML-based time expressions. The dataset, which is available under a CC-BY 4.0 license in the CoNNL-2002 format, was developed for training an NER service for German legal documents in the EU project Lynx.

中文翻译：

用于命名实体识别的德国法律文件数据集

我们描述了为德国联邦法院判决中的命名实体识别开发的数据集。它由大约。67,000 个句子，超过 200 万个代币。该资源包含 54,000 个手动注释的实体，映射到 19 个细粒度的语义类：人、法官、律师、国家、城市、街道、景观、组织、公司、机构、法院、品牌、法律、条例、欧洲法律规范、法规、合同、法院判决和法律文献。此外，法律文件会自动使用超过 35,000 个基于 TimeML 的时间表达式进行注释。该数据集在 CC-BY 4.0 许可下以 CoNNL-2002 格式提供，旨在为欧盟项目 Lynx 中的德国法律文件培训 NER 服务。

更新日期：2020-03-31

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>