当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Dataset of German Legal Documents for Named Entity Recognition
arXiv - CS - Information Retrieval Pub Date : 2020-03-29 , DOI: arxiv-2003.13016
Elena Leitner and Georg Rehm and Juli\'an Moreno-Schneider

We describe a dataset developed for Named Entity Recognition in German federal court decisions. It consists of approx. 67,000 sentences with over 2 million tokens. The resource contains 54,000 manually annotated entities, mapped to 19 fine-grained semantic classes: person, judge, lawyer, country, city, street, landscape, organization, company, institution, court, brand, law, ordinance, European legal norm, regulation, contract, court decision, and legal literature. The legal documents were, furthermore, automatically annotated with more than 35,000 TimeML-based time expressions. The dataset, which is available under a CC-BY 4.0 license in the CoNNL-2002 format, was developed for training an NER service for German legal documents in the EU project Lynx.

中文翻译:

用于命名实体识别的德国法律文件数据集

我们描述了为德国联邦法院判决中的命名实体识别开发的数据集。它由大约。67,000 个句子,超过 200 万个代币。该资源包含 54,000 个手动注释的实体,映射到 19 个细粒度的语义类:人、法官、律师、国家、城市、街道、景观、组织、公司、机构、法院、品牌、法律、条例、欧洲法律规范、法规、合同、法院判决和法律文献。此外,法律文件会自动使用超过 35,000 个基于 TimeML 的时间表达式进行注释。该数据集在 CC-BY 4.0 许可下以 CoNNL-2002 格式提供,旨在为欧盟项目 Lynx 中的德国法律文件培训 NER 服务。
更新日期:2020-03-31
down
wechat
bug