当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Comprehensive Named Entity Recognition on CORD-19 with Distant or Weak Supervision
arXiv - CS - Information Retrieval Pub Date : 2020-03-27 , DOI: arxiv-2003.12218
Xuan Wang, Xiangchen Song, Bangzheng Li, Yingjun Guan, Jiawei Han

We created this CORD-NER dataset with comprehensive named entity recognition (NER) on the COVID-19 Open Research Dataset Challenge (CORD-19) corpus (2020-03-13). This CORD-NER dataset covers 75 fine-grained entity types: In addition to the common biomedical entity types (e.g., genes, chemicals and diseases), it covers many new entity types related explicitly to the COVID-19 studies (e.g., coronaviruses, viral proteins, evolution, materials, substrates and immune responses), which may benefit research on COVID-19 related virus, spreading mechanisms, and potential vaccines. CORD-NER annotation is a combination of four sources with different NER methods. The quality of CORD-NER annotation surpasses SciSpacy (over 10% higher on the F1 score based on a sample set of documents), a fully supervised BioNER tool. Moreover, CORD-NER supports incrementally adding new documents as well as adding new entity types when needed by adding dozens of seeds as the input examples. We will constantly update CORD-NER based on the incremental updates of the CORD-19 corpus and the improvement of our system.

中文翻译:

具有远程或弱监督的 CORD-19 上的综合命名实体识别

我们在 COVID-19 开放研究数据集挑战 (CORD-19) 语料库 (2020-03-13) 上创建了具有全面命名实体识别 (NER) 的 CORD-NER 数据集。该 CORD-NER 数据集涵盖 75 种细粒度实体类型:除了常见的生物医学实体类型(例如基因、化学物质和疾病),它还涵盖了许多与 COVID-19 研究明确相关的新实体类型(例如冠状病毒、病毒蛋白、进化、材料、底物和免疫反应),这可能有利于对 COVID-19 相关病毒、传播机制和潜在疫苗的研究。CORD-NER 注释是具有不同 NER 方法的四种来源的组合。CORD-NER 注释的质量超过了 SciSpacy(基于一组文档样本的 F1 分数高出 10% 以上),这是一种完全监督的 BioNER 工具。而且,CORD-NER 支持增量添加新文档以及在需要时通过添加数十个种子作为输入示例来添加新实体类型。我们将根据 CORD-19 语料库的增量更新和我们系统的改进,不断更新 CORD-NER。
更新日期:2020-04-17
down
wechat
bug