当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Distant Supervision Corpus for Extracting Biomedical Relationships Between Chemicals, Diseases and Genes
arXiv - CS - Artificial Intelligence Pub Date : 2022-04-13 , DOI: arxiv-2204.06584
Dongxu Zhang, Sunil Mohan, Michaela Torkar, Andrew McCallum

We introduce ChemDisGene, a new dataset for training and evaluating multi-class multi-label document-level biomedical relation extraction models. Our dataset contains 80k biomedical research abstracts labeled with mentions of chemicals, diseases, and genes, portions of which human experts labeled with 18 types of biomedical relationships between these entities (intended for evaluation), and the remainder of which (intended for training) has been distantly labeled via the CTD database with approximately 78\% accuracy. In comparison to similar preexisting datasets, ours is both substantially larger and cleaner; it also includes annotations linking mentions to their entities. We also provide three baseline deep neural network relation extraction models trained and evaluated on our new dataset.

中文翻译:

一种提取化学物质、疾病和基因之间生物医学关系的远程监督语料库

我们介绍了 ChemDisGene,这是一个用于训练和评估多类多标签文档级生物医学关系提取模型的新数据集。我们的数据集包含 80,000 个生物医学研究摘要,标有提及化学品、疾病和基因,其中部分人类专家标有这些实体之间的 18 种生物医学关系(用于评估),其余部分(用于培训)具有通过 CTD 数据库以大约 78% 的准确率进行远距离标记。与类似的预先存在的数据集相比,我们的数据集更大更干净;它还包括将提及链接到其实体的注释。我们还提供了三个在我们的新数据集上训练和评估的基线深度神经网络关系提取模型。
更新日期:2022-04-13
down
wechat
bug