当前位置: X-MOL 学术Database J. Biol. Databases Curation › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing
Database: The Journal of Biological Databases and Curation ( IF 3.4 ) Pub Date : 2020-12-01 , DOI: 10.1093/database/baaa104
Diana Sousa 1 , Andre Lamurias 1 , Francisco M Couto 1
Affiliation  

Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can potentially reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, we use a previously created distantly supervised human phenotype–gene relations (PGR) dataset to perform crowdsourcing validation. We divided the original dataset into two annotation tasks: Task 1, 70% of the dataset annotated by one worker, and Task 2, 30% of the dataset annotated by seven workers. Also, for Task 2, we added an extra rater on-site and a domain expert to further assess the crowdsourcing validation quality. Here, we describe a detailed pipeline for RE crowdsourcing validation, creating a new release of the PGR dataset with partial domain expert revision, and assess the quality of the MTurk platform. We applied the new dataset to two state-of-the-art deep learning systems (BiOnt and BioBERT) and compared its performance with the original PGR dataset, as well as combinations between the two, achieving a 0.3494 increase in average F-measure. The code supporting our work and the new release of the PGR dataset is available at https://github.com/lasigeBioTM/PGR-crowd.

中文翻译:


生物医学关系提取训练语料库的混合方法:远程监督与众包相结合



生物医学关系提取(RE)数据集对于知识库的构建和加强新相互作用的发现至关重要。创建生物医学 RE 数据集的方法有多种,其中一些比其他方法更可靠,例如诉诸领域专家注释。然而,新兴的众包平台(例如 Amazon Mechanical Turk (MTurk))的使用可以潜在地降低 RE 数据集的构建成本,即使无法保证相同的质量水平。研究人员缺乏控制工作人员参与众包平台的人员、方式和环境的权力。因此,将远程监督与众包结合起来可能是一个更可靠的选择。众包工作人员只会被要求纠正或丢弃现有的注释,这将使该过程减少对他们解释复杂生物医学句子的能力的依赖。在这项工作中,我们使用之前创建的远程监督人类表型-基因关系(PGR)数据集来执行众包验证。我们将原始数据集分为两个注释任务:任务 1,由一名工作人员注释数据集的 70%,以及任务 2,由七名工作人员注释数据集的 30%。此外,对于任务 2,我们额外增加了一名现场评估员和一名领域专家,以进一步评估众包验证质量。在这里,我们描述了 RE 众包验证的详细流程,创建带有部分领域专家修订的 PGR 数据集的新版本,并评估 MTurk 平台的质量。我们将新数据集应用于两个最先进的深度学习系统(BiOnt 和 BioBERT),并将其性能与原始 PGR 数据集以及两者的组合进行比较,获得了 0 分。第3494章 平均F测量的增加。支持我们工作的代码和新版本的 PGR 数据集可在 https://github.com/lasigeBioTM/PGR-crowd 上找到。
更新日期:2020-12-01
down
wechat
bug