当前位置: X-MOL 学术arXiv.cs.LG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Parallel sequence tagging for concept recognition
arXiv - CS - Machine Learning Pub Date : 2020-03-16 , DOI: arxiv-2003.07424
Lenz Furrer (1 and 3), Joseph Cornelius (1), Fabio Rinaldi (1, 2, and 3) ((1) University of Zurich, Switzerland, (2) Dalle Molle Institute for Artificial Intelligence Research (IDSIA), Switzerland, (3) Swiss Institute of Bioinformatics, Switzerland)

Background: Named Entity Recognition (NER) and Normalisation (NEN) are core components of any text-mining system for biomedical texts. In a traditional concept-recognition pipeline, these tasks are combined in a serial way, which is inherently prone to error propagation from NER to NEN. We propose a parallel architecture, where both NER and NEN are modeled as a sequence-labeling task, operating directly on the source text. We examine different harmonisation strategies for merging the predictions of the two classifiers into a single output sequence. Results: We test our approach on the recent Version 4 of the CRAFT corpus. In all 20 annotation sets of the concept-annotation task, our system outperforms the pipeline system reported as a baseline in the CRAFT shared task 2019. Conclusions: Our analysis shows that the strengths of the two classifiers can be combined in a fruitful way. However, prediction harmonisation requires individual calibration on a development set for each annotation set. This allows achieving a good trade-off between established knowledge (training set) and novel information (unseen concepts). Availability and Implementation: Source code freely available for download at https://github.com/OntoGene/craft-st. Supplementary data are available at arXiv online.

中文翻译:

用于概念识别的并行序列标记

背景:命名实体识别 (NER) 和规范化 (NEN) 是任何生物医学文本文本挖掘系统的核心组件。在传统的概念识别管道中,这些任务以串行方式组合,这在本质上很容易发生从 NER 到 NEN 的错误传播。我们提出了一种并行架构,其中 NER 和 NEN 都被建模为序列标记任务,直接对源文本进行操作。我们研究了将两个分类器的预测合并为单个输出序列的不同协调策略。结果:我们在 CRAFT 语料库的最新版本 4 上测试了我们的方法。在概念注释任务的所有 20 个注释集中,我们的系统优于 CRAFT 共享任务 2019 中报告为基线的管道系统。 结论:我们的分析表明,两个分类器的优势可以以富有成效的方式结合起来。但是,预测协调需要对每个注释集的开发集进行单独校准。这允许在已建立的知识(训练集)和新信息(未见的概念)之间实现良好的权衡。可用性和实现:源代码可在 https://github.com/OntoGene/craft-st 上免费下载。补充数据可在 arXiv 在线获得。//github.com/OntoGene/craft-st. 补充数据可在 arXiv 在线获得。//github.com/OntoGene/craft-st. 补充数据可在 arXiv 在线获得。
更新日期:2020-08-11
down
wechat
bug