当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Constructing Artificial Data for Fine-tuning for Low-Resource Biomedical Text Tagging with Applications in PICO Annotation
arXiv - CS - Computation and Language Pub Date : 2019-10-21 , DOI: arxiv-1910.09255
Gaurav Singh, Zahra Sabet, John Shawe-Taylor, James Thomas

Biomedical text tagging systems are plagued by the dearth of labeled training data. There have been recent attempts at using pre-trained encoders to deal with this issue. Pre-trained encoder provides representation of the input text which is then fed to task-specific layers for classification. The entire network is fine-tuned on the labeled data from the target task. Unfortunately, a low-resource biomedical task often has too few labeled instances for satisfactory fine-tuning. Also, if the label space is large, it contains few or no labeled instances for majority of the labels. Most biomedical tagging systems treat labels as indexes, ignoring the fact that these labels are often concepts expressed in natural language e.g. `Appearance of lesion on brain imaging'. To address these issues, we propose constructing extra labeled instances using label-text (i.e. label's name) as input for the corresponding label-index (i.e. label's index). In fact, we propose a number of strategies for manufacturing multiple artificial labeled instances from a single label. The network is then fine-tuned on a combination of real and these newly constructed artificial labeled instances. We evaluate the proposed approach on an important low-resource biomedical task called \textit{PICO annotation}, which requires tagging raw text describing clinical trials with labels corresponding to different aspects of the trial i.e. PICO (Population, Intervention/Control, Outcome) characteristics of the trial. Our empirical results show that the proposed method achieves a new state-of-the-art performance for PICO annotation with very significant improvements over competitive baselines.

中文翻译:

使用 PICO 注释中的应用构建用于微调低资源生物医学文本标记的人工数据

生物医学文本标记系统受到缺乏标记训练数据的困扰。最近有人尝试使用预训练的编码器来解决这个问题。预训练的编码器提供输入文本的表示,然后将其馈送到特定于任务的层进行分类。整个网络在来自目标任务的标记数据上进行了微调。不幸的是,资源匮乏的生物医学任务通常具有太少的标记实例,无法进行令人满意的微调。此外,如果标签空间很大,对于大多数标签,它包含很少或没有标记实例。大多数生物医学标签系统将标签视为索引,忽略了这些标签通常是用自然语言表达的概念,例如“脑成像上的病变外观”这一事实。为了解决这些问题,我们建议使用标签文本(即标签名称)作为相应标签索引(即标签索引)的输入来构建额外的标记实例。事实上,我们提出了许多策略来从单个标签制造多个人工标记的实例。然后根据真实和这些新构建的人工标记实例的组合对网络进行微调。我们在一项名为 \textit{PICO annotation} 的重要低资源生物医学任务上评估所提出的方法,该任务需要使用对应于试验不同方面的标签来标记描述临床试验的原始文本,即 PICO(人口、干预/控制、结果)特征的审判。
更新日期:2020-01-16
down
wechat
bug