当前位置: X-MOL 学术arXiv.cs.HC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Truth Discovery in Sequence Labels from Crowds
arXiv - CS - Human-Computer Interaction Pub Date : 2021-09-09 , DOI: arxiv-2109.04470
Nasim Sabetpour, Adithya Kulkarni, Sihong Xie, Qi Li

Annotations quality and quantity positively affect the performance of sequence labeling, a vital task in Natural Language Processing. Hiring domain experts to annotate a corpus set is very costly in terms of money and time. Crowdsourcing platforms, such as Amazon Mechanical Turk (AMT), have been deployed to assist in this purpose. However, these platforms are prone to human errors due to the lack of expertise; hence, one worker's annotations cannot be directly used to train the model. Existing literature in annotation aggregation more focuses on binary or multi-choice problems. In recent years, handling the sequential label aggregation tasks on imbalanced datasets with complex dependencies between tokens has been challenging. To conquer the challenge, we propose an optimization-based method that infers the best set of aggregated annotations using labels provided by workers. The proposed Aggregation method for Sequential Labels from Crowds ($AggSLC$) jointly considers the characteristics of sequential labeling tasks, workers' reliabilities, and advanced machine learning techniques. We evaluate $AggSLC$ on different crowdsourced data for Named Entity Recognition (NER), Information Extraction tasks in biomedical (PICO), and the simulated dataset. Our results show that the proposed method outperforms the state-of-the-art aggregation methods. To achieve insights into the framework, we study $AggSLC$ components' effectiveness through ablation studies by evaluating our model in the absence of the prediction module and inconsistency loss function. Theoretical analysis of our algorithm's convergence points that the proposed $AggSLC$ halts after a finite number of iterations.

中文翻译:

来自人群的序列标签中的真相发现

注释的质量和数量会对序列标记的性能产生积极影响,这是自然语言处理中的一项重要任务。聘请领域专家来注释语料库集在金钱和时间方面都非常昂贵。已部署众包平台,例如 Amazon Mechanical Turk (AMT),以协助实现此目的。然而,由于缺乏专业知识,这些平台容易出现人为错误;因此,不能直接使用一个工人的注释来训练模型。注释聚合方面的现有文献更多地关注二元或多项选择问题。近年来,在令牌之间具有复杂依赖关系的不平衡数据集上处理顺序标签聚合任务一直具有挑战性。为了征服挑战,我们提出了一种基于优化的方法,该方法使用工作人员提供的标签来推断最佳的聚合注释集。拟议的人群序列标签聚合方法 ($AggSLC$) 联合考虑了序列标记任务的特征、工作人员的可靠性和先进的机器学习技术。我们在命名实体识别 (NER)、生物医学信息提取任务 (PICO) 和模拟数据集的不同众包数据上评估 $AggSLC$。我们的结果表明,所提出的方法优于最先进的聚合方法。为了深入了解该框架,我们通过在没有预测模块和不一致损失函数的情况下评估我们的模型,通过消融研究来研究 $AggSLC$ 组件的有效性。我们算法的理论分析'
更新日期:2021-09-13
down
wechat
bug