当前位置: X-MOL 学术Computing › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multicore based least confidence query sampling strategy to speed up active learning approach for named entity recognition
Computing ( IF 3.3 ) Pub Date : 2021-08-28 , DOI: 10.1007/s00607-021-01000-1
Ankit Agrawal 1 , Manu Vardhan 1 , Sarsij Tripathi 2
Affiliation  

In the present era, there is a large amount of new data available readily from different sources to collect and store. One of the main problems is to label these new data for various machine learning applications correctly. The active learning approach presents a unique case of machine learning which is widely used to solve the above problem by significantly minimizing the need for labeled data. It aims to select the most appropriate samples from the unlabeled data to be correctly labeled by the oracle and is passed to train the active learner incrementally. There are several different query sampling strategies that exist using which the appropriate samples are selected. One of the main problems with the active learning approach is that it is very time-consuming. So in this research work, a new multi-core-based algorithm is proposed to speed up the active learning approach, which can utilize the complete computational resources present in the system. The experiments have been performed for the problem of named entity recognition which deals with labeling the sequences of words in an unstructured text by classifying them into pre-existing categories. The proposed algorithm is evaluated in terms of both: the performance and execution time over three named entity recognition corpus of distinct biomedical domains. The evaluation results shows considerable improvement in terms of execution time for the proposed active learning algorithm than the existing active learning approach.



中文翻译:

基于多核的最小置信查询采样策略加速命名实体识别的主动学习方法

在当今时代,可以轻松地从不同来源收集和存储大量新数据。主要问题之一是为各种机器学习应用程序正确标记这些新数据。主动学习方法是机器学习的一个独特案例,它通过显着减少对标记数据的需求而被广泛用于解决上述问题。它旨在从未标记的数据中选择最合适的样本由预言机正确标记并传递给主动学习者增量训练。存在几种不同的查询采样策略,使用它们选择适当的样本。主动学习方法的主要问题之一是非常耗时。所以在这项研究工作中,提出了一种新的基于多核的算法来加速主动学习方法,该方法可以利用系统中存在的完整计算资源。已针对命名实体识别问题进行了实验,该问题通过将非结构化文本中的单词序列分类为预先存在的类别来标记它们。所提出的算法从以下两个方面进行评估:在不同生物医学领域的三个命名实体识别语料库上的性能和执行时间。评估结果表明,所提出的主动学习算法的执行时间比现有的主动学习方法有相当大的改进。已针对命名实体识别问题进行了实验,该问题通过将非结构化文本中的单词序列分类为预先存在的类别来标记它们。所提出的算法从以下两个方面进行评估:在不同生物医学领域的三个命名实体识别语料库上的性能和执行时间。评估结果表明,所提出的主动学习算法的执行时间比现有的主动学习方法有相当大的改进。已针对命名实体识别问题进行了实验,该问题通过将非结构化文本中的单词序列分类为预先存在的类别来标记它们。所提出的算法从以下两个方面进行评估:在不同生物医学领域的三个命名实体识别语料库上的性能和执行时间。评估结果表明,所提出的主动学习算法的执行时间比现有的主动学习方法有相当大的改进。

更新日期:2021-08-29
down
wechat
bug