当前位置: X-MOL 学术IEEE Trans. Cybern. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Active Learning of Regular Expressions for Entity Extraction
IEEE Transactions on Cybernetics ( IF 11.8 ) Pub Date : 2018-03-01 , DOI: 10.1109/tcyb.2017.2680466
Alberto Bartoli , Andrea De Lorenzo , Eric Medvet , Fabiano Tarlao

We consider the automatic synthesis of an entity extractor, in the form of a regular expression, from examples of the desired extractions in an unstructured text stream. This is a long-standing problem for which many different approaches have been proposed, which all require the preliminary construction of a large dataset fully annotated by the user. In this paper, we propose an active learning approach aimed at minimizing the user annotation effort: the user annotates only one desired extraction and then merely answers extraction queries generated by the system. During the learning process, the system digs into the input text for selecting the most appropriate extraction query to be submitted to the user in order to improve the current extractor. We construct candidate solutions with genetic programming (GP) and select queries with a form of querying-by-committee, i.e., based on a measure of disagreement within the best candidate solutions. All the components of our system are carefully tailored to the peculiarities of active learning with GP and of entity extraction from unstructured text. We evaluate our proposal in depth, on a number of challenging datasets and based on a realistic estimate of the user effort involved in answering each single query. The results demonstrate high accuracy with significant savings in terms of computational effort, annotated characters, and execution time over a state-of-the-art baseline.

中文翻译:

主动学习正则表达式以提取实体

我们考虑从非结构化文本流中所需提取的示例中,以正则表达式的形式自动合成实体提取器。这是一个长期存在的问题,已经提出了许多不同的方法,所有这些方法都需要对用户完全注释的大型数据集进行初步构建。在本文中,我们提出了一种主动学习方法,旨在最大程度地减少用户注释工作:用户仅注释一个所需的提取,然后仅回答系统生成的提取查询。在学习过程中,系统将挖掘输入文本,以选择要提交给用户的最合适的提取查询,以改进当前的提取器。我们使用遗传规划(GP)构造候选解决方案,并通过按委员会查询的形式(即基于最佳候选解决方案中不同意见的度量)来选择查询。我们系统的所有组件都经过精心设计,以适应主动学习与GP以及从非结构化文本中提取实体的特殊性。我们根据大量具有挑战性的数据集,并根据对回答每个查询所涉及的用户工作量的实际估计,来深入评估我们的建议。结果表明,在最先进的基准上,可以显着节省计算量,注释字符和缩短执行时间,并具有很高的准确性。我们系统的所有组件都经过精心设计,以适应主动学习与GP以及从非结构化文本中提取实体的特殊性。我们根据大量具有挑战性的数据集,并根据对回答每个查询所涉及的用户工作量的实际估计,来深入评估我们的建议。结果表明,在最先进的基准上,可以显着节省计算量,注释字符和缩短执行时间,并具有很高的准确性。我们系统的所有组件都经过精心设计,以适应主动学习与GP以及从非结构化文本中提取实体的特殊性。我们根据大量具有挑战性的数据集,并根据对回答每个查询所涉及的用户工作量的实际估计,来深入评估我们的建议。结果表明,在最先进的基准上,可以显着节省计算量,注释字符和缩短执行时间,并具有很高的准确性。
更新日期:2018-03-01
down
wechat
bug