On the Construction of Web NER Model Training Tool based on Distant Supervision,ACM Transactions on Asian and Low-Resource Language Information Processing

当前位置： X-MOL 学术 › ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

On the Construction of Web NER Model Training Tool based on Distant Supervision
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 1.8 ) Pub Date : 2020-11-25 , DOI: 10.1145/3422817
Chien-Lung Chou, Chia-Hui Chang, Yuan-Hao Lin, Kuo-Chun Chien

Named entity recognition (NER) is an important task in natural language understanding, as it extracts the key entities (person, organization, location, date, number, etc.) and objects (product, song, movie, activity name, etc.) mentioned in texts. However, existing natural language processing (NLP) tools (such as Stanford NER) recognize only general named entities or require annotated training examples and feature engineering for supervised model construction. Since not all languages or entities have public NER support, constructing a tool for NER model training is essential for low-resource language or entity information extraction. In this article, we study the problem of developing a tool to prepare training corpus from the Web with known seed entities for custom NER model training via distant supervision. The major challenge of automatic labeling lies in the long labeling time due to large corpus and seed entities as well as the concern to avoid false positive and false negative examples due to short and long seeds. To solve this problem, we adopt locality-sensitive hashing (LSH) for various length of seed entities. We conduct experiments on five types of entity recognition tasks, including Chinese person names, food names, locations, points of interest (POIs), and activity names to demonstrate the improvements with the proposed Web NER model construction tool. Because the training corpus is obtained by automatic labeling of the seed entity–related sentences, one could use either the entire corpus or the positive only sentences for model training. Based on the experimental results, we found the decision should depend on whether traditional linear chained conditional random fields (CRF) or deep neural network–based CRF is used for model training as well as the completeness of the provided seed list.

中文翻译：

基于远程监督的Web NER模型训练工具的构建

命名实体识别 (NER) 是自然语言理解中的一项重要任务，因为它提取关键实体（人、组织、位置、日期、数字等）和对象（产品、歌曲、电影、活动名称等）文中提到。然而，现有的自然语言处理 (NLP) 工具（例如斯坦福 NER）只能识别通用命名实体，或者需要带注释的训练示例和特征工程来构建监督模型。由于并非所有语言或实体都具有公共 NER 支持，因此构建用于 NER 模型训练的工具对于低资源语言或实体信息提取至关重要。在本文中，我们研究了开发一种工具来准备训练语料库的问题，该工具具有已知的种子实体，用于通过远程监督进行自定义 NER 模型训练。自动标记的主要挑战在于由于大的语料库和种子实体导致标记时间长，以及由于短种子和长种子而需要避免假阳性和假阴性示例。为了解决这个问题，我们对不同长度的种子实体采用局部敏感哈希（LSH）。我们对五种类型的实体识别任务进行了实验，包括中文人名、食物名称、位置、兴趣点 (POI) 和活动名称，以展示所提出的 Web NER 模型构建工具的改进。因为训练语料库是通过自动标记与种子实体相关的句子获得的，所以可以使用整个语料库或只有正例的句子进行模型训练。根据实验结果，

更新日期：2020-11-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11