Adaptive Name Entity Recognition under Highly Unbalanced Data,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Adaptive Name Entity Recognition under Highly Unbalanced Data
arXiv - CS - Computation and Language Pub Date : 2020-03-10 , DOI: arxiv-2003.10296
Thong Nguyen, Duy Nguyen, Pramod Rao

For several purposes in Natural Language Processing (NLP), such as Information Extraction, Sentiment Analysis or Chatbot, Named Entity Recognition (NER) holds an important role as it helps to determine and categorize entities in text into predefined groups such as the names of persons, locations, quantities, organizations or percentages, etc. In this report, we present our experiments on a neural architecture composed of a Conditional Random Field (CRF) layer stacked on top of a Bi-directional LSTM (BI-LSTM) layer for solving NER tasks. Besides, we also employ a fusion input of embedding vectors (Glove, BERT), which are pre-trained on the huge corpus to boost the generalization capacity of the model. Unfortunately, due to the heavy unbalanced distribution cross-training data, both approaches just attained a bad performance on less training samples classes. To overcome this challenge, we introduce an add-on classification model to split sentences into two different sets: Weak and Strong classes and then designing a couple of Bi-LSTM-CRF models properly to optimize performance on each set. We evaluated our models on the test set and discovered that our method can improve performance for Weak classes significantly by using a very small data set (approximately 0.45\%) compared to the rest classes.

中文翻译：

高度不平衡数据下的自适应名称实体识别

对于自然语言处理 (NLP) 中的多种用途，例如信息提取、情感分析或聊天机器人，命名实体识别 (NER) 发挥着重要作用，因为它有助于确定文本中的实体并将其分类为预定义的组，例如人名、位置、数量、组织或百分比等。在本报告中，我们展示了我们在神经架构上的实验，该神经架构由堆叠在双向 LSTM (BI-LSTM) 层之上的条件随机场 (CRF) 层组成，用于求解NER 任务。此外，我们还采用了嵌入向量（Glove、BERT）的融合输入，这些向量在庞大的语料库上进行了预训练，以提高模型的泛化能力。不幸的是，由于大量不平衡分布的交叉训练数据，这两种方法在较少的训练样本类上都表现不佳。为了克服这一挑战，我们引入了一个附加分类模型将句子分成两个不同的集合：弱类和强类，然后正确设计几个 Bi-LSTM-CRF 模型以优化每个集合的性能。我们在测试集上评估了我们的模型，发现与其他类相比，我们的方法可以通过使用非常小的数据集（大约 0.45\%）显着提高弱类的性能。

更新日期：2020-03-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文