Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition,Journal of Intelligent & Fuzzy Systems

当前位置： X-MOL 学术 › J. Intell. Fuzzy Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition
Journal of Intelligent & Fuzzy Systems ( IF 2 ) Pub Date : 2021-05-29 , DOI: 10.3233/jifs-202286
Yingwen Fu ₁ , Nankai Lin ₁ , Xiaotian Lin ₁ , Shengyi Jiang ₁

Affiliation

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.

中文翻译：

走向语料库和模型：印度尼西亚命名实体识别的分层结构化基于注意力的特征

命名实体识别 (NER) 是自然语言处理 (NLP) 的基础。大多数关于 NER 的最新研究都是基于预训练语言模型 (PLM) 或经典神经模型。然而，这些研究主要面向英语等高资源语言。而对于印度尼西亚，相关资源（数据集和技术）尚未开发。此外，词缀是印尼语的一个重要词组，表明字符和标记特征对于 token-wise 的印尼语 NLP 任务的重要性。然而，目前性能最好的模型提取的特征是不够的。针对印尼语 NER 任务，在本文中，我们构建了一个包含超过 5 万个句子（超过 67 万个标记）的印尼语 NER 数据集（IDNER），以缓解印尼语标记资源的短缺。此外，我们为印尼语 NER 构建了一个基于层次结构的基于注意力的模型（HSA），以从不同角度提取序列特征。具体来说，我们使用增强的卷积结构和增强的注意力结构从字符和标记中提取更深层次的特征。实验结果表明，HSA 在 IDNER 和三个基准数据集上建立了有竞争力的性能。

更新日期：2021-06-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>