An Improved NER Methodology to the Portuguese Language,Mobile Networks and Applications

当前位置： X-MOL 学术 › Mobile Netw. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An Improved NER Methodology to the Portuguese Language
Mobile Networks and Applications ( IF 2.3 ) Pub Date : 2020-09-08 , DOI: 10.1007/s11036-020-01644-x
Rogerio de Aquino Silva , Luana da Silva , Moisés Lima Dutra , Gustavo Medeiros de Araujo

The text mining process typically involves the application of natural language processing (NLP) techniques, in order to obtain important information and extract insights from texts. This is achieved by detecting patterns, which are not explicitly a priori in this unstructured or semi-structured dataset. One of the most significant NLP tasks is Named Entity Recognition (NER). The NER process seeks to extract and classify mentioned entities detected in a text written in natural language. These categories are predefined and can be names of people or organizations, locations, dates, monetary values, specific codes, etc. A wide range of algorithms based on LSTM (Long-Short Term Memory) architecture has being proposed to enhance the NER accuracy. However, a key component to a successful information extraction is the corpora used for NER training. Another key issue concerns the language being worked on, since the vast majority of algorithms were designed to work with English. According to the literature, while the NER process applied to the English language reaches about 90% accuracy, when it is applied to the Portuguese language, this precision reaches a maximum of 83.38%. This paper proposes a methodology to improve the Portuguese-based NER, which uses journalistic corpora as a basis for text corpora training. We believe the journalistic writting has the best adherence to the contemporaneity of any language, since it preserves features such as objectivity, simplicity, impartiality, and is a reference of transmitting the information without ambiguity. The proposed methodology provides a model to extract entities and assess the obtained results with the use of Recurrent Neural Network architectures. At the best of our knowledge, the proposed methodology applied to the Portuguese language not only overcomes the average accuracy found in the literature by increasing it from 83.38% to 85.64%, but also could decrease the computational costs related to the NER processing tasks.

中文翻译：

葡萄牙语的一种改进的NER方法

文本挖掘过程通常涉及自然语言处理（NLP）技术的应用，以便获取重要信息并从文本中提取见解。这是通过检测模式来实现的，这些模式在此非结构化或半结构化数据集中显然不是先验的。NLP最重要的任务之一是命名实体识别（NER）。NER过程试图提取和分类以自然语言编写的文本中检测到的提到的实体。这些类别是预定义的，可以是人员或组织的名称，位置，日期，货币值，特定代码等。为了提高NER的准确性，已经提出了多种基于LSTM（长期记忆）架构的算法。但是，成功提取信息的关键要素是用于NER训练的语料库。另一个关键问题与正在使用的语言有关，因为绝大多数算法都是设计用于英语的。根据文献，虽然适用于英语的NER过程达到约90％的准确度，但适用于葡萄牙语的NER过程最高达到83.38％。本文提出了一种改进基于葡萄牙语的NER的方法，该方法使用新闻语料库作为文本语料库训练的基础。我们相信新闻稿最好地遵循任何一种语言的同时性，因为它保留了客观性，简单性，公正性等特征，并且是毫无歧义地传递信息的参考。所提出的方法学提供了一个模型，用于提取实体并使用递归神经网络体系结构评估获得的结果。据我们所知，用于葡萄牙语的拟议方法不仅克服了文献中发现的平均准确性，将其从83.38％提高到85.64％，而且还可以减少与NER处理任务相关的计算成本。

更新日期：2020-09-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文