当前位置: X-MOL 学术Comput. Soc. Netw. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Text normalization for named entity recognition in Vietnamese tweets.
Computational Social Networks Pub Date : 2016-12-01 , DOI: 10.1186/s40649-016-0032-0
Vu H Nguyen 1 , Hien T Nguyen 1 , Vaclav Snasel 2
Affiliation  

Named entity recognition (NER) is a task of detecting named entities in documents and categorizing them to predefined classes, such as person, location, and organization. This paper focuses on tweets posted on Twitter. Since tweets are noisy, irregular, brief, and include acronyms and spelling errors, NER in those tweets is a challenging task. Many approaches have been proposed to deal with this problem in tweets written in English, Germany, Chinese, etc., but none for Vietnamese tweets. We propose a method that normalizes a tweet before taking as an input of a learning model for NER in Vietnamese tweets. The normalization step detects spelling errors in a tweet and corrects them using an improved Dice's coefficient or n-grams. A Support Vector Machine learning algorithm is employed to learn a classifier using six different types of features. We train our method on a training set consisting of more than 40,000 named entities and evaluate it on a testing set consisting of 3,186 named entities. The experimental results showed that our system achieves state-of-the-art performance with F1 score of 82.13%.

中文翻译:

越南推文中命名实体识别的文本规范化。

命名实体识别 (NER) 是检测文档中的命名实体并将它们分类到预定义的类(例如人员、位置和组织)的任务。本文重点介绍在 Twitter 上发布的推文。由于推文嘈杂、不规则、简短,并且包含首字母缩写词和拼写错误,因此这些推文中的 NER 是一项具有挑战性的任务。已经提出了许多方法来处理用英语、德语、中文等编写的推文中的这个问题,但对于越南推文却没有。我们提出了一种方法,该方法在将越南推文中的 NER 学习模型作为输入之前对推文进行规范化。标准化步骤检测推文中的拼写错误,并使用改进的 Dice 系数或 n-gram 进行纠正。支持向量机学习算法用于学习使用六种不同类型特征的分类器。我们在包含 40,000 多个命名实体的训练集上训练我们的方法,并在包含 3,186 个命名实体的测试集上对其进行评估。实验结果表明,我们的系统达到了最先进的性能,F1 得分为 82.13%。
更新日期:2016-12-01
down
wechat
bug