Using machine learning to build POS tagger for under-resourced language: the case of Somali,International Journal of Information Technology

当前位置： X-MOL 学术 › Int. J. Inf. Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Using machine learning to build POS tagger for under-resourced language: the case of Somali
International Journal of Information Technology Pub Date : 2020-06-03 , DOI: 10.1007/s41870-020-00480-2
Siraj Mohammed

POS tagging serves as a preliminary task for many NLP applications. It refers to the process of classifying words into their parts of speech (also known as words classes or lexical categories). Somali is a member of the Cushitic languages with limited number of NLP tools for use. An accurate and reliable POS tagger is essential for many NLP tasks like shallow parsing, dependency parsing, sentiment analysis, and named entity recognition. In this paper, we present a statistical POS tagger for Somali language using different machine learning approaches (i.e., HMM and CRF) and neural network model. Our Somali POS tagger outperforms the state-of-the-art POS tagger by 87.51% on a tenfold cross-validation. The key contribution of this paper are (1) building a generic POS tagger, (2) comparing the performances with the existing state of the art techniques, and (3) exploring the use word embeddings for Somali POS tagging.

中文翻译：

使用机器学习为资源不足的语言构建POS标记器：索马里案例

POS标记是许多NLP应用程序的一项初步任务。它是指将单词分类为词性的过程（也称为单词类别或词汇类别）。索马里语是Cushitic语言的成员，使用的NLP工具数量有限。精确可靠的POS标记器对于许多NLP任务至关重要，例如浅层解析，依赖项解析，情感分析和命名实体识别。在本文中，我们介绍了使用不同的机器学习方法（即HMM和CRF）和神经网络模型的索马里语言统计POS标记器。在十倍的交叉验证中，我们的索马里POS标记器比最新的POS标记器高出87.51％。本文的主要贡献是（1）构建通用POS标记器；（2）将性能与现有技术水平进行比较；

更新日期：2020-06-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文