当前位置: X-MOL 学术ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Sentiment Analysis of Sinhala News Comments
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2021-05-26 , DOI: 10.1145/3445035
Surangika Ranathunga 1 , Isuru Udara Liyanage 1
Affiliation  

Sinhala is a low-resource language, for which basic language and linguistic tools have not been properly defined. This affects the development of NLP-based end-user applications for Sinhala. Thus, when implementing NLP tools such as sentiment analyzers, we have to rely only on language-independent techniques. This article presents the use of such language-independent techniques in implementing a sentiment analysis system for Sinhala news comments. We demonstrate that for low-resource languages such as Sinhala, the use of recently introduced word embedding models as semantic features can compensate for the lack of well-developed language-specific linguistic or language resources, and text classification with acceptable accuracy is indeed possible using both traditional statistical classifiers and Deep Learning models. The developed classification models, a corpus of 8.9 million tokens extracted from Sinhala news articles and user comments, and Sinhala Word2Vec and fastText word embedding models are now available for public use; 9,048 news comments annotated with POSITIVE/NEGATIVE/NEUTRAL polarities have also been released.

中文翻译:

僧伽罗语新闻评论的情绪分析

僧伽罗语是一种资源匮乏的语言,其基本语言和语言工具尚未正确定义。这会影响僧伽罗语基于 NLP 的最终用户应用程序的开发。因此,在实现诸如情感分析器之类的 NLP 工具时,我们只能依赖与语言无关的技术。本文介绍了在实现僧伽罗语新闻评论情感分析系统中使用这种与语言无关的技术。我们证明,对于像僧伽罗语这样的低资源语言,使用最近引入的词嵌入模型作为语义特征可以弥补开发良好的特定语言语言或语言资源的不足,并且使用准确度可以接受的文本分类确实是可能的传统的统计分类器和深度学习模型。开发的分类模型,从僧伽罗语新闻文章和用户评论中提取的 890 万个标记的语料库,以及僧伽罗语 Word2Vec 和 fastText 词嵌入模型现已可供公众使用;还发布了 9,048 条带有正/负/中性极性注释的新闻评论。
更新日期:2021-05-26
down
wechat
bug