Low resource language specific pre-processing and features for sentiment analysis task
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2021-06-02 , DOI: 10.1007/s10579-021-09541-9
Loitongbam Sanayai Meetei , Thoudam Doren Singh , Samir Kumar Borgohain , Sivaji Bandyopadhyay

Sentiment analysis is a classification task where polarity of textual data is identified, i.e. to analyze whether a sentence or document expresses a negative, positive or neutral sentiment. Manipuri is a less privileged, highly agglutinative and tonal language. Despite being a scheduled language of Indian Constitution, it is also a resource constrained language. In this work, we report the sentiment analysis for Manipuri using different types of machine learning based approaches. The dataset used in our work is collected from local daily newspaper. The novelty of this work is that we carry out language specific pre-processing tasks such as transliteration, building negative morpheme based lexicon and filtering of noisy words. Using them as additional linguistic features in our models improves the classification result in terms of precision, recall and F-score. The ensemble voting of best three classifiers based on TF-IDF perform better than BM25 based classifiers and other stand-alone classifiers. Based on this result, we attempt to classify the sentiment of news articles during a certain period of time. Further, we report the finding of deep learning based approaches on the same dataset.



情感分析是一种识别文本数据极性的分类任务,即分析一个句子或文档是否表达了消极、积极或中性的情绪。Manipuri 是一种特权较少、高度粘着和声调的语言。尽管它是印度宪法的预定语言,但它也是一种资源受限的语言。在这项工作中,我们使用不同类型的基于机器学习的方法报告了 Manipuri 的情绪分析。我们工作中使用的数据集是从当地日报收集的。这项工作的新颖之处在于我们执行特定于语言的预处理任务,例如音译、构建基于否定语素的词典和过滤噪声词。在我们的模型中使用它们作为额外的语言特征可以提高分类结果的精度,召回率和 F 分数。基于 TF-IDF 的最佳三个分类器的集成投票比基于 BM25 的分类器和其他独立分类器表现更好。基于此结果,我们尝试对特定时间段内的新闻文章的情绪进行分类。此外,我们报告了在同一数据集上基于深度学习的方法的发现。
