Abstract
Parts-of-speech tagging is a process in linguistics which deals with tagging each word in a sentence with their corresponding parts-of-speech. This process is considered to be one of the pre-processing steps for many natural language processing tasks. Earlier approaches were based on simple heuristics and later several methods were reported in the literature that incorporated machine learning techniques such as artificial neural networks. Very recently, with the advancement of deep learning-based approaches, parts-of-speech tagging process became more accurate and a reasonable number of taggers are now available for high resource languages such as English. But the low resource languages such as Malayalam is still lacking computationally efficient and accurate methods and techniques for parts-of-speech tagging. In this direction, this work proposes a deep learning-based approach for parts-of-speech tagging for the Malayalam language. Experiments conducted on real datasets show that the proposed method outperforms some of the already available methods in terms of precision and accuracy.
Similar content being viewed by others
References
Kumar S, Kumar MA, Soman KP (2019) Deep learning based part-of-speech tagging for malayalam twitter data (special issue: deep learning techniques for natural language processing). J Intell Syst 28(3):423–435
Sarkar K, Gayen V (2013) A trigram HMM-based POS tagger for Indian languages. In: Proceedings of the international conference on frontiers of intelligent computing: theory and applications (FICTA) (pp 205–212). Springer, Berlin, Heidelberg
Sarkar K (2016) A CRF based POS tagger for code-mixed Indian social media text. arXiv preprint arXiv: 1612.07956
Qin L (2019) POS tagging of chinese buddhist texts using recurrent neural networks, report, Department of East Asian Languages and Cultures, Stanford University
Plank B, Sgaard A, Goldberg Y (2016) Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. arXiv preprint arXiv: 1604.05529
Ling W, Lus T, Marujo L, Astudillo RF, Amir S, Dyer C, Trancoso I (2015) Finding function in form: compositional character models for open vocabulary word representation. arXiv preprint arXiv: 1508.02096
Santos CD, Zadrozny B (2014) Learning character-level representations for part-of- speech tagging. In: Proceedings of the 31st international conference on machine learning (ICML-14) (pp 1818–1826)
Chrupaa G (2013) Text segmentation with character-level text embeddings. arXiv preprint arXiv:1309.4628
Gillick D, Brunk C, Vinyals O, Subramanya A (2015) Multilingual language processing from bytes. arXiv preprint arXiv:1512.00103
Gimpel K, Schneider N, O’Connor B, Das D, Mills D, Eisenstein J, Smith NA (2010) Part-of-speech tagging for twitter: annotation, features, and experiments. In: Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science
Nooralahzadeh F, Brun C, Roux C (2014) Parts of speech tagging for french social media data. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp 1764–1772)
Owoputi O, O’Connor, B, Dyer C, Gimpel K, Schneider N, Smith NA (2013) Improved part-of-speech tagging for online conversational text with word clusters. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies (pp 380–390)
Vyas Y, Gella S, Sharma J, Bali, K, Choudhury M (2014) Pos tagging of english-hindi code-mixed social media content. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp 974–979)
Jamatia A, Das A (2014) Part-of-speech tagging system for indian social media text on twitter. In: Social-India 2014, first workshop on language technologies for indian social media text, at the eleventh international conference on natural language processing (ICON-2014) (pp 21–28)
Jamatia A, Gambck B, Das A (2015) Part-of-speech tagging for code-mixed english-hindi twitter and face-book chat messages. In: Proceedings of the international conference recent advances in natural language processing (pp 239–248)
Baskaran S, Bali K, Bhattacharya T, Bhattacharyya P, Jha GN, Rajendran S, Sobha L (2008) Designing a common POS-tagset framework for Indian languages. In: Proceedings of the 6th workshop on Asian language resources
Petrov S, Das D, McDonald R (2011) A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086
Patel RN, Pimpale PB, Sasikumar M (2016) Recurrent neural network based part-of-speech tagger for code-mixed social media text. arXiv preprint arXiv:1611.04989
Pimpale PB, Patel RN (2016) Experiments with POS tagging code-mixed Indian social media text. arXiv preprint arXiv:1610.09799
Krishnan KG, Pooja A, Kumar MA, Soman KP (2017) Character based bidirectional LSTM for disambiguating tamil part-of-speech categories. Int J Control Theory Appl 2017:229–235
Jamatia A, Das A (2016) Task report: tool contest on POS tagging for code- mixed indian social Media (Facebook, Twitter, and Whatsapp) Text@ ICON 2016 the proceeding of ICON 2016
Ghosh S, Das D (2016) Part-of-speech tagging of code-mixed social media text. In: Proceedings of the second workshop on computational approaches to code switching (pp 90–97)
Joshi N, Darbari H, Mathur I (2013) HMM based POS tagger for Hindi. In: Proceeding of 2013 international conference on artificial intelligence, soft computing (AISC-2013)
Bharati A, Sangal R, Sharma DM, Bai L (2006) Anncorra: annotating corpora guidelines for pos and chunk annotation for indian languages. LTRC-TR31, pp 1–38
Reddy S, Sharoff S (2011) Cross language POS taggers (and other tools) for Indian languages: an experiment with Kannada using Telugu resources. In: Proceedings of the fifth international workshop on cross lingual information access (pp 11–19)
Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of- speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North american chapter of the association for computational linguistics on human language technology-volume 1 (pp 173–180). Association for Computational Linguistics
Manju K, Soumya S, Idicula SM (2009) Development of a POS tagger for Malayalam-an experience. In: 2009 international conference on advances in recent technologies in communication and computing, IEEE (pp 709–713)
Kumawat D, Jain V (2015) Pos tagging approaches: a comparison. Int J Comput Appl 118:6
Hasan FM (2006) Comparison of different POS tagging techniques for some South Asian languages (Doctoral dissertation, BRAC University)
Rajeev RR, Jayan JP, Sherly E (2010) Tagging Malayalam text with Parts of Speech-TnT and SVM tagger comparison
Gal Y, Ghahramani Z (2016) A theoretically grounded application of dropout in recurrent neural networks. Adv Neural Inf Process Syst 2016:10191027
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Nambiar SK, Leons A, Jose S (2019) POS tagger for Malayalam using Hidden Markov Model. In: 2019 international conference on smart systems and inventive technology (ICSSIT), IEEE (pp 957–960)
Acknowledgements
The authors would like to thank all the researchers and staff members of Data Engineering Lab for their constructive comments and feedback that significantly improved the quality of this paper. The authors also acknowledge the people who provided the tagged dataset for Malayalam that is used in this work.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Akhil, K.K., Rajimol, R. & Anoop, V.S. Parts-of-Speech tagging for Malayalam using deep learning techniques. Int. j. inf. tecnol. 12, 741–748 (2020). https://doi.org/10.1007/s41870-020-00491-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-020-00491-z