Abstract
In this paper we discuss the current state of the art in part-of-speech tagging for Polish. We introduce the problem of POS tagging and point out the key issues in tagging inflected languages, which make this task more difficult in the case of Polish than e.g. English. We also discuss the most important language resources connected with POS tagging, as well as the task of morphological analysis, as it is commonly used as a preliminary step in tagging. We describe the methods that have been applied to the problem of POS tagging for Polish to date and discuss the most current, neural-network based methods in more detail. Finally, we conclude with a general view of this field in the context of Polish and discuss possible future research directions.
References
Acedański, S. 2010. “A morphosyntactic Brill tagger for inflectional languages”. In: Loftsson, H., E. Rögnvaldsson and S. Helgadóttir S. (eds.), Advances in Natural Language Processing. NLP 2010. Berlin: Springer. 3–14.10.1007/978-3-642-14770-8_3Search in Google Scholar
Bień, J. Stanisław. 1991. Koncepcja słownikowej informacji morfologicznej i jej komputerowej weryfikacji [A concept for showing morphological information in a dictionary and the computerized verification thereof]. Warszawa: Wydawnictwa Uniwersytetu Warszawskiego.Search in Google Scholar
Brants, T. 2000. “TnT: A statistical part-of-speech tagger”. Proceedings of the Sixth Conference on Applied Natural Language Processing Stroudsburg, PA: Association for Computational Linguistics. 224–231. doi:10.3115/974147.974178.10.3115/974147.974178Search in Google Scholar
Brill, E. 1992. “A simple rule-based part of speech tagger”. Proceedings of the Third Conference on Applied Natural Language Processing Association for Computational Linguistics. 152–155.10.3115/974499.974526Search in Google Scholar
Daelemans, W., P. Berck, J. Zavrel and S. Gillis. 1996. “MBT: A memory-based part of speech tagger-generator”. Proceedings of the 4th Workshop on Very Large Corpora Copenhagen. 14–27.Search in Google Scholar
Dębowski, Ł. 2004. “Trigram morphosyntactic tagger for Polish”. Proceedings of the International IIS:IIPWM’04 Conference Berlin: Springer. 409–413.10.1007/978-3-540-39985-8_43Search in Google Scholar
Gers, F.A., J. Schmidhuber and F.A. Cummins. 1999. “Learning to forget: Continual prediction with LSTM”. Neural Computation 12. 2451–2471.10.1049/cp:19991218Search in Google Scholar
Giménez, J. and L. Márquez. 2004. “SVMTool: A general POS tagger generator based on support vector machines”. Proceedings of the 4th International Conference on Language Resources and Evaluation Lisbon. 43–46.Search in Google Scholar
Gruszczyński, W., D. Adamiec and M. Ogrodniczuk. 2013. “Elektroniczny korpus tekstów polskich z XVII i XVIII w. (Do 1772 r.)” [An electronic corpus of 17th- and 18th-century Polish texts (up to 1772)]. Polonica XXXIII. 311–318.Search in Google Scholar
Hochreiter, S. and J. Schmidhuber. 1997. “Long short-term memory”. Neural computation 9. 1735–1780.10.1162/neco.1997.9.8.1735Search in Google Scholar
Jurafsky, D. and J.H. Martin. 2018. Speech and language processing. (3rd edition draft.) <https://web.stanford.edu/~jurafsky/slp3/>Search in Google Scholar
Karlsson, F., A. Voutilainen, J. Heikkilä and A. Anttila. 1995. Constraint grammar: A language-independent system for parsing unrestricted text Berlin: Mouton de Gruyter.10.1515/9783110882629Search in Google Scholar
Kieraś, W., D. Komosińska, E. Modrzejewski and M. Woliński. 2017. “Morphosyntactic annotation of historical texts. The making of the Baroque Corpus of Polish”. In: Ekštein, K. and V. Matoušek (eds.), Text, Speech, and Dialogue: 20th International Conference (TSD 2017), Prague, August 27–31. Berlin: Springer. 308–316. doi:10.1007/978-3-319-64206-S_3510.1007/978-3-319-64206-S_35Search in Google Scholar
Kieraś, W. and M. Woliński. 2017. “Morfeusz 2 – analizator i generator fleksyjny dla języka polskiego” [Morfeusz 2 – an inflectional analyzer and generator for Polish]. Język Polski XCVII(1). 75–83.Search in Google Scholar
Kieraś, W. and M. Woliński. 2018. “Manually annotated corpus of Polish texts published between 1830 and 1918”. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Paris: European Language Resources Association (ELRA). 3854–3859. <http://www.lrec-conf.org/proceedings/lrec2018/index.html>Search in Google Scholar
Kobyliński, Ł. 2014. “PoliTa: A multitagger for Polish”. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). Reykjavík: ELRA. 2949–2954. <http://www.lrec-conf.org/proceedings/lrec2014/index.html>Search in Google Scholar
Kobyliński, Ł. and W. Kieraś. 2016. “Part of speech tagging for Polish: State of the art and future perspectives”. Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2016). Konya.Search in Google Scholar
Kobyliński, Ł. and M. Ogrodniczuk. 2017. “Results of the PolEval 2017 competition: Part-of-speech tagging shared task”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 362–366.Search in Google Scholar
Kobyliński, Ł., M. Wasiluk and G. Wojdyga. 2018. “Improving part-of-speech tagging by meta-learning”. Proceedings of 21st International Conference on Text, Speech and Dialogue (LNAI). Berlin: Springer-Verlag. 1–9.Search in Google Scholar
Krasnowska-Kieraś, K. 2017. “Morphosyntactic disambiguation for Polish with bi-LSTM neural networks”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 367–371. <http://ltc.amu.edu.pl/book/papers/PolEvalU-Sdpdf>Search in Google Scholar
Mikolov, T., K. Chen, G. Corrado and J. Dean. 2013. “Efficient estimation of word representations in vector space”. Proceedings of Workshop at ICLR 2013Search in Google Scholar
Mikolov, T., I. Sutskever, K. Chen, G. Corrado and J. Dean. 2013. “Distributed representations of words and phrases and their compositionality”. Proceedings of NIPS 2013 USA: Curran Associates Inc.Search in Google Scholar
Nivre, J., M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajič, C.D. Manning, R. McDonald, et al. 2016. “Universal dependencies v1: A multilingual treebank collection”. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016).Search in Google Scholar
Patejuk, A. and A. Przepiórkowski. 2018. From lexical functional grammar to enhanced universal dependencies: Linguistically informed treebanks of Polish Warsaw: Institute of Computer Science, Polish Academy of Sciences.10.1007/s10579-018-9433-zSearch in Google Scholar
Pęzik, P. and S. Laskowski. 2017. “Evaluating an averaged perceptron morphosyntactic tagger for Polish”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 372–376. <http://ltc.amu.edu.pl/book/papers/PolEvalU-3dpdf>Search in Google Scholar
Piasecki, M. 2007. “Polish tagger TaKIPI: Rule based construction and optimisation”. Task Quarterly 11(1–2). 151–167.Search in Google Scholar
Piasecki, M. and W. Walentynowicz. 2017. “MorphoDiTa-based tagger adapted to the Polish language technology”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 377–381. <http://ltc.amu.edu.pl/book/papers/PolEvalU-4dpdf>Search in Google Scholar
Przepiórkowski, A., M. Bańko, R.L. Górski and B. Lewandowska-Tomaszczyk (eds.). 2012. Narodowy korpus języka polskiego [The national corpus of Polish]. Warsaw: Wydawnictwo Naukowe PWN.Search in Google Scholar
Radziszewski, A. 2013. “Evaluation of lemmatisation accuracy of four Polish taggers”. Proceedings of the LTC 2013Search in Google Scholar
Radziszewski, A. and S. Acedański. 2012. “Taggers gonna tag: An argument against evaluating disambiguation capacities of morphosyntactic taggers”. Proceedings of TSD 2012 (LNCS). Berlin: Springer-Verlag.10.1007/978-3-642-32790-2_9Search in Google Scholar
Radziszewski, A. and T. Śniatowski. 2011. “A memory-based tagger for Polish”. Proceedings of the LTC 2011.Search in Google Scholar
Rychlikowski, P., M. Zapotoczny and J. Chorowski. 2017. “Character-based neural POS tagger”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 382–385. <http://ltc.amu.edu.pl/book/papers/PolEvalU-5dpdf>Search in Google Scholar
Saloni, Z., W. Gruszczyński, M. Woliński, R. Wołosz and D. Skowrońska. 2015. Słownik gramatyczny języka polskiego [A grammatical dictionary of Polish]. (3rd edn.) <http://sgjp.pl>Search in Google Scholar
Silfverberg, M., T. Ruokolainen, K. Lindén and M. Kurimo. 2014. “Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy”. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short papers) Baltimore: Association for Computational Linguistics. 259–264. <http://aclweb.org/anthology/P14-204>10.3115/v1/P14-2043Search in Google Scholar
Straková, J. Milan Straka and Jan Hajič. 2014. “Open-source tools for morphology, lemmatization, POS tagging and named entity recognition”. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations Baltimore: Association for Computational Linguistics. 13–18. <http://www.aclweb.org/anthology/P/P14/P14-5003dpdf>10.3115/v1/P14-5003Search in Google Scholar
Szałkiewicz, Ł. and A. Przepiórkowski. 2012. “Anotacja morfoskładniowa” [Morpho-syntactic annotation]. In: Przepiórkowski, A., M. Bańko, R.L. Górski and B. Le-wandowska-Tomaszczyk (eds.), Narodowy korpus języka polskiego [The national corpus of Polish]. Warsaw: Wydawnictwo Naukowe PWN. 59–96.Search in Google Scholar
Toutanova, K. and C.D. Manning. 2000. “Enriching the knowledge sources used in a maximum entropy part-of-speech tagger”. Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics 63–70.10.3115/1117794.1117802Search in Google Scholar
Waszczuk, J. 2012. “Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language”. Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai. 2789–2804.Search in Google Scholar
Wawer, A. 2015. “Sentiment dictionary refinement using word embeddings”. Proceedings of ISMIS 2015 Cham. 186–193. doi:10.1007/978-3-319-25252-C_20.10.1007/978-3-319-25252-C_20Search in Google Scholar
Woliński, M. 2006. “Morfeusz – a practical tool for the morphological analysis of Polish”. Proceedings of the International Intelligent Information Systems: Intelligent Information Processing and Web Mining 2006 Conference Wisła. 511–520.10.1007/3-540-33521-8_55Search in Google Scholar
Woliński, M. 2018. Automatyczna analiza składnikowa języka polskiego [Automatic syntactic analysis of Polish]. Warsaw: IPI PAN.10.31338/uw.9788323536147Search in Google Scholar
Woliński, M. and W. Kieraś. 2016. “The on-line version of Grammatical Dictionary of Polish”. Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC 2016 Portorož: ELRA; European Language Resources Association (ELRA). 2589–2594. <http://www.lrec-conf.org/proceedings/lrec2016/index.html>Search in Google Scholar
Wróbel, K. 2017. “KRNNT: Polish recurrent neural network tagger”. Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu. 386–391. <http://ltc.amu.edu.pl/book/papers/PolEvalU-6dpdf>Search in Google Scholar
Wróblewska, A. 2018. “Extended and enhanced Polish dependency bank in Universal Dependencies format”. Proceedings of the Second Workshop on Universal Dependencies (UDW 2018). Brussels: Association for Computational Linguistics. 173–182.10.18653/v1/W18-6020Search in Google Scholar
© 2019 Faculty of English, Adam Mickiewicz University, Poznań, Poland