Subwords-Only Alternatives to fastText for Morphologically Rich Languages

Ghukasyan, Tsolak; Yeshilbashyan, Yeva; Avetisyan, Karen

doi:10.1134/S0361768821010059

Subwords-Only Alternatives to fastText for Morphologically Rich Languages

Published: 23 February 2021

Volume 47, pages 56–66, (2021)
Cite this article

Programming and Computer Software Aims and scope Submit manuscript

Tsolak Ghukasyan¹,
Yeva Yeshilbashyan¹ &
Karen Avetisyan¹

291 Accesses
3 Citations
Explore all metrics

Abstract

In this work, we present purely subword-based alternatives to fastText word embedding algorithm The alternatives are modifications of the original fastText model, but rely on subword information only, eliminating the reliance on word-level vectors and at the same time helping to dramatically reduce the size of embeddings. Proposed models differ in their subword information extraction method: character n-grams, suffixes, and the byte-pair encoding units. We test the models in the task of morphological analysis and lemmatization for 3 morphologically rich languages: Finnish, Russian, and German. The results are compared with other recent subword-based models, demonstrating consistently higher results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Comprehensive Evaluation of Word Embeddings for Highly Inflectional Language

Integrating Character Representations into Chinese Word Embedding

Pronunciation-Enhanced Chinese Word Embedding

Article Open access 22 February 2021

Qinjuan Yang, Haoran Xie, … Yanghui Rao

Notes

http://morpho.aalto.fi/projects/morpho/
https://github.com/google/sentencepiece
https://github.com/ispras-texterra/babylondigger

REFERENCES

Pinter, Y., Guthrie, R., and Eisenstein, J., Mimicking word embeddings using subword RNNs, Proc. 2017 Conf. on Empirical Methods in Natural Language Processing, Copenhagen, 2017, pp. 102–112.
Schick, T. and Schutze, H., Attentive mimicking: better word embeddings by attending to informative contexts, Proc. 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, 2019, vol. 1, pp. 489–494.
Zhao, J., Mudgal, S., and Liang, Y., Generalizing word embeddings using bag of subwords, Proc. 2018 Conf. on Empirical Methods in Natural Language Processing, Brussels, 2018, pp. 601–606.
Sasaki, S., Suzuki, J., and Inui, K., Subword-based compact reconstruction of word embeddings, Proc. 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, 2019, vol. 1, pp. 3498–3508.
Heinzerling, B. and Strube, M., BPEmb: tokenization-free pre-trained subword embeddings in 275 languages, Proc. 11th Int. Conf. on Language Resources and Evaluation (LREC 2018), Miyazaki, 2018.
Zhu, Y., Vulić, I., and Korhonen, A., A systematic study of leveraging subword information for learning word representations, Proc. 2019 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, 2019, vol. 1, pp. 912–932.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T., Enriching word vectors with subword information, Trans. Assoc. Comput. Linguist., 2017, vol. 5, pp. 135–146.
Article Google Scholar
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and Mikolov, T., Learning word vectors for 157 languages, Proc. 11th Int. Conf. on Language Resources and Evaluation (LREC 2018), Miyazaki, 2018.
Shibata, Y., et al., Byte pair encoding: a text compression scheme that accelerates pattern matching, Tech. Rep., Kyushu Univ.: Dep. of Informatics, 1999, no. DOI-TR-161.
Pennington, J., Socher, R., and Manning, C.D., Glove: global vectors for word representation, Proc. 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP), Doha, 2014, pp. 1532–1543.
Üstün, A., Kurfalı, M., and Can, B., Characters or morphemes: how to represent words?, Proc. 3rd Workshop on Representation Learning for NLP, Melbourne, 2018, pp. 144–153.
Devlin, J., et al., BERT: pre-training of deep bidirectional transformers for language understanding, Proc. 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, 2019, vol. 1, pp. 4171–4186.
Mikolov, T., et al., Advances in pre-training distributed word representations, Proc. 11th Int. Conf. on Language Resources and Evaluation (LREC 2018), Miyazaki, 2018.
Zhu, Y., et al., On the importance of subword information for morphological tasks in truly low-resource languages, Proc. 23rd Conf. on Computational Natural Language Learning (CoNLL), Hong Kong, 2019, pp. 216–226.
Zeman, D., et al., CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies, Proc. CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, 2018, pp. 1–21.
Rybak, P. and Wróblewska, A., Semi-supervised neural system for tagging, parsing and lematization, Proc. CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, 2018, pp. 45–54.
Srivastava, R.K., Greff, K., and Schmidhuber, J., Highway networks, 2015, arXiv:1505.00387.
Diederik Kingma and Jimmy Ba, Adam: a method for stochastic optimization, Proc. 3rd Int. Conf. on Learning Representations ICLR 2015, San Diego, 2015.
Zeman, D., Popel, M., Straka, M., Hajič, J., Nivre, J., Ginter, F., Luotolahti, J., Pyysalo, S., Petrov, S., Potthast, M., Tyers, F., Badmaeva, E., Gokirmak, M., Nedoluzhko, A., Cinková, S., Hajič, J., Jr., Hlaváčová, J., Kettnerová, V., Urešová, Z., Kanerva, J., Ojala, S., Missilä, A., Manning, C.D., Schuster, S., Reddy, S., Taji, D., Habash, N., Leung, H., de Marneffe, M.-C., Sanguinetti, M., Simi, M., Kanayama, H., de Paiva, V., Droganova, K., Alonso, H.M., Çöltekin, Ç., Sulubacak, U., Uszkoreit, H., Macketanz, V., Burchardt, A., Harris, K., Marheinecke, K., Rehm, G., Kayadelen, T., Attia, M., Elkahky, A., Yu, Z., Pitler, E., Lertpradit, S., Mandl, M., Kirchner, J., Alcalde, H.F., Strnadová, J., Banerjee, E., Manurung, R., Stella, A., Shimada, A., Kwak, S., Mendonça, G., Lando, T., Nitisaroj, R., and Li, J., CoNLL 2017 shared task: multilingual parsing from raw text to universal dependencies, in Proc. CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, 2017, pp. 1–19.
Boguslavsky, I., SynTagRus – a deeply annotated corpus of Russian, in Les emotions dans le discours-Emotions in Discourse, 2014, pp. 367–380.
Haverinen, K., et al., Building the essential resources for Finnish: the Turku dependency treebank, Lang. Res. Eval., 2014, vol. 48, no. 3, pp. 493–531.
Article Google Scholar
Kilian, F., Kohn, A., Beuck, N., and Menzel, W., Because size does matter: the Hamburg dependency treebank, Proc. Language Resources and Evaluation Conf. (LREC 2014), Reykjavik, 2014.
Turdakov, D., Astrakhantsev, N., Nedumov, Y., Sysoev, A., Andrianov, I., Mayorov, V., Fedorenko, D., Korshunov, A., and Kuznetsov, S., Texterra: a framework for text analysis, Proc. Inst. Syst. Program. RAS (Proc. ISP RAS), 2014, vol. 26, no. 1, pp. 421–438.
Andrianov, I.A., Mayorov, V.D., and Turdakov, D.Y.E., Modern approaches to aspectbased sentiment analysis, Proc. Inst. Syst. Program. RAS, 2015, vol. 27, no. 5, pp. 5–22.
Article Google Scholar
Zeman, D., Nivre, J., Abrams, M., et al., Universal Dependencies 2.5, LINDAT/CLARIAH-CZ Digital Library at the Institute of Formal and Applied Linguistics (UFAL), Charles Univ., Faculty of Mathematics and Physics, 2019. http://hdl.handle.net/11234/1-3105.

Download references

ACKNOWLEDGMENTS

We thank Vladimir Mayorov and Sergey Korolyov for assisting with the development and experiments, and Denis Turdakov, Ivan Andrianov, Hrant Khachatryan, Vladislav Trifonov for feedback and insightful discussions.

Author information

Authors and Affiliations

Ivannikov Laboratory for System Programming at Russian-Armenian University, 0051, Yerevan, Armenia
Tsolak Ghukasyan, Yeva Yeshilbashyan & Karen Avetisyan

Authors

Tsolak Ghukasyan
View author publications
You can also search for this author in PubMed Google Scholar
Yeva Yeshilbashyan
View author publications
You can also search for this author in PubMed Google Scholar
Karen Avetisyan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Tsolak Ghukasyan, Yeva Yeshilbashyan or Karen Avetisyan.

Appendices

APPENDIX A

COMPARISON OF VOCABULARIES FOR ENGLISH AND MORPHOLOGICALLY RICH LANGUAGES

Table 1. Form-Lemma ratios in UD v2.5 datasets [25] for Russian, German and Finnish compared to English

Full size table

APPENDIX B

ILLUSTRATION OF ORIGINAL FASTTEXT AND THE PROPOSED ALTERNATIVES

APPENDIX C

ILLUSTRATION OF THE NEURAL NETWORK FOR MORPHOLOGICAL ANALYSIS AND LEMMATIZATION

APPENDIX D

COMPARISON OF PROPOSED MODELS AND SIMILAR SUBWORDS-ONLY APPROACHES

Table 2. The accuracy of morphological analysis and lemmatization of proposed fastText alternatives vs morpheme-based approaches

Full size table

APPENDIX E

COMPARISON OF FASTTEXT AND PROPOSED MODELS

Table 3. The accuracy of morphological analysis and lemmatization of proposed alternatives vs fastText

Full size table

Table 4. Comparison of model sizes and parameter counts of proposed alternatives vs fastText

Full size table

APPENDIX F

NEURAL NETWORK-BASED ANALYZER’S HYPERPARAMETERS

Table 5. Tested BabylonDigger hyperparameters for the analyzer’s neural network (search was performed in greedy fashion). Bolded values were used in the final evaluation*

Full size table

APPENDIX G

THE PERFORMANCE OF BPE-BASED MODELS

Table 6. The performance of BPE-fastText embeddings based on subwords vocabulary size (vs.)

Full size table

Table 7. The performance of BPE-based models for the Russian language

Full size table

Table 8. The performance of BPE-based models for the Finnish language

Full size table

Table 9. The performance of BPE-based models for the German language.

Full size table

APPENDIX H

THE PERFORMANCE OF BPE-BASED MODELS BASED ON VOCABULARY SIZE AND LANGUAGE

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ghukasyan, T., Yeshilbashyan, Y. & Avetisyan, K. Subwords-Only Alternatives to fastText for Morphologically Rich Languages. Program Comput Soft 47, 56–66 (2021). https://doi.org/10.1134/S0361768821010059

Download citation

Received: 10 July 2020
Revised: 04 August 2020
Accepted: 12 August 2020
Published: 23 February 2021
Issue Date: January 2021
DOI: https://doi.org/10.1134/S0361768821010059

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.