Hostname: page-component-8448b6f56d-gtxcr Total loading time: 0 Render date: 2024-04-19T03:18:20.268Z Has data issue: false hasContentIssue false

Incorporating word embeddings in unsupervised morphological segmentation

Published online by Cambridge University Press:  10 July 2020

Ahmet Üstün
Affiliation:
The University of Groningen, Groningen, The Netherlands
Burcu Can*
Affiliation:
Department of Computer Engineering, Hacettepe University, Ankara, Turkey
*
*Corresponding author. E-mail: burcucan@gmail.com

Abstract

We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representations. Our approach does not require any annotated data which make it fully unsupervised and require only a small amount of raw data together with pretrained word embeddings for training purposes. The results show that using dense vector representations helps in morphological segmentation especially for low-resource languages. We present results for Turkish, English, and German. Our semantic MLE model outperforms other unsupervised models for Turkish language. Our proposed models could be also used for any other low-resource language with concatenative morphology.

Type
Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017). Enriching word vectors with subword information. In Transactions of the Association of Computational Linguistics, TACL, pp. 135146.CrossRefGoogle Scholar
Can, B. and Manandhar, S. (2010). Clustering morphological paradigms using syntactic categories. In Proceedings of the Multilingual Information Access Evaluation I. Text Retrieval Experiments: 10th Workshop of the Cross-Language Evaluation Forum, Revised Selected Papers. Berlin, Heidelberg: Springer, pp. 641648.CrossRefGoogle Scholar
Can, B. and Manandhar, S. (2018). Tree structured dirichlet processes for hierarchical morphological segmentation. Computational Linguistics 44(2), 349374.CrossRefGoogle Scholar
Cao, K. and Rei, M. (2016). A joint model for word embedding and word morphology. In Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 1826.CrossRefGoogle Scholar
Clark, A. (2000). Inducing syntactic categories by context distribution clustering. In Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning - Volume 7, ConLL’00. Association for Computational Linguistics, pp. 9194.CrossRefGoogle Scholar
Cotterell, R. and Schütze, H. (2015). Morphological word-embeddings. In Proceedings of the Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, Denver, Colorado. Association for Computational Linguistics, pp. 12871292.CrossRefGoogle Scholar
Creutz, M. and Lagus, K. (2002). Unsupervised discovery of morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning - Volume 6, MPL’02. Association for Computational Linguistics, pp. 2130.CrossRefGoogle Scholar
Creutz, M. and Lagus, K. (2005a). Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR 2005), pp. 106113.Google Scholar
Creutz, M. and Lagus, K. (2005b). Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Technical Report A81.Google Scholar
Creutz, M. and Lagus, K. (2007). Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions Speech Language Processing 4, 3:1–3:34.CrossRefGoogle Scholar
de Marcken, C. (1996). Linguistic structure as composition and perturbation. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, California, USA. Association for Computational Linguistics, pp. 335341.CrossRefGoogle Scholar
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6), 721741.CrossRefGoogle ScholarPubMed
Goldsmith, J. (2001). Unsupervised learning of the morphology of a natural language. Computational Linguistics 27(2), 153198.CrossRefGoogle Scholar
Goldwater, S., Johnson, M. and Griffiths, T.L. (2006). Interpolating between types and tokens by estimating power-law generators. In Proceedings of the Advances in Neural Information Processing Systems 18. MIT Press, pp. 459466.Google Scholar
Hankamer, J. (1986). Finite state morphology and left to right phonology. In Proceedings of the West Coast Conference on Formal Linguistics (WCCFL-5).Google Scholar
Harris, Z.S. (1955). From phoneme to morpheme. Language 31(2), 190222.CrossRefGoogle Scholar
Harris, Z.S. (1970). Morpheme boundaries within words: report on a computer test. Papers in Structural and Transformational Linguistics, pp. 6877.CrossRefGoogle Scholar
Kurimo, M., Lagus, K., Virpioja, S. and Turunen, V.T. (2011). Morpho Challenge 2010. http://research.ics.tkk.fi/events/morphochallenge2010/ (accessed 10 February 2017).Google Scholar
Lazaridou, A., Marelli, M., Zamparelli, R. and Baroni, M. (2013). Compositional-ly derived representations of morphologically complex words in distributional semantics. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria. Association for Computational Linguistics, pp. 15171526.Google Scholar
Lee, Y.K., Haghighi, A. and Barzilay, R. (2011). Modeling syntactic context improves morphological segmentation. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL’11, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 19.Google Scholar
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.Google Scholar
Narasimhan, K., Barzilay, R. and Jaakkola, T.S. (2015). An unsupervised method for uncovering morphological chains. Transactions of the Association for Computational Linguistics (TACL) 3, 157167.CrossRefGoogle Scholar
Schone, P. and Jurafsky, D. (2001). Knowledge-free induction of inflectional morphologies. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, NAACL’01. Association for Computational Linguistics, pp. 19.CrossRefGoogle Scholar
Soricut, R. and Och, F. (2015). Unsupervised morphology induction using word embeddings. In Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pp. 16271637.CrossRefGoogle Scholar
Team, D.D. (2016). Deeplearning4j: Open-source distributed deep learning for the JVM, Apache Software Foundation License 2.0. http://deeplearning4j.org/ (accessed 10 February 2017).Google Scholar
Üstün, A. and Can, B. (2016). Unsupervised morphological segmentation using neural word embeddings. In Proceedings of the Statistical Language and Speech Processing: 4th International Conference, SLSP 2016, Pilsen, Czech Republic, October 11–12, 2016. Springer International Publishing, pp. 4353.CrossRefGoogle Scholar
Üstün, A., Kurfal, M. and Can, B. (2018). Characters or morphemes: How to represent words? In Proceedings of The Third Workshop on Representation Learning for NLP, Melbourne, Australia. Association for Computational Linguistics, pp. 144153.Google Scholar