Incorporating word embeddings in unsupervised morphological segmentation,Natural Language Engineering

当前位置： X-MOL 学术 › Nat. Lang. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Incorporating word embeddings in unsupervised morphological segmentation
Natural Language Engineering ( IF 2.3 ) Pub Date : 2020-07-10 , DOI: 10.1017/s1351324920000406
Ahmet Üstün , Burcu Can

We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representations. Our approach does not require any annotated data which make it fully unsupervised and require only a small amount of raw data together with pretrained word embeddings for training purposes. The results show that using dense vector representations helps in morphological segmentation especially for low-resource languages. We present results for Turkish, English, and German. Our semantic MLE model outperforms other unsupervised models for Turkish language. Our proposed models could be also used for any other low-resource language with concatenative morphology.

中文翻译：

在无监督形态分割中结合词嵌入

我们研究了语义信息在形态分割中的使用，因为相互派生的词将保持语义相关。我们通过结合从密集词向量表示中获得的语义信息来使用数学模型，例如最大似然估计 (MLE) 和最大后验估计 (MAP)。我们的方法不需要任何注释数据，这使得它完全不受监督，并且只需要少量原始数据以及用于训练目的的预训练词嵌入。结果表明，使用密集向量表示有助于形态分割，尤其是对于低资源语言。我们提供土耳其语、英语和德语的结果。我们的语义 MLE 模型优于土耳其语的其他无监督模型。

更新日期：2020-07-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11