当前位置: X-MOL 学术ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2020-06-22 , DOI: 10.1145/3390298
Santwana Chimalamarri 1 , Dinkar Sitaram 1 , Ashritha Jain 1
Affiliation  

Crosslingual word embeddings developed from multiple parallel corpora help in understanding the relationships between languages and improving the prediction quality of machine translation. However, in low resource languages with complex and agglutinative morphologies, inducing good-quality crosslingual embeddings becomes challenging due to the problem of complex morphological forms and rare words. This is true even for languages that share common linguistic structure. In our work, we have shown that performing a simple morphological segmentation upon the corpora prior to the generation of crosslingual word embeddings for both roots and suffixes greatly improves the prediction quality and captures semantic similarities more effectively. To exhibit this, we have chosen two related languages: Telugu and Kannada of the Dravidian language family. We have also tested our method upon a widely spoken North Indian language, Hindi, belonging to the Indo-European language family, and have observed encouraging results.

中文翻译:

形态分割以改进低资源语言的跨语言词嵌入

从多个并行语料库开发的跨语言词嵌入有助于理解语言之间的关系并提高机器翻译的预测质量。然而,在具有复杂和粘着形态的低资源语言中,由于复杂的形态形式和稀有词的问题,诱导高质量的跨语言嵌入变得具有挑战性。即使对于具有共同语言结构的语言也是如此。在我们的工作中,我们已经证明,在为词根和后缀生成跨语言词嵌入之前,对语料库执行简单的形态分割可以极大地提高预测质量并更有效地捕获语义相似性。为了展示这一点,我们选择了两种相关的语言:德拉威语系的泰卢固语和卡纳达语。
更新日期:2020-06-22
down
wechat
bug