Subwords-Only Alternatives to fastText for Morphologically Rich Languages,Programming and Computer Software

当前位置： X-MOL 学术 › Program. Comput. Softw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Subwords-Only Alternatives to fastText for Morphologically Rich Languages
Programming and Computer Software ( IF 0.7 ) Pub Date : 2021-02-23 , DOI: 10.1134/s0361768821010059
Tsolak Ghukasyan , Yeva Yeshilbashyan , Karen Avetisyan

Abstract

In this work, we present purely subword-based alternatives to fastText word embedding algorithm The alternatives are modifications of the original fastText model, but rely on subword information only, eliminating the reliance on word-level vectors and at the same time helping to dramatically reduce the size of embeddings. Proposed models differ in their subword information extraction method: character n-grams, suffixes, and the byte-pair encoding units. We test the models in the task of morphological analysis and lemmatization for 3 morphologically rich languages: Finnish, Russian, and German. The results are compared with other recent subword-based models, demonstrating consistently higher results.

中文翻译：

形态丰富的语言的仅subword替代fastText

摘要

在这项工作中，我们提出了fastText单词嵌入算法的完全基于子单词的替代方法。这些替代方法是对原始fastText模型的修改，但仅依赖于子词信息，从而消除了对词级向量的依赖，同时有助于显着减少嵌入的大小。提议的模型的子字信息提取方法有所不同：字符n -gram，后缀和字节对编码单位。我们测试了3种形态丰富的语言（芬兰语，俄语和德语）在形态分析和词法还原中的模型。将该结果与其他最近的基于子词的模型进行比较，始终显示出更高的结果。

更新日期：2021-02-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11