当前位置: X-MOL 学术Program. Comput. Softw. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Subwords-Only Alternatives to fastText for Morphologically Rich Languages
Programming and Computer Software ( IF 0.7 ) Pub Date : 2021-02-23 , DOI: 10.1134/s0361768821010059
Tsolak Ghukasyan , Yeva Yeshilbashyan , Karen Avetisyan

Abstract

In this work, we present purely subword-based alternatives to fastText word embedding algorithm The alternatives are modifications of the original fastText model, but rely on subword information only, eliminating the reliance on word-level vectors and at the same time helping to dramatically reduce the size of embeddings. Proposed models differ in their subword information extraction method: character n-grams, suffixes, and the byte-pair encoding units. We test the models in the task of morphological analysis and lemmatization for 3 morphologically rich languages: Finnish, Russian, and German. The results are compared with other recent subword-based models, demonstrating consistently higher results.



中文翻译:

形态丰富的语言的仅subword替代fastText

摘要

在这项工作中,我们提出了fastText单词嵌入算法的完全基于子单词的替代方法。这些替代方法是对原始fastText模型的修改,但仅依赖于子词信息,从而消除了对词级向量的依赖,同时有助于显着减少嵌入的大小。提议的模型的子字信息提取方法有所不同:字符n -gram,后缀和字节对编码单位。我们测试了3种形态丰富的语言(芬兰语,俄语和德语)在形态分析和词法还原中的模型。将该结果与其他最近的基于子词的模型进行比较,始终显示出更高的结果。

更新日期:2021-02-23
down
wechat
bug