当前位置: X-MOL 学术Nat. Lang. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Linguistic knowledge-based vocabularies for Neural Machine Translation
Natural Language Engineering ( IF 2.3 ) Pub Date : 2020-07-02 , DOI: 10.1017/s1351324920000364
Noe Casas , Marta R. Costa-jussà , José A. R. Fonollosa , Juan A. Alonso , Ramón Fanlo

Neural Networks applied to Machine Translation need a finite vocabulary to express textual information as a sequence of discrete tokens. The currently dominant subword vocabularies exploit statistically-discovered common parts of words to achieve the flexibility of character-based vocabularies without delegating the whole learning of word formation to the neural network. However, they trade this for the inability to apply word-level token associations, which limits their use in semantically-rich areas and prevents some transfer learning approaches e.g. cross-lingual pretrained embeddings, and reduces their interpretability. In this work, we propose new hybrid linguistically-grounded vocabulary definition strategies that keep both the advantages of subword vocabularies and the word-level associations, enabling neural networks to profit from the derived benefits. We test the proposed approaches in both morphologically rich and poor languages, showing that, for the former, the quality in the translation of out-of-domain texts is improved with respect to a strong subword baseline.

中文翻译:

用于神经机器翻译的基于语言知识的词汇表

应用于机器翻译的神经网络需要有限的词汇表来将文本信息表达为一系列离散的标记。当前占主导地位的子词词汇表利用统计发现的词的共同部分来实现基于字符的词汇表的灵活性,而无需将单词形成的整个学习委托给神经网络。然而,他们将其换成无法应用单词级标记关联,这限制了它们在语义丰富的领域的使用,并阻止了一些迁移学习方法,例如跨语言预训练嵌入,并降低了它们的可解释性。在这项工作中,我们提出了新的基于语言的混合词汇定义策略,既保留了子词词汇的优势又保留了词级关联,使神经网络能够从衍生的收益中获利。我们在形态丰富和贫穷的语言中测试了所提出的方法,表明对于前者,域外文本的翻译质量相对于强子词基线有所提高。
更新日期:2020-07-02
down
wechat
bug