当前位置: X-MOL 学术ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Finding Better Subwords for Tibetan Neural Machine Translation
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 1.8 ) Pub Date : 2021-03-15 , DOI: 10.1145/3448216
Yachao Li 1 , Jing Jiang 1 , Jia Yangji 1 , Ning Ma 1
Affiliation  

Subword segmentation plays an important role in Tibetan neural machine translation (NMT). The structure of Tibetan words consists of two levels. First, words consist of a sequence of syllables, and then a syllable consists of a sequence of characters. According to this special word structure, we propose two methods for Tibetan subword segmentation, namely syllable-based and character-based methods. The former generates subwords based on the Tibetan syllables, and the latter is based on Tibetan characters. In addition, we carry out experiments with these two subword segmentation methods on low-resource Tibetan-to-Chinese NMT, respectively. The experimental results show that both of them can improve translation performance, in which the subword segmentation based on character sequences can achieve better results. Overall, our proposed character-based subword segmentation is more simple and effective. Moreover, it can achieve better experimental results without paying much attention to the linguistic features of Tibetan.

中文翻译:

为藏语神经机器翻译寻找更好的子词

子词分割在藏语神经机器翻译(NMT)中发挥着重要作用。藏语词的结构分为两个层次。首先,单词由一系列音节组成,然后一个音节由一系列字符组成。根据这种特殊的词结构,我们提出了两种藏语子词切分方法,即基于音节的方法和基于字符的方法。前者根据藏文音节生成子词,后者根据藏文字符生成子词。此外,我们分别在低资源藏汉 NMT 上对这两种子词分割方法进行了实验。实验结果表明,两者都可以提高翻译性能,其中基于字符序列的子词分割可以取得更好的效果。全面的,我们提出的基于字符的子词分割更简单有效。而且,它可以在不关注藏语语言特征的情况下取得较好的实验效果。
更新日期:2021-03-15
down
wechat
bug