Using Sub-character Level Information for Neural Machine Translation of Logographic Languages,ACM Transactions on Asian and Low-Resource Language Information Processing

当前位置： X-MOL 学术 › ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Using Sub-character Level Information for Neural Machine Translation of Logographic Languages
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 1.8 ) Pub Date : 2021-04-15 , DOI: 10.1145/3431727
Longtu Zhang ₁ , Mamoru Komachi ₁

Affiliation

Logographic and alphabetic languages (e.g., Chinese vs. English) have different writing systems linguistically. Languages belonging to the same writing system usually exhibit more sharing information, which can be used to facilitate natural language processing tasks such as neural machine translation (NMT). This article takes advantage of the logographic characters in Chinese and Japanese by decomposing them into smaller units, thus more optimally utilizing the information these characters share in the training of NMT systems in both encoding and decoding processes. Experiments show that the proposed method can robustly improve the NMT performance of both “logographic” language pairs (JA–ZH) and “logographic + alphabetic” (JA–EN and ZH–EN) language pairs in both supervised and unsupervised NMT scenarios. Moreover, as the decomposed sequences are usually very long, extra position features for the transformer encoder can help with the modeling of these long sequences. The results also indicate that, theoretically, linguistic features can be manipulated to obtain higher share token rates and further improve the performance of natural language processing systems.

更新日期：2021-04-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11