当前位置: X-MOL 学术Arab. J. Sci. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Addressing Limited Vocabulary and Long Sentences Constraints in English–Arabic Neural Machine Translation
Arabian Journal for Science and Engineering ( IF 2.6 ) Pub Date : 2021-03-02 , DOI: 10.1007/s13369-020-05328-2
Safae Berrichi , Azzeddine Mazroui

Neural Machine Translation (NMT) has attracted growing interest in recent years for its promising performance compared to traditional approaches such as Statistical Machine Translation. However, its application to languages having different structures, like the (English, Arabic) pair that interests us in this work, degrades its performance. Indeed, the limited vocabulary size required by the NMT models decreases the vocabulary coverage rate of the Arabic language, well known by its morphological richness. Likewise, long sentences present an additional challenge to NMT systems because they perform less well for longer sentences than for the shorter ones. In this paper, we provide a series of experiments to mitigate the effects of these constraints. To address the problem of out-of-vocabulary words, we integrated into factored NMT models morphosyntactic features as an output factor, namely stem, lemma, POS, root, and pattern. We have also developed two techniques for segmenting long sentences into smaller sub-sentences. The first uses a list of lexical markers that we have collected as segmentation points, and the second integrates into the NMT model the parallel phrases extracted by an SMT system. The experiments carried out on the English–Arabic pair show that the proposed approaches considerably improve the translation quality compared to the basic NMT system.



中文翻译:

解决英语-阿拉伯语神经机器翻译中有限的词汇量和长句限制

与统计机器翻译等传统方法相比,神经机器翻译(NMT)近年来以其令人鼓舞的性能吸引了越来越多的兴趣。但是,将其应用于具有不同结构的语言(例如使我们对此工作感兴趣的(英语,阿拉伯语)对)会降低其性能。实际上,NMT模型所需的词汇量有限,这降低了阿拉伯语的词汇覆盖率,阿拉伯语以其形态丰富而广为人知。同样,长句子给NMT系统带来了另一项挑战,因为长句子的表现要比短句子的表现差。在本文中,我们提供了一系列实验来减轻这些约束的影响。为了解决词汇量不足的问题,我们将形态句法特征作为输出因子集成到有因数的NMT模型中,即词干,词条,POS,词根和模式。我们还开发了两种将长句子分割成较小的子句子的技术。第一种使用我们收集的词汇标记列表作为分割点,第二种将SMT系统提取的平行短语集成到NMT模型中。在英语-阿拉伯语对上进行的实验表明,与基本的NMT系统相比,所提出的方法大大提高了翻译质量。第二个将SMT系统提取的并行短语集成到NMT模型中。在英语-阿拉伯语对上进行的实验表明,与基本的NMT系统相比,所提出的方法大大提高了翻译质量。第二个将SMT系统提取的并行短语集成到NMT模型中。在英语-阿拉伯语对上进行的实验表明,与基本的NMT系统相比,所提出的方法大大提高了翻译质量。

更新日期:2021-03-03
down
wechat
bug