当前位置: X-MOL 学术Comput. Intell. Neurosci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
Computational Intelligence and Neuroscience Pub Date : 2021-04-12 , DOI: 10.1155/2021/6682385
Michael Adjeisah 1 , Guohua Liu 1 , Douglas Omwenga Nyabuga 1 , Richard Nuetey Nortey 2 , Jinling Song 3
Affiliation  

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.

中文翻译:

用于神经机器翻译的低资源语料库的伪文本注入和高级过滤

将自然语言处理(NLP)扩展到资源匮乏的语言以提高机器翻译(MT)的性能仍然是个谜。这项研究为基于过滤后的合成并行语料库的低资源英语-英语翻译提供了帮助。在缺乏资源的情况下(主要是在目标语料库是并行语言的唯一示例文本的情况下),要学习和理解优质语料库的外观通常会感到困惑。为了提高在这种资源匮乏的语言对中的MT性能,我们建议通过注入合成并行语料库来扩展训练数据,该合成并行语料库是基于具有不同参数设置的自举从目标语言翻译单语语料库而获得的。此外,我们对每个平方马氏距离平方对的句子对进行了无监督测量,一种预测句子并行性的过滤技术。此外,在往返翻译后,我们广泛使用三种不同的句子级相似性度量。在各种可用的并行语料库上的实验结果表明,注入伪并行语料库和具有句子级别相似性​​度量的广泛过滤功能可显着改善低资源语言对的原始现成MT系统。与在相同结构下对相同原始框架的现有改进相比,我们的方法在BLEU和TER得分方面显示出巨大的发展。在各种可用的并行语料库上的实验结果表明,注入伪并行语料库和具有句子级别相似性​​度量的广泛过滤功能可显着改善低资源语言对的原始现成MT系统。与在相同结构下对相同原始框架的现有改进相比,我们的方法在BLEU和TER得分方面显示出巨大的发展。在各种可用的并行语料库上的实验结果表明,注入伪并行语料库和具有句子级别相似性​​度量的广泛过滤功能可显着改善低资源语言对的原始现成MT系统。与在相同结构下对相同原始框架的现有改进相比,我们的方法在BLEU和TER得分方面显示出巨大的发展。
更新日期:2021-04-12
down
wechat
bug