Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation
arXiv - CS - Computation and Language Pub Date : 2020-09-20 , DOI: arxiv-2009.09359
Tahmid Hasan, Abhik Bhattacharjee, Kazi Samin, Masum Hasan, Madhusudan Basak, M. Sohel Rahman, Rifat Shahriyar

Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources. Most publicly available parallel corpora for Bengali are not large enough; and have rather poor quality, mostly because of incorrect sentence alignments resulting from erroneous sentence segmentation, and also because of a high volume of noise present in them. In this work, we build a customized sentence segmenter for Bengali and propose two novel methods for parallel corpus creation on low-resource setups: aligner ensembling and batch filtering. With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2.75 million sentence pairs, more than 2 million of which were not available before. Training on neural models, we achieve an improvement of more than 9 BLEU score over previous approaches to Bengali-English machine translation. We also evaluate on a new test set of 1000 pairs made with extensive quality control. We release the segmenter, parallel corpus, and the evaluation set, thus elevating Bengali from its low-resource status. To the best of our knowledge, this is the first ever large scale study on Bengali-English machine translation. We believe our study will pave the way for future research on Bengali-English machine translation as well as other low-resource languages. Our data and code are available at https://github.com/csebuetnlp/banglanmt.

中文翻译：

不再是低资源：对齐器集成、批量过滤和孟加拉语-英语机器翻译的新数据集

尽管孟加拉语是世界上第七大使用最广泛的语言，但由于资源匮乏，在机器翻译文献中受到的关注要少得多。大多数公开可用的孟加拉语平行语料库都不够大；并且质量相当差，主要是因为错误的句子分割导致不正确的句子对齐，也因为它们中存在大量噪声。在这项工作中，我们为孟加拉语构建了一个定制的句子分割器，并提出了两种在低资源设置上创建并行语料库的新方法：对齐器集成和批量过滤。结合分词器和两种方法，我们编译了一个高质量的孟加拉语-英语平行语料库，包含 275 万个句子对，其中超过 200 万个是以前没有的。训练神经模型，与之前的孟加拉语-英语机器翻译方法相比，我们的 BLEU 分数提高了 9 分以上。我们还对经过广泛质量控制的 1000 双新测试集进行了评估。我们发布了分割器、平行语料库和评估集，从而将孟加拉语从其资源匮乏的状态中提升。据我们所知，这是第一次大规模的孟加拉语-英语机器翻译研究。我们相信我们的研究将为未来对孟加拉语-英语机器翻译以及其他低资源语言的研究铺平道路。我们的数据和代码可在 https://github.com/csebuetnlp/banglanmt 获得。我们发布了分割器、平行语料库和评估集，从而将孟加拉语从其资源匮乏的状态中提升。据我们所知，这是第一次大规模的孟加拉语-英语机器翻译研究。我们相信我们的研究将为未来对孟加拉语-英语机器翻译以及其他低资源语言的研究铺平道路。我们的数据和代码可在 https://github.com/csebuetnlp/banglanmt 获得。我们发布了分割器、平行语料库和评估集，从而将孟加拉语从其资源匮乏的状态中提升。据我们所知，这是第一次大规模的孟加拉语-英语机器翻译研究。我们相信我们的研究将为未来对孟加拉语-英语机器翻译以及其他低资源语言的研究铺平道路。我们的数据和代码可在 https://github.com/csebuetnlp/banglanmt 获得。

更新日期：2020-10-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>