当前位置: X-MOL 学术Nat. Lang. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improved feature decay algorithms for statistical machine translation
Natural Language Engineering ( IF 2.3 ) Pub Date : 2020-09-22 , DOI: 10.1017/s1351324920000467
Alberto Poncelas , Gideon Maillette de Buy Wenniger , Andy Way

In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.

中文翻译:

用于统计机器翻译的改进的特征衰减算法

在机器学习应用中,如果要实现良好的运行时性能,数据选择至关重要。在构建模型时可以访问测试集的场景中,可以选择训练实例,使其与测试集最相关。特征衰减算法 (FDA) 是一种数据选择技术,已在许多任务中表现出出色的性能。这种方法最大限度地提高了多样性n-grams 通过贬低那些已经包含在内的训练集中。我们专注于这种方法,对如何选择更好的训练数据实例进行更深入的研究。我们对 FDA 进行概述,并在速度和质量方面提出改进建议。使用德语到英语的并行数据,我们首先创建了一种新颖的方法,可以在多个计算单元可用时减少 FDA 的执行时间。此外,我们通过使用通常被忽略的平行语料库中的信息扩展 FDA 来提高翻译质量。
更新日期:2020-09-22
down
wechat
bug