Faster Transformer Decoding: N-gram Masked Self-Attention,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Faster Transformer Decoding: N-gram Masked Self-Attention
arXiv - CS - Computation and Language Pub Date : 2020-01-14 , DOI: arxiv-2001.04589
Ciprian Chelba, Mia Chen, Ankur Bapna, and Noam Shazeer

Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attention model loses very little in BLEU score for $N$ values in the range $4, \ldots, 8$, depending on the task.

中文翻译：

更快的 Transformer 解码：N-gram Masked Self-Attention

由于与目标标记预测相关的大部分信息都来自源语句 $S=s_1, \ldots, s_S$，我们建议截断用于计算自注意力的目标侧窗口$N$-gram 假设。在 WMT EnDe 和 EnFr 数据集上的实验表明，$N$-gram 掩蔽自注意力模型在 $N$ 值范围内 $4, \ldots, 8$ 的 BLEU 分数中损失很小，具体取决于任务。

更新日期：2020-01-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>