Rethinking Positional Encoding in Language Pre-training,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Rethinking Positional Encoding in Language Pre-training
arXiv - CS - Computation and Language Pub Date : 2020-06-28 , DOI: arxiv-2006.15595
Guolin Ke, Di He, Tie-Yan Liu

How to explicitly encode positional information into neural networks is important in learning the representation of natural languages, such as BERT. Based on the Transformer architecture, the positional information is simply encoded as embedding vectors, which are used in the input layer, or encoded as a bias term in the self-attention module. In this work, we investigate the problems in the previous formulations and propose a new positional encoding method for BERT called Transformer with Untied Positional Encoding (TUPE). Different from all other works, TUPE only uses the word embedding as input. In the self-attention module, the word contextual correlation and positional correlation are computed separately with different parameterizations and then added together. This design removes the addition over heterogeneous embeddings in the input, which may potentially bring randomness, and gives more expressiveness to characterize the relationship between words/positions by using different projection matrices. Furthermore, TUPE unties the [CLS] symbol from other positions to provide it with a more specific role to capture the global representation of the sentence. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness and efficiency of the proposed method: TUPE outperforms several baselines on almost all tasks by a large margin. In particular, it can achieve a higher score than baselines while only using 30% pre-training computational costs. We release our code at https://github.com/guolinke/TUPE.

中文翻译：

重新思考语言预训练中的位置编码

如何将位置信息显式编码到神经网络中对于学习自然语言（如 BERT）的表示很重要。基于 Transformer 架构，位置信息被简单地编码为嵌入向量，用于输入层，或编码为自注意力模块中的偏置项。在这项工作中，我们研究了先前公式中的问题，并为 BERT 提出了一种新的位置编码方法，称为具有不受限位置编码的变换器 (TUPE)。与其他所有作品不同，TUPE 仅使用词嵌入作为输入。在 self-attention 模块中，单词上下文相关性和位置相关性分别用不同的参数化计算，然后加在一起。这种设计消除了输入中异构嵌入的添加，这可能会带来随机性，并通过使用不同的投影矩阵为表征词/位置之间的关系提供更多的表现力。此外，TUPE 将 [CLS] 符号从其他位置解开，为其提供更具体的角色来捕获句子的全局表示。对 GLUE 基准的大量实验和消融研究证明了所提出方法的有效性和效率：TUPE 在几乎所有任务上都大大优于几个基线。特别是，它可以在仅使用 30% 的预训练计算成本的情况下获得比基线更高的分数。我们在 https://github.com/guolinke/TUPE 发布我们的代码。并通过使用不同的投影矩阵为表征单词/位置之间的关系提供更多的表现力。此外，TUPE 将 [CLS] 符号从其他位置解开，为其提供更具体的角色来捕获句子的全局表示。对 GLUE 基准的大量实验和消融研究证明了所提出方法的有效性和效率：TUPE 在几乎所有任务上都大大优于几个基线。特别是，它可以在仅使用 30% 的预训练计算成本的情况下获得比基线更高的分数。我们在 https://github.com/guolinke/TUPE 发布我们的代码。并通过使用不同的投影矩阵为表征单词/位置之间的关系提供更多的表现力。此外，TUPE 将 [CLS] 符号从其他位置解开，为其提供更具体的角色来捕获句子的全局表示。对 GLUE 基准的大量实验和消融研究证明了所提出方法的有效性和效率：TUPE 在几乎所有任务上都大大优于几个基线。特别是，它可以在仅使用 30% 的预训练计算成本的情况下获得比基线更高的分数。我们在 https://github.com/guolinke/TUPE 发布我们的代码。对 GLUE 基准的大量实验和消融研究证明了所提出方法的有效性和效率：TUPE 在几乎所有任务上都大大优于几个基线。特别是，它可以在仅使用 30% 的预训练计算成本的情况下获得比基线更高的分数。我们在 https://github.com/guolinke/TUPE 发布我们的代码。对 GLUE 基准的大量实验和消融研究证明了所提出方法的有效性和效率：TUPE 在几乎所有任务上都大大优于几个基线。特别是，它可以在仅使用 30% 的预训练计算成本的情况下获得比基线更高的分数。我们在 https://github.com/guolinke/TUPE 发布我们的代码。

更新日期：2020-07-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>