Self-Attention Networks Can Process Bounded Hierarchical Languages,arXiv - CS - Formal Languages and Automata Theory

当前位置： X-MOL 学术 › arXiv.cs.FL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Self-Attention Networks Can Process Bounded Hierarchical Languages
arXiv - CS - Formal Languages and Automata Theory Pub Date : 2021-05-24 , DOI: arxiv-2105.11115
Shunyu Yao, Binghui Peng, Christos Papadimitriou, Karthik Narasimhan

Despite their impressive performance in NLP, self-attention networks were recently proved to be limited for processing formal languages with hierarchical structure, such as $\mathsf{Dyck}_k$, the language consisting of well-nested parentheses of $k$ types. This suggested that natural language can be approximated well with models that are too weak for formal languages, or that the role of hierarchy and recursion in natural language might be limited. We qualify this implication by proving that self-attention networks can process $\mathsf{Dyck}_{k, D}$, the subset of $\mathsf{Dyck}_{k}$ with depth bounded by $D$, which arguably better captures the bounded hierarchical structure of natural language. Specifically, we construct a hard-attention network with $D+1$ layers and $O(\log k)$ memory size (per token per layer) that recognizes $\mathsf{Dyck}_{k, D}$, and a soft-attention network with two layers and $O(\log k)$ memory size that generates $\mathsf{Dyck}_{k, D}$. Experiments show that self-attention networks trained on $\mathsf{Dyck}_{k, D}$ generalize to longer inputs with near-perfect accuracy, and also verify the theoretical memory advantage of self-attention networks over recurrent networks.

中文翻译：

自我注意网络可以处理有限的分层语言

尽管自注意力网络在NLP中表现出色，但是最近证明，它只能用于处理具有层次结构的形式语言，例如$ \ mathsf {Dyck} _k $，该语言由$ k $类型的嵌套括号组成。这表明自然语言可以用对于形式语言而言太弱的模型很好地近似，或者层次和递归在自然语言中的作用可能会受到限制。我们通过证明自我注意网络可以处理$ \ mathsf {Dyck} _ {k，D} $（$ \ mathsf {Dyck} _ {k} $的子集，深度受$ D $约束）来证明这种含义。可以说，它更好地捕捉了自然语言的有限层次结构。具体来说，我们构建了一个具有$ D + 1 $层和$ O（\ log k）$内存大小（每个令牌每层）的硬注意力网络，该网络可以识别$ \ mathsf {Dyck} _ {k，D} $，以及一个具有两层且内存大小为$ O（\ log k）$的软注意力网络，该网络会生成$ \ mathsf {Dyck} _ {k，D} $。实验表明，在$ \ mathsf {Dyck} _ {k，D} $上训练的自我注意网络可以以接近完美的精度推广到更长的输入，并且还验证了自我注意网络相对于递归网络的理论存储优势。

更新日期：2021-05-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文