Theoretical Limitations of Self-Attention in Neural Sequence Models,arXiv - CS - Formal Languages and Automata Theory

当前位置： X-MOL 学术 › arXiv.cs.FL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Theoretical Limitations of Self-Attention in Neural Sequence Models
arXiv - CS - Formal Languages and Automata Theory Pub Date : 2019-06-16 , DOI: arxiv-1906.06755
Michael Hahn

Transformers are emerging as the new workhorse of NLP, showing great success across tasks. Unlike LSTMs, transformers process input sequences entirely through self-attention. Previous work has suggested that the computational capabilities of self-attention to process hierarchical structures are limited. In this work, we mathematically investigate the computational power of self-attention to model formal languages. Across both soft and hard attention, we show strong theoretical limitations of the computational abilities of self-attention, finding that it cannot model periodic finite-state languages, nor hierarchical structure, unless the number of layers or heads increases with input length. These limitations seem surprising given the practical success of self-attention and the prominent role assigned to hierarchical structure in linguistics, suggesting that natural language can be approximated well with models that are too weak for the formal languages typically assumed in theoretical linguistics.

中文翻译：

神经序列模型中自我注意的理论局限性

Transformers 正在成为 NLP 的新主力，在各种任务中都取得了巨大的成功。与 LSTM 不同，transformer 完全通过自注意力来处理输入序列。以前的工作表明，自我注意过程层次结构的计算能力是有限的。在这项工作中，我们从数学上研究了自我注意对形式语言建模的计算能力。在软注意力和硬注意力中，我们展示了自注意力计算能力的强大理论局限性，发现它不能对周期性有限状态语言或层次结构进行建模，除非层数或头的数量随着输入长度的增加而增加。考虑到自我注意的实际成功以及语言学中等级结构的突出作用，这些限制似乎令人惊讶，

更新日期：2020-02-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>