Perceiver: General Perception with Iterative Attention,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Perceiver: General Perception with Iterative Attention
arXiv - CS - Machine Learning Pub Date : 2021-03-04 , DOI: arxiv-2103.03206
Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira

Biological systems understand the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture performs competitively or beyond strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video and video+audio. The Perceiver obtains performance comparable to ResNet-50 on ImageNet without convolutions and by directly attending to 50,000 pixels. It also surpasses state-of-the-art results for all modalities in AudioSet.

中文翻译：

感知器：具有迭代注意力的一般感知

生物系统通过同时处理来自视觉，听觉，触觉，本体感觉等多种模式的高维度输入来了解世界。另一方面，深度学习中使用的感知模型是针对单个模式设计的，通常依赖于特定领域假设，例如几乎所有现有视觉模型都采用了局部网格结构。这些先验引入了有益的归纳偏差，但也将模型锁定于各个模式。在本文中，我们介绍了Perceiver-一种基于Transformers的模型，因此很少对其输入之间的关系进行体系结构假设，但也可以扩展到数十万个输入，例如ConvNets。该模型利用非对称注意力机制来反复将输入提炼到一个紧密的潜在瓶颈中，使其可以扩展以处理非常大的输入。我们证明，该体系结构在各种模式下的分类任务上具有竞争力，甚至超越了强大的专业模型：图像，点云，音频，视频和视频+音频。感知器通过不加卷积并直接参与50,000像素获得与ImageNet上ResNet-50相当的性能。对于AudioSet中的所有模式，它也超过了最新的结果。

更新日期：2021-03-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>