当前位置:
X-MOL 学术
›
arXiv.cs.LG
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Perceiver: General Perception with Iterative Attention
arXiv - CS - Machine Learning Pub Date : 2021-03-04 , DOI: arxiv-2103.03206 Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira
arXiv - CS - Machine Learning Pub Date : 2021-03-04 , DOI: arxiv-2103.03206 Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira
Biological systems understand the world by simultaneously processing
high-dimensional inputs from modalities as diverse as vision, audition, touch,
proprioception, etc. The perception models used in deep learning on the other
hand are designed for individual modalities, often relying on domain-specific
assumptions such as the local grid structures exploited by virtually all
existing vision models. These priors introduce helpful inductive biases, but
also lock models to individual modalities. In this paper we introduce the
Perceiver - a model that builds upon Transformers and hence makes few
architectural assumptions about the relationship between its inputs, but that
also scales to hundreds of thousands of inputs, like ConvNets. The model
leverages an asymmetric attention mechanism to iteratively distill inputs into
a tight latent bottleneck, allowing it to scale to handle very large inputs. We
show that this architecture performs competitively or beyond strong,
specialized models on classification tasks across various modalities: images,
point clouds, audio, video and video+audio. The Perceiver obtains performance
comparable to ResNet-50 on ImageNet without convolutions and by directly
attending to 50,000 pixels. It also surpasses state-of-the-art results for all
modalities in AudioSet.
中文翻译:
感知器:具有迭代注意力的一般感知
生物系统通过同时处理来自视觉,听觉,触觉,本体感觉等多种模式的高维度输入来了解世界。另一方面,深度学习中使用的感知模型是针对单个模式设计的,通常依赖于特定领域假设,例如几乎所有现有视觉模型都采用了局部网格结构。这些先验引入了有益的归纳偏差,但也将模型锁定于各个模式。在本文中,我们介绍了Perceiver-一种基于Transformers的模型,因此很少对其输入之间的关系进行体系结构假设,但也可以扩展到数十万个输入,例如ConvNets。该模型利用非对称注意力机制来反复将输入提炼到一个紧密的潜在瓶颈中,使其可以扩展以处理非常大的输入。我们证明,该体系结构在各种模式下的分类任务上具有竞争力,甚至超越了强大的专业模型:图像,点云,音频,视频和视频+音频。感知器通过不加卷积并直接参与50,000像素获得与ImageNet上ResNet-50相当的性能。对于AudioSet中的所有模式,它也超过了最新的结果。
更新日期:2021-03-05
中文翻译:
感知器:具有迭代注意力的一般感知
生物系统通过同时处理来自视觉,听觉,触觉,本体感觉等多种模式的高维度输入来了解世界。另一方面,深度学习中使用的感知模型是针对单个模式设计的,通常依赖于特定领域假设,例如几乎所有现有视觉模型都采用了局部网格结构。这些先验引入了有益的归纳偏差,但也将模型锁定于各个模式。在本文中,我们介绍了Perceiver-一种基于Transformers的模型,因此很少对其输入之间的关系进行体系结构假设,但也可以扩展到数十万个输入,例如ConvNets。该模型利用非对称注意力机制来反复将输入提炼到一个紧密的潜在瓶颈中,使其可以扩展以处理非常大的输入。我们证明,该体系结构在各种模式下的分类任务上具有竞争力,甚至超越了强大的专业模型:图像,点云,音频,视频和视频+音频。感知器通过不加卷积并直接参与50,000像素获得与ImageNet上ResNet-50相当的性能。对于AudioSet中的所有模式,它也超过了最新的结果。