当前位置: X-MOL 学术Comput. Vis. Image Underst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Cascade multi-head attention networks for action recognition
Computer Vision and Image Understanding ( IF 4.5 ) Pub Date : 2020-01-02 , DOI: 10.1016/j.cviu.2019.102898
Jiaze Wang , Xiaojiang Peng , Yu Qiao

Long-term temporal information yields crucial cues for video action understanding. Previous researches always rely on sequential models such as recurrent networks, memory units, segmental models, self-attention mechanism to integrate the local temporal features for long-term temporal modeling. Recurrent or memory networks record temporal patterns (or relations) by memory units, which are proved to be difficult to capture long-term information in machine translation. Self-attention mechanisms directly aggregate all local information with attention weights which is more straightforward and efficient than the former. However, the attention weights from self-attention ignore the relations between local information and global information which may lead to unreliable attention. To this end, we propose a new attention network architecture, termed as Cascade multi-head ATtention Network (CATNet), which constructs video representations with two-level attentions, namely multi-head local self-attentions and relation based global attentions. Starting from the segment features generated by backbone networks, CATNet first learns multiple attention weights for each segment to capture the importance of local features in a self-attention manner. With the local attention weights, CATNet integrates local features into several global representations, and then learns the second level attention for the global information by a relation manner. Extensive experiments on Kinetics, HMDB51, and UCF101 show that our CATNet boosts the baseline network with a large margin. With only RGB information, we respectively achieve 75.8%, 75.2%, and 96.0% on these three datasets, which are comparable or superior to the state of the arts.



中文翻译:

级联多头注意力网络,用于动作识别

长期的时间信息为理解视频动作提供了重要的线索。以前的研究始终依靠顺序模型(例如递归网络,内存单元,分段模型,自我注意机制)来集成局部时态特征以进行长期时态建模。循环网络或内存网络通过内存单元记录时间模式(或关系),事实证明在机器翻译中很难捕获长期信息。自我注意机制直接将所有本地信息与注意权重进行汇总,比前者更直接,更有效。但是,来自自我注意力的注意力权重忽略了本地信息和全局信息之间的关系,这可能导致注意力不可靠。为此,我们提出了一种新的注意力网络架构,称为级联多头注意力网络(CATNet),它构建具有两级注意力的视频表示,即多头局部自我注意力和基于关系的全局注意力。从骨干网生成的网段特征开始,CATNet首先为每个网段学习多个注意力权重,以自注意力的方式捕获局部特征的重要性。CATNet利用局部注意权重将局部特征集成到多个全局表示中,然后通过关联方式学习全局信息的第二级注意。在动力学,HMDB51和UCF101上进行的大量实验表明,我们的CATNet大大提高了基准网络。仅使用RGB信息,我们在这三个数据集上分别达到75.8%,75.2%和96.0%,

更新日期:2020-01-04
down
wechat
bug