Spatio-Temporal Deep Residual Network with Hierarchical Attentions for Video Event Recognition,ACM Transactions on Multimedia Computing, Communications, and Applications

当前位置： X-MOL 学术 › ACM Trans. Multimed. Comput. Commun. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Spatio-Temporal Deep Residual Network with Hierarchical Attentions for Video Event Recognition
ACM Transactions on Multimedia Computing, Communications, and Applications ( IF 5.1 ) Pub Date : 2020-06-22 , DOI: 10.1145/3378026
Yonggang Li ₁ , Chunping Liu ₂ , Yi Ji ₂ , Shengrong Gong ₃ , Haibao Xu ₄

Affiliation

Event recognition in surveillance video has gained extensive attention from the computer vision community. This process still faces enormous challenges due to the tiny inter-class variations that are caused by various facets, such as severe occlusion, cluttered backgrounds, and so forth. To address these issues, we propose a spatio-temporal deep residual network with hierarchical attentions (STDRN-HA) for video event recognition. In the first attention layer, the ResNet fully connected feature guides the Faster R-CNN feature to generate object-based attention (O-attention) for target objects. In the second attention layer, the O-attention further guides the ResNet convolutional feature to yield the holistic attention (H-attention) in order to perceive more details of the occluded objects and the global background. In the third attention layer, the attention maps use the deep features to obtain the attention-enhanced features. Then, the attention-enhanced features are input into a deep residual recurrent network, which is used to mine more event clues from videos. Furthermore, an optimized loss function named softmax-RC is designed, which embeds the residual block regularization and center loss to solve the vanishing gradient in a deep network and enlarge the distance between inter-classes. We also build a temporal branch to exploit the long- and short-term motion information. The final results are obtained by fusing the outputs of the spatial and temporal streams. Experiments on the four realistic video datasets, CCV, VIRAT 1.0, VIRAT 2.0, and HMDB51, demonstrate that the proposed method has good performance and achieves state-of-the-art results.

中文翻译：

具有分层注意力的时空深度残差网络用于视频事件识别

监控视频中的事件识别已引起计算机视觉界的广泛关注。由于严重的遮挡、杂乱的背景等各个方面造成的微小的类间变化，这一过程仍然面临着巨大的挑战。为了解决这些问题，我们提出了一种用于视频事件识别的具有分层注意力的时空深度残差网络（STDRN-HA）。在第一个注意力层中，ResNet 全连接特征引导 Faster R-CNN 特征为目标对象生成基于对象的注意力（O-attention）。在第二个注意力层中，O-attention 进一步引导 ResNet 卷积特征产生整体注意力（H-attention），以感知被遮挡对象和全局背景的更多细节。在第三个注意层，注意力图使用深度特征来获得注意力增强的特征。然后，将注意力增强的特征输入到深度残差循环网络中，用于从视频中挖掘更多的事件线索。此外，还设计了一个名为 softmax-RC 的优化损失函数，它嵌入了残差块正则化和中心损失，以解决深度网络中的梯度消失问题，并扩大类间距离。我们还建立了一个时间分支来利用长期和短期的运动信息。通过融合空间流和时间流的输出获得最终结果。在四个真实视频数据集 CCV、VIRAT 1.0、VIRAT 2.0 和 HMDB51 上的实验表明，所提出的方法具有良好的性能并达到了最先进的结果。

更新日期：2020-06-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>