当前位置: X-MOL 学术ACM Trans. Multimed. Comput. Commun. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Modeling Long-Term Dependencies from Videos Using Deep Multiplicative Neural Networks
ACM Transactions on Multimedia Computing, Communications, and Applications ( IF 5.2 ) Pub Date : 2020-07-07 , DOI: 10.1145/3357797
Wen Si 1 , Cong Liu 2 , Zhongqin Bi 3 , Meijing Shan 4
Affiliation  

Understanding temporal dependencies of videos is fundamental for vision problems, but deep learning–based models are still insufficient in this field. In this article, we propose a novel deep multiplicative neural network (DMNN) for learning hierarchical long-term representations from video. The DMNN is built upon the multiplicative block that remembers the pairwise transformations between consecutive frames using multiplicative interactions rather than the regular weighted-sum ones. The block is slided over the timesteps to update the memory of the networks on the frame pairs. Deep architecture can be implemented by stacking multiple layers of the sliding blocks. The multiplicative interactions lead to exact, rather than approximate, modeling of temporal dependencies. The memory mechanism can remember the temporal dependencies for an arbitrary length of time. The multiple layers output multiple-level representations that reflect the multi-timescale structure of video. Moreover, to address the difficulty of training DMNNs, we derive a theoretically sound convergent method, which leads to a fast and stable convergence. We demonstrate a new state-of-the-art classification performance with proposed networks on the UCF101 dataset and the effectiveness of capturing complicate temporal dependencies on a variety of synthetic datasets.

中文翻译:

使用深度乘法神经网络从视频中建模长期依赖关系

了解视频的时间依赖性是解决视觉问题的基础,但基于深度学习的模型在该领域仍然不足。在本文中,我们提出了一种新颖的深度乘法神经网络 (DMNN),用于从视频中学习分层的长期表示。DMNN 建立在乘法块的基础上,该块使用乘法交互而不是常规的加权和来记住连续帧之间的成对变换。该块在时间步上滑动以更新帧对上网络的内存。深度架构可以通过堆叠多层滑块来实现。乘法交互导致时间依赖性的精确而非近似建模。记忆机制可以记住任意时间长度的时间依赖性。多层输出反映视频多时间尺度结构的多级表示。此外,为了解决训练 DMNN 的困难,我们推导出了一种理论上合理的收敛方法,它可以实现快速且稳定的收敛。我们在 UCF101 数据集上使用提议的网络展示了一种新的最先进的分类性能,以及捕获对各种合成数据集的复杂时间依赖性的有效性。
更新日期:2020-07-07
down
wechat
bug