当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection
arXiv - CS - Artificial Intelligence Pub Date : 2021-07-29 , DOI: arxiv-2107.13720
Xinyang Feng, Dongjin Song, Yuncong Chen, Zhengzhang Chen, Jingchao Ni, Haifeng Chen

Detecting abnormal activities in real-world surveillance videos is an important yet challenging task as the prior knowledge about video anomalies is usually limited or unavailable. Despite that many approaches have been developed to resolve this problem, few of them can capture the normal spatio-temporal patterns effectively and efficiently. Moreover, existing works seldom explicitly consider the local consistency at frame level and global coherence of temporal dynamics in video sequences. To this end, we propose Convolutional Transformer based Dual Discriminator Generative Adversarial Networks (CT-D2GAN) to perform unsupervised video anomaly detection. Specifically, we first present a convolutional transformer to perform future frame prediction. It contains three key components, i.e., a convolutional encoder to capture the spatial information of the input video clips, a temporal self-attention module to encode the temporal dynamics, and a convolutional decoder to integrate spatio-temporal features and predict the future frame. Next, a dual discriminator based adversarial training procedure, which jointly considers an image discriminator that can maintain the local consistency at frame-level and a video discriminator that can enforce the global coherence of temporal dynamics, is employed to enhance the future frame prediction. Finally, the prediction error is used to identify abnormal video frames. Thoroughly empirical studies on three public video anomaly detection datasets, i.e., UCSD Ped2, CUHK Avenue, and Shanghai Tech Campus, demonstrate the effectiveness of the proposed adversarial spatio-temporal modeling framework.

中文翻译:

用于视频异常检测的基于卷积变换器的双鉴别器生成对抗网络

检测现实世界监控视频中的异常活动是一项重要但具有挑战性的任务,因为关于视频异常的先验知识通常是有限的或不可用的。尽管已经开发了许多方法来解决这个问题,但很少有人能够有效地捕捉正常的时空模式。此外,现有作品很少明确考虑帧级的局部一致性和视频序列中时间动态的全局一致性。为此,我们提出了基于卷积变换器的双鉴别器生成对抗网络 (CT-D2GAN) 来执行无监督视频异常检测。具体来说,我们首先提出一个卷积变换器来执行未来的帧预测。它包含三个关键组件,即,一个卷积编码器来捕获输入视频剪辑的空间信息,一个时间自注意力模块来编码时间动态,一个卷积解码器来整合时空特征并预测未来帧。接下来,采用基于双重鉴别器的对抗性训练过程,联合考虑可以在帧级保持局部一致性的图像鉴别器和可以强制执行时间动态全局一致性的视频鉴别器,以增强未来的帧预测。最后,利用预测误差来识别异常视频帧。对三个公共视频异常检测数据集(即 UCSD Ped2、CUHK Avenue 和 Shanghai Tech Campus)的彻底实证研究证明了所提出的对抗性时空建模框架的有效性。
更新日期:2021-07-30
down
wechat
bug