Light-weight AI and IoT collaboration for surveillance video pre-processing

https://doi.org/10.1016/j.sysarc.2020.101934Get rights and content

Abstract

As one of the internet of things (IoT) use cases, wireless surveillance systems are rapidly gaining popularity due to their easier deployability and improved performance. Videos captured by surveillance cameras are required to be uploaded for further storage and analysis, while the large amount of its raw data brings great challenges to the transmission through resource-constraint wireless networks. Observing that most collected consecutive frames are redundant with few objects of interest (OoIs), the filtering of these frames before uploading can dramatically relieve the transmission pressure. Additionally, real-world monitoring environment may bring shielding or blind areas in videos, which notoriously affects the accuracy on frame filtering. The collaboration between neighbouring cameras can compensate for such accuracy loss.

Under the computational constraint of edge cameras, we present an efficient video pre-processing strategy for wireless surveillance systems using light-weight AI and IoT collaboration. Two main modules are designed for either fixed or rotated cameras: (i) frame filtering module by dynamic background modelling and light-weight deep learning analysis; and (ii) collaborative validation module for error compensation among neighbouring cameras. Evaluations based on real-collected videos show the efficiency of this strategy. It achieves 64.4% bandwidth saving for the static scenario and 61.1% for the dynamic scenario, compared with the raw video transmission. Remarkably, the relatively high balance ratio between frame filtering accuracy and latency overhead outperforms than state-of-the-art light-weight AI structures and other surveillance video processing methods, implying the feasibility of this strategy.

Introduction

Wireless video surveillance systems nowadays perform as guardians in our daily life due to its easier installations and flexible infrastructures [1]. It helps in traffic monitoring [2], parking management, and public security protection in campus, office buildings, or residential communities [3], [4]. With the tremendous advancements of artificial intelligence (AI), deep learning models are widely adopted in these applications, which requires real-time video feeding and analysis. However, according to a global forecast report in 2020 [5], the wireless video surveillance systems would enlarge 10.4% usage worldwide from 2020 to 2025, based on the already existed USD 45.5 billion markets. It is apparent that with the growing size of monitoring areas and the number of cameras, the increasing quantity of video streams will bring great challenges on transmission through the resource-constraint wireless network.

Existing commercial solutions to reduce the transmission quantity on the camera side based on dynamic detection and video coding, which represent temporal and spatial reduction respectively. For the former solution, cameras only start recording and transmitting videos when the different level between two consecutive frames is over the threshold [6]. However, several redundant but dynamic frames will still be captured to the cloud via this method. For example, the public security monitoring in residential communities concerns more about frames containing human activities, animals, and transportation, while frames containing branches sway or garbage bags fly are redundant but dynamic so remained. For the latter solution, video coding standards (e.g. H.264 [7], H.265 [8]) and their variants [9], [10], [11] can structurally compress videos without frame filtering. It is compatible with the former solution, which is out of the scope of our discussion.

In this paper, we focus on the temporal reduction of the transmitted video quantity. It is worth to notice that there exist a large number of redundant frames in transmitted video streams, which contain less information useful for surveillance applications. Still taking the monitoring in residential communities as an example, we conduct analysis on a 168-hour (one week) monitoring video captured by one camera in a community. As shown in Fig. 1(a), we observe that except the rush hour (7:00–9:00 and 17:00–19:00), there are few OoIs (i.e., human, cars, and animals1 ) detected during other periods in this video. The filtering of these redundant frames will not affect the quality of safety monitoring service but can dramatically save wireless bandwidth during transmission.

Accordingly, we aim to design a content-aware frame filtering strategy to enhance the video compression ratio on the camera side. Well-known deep learning models are powerful tools for content-aware frame filtering, such as SSD [12], YOLO [13], R-CNN [14], etc. But it is not feasible to directly apply these model on edge cameras due to their limited computational capabilities. Methods like DeepMon [15], DeepEye [16], MobileNet [17], and MBBNet [18] propose light-weight model optimization either on model construction or calculation for relatively powerful portable devices like mobile phones. As compared in Table 1, the surveillance cameras have much lower computational capabilities than mobile phones, which require a higher compression degree for deep learning models. In this paper, we discuss the model compression possibility both structurally and computationally. Additionally, massive buildings, plants, and facilities in environments may introduce the shielding even blind monitoring areas in videos (Fig. 1(b)), which would reduce the detection accuracy on redundant frames. To deal with this problem, we add a collaborative validation module based on the edge computing framework [19], [20], [21], where neighbouring cameras can help to validate an uncertain frame by peer-to-peer communication.

In a word, we present an efficient video pre-processing in wireless surveillance systems using light-weight AI and IoT collaboration. We consider both static and dynamic surveillance cameras, where dynamic cameras rotate in a stable speed continuously. This strategy is mainly composed of two modules: the frame filtering module on edge cameras and the collaborative validation module among neighbouring pairs. For the first module, we firstly model the background considering the angle of cameras with a proposed DCS-LBP operator and select key frames after background pruning. A light-weight SSD model optimized by channel pruning and convolution acceleration is then applied to these key frames on OoI detection. Only groups of pictures (GOPs) corresponding to key frames which are OoI contained will be transmitted to the cloud. The second module is complementary to the accuracy loss in the first module. The uncertain frames which contain partly shielded objects will be broadcasted to neighbouring cameras for validation.

A surveillance system prototype is implemented from edge cameras to the central server for evaluation. Three main factors are evaluated on real-collected videos to show the efficiency of this strategy: compression ratio, accuracy, and latency. Our method achieves 64.4% compression ratio for videos with a static background and 61.1% for ones with a dynamic background, which dramatically releases the transmission burden. As for accuracy, it successfully filters out 92.6% redundant frames. And for latency, it only causes maximal 0.049 s computational latency on deep learning module and 2.79 s processing delay on validation module for each frame. Its overall balance ratio on accuracy and latency outperforms than state-of-the-art surveillance video processing methods, indicating the satisfactory of accuracy performance and the acceptance of latency overhead.

To sum up, this paper makes the following contributions:

  • 1.

    Adaptive background modelling: To adapt our strategy on most kinds of surveillance cameras, we consider adaptive background modelling for the flexible key frame selection.

  • 2.

    Joint model optimization: We explore the possibility to optimize both the structure and the computation steps to build a light-weight deep learning model, resulting in faster speed and fewer computational consumptions.

  • 3.

    Collaborative validation on cameras: We novelly design a collaborative scheme among edge cameras for validating uncertain frames. It behaves as the compensation for accuracy loss caused by environment shielding.

The remaining part of the paper is organized as follows: Section 2 surveys the related work on two modules in our strategy. Section 3 describes the problem statement and the designed framework. Sections 4 Detailed designs, 5 Implementation introduce the details of the design and the prototype implementation, respectively. Section 6 reports the evaluation results. Finally, Section 7 concludes the paper and reviews some future work.

Section snippets

Related work

In smart surveillance systems, cameras on the edge have their own computational capabilities and connected to the cloud by network communication, which builds a typical edge computing scheme [19]. The core idea of this scheme is to bring network functions, contents and resources closer to end devices (e.g., edge cameras in surveillance systems) [22], which makes it possible for pre-processing on videos and collaborative validation on the edge before transmission. In this section, we will

Problem statement and strategy overview

The application scenarios of surveillance cameras include residential communities, offices, campus and so on. In this paper, we take the residential community as an example, where the challenges faced by wireless surveillance system are more representative in its limited area. As shown in the left figure of Fig. 2, surveillance cameras are deployed at important corners, while the type of cameras can be either stable or dynamic. Neighbouring cameras have overlapping monitoring areas at specific

Detailed designs

In this section, we introduce the detailed designs of two main modules of our pre-processing strategy: frame filtering module and collaborative validation module.

Implementation

As shown in Fig. 7, we implement a prototype of the whole surveillance system on two sides: the edge camera side and the server side. On the camera side, we use Raspberry Pi 3bs for experiments (labelled in the blue box), embedded with ARM Cortex-A53 and no GPU. Supported by 802.11n WiFi module and external camera module v2, they perform as surveillance cameras with video capturing, processing and wireless communication functions. To simulate both static and dynamic cameras deployed in

Evaluation

To prove the efficiency of the proposed strategy, experiments are conducted on real-collected videos. Two kinds of videos are captured by either static or dynamic cameras in the same residential communities simultaneously. Three main indicators are evaluated in this section. At the beginning, the compression ratios (frame deduction ratio) for both static and dynamic cameras are present in Section 6.1, which is the main target of this paper. Secondly, Section 6.2 evaluates the processing latency

Summary and conclusion

This work is based on the observation that there exists a large number of redundant frames in surveillance videos which contain less information for surveillance applications. Spatially, simply reducing the sampling ratio or filtering frames by dynamic detection in commercial cameras still have room to enhance the compress ratio. To further reduce the transmission burden of surveillance videos via wireless spectrum, in this paper, we present an efficient video pre-processing scheme composed of

Declaration of Competing Interest

One or more of the authors of this paper have disclosed potential or pertinent conflicts of interest, which may include receipt of payment, either direct or indirect, institutional support, or association with an entity in the biomedical field which may be perceived to have potential conflict of interest with this work. For full disclosure statements refer to https://doi.org/10.1016/j.sysarc.2020.101934.

Shanghai Jiao Tong University, Shanghai, China.

Shanghai Jianqiao University, Shanghai, China.

Acknowledgements

This work was supported in part by National Key R&D Program of China 2018YFB1004703, NSFC, China grant 61972253, 61672349, U190820096, 61672348, 61672353, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning, China .

Yutong Liu is currently a Ph.D. candidate in Computer Science with Shanghai Jiao Tong University. In 2017, she received her B. E. degree in Computer Science and Technology from Ocean University of China. Her research interests include wireless sensing, artificial intelligence, and mobile computing.

References (51)

  • Video Surveillance Market Global Forecast to 2025,...
  • ccrisan, MotionEyeOS,...
  • WiegandT. et al.

    Overview of the h.264/AVC video coding standard

    IEEE Trans. Circuits Syst. Video Technol.

    (2003)
  • Y. Tsai, M. Liu, D. Sun, M. Yang, J. Kautz, Learning binary residual representations for domain-specific video...
  • SerranoA. et al.

    Convolutional sparse coding for capturing high-speed video content

    Comput. Graph. Forum

    (2017)
  • KongL. et al.

    Efficient video encoding for automatic video analysis in distributed wireless surveillance systems

    TOMCCAP

    (2018)
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S.E. Reed, C. Fu, A.C. Berg, SSD: Single shot multibox detector, in: ECCV,...
  • RedmonJ. et al.

    Real-time grasp detection using convolutional neural networks

  • R.B. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic...
  • H.N. Loc, Y. Lee, R.K. Balan, DeepMon: Mobile gpu-based deep learning framework for continuous vision applications, in:...
  • A. Mathur, N.D. Lane, S. Bhattacharya, A. Boran, C. Forlivesi, F. Kawsar, Deepeye: Resource efficient local execution...
  • HowardA.G. et al.

    Mobilenets: Efficient convolutional neural networks for mobile vision applications

    (2017)
  • JiaG. et al.

    Edge computing-based intelligent manhole cover management system for smart cities

    IEEE Internet Things J.

    (2018)
  • WangS. et al.

    A survey on mobile edge networks: Convergence of computing, caching and communications

    IEEE Access

    (2017)
  • Cited by (19)

    • FastCache: A write-optimized edge storage system via concurrent merging cache for IoT applications

      2022, Journal of Systems Architecture
      Citation Excerpt :

      That is because fewer threads cannot be distributed evenly in the cache by hash function and write threads are merging fully within a partition of cache blocks which triggers the flushing process in a short time. Internet of Things (IoT) devices are generating a large number of data that are categorized into two write types: large sequential for video data streams and random microwrite for sensors small data [49–51]. Wu, Kai et al. present ArchTM, a PM transaction system to avoid small writes and encourage sequential writes [52].

    View all citing articles on Scopus

    Yutong Liu is currently a Ph.D. candidate in Computer Science with Shanghai Jiao Tong University. In 2017, she received her B. E. degree in Computer Science and Technology from Ocean University of China. Her research interests include wireless sensing, artificial intelligence, and mobile computing.

    Linghe Kong is currently a research professor in Shanghai Jiao Tong University. Before that, he was a postdoctoral fellow at Columbia University and McGill University. He received his Ph.D. degree in Computer Science with Shanghai Jiao Tong University 2012, Master degree in Telecommunication with TELECOM SudParis 2007, B. E. degree in Automation with Xidian University 2005. His research interests include wireless communications, sensor networks and mobile computing.

    Guihai Chen earned his B.S. degree from Nanjing University in 1984, M.E. degree from Southeast University in 1987, and Ph.D. degree from the University of Hong Kong in 1997. He is a distinguished professor of Shanghai Jiaotong University, China. He had been invited as a visiting professor by many universities including Kyushu Institute of Technology, Japan in 1998, University of Queensland, Australia in 2000, and Wayne State University, USA during September 2001 to August 2003. He has a wide range of research interests with focus on sensor network, peer-to-peer computing, high performance computer architecture and combinatorics.

    Fangqin Xu is a profess or of Shanghai Jianqiao University. In 2018, he graduated from communication and information system major of East China Normal University, and his master’s degree from computer technology major of Huazhong University of science and technology. His main research direction is Internet of things technology application and computer application technology.

    Zhanquan Wang is an associate professor at Department of Computer Science and Engineering of East China University of Science and Technology in Shanghai, China. He received his Ph.D. in computer science from Zhejiang University in China in 2005. His current research interests include Spatial Datamining, GIS, Educational technology.

    This paper extends our previous work published in the Proceedings of the International Symposium on Quality of Service (IWQoS) 2019 (Liu et al., 2019).

    View full text