1 Introduction

Weapons, among other threats, need to be detected as soon as possible to eliminate or mitigate the danger they could cause [1]. Traditionally, the surveillance of public scenarios has been accomplished by the human supervision of the images captured by closed-circuit television (CCTV) systems. However, even an experienced guard may miss a dangerous event due to fatigue or loss of attention [2]. To help with this situation, the creation of automated surveillance systems (AVSs) able to locate potentially threatening objects (or other events) in video has been studied during the last decades [3].

Similarly to other areas, with the introduction of the new deep learning methods these frameworks have obtained promising results and are closer to be used in real scenarios [4, 5]. Nevertheless, although those detectors have high detection (D) and low false positive (FP) rates, when they are used in a different scenario from the one used for training, the false positive ratio increases [6]. This fact represents a major problem since even an increase of a 0.1% of the false positive ratio may cause 90 false alarms per hour with a video input of 25 fps. Therefore, when running the surveillance system in a real scenario, the outcome is usually an unsatisfactory number of false alarms. In most cases, this may lead to the guard switching off the system.

In this context, we propose to include an extra step that models the false alerts that are specific of the new scenario while approximately maintaining the capability of identifying the objects it was trained for. As an specific application, this work focuses on detecting handguns in video surveillance. After running the detector in a new scenario, it is possible to collect all the detector alarms. Practically, all of these alarms are false positives since the incidence of the true event (a handgun in the scene) is very low. Therefore, all of these detections can be stored and used to model the new scenario.

The new step will act as a filter able to recognize typical FPs of the detector in the particular scenario. Therefore, this problem can be seen as an anomaly detection problem where the anomalies are those detections that are not similar to the FPs modeled by the filter [7, 8]. In fact, the anomalies detected on this step will be the real alarms.

To detect abnormal and extreme data, one-class classifiers have been widely used in the literature [9]. More concretely, autoencoders have proven to be the most suitable of the techniques, obtaining good results even where other methods fail [10]. In order to use the autoencoder as a filter and decide whether a detection is an anomaly or not, we have tested different approaches: using the autoencoder reconstruction error as a threshold and using the central vector representation to train a nearest neighbor (NN) and a one-class support vector machine (SVM) classifiers. Although autoencoders have been applied in anomaly detection problems, to the best of our knowledge, this is the first time they have been applied to reduce false positive detections when the detector runs in a new scenario from which it is not possible to obtain labeled data.

For the purpose of testing our idea, we have generated an entirely synthetic dataset from the frames captured from a realistic 3D scenario. The synthetic scenario resembles a school hallway from the point of view of a surveillance camera. This allow us to generate as much data as needed with and without handguns to train and test the autoencoder.

The rest of the paper is organized as follows. Section 2 performs an overview of the advances in handgun detection. Section 3 shows the handgun detector used as base detector of the proposed false positive reduction method. Section 4 describes the datasets used including the synthetic dataset that has been generated. Section 5 provides detailed information about the proposed autoencoder-based filtering step. Finally, Sect. 6 shows the results and Sect. 7 summarizes the main conclusions.

2 Related work

In addition to automatic CCTV video surveillance, several approaches have been proposed to deal with concealed handguns in X-ray or millimetric wave images. These types of image are commonly used in airports, train stations or the entrance of some public buildings. In 2008, Nercessian et al. presented a system for handgun detection using X-ray luggage scan images [11]. The approach was based on the Gaussian mixture expectation maximization (EM) method to perform image segmentation prior to the obtention of the edge-based feature vectors. Gesick et al. compared three different approaches for the detection of handguns inside luggage [12]. The first method employs edge detection combined with pattern matching with reliable results. However, both the computational time and the number of false positives were high. The second method uses Daubechies wavelet transforms with inconclusive results as the authors commented. The third algorithm proposed in that work was based on the scale-invariant feature transform (SIFT). Later, in 2010, Harmer et al. used a completely different approach based on the modeling of the complex natural resonances of handguns and compared them with those of other objects [13]. In addition, the work of Flitton et al. concluded that using simpler 3D feature, descriptors outperform even complex RIFT/SIFT solutions with an accuracy of more than 95% [14]. In [15], Xiao et al. employed an extension of the Haar-like features with an AdaBoost-based cascade classifier to detect handguns in passive millimeter wave (PMMW) images. Following a similar approach, the study of Kundegorski et al. combines bag of visual words (BoVW) based on feature point descriptors and support vector machines (SVMs) and random forest classifiers [16].

While there are numerous methods and devices that can detect concealed weapons, unfortunately, the incidence of mass shootings requires the use of RGB surveillance images. Tiwari and Verma proposed a framework that applies color-based segmentation and k-means clustering to remove irrelevant objects and then uses Harris interest point detector and Fast Retina Keypoint (FREAK) to locate the handguns [17]. This resulted in high robustness when detecting the desired object at different scales and rotations. In addition, Halima and Hosam worked on a detector that combined SIFT features, k-means clustering, a word vocabulary histogram, and SVM [18].

The recent advances in deep learning have also been applied to the handgun detection problem using CCTV images. The first contribution in this area came in the work of Olmos et al. where two different approaches were used [4]. The first one uses a classification CNN to detect handguns with the sliding window method, whereas the second one is based on the Faster R-CNN detection architecture. The latter obtained the best results when tested in a dataset composed of several YouTube videos. On the other hand, Gelana et al. followed a more traditional approach using edge detection and a classification CNN with the sliding window method [19]. In addition, Romero and Salamea trained a YOLO object detection and localization system to detect firearms with the particularity of running the detector only in areas where there are people [20]. Another study of Olmos et al. proposed using a symmetric dual camera system to increase the performance of the detection model in low quality surveillance videos improving both the false positive and the detection rates.

To model outliers, discordant objects or simply data that has a different behavior or pattern, anomaly detection techniques have been used [7]. Anomaly detection has a wide range of applications. For example, it can be used to detect anomalies in stock prices and time series [21, 22], abnormal medical images of findings [23,24,25], abnormal events in video [5, 26, 27], intrusion detection [28], or disaster areas from radar images [29].

A simple method to model anomalies is to use neighbor-based methods such as the k-nearest neighbor [30,31,32] where anomalies are identified as those points in the data space that differ from the surrounding data points. The advantage of these methods is the independence of the data distribution. However, their performance relies on the values of the parameters selected such as the number of neighbors.

An alternative to use neighbor-based methods is to detect anomalies taken into account that they are grouped in a zone of the data space. Thus, the anomaly detection problem is solved as a subspace learning problem [33,34,35,36]. Although this method work well in some cases, finding the number of subspaces in which the anomalies are distributed is not trivial.

As in classification and detection tasks, CNNs have demonstrated to improve the performance in anomaly detection problems [26]. More concretely, convolutional autoencoders have been used to model input data and reduce data space dimensionality [37]. Their use has reduced the need of reprocessing input data and compute handcrafted features from it [10]. Following this approach, Mabu et al. proposed to use a convolutional autoencoder followed by a one-class SVM to model normal areas in satellite images and detect abnormal areas caused by natural disasters in Japan [29]. Lu and Xu demonstrated the potential of using variational autoencoders to detect anomalies in skin disease images [23]. The authors recommend to use them instead of GANs (generative adversarial networks) due to their training stability and interpretable results. Sugimoto et al. use an autoencoder followed by a k-NN classifier to detect myocardial infarction.

Another approach is the one followed by Gutoski et al. in which autoencoders and stacked denoising autoencoders are used for clustering [38]. With the clustering, representation in possible to define whether a new sample is an anomaly or not according to its distance to the clusters. Gutoski et al. also followed this approach for one-class classification [38].

In some cases, there are more than one group of abnormalities as in the work carried out by Mirsky et al. in [28]. The authors proposed to use an ensemble of autoencoders instead of one to detect online network intrusions. The decision of what is an anomaly or not is based on the RMSE (root-mean-squared errors) score output by the autoencoders.

For video input, Singh and Mohan use deep stacked autoencoders to obtain a deep representation of spatiotemporal video volumes to detect road accidents [37]. The anomaly score is obtained with a one-class SVM as in other works.

Finally, non-symmetric autoencoders have also been used to learn space representations. An example of this is the work carried out by Tran and Hogg where the autoencoder representation is used for detecting anomalies in video [39]. In addition, recurrent autoencoders with LSTM (long short-term memory) layers have also been applied for anomaly detection in video in the work carried out by Yan et al. in [40].

3 Handgun detector

Before addressing the false positive rate reduction through the use of the autoencoder, we needed to train and test a handgun detector. As shown in Sect. 2, there are several approaches that can be selected, from the use of classification CNNs with the sliding window approach to the most modern CNN detection architectures. While the former examines every subregion of the image, the latter uses region proposal algorithms to reduce the number of examined windows or process the full image in one pass [41]. The advantage of the new architectures is the ability to detect objects in different locations of the image without being restricted to a certain aspect ratio. Moreover, the number of regions to be examined is drastically reduced in comparison with other methods. The most representative architectures for object detection that follow a region proposals approach are R-CNN, Fast R-CNN, and Faster R-CNN [42].

Two well-known architectures are YOLO (You Only Look Once) and SSD (single-shot detector). YOLO addresses object detection as a regression problem with spatially separated bounding boxes and their corresponding class probabilities [43]. SSD is able to predict, with only one pass over the entire image, the bounding boxes and the class probabilities for them [44].

In addition to all the above, there is a recently developed CNN-based detector called RetinaNet [45]. RetinaNet was designed to solve the problem of having extreme foreground–background class imbalanced problems and has been also applied to X-ray images [46].

For the particular problem of weapon (handgun and knife) detection, [47] reviews recent work and shows that Faster R-CNN has been the prevalent method. For that reason, we have selected the Faster R-CNN architecture to train a handgun detector with a dataset provided by the University of Seville [1]. The dataset is composed of 871 images that contain 177 annotated handguns. Those images were extracted from the video captured by 2 CCTV cameras located in two different college hallways.

4 Datasets

The collection and labeling of the data necessary to train deep learning models are tasks that require significant time and effort. This is even more complicated in detection or segmentation problems in which someone has to select the area of the image in which the object is located, or the exact contour of the object, in addition to the category. A possible solution to this problem is the use of public datasets, but, depending on the problem, it is not always possible to have one available. The use of synthetic images facilitates the work required to obtain large datasets. For this work, a completely synthetic dataset has been generated with Unreal Engine 4 [48], rendering a scenario that represents a high-school hallway where people are walking. There are other popular alternatives such as Unity [49] and Lumberyard/CryEngine [50] that can also be used for the same purpose. While some of the people on the scenario carry everyday objects in their hands, such as mobile phones, others carry guns or nothing (see Fig. 1).

Another advantage of having the data generation fully controlled by the researcher is that it is also possible to automatically generate a mask image with the desired objects for each frame. In this case, each generated image contains the people in white, the background covered in black, and each handgun filled with a different color to help extract the information about its location. Once all masks are obtained, the coordinates of the bounding boxes that contain the weapons are extracted storing the annotations in XML files with the format defined by the Pascal VOC 2012 Challenge [51].

A total of 4000 images were generated with this method with a resolution of 1280\(\times \)720. From these, 3000 frames were used to train and adjust the proposed autoencoder filter, containing 5437 annotated handguns. The remaining 1000 frames were used to evaluate and compare the detector and detector + autoencoder systems.

Fig. 1
figure 1

Synthetic scenario with a zoom on the elements of interest (in this case, a mobile phone and a handgun)

In addition to the synthetic dataset, the Gun Movies Database [52] has also been used to ensure the differences are caused by the proposed method and not by changes in the texture by the origin of the data. This dataset contains images of size 640\(\times \)480 pixels from 7 laboratory-shot movies with a total of 817 frames and 686 annotated handguns (Fig. 2).

For this second dataset, a total of 817 images were used. From these, 571 frames were used to train and adjust the proposed autoencoder filter and the remaining 246 frames were used to evaluate and compare the detector and detector + autoencoder systems.

Fig. 2
figure 2

Sample frame from the Gun Movies dataset

5 Proposed method

As introduced above, when a detector runs in a new scenario, the false positive rate increases due to its particularities that were not seen in the training data. To deal with this problem, we propose to add a filtering step after the detector inference (see Fig. 3). The filter is used to discard the FP detections of the object detector produced by the particularities of the new scenario.

Fig. 3
figure 3

Proposed system

The filter can be considered as a one-class classifier that learns how to identify a certain type of samples. Thus, the rest of the samples can be considered as anomalies. This problem has been addressed in the literature through the use of the one-class versions of the SVM, k-nearest neighbor (k-NN), random forests classifiers, and more recently with deep autoencoders [9, 38, 53]. In our case, an autoencoder is trained to model the class of the typical FP detections.

In order to collect the training samples for the autoencoder, the detector is run in the particular scenario for a certain period of time, storing all the FP detections (Fig. 4). Initially, all detections can be considered as FPs in a real scenario since the incidence of handguns is very low.

Fig. 4
figure 4

Autoencoder training phase

Deep autoencoders learn the input data distribution using an intermediate representation. They are able to compress the data into a small vector and then reconstruct the input from it with accurate results. If new input data come from a different distribution, the reconstruction error will be higher.

Finally, according to the autoencoder structure, we define and compare 3 different methods to check whether the output of the detector is a typical false positive. The simplest one is to establish a threshold for the reconstruction error. Therefore, detections with low reconstruction error will be discarded as typical FPs of the scenario. The other two methods are based on the use of the central vector as a compact representation of the images and then train a one-class classifier with it. For that, SVM and k-NN with \(k=1\) were used, and the thresholds were selected according to the scores and the distance to the closest neighbor, respectively.

5.1 Autoencoder architecture

The structure of an autoencoder consists of an encoder path that ignores the noise and reduces the dimensionality and a decoder path that makes the reconstruction. The compressive path of the autoencoder used consists of a set of 6 convolutional and max-pooling layers (Fig. 5). Similarly, the reconstruction path has also 6 convolutional and up-sampling layers. The input is a 3-channel image of size 64\(\times \)64, and the central vector has 148 elements (conv2d_7 layer). A more detailed description of the architecture can be seen in Table 1.

Fig. 5
figure 5

Autoencoder architecture used

Table 1 Detailed description of the autoencoder architecture used. The output of the conv2d_7 layer (in bold) is used as the FP intermediate representation

6 Results

The Faster R-CNN model trained with the dataset from the University of Seville obtained an mAP of 0.7933. Training took 2 days and executed 62 epochs. An Ubuntu 14.04 LTS machine with 2 nVIDIA Quadro M4000 cards, Keras with TensorFlow backend and CUDA 8.0 were used to perform the training.

After obtaining this base detector, both datasets were divided into 4 parts to (1) train and (2) validate the autoencoder, (3) fit the k-NN and SVM classifiers, and (4) test the system variants (Fig. 6). The detector was then run on each of the subsets, and the FP and TP patches were stored. Although only the FP detections are used to train and validate the autoencoder and fit the classifiers, the correct detections were also generated and stored for the test subset to check that the detection rate is minimally affected.

Overall, for the autoencoder training and validation, two sets with 4913 and 3607 FPs, respectively, were obtained for the synthetic dataset and 586 and 405 FPs for the Gun Movies dataset. Another set composed of 3712 FP regions for the synthetic dataset and 95 FP regions for the Gun Movies dataset was used to fit the classifiers used to perform the decision. Finally, a set of 4832 regions (4632 FPs and 200 TPs) for the synthetic dataset and 499 regions (359 FPs and 40 TPs) for the Gun Movies dataset was reserved for testing. Figure 7 shows some examples of the typical FP detections obtained in the synthetic scenario.

Fig. 6
figure 6

Subsets of the synthetic dataset used. The Gun Movies dataset is similarly split

Fig. 7
figure 7

Typical false positives of the handgun detector in the synthetic scenario (enlarged)

Fig. 8
figure 8

Typical false positives of the handgun detector in the Gun Movies dataset (enlarged)

The autoencoders were then trained and validated with the stored FP detections of the training and validation subsets. This process took only about an hour for each dataset to complete 500 epochs in a Windows 10 PC with an nVIDIA GTX 1060 MaxQ card using Keras with TensorFlow backend and CUDA 9.0. At this point, if the autoencoder is used with some test images from TP and FP detections, the ability to effectively reconstruct FPs is evidenced (see Fig. 9).

Fig. 9
figure 9

Autoencoder reconstruction of: a TP and d) FP of the detector from the synthetic dataset. b and e are the reconstructed images, and c and f are the absolute difference between the reconstructions and their corresponding original images

The stored FP detections from the fit subset of each dataset were used to feed each of the autoencoders and get intermediate vectors to train the SVM and k-NN one-class classifiers. The SVM selected uses a linear kernel. On the other hand, \(k=1\) was selected for the k-NN algorithm.

To illustrate the performance of both the detector and detector + autoencoder approaches on the two datasets, they were tested with the 1000 images from the fourth subset of the synthetic scenario and the 246 frames of the Gun Movies dataset. The histograms of the reconstruction error for the MSE thresholding-based method, the probability score for the SVM one-class classifier, and the distance to the nearest neighbor for the k-NN were obtained (Figs. 10 and 11). Although TPs and FPs are overlapped in all cases, the first part of the MSE and k-NN histograms and the last part of the SVM histogram do not contain TPs. Therefore, the FPs that lie on those parts of the histograms can be potentially filtered selecting the value of the first bin (or the last for the SVM) in the histogram that contains TPs as threshold. For the synthetic dataset, this shows that, without affecting the detection rate, up to 26.4% of all the FPs can be filtered using the MSE reconstruction error, 22.5% using the one-class SVM, and 30% with the distance to the nearest neighbor of the k-NN (2). On the other hand, in the histograms obtained from the Gun Movies dataset TPs and FPs are less overlapped making it possible to remove up to 87.2% of all the FPs using the MSE reconstruction error, 49% using the one-class SVM, and 74.1% with the distance to the nearest neighbor of the k-NN without affecting the detection rate.

Fig. 10
figure 10

Synthetic dataset. Histograms of the MSE reconstruction error, SVM score, and K-NN distance (y-axis uses logarithmic scale)

Fig. 11
figure 11

Gun movies dataset. Histograms of the MSE reconstruction error, SVM score, and K-NN distance (y-axis uses logarithmic scale)

Table 2 Percentage of FPs that are filtered by method and dataset

In addition, the precision–recall curves corresponding to the detector and the autoencoder with the three proposed decision methods were obtained (see Figs. 12 and 13). The experimental results show a reduction in the number of false positives while roughly maintaining the detection capabilities [54]. Since the precision–recall curve is calculated by varying the detector output threshold and the autoencoder has another threshold that can be varied too, each curve was obtained under a specific value for the autoencoder and varying the threshold of the detector (3). For the synthetic dataset, comparing all the curves for a specific autoencoder thresholding method, they show a maximum increase in the precision of 1.46% at the same recall values when the autoencoder and the MSE are used (threshold = 0.0057), of 1.2% using the autoencoder and the SVM classifier (threshold = 14600), and of 1.77% in case of the autoencoder and K-NN (threshold = 0.34). For the Gun Movies dataset, results show a maximum increase in the precision of 47% at the same recall values when the autoencoder and the MSE are used (threshold = 0.047), of 8% using the autoencoder and the SVM classifier (threshold = 175), and of 20% in case of the autoencoder and K-NN (threshold = 1.92).

Fig. 12
figure 12

Precision–recall curves for the synthetic dataset. Best viewed in color (color figure online)

Fig. 13
figure 13

Precision–recall curves for the Gun Movies dataset. Best viewed in color (color figure online)

Table 3 Increase in the detector precision when the autoencoder is applied by method and dataset

Overall, the results show that the autoencoder is able to filter part of the new FPs in all cases without affecting the detection rate of the original system. All thresholding methods are able to reduce the number of FPS to some degree.

7 Conclusions

In this work, a step to filter the false positive detections that appear when a pre-trained handgun detector is deployed in the final surveillance scenario has been proposed. This step consists of training a deep autoencoder with the false positive regions obtained from the particular scenario. Once the autoencoder is trained, it can be used to decide whether a detection is similar to the already known typical false alarms and can be filtered, or otherwise if an alert should be triggered.

The ability of the autoencoder to reduce the number of FPs has been demonstrated with a potential reduction by up to 30% for the synthetic scenario when it is combined with a k-NN classifier trained with the vector representation of the detector FPs regions and up to 78% of the FPs for the Gun Movies dataset when the autoencoder is combined with the MSE error metric. Furthermore, the handgun detection capability of the system is not compromised by the added filtering step under a wide range of threshold levels.

Although the proposed approach has been only applied to two particular scenarios, it can be extended to more than one since having different perspectives, lighting conditions or background objects will generate different false positives. Thus, during the system’s deployment only a generic detector (in this case a handgun detector) is required and one autoencoder will be trained for each camera feed.