Abstract

Pedestrian detection based on visual sensors has made significant progress, in which region proposal is the key step. There are two mainstream methods to generate region proposals: anchor-based and anchor-free. However, anchor-based methods need more hyperparameters related to anchors for training compared with anchor-free methods. In this paper, we propose a novel multiscale anchor-free (MSAF) region proposal network to obtain proposals, especially for small-scale pedestrians. It usually has several branches to predict proposals and assigns ground truth according to the height of pedestrian. Each branch consists of two components: one is feature extraction, and the other is detection head. Adapted channel feature fusion (ACFF) is proposed to select features at different levels of the backbone to effectively extract features. The detection head is used to predict the pedestrian center location, center offsets, and height to get bounding boxes. With our classifier, the detection performance can be further improved, especially for small-scale pedestrians. The experiments on the Caltech and CityPersons demonstrate that the MSAF can significantly boost the pedestrian detection performance and the log-average miss rate (MR) on the reasonable setting is 3.97% and 9.5%, respectively. If proposals are reclassified with our classifier, MR is 3.38% and 8.4%. The detection performance can be further improved, especially for small-scale pedestrians.

1. Introduction

Pedestrian detection played an important role in self-driving vehicle tasks by assisting drivers to judge whether there are pedestrians in the front of the driving area. Therefore, pedestrian detection performance directly affects pedestrian safety [14]. In recent years, with the research and development of convolutional neural networks (CNN), pedestrian detection methods based on CNN have shown rapid progress. According to the regression starting status, pedestrian detection can be divided into anchor-based [59] and anchor-free [1016] detection methods.

Based on the number of the detection stages, anchor-based methods can be divided into two-stage [9, 1720] and one-stage [6, 21, 22] detection methods. In two-stage detection, region proposals are generated firstly, and then, the proposals are classified by a classifier. In one-stage detection, the final detection results can be obtained via only one step; pedestrian detection can skip the classification stage and predict bounding boxes with confidence scores directly.

The most impressive anchor-based method is the region proposal network (RPN) [7], which was first proposed in Fast-RCNN [7]. The regression starting status of RPN is predefined by a set of anchor boxes with multiple scales and ratios, and then, the anchors are transformed according to the learning parameters into proposals. Multiscale anchors can avoid the problem of scale imbalance [23] caused by the width and height range of ground truth. Although RPN can achieve excellent performance, it needs to design anchor boxes manually, which will affect the generalization ability of the model.

Different from anchor-based methods, anchor-free start regression from a point and do not require hyperparameters about anchors. At the same time, with the help of focal loss [24] to solve the problem of the imbalance between positive and negative samples in training, many remarkable anchor-free detection methods have been developed, such as CSPNet [12], FCOS [14], and FSAF [25]. Among these methods is CSPNet, a single-scale anchor-free detector which is efficient on pedestrian detection datasets. However, as Figure 1 presents the height and the area of the pedestrians in the Caltech and CityPersons datasets, we observe most of the pedestrians in the dataset are small, which leads to scale imbalance [23]. CSPNet is insufficient for handling with scale imbalance, because it is a single-scale detection head and only concatenation is used to fuse these multiscale feature maps on different stages.

Inspired by RPN and CSPNet, we design a multiscale anchor-free detection head (MSAF) on adaptive channel feature fusion (ACFF) to generate proposals at different scales of features. The deeper the network is, the more difficult it is to detect small pedestrians. As we all know, different feature layers have different receptive field sizes. If multiscale regression is trained on the same feature layer, the size of ground truth and actual receptive fields do not match. Fortunately, within the feature extract module, the predict boxes do not necessarily need to correspond to the actual receptive fields of each layer. We design the multiscale detection head so that specific feature maps learn to be responsive to the particular scale of the pedestrians.

The main contributions of this work are summarized as follows: first, we propose an effective approach, named adapted channel feature fusion, to extract channel features at different levels so that only useful channel features are kept for fusion. Second, we propose a multiscale anchor-free method to replace the anchor-based method. It is used to reduce the hyperparameters that exist in anchor-based and address the scale imbalance problem. Third, a RCNN classifier is proposed to further improve the detection performance, especially small-scale pedestrian detection. Fourth, our detection method achieves state-of-the-art performance on the Caltech database [26] and competitive performance on the CityPersons [27] pedestrian benchmark.

In this section, we mainly introduce anchor-based and anchor-free pedestrian detection methods based on the feature extraction and detection head. In the step of pedestrian classification, the description of how to select backbone and design classifier to address various problems is also highlighted.

2.1. Anchor-Based Methods

Anchor-based methods need a set of predefined anchors with different scales and ratios for regression training, and then, the anchors are transformed according to the training parameters into proposals. In two-stage pedestrian detection, generating high-quality proposal boxes is the first key step; then region proposals are classified by a classifier. The most representative method is RPN which is first introduced in Fast-RCNN [7]. RPN [19] takes a smooth L1 loss for regression training and is implemented on the final high-level feature layers. MS-CNN [17] sets RPN modules on different level layers of backbone to pay more attention to small object detection. FPN [9] uses the top-down feature fusion method to build a feature pyramid and generate bounding boxes with RPN on different levels. We can also find that RPN based on feature fusion can significantly improve the detection performance. SDS-RCNN [28] applies semantic segmentation to RPN and RCNN to boost pedestrian detection accuracy. SSA-CNN [29] proposes a self-attention mechanism to connect the RPN and RCNN stages to improve pedestrian detection performance. AR-Ped [5] utilizes a stackable decoder-encoder module consisting of top-down and bottom-up pathways for feature fusion to improve the precision of the RPN stage. Repulsion [30] and aggregation [31] loss are designed on the RPN to tackle occluded pedestrians in crowded scenes.

In one-stage pedestrian detection, bounding boxes are predicted with only one step. SSD [6] predicts the detect results at different levels of features with a prior anchor. YOLOv3 [22] and YOLOv4 [21] predict the object on three different scale branches, and feature fusion architecture like FPN is used to detect small-scale objects. RetinaNet [24] also take the feature fusion architecture like FPN to object detection, and focal loss is proposed to address the foreground-background class imbalance.

2.2. Anchor-Free Methods

There are two ways to find objects in anchor-free detection. The first way is to use the center point or region of the pedestrian to predict the length from the bounding box boundary. YOLOv1 [13] predicts pedestrians on the final layer of backbone and detect objects in a grid cell if the center of pedestrian falls into. UnitBox [32] takes Intersection over Union (IoU) loss as detection head to predict proposals and avoid the box-level scale imbalance and optimizes the L2 loss in DenseBox [33]. CSPNet [12] extracts multiscale features with concatenation on different stages, and pedestrian detection is simplified as a straightforward center and scale prediction task through convolutions. Wang [15] appends some adaptations on CSPNet to improve the robustness of the method. CSID [16] proposes a pedestrian detector with a novel identity-and-density-aware nonmaximum suppression (NMS) algorithm to refine the detection results. FCOS [14] selects FPN as feature fusion architecture and defines the inside of bounding box as positive.

The second way is to predict key points on heatmaps as detection head. CornerNet [11] takes an hourglass network for multiscale feature fusion and detects a pedestrian as a pair of key points on heat maps. CenterNet [10] adapts CornerNet’s detection head as a triplet of keypoints to predict bounding boxes. ExtremeNet [34] also uses an hourglass network to extract features and predict four extreme points and one center point for each pedestrian.

2.3. Classifier

In the step of pedestrian classification, different types of classifiers have been designed to address various problems. In order to get better classification accuracy, we design different convolutional neural network architectures according to the application scenarios, such as MobileNet [35], VGGNet [36], GoolgeNet [36], ResNet [37], and DenseNet [38]. In order to ensure that for any size of input regions, it can always produce the same size region features, RoIPooling [19] and RoIAlign [39] are designed, and the shared features are directly classified according to the Region of Interest (RoI).

To detect the small objects, RPN+BF [20] uses the cascaded Boosted Forest for pedestrian classification to mine hard negative examples and handle the small number of instances. In another approach, scale-aware [8] weights are predicted and a large-scale subnetwork and a small-scale subnetwork are combined into a unified framework to solve the multiscale pedestrian classification problem and achieve state-of-the-art performance on the Caltech dataset. In order to improve detection performance with feature fusion, BCN [28] combines semantic segmentation and classification together to perform pedestrian classification. SA-RCNN [29] uses self-attention to perform feature extraction for pedestrian classification and achieves good results. A previous study [40] designs a hyperlearner, which is a new type of feature fusion framework, to extract features, and uses additional pedestrian features to improve the detection performance. To handle the occlusion problem in pedestrian detection, a new partial occlusion-aware pooling unit [31] is used in classification. To address the IoU distribution imbalance, in Cascade R-CNN [41], several RCNN networks are cascaded based on different IoU thresholds, and the detection results are continuously optimized to improve the detection performance.

3. Baseline Method

CSPNet [12] is a single-scale anchor-free pedestrian detector. It can directly obtain the detection results by predicting the center location, the height of bounding boxes, and center offsets with single scale. The architecture consists of two modules: the feature extraction and the detection head.

The feature extraction module uses CNN to extract feature maps for pedestrian detection. In this paper, ResNet-50 is used as the backbone of feature extraction; its convolution layers can be divided into five stages according to the pooling stride. Given an input image of size , the feature resolution of stage is . The experiment results show that the best detection performance can be obtained by deconvolution the features of stages 3, 4, and 5 into the same resolution before concatenation.

The detector head is used to generate the bounding box, which contains three branches: the first branch is to predict the classification score of the bounding box and determine the center location of the proposal. The second branch is used to predict the height of the bounding box and then use the aspect ratio to get the bounding box. The third branch is used to predict the center offsets of the bounding box and adjust the center location. The center offsets is defined as . The details are provided in Figure 2(b). In the training process, the modified focal loss is used as the loss function in the classification task, and smooth L1 is used as the regression loss function in the height and center offsets prediction task.

4. Our Approach

In this section, we introduce an adaptive channel feature fusion method to extract channel features at different levels and propose a multiscale anchor-free region proposal network to generate proposals. The proposed network has fewer hyperparameters than anchor-based methods and can significantly boost detection performance, especially for small objects.

4.1. Adapted Channel Feature Fusion for Feature Extraction

In CSPNet, only concatenation is used to fuse the features on different levels. Currently, the common feature fusion methods are element-wise sum (SUM) or concatenation. As we all know, different feature layers have distinct abilities, and the low-level feature maps can provide more precise localization information while the high-level maps contain more semantic information. Therefore, here, we introduce ACFF, which can not only adaptively select channel features on different scales for fusion but also boost the feature discrimination. The detail of ACFF can be found in Figure 3.

Two steps are needed to implement the ACFF. In the first step, the features of different levels are scaled to the same resolution and then concatenated on the channel dimension. If you want to get the fused feature of the -th level, you need to scale the features of the other two adjacent different levels to the same resolution as the -th level and then get the concatenated feature maps, because the features at three levels in detect backbone have different resolutions as well as different numbers of channels. The concatenated feature maps can be presented as where is defined as the feature of the -th level and , is stride, and .

For example, let us assume that channel feature fusion is performed at the fourth level. If the resolution of the feature to be fused is smaller than that of the target feature, deconvolution is used to enlarge the feature, and then convolution layer is used to compress the channel to 256. If the resolution of the feature to be fused is greater than that of the target feature, we use a convolution layer with a step size of 2 to reduce the feature resolution and channel dimension.

In the second step, the global average pooling(GAP) is used to generate channel feature vectors . The -th channel element from GAP is calculated by the following formula:

Then, a new compact feature is created to adaptively learn the fusion weights of different level features. This is achieved by a fully connected (FC) layer with the lower dimension: where , , and is a typical setting in our experiment.

Further, softmax is used for normalization, and the learned weights , , and are used to select the corresponding level features for final fusion . Note that , , and is simple a scale value at channel and , , . where , , and . With this method, the features at all the levels are adaptively aggregated at each scale. The output of ACFF can be used as the input of MSAF and RCNN.

4.2. Multiscale Anchor-Free Detection Head

Scale imbalance occurs in the pedestrian dataset because certain sizes of the objects or input bounding boxes are overrepresented [23]. Taking the Caltech [26] and CityPersons [27] datasets as examples, the height of the pedestrians ranges from 30 px to 350 px and 35 px to 965 px; the distribution of the pedestrian heights at different scales is imbalance. Approximately 80% of the pedestrians in the Caltech and 64% of the pedestrians in the CityPersons are less than 112 px in height. The detailed statistical information is provided in Figure 1. The scale imbalance problem suggests that a single scale of visual processing is not sufficient for detecting objects at different scales. If single-scale regression is used to predict the height, the constraint range is too large and may cause a deviation in the prediction results, like CSPNet in Figure 2(d). Two popular approaches have been used for multiscale predictions. The first approach is the prediction on the same layer. For example, in RPN [19], as shown in Figure 2(a), the detection head is on the end of the backbone with different aspect ratios and scales to train. The second approach is to predict on different level layers at multiple scales as shown in Figures 2(b) and 2(c). The detection head is different for each feature layer. For example, in MS-CNN [17], FPN [9], and FCOS [14], detection head is attached to each level on the feature pyramid to obtain proposals.

To address shortcoming of scale imbalance, we propose MSAF to assign the ground truth in different scale-spaces for forward and backward as shown in Figure 2(e). The MSAF is based on a convolutional network that produces bounding boxes with scores followed by an NMS step to produce the final detection. Three branches are attached to the final feature maps to predict the pedestrian location, height, and center offsets at each scale. The width can be calculated with an aspect ratio and height. According to the statistics of pedestrian detection bounding box annotations [26, 27], the aspect ratio is generally set to 0.41. In this detection head, we attach a 33 convolution layer on the fusion feature , and then three head map layers are appended to predict location, height and offset with convolution kernel.

During training, we must determine how to assign the ground truth to the corresponding scale. To handle different object scales, we refer to the formula in FPN [9]. If the ground truth height is , we assign it to the scale according to

Taking the Caltech as example, the ground truth bounding boxes are assigned into three scales: 56~112 px, 112~224 px, and 224 px~. An illustration example of assigning pedestrians to different levels according to different scales of pedestrians is depicted in Figure 4. The red ground truth is assigned to the low-level features, and the green ground truth is assigned to the high-level features for prediction.

To predict the center location, we define the positive region, ignore region, and negative region for center ground truth. If the point falls into the center region of the pedestrian, it is assigned to positive samples for training. The center region is the area with a pedestrian center as the center within a radius of 2. The ignore region is the location where the ground truth bounding box is not assigned in this scale and ground truth bounding box excluding positive samples. If the positive and ignore regions are excluded in the image, the rest are negative regions. The whole illustration can be found in Figure 4. The modified focal loss is used as the loss function to predict center location as follows:

In the above, is the number of positive samples. and are the focusing hyperparameters, and we experimentally set and as suggested in [12]. is a 2D Gaussian mask centered at the location , and the mask is proportional to the height and width of the individual objects. If is equal to 1, is set to , and is set to otherwise.

To predict pedestrian height and center offsets, the pedestrian height in scale is redesigned as . We select the smooth L1 loss for height and offset prediction: where and represent the offset and height from ground truth and prediction of positive samples, respectively.

To sum up, the optimization objective int scale is

The final objective loss function is a multitask loss in different scales to be optimized as follows: where corresponds to the max scale of pedestrian height in function (5).

4.3. RCNN Classifier

The RCNN classifier is used to classify the proposals generated by the MSAF as pedestrian or nonpedestrian. We take the object classifiers from [5, 20, 28, 29] as references to construct the RCNN classifier. We resize the object to a fixed resolution and then use it as the input of the classifier to determine whether it is a pedestrian based on the final score. As shown in Figure 1, the height of most pedestrians is less than 112 px. If the image is resized to  px as the input, the information of the image will be distorted, degrading the classification performance. To alleviate this problem, we cropped the object from the image, added 25% padding, and resized it to  px as the input. VGG-16 [32] without the pool5 layer is chosen as the backbone since the size of the receptive field of VGG-16 is the same as that of the pedestrian. ACFF is used to extract features to improve the discrimination ability of the model; the detail can be found in Section 4.1.

5. Experiments

In this section, we first introduce the implementation details, evaluation metrics, and dataset information. Then, the ablation studies about the MSAR and ACFF are reported. Finally, we also give a detailed description of the benchmark comparison experiments.

5.1. Training, Inference, and Implementation Details

The whole detect framework is implemented on Keras. In MSAF stage, the ResNet-50 is used as backbone to predict the bounding boxes. Specifically, our MSAF is trained using the adaptive moment estimation (Adam) algorithm for 100 K iterations with an initial learning rate of 0.0001 and a learning policy for the steps. Each minibatch is constructed from an image. It is trained with multiscale input image in the scale between 0.6 and 1.5. Whole image is taken as input to predict height, offset, and locations. We first select bounding boxes with score above 0.01 and then use NMS with threshold 0.5 for final processing. In RCNN stage, VGG-16 is selected as backbone and is trained with the stochastic gradient descent (SGD) algorithm for 120 K iterations with an initial learning rate of 0.001 and a learning policy for the steps. After 60 K iterations, the learning rate is set to 0.0001. The weight decay and momentum are set as 0.0005 and 0.9, respectively. No more than 20 proposals are selected from MSAF for each SGD minimatch, and these are selected according to the scores in a descending order. To carry out the experiments, an Intel Xeon E5-2620 @ 2.1 GHz CPU server with 48 GB of memory and two TITAN RTX (24 GB) GPUs are used.

5.2. Evaluation Metrics

To evaluate MSAF, two benchmark datasets, Caltech [26] and CityPersons [27], are selected for the experiment and comparison. The log-average miss rate over False Positive Per Image (FPPI) ranging in [] (denoted as MR-2) [26] is used to evaluate the pedestrian detection performance. A lower miss rate indicates better detection performance. The evaluation settings are from Caltech and CityPersons, respectively. Generally speaking, we need to focus on the height in the reasonable setting is greater than 50 pixels and in the all setting is greater than 20 pixels.

5.2.1. Caltech

The Caltech pedestrian dataset [26] consists of approximately 10 hours of 30 Hz video taken from a vehicle driving through regular traffic in an urban environment. Approximately 250,000 frames with a total of 350,000 bounding boxes and 2300 unique pedestrians were annotated. We extract one out of every 4 frames from the raw videos (acquiring a total of 42782 images) to form the training set and one frame out of every 30 frames from the raw videos (acquiring a total of 4024 images) to form the test set. The new annotations [42] from the Caltech dataset are used on the experiments.

5.2.2. CityPersons

The CityPersons [27] is built upon Cityscapes dataset. It is a large and diverse set of stereo video sequences and is collected in urban street scenes. The dataset contains a total of 5000 images, and the resolution is , more than 35 k person and 13 k ignore regions. The split of train, validation, and test subsets is the same as that of Cityscapes. The training subset has 2975 images and was recorded across 18 different cities in three different seasons and various weather conditions. The validation subset was created from 3 different cities and has 500 images. The test subset was collected from 6 different cities and has 1575 images.

5.3. Ablation Study

In this section, we conduct ablation experiments on Caltech to evaluate the performance of each component of the proposed method. For the proposals, we focus on analyzing the impact of the MSAF and ACFF. For the classifier, we evaluate the impact of classification on the overall performance.

5.3.1. ACFF for Region Proposals

To assess the importance of ACFF, we compare it with other feature fusion methods: SUM and concatenation [43]. ResNet-50 is taken as backbones in these experiments. The three feature fusion methods contain the same detector head from CSPNet [12].

It can be observed from Table 1 that ACFF has the best performance when it is used to feature fusion and the performance of feature fusion using SUM and Concat is similar. ACFF can adaptively select different levels of channel features for fusion, which can improve the performance of pedestrian detection. If ResNet-50 is taken as the backbone, MR-2 of ACFF on the Caltech is 0.41% higher than concatenation and 0.53 higher than SUM when IoU is 0.5.

5.3.2. Importance of MSAF

To highlight the excellent performance of MSAF, it is compared with RPN [19], SDS-RPN [28], SSA-RPN [29], AR-RPN [5], and CSPNet [12] on the Caltech dataset under the reasonable and all setting. The detailed results are given in Table 2.

Compared with other methods, MSAF is state-of-the-art as shown in Table 2, the MR-2 is 3.97% under the reasonable setting, and the MR-2 is 55.93% under the all setting when the backbone is ResNet-50. MSAF also gets the best performance when the backbone is VGG-16 and the MR-2 is 4.91% under the reasonable setting and the MR-2 is 58.86% under the all setting. Through the experimental comparison on two different backbones, we find that the region proposal methods on ResNet-50 are better than those on VGG-16. Compared with methods CSPNet, FPN (the same detection head as MSAF), and MSAF, it can be observed that the effect of multiscale is better than that of single scale. Multiscale regression can effectively improve the detection performance.

5.3.3. Importance of RCNN Classifier

To evaluate the influence of our classifier on the detection performance, MSAF is used to extract the proposals as the inputs and our RCNN classifier is compared with other classifiers from RPN+BF [20], SDS-RCNN [28], and AR-Ped [5]. The comparison experiments are performed on the Caltech dataset with different resolutions, and the results are given in Table 3.

From Table 3, compared with the other methods, our method has the highest detection accuracy and better robustness. We observe that the MR-2 in our RCNN classifier is 3.38% when the resolution is  px. The detection performance using 112112 px as input is better than that using  px as input when VGG-16 is taken as backbone on the Caltech. If ACFF is used for feature fusion, the classification effect can be further improved from the method RCNN and RCNN+ACFF. In addition, we find that two-stage detection can significantly improve the performance with our classifier compared with one-stage detection.

5.3.4. Small Object Detection

In order to further illustrate the effectiveness of our method MSAF in small object detection, we make a comparison with CSPNet, CSPNet-RCNN, and MSAF-RCNN at different pedestrian heights in Figure 5. As shown in the figure, small-scale pedestrian detection is difficult; the higher the pedestrian’s height is, the better the detection effect is. The performance improvement in CSPNet and MSAF between 30 and 50 pixels is about 5% and 0.5% between 30 and 50 pixels. When the height is between 70 and 90 pixels, the improvement is not obvious. The improvement gap on small object detection using our RCNN classifier is large when the height between 30 and 50 pixels, about 11% improvement over CSPNet-RCNN and 7% improvement over MSAF.

5.4. Benchmark Comparison
5.4.1. Caltech

The performance of MSAF was evaluated on the Caltech [26] and CityPersons pedestrian [27] benchmarks. As depicted in Figure 6(a), our MSAF-RCNN achieves the state-of-the-art result under the reasonable setting and the MR-2 is 3.38%. Without the RCNN classifier, the MR-2 of the MSAF is 0.45% higher than that of CSPNet. We also find that the MR-2 of the CSPNet-RCNN decreases from 4.54% to 3.97%. Figure 6(b) shows that our MSAF-RCNN obtained the best result, with an MR-2 (%) of 51.58% for the all setting. Compared with other region proposal methods, the gap between MSAF and CSPNet in MR-2 is about 1%. All of these show that MSAF can achieve better performance on Caltech dataset. The anchor-free method MSAF can replace anchor-based method to generate proposals. At the same time, it can alleviate the scale imbalance problem.

5.4.2. CityPersons

The experimental results in Table 4 show that our MSAF displays overall performance improvement compared with CSPNet, and the MR decreases to 9.5% on the reasonable set. This also shows that the effect of MSAF as one-stage detection is not better than that of ACSP [15] and CSID [16]. However, we find that our MSAF-RCNN achieves state-of-the-art performance on the reasonable setting and the second best performance on the heavy set and partial set. This shows that our RCNN classifier significantly improves the performance based on the proposals obtained from MSAF. As shown in the small column in Table 4, our MSAF can effectively detect small objects.

6. Conclusions

To improve the pedestrian detection performance, a multiscale anchor-free region proposal network is proposed in this paper. ACFF is used to extract features firstly, and then MSAF detector head is used for training according to the height of pedestrians. Through experimental comparisons, we know that multiscale detection is easier to detect small-scale pedestrians than single-scale detection. In addition, the RCNN classifier is taken for further improvement. Compared with other detection methods, we find that the performance of two-stage detection is significantly better than that of one-stage detection. Overall, our detection method achieved state-of-the-art performance on Caltech with new annotations and obtains competitive performance on CityPersons.

Data Availability

The raw/processed data required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study. Requests for data, please send email to corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported in part by the National Key R&D Program (Grant No. 2018AAA0102600), the National Natural Science Foundation of China (Grant Nos. 62002082, 61866009, and 61906050), and Guangxi Natural Science Foundation (Grant Nos. 2019GXNSFAA245014 and 2020GXNSFBA238014).