Abstract

For unmanned aerial vehicle (UAV), object detection at different scales is an important component for the visual recognition. Recent advances in convolutional neural networks (CNNs) have demonstrated that attention mechanism remarkably enhances multiscale representation of CNNs. However, most existing multiscale feature representation methods simply employ several attention blocks in the attention mechanism to adaptively recalibrate the feature response, which overlooks the context information at a multiscale level. To solve this problem, a multiscale feature filtering network (MFFNet) is proposed in this paper for image recognition system in the UAV. A novel building block, namely, multiscale feature filtering (MFF) module, is proposed for ResNet-like backbones and it allows feature-selective learning for multiscale context information across multiparallel branches. These branches employ multiple atrous convolutions at different scales, respectively, and further adaptively generate channel-wise feature responses by emphasizing channel-wise dependencies. Experimental results on CIFAR100 and Tiny ImageNet datasets reflect that the MFFNet achieves very competitive results in comparison with previous baseline models. Further ablation experiments verify that the MFFNet can achieve consistent performance gains in image classification and object detection tasks.

1. Introduction

To understand the environment, unmanned aerial vehicles (UAVs) need to integrate information from various sensors such as cameras, lidar, radar, and GPS. The information from the camera provides a straightforward way of visual perception, which supports further advanced thinking and reasoning for UAV. One of the important tasks in the visual perception of UAV, image recognition [1, 2] has always been a research hotspot. Convolutional neural networks (CNNs) have been widely used in solving visual cognition tasks, such as image classification [3, 4], object detection [5], and salient object detection [6]. Unlike traditional hand-crafted features (e.g., HOG [7]), features learned by CNNs based on data require minimal human involvement during training. Thus, most of the recent research on visual recognition is based on network engineering. It is becoming increasingly important to design better CNN architectures for visual recognition tasks.

Generally, for the design criterion of convolution networks, there are three important issues: depth, width, and cardinality. In 2015, Simonyan and Zisserman [8] designed an effective and very deep network by stacking blocks of the same shape, which achieved the state-of-the-art performance. However, as CNNs become increasingly deeper, gradient propagation becomes more difficult. In order to alleviate the problem of gradient disappearance caused by the increase of network depth, He et al. [9] proposed a deep residual learning approach, which referred to the input of the layer to learn the residual function. Experiments showed that this residual learning method can be easily optimized and can obtain higher accuracy by increasing the depth. Szegedy et al. [10] showed that width was another important factor to improve the performance of CNNs. Compared with shallower and less extensive networks, the main advantage of this method was that it can significantly improve the accuracy with a moderate increase in computing demand. ResNeXt [11] employed the potential of grouped convolutions and empirically showed that increasing cardinality was more effective than going deeper or wider as capacity increases. In 2016, Zagoruyko and Komodakis [12] demonstrated that using more channels and a wider convolution can improve detection accuracy. Then, Huang et al. [13] proposed a dense convolutional network, which utilized direct connections between any two layers with the same feature map size to strengthen feature propagation. Ding et al. [14] designed a novel convolutional network, which used asymmetric convolutions to strengthen the square convolution filters.

Other network studies [1518] exploited the potential of network from attention mechanism. For example, Hu et al. [15] designed a novel squeeze and excitation (SE) block that adaptively recalibrates channel-wise feature responses by emphasizing the interdependent channel maps. After that, Woo et al. [16] introduced a simple attention module called CBAM, which exploited both spatial and channel-wise attention to emphasize meaningful features along channel and spatial axes. Li et al. [17] proposed selective kernel networks (SKNets), which realized the adaptive receptive field sizes of neurons in a nonlinear approach. Furthermore, previous work [18] captured multiscale features through the additive effects of feature-selective and spatial attention. Targets appear at different scales in the image frame and are often occluded by clutter, which is a major challenge for image recognition algorithms in UAV applications. Therefore, multiscale feature representation is particularly critical for image recognition system in the UAV. However, most existing multiscale feature representation methods using attention mechanism simply employ several attention blocks to adaptively recalibrate the feature response, which overlooks the context information at a multiscale level.

Based on this analysis, in this paper, a multiscale feature filtering network (MFFNet) is proposed for image recognition system in the UAV. In MFFNet, we propose a novel building block, called multiscale feature filtering (MFF) module. Our key idea is to retain important information about smaller and insignificant objects by allowing feature-selective learning for multiscale context information across multiparallel branches. These branches employ multiple atrous convolutions at different scales, respectively, and further adaptively generate channel-wise feature responses by emphasizing channel-wise dependencies.

It is possible to construct an MFF network (MFFNet) by simply replacing the standard 3 × 3 filters in ResNet-like backbones with MFF modules. Besides, while the template for the MFF module is generic, the role it performs varies at different depths throughout the MFFNet. To compare the difference between the MFF module and standard 3 × 3 filter, we visualize the class activation mapping using Grad-CAM [19] and observe that the MFFNet-based CAM results tend to focus on the whole object more than other baseline networks. Experimental results on CIFAR [20], Tiny ImageNet [21], PASCAL VOC 2007 [22], MS COCO [23], and UAV123 [24] datasets show that our proposed method can achieve consistent performance gains in image recognition tasks.

The rest of the paper is organized as follows: Section 2 introduces our proposed MFFNet and presents the details of multiscale feature filtering (MFF) module. Section 3 shows experimental settings and analyses experimental results. Section 4 concludes this study and describes future work of this paper.

2. Method

In this section, the MFFNet, a novel backbone network for image recognition system in the UAV, is introduced. An overview of MFFNet is depicted in Figure 1. A MARNet contains four stages, and each stage contains multiple MFF units. Each MFF unit consists of a sequence of a 1 × 1 convolution, an MFF module, a 1 × 1 convolution, and a further skip layer. Figure 2 shows the schema of an MFF unit.

Furthermore, we present the details of multiscale feature filtering (MFF) module. The MFF module consists of three submodules: split module (SM), multiscale branch module (MBM), and fusion module (FM).

2.1. MFFNet Architecture

MFF modules can be integrated into a standard architecture, such as ResNet [9], by replacing every 3 × 3 layer with MFF modules. Here, MFF modules are used with MFF units. By making this change to each such module in the MFF unit, an MFFNet network can be constructed. Further variants that integrate MFF modules with ResNeXt [11], DenseNet [13], ShuffleNetV2 [25], and MobileNetV2 [26] can be constructed by following similar schemes. Like ResNet-50 and ResNeXt-50, MFFNet-50 and MFFNeXt-50 can be constructed by simply stacking a set of MFF units. MFFNet-50 can be obtained by changing the number of MFF units per stage. MFFNeXt-50 can be obtained from MFFNet-50 by changing the bottleneck width [12] and cardinality [11] of the MFF units. The cardinality, c, is the number of groups within a filter, whereas the bottleneck width, d, is the number of channels in a layer.

Table 1 shows the MFFNet-50 and MFFNeXt-50 architectures with four phases, using MFF units. The filter sizes and feature dimensionalities of a residual block are shown inside the brackets. The number of stacked blocks for each stage is shown outside the brackets. “B = 3” denotes an MFF module with three branches, and “c = 32” suggests grouped convolutions with 32 groups.

2.2. Multiscale Feature Filtering Module

The structure of the MFF module is illustrated in Figure 3. First, in MFF module, given an input feature map, to obtain fine-grained multiscale information, the SM divides the input feature map into multiple feature map subsets. Second, to capture the objects at different scales, the MBM employs multiple atrous convolutions with different rates. Meanwhile, these branches use atrous convolutions instead of standard convolutions to reduce the model’s parameters. Besides, the MBM further selectively generates channel-wise feature responses by emphasizing channel-wise dependencies. Once channel-wise feature responses with different scales are captured, the transformed features are connected by skip structures to enhance feature propagation. Third, a channel concatenation operator is applied to fuse previously captured information from different branches.

2.2.1. Split Module

As shown in Figure 1, in split module (SM), for any given input feature map , where , to obtain fine-grained multiscale information, the SM first equally splits into feature map subsets, such as the three feature map subsets shown in Figure 1, namely, , , and , where . , , and denote the height, width, and number of channels of the feature map, respectively.

2.2.2. Multiscale Branch Module

The multiscale branch module (MBM) consists of three branches, namely, , , and . Moreover, each branch contains a feature filtering module (FFM). The structure of the feature filtering module (FFM) is depicted in Figure 4.

In FFM, we selectively generate channel-wise feature responses by emphasizing channel-wise dependencies. Specifically, for the preprocessed feature map , firstly an FFM uses global average pooling and global max pooling to generate two different channel-wise statistics as and . The global average pooling and global max pooling operations are denoted as and . Specifically, the c-th element of and is calculated aswhere . denotes the c-th feature map channel in the feature map . In addition, refers to pixel in .

Then, to fuse the transformed feature information from global average pooling and global max pooling, an element-wise summation is used to obtain finer global channel-wise statistics as .where indicates the element-wise summation operation between channel-wise statistics and . Furthermore, in order to make use of the previously fused feature information, the previously global channel-wise statistics is forwarded to a function, which is composed of one dimensionality-reduction layer with parameters and reduction ratio , dimensionality-increasing layer with parameters , sigmoid activation function, and ReLU activation function. The final output of the FFM is computed aswhere and are the sigmoid and ReLU activation function, respectively. and , where l = 16.

Employing large atrous rate enlarges the model’s receptive field, so that object coding can be performed at multiple scales. As shown in Figure 5, A-branch, B-branch, and C-branch employ three atrous convolutions with different atrous rates r, where . In addition, these branches use atrous convolutions with different rates instead of standard convolutions to reduce the model’s parameters.

For any atrous convolution layer, the learned set of convolution filters , where refers to the parameters of the corresponding c-th convolution filter. Let denote the input of the atrous convolution layer. are the output of the atrous convolution layer. For the c-th filter at such a layer, the corresponding output feature map channel iswhere denotes an atrous convolution layer with filter size K × K and atrous rate r.

In A-branch, for the input feature map subset obtained from the split module, an atrous convolution layer with filter size 3 × 3 and atrous rate r = 1 is conducted to generate the output feature map of a specific scale. For the c-th filter at such a layer, K = 3 and r = 1 are put into equation (5) to obtain the c-th output feature map channel.where denotes an atrous convolution layer with filter size 3 × 3 and atrous rate r = 1. denotes the learned set of . and .

Further, in order to take advantage of the information aggregated in the feature filtering module (FFM), the feature map is sent to the FFM. The output of the FFM in A-branch is denoted as . The final output of the A-branch is obtained by rescaling the feature map with an element-wise multiplication operation.where indicates the element-wise multiplication operation.

In B-branch, to enhance feature propagation, firstly we fuse the output of the A-branch and the feature map subset obtained from the split module by using an element-wise summation operation. Thus, the fusion output feature map of and is computed as

Then, an atrous convolution layer with filter size 3 × 3 and atrous rate r = 2 is conducted to generate the output feature map . For the c-th filter at such a layer, K = 3 and r = 2 are put into equation (1) to obtain the c-th output feature map channel.where denotes an atrous convolution layer with filter size 3 × 3 and atrous rate r = 2. denotes the learned set of . and .

Similar to the A-branch, in order to take advantage of the information aggregated in the feature filtering module (FFM), the feature map is sent to the FFM. The output of the FFM in B-branch is denoted as . The final output of the B-branch is obtained by rescaling the feature map with an element-wise multiplication operation.

For the C-branch, similar to the B-branch, firstly we fuse the output of the C-branch and the feature map subset obtained from the split module by using an element-wise summation operation. Thus, the fusion output feature map of and is computed as

Then, an atrous convolution layer with filter size 3 × 3 and atrous rate r = 3 is conducted to generate the output feature map . For the c-th filter at such a layer, K = 3 and r = 3 are put into equation (1) to obtain the c-th output feature map channel.where denotes an atrous convolution layer with filter size 3 × 3 and atrous rate r = 3. denotes the learned set of . and .

The final output of the C-branch is obtained by rescaling the feature map with an element-wise multiplication operation.where is the output of the FFM in the C-branch.

2.3. Fusion Module

As shown in Figure 1, in order to take advantage of the feature information aggregated in the multiscale branch module (MBM), the outputs of A-branch, B-branch, and C-branch are forwarded to the fusion module (FM), which is implemented by a concatenation function. The output of FM is , which can be calculated bywhere denotes the concatenation operation between feature maps.

3. Experimental Results and Analysis

In this section, we describe experiments that study the effectiveness of MFF modules for a range of tasks, datasets, and model architectures. Besides, all models are implemented by using the PyTorch framework.

For image classification tasks, we evaluate all models on the CIFAR-100 and Tiny ImageNet datasets. The objects in the CIFAR-100 and Tiny ImageNet datasets have features of different scales, which can effectively verify the effectiveness of our proposed MFFNet in the UAV. For benchmarking, we evaluate the single-crop top-1 error rate and adopt the same data augmentation scheme used in [9, 27]. Moreover, we train the network using stochastic gradient descent with momentum 0.9, weight decay 0.0001, and a mini-batch size of 32 on 1 RTX 2080Ti GPU. For the CIFAR-100 and Tiny ImageNet datasets, every model is trained for 200 epochs. We start with a learning rate of 0.1, which is divided by 10 at 60, 120, and 160 epochs, respectively.

For object detection tasks, all models are trained in the PASCAL VOC 2007 and MS COCO datasets with 1 RTX 2080Ti GPU and the mini-batch size is 2 images. We use a weight decay of 0.0001 and a momentum of 0.9. In addition, all models are trained for 80k iterations with a learning rate of 0.002 and then for 30k iterations with 0.0001. Other implementation details are as in [28]. Besides, in order to verify the effectiveness of our proposed method, we further test the MFFNet on the UAV123 dataset, which is captured from a low-altitude aerial perspective.

3.1. Experiments on Tiny ImageNet

We evaluate our method on the Tiny ImageNet dataset, which contains 100k training images, 10k validation images, and 10k test images in 200 classes. Each class has 500 training images, 50 validation images, and 50 test images. An input image is 224 × 224 pixels randomly cropped from a resized image. We use ResNet-50, ResNet-101, and ResNeXt-50 as the representatives for the residual model architecture. In addition, we compare the results with those from the SENet and SKNet model architectures, which are based on attention mechanisms. We compare the single-crop top-1 error rate of each baseline and its MFFNet counterpart on the Tiny ImageNet dataset. As shown in Table 2, MFFNeXt-50 achieves significant performance gains over ResNeXt-50, with a reduction of 3.82% in the error rate. Compared with ResNet-50, MFFNet-50 is better by 1.04%. Meanwhile SENet-29 (16c × 32d) achieves 33.67% error and MFFNeXt-50 (32c × 4d) achieves 32.76% error. MFFNeXt-50 is better than SKNet-29 (16c × 32d) by 2.27%. Besides, MFFNeXt-50 (32c × 4d) achieves a top-1 error rate of 32.59%, although SENet-29 (16c × 32d) needs 26.88% more parameters.

The top-1 testing error rate versus number of epochs for the different architectures is shown in Figure 6. SKNet-29 (16c × 32d) needs 27.45 M parameters, whereas MFFNeXt-50 (32c × 4d) needs only 25.43 M trainable parameters and achieves a higher accuracy. The results show that MFF modules consistently improve the performance of state-of-the-art CNNs.

3.2. Experiments on CIFAR-100

To further evaluate the performance of the MFFNet, we conduct experiments on CIFAR-100. This dataset consists of 60k 32 × 32 color images drawn from 100 classes. There are 50k training images and 10k testing images. The 100 classes in CIFAR-100 are grouped into 20 superclasses. Each image has a fine label and a coarse label. We use implementations of ShuffleNetV2 1 ×, MobileNetV2 1 ×, ResNet-50, ResNeXt-50, ResNet-101, and DenseNet-BC-121 (k = 12) as the representative models. Similar to ShuffleNet [3], ShuffleNetV2 1 × and MobileNetV2 1 × mean scaling the number of filters by 1 time.

Table 3 shows more results of single-crop testing on CIFAR-100. Note that while ResNet-50 achieves a 21.55% error rate, MFFNet-50 achieves a 21.06% error rate. Moreover, MFFNeXt-50 (32c × 4d) outperforms ResNeXt-50 (32c × 4d) by achieving 20.03% top-1 error. For lightweight models, we compare ShuffleNetV2 1 × with MFF and MobileNetV2 1 × with MFF to the original ShuffleNetV2 1 × and MobileNetV2 1 ×, ShuffleNetV2 1 × with MFF and MobileNetV2 1 × with MFF outperform original ShuffleNetV2 1 × and MobileNetV2 1 × by 0.96% and 0.94%, respectively.

In addition, for densely connected models, we choose a DenseNet-BC network with 121 layers. DenseNet-BC-121 (k = 12) with MFF achieves a performance gain of 1.29% over DenseNet-BC-121 (k = 12). The top-1 testing error rate versus number of epochs for the different architectures is shown in Figure 7. We can clearly see that MFFNeXt-29 (2c × 64d) outperforms ResNet-101 by achieving 19.78% top-1 error, although ResNet-101 needs more parameters.

3.3. Ablation Studies on CIFAR-100

To further validate the effectiveness of the MFFNet, we undertake ablation studies on the CIFAR-100 dataset. We first evaluate the trade-off between cardinality c and bottleneck width d. Next, in MFF module, we investigate the impact of changes in the complexity on performance by combining different atrous rates r.

3.3.1. Cardinality versus Width

To study the effects of the cardinality c and the width of the bottleneck d, we start from the three-branch case and fix the setting atrous rates . We first evaluate the trade-off between cardinality c and bottleneck width d. Table 4 shows the results. MFFNeXt-29 (2c × 64d) has a top-1 error of 19.78%, which is 2.78% lower than that for MFFNeXt-29 (1c × 64d). We can see that as the cardinality c increases from 1 to 4 for constant bottleneck width, the error rate falls. In addition, as the bottleneck width d increases from 24 to 64 for constant cardinality c, the error rate again decreases.

We also note that increasing cardinality c can achieve much better results than going wider. For instance, MFFNeXt-29 (2c × 40d) performs better than MFFNeXt-29 (1c × 64d), even though it has 66.33% fewer parameters. MFFNeXt-29 (2c × 64d) needs 9.37 M parameters, whereas MFFNeXt-29 (4c × 40d) needs only 9.21 M trainable parameters and achieves a higher accuracy.

3.3.2. Combinations of Different Atrous Rates

Next, we investigate combinations of different atrous rates. The atrous rate r is used to control the receptive field size. MFFNet uses 3 × 3 filters with different atrous rates r. To limit the search space, we use only four different atrous rates, r = 1, 2, 3, and 4. To study their effects, we change the other three branches for the 3 × 3 filter with r = 1 in the first filter branch of the MFF modules. Tables 5 and 6 show the top-1 error rate for MFFNeXt-29 (2c × 64d) and MobileNetV2 1 × with MFF. We can make three major observations as follows:(1)First, when the number of branches in an MFF module b = 2, the top-1 error rate for MFFNeXt-29 (2c × 64d) gradually decreases as the atrous rate in the second branch increases. Moreover, MFFNeXt-29 (2c × 64d) achieves the lowest top-1 error for . In contrast, the top-1 error rate of MobileNetV2 1 × with MFF gradually increased as the atrous rate in the second branch increased.(2)Second, when the number of branches in an MFF module b = 3, MFFNeXt-29 (2c × 64d) achieves the lowest top-1 error rate for . However, MobileNetV2 1 × with MFF has the lowest top-1 error rate for .(3)Third, when the number of branches in an MFF module b = 4, MFFNeXt-29 (2c × 64d) and MobileNetV2 1 × with MFF do not achieve the lowest top-1 error rate. For example, MFFNeXt-29 (2c × 64d) with b = 3 achieves higher accuracy, although MFFNeXt-29 (2c × 64d) with b = 4 needs 22.63% more parameters. MobileNetV2 1 × with MFF for outperforms MobileNetV2 1 × with MFF for by above 0.91% accuracy.

3.3.3. Class Activation Mapping

To intuitively understand the multiscale representation ability of MFFNet, we visualize the class activation mapping (CAM) using Grad-CAM for different networks. Grad-CAM uses gradients to calculate the importance of the spatial locations in convolutional layers.

Figure 8 compares the CAM for representative backbone networks. The areas that have a larger impact on the classification are covered with lighter colors. We can clearly see that the MFFNet-based CAM results tend to focus on the whole object more than ResNet.

3.3.4. Object Detection

The PASCAL VOC 2007 and MS COCO datasets are in 20 and 80 object categories, respectively. The PASCAL VOC 2007 dataset has about 5k trainval images and 5k test images. We use the 5k trainval images and 5k test images for training and 5k test images for validation. The MS COCO dataset has 80k images for training, 40k for validation, and 20k for testing. We used the 80k training set plus a 35k validation subset for training and a 5k validation subset for validation. We adopt Faster-RCNN [28] as our detection method and evaluate the average precision (AP) for PASCAL VOC 2007 and MS COCO. Moreover, ResNet-101 and MFFNet-101 are used as our backbone networks.

On the PASCAL VOC 2007 dataset, MFFNet-101 outperforms ResNet-101 by 1.4% on AP. On the MS COCO dataset, we improve ResNet-101 by 1.3%. Table 7 shows that MFFNet-101 has a little longer inference latency than ResNet-101 but is more accurate. For instance, on the PASCAL VOC 2007 dataset, we improve the ResNet-101 baseline by 1.4% for AP for only 1.6 ms of additional inference latency. On the MS COCO dataset, MFFNet-101 has an AP of 26.9%, which is 1.3% higher than the ResNet-101 baseline of 25.6% for only 3.2 ms of additional inference latency. These results demonstrate the general performance improvement of using MFF modules in object detection. Figure 9 shows detection examples generated by our proposed MFFNet-100 as backbone networks on UAV123 dataset. It can be seen that our method is able to detect target objects successfully regardless of their shapes, sizes, orientations, and appearances.

4. Conclusions

To address the multiscale recognition problem in the UAV visual perception, this paper establishes a new convolutional network architecture (MFFNet). In MFFNet, the MFF module is designed by employing multiple atrous convolutions at different rates with feature-selective learning ability. The MFF module is implemented via three operations: split module (SM), multiscale branch module (MBM), and fusion module (FM). In addition, MFF module can selectively generate channel-wise feature responses by emphasizing channel-wise dependencies. We further explore the effect of atrous rate on the multiscale representation ability of CNNs. Image classification results on CIFAR-100 and Tiny ImageNet datasets demonstrate that our proposed method achieves very competitive results on various benchmarks. Grad-CAM visualization results demonstrate that the MFFNet-based CAM results tend to focus on the whole object more than other baseline networks. That is, the MFFNet has a stronger multiscale representation ability, which can achieve better recognition accuracy in the UAV. Experimental results on PASCAL VOC 2007, MS COCO, and UAV123 datasets show that our proposed method achieves consistent performance gains in object detection, which is beneficial to expanding the application of UAV. We will further explore the effect of multiscale representation on image recognition results in future work.

Data Availability

The detailed mechanism model and model parameters of MFFNet are given in the article. The results are computed on the PyCharm software with the model and given parameters, while the relevant results are also given in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by the National Major Science and Technology Projects of China (Grant no. 2019ZX04026001).