Abstract
Crowd counting plays a significant role in crowd monitoring and management, which suffers from various challenges, especially in crowd-scale variations and background interference issues. Therefore, we propose a method named depth and edge auxiliary learning for still image crowd density estimation to cope with crowd-scale variations and background interference problems simultaneously. The proposed multi-task framework contains three sub-tasks including the crowd head edge regression, the crowd density map regression and the relative depth map regression. The crowd head edge regression task outputs distinctive crowd head edge features to distinguish crowd from complex background. The relative depth map regression task perceives crowd-scale variations and outputs multi-scale crowd features. Moreover, we design an efficient fusion strategy to fuse the above information and make the crowd density map regression generate high-quality crowd density maps. Various experiments were conducted on four main-stream datasets to verify the effectiveness and portability of our method. Experimental results indicate that our method can achieve competitive performance compared with other superior approaches. In addition, our proposed method improves the counting accuracy of the baseline network by \(15.6\%\).
References
LeCun Y, Bengio Y, Hinton G. (2015) Deep learning. In: Nature, pp 436–444
Liu J, Gao C, Meng D, et al. (2018) Decidenet: counting varying density crowds through attention guided detection and density estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 5197–5206
Idrees H, Tayyab M, Athrey K, et al. (2018) Composition loss for counting, density map estimation and localization in dense crowds. In: European Conference on Computer Vision, pp 532–546
Zhang A, Shen J, Xiao Z, et al. (2019) Relational attention network for crowd counting. In: IEEE International Conference on Computer Vision, pp 6788–6797
Liu W, Salzmann M, Fua P. (2019) Context-aware crowd counting. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 5099–5108
Zhang Y, Zhou D, Chen S, et al. (2016) Single-image crowd counting via multi-column convolutional neural network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 589–597
Li Y, Zhang X, Chen D. (2018) Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1091–1100
Wang L, Yin B, Tang X, et al. (2019) Removing background interference for crowd counting via de-background detail convolutional network. In: Neurocomputing, pp 332: 360–371
Idrees H, Saleemi I, Seibert C, et al. (2013) Multi-source multi-scale counting in extremely dense crowd images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2547–2554
Zhao M, Zhang J, Zhang C, et al. (2019) Leveraging heterogeneous auxiliary tasks to assist crowd counting. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 12736–12745
Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 8(6):679–698
Simonyan K, Zisserman A. (2014) Very deep convolutional networks for large-scale image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition
Shi M, Yang Z, Xu C, et al. (2019) Revisiting perspective information for efficient crowd counting. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1–10
Dalal N, Triggs B. (2005) Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 886–893
Leibe B, Seemann E, Schiele B (2005) Pedestrian detection in crowded scenes. IEEE Conf Comput Vision Pattern Recognit 1:878–885
Tuzel O, Porikli F, Meer P (2008) Pedestrian detection via classification on riemannian manifolds. IEEE Trans Pattern Anal Mach Intell 30(10):1713–1727
Viola P, Jones M (2004) Robust real-time face detection. Int J Comput Vision 57(2):137–154
Wu B, Nevatia R. (2015) Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors. In: International Conference on Computer Vision, pp 90–97
Sabzmeydani, P., Mori, G. (2007) Detecting pedestrians by learning shapelet features. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8
Davies A C, Yin J H, Velastin S A, et al. (1995) Crowd monitoring using image processing. Electr Commun Eng J, pp 37–47
Lempitsky V, Zisserman A. (2010) Learning To count objects in images. Neural Inf Process Syst, pp 1324–1332
Rabaud V, Belongie S. (2006) Counting crowded moving objects. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 705–711
Brostow G J, Cipolla R. (2006) Unsupervised Bayesian detection of independent motion in crowds. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 594–601
Zhang L, Shi M, Chen Q, et al. (2018) Crowd Counting via scale-adaptive convolutional neural network. In: Workshop on Applications of Computer Vision, pp 1113–1121
Cao X, Wang Z, Zhao Y, et al. (2018) Scale aggregation network for accurate and efficient crowd counting. In: European Conference on Computer Vision, pp 757–773
Zhang Q, Chan A B. (2019) Wide-area crowd counting via ground-plane density maps and multi view fusion CNNs. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 8297–8306
Xu C, Qiu K, Fu J, et al. (2019) Learn to scale: generating multipolar normalized density maps for crowd counting. In: International Conference on Computer Vision, pp 8382–8390
Liu N, Long Y, Zou C, et al. (2019) ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In: Computer Vision and Pattern Recognition, pp 3225–3234
Gao J, Wang Q, Yuan Y, et al. (2019) SCAR: Spatial-/channel-wise attention regression networks for crowd counting. In: Neurocomputing, pp 1–8
Jiang X, Zhang L, Zhang T, et al. (2020) Density-aware multi-task learning for crowd counting. IEEE Transactions on Multimedia, pp 1–1
Sandwell DT (1987) Biharmonic spline interpolation of Geos-3 and Seasat altimeter data. Geophys Res Lett, pp 139–142
Kingma D P, Ba J. (2014) Adam: a method for stochastic optimization. In: arXiv preprint arXiv:1412.6980
Sindagi V A, Patel V M. (2017) CNN-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: Advanced Video and Signal Based Surveillance, pp 1–6
Shi Z, Zhang L, Liu Y, et al. (2018) Crowd counting with deep negative correlation learning. In: IEEE Conference on Computer vision and pattern recognition, pp 5382–5390
Sam D B, Sajjan N N, Babu R V, et al. (2018) Divide and grow: capturing huge diversity in crowd images with incrementally growing CNN. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3618–3626
Wang Q, Gao J, Lin W, et al. (2019) Learning from synthetic data for crowd counting in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 8198–8207
Shi Z, Mettes P, Snoek C G, et al. (2019) Counting with focus for free. In: International Conference on Computer Vision, pp 4200–4209
Oh M, Olsen P A, Ramamurthy K N, et al. (2020) Crowd counting with decomposed uncertainty. In: National Conference on Artificial Intelligence
Sam D B, Peri S V, Sundararaman M N, et al. (2020) Locate, size and count: accurately resolving people in dense crowds via detection. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp 1–1
Yan Z, Yuan Y, Zuo W, et al. (2019) Perspective-guided convolution networks for crowd counting. In: International Conference on Computer Vision, pp 952–961
Chen X, Bin Y, Sang N, et al. (2019) Scale pyramid network for crowd counting. In: Workshop on Applications of Computer Vision, pp 1941–1950
Vishwanath A. Sindagi, Rajeev Yasarla, Deepak Sam Babu, et al. (2020) Learning to count in the crowd from limited labeled data. In: European Conference on Computer Vision
Yan Liu, Lingqiao Liu, Peng Wang, et al. (2020) Semi-supervised crowd counting via self-training on surrogate tasks. In: European Conference on Computer Vision
Sam D B, Surya S, Babu R V, et al. (2017) Switching convolutional neural network for crowd counting. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 4031–4039
Krizhevsky A, Sutskever I, Hinton G E. (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Khan N, Ullah A, Haq I U, et al. (2020) SD-Net: understanding overcrowded scenes in real-time via an efficient dilated convolutional neural network. J Real-Time Image Process, pp 1–15
Acknowledgements
This work is supported by the Equipment Pre-Research Foundation of China under grant No.61403120201.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Material
Supplementary Material
1.1 Network architecture
To validate the difference against transferred learning models, we design a new CNN model from scratch. The new model is named FENet as shown in Fig. 15, which contains multiple FEM modules to extract multi-scale crowd features. The FEM module consists of multiple columns of ordinary convolution and dilated convolution with different dilation rates. We adopt a pyramid fusion method to separately integrate the features output from each FEM module. Next, the proposed relative depth map task and the crowd edge regression task supervise the network to learn multi-scale crowd features and crowd head edge features. Finally, we concatenate the crowd head edge features and the multi-scale crowd features to generate a high-quality crowd density map.
1.2 Experiment results
We conducted various experiments to evaluate the performance of the designed FENet on four public datasets including ShangTech_Part_A(Part_A), UCF_CC_50, ShangTech_Part_B(Part_B) and UCF-QNRF. In Table 9, “FENet(DEAL)” represents that our multi-task method is applied on FENet and “FENet(W/O)” stands for FENet with our multi-task method removed. Compared with “FENet(W/O)”, the MAE of the designed “FENet(DEAL)” is reduced by 8.3, 1.3, 23.32 and 16.9 on Part_A, Part_B, UCF_CC_50 and UCF-QNRF, respectively. We also show the performance of the VGG-based models, including the “ours(DEAL)” as shown in Fig. 4, and the “Ours(W/O)” that removes our multi-task method. As shown in Table 9, the MAE of “Ours(DEAL)” is decreased compared with “Ours(W/O)” on the four public datasets. Moreover, we compare the computational complexity between “FENet(DEAL)” and “Ours(DEAL)”, which shows that “FENet(DEAL)” needs fewer FLOPs.
In summary, the proposed multi-task method is both effective on FENet and VGG [12]. FENet has lower computational complexity than the model designed based on the VGG [12], but the performance is not as good as the model based on VGG [12]. To achieve accurate crowd counting results, we use a higher accuracy model in this paper as shown in Fig. 4.
Rights and permissions
About this article
Cite this article
Peng, S., Yin, B., Hao, X. et al. Depth and edge auxiliary learning for still image crowd density estimation. Pattern Anal Applic 24, 1777–1792 (2021). https://doi.org/10.1007/s10044-021-01017-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-021-01017-4