Robust facial landmark detection by cross-order cross-semantic deep network
Introduction
Facial landmark detection, also known as face alignment, is a task to locate fiducial facial landmarks (eye corners, nose tip, etc.) in a face image, which can help achieve geometric image normalization and feature extraction. It becomes an indispensable part of facial analysis tasks such as face recognition (Moghadam & Seyyedsalehi, 2018), face verification (Xiong et al., 2020) and human–computer interaction (Liu et al., 2020, Zhang et al., 2020, Zheng et al., 2020). Recently, CNNs-based methods have been one of the mainstream approaches in facial landmark detection and achieve considerable performance on frontal faces. However, when suffering from large pose variations, heavy occlusions and complicated illuminations, CNN-based methods still cannot accurately detect landmarks.
The convolutional units in various layers of CNNs can actually pay more attention to parts of interest, i.e., behave as object detectors and landmark region detectors without any label information. Thus, CNNs-based facial landmark detection methods (Chandran et al., 2020, Dong et al., 2018, Kumar et al., 2020, Liu et al., 2019, Wan et al., 2018, Wu et al., 2018, Zhang et al., 2014, Zhu, Shi, Zheng and Sadiq, 2019) are more robust to the variations in facial poses, expressions and occlusions. However, most CNNs-based facial landmark detection methods have not attempted to activate multiple correlated facial parts and learn different semantic features from them so that they cannot accurately model the differences between these correlated facial parts and the relationships among the local details in the correlated facial parts, i.e., they can not fully explore more discriminative and fine semantic features, thus the performance of the CNNs-based facial landmark detection method suffers from extremely large poses and heavy occlusions. For instance, the coordinate regression facial landmark detection methods (Wu et al., 2018, Zhang et al., 2014, Zhu, Shi, Zheng and Sadiq, 2019) learn features from the whole face images and then regress to the landmark coordinates, which drives the models to learn the whole facial features in a common/normal way that cannot accurately model the differences of local details and the relationships among local details. Also, the heatmap regression facial landmark detection methods (Chandran et al., 2020, Dong et al., 2018, Kumar et al., 2020, Liu et al., 2019) generate a landmark heatmap for each landmark and then predict landmarks by traversing the corresponding landmark heatmaps. The region (for example, the mouth area) near the landmark largely determines the location of the predicted landmark, and the information of the other areas (eyes, eyebrow and forehead areas) has not yet been effectively encoded although deeper network structures are utilized to learn features with larger receptive fields and capture facial global constraints. Hence, as shown in Fig. 1, the above methods are not robust enough against large poses and partial occlusions. Furthermore, recent works have shown that the second-order statistics (Dai et al., 2019, Gao et al., 2018, Wang et al., 2018), the part-specific semantic features (Cai et al., 2016, Luo et al., 2019) and the feature selection methods (Gao et al., 2020, Li and Tang, 2015, Li et al., 2020) can help obtain more discriminative and robust features and are beneficial to many computer vision tasks. However, how to introduce the second-order statistics to activate parts of interest and then learn multiple more discriminative and fine attention-specific (part-based) semantic features for robust facial landmark detection are still open questions.
To address the above problems, we propose a cross-order cross-semantic deep network (CCDN) to activate more correlated facial parts and learn more discriminative and fine cross-order cross-semantic features from them for more robust facial landmark detection. The overall architecture of the proposed CCDN is shown in Fig. 2. To be specific, we first propose a cross-order two-squeeze multi-excitation (CTM) module to generate multiple more discriminative attention-specific feature maps for activating more correlated facial parts. In the proposed CTM module, the cross-order channel correlations are introduced to selectively emphasize informative features and suppress less useful features by considering both the first-order and second-order statistics, thereby performing more effective feature re-calibration and generating more effective attention-specific feature maps. Then, a cross-order cross-semantic (COCS) regularizer is developed to guide the feature maps from different excitation blocks to represent different semantic meanings (i.e., activate different correlated facial parts) by maximizing the correlations between the features from the same excitation block, while de-correlating those from different excitation blocks. Finally, by integrating the CTM module and COCS regularizer via the proposed CCDN, more fine and complementary cross-order cross-semantic features can be learned for more robust facial landmark detection. Experimental results on benchmark datasets show that our approach obtains better robustness and higher accuracy than other state-of-the-art facial landmark detection methods.
The main contributions of this work are summarized as follows:
(1) With the well-designed CTM module, cross-order channel correlations can be introduced to perform more effective feature re-calibration and generate multiple more discriminative attention-specific feature maps, which helps learn more powerful cross-order cross-semantic features for robust facial landmark detection.
(2) A COCS regularizer is designed to drive the network to learn the cross-order cross-semantic features from different excitation blocks. By exploring more fine and complementary semantic features, our method is able to enhance the robustness of facial landmark detection when facing large poses and heavy occlusions.
(3) To the best of our knowledge, this is the first study to explore the cross-order cross-semantic features for handling facial landmark detection under challenging scenarios. By integrating the CTM module and COCS regularizer via a novel CCDN with a seamless formulation, our algorithm outperforms state-of-the-art methods on the benchmark datasets such as COFW (Burgosartizzu, Perona, & Dollar, 2013), 300 W (Sagonas, Antonakos, Tzimiropoulos, Zafeiriou, & Pantic, 2016), AFLW (Zhu, Li, Loy, & Tang, 2016b) and WFLW (Wu et al., 2018).
The rest of the paper is organized as follows: Section II gives an overview of the related work. Section III shows the proposed method, including the CTM module and the COCS regularizer. A series of experiments are conducted to evaluate the performance of the proposed CCDN in Section IV. Finally, Section V concludes the paper. The symbols and their meanings are listed in Table 1.
Section snippets
Related work
During the past decades, rapid development has been made on facial landmark detection. Generally, most existing facial landmark detection methods can be divided into three groups: model-based methods, coordinate regression methods and heatmap regression methods.
Model-based methods. Model-based methods learn parametric models (shape model (Cootes, Taylor, Cooper, & Graham, 1995), appearance model (Cootes, Edwards, & Taylor, 2001) or part model (Cristinacce & Cootes, 2006)) from labeled datasets
Robust facial landmark detection by cross-order cross-semantic deep network
In this section, we firstly elaborate on the proposed CTM module and then present the COCS regularizer. Finally, we illustrate the proposed CCDN.
Experiments
In this section, we firstly introduce the evaluation settings including the datasets and the methods for comparison. Then, we compare our algorithm with state-of-the-art facial landmark detection methods on challenging benchmark datasets such as COFW (Burgosartizzu et al., 2013), 300 W (Sagonas et al., 2016), AFLW (Zhu et al., 2016b) and WFLW (Wu et al., 2018).
Conclusion
Unconstrained facial landmark detection is still a very challenging topic due to large poses and partial occlusions. In this work, we present a cross-order cross-semantic deep network to address facial landmark detection under extremely large poses and heavy occlusions. By fusing the CTM module and the COCS regularizer with a seamless formulation, our CCDN is able to achieve more robust facial landmark detection. It is shown that the CTM module can effectively activate parts of interest and
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by the National Natural Science Foundation of China (Grant Nos. 62002233, 62076164, 61802267, 61976145, 61732011 and 61806127), the Natural Science Foundation of Guangdong Province, China (Grant Nos. 2019A15151-11121, 2018A030310451 and 2018A030310450), the Shenzhen Municipal Science and Technology Innovation Council, China (Grant Nos. JCYJ20180305124834854 and JCYJ20190813100801664) and the China Postdoctoral Science Foundation (Grant No. 2020M672802).
References (59)
- et al.
Active shape models-their training and application
Computer Vision and Image Understanding
(1995) - et al.
Nonlinear analysis and synthesis of video images using deep dynamic bottleneck neural networks for face recognition
Neural Networks : the Official Journal of the International Neural Network Society
(2018) - et al.
300 faces in-the-wild challenge: database and results
Image and Vision Computing
(2016) - et al.
Face alignment by component adaptive mechanism
Neurocomputing
(2019) - et al.
Robust face alignment by cascaded regression and de-occlusion
Neural Networks
(2020) - et al.
Discriminative deep multi-task learning for facial expression recognition
Information Sciences
(2020) - et al.
Branched convolutional neural networks incorporated with Jacobian deep regression for facial landmark detection
Neural networks : the official journal of the International Neural Network Society
(2019) - Belhumeur, P. N., Jacobs, D. W., Kriegman, D. J., & Kumar, N. (2011). Localizing parts of faces using a consensus of...
- Burgosartizzu, X. P., Perona, P., & Dollar, P. (2013). Robust face landmark estimation under occlusion. In IEEE...
- et al.
A unified multi-scale deep convolutional neural network for fast object detection
Face alignment by explicit shape regression
International Journal of Computer Vision
Active appearance models
IEEE Transactions on Pattern Analysis and Machine Intelligence
Three-way decision with co-training for partially labeled data
Information Sciences
LUVLI face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood
Interactive facial feature localization
Unsupervised feature selection via nonnegative spectral analysis and redundancy control
IEEE Transactions on Image Processing
Weakly-supervised semantic guided hashing for social image retrieval
International Journal of Computer Vision
Region-based context enhanced network for robust multiple face alignment
IEEE Transactions on Multimedia
Facial expression recognition via deep action units graph network based on psychological mechanism
IEEE Transactions on Cognitive and Developmental Systems
Cited by (19)
Local eye-net: An attention based deep learning architecture for localization of eyes
2024, Expert Systems with ApplicationsTemporal burstiness and collaborative camouflage aware fraud detection
2023, Information Processing and ManagementCitation Excerpt :However, existing graph-based methods do not pay sufficient attention to this issue, usually set the prior value of nodes based on biased assumptions, resulting in a decrease in the accuracy of fraud detection. Inspired by the excellent performance of deep learning in computer vision and other fields (Wan, Lai, Li, Zhou, & Gao, 2021a; Wan, Lai, Liu, Zhou, & Gao, 2020; Wan et al., 2021b), more and more scholars begin to apply Graph Neural Networks (GNNs) to fraud detection (Yu et al., 2020). Different from the traditional methods, the aggregation operation of GNNs model can effectively fuse the information from neighbor nodes, to deeply mine the features in the graph.
Stacked attention hourglass network based robust facial landmark detection
2023, Neural NetworksRobust face alignment by dual-attentional spatial-aware capsule networks
2022, Pattern RecognitionCitation Excerpt :Hoang et al. [20] present a 3DDFA model which locates the 3D facial landmarks in a video or an image by modifying the stacked hourglass network. Wan et al. [21,22] can model more effective facial geometric constraints by introducing high-order information, which achieves more robust face alignment under partial occlusions. ACN [23] combines a coordinate regression network and a heatmap regression network with spatial attention, which is tolerant of occlusion.
Quality-aware face alignment using high-resolution spatial dependencies
2024, Multimedia Tools and ApplicationsSubspace clustering based on a multichannel attention mechanism
2024, International Journal of Machine Learning and Cybernetics