Robust facial landmark detection by cross-order cross-semantic deep network

doi:10.1016/j.neunet.2020.11.001

Neural Networks

Volume 136, April 2021, Pages 233-243

https://doi.org/10.1016/j.neunet.2020.11.001 Get rights and content

Highlights

•
A CTM module is designed to generate attention-specific feature maps.
•
A COCS regularizer is presented to learn cross-order cross-semantic features.
•
A CCDN is proposed to handle facial landmark detection under challenging scene.

Abstract

Recently, convolutional neural networks (CNNs)-based facial landmark detection methods have achieved great success. However, most of existing CNN-based facial landmark detection methods have not attempted to activate multiple correlated facial parts and learn different semantic features from them that they can not accurately model the relationships among the local details and can not fully explore more discriminative and fine semantic features, thus they suffer from partial occlusions and large pose variations. To address these problems, we propose a cross-order cross-semantic deep network (CCDN) to boost the semantic features learning for robust facial landmark detection. Specifically, a cross-order two-squeeze multi-excitation (CTM) module is proposed to introduce the cross-order channel correlations for more discriminative representations learning and multiple attention-specific part activation. Moreover, a novel cross-order cross-semantic (COCS) regularizer is designed to drive the network to learn cross-order cross-semantic features from different activation for facial landmark detection. It is interesting to show that by integrating the CTM module and COCS regularizer, the proposed CCDN can effectively activate and learn more fine and complementary cross-order cross-semantic features to improve the accuracy of facial landmark detection under extremely challenging scenarios. Experimental results on challenging benchmark datasets demonstrate the superiority of our CCDN over state-of-the-art facial landmark detection methods.

Introduction

Facial landmark detection, also known as face alignment, is a task to locate fiducial facial landmarks (eye corners, nose tip, etc.) in a face image, which can help achieve geometric image normalization and feature extraction. It becomes an indispensable part of facial analysis tasks such as face recognition (Moghadam & Seyyedsalehi, 2018), face verification (Xiong et al., 2020) and human–computer interaction (Liu et al., 2020, Zhang et al., 2020, Zheng et al., 2020). Recently, CNNs-based methods have been one of the mainstream approaches in facial landmark detection and achieve considerable performance on frontal faces. However, when suffering from large pose variations, heavy occlusions and complicated illuminations, CNN-based methods still cannot accurately detect landmarks.

The convolutional units in various layers of CNNs can actually pay more attention to parts of interest, i.e., behave as object detectors and landmark region detectors without any label information. Thus, CNNs-based facial landmark detection methods (Chandran et al., 2020, Dong et al., 2018, Kumar et al., 2020, Liu et al., 2019, Wan et al., 2018, Wu et al., 2018, Zhang et al., 2014, Zhu, Shi, Zheng and Sadiq, 2019) are more robust to the variations in facial poses, expressions and occlusions. However, most CNNs-based facial landmark detection methods have not attempted to activate multiple correlated facial parts and learn different semantic features from them so that they cannot accurately model the differences between these correlated facial parts and the relationships among the local details in the correlated facial parts, i.e., they can not fully explore more discriminative and fine semantic features, thus the performance of the CNNs-based facial landmark detection method suffers from extremely large poses and heavy occlusions. For instance, the coordinate regression facial landmark detection methods (Wu et al., 2018, Zhang et al., 2014, Zhu, Shi, Zheng and Sadiq, 2019) learn features from the whole face images and then regress to the landmark coordinates, which drives the models to learn the whole facial features in a common/normal way that cannot accurately model the differences of local details and the relationships among local details. Also, the heatmap regression facial landmark detection methods (Chandran et al., 2020, Dong et al., 2018, Kumar et al., 2020, Liu et al., 2019) generate a landmark heatmap for each landmark and then predict landmarks by traversing the corresponding landmark heatmaps. The region (for example, the mouth area) near the landmark largely determines the location of the predicted landmark, and the information of the other areas (eyes, eyebrow and forehead areas) has not yet been effectively encoded although deeper network structures are utilized to learn features with larger receptive fields and capture facial global constraints. Hence, as shown in Fig. 1, the above methods are not robust enough against large poses and partial occlusions. Furthermore, recent works have shown that the second-order statistics (Dai et al., 2019, Gao et al., 2018, Wang et al., 2018), the part-specific semantic features (Cai et al., 2016, Luo et al., 2019) and the feature selection methods (Gao et al., 2020, Li and Tang, 2015, Li et al., 2020) can help obtain more discriminative and robust features and are beneficial to many computer vision tasks. However, how to introduce the second-order statistics to activate parts of interest and then learn multiple more discriminative and fine attention-specific (part-based) semantic features for robust facial landmark detection are still open questions.

To address the above problems, we propose a cross-order cross-semantic deep network (CCDN) to activate more correlated facial parts and learn more discriminative and fine cross-order cross-semantic features from them for more robust facial landmark detection. The overall architecture of the proposed CCDN is shown in Fig. 2. To be specific, we first propose a cross-order two-squeeze multi-excitation (CTM) module to generate multiple more discriminative attention-specific feature maps for activating more correlated facial parts. In the proposed CTM module, the cross-order channel correlations are introduced to selectively emphasize informative features and suppress less useful features by considering both the first-order and second-order statistics, thereby performing more effective feature re-calibration and generating more effective attention-specific feature maps. Then, a cross-order cross-semantic (COCS) regularizer is developed to guide the feature maps from different excitation blocks to represent different semantic meanings (i.e., activate different correlated facial parts) by maximizing the correlations between the features from the same excitation block, while de-correlating those from different excitation blocks. Finally, by integrating the CTM module and COCS regularizer via the proposed CCDN, more fine and complementary cross-order cross-semantic features can be learned for more robust facial landmark detection. Experimental results on benchmark datasets show that our approach obtains better robustness and higher accuracy than other state-of-the-art facial landmark detection methods.

The main contributions of this work are summarized as follows:

(1) With the well-designed CTM module, cross-order channel correlations can be introduced to perform more effective feature re-calibration and generate multiple more discriminative attention-specific feature maps, which helps learn more powerful cross-order cross-semantic features for robust facial landmark detection.

(2) A COCS regularizer is designed to drive the network to learn the cross-order cross-semantic features from different excitation blocks. By exploring more fine and complementary semantic features, our method is able to enhance the robustness of facial landmark detection when facing large poses and heavy occlusions.

(3) To the best of our knowledge, this is the first study to explore the cross-order cross-semantic features for handling facial landmark detection under challenging scenarios. By integrating the CTM module and COCS regularizer via a novel CCDN with a seamless formulation, our algorithm outperforms state-of-the-art methods on the benchmark datasets such as COFW (Burgosartizzu, Perona, & Dollar, 2013), 300 W (Sagonas, Antonakos, Tzimiropoulos, Zafeiriou, & Pantic, 2016), AFLW (Zhu, Li, Loy, & Tang, 2016b) and WFLW (Wu et al., 2018).

The rest of the paper is organized as follows: Section II gives an overview of the related work. Section III shows the proposed method, including the CTM module and the COCS regularizer. A series of experiments are conducted to evaluate the performance of the proposed CCDN in Section IV. Finally, Section V concludes the paper. The symbols and their meanings are listed in Table 1.

Section snippets

Related work

During the past decades, rapid development has been made on facial landmark detection. Generally, most existing facial landmark detection methods can be divided into three groups: model-based methods, coordinate regression methods and heatmap regression methods.

Model-based methods. Model-based methods learn parametric models (shape model (Cootes, Taylor, Cooper, & Graham, 1995), appearance model (Cootes, Edwards, & Taylor, 2001) or part model (Cristinacce & Cootes, 2006)) from labeled datasets

Robust facial landmark detection by cross-order cross-semantic deep network

In this section, we firstly elaborate on the proposed CTM module and then present the COCS regularizer. Finally, we illustrate the proposed CCDN.

Experiments

In this section, we firstly introduce the evaluation settings including the datasets and the methods for comparison. Then, we compare our algorithm with state-of-the-art facial landmark detection methods on challenging benchmark datasets such as COFW (Burgosartizzu et al., 2013), 300 W (Sagonas et al., 2016), AFLW (Zhu et al., 2016b) and WFLW (Wu et al., 2018).

Conclusion

Unconstrained facial landmark detection is still a very challenging topic due to large poses and partial occlusions. In this work, we present a cross-order cross-semantic deep network to address facial landmark detection under extremely large poses and heavy occlusions. By fusing the CTM module and the COCS regularizer with a seamless formulation, our CCDN is able to achieve more robust facial landmark detection. It is shown that the CTM module can effectively activate parts of interest and

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Grant Nos. 62002233, 62076164, 61802267, 61976145, 61732011 and 61806127), the Natural Science Foundation of Guangdong Province, China (Grant Nos. 2019A15151-11121, 2018A030310451 and 2018A030310450), the Shenzhen Municipal Science and Technology Innovation Council, China (Grant Nos. JCYJ20180305124834854 and JCYJ20190813100801664) and the China Postdoctoral Science Foundation (Grant No. 2020M672802).

References (59)

CootesT.F. et al.
Active shape models-their training and application
Computer Vision and Image Understanding
(1995)
MoghadamS.M. et al.
Nonlinear analysis and synthesis of video images using deep dynamic bottleneck neural networks for face recognition
Neural Networks : the Official Journal of the International Neural Network Society
(2018)
SagonasC. et al.
300 faces in-the-wild challenge: database and results
Image and Vision Computing
(2016)
WanJ. et al.
Face alignment by component adaptive mechanism
Neurocomputing
(2019)
WanJ. et al.
Robust face alignment by cascaded regression and de-occlusion
Neural Networks
(2020)
ZhengH. et al.
Discriminative deep multi-task learning for facial expression recognition
Information Sciences
(2020)
ZhuM. et al.
Branched convolutional neural networks incorporated with Jacobian deep regression for facial landmark detection
Neural networks : the official journal of the International Neural Network Society
(2019)
Belhumeur, P. N., Jacobs, D. W., Kriegman, D. J., & Kumar, N. (2011). Localizing parts of faces using a consensus of...
Burgosartizzu, X. P., Perona, P., & Dollar, P. (2013). Robust face landmark estimation under occlusion. In IEEE...
CaiZ. et al.
A unified multi-scale deep convolutional neural network for fast object detection

CaoX. et al.

Face alignment by explicit shape regression

International Journal of Computer Vision

(2014)

Chandran, P., Bradley, D., Gross, M., & Beeler, T. (2020). Attention-driven cropping for very high resolution facial...

CootesT.F. et al.

Active appearance models

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2001)

Cristinacce, D., & Cootes, T. F. (2006). Feature detection and tracking with constrained local models. In British...

Dai, T., Cai, J., Zhang, Y., Xia, S.-T., & Zhang, X. P. (2019). Second-order attention network for single image...

Dong, X., Yan, Y., Ouyang, W., & Yang, Y. (2018). Style aggregated network for facial landmark detection. In IEEE...

Feng, Z.-H., Kittler, J., Awais, M., Huber, P., & Wu, X. (2017). Wing loss for robust facial landmark localisation with...

Feng, Z.-H., Kittler, J., Christmas, W. J., Huber, P., & Wu, X. (2017). Dynamic attention-controlled cascaded shape...

Gao, Z., Xie, J., Wang, Q., & Li, P. (2018). Global second-order pooling convolutional networks. In IEEE conference on...

GaoC. et al.

Three-way decision with co-training for partially labeled data

Information Sciences

(2020)

Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer...

Kumar, A., & Chellappa, R. (2018). Disentangling 3D pose in a dendritic CNN for unconstrained 2D face alignment. In...

KumarA. et al.

LUVLI face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood

(2020)

LeV. et al.

Interactive facial feature localization

LiZ. et al.

Unsupervised feature selection via nonnegative spectral analysis and redundancy control

IEEE Transactions on Image Processing

(2015)

LiZ. et al.

Weakly-supervised semantic guided hashing for social image retrieval

International Journal of Computer Vision

(2020)

Li, P., Xie, J., Wang, Q., & Gao, Z. (2018). Towards faster training of global covariance pooling networks by iterative...

LinX. et al.

Region-based context enhanced network for robust multiple face alignment

IEEE Transactions on Multimedia

(2019)

LiuY. et al.

Facial expression recognition via deep action units graph network based on psychological mechanism

IEEE Transactions on Cognitive and Developmental Systems

(2020)

Cited by (19)

Local eye-net: An attention based deep learning architecture for localization of eyes
2024, Expert Systems with Applications
Development of human machine interface has become a necessity for modern day machines to catalyze more autonomy and more efficiency. Gaze driven human intervention is an effective and convenient option for creating an interface to alleviate human errors. Facial landmark detection is very crucial for designing a robust gaze detection system. Regression based methods capacitate good spatial localization of the landmarks corresponding to different parts of the faces. But there are still scope of improvements which have been addressed by incorporating attention. In this paper, we have proposed a deep coarse-to-fine architecture called LocalEyenet for localization of only the eye regions that can be trained end-to-end. The model architecture, build on stacked hourglass backbone, learns the self-attention in feature maps which aids in preserving global as well as local spatial dependencies in face image. We have incorporated deep layer aggregation in each hourglass to minimize the loss of attention over the depth of architecture. Our model shows good generalization ability in cross-dataset evaluation and in real-time localization of eyes.
Temporal burstiness and collaborative camouflage aware fraud detection
2023, Information Processing and Management
Citation Excerpt :
However, existing graph-based methods do not pay sufficient attention to this issue, usually set the prior value of nodes based on biased assumptions, resulting in a decrease in the accuracy of fraud detection. Inspired by the excellent performance of deep learning in computer vision and other fields (Wan, Lai, Li, Zhou, & Gao, 2021a; Wan, Lai, Liu, Zhou, & Gao, 2020; Wan et al., 2021b), more and more scholars begin to apply Graph Neural Networks (GNNs) to fraud detection (Yu et al., 2020). Different from the traditional methods, the aggregation operation of GNNs model can effectively fuse the information from neighbor nodes, to deeply mine the features in the graph.
With the prosperity and development of the digital economy, many fraudsters have emerged on e-commerce platforms to fabricate fraudulent reviews to mislead consumers’ shopping decisions for profit. Moreover, in order to evade fraud detection, fraudsters continue to evolve and present the phenomenon of adversarial camouflage and collaborative attack. In this paper, we propose a novel temporal burstiness and collaborative camouflage aware method (TBCCA) for fraudster detection. Specifically, we capture the hidden temporal burstiness features behind camouflage strategy based on the time series prediction model, and identify highly suspicious target products by assigning suspicious scores as node priors. Meanwhile, a propagation graph integrating review collusion is constructed, and an iterative fraud confidence propagation algorithm is designed for inferring the label of nodes in the graph based on Loop Belief Propagation (LBP). Comprehensive experiments are conducted to compare TBCCA with state-of-the-art fraudster detection approaches, and experimental results show that TBCCA can effectively identify fraudsters in real review networks with achieving 6%–10% performance improvement than other baselines.
Stacked attention hourglass network based robust facial landmark detection
2023, Neural Networks
Deep learning based facial landmark detection (FLD) has made rapid progress. However, the accuracy and robustness of FLD algorithms are degraded heavily when the face is subject to diverse expressions, posture deflection, partial occlusion and other uncertain circumstances. To learn more discriminative representations and reduce the negative effect caused by outliers, a stacked attention hourglass network (SAHN) is proposed for FLD, where new attention mechanism is introduced. Basically, in the design of SAHN, a spatial attention residual (SAR) unit is constructed such that relevant areas of facial landmarks are specially emphasized and essential features of different scales can be well extracted, and a channel attention branch (CAB) is introduced to better guide the next-level hourglass network for feature extraction. Due to the introduction of SAR and CAB, only two hourglass networks are stacked as the proposed SAHN with fewer parameters, which is different from traditional SHNs stacked by four hourglass networks. Furthermore, a variable robustness (VR) loss function is introduced for the training of SAHN. The robustness of the proposed model for FLD is guaranteed with the help of the VR loss by adaptively adjusting a continuous parameter. Extensive experimental results on three public datasets including 300W, WFLW and COFW confirm that our method is superior to some previous ones.
Robust face alignment by dual-attentional spatial-aware capsule networks
2022, Pattern Recognition
Citation Excerpt :
Hoang et al. [20] present a 3DDFA model which locates the 3D facial landmarks in a video or an image by modifying the stacked hourglass network. Wan et al. [21,22] can model more effective facial geometric constraints by introducing high-order information, which achieves more robust face alignment under partial occlusions. ACN [23] combines a coordinate regression network and a heatmap regression network with spatial attention, which is tolerant of occlusion.
Face alignment in-the-wild still faces great challenges due to that i) partial occlusion blurs the inter-features spatial relations of faces and ii) traditional CNN makes the network more difficult to capture the spatial positional relations between landmarks. To address the issues above, we propose a face alignment algorithm named Dual-attentional Spatial-aware Capsule Network (DSCN). Firstly, the spatial-aware module builds a more accurate inter-features spatial constrained model with the hourglass capsule network (HGCaps) as the backbone, which can effectively enhance its robustness against occlusions. Then, two sorts of attention mechanisms, namely capsule attention and spatial attention, are added to the attention-guided module to make the network focus more on the advantageous features and suppress other unrelated ones for more effective feature recalibration. Our method achieves 1.08% failure rate on the COFW dataset, which is much lower than the current state-of-the-art algorithms. The mean error under 300W dataset and WFLW dataset are respectively 3.91% and 5.66%, which shows that DSCN is more robust to occlusion and outperforms state-of-the-art methods in the literature.
Quality-aware face alignment using high-resolution spatial dependencies
2024, Multimedia Tools and Applications
Subspace clustering based on a multichannel attention mechanism
2024, International Journal of Machine Learning and Cybernetics

View all citing articles on Scopus

View full text

Robust facial landmark detection by cross-order cross-semantic deep network

Highlights

Abstract

Introduction

Section snippets

Related work

Robust facial landmark detection by cross-order cross-semantic deep network

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgments

Computer Vision and Image Understanding

Neural Networks : the Official Journal of the International Neural Network Society

Image and Vision Computing

Neurocomputing

Neural Networks

Information Sciences

Neural networks : the official journal of the International Neural Network Society

A unified multi-scale deep convolutional neural network for fast object detection

Face alignment by explicit shape regression

International Journal of Computer Vision

Active appearance models

IEEE Transactions on Pattern Analysis and Machine Intelligence

Three-way decision with co-training for partially labeled data

Information Sciences

LUVLI face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood

Interactive facial feature localization

Unsupervised feature selection via nonnegative spectral analysis and redundancy control

IEEE Transactions on Image Processing

Weakly-supervised semantic guided hashing for social image retrieval

International Journal of Computer Vision

Region-based context enhanced network for robust multiple face alignment

IEEE Transactions on Multimedia

Facial expression recognition via deep action units graph network based on psychological mechanism

IEEE Transactions on Cognitive and Developmental Systems