1 Introduction

Facial expression is one of the most important signals for people to exchange emotional information [1]. Automatic facial expression recognition (FER) is widely utilized in many fields, such as social robots, medical treatment, intelligent driving, and public safety. Therefore, many researchers focus on the methods of FER [2,3,4].

In recent years, with the success of deep learning technology in various fields [5,6,7], more and more researchers use convolutional neural network for FER. Since deep learning needs a lot of data for training, scientists collect and process a large number of facial expression datasets. These datasets are divided into lab-controlled datasets and in-the-wild datasets. The popular lab-controlled datasets include CK+ [8], MMI [9], and OULU-CASIA [10]. These data are obtained in controlled laboratory. They are all positive, non-occluded and the illumination is constant. Therefore, these datasets can obtain better recognition results using the convolutional neural network [11, 12]. While in-the-wild datasets are obtained from the real environment. The commonly used include RAF-DB [13], SFEW [14], and AffectNet [15]. In-the-wild datasets can better reflect the real complex situation, such as extreme face pose, large area occlusions, and illumination changes, which are shown in Fig. 1. These problems have caused the recognition accuracy of in-the-wild datasets much lower than lab-controlled datasets. Therefore, in-the-wild FER technology is facing great challenges.

Fig. 1
figure 1

Interference factors in wild facial expression datasets. From left to right are the side face, grayscale, low pixel, and occlusion

To improve the recognition accuracy of in-the-wild datasets, this paper proposes a FERGCN network based on graph convolution. The proposed network is divided into three parts: feature extraction module, graph convolutional network (GCN), and graph-matching module.

For the feature extraction module, many studies have shown that facial emotion changes are related to specific areas of the face (such as eyes, mouth, and cheek) [16,17,18]. Therefore, the proposed feature extraction module consists of two branches: key point-guided attention branch and CNN branch, and then the obtained 2 types of feature maps are fused to get 1 global and 18 local feature representations. In CNN branch, we use triplet attention [19] to optimize the feature map.

When human beings recognize occluded facial expressions, they usually utilize local and global information to recognize facial expressions together. Therefore, in the GCN, we take the 18 local feature vectors as the nodes of the topology graph and propose a novel graph convolution layer to learn the expression information of the non-occluded parts. This layer can promote the information transfer of semantic features and suppress that of meaningless and noisy features. Finally, the learned nodes contain both semantic information and related information.

There are tiny differences between facial expressions, especially some negative expressions among them have high similarity (for example, disgust and sadness, fear and surprise), so we need more representative features to distinguish among these similar expressions. We adopt the strategy of hard sample triple loss [20] to obtain the positive samples with the largest distance of the same class and the negative samples with the smallest distance of different classes for anchor samples and then perform two groups of graph matching, respectively. In the graph-matching module, first, the corresponding relationship between two graphs is learned by graph matching, and the learned corresponding relationship is regarded as an adjacency matrix to transfer messages. Finally, the similarity between the two graphs is used to calculate the verification loss.

The main contributions of this paper are as follows: (1) we propose FERGCN for recognize facial expressions in the wild. Triplet attention and graph-matching modules are introduced to the field of expression recognition for the first time. (2) For the first time, a graph convolution network is introduced from the feature graph level to learn facial expression information. The experimental results show the effectiveness of the module. (3) Our network achieved competitive results on RAF-DB, AffectNet, SFEW, and occlusion test subset of RAF-DB.

2 Related work

2.1 Facial expression recognition

FER has always been an important research topic. Most traditional methods utilize hand-made features or shallow learning, for example, local binary pattern (LBP) [21], LBP on three orthogonal planes (LBP-TOP) [22], nonnegative matrix factorization (NMF) [23], and sparse learning [24]. The development of deep learning methods mainly focuses on data and models.

In recent years, great progress has been made in FER based on deep learning [25,26,27]. Considering the change of posture, Zhang et al. [28] designed a model that can generate any posture and any expression using a generative adversarial network [29]. Through this model, the data are enhanced to improve the accuracy of expression recognition. To obtain expression information better, Yang et al. [30] proposed a generative adversarial network to generate neutral faces from faces with any expression and then obtained expression information from the middle layer of the generator. To reduce the influence of subject appearance on expression recognition, Cai et al. [31] introduced the generation adversarial network to generate the average face image and made the generated average face consistent with the expression of the original face through supervised learning. Liu et al. [32] designed a Point Adversarial Self Mining (PASM) model, and then they utilized a point adversary self-mined network to enhance data and teacher–student pattern to train recognition networks. Jiang et al. [33] applied Gabor convolutional network [34] to the field of expression recognition, and obtained an efficient and fast model.

To recognize occluded facial expressions, Li et al. [16] proposed a convolutional neural network with an attention mechanism. The network divides the feature map into 24 blocks and emphasizes the information of non-occluded parts by learning the weights of each block. There are some images with poor quality in the wild datasets, which will affect the training of the model. To suppress the uncertainty in the dataset, Wang et al. [4] introduced the learned image quality coefficient into the loss function and re-labeled the low-quality images and labels. Chen et al. [11] trained the model using the labeling consistency of soft tags and similar images for suppressing the label inconsistency in large-scale datasets. These methods have achieved good results, but due to various interference factors in the wild datasets, expression recognition is still faced with great challenges.

2.2 Graph convolutional network

The structure of the graph is irregular, it has no translation invariance. Traditional CNN and RNN encounter great challenges in processing data with graph structure, so Bruna et al. [35] proposed graph convolution network for data with graph structure. Defferrard et al. [36] used Chebyshev polynomial function to enhance the spatial locality of GCN and reduce the computational complexity of GCN. Kipf et al. [37] used the effective layer wise propagation rule to further optimize graph convolution and formed the current graph convolution structure. The GCN, like CNN, is a feature extractor, but its object is graph data. GCN ingeniously designs a method to extract features from graph data, so that we can use these features to do the task of node classification, graph classification, link prediction, and graph embedding. Recently, GCN has received increasing attention in the field of computer vision. Yan et al. [38] applied GCN to the video field using spatial temporary graph revolutionary networks, which achieved good results in skeleton based action recognition. To apply GCN to regression task, Zhao et al. [39] proposed semantic graph progressive networks and verified it on 3D human pose region task. Wang et al. [40] used GCN to learn the information of human body topology, which greatly improved the accuracy of person re-identification under occlusion. The structure between key points in facial expressions is just a graph structure, so we can design a graph convolution module for expression recognition.

3 Methodology

In this section, we first introduce our overall network framework and then introduce the structure of the three modules and their corresponding loss functions.

3.1 Overview

As shown in Fig. 2, the proposed FERGCN framework includes three parts, a feature extraction module, GCN, and graph-matching module. Specifically, the input size is set as 224 × 224, while the feature map and the key point heat map can be obtained, respectively, through the CNN branch and key point branch. The landmark-guided attention branch predicts 68 landmarks according to face recognition technology. We design a mechanism to obtain 18 key points associated with the expression, and the key points can be further used to generate the attention heat maps. Meanwhile, the CNN branch utilizes ResNet18 [41] to obtain feature maps for the input face. To improve the recognition accuracy, we add the triplet attention [19] at the end of ResNet18. The output features of the CNN branch are multiplied and pooled with 18 heat maps to obtain 18 local feature vectors, which are used as the semantic information of key points.

Fig. 2
figure 2

The FERGCN neural network framework. FERGCN includes feature extraction module, GCN, and graph-matching module. \(\otimes \) represents the multiplication of corresponding elements

We apply the Z-pool operation to the output features of CNN branches to obtain a global feature vector. The Z-pool operation is composed of global max pooling and global average pooling. In addition, inspired by [17], we divide the output features of the CNN branches into pieces to learn context information in face. Since the mouth of the lower part of the face is a whole, we divide the features into three parts without overlapping.

Then, according to the location information of key points, the local feature vectors are regarded as one-by-one points to form a topological graph. Based on the topological graph, all the feature vectors are operated by GCN to obtain the optimized feature vectors. Finally, the output of the GCN is matched to obtain the similarity between the two images, and the relationship between the face images is used for supervised learning.

3.2 Feature extraction module

3.2.1 CNN branch

As shown in Fig. 2, the feature extraction module is composed of the key point-guided attention branch and CNN branch. In the CNN branch, we utilize ResNet18 without the average pooling layer and fully connected layer as the backbone network to extract the global feature map from the given image. We set the stride of conv4_1 to be 1, which is conducive to obtain a larger feature map for richer local information.

In addition, we also use the triplet attention [19] network to process feature maps to obtain more expression information. The structure of the triplet attention network is shown in Fig. 3. The network has three branches: CW branch, CH branch, and channel attention branch. The CW branch is to obtain the interaction information between channel C and spatial dimension W, and the CH branch is to obtain the interaction information between channel C and spatial dimension H.

Fig. 3
figure 3

Triplet attention network structure

In the CW branch, first, the dimension of the feature map is transposed to \(H \times C \times W\), followed by a Z-pool operation on the H dimension to obtain the tensor with the size of \(2 \times C \times W\). Then the attention matrix of \(1 \times C \times W\) is obtained using the convolution layer and sigmoid. Finally, the attention weight is multiplied by the input feature map according to the corresponding dimension.

The CH branch and channel attention branch are similar to the C–W branch. CH branch is processed in W dimension, while channel attention branch is processed in C dimension. We calculate the average value of the output feature map of the three branches to get the final feature map.

The global feature map F is obtained from the input image X after ResNet18 and triplet attention. The formula is as follows:

$$ F = T(f(x)), $$
(1)

where \(f( \cdot )\) represents the adjusted ResNet18 and \(T( \cdot )\) represents triplet attention.

3.2.2 Key point-guided attention branch

In the key point-guided attention branch, the SAN [42] method is utilized to detect 68 face landmarks from the input face image and the corresponding confidence level. Then, 16 key points (the red points in Fig. 4c) are selected from 68 face landmarks to represent eyebrows, eyes, mouth, and nose.

Fig. 4
figure 4

Face key point acquisition process. a is the original face image, b is 68 landmarks, c is the key point of the face, in which the blue point is calculated

The indexes of these 16 points are 19, 26, 38, 45, 21, 24, 37, 46, 41, 48, 49, 55, 52, 58, 28 and 31. Since the cheek part also has rich expression information, we propose two extra key points representing the cheek which are calculated from the other neighbor landmarks, shown as blue points in Fig. 4c. First, we use landmarks with indexes of 3, 4, and 32 to form the triangle region of the left cheek, and landmarks with indexes of 36, 14, and 15 to form the triangle region of the right cheek, and take the center of gravity of these two triangle regions as the key points of the two cheeks. The confidence of the two key points is the average of the confidence of the three vertices in the corresponding triangle region.

As shown in Fig. 4c, the 18 key points and their corresponding confidence levels are obtained. Then, the 18 key points are taken as the center to generate 18 Gauss distribution attention heat maps \(A_{i} (i = 1,2 \ldots 18)\). Finally, a set of local feature vectors is obtained by the following formula:

$$ v_{{{\text{local}}}}^{i} = g(F \otimes A_{i} ), $$
(2)

where \(\otimes\) represents multiplication of corresponding elements and \(g( \cdot )\) represents global average pooling.

The global average pooling operation is carried out on F to obtain the vector \({v}_{\mathrm{global}}\) with global information. Finally, the set of vectors output by the feature extraction module is represented by V and \(V = (v_{{{\text{local}}}}^{1} ,v_{{{\text{local}}}}^{2} , \ldots ,v_{{{\text{local}}}}^{18} ,v_{{{\text{global}}}} )\).

We utilize the global feature vector \(v_{{{\text{global}}}}\) to calculate the triple loss of hard samples [16] for better distinguishing similar expressions. Specifically, for each target image a, we select the farthest positive sample p and the nearest negative sample n to calculate the triple loss, the formula is as follows:

$$ L_{{{\text{triple}}}} = \max (d(a,p) - d(a,n) + \gamma ,0), $$
(3)

where \(d( \cdot )\) is the distance between two eigenvectors, and \(\gamma\) is a super parameter and is set to \(0.3\).

Facial expression is presented by multiple parts of the face, single local information cannot represent the whole expression, so we design a classification module for the feature vector group \(V\), which can integrate the local feature information. Our proposed classification unit is shown in Fig. 5. Since each local position of the face in the wild datasets may have interfered to varying degrees, we multiply the confidence \(\alpha_{i}\) of each key point by the corresponding feature vector \(v_{{{\text{local}}}}^{i}\) to suppress the possible interference factors. Then, the features of key points are fused by average pooling. Finally, the fused features are processed in the fully connected layer to obtain the local classification vector \(v_{{\text{class,fuse}}}\).

Fig. 5
figure 5

The proposed classification unit. C is the number of expression classes

The global feature vector \({v}_{\mathrm{global}}\) is directly processed by the fully connected layer to get the global classification vector \(v_{{\text{class,global}}}\). The loss function of the feature extraction module is as follows:

$$ L_{F} = k \times L_{{{\text{class}}}} (v_{{\text{class,fuse}}} ) + L_{{{\text{class}}}} (v_{{\text{class,global}}} ) + L_{{{\text{triple}}}} (v_{{{\text{global}}}} ), $$
(4)

where \(k\) is a hyperparameter. We have 18 key points, so let \(k = 18\), \(L_{{{\text{class}}}} ( \cdot )\) denotes the cross-entropy loss function.

To enhance the robustness of the network and learn face context information, we divide the output features of CNN branches into three parts as shown in Fig. 2. Then, the three features are processed by Z-pool and fully connected layer to get the classification feature vector \(v_{{\text{class,block}}}^{i} (i = 1,2,3)\). The loss \({L}_{\mathrm{B}}\) of this part is as follows:

$$ L_{{\text{B}}} = \sum\limits_{i = 1}^{3} {L_{{{\text{class}}}} (v_{{\text{class,block}}}^{i} )} . $$
(5)

3.2.3 GCN

Although we get the feature information of the key points in the feature extraction module, it is still a great challenge to recognize facial expressions in the real environment with occlusion and side faces. Studies have shown that human beings can effectively utilize local regions and whole faces to perceive the semantics of incomplete faces [43]. Therefore, we propose a graph convolution neural network. Here, we take the local feature vector output from the first module as the node and use the relationship between the key parts of the face and the relationship between the whole and the local to obtain deeper semantic information.

Under in-the-wild facial expression datasets, many faces have interference factors such as occlusion, side face, and light shadow. The 18 key points we obtained may also be affected by these interference factors. To suppress the interfering factors and emphasize the undisturbed local information, we proposed the Graph Convolutional Neural Network as shown in Fig. 6. Our proposed graph convolutional network uses the relationship between the whole and the part to obtain the local information that needs to be emphasized.

Fig. 6
figure 6

The architecture of the plot module. \(\odot\) denotes matrix multiplication, \(\otimes\) denotes multiplication of corresponding elements, and A is the designed adjacency matrix

As shown in Fig. 6, the feature vector group V is divided into two parts, the local feature vectors form \({V}_{\mathrm{local}} \in {R}^{18\times 512}\), and the global feature vectors are copied to get \({V}_{\mathrm{global}} \in {R}^{18\times 512}\). To obtain the enhanced feature information \({V}_{d} \in {R}^{18\times 512}\), we propose the following formula:

$$ V_{{\text{d}}} = \{ [{\text{Re}} LU(V_{{{\text{local}}}} \odot W \odot V_{{{\text{global}}}}^{T} )] \otimes A\} \odot V_{{{\text{local}}}} , $$
(6)

where \(\odot\) denotes matrix multiplication,\(\otimes\) denotes multiplication of corresponding elements, and W is a learnable parameter matrix of 512 × 512. \(A\) is the adjacency matrix of 18 key points. We design the topological map as shown in Fig. 7 according to the face structure for obtaining this adjacency matrix. The adjacency matrix is derived from Eq. 7, \(A[i,j]\) represents the element values in the adjacency matrix, and \(V_{i}\) represents the key points in Fig. 7:

$$ A[i,j] = \left\{ {\begin{array}{*{20}c} 1 & {(V_{i} ,V_{j} ){\text{ is the edge in Figure }}7} \\ 0 & {(V_{i} ,V_{j} ){\text{ is not the edge in Figure }}7} \\ \end{array} } \right.. $$
(7)
Fig. 7
figure 7

The topological map between the key points

The output characteristic \({V}^{\mathrm{G}}\) of final graph convolution module is given by the following formula:

$$ V^{G} = {\text{Re}} {\text{LU}}\{ f_{1} [{\text{concat}}(V_{{{\text{local}}}} + V_{d,} \, v_{{{\text{global}}}} )]\} , $$
(8)

where \(f_{1} ( \cdot )\) denotes the fully connected layer, and \({\text{concat}}( \cdot , \cdot )\) represents matrix splicing, which is to integrate the optimized local information with the global information.

The final \(V^{{\text{G}}}\) is similar to the \(V\) of the first module. \(V^{{\text{G}}}\) is processed by the classification unit to get \(v_{{\text{class,fuse}}}^{{\text{G}}}\) and \(v_{{\text{class,global}}}^{G}\). Like formula (4), the loss function of GCN is as follows:

$$ L_{{\text{G}}} = k \times L_{{{\text{class}}}} (v_{{\text{class,fuse}}}^{{\text{G}}} ) + L_{{{\text{class}}}} (v_{{\text{class,global}}}^{{\text{G}}} ) + L_{{{\text{triple}}}} (v_{{{\text{global}}}}^{{\text{G}}} ). $$
(9)

3.2.4 Graph-matching module

To obtain higher order expression information and enhance the discrimination between similar expressions, we apply the graph-matching method for supervised learning of the feature vectors of the second module. The general graph-matching method [44, 45] is direct matching between corresponding points, but this method is very sensitive to outliers, which is not suitable for in-the-wild datasets with a lot of interference. Here, we utilize the Cross-Graph Embedded-Alignment Layer (CGEA) method in paper [40] to optimize the result of graph matching, and finally get the similarity S between the two images, the formula is as follows:

$$ (V_{1}^{{\text{H}}} ,V_{2}^{{\text{H}}} ) = F_{{\text{H}}} (V_{1}^{{\text{G}}} ,V_{2}^{{\text{G}}} ), $$
(10)
$$ S_{1,2}^{{\text{H}}} = \sigma [f_{2} ( - |V_{1}^{{\text{H}}} - V_{2}^{{\text{H}}} |)], $$
(11)

where \(F_{{\text{H}}}\) denotes CGEA method, \(f_{2}\) is the fully connected layer. The loss function of the module is shown in the following formula 10:

$$ L_{{\text{H}}} = y \times \log S_{1,2}^{{\text{H}}} + (1 - y) \times \log (1 - S_{1,2}^{{\text{H}}} ), $$
(12)

where y means ground truth. If the two pictures have the same expression, then \(y = 1\), otherwise \(y = 0\).

The training strategy of the module is: according to the real tag and \(V^{G}\) of the second module, find out the positive sample \(x^{ + }\) farthest from the target image \(x\), and the negative sample \(x^{ - }\) nearest to the target image \(x\). Finally, the verification losses of \((x,\hat{x}^{ + } )\) and \((x,\hat{x}^{ - } )\) are, respectively, calculated using Eqs. (10), (11), and (12).

3.2.5 Train and inference

In the training phase, the total loss function is as follows:

$$ L = L_{{\text{F}}} + L_{{\text{G}}} + L_{{\text{H}}} + L_{{\text{B}}} . $$
(13)

We train our network framework by minimizing \(L\). It should be noted that the third module in the first 20 training cycles does not participate in training. Our whole training process is shown in the following algorithm.

figure f

In the inference stage, the third module does not participate in the reasoning and takes the average value of \(v_{{\text{class,fuse}}}^{{\text{G}}}\) and \(v_{{\text{class,global}}}^{{\text{G}}}\) as the final classification basis.

4 Experiments

4.1 Datasets

RAF-DB [13] is a real-world database that contains 29,672 highly diverse facial images downloaded from the Internet. The image size of RAF-DB is resized to 100 × 100 pixels. In our experiment, we only employed six basic expressions (neutral, happiness, surprise, sadness, anger, disgust, fear) and neutral expressions, including 12,271 images as the training dataset and 3068 images as the testing dataset.

SFEW [14] is created by selecting static frames from the AFEW database [46]. It has six basic expressions and neutral expressions. The image size of this dataset is 143 × 181. The dataset contains 958 training images, 436 validation images, and 372 test images. In our experiment, we only use training images and validation images.

AffectNet [15] contains more than one million images from the Internet. It is the largest facial expression database at present. In AffectNet, 450,000 pieces are manually labeled. Like RAF-DB, we choose six basic expressions and neutral expressions for experiments, including 283,901 training images and 3500 validation images.

Occlusion-RAF-DB and Pose-RAF-DB [47] are occlusion test subsets extracted from RAF-DB. The Occlusion-RAF-DB dataset contains 735 occluded facial images. Pose-RAF-DB has 2 kinds of pictures, in which 1248 pictures are with side face angle greater than 30 degrees and 558 pictures are with side face angle greater than 45 degrees. These two datasets are not used for training, only for testing.

4.2 Implementation details

Image preprocessing Before the formal experiment, we adjusted the size of all the pictures to 224 × 224 pixels. Due to the serious imbalance between classes in these datasets, we utilize some online data enhancement methods to balance the datasets. These data enhancement methods include rotated by random degrees between − 10° and 10°, randomly horizontally flipped with 50% probability, and random erasing.

Training details The training of network is completed on a 1080Ti GPU with 12 GB memory. We use the PyTorch [48] framework to experiment. During the training stage, the batch size is set to 64, and we train for 80 cycles. The initial learning rate is 3.5e−4 and decaying to its 0.1 at 40 and 60 epochs. Our network is optimized by Adam [49]; Adam is deployed as the optimizer with betas of (0.9, 0.999).

4.3 Comparison to the state-of-the-art

We compare the proposed method with the latest face expression recognition methods on RAF-DB and AffectNet datasets. gACNN [16], RAN [47], and OADN [17] all propose their attention mechanisms for occlusion in face images, combining local information with global information. LDL-ALSG [11] utilizes the correlation between images and introduces soft tags to supervise the training of the model. OADN [17] introduces a landmark-guided attention branch to guide the network to learn the information of non-occluded areas.

The experimental results are shown in Table 1. For RAF-DB and AffectNet datasets, our results are better than other models. Our method achieves 88.23% accuracy in RAF-DB, which is 1.07% higher than OADN [17]. In AffectNet, we achieve 62.03% accuracy, which is 0.14% higher than OADN [17]. The results show that our method can effectively extract expression information from face images.

Table 1 Comparison to the state-of-the-art results

We draw the confusion matrix on RAF-DB, as shown in Fig. 8. This method has an outstanding performance in the recognition of happiness and sadness. Fear is similar to surprise, and people usually restrain their disgust to others, so our FERGCN cannot recognize fear and disgust well.

Fig. 8
figure 8

Confusion matrix in RAF-DB

We also test our method on SFEW, the number of pictures in SFEW is small, so we first do pre-training in RAF-DB, and then use SFEW for training and testing. The experimental results are shown in Table 2. ICID [50] use the intra-category common feature representation (IC) channel and the Inter-category distinction feature representation (ID) channel for facial expression recognition. LBF-NN [51] uses pixel difference features, ensembles of decision trees, and shallow neural network for facial expression recognition. We achieve 56.15% accuracy, which is 1.96% higher than RAN [47].

Table 2 Comparison to the state-of-the-art results in SFEW

4.4 Performance evaluation on occlusion datasets

To evaluate the robustness of our proposed method to occlusion and pose change, we test the performance of our proposed method on Occlusion-RAF-DB and Pose-RAF-DB. Following RAN [47], we first train our network on RAF-DB and then test on the Occlusion-RAF-DB and Pose-RAF-DB dataset. In the same way, we compare it with the latest method. RAN [47] proposed a regional attention network, and they also designed a regional bias loss function to emphasize regional information. SCN [4] proposed image quality coefficient and label correction for the uncertainty of datasets. The results of the test are shown in Table 3, as we can see, our proposed framework significantly outperforms the RAN [47] by 0.68%, 1.15%, and 1.54% in terms of accuracy on the three datasets. Comparing with SCN [4], our proposed method gains advantages in Occlusion-RAF-DB and Pose-RAF-DB (pose > 30). The experimental result emphasizes the effectiveness of our designed face key point graph and graph convolution.

Table 3 Accuracy on Occlusion-RAF-DB and Pose-RAF-DB dataset

4.5 Ablation experiment

To verify the effect of each module on expression recognition, we design ablation experiments to investigate GCN, triplet attention, and Graph Match on RAF-DB. For the ablation experiment of the GCN, we delete the GCN module and directly connect the first feature extraction module with the graph-matching module. The final ablation results are shown in Table 4. The experimental results show that after removing the GCN, our expression recognition accuracy is reduced by 0.97%, which shows that our GCN can well learn the information of the undisturbed part of the image, which is very important for in-the-wild datasets. For the triplet attention module, its effect on expression recognition is 1.15%, which indicates that triplet attention can pay attention to the information related to expression in space and channel. If without the graph match module, the accuracy of expression recognition will decrease by 0.92%, which indicates that the graph match module can guide our recognition network to distinguish similar expressions better.

Table 4 Ablation test results on RAF-DB

4.6 Visualization

We visually compare our method with SCN [4] on RAF-DB, and the experimental results are shown in Fig. 9. Experimental results show that our method can recognize facial expressions better in the occlusion environment. However, our network cannot recognize the expressions of the last two pictures in Fig. 9, due to they are not only inapparent, but also seriously occluded.

Fig. 9
figure 9

Comparison of the SCN method and our method on RAF-DB

5 Conclusion

This paper proposes a FERGCN deep neural network for recognizing facial expression in wild datasets. The proposed network consists of three modules: feature extraction module, GCN, and graph-matching module. In the feature extraction module, we use the key point information and triplet attention to guide the network to learn the local and global features of the face. In the GCN, we refine expression information to suppress the influence of complex environment. In the graph-matching module, we enhance the recognition ability of the network by reducing the similarity between classes. In addition, we also adopt hard sample triple loss to optimize our network. A large number of experimental results on FER datasets show that the proposed network performs well. On AffectNet, RAF-DB, and SFEW datasets, our method achieves 62.03%, 88.23%, and 56.15% recognition accuracy. Besides, it also achieves good results on the occlusion test subset of RAF-DB.