Kernel-based convolution expansion for facial expression recognition

doi:10.1016/j.patrec.2022.06.013

Pattern Recognition Letters

Volume 160, August 2022, Pages 128-134

https://doi.org/10.1016/j.patrec.2022.06.013 Get rights and content

Highlights

•
We propose a new architecture that outperforms the linear convolution.
•
It expands convolution to a higher degree kernel function without additional weights.
•
We opt for Taylor series kernel which maps data in both implicit and explicit ways.
•
This architecture is able to learn more complex patterns thus be more discriminative.
•
The experiments demonstrate that our architecture outperforms the convolution layer.

Abstract

The ever-growing depth and width of Convolutional Neural Networks (CNNs) drastically increases the number of their parameters and requires more powerful devices to train and deploy. In this paper, we propose a new architecture that outperforms the classical linear convolution function by expanding the latter to a higher degree kernel function without additional weights. We opt for Taylor Series Kernel which maps input data to a higher-dimensional Reproducing Kernel Hilbert Space (RKHS). Mapping features to a higher-order RKHS is performed in both implicit and explicit ways. For the former way, we compute several polynomial kernels of different degrees leveraging the kernel trick. Whereas, the latter way is achieved by concatenating the result of these polynomial kernels. The proposed Taylor Series Kernelized Convolution (TSKC) is able to learn more complex patterns than the linear convolution kernel and thus be more discriminative. The experiments conducted on Facial Expression Recognition (FER) datasets demonstrate that TSKC outperforms the ordinary convolution layer without additional parameters.

Introduction

Convolutional Neural Networks (CNNs) have been extremely successful in computer vision applications. Convolution layers are the core building block of a CNN. They leverage the fact that an input image is composed of small details, or features, and create a mechanism for analyzing each feature in isolation, which makes a decision about the image as a whole. This mechanism allowed CNNs to achieve very good results in various fields. This great success encouraged the computer vision community to tackle more challenging tasks in this field. One of these challenging tasks is the fine-grained recognition. It consists of discriminating categories that were considered previously as a single category and have only small subtle visual differences. Among these fine-grained recognition tasks, we are more interested in Facial Expression Recognition (FER). FER is the task of classifying the expressions on face images into various categories such as anger, fear, surprise, sadness, happiness, disgust and neutral. These categories were previously considered as a single category (i.e., Human face). Furthermore, FER goal is to spot these different categories in a single face image. The challenge with FER is the fact that there is a small inter-class differences due to the similarity in expressing some emotions. For instance, anger and disgust or surprise and fear. The second issue is the large intra-class differences, one emotion can be expressed differently from one person to another. The learning of one emotion can thus intersect with the learning of another emotion which highly worsen the learning process. Therefore, we must be able to detect small subtle visual distortion in small areas of the face like the eyebrows, the noise, and the mouth.

Convolution layers are ruled by a linear kernel function which makes convolution implementation relatively simple and computationally inexpensive. Yet, in some cases convolution fails to learn fully linearly separable features [16]. Researchers are trying to overcome this issue by either increasing the network size or by employing more complex functions. In the first case, researchers are continuously trying to enhance CNNs by either increasing their depth (number of layers) or width (size of the output of each layer). Even though by doing so the performance of the network is effectively enhanced, it can not be a longstanding solution. Indeed, these methods increase dramatically the number of weights and the network complexity. Therefore, the resulting models can only be used on powerful devices. In the second case, the focus is more on computation. Many researchers incorporated more complex functions in CNN, instead of simple linear functions, at different levels (e.g. Jayasumana et al. [16], Mahmoudi et al. [26], Mahmoudi et al. [27], Wang et al. [34]). These methods have the benefits of being less memory consuming, compared to the first case, even though they are harder to We build upon these methods and propose a new convolution-like layer based on high degree kernel function.

We propose to improve the performance of CNNs by specifically designing a new high degree kernel-based convolution-like layer. This new layer outperforms convolution layer without increasing the number of parameters. It consists of expanding the linear kernel function, used in convolution layers, in the form of a Taylor Series kernel by applying simultaneously the convolution operation in addition to higher degree polynomial kernels (Fig. 1). These polynomial kernels are computed in a similar manner to Kervolution [34], with different degrees ( $\geq 2$ ). The result of all kernel functions is concatenated over the channel axis. This kernel function is more sensitive to subtle details than the linear one and is able to better fit the input data. The sensitivity to subtle visual details is a key factor for FER. Furthermore, this method uses the same number of parameters as a convolution layer. It is worth noting that the proposed layer can be used in the same manner as the usual convolution layer. Therefore, it can be used solely or jointly with convolution layers. This flexibility makes them usable in any architecture or even plugged at any level of a pre-trained CNN model.

The remainder of this paper is organized as follows: Section 2 reviews similar works that have been proposed for the improvement of both FER and CNN using kernel functions. Section 3 introduces the proposed expansion method for convolution layers. Section 4 presents our experiments setting, the datasets we used and their related results. Finally, Section 5 concludes the paper.

Section snippets

Related work

In this section, we describe similar work to our method. First of all, we point out the latest works in the facial expression recognition field using deep learning methods. Then, we point out some researches that implemented kernel functions to enhance CNNs performance.

Taylor series kernelized convolution

A CNN is mainly a stack of three different types of layers; namely convolution, pooling and fully connected layers. These layers use linear kernel functions in order to extract, down-sample and classify features. Let $x \in R^{n}$ be an input vector fed into a layer $L$ and $W \in R^{n}$ its related weight vector. A linear kernel is represented as in Eq. (1). $K_{linear} (x, W) = 〈 x, W 〉,$ where $〈 ., . 〉$ is the inner product.

Linear functions are efficient, particularly when the original data is linearly separable. These data

Experiments

In this section, we give more details about the experiments we performed to evaluate the approach described above. First, we give a brief description of the datasets we have used. After that, we describe our models architecture and training process. Finally, we discuss the obtained results and compare them to the state-of-the-art results.

Conclusion

In this paper, we proposed a new architecture that outperforms the classical linear convolution function by expanding the latter to a higher degree kernel function without additional weights. We opted for Taylor Series Kernel which maps input data to a higher-order RKHS in both implicit and explicit ways. For the former way, we compute several polynomial kernels of different degrees, leveraging the kernel trick. Whereas, the latter way is achieved by concatenating the result of these polynomial

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (36)

M.A. Mahmoudi et al.
Learnable pooling weights for facial expression recognition
Pattern Recognit. Lett.
(2020)
C. Wang et al.
Kervolutional neural networks
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2019)
D. Acharya et al.
Covariance pooling for facial expression recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
(2018)
M. Bishay et al.
SchiNet: automatic estimation of symptoms of schizophrenia from facial behaviour analysis
IEEE Trans. Affect. Comput.
(2019)
W. Chen et al.
STCAM: spatial-temporal and channel attention module for dynamic facial expression recognition
IEEE Trans. Affect. Comput.
(2020)
Y. Chen et al.
Dynamic convolution: attention over convolution kernels
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(2020)
Y. Cui et al.
Kernel pooling for convolutional neural networks
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2017)
Y. Fan et al.
Facial expression recognition with deeply-supervised attention network
IEEE Trans. Affect. Comput.
(2020)
Y. Gao et al.
Compact bilinear pooling
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)
Z. Gao et al.
Global second-order pooling convolutional networks
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2019)

I.J. Goodfellow et al.

Challenges in representation learning: a report on three machine learning contests

International Conference on Neural Information Processing

(2013)

Y. Guo et al.

Deep neural networks with relativity learning for facial expression recognition

2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)

(2016)

D. Haase et al.

Rethinking depthwise separable convolutions: how intra-kernel correlations lead to improved mobilenets

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

(2020)

K. He et al.

Delving deep into rectifiers: surpassing human-level performance on imagenet classification

Proceedings of the IEEE International Conference on Computer Vision (ICCV)

(2015)

J. Hyun, H. Seong, E. Kim, Universal pooling—A new pooling method for convolutional neural networks, arXiv preprint...

M.T.B. Iqbal et al.

Facial expression recognition with neighborhood-aware edge directional pattern (NEDP)

IEEE Trans. Affect. Comput.

(2018)

M.T.B. Iqbal et al.

Facial expression recognition with active local shape pattern and learned-size block representations

IEEE Trans. Affect. Comput.

(2020)

S. Jayasumana, S. Ramalingam, S. Kumar, Kernelized classification in deep networks, arXiv preprint...

Cited by (7)

RGCN-Based Multi-Label Facial Expression Recognition by Combining Semantic and Geometric Information
2023, SSRN
Lightweight Facial Expression Recognition Based on Class-Rebalancing Fusion Cumulative Learning
2023, Applied Sciences (Switzerland)
Open-Set Event Recognition Model Using 1-D RL-CNN With OpenMax Algorithm for Distributed Optical Fiber Vibration Sensing System
2023, IEEE Sensors Journal
Kernel function impact on convolutional neural networks
2023, arXiv
Automated Detection of Diabetes From Exhaled Human Breath Using Deep Hybrid Architecture
2023, IEEE Access
Expanding Convolutional Neural Network Kernel for Facial Expression Recognition
2023, Communications in Computer and Information Science

View all citing articles on Scopus

View full text

Kernel-based convolution expansion for facial expression recognition

Highlights

Abstract

Introduction

Section snippets

Related work

Taylor series kernelized convolution

Experiments

Conclusion

Declaration of Competing Interest

Pattern Recognit. Lett.

Covariance pooling for facial expression recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

SchiNet: automatic estimation of symptoms of schizophrenia from facial behaviour analysis

IEEE Trans. Affect. Comput.

STCAM: spatial-temporal and channel attention module for dynamic facial expression recognition

IEEE Trans. Affect. Comput.

Dynamic convolution: attention over convolution kernels

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kernel pooling for convolutional neural networks

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Facial expression recognition with deeply-supervised attention network

IEEE Trans. Affect. Comput.

Compact bilinear pooling

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Global second-order pooling convolutional networks

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Challenges in representation learning: a report on three machine learning contests

International Conference on Neural Information Processing

Deep neural networks with relativity learning for facial expression recognition

2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)

Rethinking depthwise separable convolutions: how intra-kernel correlations lead to improved mobilenets

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Delving deep into rectifiers: surpassing human-level performance on imagenet classification

Proceedings of the IEEE International Conference on Computer Vision (ICCV)

Facial expression recognition with neighborhood-aware edge directional pattern (NEDP)

IEEE Trans. Affect. Comput.

Facial expression recognition with active local shape pattern and learned-size block representations

IEEE Trans. Affect. Comput.