Elsevier

Pattern Recognition Letters

Volume 160, August 2022, Pages 128-134
Pattern Recognition Letters

Kernel-based convolution expansion for facial expression recognition

https://doi.org/10.1016/j.patrec.2022.06.013Get rights and content

Highlights

  • We propose a new architecture that outperforms the linear convolution.

  • It expands convolution to a higher degree kernel function without additional weights.

  • We opt for Taylor series kernel which maps data in both implicit and explicit ways.

  • This architecture is able to learn more complex patterns thus be more discriminative.

  • The experiments demonstrate that our architecture outperforms the convolution layer.

Abstract

The ever-growing depth and width of Convolutional Neural Networks (CNNs) drastically increases the number of their parameters and requires more powerful devices to train and deploy. In this paper, we propose a new architecture that outperforms the classical linear convolution function by expanding the latter to a higher degree kernel function without additional weights. We opt for Taylor Series Kernel which maps input data to a higher-dimensional Reproducing Kernel Hilbert Space (RKHS). Mapping features to a higher-order RKHS is performed in both implicit and explicit ways. For the former way, we compute several polynomial kernels of different degrees leveraging the kernel trick. Whereas, the latter way is achieved by concatenating the result of these polynomial kernels. The proposed Taylor Series Kernelized Convolution (TSKC) is able to learn more complex patterns than the linear convolution kernel and thus be more discriminative. The experiments conducted on Facial Expression Recognition (FER) datasets demonstrate that TSKC outperforms the ordinary convolution layer without additional parameters.

Introduction

Convolutional Neural Networks (CNNs) have been extremely successful in computer vision applications. Convolution layers are the core building block of a CNN. They leverage the fact that an input image is composed of small details, or features, and create a mechanism for analyzing each feature in isolation, which makes a decision about the image as a whole. This mechanism allowed CNNs to achieve very good results in various fields. This great success encouraged the computer vision community to tackle more challenging tasks in this field. One of these challenging tasks is the fine-grained recognition. It consists of discriminating categories that were considered previously as a single category and have only small subtle visual differences. Among these fine-grained recognition tasks, we are more interested in Facial Expression Recognition (FER). FER is the task of classifying the expressions on face images into various categories such as anger, fear, surprise, sadness, happiness, disgust and neutral. These categories were previously considered as a single category (i.e., Human face). Furthermore, FER goal is to spot these different categories in a single face image. The challenge with FER is the fact that there is a small inter-class differences due to the similarity in expressing some emotions. For instance, anger and disgust or surprise and fear. The second issue is the large intra-class differences, one emotion can be expressed differently from one person to another. The learning of one emotion can thus intersect with the learning of another emotion which highly worsen the learning process. Therefore, we must be able to detect small subtle visual distortion in small areas of the face like the eyebrows, the noise, and the mouth.

Convolution layers are ruled by a linear kernel function which makes convolution implementation relatively simple and computationally inexpensive. Yet, in some cases convolution fails to learn fully linearly separable features [16]. Researchers are trying to overcome this issue by either increasing the network size or by employing more complex functions. In the first case, researchers are continuously trying to enhance CNNs by either increasing their depth (number of layers) or width (size of the output of each layer). Even though by doing so the performance of the network is effectively enhanced, it can not be a longstanding solution. Indeed, these methods increase dramatically the number of weights and the network complexity. Therefore, the resulting models can only be used on powerful devices. In the second case, the focus is more on computation. Many researchers incorporated more complex functions in CNN, instead of simple linear functions, at different levels (e.g. Jayasumana et al. [16], Mahmoudi et al. [26], Mahmoudi et al. [27], Wang et al. [34]). These methods have the benefits of being less memory consuming, compared to the first case, even though they are harder to We build upon these methods and propose a new convolution-like layer based on high degree kernel function.

We propose to improve the performance of CNNs by specifically designing a new high degree kernel-based convolution-like layer. This new layer outperforms convolution layer without increasing the number of parameters. It consists of expanding the linear kernel function, used in convolution layers, in the form of a Taylor Series kernel by applying simultaneously the convolution operation in addition to higher degree polynomial kernels (Fig. 1). These polynomial kernels are computed in a similar manner to Kervolution [34], with different degrees (2). The result of all kernel functions is concatenated over the channel axis. This kernel function is more sensitive to subtle details than the linear one and is able to better fit the input data. The sensitivity to subtle visual details is a key factor for FER. Furthermore, this method uses the same number of parameters as a convolution layer. It is worth noting that the proposed layer can be used in the same manner as the usual convolution layer. Therefore, it can be used solely or jointly with convolution layers. This flexibility makes them usable in any architecture or even plugged at any level of a pre-trained CNN model.

The remainder of this paper is organized as follows: Section 2 reviews similar works that have been proposed for the improvement of both FER and CNN using kernel functions. Section 3 introduces the proposed expansion method for convolution layers. Section 4 presents our experiments setting, the datasets we used and their related results. Finally, Section 5 concludes the paper.

Section snippets

Related work

In this section, we describe similar work to our method. First of all, we point out the latest works in the facial expression recognition field using deep learning methods. Then, we point out some researches that implemented kernel functions to enhance CNNs performance.

Taylor series kernelized convolution

A CNN is mainly a stack of three different types of layers; namely convolution, pooling and fully connected layers. These layers use linear kernel functions in order to extract, down-sample and classify features. Let xRn be an input vector fed into a layer L and WRn its related weight vector. A linear kernel is represented as in Eq. (1).Klinear(x,W)=x,W,where .,. is the inner product.

Linear functions are efficient, particularly when the original data is linearly separable. These data

Experiments

In this section, we give more details about the experiments we performed to evaluate the approach described above. First, we give a brief description of the datasets we have used. After that, we describe our models architecture and training process. Finally, we discuss the obtained results and compare them to the state-of-the-art results.

Conclusion

In this paper, we proposed a new architecture that outperforms the classical linear convolution function by expanding the latter to a higher degree kernel function without additional weights. We opted for Taylor Series Kernel which maps input data to a higher-order RKHS in both implicit and explicit ways. For the former way, we compute several polynomial kernels of different degrees, leveraging the kernel trick. Whereas, the latter way is achieved by concatenating the result of these polynomial

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (36)

  • M.A. Mahmoudi et al.

    Learnable pooling weights for facial expression recognition

    Pattern Recognit. Lett.

    (2020)
  • C. Wang et al.

    Kervolutional neural networks

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2019)
  • D. Acharya et al.

    Covariance pooling for facial expression recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    (2018)
  • M. Bishay et al.

    SchiNet: automatic estimation of symptoms of schizophrenia from facial behaviour analysis

    IEEE Trans. Affect. Comput.

    (2019)
  • W. Chen et al.

    STCAM: spatial-temporal and channel attention module for dynamic facial expression recognition

    IEEE Trans. Affect. Comput.

    (2020)
  • Y. Chen et al.

    Dynamic convolution: attention over convolution kernels

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2020)
  • Y. Cui et al.

    Kernel pooling for convolutional neural networks

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • Y. Fan et al.

    Facial expression recognition with deeply-supervised attention network

    IEEE Trans. Affect. Comput.

    (2020)
  • Y. Gao et al.

    Compact bilinear pooling

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • Z. Gao et al.

    Global second-order pooling convolutional networks

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2019)
  • I.J. Goodfellow et al.

    Challenges in representation learning: a report on three machine learning contests

    International Conference on Neural Information Processing

    (2013)
  • Y. Guo et al.

    Deep neural networks with relativity learning for facial expression recognition

    2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)

    (2016)
  • D. Haase et al.

    Rethinking depthwise separable convolutions: how intra-kernel correlations lead to improved mobilenets

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2020)
  • K. He et al.

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    (2015)
  • J. Hyun, H. Seong, E. Kim, Universal pooling—A new pooling method for convolutional neural networks, arXiv preprint...
  • M.T.B. Iqbal et al.

    Facial expression recognition with neighborhood-aware edge directional pattern (NEDP)

    IEEE Trans. Affect. Comput.

    (2018)
  • M.T.B. Iqbal et al.

    Facial expression recognition with active local shape pattern and learned-size block representations

    IEEE Trans. Affect. Comput.

    (2020)
  • S. Jayasumana, S. Ramalingam, S. Kumar, Kernelized classification in deep networks, arXiv preprint...
  • View full text