Full Length Article
Facial expression recognition using frequency multiplication network with uniform rectangular features

https://doi.org/10.1016/j.jvcir.2020.103018Get rights and content

Abstract

Facial expression recognition (FER) is a popular research field in cognitive interaction systems and artificial intelligence. Many deep learning methods achieve outstanding performances at the expense of enormous computation workload. Limiting their application in small devices or offline scenarios. To cope with this drawback, this paper proposes the Frequency Multiplication Network (FMN), a deep learning method operating in the frequency domain that significantly reduces network capacity and computation workload. By taking advantage of the frequency domain conversion, this novel deep learning method utilizes multiplication layers for effective feature extraction. In conjunction with the Uniform Rectangular Features (URF), our method further improves the performance and reduces the training effort. On three publicly available datasets (CK+, Oulu, and MMI), our method achieves substantial improvements in comparison to popular approaches.

Introduction

Recently, there has been a growing interest in developing emotion awarded Human–Robot Interaction (HRI) systems[1]. Out of the need to be understood, the emphasis of HRI systems shifts from task variety to emotional intelligence. The key to developing such an HRI system lies readily in affective computing. As facial expressions being the most common nonverbal signals to reflect people’s affective state, recognizing a person’s emotional state and responding appropriately could result in a significant improvement of the HRI systems.

Facial Expression Recognition (FER) aims to classify basic expressions from the facial area according to the emotion it contains. It involves three major stages: (1) acquisition of facial area, (2) feature extraction to turn facial image into feature vectors, and (3) facial expression recognition to classify the feature vectors into different categories[2], [3]. At present, the construction of highly discriminative facial features is of utmost importance in the FER research community and HRI systems. Generally, feature extraction techniques for facial expression analysis fall mainly into two categories. They are geometry-based and appearance-based feature extractions.

Geometry-based techniques mainly use facial landmarks to partition a facial image into multiple regions and then distinguish different expressions by considering the geometric changes of these regions. Such as Differential Geometric Fusion Network (DGFN)[4] and Deep Action Units Graph Network (DAUGN)[5]. In particular, DGFN utilizes a combination of facial landmarks to generate various features corresponding to Action Units (AUs) while DAUGN converts facial images into facial graphs using facial landmarks and Voronoi Diagram (VD) algorithm. These methods are mainly originated from people’s attention to facial muscle movements when recognizing an expression. Appearance-based techniques on the other hand make use of textual information such as Local Binary Patterns (LBP)[6], Gabor Wavelet and Histogram of Oriented Gradient (HOG)[7] to capture the variation of facial textures and wrinkles in different expressions. Both geometry-based and appearance-based features have weaknesses. For example, geometry-based features relay heavily on landmark detection methods and are unable to capture facial movements that do not cause landmark displacement. As for appearance-based features, they can be affected by background noises or facial organ deformations. Mixing them and putting the hybrid feature into a deep learning network could be a feasible way to maximize the discriminative information in the resulting deep feature but most deep learning methods themselves can be computationally heavy[2].

To move beyond the limitation of heavy computation workload while enjoying the flexibility and adaptability of the deep learning architecture, we exploit the potential of frequency domain filtering in this work. The use of frequency-domain conversion is not new in computer vision. In visual tracking studies, the use of correlation filter for visual features after frequency domain transform is an efficient and robust solution[8]. In particular, typical Correlation Filter-based Trackers (CFTs) extract visual features from the target area and apply frequency domain transform to the feature. Then a correlation filter is used to perform element-wise multiplication in the frequency domain to generate the final result. The merits of this framework are the significant acceleration and resistance to variations. Unfortunately, feature extraction in the frequency domain has not been widely used in visual recognition. Mostly due to the difficulty to design the filters and the lack of practical methods. However, stemming from the recent research in neuroscience, the amygdala system in the human brain can utilize information from some specific spatial frequency during the perception of an expression[9]. Providing yet another reason from the nature of emotion perception to explore the use of frequency-domain data for visual expression recognition.

Despite the above motivations, selecting distinctive patterns from the frequency domain is not easy. It requires prior knowledge in the design of filtering functions and constant online training to achieve high performance. The goal of the paper is to exploit the benefits of frequency domain filtering using a novel feature learning framework in the frequency domain. In order to simplify the device of frequency domain filters, we propose the multiplication layer as a novel neural network component to automatically learn discriminative features in the frequency domain. Using the multiplication layers, we build the Frequency Multiplication Network (FMN) for deep frequency-domain feature extraction and expression-specific feature learning. Furthermore, a hybrid descriptor called the Uniform Rectangular Feature (URF) is used as the visual feature for the expression. In conjunction with URF, the proposed FMN achieves effective expression recognition performance with significantly less parameter quantity and computation workload compared to mainstream CNNs. The main highlight of this paper can be summarized as follows:

  • The URF is proposed as a simple and effective visual description for facial expression.

  • The multiplication layer is proposed as a novel network component for automatic filtering in the frequency domain. With the multiplication layers, the FMN is designed for deep frequency-domain feature extraction with light training cost.

  • On three popular datasets, the proposed FER methods achieve substantial improvement and generalization capacity as compared to previous methods.

Section snippets

Hybrid features

Feature representing methods greatly influences the performance of frequency domain filtering approaches. Although raw pixels could be directly used to obtain the recognition result, frequency-domain-based methods are likely to be affected by noises like illumination variants and motion blurs[8]. More powerful features such as hybrid features are helpful for increasing the robustness of the method and yielding higher accuracy in these tasks[10].

In the literature, Paisitkriangkrai etal.[11]

Proposed method

This section illustrates the proposed FER method. As demonstrated in Fig.1, after preprocessing, the URF is obtained from the facial image. Then it is converted to the frequency domain using DCT. To enjoy the benefits brought by the frequency domain conversion, we design the multiplication layer. With the multiplication layer, the FMN is built to integrate frequency domain features for facial expression recognition.

Experiments

To test the ability of the proposed method, the FMN is evaluated on three public datasets: Extended Cohn–Kanade Dataset (CK+)[26], Oulu-CASIA (Oulu)[48] and MMI[49] datasets for ablation studies as well as comparison with several competitive methods. In addition, we adopt several convolutional based networks to compare their learning efficiency with our proposed FMN. The experiments are implemented on the environment of Python 3.6 and operating system of Windows 10. Dlib 19.17 is introduced for

Conclusion and future work

In this paper, we demonstrate how the use of FMN with URF can be an effective frequency-domain approach for FER. To begin with, we extract URF from key facial regions so that unrepresentative regions and interferences can be removed. Then we apply DCT to convert the URF to the DCT frequency domain while removing its correlation. To extract expression related features with the least manual efforts, we design the multiplication layer for automatic filtering in the frequency domain, with which we

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (63)

  • TangY. et al.

    Geometric-convolutional feature fusion based on learning propagation for facial expression recognition

    IEEE Access

    (2018)
  • HuangD. et al.

    Local binary patterns and its application to facial image analysis: A survey

    IEEE Trans. Syst. Man Cybern. C

    (2011)
  • HuM. et al.

    Facial expression recognition based on fusion features of center-symmetric local signal magnitude pattern

    IEEE Access

    (2019)
  • ChenZ. et al.

    An experimental survey on correlation filter-based tracking

    (2015)
  • VuilleumierP. et al.

    Distinct spatial frequency sensitivities for processing faces and emotional expressions

    Nature Neurosci.

    (2003)
  • MitaT. et al.

    Discriminative feature co-occurrence selection for object detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2008)
  • PaisitkriangkraiS. et al.

    Face detection with effective feature extraction

  • FuY. et al.

    Multiple feature fusion by subspace learning

  • XieZ. et al.

    Fusion of LBP and HOG using multiple kernel learning for infrared face recognition

  • ZengC. et al.

    Robust head-shoulder detection by PCA-based multilevel HOG-lbp detector for people counting

  • HassanM.M. et al.

    Human activity recognition from body sensor data using deep learning

    J. Med. Syst.

    (2018)
  • LiH. et al.

    Multimodal 2D+3D facial expression recognition with deep fusion convolutional neural network

    IEEE Trans. Multimed.

    (2017)
  • LinX. et al.

    Audio recapture detection with convolutional neural networks

    IEEE Trans. Multimed.

    (2016)
  • KiranyazS. et al.

    Real-time patient-specific ECG classification by 1-D convolutional neural networks

    IEEE Trans. Biomed. Eng.

    (2016)
  • D.S. Bolme, J.R. Beveridge, B.A. Draper, Y.M. Lui, Visual object tracking using adaptive correlation filters, in: 2010...
  • AmiriparianS. et al.

    Snore sound classification using image-based deep spectrum features

  • SongS. et al.

    Human behaviour-based automatic depression analysis using hand-crafted statistics and deep learned spectral features

  • PalM. et al.

    A bacterial foraging optimization and learning automata based feature selection for motor imagery EEG classification

  • EkmanP.

    Facial action coding system (FACS)

    (2002)
  • TianY. et al.

    Recognizing action units for facial expression analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2001)
  • LuceyP. et al.

    The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression

  • Cited by (0)

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text