1 Introduction

Head pose estimation (HPE) is a computer vision technique for determining the orientation of a human’s head. Head movements represent an important aspect of a subject, providing several characteristics like individual’s intentions and attention. In any context where image acquisition cannot be controlled by an operator, automated HPE of an unknown subject makes face recognition much more accurate and efficient [33]. In the last decade, many application systems have been developed based on the estimation of the human head directions and movements, finding applicability in several contexts, such as video surveillance and driving monitoring systems. In the literature, the head rotation movements can be determined in different forms. The usually chosen representation uses the Euler angles. In particular, a 3D vector is obtained, including yaw, pith and roll angles. Figure 1 shows the head pose along the three axes, respectively, x, y and z. Estimating the head movements from 2D images is actually an open and still challenging problem for many applications that require head rotation knowledge. In this paper, we consider a classification method based on fractal self-similarity of images, called \(\hbox {HP}^2\)IFS [5] and apply four different regression models in order to improve its performance in head pose estimation. We performed an experimental evaluation of this novel method over two well-know datasets: Biwi [11] and AFLW2000 [19]. The article is structured as follows. In Sect. 2, is introduced a literature review of 2D and 3D methods for head pose estimation; Sect.  3 illustrates the \(\hbox {HP}^2\)IFS method; Sects. 4 and 5 analyze, respectively, the regression methods applied to \(\hbox {HP}^2\)IFS and the datasets adopted in the experimental phase; Sect. 6 describes the experimental results. Finally, conclusions are showed in Sect. 7.

Fig. 1
figure 1

The head rotations movements represented in yaw, pitch and roll angles

2 Related work

Many HPE algorithms have been proposed over the years. We divide the HPE approaches available in 2D (intensity) or 3D (depth) image work. 3D data imply the use of special sensors and cameras capable of capturing the subject and acquiring its depth, furthermore, for this type of acquisition the operating distance between the camera and the subject is limited; for these reasons, the use of the above methods in real contexts is very limited and the methods that use 3D images often also use 2D images.

2.1 2D image methods

In the category of approaches working on 2D images, we have many methods involving machine learning techniques in particular with the use of DNN and CNN. The method presented in [28] estimates head pose through a neural network on the Pointing’04 dataset. This dataset contains pitch and yaw information only. FSA-Net [32] is another method that estimates head pose based on the use of a neural network, which is based on regression and aggregation of characteristics. In [31], the authors propose a Coarse-to-Fine strategy using a deep learning approach, jointly training two subnets to classify the frame into four classes and then to estimate the pose via Fine regression. The method in [26] uses the combination of two trained CNNs to identify both the head and the body pose; similarly, the HPE approach in [7] adopts information from video sequence in order to estimate the head orientation through the movement direction analysis of an individual. QuatNet, a multi-regression loss function applied in [17], estimate head rotations with a CNN, using RGB frames without depth information. The work in [21] proposes a whole body estimation method, and it is composed of three steps: (1) in the first step, the person’s appearance characteristics are extracted using the HOG technique; (2) the second step updates a classifier with the person’s tracking and direction information. Based on the direction in which it walks and the information of the first module, the third step estimates the body orientation, merging the characteristics collected from the previous steps. The authors in [22] analyze the region of the nose, based on its orientation they evaluate the pose of the face. The experiments carried out show that this information has a high discriminatory power to determine the orientation of the head compared to the techniques that are based on the analysis of the entire facial region. In [25, 27], through transfer learning two well-known neural networks are used, respectively, Multi-Loss ResNet50 and Hyperface. ResNet50 is used to predict the three face degrees of freedom (yaw, pitch and roll angles, respectively) directly from the image; Hyperface trains a CNN to identify the face region, individuate the facial reference points and estimate the subject pose. In [20], they address the face alignment problem with Kepler that uses Efficient H-CNN Regressors for obtaining iteratively Keypoint Estimation and Pose prediction of unconstrained faces. In [1], the method QuadTree Pitch Yaw and Roll (QT-PYR) is discussed. This approach extracts the 68 landmarks facial points and adopts a QuadTree model to encode the pose through a vector. This vector will be compared to the ground truth to estimate the pose. This method does not make use of neural networks. The papers [2, 3] obtain a face pose coding building a Web-Shaped Model through the reference points of the face. In hGLLiM [10], they experiment different classifiers and regression methods, proposing to use a mixture of linear regressions that learns to map high-dimensional feature vectors (extracted from the face bounding boxes) on the head pose angles and the bounding box displacements, so that they are predicted in robust way in the presence of unobservable phenomena. In the method presented in [9], the HPE is formulated as a mixture of linear regression problems. The method maps the HOG-based descriptors extracted from the face bounding boxes to the corresponding head poses. Finally, the authors in [15] address the head pose estimation challenge analyzing low-resolution frames with a large angles range and using chrominance-based functions. These images constitute the input for a linear auto-associative memory, which is calculated for each head pose using a Widrow–Hoff learning rule.

2.2 3D image methods

The majority of the existing solutions operate in 2D images, but 3D imaging has also been exploited here. For example, [12] explores the orientation of a human’s head using depth information. The authors, with a statistical model of the face, train a large amount of synthesize and annotated data. The experimental evaluation demonstrates that the method is capable of handling real-world data with non-cooperative subjects, partial occlusions of facial regions and facial expression changes, even if it is only trained on synthetic facial data. In [34], 3DDFA (3D Dense Face Alignment) is proposed, which adapts a dense Morphable 3D model (3DMM) of a face to an image via cascading CNN. In [6], FAN is presented, in which a very large 2D dataset is synthetically expanded by converting the annotations of the 2D landmarks into 3D and unifying all the existing datasets, leading to the creation of LS3D-W. The method presented in [8] introduces a robust method in the case of variable lighting and rotation. Head pose is estimated from 2D key points drawn in two consecutive frames in the head region and their 3D projection on a simple geometric model. In the automotive field, [24] presents a solution for monitoring the driver’s head. By combining 2D and 3D information, head position is estimated and regions of interest identified. This is to detect special driver-related events such as drowsiness or inattention.

3 \(\hbox {HP}^2\)IFS: partitioned iterated function systems for head pose estimation

The method adopted to estimate an individual’s head pose is proposed in [5]. This approach is closely related to fractal image compression and, consequently, to the concept of partitioned iterated function systems (PIFS) [14]. In particular, fractal compression bases its origin on self-similar structures, which possess almost the same features at any level of detail they are enlarged. Thus, it is possible to describe and generate fractals using extremely simple recursive deterministic algorithms, gradually producing copies of oneself or portions of oneself at various scaling factors. Fractal compression essentially consists in searching, for the whole image or part of it, the fractal object that is best suited to approximating its information content and in encoding the description of the object associated with the image. Originally used as a lossy image compression algorithm, the \(\hbox {HP}^2\)IFS approach [5] allows to analyze the self-similarity of two images representing a similar head rotation.

The main steps characterizing the fractal encoding algorithm are the following:

  1. 1.

    Partition the input image into \(R_{i}\) non-overlapping blocks of size N \(\times \) N (namely Range Blocks).

  2. 2.

    Partition the input image into overlapping \(D_{j}\) blocks of size 2N \(\times \) 2N (Domain Blocks).

  3. 3.

    Determine the self-similar parts within the image, memorizing every possible area in terms of the image itself through contractive transformations, i.e., applying various combination of geometrical transformations and luminance factors.

Fig. 2
figure 2

Fractal encoding process: domain and range blocks

Fig. 3
figure 3

Framework of the proposed method: A \(\hbox {HP}^2\)IFS approach; B classification; C regression

Therefore, iterating a series of affine transformations \(f_i\), the fractal compression algorithm goal is the finding of best matching block for each R block, satisfying the minimum distortion error. These transformations represent the fractal encoding result. So, in \(\hbox {HP}^2\)IFS method, given an image as input, we identify the face using Viola Jones’s algorithm [30]. Then, using a pre-trained regression method [18], the 68 facial landmark points are identified in order to create a facial mask. The resulting mask is encoded using the fractal compression algorithm (see Fig. 2). As above-mentioned, the matrix created by fractal encoding is converted into a pose feature vector which will be compared with the built reference model.

4 \(\hbox {HP}^2\)IFS: regression models

To estimate the pose of an individual, we use the classification method shown in Fig. 3B) [5], and subsequently, we compare with the regression approach. In particular, the resulting array from fractal encoding is compared with a reference model using the Hamming distance [16]. The reference model is obtained from a part of the dataset involved in the tests.

Regression analysis is a predictive modeling technique in which the target variable to be estimated is continuous. By definition, regression represents the learning process of a target function f that maps each attribute x to continuous output [13]. So, the goal is to find the target function that is able to adapt to the input with the minimum error. In this work, starting from \(\hbox {HP}^2\)IFS approach to identify the pose, we adopt 4 different regression models to perform the results, for yaw, pitch and roll angles. This procedure is illustrated in Fig. 3C). Further details are present in the following subsections.

4.1 Linear regression

Linear regression (LR) represents the simplest form of regression [4]. The relationship between dependent and independent variables is assumed to be linear. In Eq. 1, y represents the dependent variable to be estimated, x and \(\epsilon \) are, respectively, the independent variable and the error term. \(\beta \) is the regression coefficient.

$$\begin{aligned} y = \beta x + \epsilon \end{aligned}$$
(1)

A relationship between variables of interest does not necessarily imply that one variable is the cause of the other, but that there is a significant association between the two variables.

4.2 Bayesian ridge regression

Ridge regression, also known as Tikhonov regularization, is a classical regularization technique of Linear regression [29]. This model estimate has a Bayesian interpretation. In particular, adopting a fully probabilistic model, in which the prior of the coefficients are given by a spherical Gaussian, it is possible to obtain a Ridge regression using a Bayesian view (see Eq. 2).

$$\begin{aligned} p(w|\lambda ) = {\mathcal {N}}(w|0,\lambda ^{-1}, \mathbf{I }_{p}) \end{aligned}$$
(2)

4.3 Logistic regression

Logistic regression (LgR), also called as Logit model, is a nonlinear regression model used when the dependent variable is dichotomous. LgR through statistical methods allows to generate a result which represents a probability that a given input value belongs to a specific class. The goal is to establish the probability with which an observation can generate one or the other value of the dependent variable [23]. Eq. 3 refers to Logit model:

$$\begin{aligned} y = \frac{e^{\alpha +\beta x}}{1 + e^{\alpha + \beta x}} \end{aligned}$$
(3)

4.4 Lasso regression

Lasso regression, acronym of Least Absolute Shrinkage and Selection operator, is a regularized version of Linear regression [13]. It adopts the L1 penalty in the objective function. So, the optimization objective is expressed by Eq. 4:

$$\begin{aligned} \mathrm{min} \frac{1}{2n_{\mathrm{samples}}}||y-Xw||^{2} _{2} + \alpha ||w||_{1} \end{aligned}$$
(4)

Lasso regression performs a selection of the independent variables, bringing the remaining ones to zero through an appropriate value of the associated weight, and generating a sparse model.

5 Datasets

The following datasets were used for experimentation and comparison with the state-of-the-art: Biwi dataset [11] and AFLW2000 dataset [19].

5.1 Biwi dataset

Biwi Kinect Head Pose Database [11] contains RGB-D images of 20 different people (6 females and 14 males) with a total of over 15,000 frames. For each subject, it includes a file with extension .obj with the three-dimensional model of the head of the subject. In Fig. 4, there are five video frames of subject 01 (top) and corresponding depth frames of the same subject (bottom). For 10 subjects, 3D models of the individuals’ heads were processed with the Blender graphics engine to obtain 2223 different poses from each 3D subject. For each subject, there are all the possible combinations of pose in terms of pitch, yaw and roll angles (13 variations in pitch, 19 in yaw and 9 in roll) with steps of 5\(^\circ \) each. Through this procedure, it was possible to annotate each frame with pitch, yaw and roll, in order to use the figures as a ground truth for experiments.

Fig. 4
figure 4

Some RGB and depth frames from the Biwi dataset with different head-poses

5.2 AFLW2000 dataset

The AFLW2000 dataset [19] provides the first 2000 images of the Annotated Face Landmarks in the Wild (AFLW) dataset, extract from Flickr social network. In AFLW, the faces depicted are annotated with the pose of the face in the degrees of yaw, pitch and roll. These faces have random poses and different ages, ages, facial expressions, environmental conditions, etc. In Fig. 5, there are some image extracted from the AFLW2000 dataset.

Fig. 5
figure 5

Samples from the AFLW2000 dataset with different head-poses

6 Experimental results

As introduced in Sect. 5, the experiments were performed on BIWI and AFLW2000 datasets. Biwi dataset provides 10 identity with several images each. Consequently, we created our model applying a one-left-out technique. In particular, we performed 10 different experiments set using in turn 1 subject as a tester and the others as a model. Each individual has a wide range of poses that cover the angular variation over the three degrees of freedom. Table 1 shows the results obtained for each subject tested in terms of MAE, applying the regression models mentioned in Sect. 4. The presence of large variations in errors between an experimental subset and the other demonstrates the geometrical difference between the faces of the various subjects.

Table 1 Results on the subsets of Biwi applying \(\hbox {HP}^2\)IFS-BRR model

For AFLW2000 database, the 70% of the frames randomly selected were used to create the model reference and the remaining 30% were adopted for the test. The results obtained by the combination of \(\hbox {HP}^2\)IFS method and regression models were analyzed through the Mean Absolute Error (MAE), a performance index commonly used in HPE evaluation. MAE measures the average over the absolute differences between the predicted values (in this case, the predicted poses) and the actual observation (i.e the ground truth poses), as indicated in Eq. 5:

$$\begin{aligned} \mathrm{MAE}=\frac{1}{n}\sum _{j=1}^{n} |y_j-\hat{y_j}| \end{aligned}$$
(5)

where \(y_j\) is the angular value of true pose and \(\hat{y_j}\) is the angular value of predicted pose. The comparison with existing literature review, described in Sect. 2, is reported in the following tables. Our \(\hbox {HP}^2\)IFS regression methods on Biwi and AFLW2000 datasets are showed, respectively, in Tables 2 and 3. The proposed fusion approach includes four types of regression: Linear (\(\hbox {HP}^2\)IFS-LR), Bayesian Ridge (\(\hbox {HP}^2\)IFS-BRR), Logistic (\(\hbox {HP}^2\)IFS-LgR) and Lasso (\(\hbox {HP}^2\)IFS-LsR).

All values reporting in the tables represent the MAE for each of the three angular poses, including an overall MAE along the three axes.

Table 2 MAE (degrees) of yaw, pitch and roll on Biwi database

Table 2 shows the results performed on Biwi dataset and compared with other state-of-art approaches. In Bayesian Ridge regression model, it is possible to observe the roll angular error and the overall MAE similar to \(\hbox {HP}^2\)IFS classification method, and the pitch angular error better than some other deep learning-based approaches. \(\hbox {HP}^2\)IFS yaw angular error represents the only exception.

Table 3 MAE (degrees) of yaw, pitch and roll on AFLW2000 database

Table 3 reports the comparison results on AFLW2000 database. The Lasso regression model provides lowest MAE value respect to all other state-of-the-art methods, including pitch and roll angular errors. Very few exceptions, as for the Biwi dataset in Table 2, are related to methods that use the neural networks. It can also be noted that \(\hbox {HP}^2\)IFS-LsR yaw angular error value is very close to \(\hbox {HP}^2\)IFS yaw error. Finally, Figs. 6 and 7 illustrate, respectively, the error distribution in terms of percentage of tested images using Bayesian Ridge regression model (BRR) and Lasso regression model (LsR) on Biwi and AFLW2000 datasets, thus showing a similar trend anticipated by the results (see Tables 2, 3). In particular, for BIWI dataset 90% of the poses has error less than 20\(^\circ \) and maximum error equal to 60\(^\circ \) for yaw (see Fig. 6). For AFLW2000 benchmark, 90% of the images have an error less than 15\(^\circ \) and maximum error equal to 35\(^\circ \) for yaw, as can be seen in Fig. 7. Since x shows the pose estimation error in degrees, we can see that for yaw there is a higher percentage of poses and a higher error; this is because the images have pose variations with a wider range in yaw.

The total time to perform the pre-processing phase on an image with size \(256 \times 256\) is 0.06 s, including the face detection and localization, the landmark prediction and, finally, the mask creation process. All the experiments are performed on a MacBook Pro 2.6 GHz Intel Core i7 6 core 16 GB 2667 MHz DDR4 Intel UHD Graphics 630 1536 MB, with Python 3.6.8.

Fig. 6
figure 6

Errors on Biwi dataset respect to the tested images (%) in \(\hbox {HP}^2\)IFS-BRR

Fig. 7
figure 7

Errors on AFLW2000 dataset respect to the tested images (%) in \(\hbox {HP}^2\)IFS-LsR

7 Conclusions

In this work, four different regression methods combined with \(\hbox {HP}^2\)IFS approach are analyzed to estimate an individual’s head pose. In particular, \(\hbox {HP}^2\)IFS regression method merges fractal image compression self-similarity properties with regression models prediction, thus identifying similar head rotations. The experiments carried out on widely-used benchmark datasets including Biwi and AFLW2000 are compared with many state-of-the-art approaches, demonstrating excellent performance and obtaining accurate angular values along the three axes, i.e., for yaw, pitch and roll. The proposed fusion methodology is superior to other deep-learning-based methods, and it also requires no training phase.