1 Introduction

With the advent of the advanced digital technologies, such as ICT, IoT, WoT, mass volume of storage devices, and fast computational methods, the biometric system has tremendously evolved during the last two decades. Nowadays, almost all the organizations, irrespective of size and installation capability, use the biometric-driven security systems. In biometric systems, especially in the public-security-driven biometric system, face recognition plays a vital role in identifying the criminals and preventing the crime and its related activities in public. Also, it has got proper attention of research communities of computer vision and pattern recognition.

Specifically, during and after the global pandemic Covid-19, the facial-based biometric system has slewed the attention of the biometric technology vendors because the users in worldwide are avoiding fingerprint and other contact-based biometric systems. Thus, worldwide organisations that had implemented the contact-based biometric systems have turned on to non-touch facial authentication platform as a result of Covid-19. Hence, in the future, there will be a spike in demand for non-contact biometric systems in organisations. Therefore, this paper believes that it will reflect in research organisations, which involved in biometric technology-oriented systems development. Moreover, it will induce the biometric research community to concentrate more on contactless biometric systems development like facial, iris, gait, and voice recognition (Carlaw 2020; News 2020a).

In face recognition, the features and feature extraction methods play a significant role. They have been broadly categorized into global and local. The local methods are too effective than the global for face recognition, even if it is a minute feature (Singh and Chabra 2018). Several local feature extraction methods, such as local binary pattern (LBP), local ternary pattern (LTD), local derivative pattern (LDP), Weber’s local descriptor (WLB), histogram of oriented gradient (HOG), have been proposed by many researchers (Shi et al. 2010; Tan and Triggs 2010; Chai et al. 2014; Singh and Chabra 2018; He et al. 2018; Muqeet and Holambe 2019) for face recognition. They have reported that the local feature extraction methods yield better results than the global methods. In recent years, ordinal measures have drawn the attention of the research community of facial recognition (Chai et al. 2014; He et al. 2018). The binary-code-driven ordinal features are robust to intra-class variations during binarization (He et al. 2018). He et al. suggest, in biometric systems, an ordinal measure represents the average brightness of two adjacent regions or the relative ordering of two colour channels within the same region. The colour feature plays a noteworthy role in facial image representation and recognition. Many researchers have used colour features at the local level for pattern recognition, especially for image retrieval (Theoharatos et al. 2005; Bindemann and Burton 2009; Liu and Liu 2010; Seetharaman 2015; Pavithra and Sharmila 2017) and face recognition (Liu and Liu 2008; Thakur et al. 2011; Uçar 2014; Pujol et al. 2017; Sharifara et al. 2017); they have reported that the colour features show good results. The literature reveals that spatial orientation feature and texture feature significantly contributed to face recognition (Liu and Liu 2010).

2 Related works

Liu and Liu (2008) have presented a method based on a combination of colour feature and frequency feature for face identification. They have combined the R component of the RGB colour model, and I and Q components of the YIQ colour model; reported that the proposed hybrid method gives a good result. Thakur et al. (2011) have introduced a skin colour model, RGB–HS–CbCr, for face identification, and they report that the HS, Cb, and Cr components yield better results than the RGB colour model. Thus, they use a set of bounding rules, based on the colour distribution to detect the boundary of the human face. Terrillon et al. (2000) have performed a comparative study of nine different colour models and reported that the HSV and YCbCr models result in better segmentation than the others. Wang et al. (2018) have adopted the YCbCr model as it has excellent segregation characteristics of the skin tone and background scenes, and they have reported that it demands minimal computational cost than the HSV colour model.

The LBP, LDP, LTD, and HOG encode the image pixels, based on the positional number system. The positional number-based encoding system could lead to wrong encoding and results in less precision. For example, LBP and LTP methods assign the weights 1 and 128 to the pixels at locations (1, 1) and (3, 3) of the sub-image of size (3 × 3), respectively. It is not correct because spatially, the pixels at locations (1, 1) and (3, 3) equally contribute to the centre pixel. The descriptors derived in both methods, based on the positional number system, are not rational because it shows a great discrimination on weights given to the pixels at locations (1, 1) and (3, 3). It could lead to a wrong or partial feature description in the small image region.

In order to overcome this problem, this paper presents an autocorrelation (AC)-based feature extraction method, which is the best representation of the spatial information about the pixels. The proposed method builds a hybrid colour model with a combination of H and S of the HSV model and YCbCr model, called HS–YCbCr, and extracts AC coefficient (ACC) from the HS–YCbCr model. The H and S represent hue and saturation; Y and CbCr represent luminance and chrominance, respectively; Cb and Cr denote blue and red chrominance, respectively. The ACC is computed from each submodel of the HS–YCbCr model. The computed ACC is subjected to the significance test at a specific level of significance to examine the degree of correlation of the pixels in the small image region, called the window. The literature (Hatamikia et al. 2014) reveals that ACC is the best representation of the spatial information about the pixels. Hence, the proposed method deploys ACC-driven feature descriptor for face recognition.

The literature reveals that many researchers have applied statistical models for face detection (Wang et al. 2009; Mil et al. 2012; Yan et al. 2013; Hatamikia et al. 2014; Wu et al. 2017). Wang et al. (2009) have proposed a Bayesian method, based on Markov random field model, for face recognition. They have used Gabor wavelet features, and the relationships between the Gabor features at different pixel locations, which provides higher-order contextual constraints. Hatamikia et al. (2014) have investigated the performance of autoregressive (AR) features for the classification of emotional states, based on the AR model, sequential forward feature selection, and k-nearest neighbour classifier using EEG signals during emotional audio-visual inductions. They have reported that the AR features outperform the existing methods. Arashloo (2016) has constructed a deep multilayer architecture and undirected graphical models based on Markov random field. Wu et al. (2017) have introduced a coupled hidden Markov random field model for face clustering, face tracking, and their interactions. Liu et al. (2019) have presented a method based on conditional random regression forests, which detects the real-time smiling face detection. They claim that the proposed method is not sensitive to head poses as the relationship between the image patches and smile intensity is conditionally modelled to head pose. The techniques—regression forest, multiple-label dataset augmentation, and non-informative patch removal—have been deployed to achieve high performance.

Paul et al. (2020) have proposed a method, based on geometrical distance and statistical pattern matching of binary facial components: eyes, nose, and mouth regions. It employs the statistical pattern matching tools—standard deviation of Chi-square values with a probability of white pixels (PWPs), standard deviation of HuMIs with Hu’s seven-moment invariants, Abs-DifPWPs, and geometrical distance values (GDVs)—for recognition. GDVs are computed between two similar facial corner points (FCPs), and nine FCPs are extracted from whole binary face and BFCs. Pixel intensity values (PIVs) are determined using L2 norms from greyscale values of the whole face and greyscale values of the face components. In this method, the corner points of the eyes, nose, and mouth regions are considered as the geometrical points; GDVs are computed based on the corner points. The GDV-based similarity could lead to wrong results or less precision because they measure the region of the components; the region is computed using the Otsu’s algorithm. One cannot regard the regions, which is segmented by Otsu’s algorithm, for geometrical points detection because the Otsu’s algorithm approximately segments the object or region of interest from the background scenes. Also, they extract statistical features from the greyscale and binary pattern images that are not sufficient and appropriate for face recognition. Because the facial images fully rely on the skin colour chromaticity. Thus, this paper believes that the features extracted from greyscale and binary pattern could not lead to good results or results in with less recognition rate.

On the other hand, in recent years, deep neural networks (DNN) attracted many researchers. However, there are some drawbacks as reported in Goswami et al. (2019) that the DNN architecture-based methods are mostly a black box method since it is not easy to mathematically formulate the functions that are learned within its many layers of representation. They have examined the robustness of DNNs for face recognition and reported that the performance of deep learning-based face recognition algorithms significantly suffers in the presence of vulnerabilities and adversaries. Wu et al. (2019) reported that the CNN methods use identical operations to all local regions of each face for feature aggregation and give equal weight to all local features for the face detection process. Also, they suggest that these methods not only regard the local features as suboptimal owing to ignoring region-wise distinctions, but also the overall face representations are semantically inconsistent. Realizing this, many researchers have exploited the drawbacks of deep learning-based algorithms that questions their robustness and singularities.

This paper also believes, as suggested by Goswami et al., most of the DNN architectures do not transparently discuss the methods involved in feature selection, feature matching, and other kinds of facial image analyses. Despite there is no second thought of the accuracy of the results attained by DNN, however, one thing is clear that whatever the techniques deployed for recognizing human faces without mathematical conception, it is a challenging and time-consuming process because the human face recognition significantly contributed to mathematical conception. Also, the literature reveals that the DNN methods require more computation cost as it comprises many layers, and this is another disadvantage of the DNN. Hence, to shun these problems mainly computation cost, the proposed method is designed with a Bayes backpropagation neural network (BBPNN), which is deployed only for feature selection, not for the entire face recognition process. Conventional procedures are used for feature matching and face identification. The experimental results show that the proposed method achieves better results with less computation cost than the state-of-the-art methods.

Although many statistical models have been developed for face recognition, the literature reveals that yet to develop a holistic model because the existing models are complementary to each other. It motivated us to develop a holistic method for face recognition; thus, this paper attempts to build such a comprehensive model with less computation cost. The proposed method addresses the drawbacks mentioned above as follows:

  1. 1.

    The proposed method uses HS–YCbCr colour model instead of greyscale and binary image, and extract features from each colour submodel such as H, S, Y, Cb, and Cr.

  2. 2.

    The proposed AR model with Bayes backpropagation neural network (ARBBPNN) method learns deep shape and deep texture features from each colour submodel. It is observed from the literature that mainly the facial images rely on colour chromaticity and skin textural patterns, and the ACC is the best representation of the spatial orientation of the pixels and textural patterns.

2.1 Outline of the proposed method

The proposed ARBBPNN method performs face recognition process in three stages: first, pre-processes the given input image; secondly, extracts the features using the BBPNN method; finally, it performs searching and matching the target face images.

Pre-process At first, the given input key image is denoised; then, it is converted to HSV and YCbCr colour models if it is colour. If the key image is greyscale, then it is taken into account for further analysis as it is. The proposed method builds a hybrid colour model, called HS–YCbCr, with a combination of H and S submodels of the HSV model and the YCbCr colour model. Each colour submodel, such as H, S, Y, Cb, and Cr, is segregated into texture and shape components. For example, the denoised and the texture and shape components are presented in Figs. 6 and 7, respectively.

Feature extraction The ACC-based feature descriptor is computed from each colour submodel, based on the AR model coefficients. The coefficients are computed using the model parameters. The BBPNN model is employed to estimate the parameters of the AR model. The hidden layer-4 of the BBPNN computes the model coefficients, and the ACCs are calculated using the model coefficients. The hidden layer-5 tests the significance of the ACC, based on the AC test. If the degree of AC is high, then it is inferred that the small image region contributes to texture; otherwise, it is regarded as shape or structure. The ACC with high degree is formulated into a feature vector (FV), called texture feature vector \( \left( {{\text{FV}}^{t} } \right) \), and the ACC with low degree is formulated as a separate FV, which is termed as a shape feature vector \( \left( {{\text{FV}}^{s} } \right) \). Each FV consists of ten feature elements. The shape-driven and the texture-driven features are merged into a single FV, as shown in Eq. (36), which has twenty FV elements and denoted by \( {\text{FV}}^{k} \).

Searching and matching The target face image identification is performed in two stages: (1) tests the similarity of the FVs of the key and target face images in terms of variation; (2) tests the distance between the mean vectors of the FVs of the key and target images, provided the images passed the test conducted at stage (1). In order to perform the above two tests, a hypothesis is formulated, i.e. null hypothesis (H0): the key and target face images are the same; otherwise, the alternative hypothesis (Ha): the key and target face images differ. At the first stage, the test for equality of covariance of the FV of the key and target images is performed; at stage two, the test for equality of mean values of the FVs of the key and target images is conducted. If H0 is accepted, then it is inferred that the key and target face images are the same; otherwise, they differ. The overall functions of the proposed method are illustrated in Fig. 1.

Fig. 1
figure 1

Overall functions of the proposed method

The rest of the paper is organized as follows. Section 3 discusses the pre-processing technique. Section 4 describes the proposed feature selection methods based on the BBPNN method and AR model, while Sect. 5 discusses the test statistics for face recognition. In Sect. 6, the experimental setups and the performance evaluation methods are briefed; the experimental results are demonstrated with suitable diagrams. Section 7 deals with computation cost. Section 8 concludes the papers with a conclusion and provides a future direction for the proposed method.

3 Pre-process

There are many possibilities of incorporating different kinds of noises in the images while capturing or manipulating for further analysis. So, the proposed method employs a pre-processing technique, called soft-SVVD (Liu et al. 2014), to remove or mitigate the effects of the noise (abnormality) in the image. The main objective of the SVDD is to identify the noised pixels in the feature space; feature space means the transformed domain of the image. The identified noised pixels are replaced by median values of the normal (except abnormal) pixels of the image. The denoising technique is expressed in Eq. (1), which enriches the proposed face recognition method even if the image is profoundly affected by different noises. In order to identify the abnormal pixels in the image, the given input image is divided into sliding windows of size, 3 × 3.

$$ \left\| {f_{k,l} - C_{k,l} } \right\|^{2} \le r^{2} $$
(1)

where \( \left\| \cdot \right\| \) is the Euclidean norm; \( C_{k,l} \) is the mean intensity value of the pixels in the window, which is computed using the expression in Eq. (2).

$$ C_{k,l} = \sum\limits_{u = 1}^{p + n} {\lambda_{u} \theta \left( {f_{k,l} } \right)} $$
(2)

\( \lambda_{u} \) denotes the Lagrange multiplier; \( \theta \left( {f_{k,l} } \right) \) is a vector form of the pixel intensity values; \( f_{k,l} \) is the intensity value of the pixel to be classified, whether it is noised or not.

To classify the pixel,\( f_{k,l} \), the distance between the \( f_{k,l} \) and \( C_{k,l} \) is computed. If it is less than or equal to radius, r, that is, it satisfies the condition in Eq. (1), then \( f_{k,l} \) is accepted as a normal pixel; otherwise, it is identified as abnormal.

In order to derive the function in Eq. (1), let \( \left\{ {f_{k,l} \in R^{n} ;\;i,k = 1,2,3} \right\} \) be a set of targeted pixels in the window. In which, the functions \( m^{\text{nor}} \left( {f_{k, \, l}^{\text{nor}} } \right) \) and \( m^{\text{neg}} \left( {f_{k, \, l}^{\text{neg}} } \right) \) represent the degree of the membership of a pixel, \( f_{k,l} \), towards normal class and negative class, respectively. The functions \( f_{k,l}^{\text{nor}} \) and \( f_{k,l}^{\text{neg}} \) contain the group of normal and abnormal pixels, respectively. The solution to the soft-SVDD can be achieved by solving the following optimization problem.

$$ \begin{aligned} & \hbox{min} \;G = r^{2} + \omega_{1} \sum {m^{\text{nor}} } \left( {f_{k, l} } \right)\xi_{i} + \omega_{2} m^{\text{neg}} \left( {g_{k, l} } \right)\xi_{j} \\ & {\text{s}}.\;{\text{t}}.\;\left\| {f_{k, l} - C_{k, l} } \right\| \le r^{2} + \xi_{i} ,\;\;f_{k, l} \in S_{\text{nor}} \\ & \quad \;\;\left\| {g_{k, l} - C_{k, l} } \right\| \le r^{2} + \xi_{j} , g_{k, l} \in S_{\text{neg}} \\ & \quad \;\;\xi_{i} \ge 0,\xi_{j} \ge 0, \\ \end{aligned} $$
(3)

where \( \omega_{1} \) and \( \omega_{2} \) control the trade-off between the pixels in the window and the error; \( \xi_{i} \) and \( \xi_{j} \) are defined as a measure of error. The terms \( m^{\text{nor}} \left( {f_{k,l}^{\text{nor}} } \right)\xi_{i} \) and \( m^{\text{neg}} \left( {f_{k,l}^{\text{neg}} } \right)\xi_{j} \) can be regarded as a measure of error with different weighing factors. It is noted that a smaller value of \( m^{\text{nor}} \left( {f_{k,l}^{\text{nor}} } \right) \) could reduce the effect of the parameter \( \xi_{i} \) in Eq. (3), such that the corresponding pixel, \( f_{k,l}^{\text{nor}} \), becomes less significant.

The decision boundary of the normal and abnormal pixels is constructed by deploying the conditions of the following Lagrange-multiplier’s theorem and the Karush–Kuhn–Tucker complementarity (Vapnik 1998), and then, one can obtain the radius, r, of the hyper-plane.

Theorem

The solution to the problem expressed in Eq. (3) can be arrived by optimizing the problem in Eq. (4):

$$ \begin{aligned} & \hbox{max} \sum\limits_{i = 1}^{m + n} {\lambda_{i} K\left( {m^{\text{nor}} \left( {f_{k,l}^{\text{nor}} } \right), m^{\text{nor}} \left( {f_{k,l}^{\text{nor}} } \right)} \right)} \\ & \quad - \sum\limits_{i = 1}^{m + n} {\sum\limits_{j = 1}^{m + n} {\lambda_{i} \lambda_{j} K\left( {m^{\text{nor}} \left( {f_{k,l}^{\text{nor}} } \right), m^{\text{neg}} \left( {f_{k,l}^{\text{neg}} } \right)} \right)} } \\ & {\text{s}}.\;{\text{t}}.\;0 \le \lambda_{i} \le \omega_{i}^{p} ;\;\;i = 1,2, \ldots ,m + n, \\ & \quad \sum\limits_{i = 1}^{m + n} {\lambda_{i} = 1} \\ \end{aligned} $$
(4)

where \( \lambda_{i} ,\;\lambda_{i} \ge 0 \) are the Lagrange multipliers; \( \left\{ {\omega_{i}^{p} = \omega_{1} m^{\text{nor}} \left( {f_{k,l}^{\text{nor}} } \right),(i = 1,2, \ldots ,l)} \right\} \) and \( \left\{ {\omega_{i}^{p} = \omega_{2} m^{\text{neg}} \left( {f_{k,l}^{\text{neg}} } \right),(i = l + 1,l + 2, \ldots ,l + n)} \right\} \).

Let us assume \( f_{k,l}^{u} \) is one of the patterns of the pixels in the window, and r can be measured as follows:

$$ \begin{aligned} r^{2} & = \left\| {f_{k,l}^{u} - C_{k,l} } \right\|^{2} \\ & = \left( {f_{k,l}^{u} - C_{k,l} } \right) \cdot \left( {f_{k,l}^{u} - C_{k,l} } \right) \\ r^{2} & = \left( {f_{k,l}^{u} } \right)^{2} + \left( {C_{k,l} } \right)^{2} - 2\left( {f_{k,l}^{u} } \right) \cdot \left( {C_{k,l} } \right) \\ \end{aligned} $$
(5)

Based on the Mercer’s Kernel function, Eq. (5) can be written as in Eq. (6):

$$ r^{2} = K\left( {f_{k,l}^{u} ,f_{k,l}^{u} } \right) + K\left( {C_{k,l} ,C_{k,l} } \right) - 2K\left( {f_{k,l}^{u} ,C_{k,l} } \right) $$
(6)

By applying the Karush–Kuhn–Tucker (Vapnik 1998), one can derive the distance, r, from the decision boundary of the pixels in the window to its mean intensity values of the pixels.

$$ r^{2} = K\left( {f_{k,l}^{u} , \, f_{k,l}^{u} } \right) + \sum\limits_{i = 1}^{m + n} {\sum\limits_{j = 1}^{m + n} {L_{i} L_{j} } K} \left( {f_{k,l}^{\text{nor}} ,f_{k,l}^{\text{neg}} } \right) - K\sum\limits_{i = 1}^{m + n} {L_{i} } \left( {f_{k,l}^{\text{neg}} ,f_{k,l}^{u} } \right) $$
(7)

To identify whether a pixel in the window is noised or not, it is enough to compute the distance from the decision boundary of the pixels in the window to the mean value of the pixels in the window. If the distance is less than or equal to r, that is,

$$ \left\| {f_{k,l} - C_{k,l} } \right\|^{2} \le r^{2} , $$
(8)

then it is inferred that the pixel, \( f_{k,l} \), is normal. Otherwise, it is treated as an abnormal.

For more details about Lagrange-multiplier and Karush–Kuhn–Tucker complementarity, readers can refer to Liu et al. (2014) and Vapnik (1998).

In the sequel, the denoised key face image was converted to HS–YCbCr colour model. The ACC was computed from each colour submodel—H, S, Y, Cb, Cr, based on the sliding window of size 3 × 3. The computed ACC was tested, whether it is highly correlated or not, using the significance test discussed in Sect. 4.2.1. If the ACC is highly correlated, then it is taken into account of texture features; otherwise, it treated as shape features. For example, the structure and shape of an image are shown in Fig. 7.

4 Proposed feature descriptor

Guo et al. (2010) have reported that despite the LBP performs well for pattern recognition; still, it requires more investigation for better results. They have presented a complete model for local binary pattern operator (CLBP) and employed for classification of texture, and claimed that it addresses the above problems. But, still, it needs further investigation because the CLBP method again assigns the binary values based on the difference between the centre and the neighbouring pixels. Seetharamana and Palanivel (2013) suggested that the assignment of binary and ternary codes, based on the difference between the centre and the adjacent pixels, is not rational. This paper believes that again the problem arises in CLBP as in the case of LBP. In order to overcome this problem, this paper computes AC; the AC coefficient is tested whether it is significantly correlated or not. The computation of the ACC and the significance test are detailed in the proceeding sections.

4.1 Spatial-driven descriptor

Let the pixel intensity, f, at the location (k, l), of an image, be a random process at equal interval of discrete-time space, t. The pixel at location (k, l) can be denoted by \( f_{k,l} \). Let us assume that the pixel, \( f_{k,l} \), be a value generated by a given run of the process at the time, t. Suppose the process has mean, μ, and variance, σ2, at discrete-time space, t, for each t. In the context of image analysis, the dataset is considered as a two-dimensional (2-D) data as depicted in Fig. 2. In this paper, the full-range autoregressive model proposed in Seetharaman (2012) is adopted and reconstructed into an AR model with order two, i.e. AR(2) as in Eq. (9), which is employed to compute the ACC for a small image region of size, 3 × 3, as depicted in Fig. 2.

$$ \begin{aligned} f_{k,l} & = \sum\limits_{p = - 1}^{1} {\sum\limits_{q = - 1}^{1} {\lambda_{\rho } f_{k + p,l + q} } } + \varepsilon_{p,q} ; \\ {\text{s}}.\;{\text{t}}.\;\rho & = \left| {p + q} \right|; p = q \ne 0; k,l = 0. \\ \end{aligned} $$
(9)
$$ \lambda_{\rho } = \frac{{K\left( {\sin \left( {\rho \theta } \right)\cos \left( {\rho \varphi } \right)} \right)}}{{\alpha^{\rho } }},\quad {\text{s}}.\;t.,\;\rho = 1,2. $$
(10)

where \( k,\;l \) represent the row and column index of the centre pixel, which is treated as origin of the region, of the small image region.

Fig. 2
figure 2

Spatial relationships of the pixels

The model coefficients \( \lambda_{\rho } \) are computed as discussed below. According to the following stationarity theorem, the following conditions are imposed on the AR model coefficients as follows.

$$ {\text{Stationarity}} \Leftrightarrow \lambda_{1} + \lambda_{2} < 1,\; - \lambda_{1} + \lambda_{2} < 1,\;\left| {\lambda_{1} } \right| < 2,\quad {\text{and}}\quad \left| {\lambda_{2} } \right| < 1. $$
(11)

Stationarity theorem A necessary and sufficient condition for the AR(p) model to be stationary is that all of the roots of the polynomial in Eq. (12) lie outside the unit circle.

$$ \lambda (\beta ) = \prod\limits_{i = 1}^{p} {\left( {1 - G_{i} \beta } \right)} = 0 . $$
(12)

The ACC is computed, based on the model coefficients. The computation of the model coefficients is discussed in the next section.

4.1.1 Parameters learning

The proposed BBPNN model is effectively used to estimate the parameters of the AR(2) model. It consists of 6 layers, such as one input and one output layers, and four hidden layers. Its structure is illustrated in Fig. 3. The parameters are learned as follows.

Fig. 3
figure 3

Structure of the BBPNN with three-layer perceptron

According to Bayes rules, the point estimates of the parameters, \( \alpha ,\;\theta ,\;\phi \), and K, are regarded as means of the respective marginal posterior distribution, i.e. posterior means. In a view to minimizing the computational cost, first, the posterior mean of \( \alpha \) is computed. Then, \( \alpha \) is fixed at its posterior mean, and the conditional means of \( \theta \) and \( \phi \) are evaluated. The \( \mathop \alpha \limits^{ \wedge } ,\;\mathop \theta \limits^{ \wedge } , \) and \( \mathop \phi \limits^{ \wedge } \) are set at their posterior means, and then, the conditional mean of K is estimated. Thus, the estimates are

$$ \begin{aligned} & \mathop \alpha \limits^{ \wedge } = E\left( \alpha \right) \\ & \left( {\mathop \theta \limits^{ \wedge } ,\mathop \varphi \limits^{ \wedge } } \right) = E\left( {\left. {\theta ,\varphi } \right|\mathop \alpha \limits^{ \wedge } } \right) \\ & \mathop K\limits^{ \wedge } = E\left( {\left. K \right|\mathop \alpha \limits^{ \wedge } ,\mathop \theta \limits^{ \wedge } ,\mathop \varphi \limits^{ \wedge } } \right) \\ \end{aligned} $$
(13)

First, the image is divided into various sliding windows \( w\left( {k,l} \right) \) of size 3 × 3, where k ranges from − 1 to 1, and estimate the parameters as follows.

Estimation of α First, the hidden layer-1 estimates the posterior mean, based on the seed values received from input layer that are treated as prior information, as follows. It is not a difficult task to set the seed values to the parameters because their ranges are known.

$$ \mathop \alpha \limits^{ \wedge }_{{_{ij} }} = E(\alpha_{ij} ) = E(\alpha_{ij} ) = \frac{{\sum\nolimits_{i} {\alpha \times f(i)} }}{{\sum\nolimits_{i} {f(i)} }} $$
(14)

where \( f(i) = \sum\nolimits_{r} {\sum\nolimits_{s} {T(r,s) = e^{ - \beta (\alpha - 1)} C^{ - d} \alpha^{ - 1/2} } } \)

$$ d = \frac{\text{TP}}{2};\quad \Delta = \frac{\pi }{\text{TP}};\quad \alpha = \frac{\beta }{2};\quad \beta = \frac{{{\text{Var}}\left( {f_{i} } \right)}}{{\overline{f} }} $$

where T(r, s) is a matrix with order p × q; TP is the number of pixels in the window, i.e. TP = 9; \( \alpha \) is a weight factor; f(i) is the likelihood of observations; \( \sum\nolimits_{i} {f(i)} \) is a normalizing factor.

The \( \lambda_{\rho } s \, \left( {\rho = 1, 2 \ldots } \right) \) are computed by substituting \( \mathop \alpha \limits^{ \wedge }_{ij} \) and \( \theta ,\;\phi ,\;{\rm K} \) in Eq. (10), for example, which is exemplified in Eq. (18); \( \mathop \alpha \limits^{ \wedge }_{ij} \) is estimated in hidden layer 1, and the parameters, \( \theta ,\;\phi ,\;{\rm K} \), are received from input layer. The computed \( \lambda_{\rho } s \) are stored in \( \lambda \left( {k,l} \right) \) as illustrated in Eq. (15). In order to learn \( \lambda_{\rho } \), \( \lambda \left( {k,l} \right) \) and \( w\left( {k,l} \right) \) are convolved as depicted in Eq. (16). The R is computed using Eq. (17). If R falls in the interval mentioned in Eq. (17), then the iteration process is terminated, and the estimated parameters are sent to hidden layer 4. Otherwise, \( \mathop \alpha \limits^{ \wedge }_{ij} \) is sent to the hidden layer 2.

$$ \lambda \left( {k,l} \right) = \left[ {\begin{array}{*{20}c} {\lambda_{2} } & {\lambda_{1} } & {\lambda_{2} } \\ {\lambda_{1} } & {} & {\lambda_{1} } \\ {\lambda_{2} } & {\lambda_{1} } & {\lambda_{2} } \\ \end{array} } \right] $$
(15)
$$ g = \sum\limits_{i = - 1}^{1} {\sum\limits_{j = - 1}^{1} {\lambda \left( {k,l} \right)*w\left( {k,l} \right);} } \quad {\text{s}}.\;{\text{t}}.\;\;k = l \ne 0 $$
(16)
$$ R = \frac{w(k,l)}{g} \in \left[ {0.95, 1.05} \right] $$
(17)
$$ \lambda_{1} = \frac{{K\left( {\sin \left( {1\theta } \right)\cos \left( {1\phi } \right)} \right)}}{{\alpha^{1} }}\quad {\text{and}}\quad \lambda_{2} = \frac{{K\left( {\sin \left( {2\theta } \right)\cos \left( {2\phi } \right)} \right)}}{{\alpha^{2} }} $$
(18)

while R falls in the interval given in Eq. (17), the iteration process is saturated; it means that the predicted pixel has become the same or nearer to the actual pixel.

Estimation of θ and ϕ (hidden layer 2) The α is fixed at its posterior mean, \( \mathop \alpha \limits^{ \wedge } \), as in Eq. (13), then θ and ϕ are learned, based on the expressions in Eq. (19), using the seed values of θ and ϕ received from the input layer that are regarded as prior information.

$$ \mathop \theta \limits^{ \wedge } = E(\theta ) = \frac{{\sum\nolimits_{i} {\theta \times f(i)} }}{{\sum\nolimits_{i} {f(i)} }};\quad \mathop \phi \limits^{ \wedge } = E(\phi )\frac{{\sum\nolimits_{i} {\phi \times f(i)} }}{{\sum\nolimits_{i} {f(i)} }} $$
(19)

The hidden layer-2 computes \( \lambda_{\rho } \), based on \( \mathop \alpha \limits^{ \wedge } ,\;\mathop \theta \limits^{ \wedge } ,\;\mathop \phi \limits^{ \wedge } ,\;K, \) and stored as in Eq. (15); this layer learns the parameters, \( \mathop \theta \limits^{ \wedge } \) and \( \mathop \phi \limits^{ \wedge } \), while \( \mathop \alpha \limits^{ \wedge } \) and K are received from the hidden layer-1 and input layer, respectively. The value of g is computed as discussed above and applies in Eq. (16); examines whether R falls in the interval \( \left[ {0.95, 1.05} \right] \). If \( R \in \left[ {0.95, 1.05} \right] \), then the iteration process is terminated, and the estimated parameters are sent to hidden layer 4. Otherwise, \( \mathop \alpha \limits^{ \wedge } \), \( \mathop \theta \limits^{ \wedge } \), and \( \mathop \phi \limits^{ \wedge } \) are fed as input to hidden layer-3.

Estimate of K (hidden layer 3) The \( \alpha ,\;\theta , \) and \( \phi \) are fixed at their respective posterior means, \( \mathop \alpha \limits^{ \wedge } ,\;\mathop \theta \limits^{ \wedge } \), and \( \mathop \phi \limits^{ \wedge } \); then, K is learned, based on the expression in Eq. (20), using the seed value of K received from the input layer that is regarded as prior information.

$$ \mathop K\limits^{ \wedge } = {b \mathord{\left/ {\vphantom {b a}} \right. \kern-0pt} a} $$
(20)

Layer-3 computes R, based on \( \mathop \alpha \limits^{ \wedge } ,\;\mathop \theta \limits^{ \wedge } ,\;\mathop \phi \limits^{ \wedge } ,\;\mathop K\limits^{ \wedge } \), and examines whether R falls in the interval \( \left[ {0.95, 1.05} \right] \). If R falls in the interval, then the iteration process is terminated, and the estimated parameters are sent to hidden layer 4. Otherwise, the sliding window moves to (k +1)th position with half of the values of the parameters estimated in kth position; and the weight factor is updated for each parameter as in Eq. (21) until the function in Eq. (17) satisfies the condition. The above process is repeated for the entire image. Ultimately, the BBPNN derives a tensor matrix F(I, J, K) with ACC values, which is output of the BBPNN model.

$$\left.\begin{array}{ll}{\mathop\alpha\limits^{ \wedge }}_{i} & ={\mathop\alpha\limits^{ \wedge }}_{i - 1} + \Delta_{i} \\ {\mathop\theta\limits^{\wedge }}_{i} & = {\mathop\theta\limits^{ \wedge }}_{i - 1} + \Delta_{i}\\ {\mathop\phi\limits^{ \wedge }}_{i} & = {\mathop\phi\limits^{ \wedge}}_{i - 1} + \Delta_{i} \\ {\mathop K\limits^{ \wedge }}_{i} & = {\mathop K\limits^{ \wedge }}_{i - 1} + \Delta_{i} \end{array}\right\} $$
(21)

where i denotes the ith iteration of updating the parameters; \( \Delta_{i} \) is a very small quantity.

Hidden layer 4 Based on the estimated parameters, it computes the model coefficients \( \lambda_{1} \) and \( \lambda_{2} \), as exemplified in Eq. (18). The ACCs, \( \rho_{1} \), and \( \rho_{2} \), are computed by substituting the model coefficients in Eqs. (29) and (30), and the ACCs are sent to the hidden layer-5.

Hidden layer 5 This layer performs a significance test on computed ACCs, based on the test statistic expressed in Eq. (31). It computes the test statistics, \( {\text{AC}}_{m} , \) based on the computed AC and tests whether it falls within the confidence limit, \( \pm \alpha (\sigma /\sqrt n ) \). If the \( {\text{AC}}_{m} \) falls within the limit, then it is inferred that the ACCs are not highly significant, viz. the small image region of the window contains shape feature, and it is formulated as a FV as depicted in Eq. (34); otherwise, it is assumed that it contains texture feature and formulated as a FV as demonstrated in Eq. (35). The formulated FVs are sent to the output layer. The output layer outputs the FVs.

4.2 Basis for computation of ACC

In order to compute ACC, we have to compute the autocovariance for the AR(2) process. More general form of the autocovariance function can be written as in Eq. (22):

$$ \varpi_{{\left| {p + q} \right|}} = E\left( {f_{k,l} f_{k + p,l + q} } \right) $$
(22)

where p and q are the lag variables in row and column directions of the 2D image pixel intensity values, respectively. In the case of the proposed study, Eq. (22) can be written as follows:

$$ \begin{aligned} \varpi_{\rho } & = E\left\{ {\sum\limits_{p} {\sum\limits_{q} {\lambda_{\rho } \left( {f_{k,l} f_{k - p,j + q} } \right)} } } \right\} + E\left( {\varepsilon_{k,l} \varepsilon_{k + p,l + q} } \right); \\ {\text{s}}.\;{\text{t}}.\;\;p & = - 1, 0, 1;\quad q = - 1, 0, 1. \\ \end{aligned} $$
(23)
$$ \begin{aligned} \varpi_{\rho } & = E\left\{ {\lambda_{\rho } \left[ {f_{0,0} \left( {f_{ - 1,0} + f_{0,1} + f_{1,0} + f_{0, - 1} } \right)} \right]} \right\} \\ & \quad + E\left\{ {\lambda_{\rho } \left[ {f_{0,0} \left( {f_{ - 1, - 1} + f_{ - 1,1} + f_{1,1} + f_{1, - 1} } \right)} \right]} \right\} + E\left( {\varepsilon_{0,0} \varepsilon_{p,q} } \right) \\ {\text{s}}.\;t.\;\;p & = - 1\;{\text{to}}\;1;\quad q = - 1\;{\text{to}}\;1 \\ \end{aligned} $$
(24)
$$ \varpi_{r} = \lambda_{1} \varpi_{r - 1} + \lambda_{2} \varpi_{r - 2} $$
(25)

where \( \rho = \left| {p + q} \right| \), ρ represents order of the autocovariance function.

The AC function expressed in Eq. (26) can be obtained by dividing Eq. (24) by the variation between the pixel values of the subimage, \( f_{k,l} \).

\( {\text{Var}}\left( {f_{i,j} } \right) = \frac{{\sum\nolimits_{i = 1}^{3} {\sum\nolimits_{j = 1}^{3} {f_{i,j} } } }}{N - 1} \), where N is a number of pixels in the small image region of size, 3 × 3.

$$ \rho_{r} = \lambda_{1} \rho_{r - 1} + \lambda_{2} \rho_{r - 2} $$
(26)

Since \( \rho_{1} = \rho_{ - 1} \), Eq. (26) can be written as:

$$ \rho_{1} = \lambda_{1} + \lambda_{2} \rho_{ - 1} $$
(27)
$$ \rho_{1} = \lambda_{1} + \lambda_{2} \rho_{1} $$
(28)
$$ \rho_{1} = \frac{{\lambda_{1} }}{{1 - \lambda_{2} }} $$
(29)

Similarly, the second-order ACC can be computed as follows:

$$ \rho_{2} = \frac{{\lambda_{1}^{2} + \lambda_{1} + \lambda_{2}^{2} }}{{1 - \lambda_{2} }} $$
(30)

where \( \rho_{1} \) and \( \rho_{2} \) represent the first- and second-order ACC, respectively.

4.2.1 Significance test for ACC

In order to test the significance of the ACC in a small image region of size 3 × 3, the proposed method adopts the test statistic expressed in Eq. (31), which is presented by Pena and Rodriguez (2002).

$$ {\text{AC}}_{m} = \left[ {1 - \left| {\tilde{\rho }_{m} } \right|^{{\frac{1}{m}}} } \right] $$
(31)

where \( \tilde{\rho }_{m} \) is the correlation matrix, which is formulated using the standardized ACC, \( \tilde{\rho }_{m} \), m is the order of the AC. The correlation matrix is expressed in Eq. (32):

$$ \tilde{\rho }_{m} = \left[ {\begin{array}{*{20}c} 1 & {\tilde{r}_{1} } & \ldots & {\tilde{r}_{m} } \\ {\tilde{r}_{1} } & 1 & \ldots & {\tilde{r}_{m - 1} } \\ \vdots & \vdots & \ddots & \vdots \\ {\tilde{r}_{m} } & {\tilde{r}_{m - 1} } & \ldots & 1 \\ \end{array} } \right];\quad \tilde{r}_{m}^{2} = \frac{(n + 2)}{(n + k)}r_{m}^{2} $$
(32)

where n is the number of samples and m is the lag variable.

The significance test statistic, \( {\text{AC}}_{m} , \) is performed based on the confidence measure, \( \pm \alpha (\sigma /\sqrt n ) \), where α represents significance level; σ denotes standard deviation. To perform the confidence test at a given significance level α, under H0 that there is no AC at lag m, the computed \( {\text{AC}}_{m} \) is compared to \( \left( { - \alpha (\sigma /\sqrt n ),\; + \alpha (\sigma /\sqrt n )} \right) \). If the \( {\text{AC}}_{m} \) falls outside the interval, then H0 is rejected at the level of significance α, i.e. the image pixels are correlated; otherwise, the H0 is accepted, i.e. the image pixels are not correlated. If the AC is high, then it is assumed that the small image region represents texture feature; otherwise, the region represents shape feature. If the AC test is accepted at significance level, α, then the ACC value is stored as a vector as depicted in Eq. (33), which is treated as shape features; otherwise, it is treated as a texture feature and stored as a vector as depicted in Eq. (34).

The above process is repeated for all submodels (H, S, Y, Cb, and Cr), and finally, the \( {\text{FV}}^{ks} \) and \( {\text{FV}}^{kt} \) are merged into a single holistic FV of the key face image, which is denoted as \( {\text{FV}}^{k} \).

$$ \overrightarrow {\text{FV}}^{ ks} = \left\{ { \overrightarrow {\text{H}}^{ s} \left( {\rho_{1i} ,\rho_{2i} } \right),\overrightarrow {\text{S}}^{ s} \left( {\rho_{1i} ,\rho_{2i} } \right),\overrightarrow {\text{Y}} {}^{s}\left( {\rho_{1i} ,\rho_{2i} } \right),\overrightarrow {\text{Cb}}^{ s} \left( {\rho_{1i} ,\rho_{2i} } \right),\overrightarrow {\text{Cr}}^{ s} \left( {\rho_{1i} ,\rho_{2i} } \right) } \right\} $$
(33)
$$ \overrightarrow {\text{FV}}^{kt} = \left\{ { \overrightarrow {\text{H}}^{ t} \left( {\rho_{1i} ,\rho_{2i} } \right),\overrightarrow {\text{S}}^{ t} \left( {\rho_{1i} ,\rho_{2i} } \right),\overrightarrow {\text{Y}}^{ t} \left( {\rho_{1i} ,\rho_{2i} } \right),\overrightarrow {\text{Cb}}^{ t} \left( {\rho_{1i} ,\rho_{2i} } \right),\overrightarrow {\text{Cr}}^{ t} \left( {\rho_{1i} ,\rho_{2i} } \right) } \right\} $$
(34)

where \( {\text{FV}}^{ks} \) and \( {\text{FV}}^{kt} \) denote shape and texture FVs of the key face image, respectively; i represents the number of sliding windows (number of samples of the features) of the face image.

Similarly, the FVs for target face images of the face image database are computed, which is denoted by \( {\text{FV}}^{tr} \), and stored in the FV database of the face image. The formulation of the \( {\text{FV}}^{tr} \) is discussed in Sect. 6.

$$ \begin{aligned} \overrightarrow {\text{FV}}^{ \, k} & = \left( {\rho_{1i}^{\text{Hs}} ,\rho_{2i}^{\text{Hs}} ,\rho_{1i}^{\text{Ss}} ,\rho_{2i}^{\text{Ss}} ,\rho_{1i}^{\text{Ys}} ,\rho_{2i}^{\text{Ys}} ,\rho_{1i}^{\text{Cbs}} ,\rho_{2i}^{\text{Cbs}} ,\rho_{1i}^{\text{Crs}} ,} \right. \\ & \quad \left. {\rho_{2i}^{\text{Crs}} ,\rho_{1i}^{\text{Ht}} ,\rho_{2i}^{\text{Ht}} ,\rho_{1i}^{\text{St}} ,\rho_{2i}^{\text{Sst}} ,\rho_{1i}^{\text{Yt}} ,\rho_{2i}^{\text{Yt}} ,\rho_{1i}^{\text{Cbt}} ,\rho_{2i}^{\text{Cbt}} ,\rho_{1i}^{\text{Crt}} ,\rho_{2i}^{\text{Crt}} } \right) \\ \end{aligned} $$
(35)

5 Proposed face recognition test statistics

The combined \( {\text{FV}}^{k} \) in Eq. (35) is treated as a feature space, which is measurable. Let us assume that the FV space is a sigma field (Ω). The shape and texture features are regarded as a subfield of Ω. The shape and texture subspaces are denoted by \( S \) and \( T, \), respectively, such that \( S \subset\Omega \) and \( T \subset\Omega \). Also, it satisfies \( S \cup T \in\Omega \) and \( S \cap T \in\Omega \). The subspaces, \( S \) and \( T, \) are assumed as enfolded Gaussian random fields (Carbó-Dorca and Besalú 2011; Kukush 2019). Let \( F^{{k_{i} }} \) and \( F^{{tr_{i} }} \) (i = 1, 2, …, 20) be the multidimensional Gaussian distribution functions that can be denoted as \( F^{{k_{i} }} \sim N\left( {M_{k} ,W_{k} } \right) \) and \( F^{{tr_{i} }} \sim N\left( {M_{t} ,W_{t} } \right) \) with mean vector \( M_{( \cdot )} \) and covariance matrix \( W_{( \cdot )} \). Each feature space, such as shape and texture, has five submodels that are extracted from HS–YCbCr colour model; each submodel has two FVs, such as \( \rho_{1} \) and \( \rho_{2} \), as illustrated in Eq. (35). As a result, the FV consists of twenty FV elements, which is expressed in Eq. (35).

In order to identify the target face image, the multivariate parametric tests, such as the test for equality of covariance matrices and the test for equality of mean vectors, are employed. The test for equality of covariance between the \( {\text{FV}}^{k} \) and \( {\text{FV}}^{tr} \) is expressed in Eq. (36).

Let \( f_{\alpha }^{(g)} ;\;\alpha = k,\;tr;\;g = 1,2, \ldots ,10 \) be an observation from gth FV of the key (k) face image or the target (tr) image which follows Gaussian distribution with mean vector \( M_{( \cdot )} \) and covariance matrix \( W_{( \cdot )} \). According to the law of distribution that can be denoted as \( F_{{k_{i} }} \sim N\left( {M_{k} ,W_{k} } \right) \) and \( F_{{k_{i} }} \sim N\left( {M_{tr} ,W_{tr} } \right) \).

5.1 Test for equality of variation between key and target face images

To test whether the covariance matrices of the key and target images are the same or not, the hypothesis is formulated as follows:

$$ \begin{array}{*{20}l} {{\mathbf{H}}_{0} :\;\varSigma_{k} = \varSigma_{tr} } \hfill & {{\text{Null}}\;{\text{hypothesis}}} \hfill \\ {{\mathbf{H}}_{\text{a}} :\;\varSigma_{k} \ne \varSigma_{tr} } \hfill & {{\text{Alternative}}\;{\text{hypothesis}}} \hfill \\ \end{array} $$

The equality of covariance matrices of the key and target face images with 20-feature components of multivariate normal populations can be examined against the alternative of general positive definite matrices. The test statistic given in Eq. (36) interprets the similarity (distance) between the key and target face images, which approximately distributed to Chi-square distribution with degrees of freedom \( {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}\left( {q - 1} \right)p\left( {p + 1} \right) \) (Anderson 2003).

$$ D_{\text{Cov}} = \left( {\sum {n_{g} \ln \left| W \right| - \sum\limits_{g = 1}^{q} {n_{g} \ln W_{g} } } } \right)C^{ - 1} $$
(36)

where

$$ \begin{aligned} W_{g} & = \frac{1}{{N_{g} }}S_{g} , \\ S_{g} & = \sum\limits_{g } {\left( {f_{\alpha }^{g} - \overline{f}^{(g)} } \right)}^{\text{T}} \left( {f_{\alpha }^{g} - \overline{f}^{(g)} } \right);\quad \alpha = k,tr \\ \overline{f}^{(g)} & = \frac{1}{{N_{\alpha } }}\sum\limits_{g}^{{N_{g} }} {f_{\alpha }^{(g)} } ;\quad \alpha = k,tr \\ C^{ - 1} & = 1 - \frac{{2g^{2} + 3g - 1}}{{6\left( {g + 1} \right)\left( {\alpha - 1} \right)}}\left( {\sum\limits_{\alpha = 1}^{2} {\frac{1}{{n_{\alpha } }} - \frac{1}{{\varSigma n_{\alpha } }}} } \right) \\ \end{aligned} $$
(37)

where \( S_{g} \) is a sum of squares of cross-products of variation about the mean of the components in a sample, which is drawn from the multivariate normal population; \( N_{g} \) is the number of observations of the gth component of the αth population. The covariance matrix \( W_{g} \) is a positive definite in the parameter space, \( \Omega \).

Let \( S_{g} \) be the unbiased estimate of \( W_{g} \), based on \( n_{g} \) degrees of freedom, where \( n_{g} = N_{g} - 1 \) for the usual case of a random sample of \( N_{g} \) observation vectors from the gth population. While \( {\mathbf{H}}_{0} \) is true,

$$ S = \frac{1}{{\sum n_{\alpha } }}\sum\limits_{\alpha = 1}^{2} {n_{\alpha } S_{\alpha } } $$
(38)

is the pooled estimate of the covariance matrices.

If all \( n_{i} s \) (the number of pixels in the query and target images) are equal, then \( C^{ - 1} \) can be written as follows:

$$ C^{ - 1} = 1 - \frac{{\left( {2g^{2} + 3g - 1} \right)\left( {\alpha + 1} \right)}}{{6\left( {g + 1} \right)\alpha \left( {N - 1} \right)}} $$
(39)

In the proposed work, there are only two populations, such as key face image and target face image (α = 2), and each population consists of twenty FVs (g = 20) as depicted in Eqs. (35) and (45). The objective of this study is to test the null hypothesis that \( {\mathbf{H}}_{0} :\;W_{k} = W_{tr} , \) where \( W_{k} \) and \( W_{tr} \) represent covariance matrices of the key and target face images. The pooled estimate of the common covariance matrix, S, can be calculated as follows:

$$ S = \left( {\frac{1}{{\left( {N_{k} - 1} \right) + \left( {N_{tr} - 1} \right)}}} \right) \left[ {\left( {N_{k} - 1} \right)S_{k} + \left( {N_{tr} - 1} \right)S_{tr} } \right] $$
(40)

The test statistic for the case of two populations is given by

$$ D_{\text{Cov}} = \left\{ {\left( { \left[ {\left( {N_{k} - 1} \right) + \left( {N_{tr} - 1} \right)} \right] \ln \left| S \right|} \right) - \left( {\left[ {\left( {N_{k} - 1} \right)\ln \left( {S_{k} } \right) + \left( {N_{tr} - 1} \right)\ln \left| {S_{tr} } \right|} \right] } \right)} \right\}C^{ - 1} $$
(41)

Critical region If \( D_{\text{Cov}} \le \chi_{\nu }^{2} (\alpha ) \), then H0 is accepted. It means that the covariance matrices of the key and target face images are closely related, that is, the key and target face images are the same or similar with respect to covariance. Otherwise, H0 is rejected, i.e. the distance between the covariance matrices of the key and target face images is greater than the critical value of the Chi-square distribution. The hypothesis is rejected means that the key face image does not match with the targeted face image in terms of covariance, where \( \nu = {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}\left( {\alpha - 1} \right)g\left( {g + 1} \right) \), \( \alpha \) is the level of significance. If the hypothesis is rejected, then it drops the testing process and takes another facial image from the database.

5.2 Test for equality of mean vectors of key and target face images

If the key and target face images pass the above covariance test, then one can proceed to test the mean vectors of the two images, i.e. the test for equality of mean vectors; otherwise, the matching process could be stopped and proceed to the next image in the database. The test statistic for testing the mean vectors of the features of the key and target images is given in Eq. (43). The test of the hypothesis is formulated as follows:

Tests of Hypothesis

$$ \begin{array}{*{20}l} {{\mathbf{H}}_{0} :\;M_{k} = M_{tr} } \hfill & {{\text{Null}}\;{\text{hypothesis}}} \hfill \\ {{\mathbf{H}}_{\text{a}} :\;M_{k} \ne M_{tr} } \hfill & {{\text{Alternative}}\;{\text{hypothesis}}} \hfill \\ \end{array} $$

The test statistic for testing the mean vectors of two populations, i.e. the key and target face images,

$$ {\text{MV}}_{\text{Mean}} = \frac{{N_{k} N_{tr} }}{{N_{k} + N_{tr} }}\left( {\overline{M}_{{_{k} }} - \overline{M}_{{_{tr} }} } \right)^{\prime } S^{ - 1} \left( {\overline{M}_{{_{k} }} - \overline{M}_{{_{tr} }} } \right) $$
(42)

where

$$ \overline{M}_{{_{k} }} = \frac{1}{{N_{k} }}\sum\limits_{g = 1}^{{N_{g} }} {f_{k}^{g} } ;\quad {\text{and}}\quad \overline{M}_{{_{tr} }} = \frac{1}{{N_{tr} }}\sum\limits_{g = 1}^{{N_{g} }} {f_{tr}^{g} } ; $$
(43)

which are distributed to a Chi-squared distribution with degrees of freedom \( \nu = {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}\left( {\alpha - 1} \right)g\left( {g + 1} \right) \) (Shi et al. 2010).

Critical region If \( {\text{MV}}_{\text{Mean}} \le \chi_{\nu }^{2} \left( \alpha \right) \), then H0 may be accepted, that is, the key and target face images are the same. If \( {\text{MV}}_{\text{Mean}} > \chi_{\nu }^{2} \left( \alpha \right) \), then the H0 may be rejected, i.e. the key and target face images are different.

If the key and target face images pass both tests, such as test for equality of covariance matrices and the test for equality of mean vectors, at 15% significance level, then the target face image is marked and indexed. The indexed face images are displayed to users. The significance level, 15%, is fixed after conducting rigorous experiments.

The overall functions of the proposed face recognition method were given in an algorithmic form as follows.

figure a

6 Experiments and results

This section deals with databases construction, performance measures, and experimental results and discussion.

6.1 Experimental setups

6.1.1 Face image datasets and feature databases construction

Vintra, a Californian analytics specialist and IT service provider, reported that most of the Western-based face recognition algorithms are racially biased since they have been formulated based on the publicly accessible databases that contain “supermajority” of white faces. Also, it reported that by deploying a tweak in algorithmic formulation alone could not address this problem (News 2020b).

To validate the proposed face recognition method, taking into account of the above report, we have constructed a face image database on our interest, and its details are given below. Also, we have considered the datasets, such as GT dataset (Mahmoodi and Sayedi 2015; Muqeet and Holambe 2019), LFW dataset (Mahmoodi and Sayedi 2015), Pointing’04 dataset (Dantcheva et al. 2018), BioID dataset (BioID) for the experiments.

GT dataset Georgia Tech faces dataset composed of 750 colour face images of 50 subjects. All subjects in the dataset are represented by 15 colour JPEG images with the cluttered background taken at resolution 640 × 480 pixels. The average size of the faces in these images is 150 × 150 pixels. The pictures show frontal and tilted faces with different facial expressions, lighting conditions and scale. Each image is manually labelled to determine the position of the face in the image.

LFW dataset The LFW dataset is one of the most popular datasets in face recognition domain that comprises all variations of the face image in real-world situations, including significant variations in head pose, lighting, and skin colour. The LFW consists of 13,233 face images from 5749 different subjects like different ages, genders.

Pointing’04 dataset This dataset contains 15 subjects with two series of 93 images of the same subject at different head poses. The face images of the 15 subjects are with and without wearing glasses, and the skin colour may vary one subject to another. The head pose orientation is determined by two directions (horizontal and vertical) that range from − 90° to + 90°.

BioID face dataset (BioID, http://www.bioid.com/downloads/facedb/index.php) It is composed of 1521 greyscale images of 23 persons attributed to various illuminations, expression, pose variations and partial occlusions. The face area is located in different positions with a complex background of size, 384 × 286 pixels.

Our dataset It is composed of 1503 face images of the celebrities that gathered from Google through the Internet and many different face images captured by a digital camera.

At first, the face objects were cropped from the face datasets and pre-processed, i.e. noises removed. The noises were removed or alleviated by employing the method discussed in Sect. 3. The noise removal process is repeated until the process saturates; saturate means that the threshold (T) falls within the band given in Eq. (44).

$$ T = \frac{{{\text{CoV}}_{{t_{n - 1} }} }}{{{\text{CoV}}_{{t_{n} }} }} \le \left[ {0.95, 1.05} \right] $$
(44)

where \( {\text{CoV}}_{\left( \cdot \right)} = \left( {{\sigma \mathord{\left/ {\vphantom {\sigma \mu }} \right. \kern-0pt} \mu }} \right) \times 100 \), \( {\text{CoV}}_{{t_{n - 1} }} \) and \( {\text{CoV}}_{{t_{n} }} \) denote the coefficient of variation for the (n − 1)th and nth time process of the input face image.

The pre-processed face images were converted to HS–YCbCr colour model, and the features extracted by deploying the techniques discussed in Sect. 4, as did for the key-face image. A face FV database was constructed as discussed in Seetharaman and Jeyakarthic (2014), and the extracted face features have been stored in the feature database. The structure of the FV in the FV database is demonstrated in Eq. (45). The face features were clustered into various homogeneous groups using the fuzzy c-means clustering algorithm (Bezdek 1981). The median value was computed for each cluster. The FVs were indexed, based on the median value. A link was established in a one-to-one correspondence between the cluster index of the FV database and the corresponding actual face images in the database. It facilitates the users for quick recognition.

$$ \begin{aligned} \overrightarrow {\text{FV}}^{ tr} & = \left( {\rho_{1i}^{\text{Hs}} ,\rho_{2i}^{\text{Hs}} ,\rho_{1i}^{\text{Ss}} ,\rho_{2i}^{\text{Ss}} ,\rho_{1i}^{\text{Ys}} ,\rho_{2i}^{\text{Ys}} ,\rho_{1i}^{\text{Cbs}} ,\rho_{2i}^{\text{Cbs}} ,\rho_{1i}^{\text{Crs}} ,} \right. \\ & \quad \left. {\rho_{2i}^{\text{Crs}} ,\rho_{1i}^{\text{Ht}} ,\rho_{2i}^{\text{Ht}} ,\rho_{1i}^{\text{St}} ,\rho_{2i}^{\text{Sst}} ,\rho_{1i}^{\text{Yt}} ,\rho_{2i}^{\text{Yt}} ,\rho_{1i}^{\text{Cbt}} ,\rho_{2i}^{\text{Cbt}} ,\rho_{1i}^{\text{Crt}} ,\rho_{2i}^{\text{Crt}} } \right) \\ \end{aligned} $$
(45)

In order to validate, the proposed ARBBPNN method is robust to head pose, and a number of images with different head poses such as the head pose ranges from − 90° to + 90° (downward to upward direction) with the interval of 15° as well as the head pose ranges from − 90° to + 90° (left to right at each 15° of downward to upward direction), collected from the database proposed in Gourier et al. (2004). Also, several face images that attribute to different facial expressions of the subjects, have been inducted in the face image database. For a sample, a few of them are presented in Figs. 4 and 5. These face images, also, have been incorporated in the newly constructed image database. The newly built face image database contains 2173 images. However, only few of them are presented in Fig. 4 owing to the space constraint.

Fig. 4
figure 4

Pointing’04 dataset. Face images with different angles

Fig. 5
figure 5

Face images with different expressions of the same person and other similar images of different persons (celebrities face images collected from Google)

6.1.2 System specification

The proposed method was implemented using Python cv2 with the system specification: Intel Core i5 processor-based PC with 3.5 GHz, 16 GB DDR4 RAM, Asus Mother Board D97, and 4.0 GB Video Card.

6.2 Performance measure

To validate the performance of the ARBBPNN method, the Precision@α (P@α) was deployed, where α denotes the significance level. The P@α is computed for the entire dataset as follows:

$$ P@\alpha = \frac{1}{m}\sum\limits_{C = 1}^{m} {\left( {\frac{1}{n}\sum\limits_{g = 1}^{n} {P_{g} } } \right)} $$
(46)

where

$$ P = \frac{{{\text{Number}}\;{\text{of}}\;{\text{Similar}}\;{\text{Images}}\;{\text{from}}\;{\text{the}}\;{\text{Identified}}\;{\text{images}}}}{{{\text{Total}}\;{\text{Number}}\;{\text{of}}\;{\text{Images}}\;{\text{Identified}}}} $$

g represents the number of head poses and facial expressions; C represents the number of persons in a dataset; α ranges from 1 to 25% at the interval of 5%. As the precision, recall, and F1 score are most popular performance measures in the domain of pattern recognition and information retrieval, those are not discussed in this paper.

6.3 Experimental results

The constructed face image database is subjected to experiment to validate the proposed ARBBPNN method. The face image in Fig. 6a was submitted to the proposed method. At first, the input colour key face image was denoised as discussed earlier, based on the technique briefed in Sect. 3. For example, the pre-processed image is given in Fig. 6. In sequel, the given key image was converted to HS–YCbCr colour model. The ACC was computed from each colour component—H, S, Y, Cb, Cr, based on the sliding window of size 3 × 3. The computed ACC was tested, whether it is highly correlated or not, using the significance test discussed in Sect. 4. If the ACC is highly correlated, then it is taken into account of texture features; otherwise, it treated as shape features. For example, the shape and texture parts of an image are shown in Fig. 7. The process was repeated for the entire input image. The extracted shape and texture features were stored in a two-dimensional array, which is treated as an n-dimensional Euclidean space with Gaussian enfoldment. The shape and texture features derived from each HS–YCbCr colour model were formulated as a FV as described in Eqs. (33) and (34). Both shape and texture FVs were combined. As a result, there are twenty feature attributes to each key and target face images that are presented in Eqs. (35) and (45). The test statistics discussed in Sect. 5 were deployed at different significant levels. The image in Fig. 8a was submitted as an input key to the proposed system, and the identified same or similar images are shown in Fig. 8b. Images in row 1 of Fig. 8b were recognized at the 5% level of significance; images in rows 1 and 2 were identified at the 10% level of significance; at the level of significance 15%, the images in rows 1, 2, and 3 were recognized, while fixing α at 20%, the images in rows 1, 2, 3, and 4 were recognized. All images in Fig. 8b were recognized while fixing α at 25%.

Fig. 6
figure 6

Pre-processed images. a Noised input image; b denoised image at threshold t1 (first-time); c denoised image at threshold t2 (second-time); denoised image at threshold t3 (third-time)

Fig. 7
figure 7

a Actual colour image; b greyscale image; c shape pattern; d texture pattern

Fig. 8
figure 8

a Input key image; b identified images

We have made an attempt to examine the robustness of the proposed method for rotation and scale. The face images with different head poses of the fifteen persons of the Pointing’04 dataset were merged as a grand dataset and subjected to the experiments. More than fifty per cent of the head pose images were randomly selected from each subject, and the experiment was conducted at various significant levels. However, for example, only few of them have been presented owing to the space constraint. The image in Fig. 9a was submitted to the ARBBPNN system, and the corresponding identified same or similar images are presented in Fig. 9b. The images in row 1 were detected at the 5% level of significance; at 10%, the system has resulted in the images in rows 1 and 2; at 15%, the images in rows 1, 2, and 3 were identified; at 20%, the images in rows 1–4 were identified. The images in Fig. 9b were recognized while fixing α at 25%.

Fig. 9
figure 9

Robustness for head pose, rotation, and scale. a Input key image; b identified output images

Furthermore, the image dataset, which has face images with different expressions of South Asian people, was subjected to the experiment to examine the robustness of the proposed system for different facial expressions of a person. The image in Fig. 10a was subjected to experiments, and the system recognized the same or similar images that are given in Fig. 10b. The images in row 1 of Fig. 10b were detected at 5% level of significance; the images in rows 1 and 2 were identified at the level of significance, 10%. The system resulted in no images at the level of significance, 15%, while fixing the significance level at 20%, it detected the images in rows 1, 2, and 3; the images in Fig. 10b were detected at the level of significance, 25%. The experimental results show that the proposed system addresses the racial biasedness suggested in News (2020b), i.e. performs well for all types of datasets (white faces, black faces, and the trade-off between them—South Asians).

Fig. 10
figure 10

Robustness of the proposed system for different expressions of a person. a Input key image; b detected images

The average recognition rate was computed for the results obtained, at a particular significance level, for a subject with various combinations of the face images within the database; various combination means the combination of different head poses and facial expressions of a subject. The obtained recognition rate for each dataset is tabulated in Table 1, and a line graph was drawn, which is illustrated in Fig. 11. The results in Table 1 reveal that the proposed method shows a cent per cent average recognition rate for all the datasets at the 1% level of significance, except the BioID dataset. At 5%, it resulted in 98.7% and 80.1% for LFW and BioID datasets, respectively, whereas it yields almost cent per cent for other datasets. At 10%, the ARBBPNN method has resulted in about cent per cent for the GT, Pointing’04, and Our datasets, while showing 97.4% and 73.81% for LFW and BioID datasets, respectively, while fixing α at 15%, the ARBBPNN method has yielded 99%, 97.1%, and 97% recognition rate for Pointing’04, GT, and Our datasets, respectively, and it gives 96% and 64.3% for the LFW and BioID datasets, respectively. At the 20% level of significance, it has responded approximately 97%, 96% %, 94.4%, and 94% recognition rate for Pointing’04, GT, Our, and LFW datasets, respectively, while it performed about 58.24% for BioID dataset. The proposed ARBBPNN method has reported 96.5%, 94%, 93%, 91%, and 56% for the Pointing’04, GT, Our, and LFW datasets, respectively.

Table 1 Dataset-wise P@α
Fig. 11
figure 11

Line graph for different datasets: significance level versus retrieval rate

Furthermore, precision, recall, and F1 score were calculated dataset-wise, which show that the average precision score for LFW and BioID datasets is 95.8% and 68.5% while around 97% for other datasets. The ARBBPNN method returns a recall score of approximately 99% for Pointing’04 dataset and 98% for GT and Our datasets; and returns 97.2% and 74.4% for LFW and BioID datasets, respectively. The ARBBPNN method resulted in about 96.5% and 71.4% F1 score for LFW and BioID datasets, respectively, while it shows about 98% score for other datasets. The dataset-wise comparison results show that the ARBBPNN method performs better for GT, Pointing’04, and proposed datasets than the LFW and BioID datasets. The following could be the main reasons:

Reasons for LFW dataset

  1. 1.

    The size of the LFW dataset is comparatively more massive than the others.

  2. 2.

    Several face images suffered from poor lighting, extreme pose, strong occlusions, and low resolution.

  3. 3.

    The LFW dataset is not designed to have enough statistical conclusions about the subgroups.

  4. 4.

    About 1680 of the subjects have two or more different facial images in the data set. It reflects in the results of the false negative case, which also influences the F1 score.

The above complications are not prevalent in the GT, Pointing’04 datasets; especially, the Pointing’04 dataset is perfectly designed according to the persons. There also exists a problem in BioID dataset, which is different from the LFW dataset, as given below. Anyhow, the proposed ARBBPNN method has to be improved in the future to tackle the above problem even though it is not inferior to state-of-the-art methods, and it requires minimal time compared to the existing methods. The obtained results have been presented in Table 2, and a bar chart was drawn for the results in Table 2, which is depicted in Fig. 12.

Table 2 Dataset-wise performance measure
Fig. 12
figure 12

Dataset-wise performance measures

Reason for BioID dataset

  1. 1.

    The features, such as shape and texture, are extracted from the greyscale image only. Hence, totally, only four features out of twenty are extracted for the BioID dataset. It is minimal while compared to that of the other datasets. It evidences that the HS–YCbCr colour model and the features extracted from it are more appropriate and sufficient for achieving high accuracy of face recognition.

Moreover, to validate the proposed ARBBPNN method, the obtained results were compared to state-of-the-art methods in terms of F1 score. The F1 score obtained by the ARBBPNN method and the F1 scores available in Mahmoodi and Sayedi (2015), Muqeet and Holambe (2019) and Liu et al. (2019) have been taken into account for comparison, which is given in Table 3. In Mahmoodi and Sayedi (2015), the F1 score is available only for the GT and LFW datasets; in CRRF (Liu et al. 2019) method, the F1 score was calculated for LFW dataset; we calculated the F1 score for the DIWT method using the rank-one score and other results available in Muqeet and Holambe (2019). The results presented in Table 3 show that the proposed method gives better results than the state-of-the-art methods in terms of F1 score. Figure 13 illustrates the results given in Table 3.

Table 3 Comparison of different methods in terms of F1 score
Fig. 13
figure 13

Comparison of performance of different methods in terms of F1 score

7 Computational cost

In literature, most articles discuss the computational cost in terms of qualitative rather than quantitative (Muqeet and Holambe 2019; Paul et al. 2020). Therefore, this paper presents the computational cost for only the proposed ARBBPNN method. The computation cost was measured from submission of the input key face image until obtaining the output recognized face images. The computational cost was calculated phase-wise, and it is given in Table 4.

Table 4 Dataset-wise computational cost for the proposed ARBBPNN method

The obtained results show that the overall time consumed for BioID dataset was lesser than the others because it has only one component, viz. the greyscale (Y submodel), while the others have five submodels. It is the main reason that only four feature attributes, such as \( Y^{trt} \left( {\rho_{1} ,\rho_{2} } \right) \) and \( Y^{trs} \left( {\rho_{1} ,\rho_{2} } \right) \), are extracted from the BioID dataset so that it consumes lesser time for feature extraction. It shows the insufficiency of the features, which has reflected in the accuracy of the recognition rate. The obtained recognition results are presented in Table 3. The other datasets, except LFW, demand almost the same computational cost. Because of the LFW dataset is composed of 13,233 face images, which is larger while compared to others, it demands more time than the others.

8 Discussion and conclusion

8.1 Discussion

The proposed ARBBPNN method achieves good performance for biometric-based face image detection, which outperforms the methods developed, based on the hand-crafted features; also, it is comparable to the systems developed with the deep neural network. The ARBBPNN method acts as a trade-off between the deep learning methods and the hand-crafted methods in terms of less computational cost and high accuracy. Because the ARBBPNN method employs BBPNN method for only the features extraction and adopts hand-crafted techniques for feature matching, it has the properties of both deep learning and hand-crafted methods. At the same time, it maintains high accuracy, as the feature matching is performed in two stages: (1) test the variation between the key and target face images with respect to covariance; (2) test the equality of spectrum of FVs of the key and target face images, provided the images pass the test at stage (1). Though the performance of the ARBBPNN method is comparable to the deep neural network-based methods, we have a plan, in the future, to improve the ARBBPNN method in the light of the end-to-end deep learning methods.

The precision and F1 score obtained dataset-wise evidence that the HS–YCbCr colour model is sufficient and efficient for face characterization and feature extraction than the single colour model like greyscale.

8.2 Conclusion

The proposed ARBBPNN method was developed for biometric-based face recognition because of the sudden demand for non-touch biometric systems like face recognition in the future due to the impact of pandemic disease COVID-19. Both public and private organizations that maintain a biometric-based attendance system will begin to shun the touch-based methods like a fingerprint. The proposed system is a hybrid of the hand-crafted, and DNN features so that it results in a reasonable face detection accuracy with minimal computational cost. The obtained results are comparable to state-of-the-art methods. We have subjected four different publicly available benchmark datasets—GT, LFW, BioID, and Pointing’04—to the experiments. In addition to that, we constructed a dataset with celebrities’ face images, which is also subjected to the experiments. The proposed ARBBPNN-based method has responded around 97.5% average recognition rate for GT and Our datasets, while 98.1%, 96.5%, and 71.4% for Pointing’04, LFW, and BioID datasets, respectively. The obtained average recognition results are comparable to state-of-the-art methods.

8.3 Future direction

The accuracy of the recognition rate of the ARBBPNN method can be improved in the future, even for the datasets that have complicated nature or structures like LFW and other datasets with massive facial images, in the light of the end-to-end DNN and higher-order AR model. The proposed ARBBPNN method could be deployed for iris recognition and also in the domain of big-data analytics.