Prolonged wakefulness induces adverse changes in cognitive performance (Alhola & Polo-Kantola, 2007; Killgore et al., 2008; Lim & Dinges, 2008). These changes are well established in the literature and include increased likelihood of errors, increased time to complete tasks or react to stimuli, and impaired attention, memory, and decision-making, among others (Åkerstedt & Philip, 2015; Dinges, 1995; Griffith & Mahadevan, 2015; Hafner, Stepanek, Taylor, Troxel, & van Stolk, 2017). Many studies have reported that prolonged wakefulness increases human-error related accidents such as traffic accidents and chemical safety accidents (Åkerstedt & Philip, 2015; Dinges, 1995; Griffith & Mahadevan, 2015; Hafner et al., 2017; Philip et al., 2014). Workers such as transportation drivers, medical providers, and military personnel may be required to work long hours or at night, leaving them susceptible to prolonged wakefulness and resulting negative impacts on their job performance. Decreased productivity and critical mistakes that result from prolonged wakefulness can be costly at both individual and societal levels (Akrout & Mahdi, 2013; Philip et al., 2014; Werner, Al-Hamadi, Limbrecht-Ecklundt, Walter, & Traue, 2018). To address these issues, researchers have been working to develop effective methods to detect and/or predict decrements in performance due to prolonged wakefulness before costly mistakes are made. These efforts include vocal features, electrodermal activity signals, and heart rate variability (McGlinchey et al., 2011; Posada-Quintero, Bolkhovsky, Qin, & Chon, 2018; Sloboda et al., 2018). However, vocal features are not easy to document in noise environment, and the other biosignal methods are limited in that they are sensitive to motion artifacts, expensive, and require electrodes to be attached to the skin/body.

One promising solution to detecting and predicting cognitive performance impairments during prolonged wakefulness is through tracking facial features and head movements. Humans express their social and emotional states through moving the facial muscles and the head, both consciously and unconsciously (Damasio, 1998; Dimberg, Thunberg, & Elmehed, 2000; Magai & McFadden, 1996; Sauter, Eisner, Ekman, & Scott, 2010). These emotional states vary from the six Ekman’s basic emotions (Ekman, 1999), defined as anger, disgust, fear, happiness, sadness, and surprise. In addition to the six basic emotions, the composite emotions such as positive and negative valence (Adolph & Alpers, 2010; Cordaro et al., 2018) can also be estimated using movements of facial muscles or expressions (R. Ekman, 1997; Friesen & Ekman, 1983). Fatigue, often caused by prolonged wakefulness (National Academies of Sciences, Engineering, and Medicine; Division of Behavioral and Social Sciences and Education; Transportation Research Board; Committee on National Statistics; Board on Human-Systems Integration; & Panel on Research Methodologies and Statistical Approaches to Understanding Driver Fatigue Factors in Motor Carrier Safety and Driver Health, 2016), has also been suggested as an emotional state that may lead to adjustment of homeostatic balance and peripheral physiological changes (Gibson et al., 2003; LeDoux, 1998; Noakes, 2012). Noakes (2012) suggested that fatigue can be considered as an emotion regulating exercise behavior to protect the homeostasis of the body.

Therefore, facial expressions, facial emotions, and head movements can be considered in assessing prolonged wakefulness. Sundelin et al. (2013) reported that prolonged wakefulness affects facial features and may in certain cases make a face look sad. They found that prolonged wakefulness affects eyelids, exacerbates wrinkle formation and lines around eyes, and makes the corners of the mouth droop. Based on their discovery, Peng, Luo, Glenn, Zhan, and Liu (2017) used machine-learning techniques with facial features (including eyelids, wrinkle around eyes, and droopy corner mouth) to measure the degree of fatigue using self-taken photos or selfies from social media. Also, facial information has often been exploited to detect fatigue while driving (B. Lee & Chung, 2012; Y. Zhang & Hua, 2015). B. Lee and Chung (2012) used the eye-closure feature to detect fatigue while driving along with photoplethysmography and eye-blinking features . Y. Zhang and Hua (2015) trained a support vector machine (SVM) classifier with facial features extracted around the mouth and eye region to detect drivers’ fatigue. Also, several studies have used head movements to detect drivers’ drowsiness using a camera (Akrout & Mahdi, 2013; Friedrichs & Yang, 2010; Mittal, Kumar, Dhamija, & Kaur, 2016). The studies aimed to analyze and detect fatigue caused by prolonged wakefulness using facial features; however, it is still unclear if prolonged wakefulness detected through facial features correlates with performance deterioration. Therefore, the facial features need to be quantitatively evaluated with working or cognitive performance during prolonged wakefulness.

To summarize, prolonged wakefulness affects both cognitive performance and facial expression, facial emotions, head movements. As mentioned earlier, many methods have exploited facial features and head movements to detect prolonged wakefulness. However, no study has examined the facial features in parallel with working or cognitive performance during prolonged wakefulness. If facial features obtained with an easily accessible and relatively low-cost device like a webcam can provide information that can be used to predict the deterioration of performance produced by prolonged wakefulness, the technology could be used to alleviate the harmful or even fatal consequences of performance impairment. Hence, we aimed to investigate and compare the changes in facial features (facial expressions, facial emotions, and head movements) and the deterioration of working and cognitive performance during prolonged wakefulness. We tested and analyzed facial features obtained using a webcam, while measuring working performance using the psychomotor vigilance task (PVT) for 25 hours. The PVT has often been used to study overall performance during prolonged wakefulness due to its reliability and limited confounding effects of aptitude and learning (Basner & Dinges, 2011; Basner, Mollicone, & Dinges, 2011; Kripke, Marler, Calle, Marler, & Calle, 2004; Lim & Dinges, 2008). The PVT enables a researcher to obtain neurobehavioral changes in vigilant attention, state stability, and impulsivity by measuring the time required to press a button in reaction to a visual stimulus (Goel, 2017). In addition, PVT indices have shown strong correlation with duration of awake time during prolonged wakefulness (Lim & Dinges, 2008). We then tested machine learning methods to test feasibility of classifying deteriorated performance based on PVT using facial features, and to examine predictive validity of a set of facial features correlated with PVT performance.

Methods

Participants

A total of 20 healthy participants were recruited (13 males and seven females, 19–32 years of age). Participants were paid hourly and offered extra compensation if they completed the study, to motivate them to finish the experiment. Signed consent forms were collected before the experiments. Our choice of 20 subjects allows greater than 95% confidence interval to observe a significant effect (p < .05; Faul, Erdfelder, Lang, & Buchner, 2007; Sundelin et al., 2013). Furthermore, this sample size is sufficient to detect a correlation of at least 0.8 between PVT and facial indices at the 0.05 level of significance (Dorrian, Rogers, & Dinges, 2005; Posada-Quintero et al., 2018; Zar, 1999).

Stimuli and materials

Psychomotor vigilance task

The 10-min duration PVT test was performed using PC-PVT (a MATLAB-based tool) on a desktop computer (Khitrov et al., 2014). Participants were asked to click the left button of a mouse as fast as they could when a number indicating elapsed time appeared on a black background screen. Four PVT indices were calculated: average reaction time (AvRT), the number of major lapses (MaL, RT > 1 s), the number of minor lapses (MiL, 1 s ≥ RT > 0.5 s), and the number of false starts (FS) when participants clicked the mouse button before the number appeared. Reaction time is defined as the lapse between the time the stimuli was shown on the screen, and the time the participant clicked the mouse. Many studies that conducted PVT during prolonged wakefulness have reported an increase of AvRT, MiL, and FS on the PVT (Basner & Dinges, 2011; Basner et al., 2011; Doran, Van Dongen, & Dinges, 2001; Posada-Quintero et al., 2018). MiL is significantly associated with physical fatigue (I.-S. Lee, Bardwell, Ancoli-Israel, & Dimsdale, 2010). Many studies showed deterioration of MaL during prolonged wakefulness (Anderson, Wales, & Home, 2010; Posada-Quintero et al., 2018; C. Zhang et al., 2012).

Facial indices

Facial video recordings were obtained using a Logitech C920 HD webcam, placed in front of the participants, on top of the screen. Facial indices were estimated using iMotions with Affectiva, as shown in Table 1 (McDuff et al., 2016). iMotions with Affectiva is a commercially available software based on Affdex software development kit (SDK) that extracts perceived facial emotion, expressions, and head movements. Affectiva claims accuracy of key emotion detection in the high 90th percentile, validated using 6 million facial videos from more than 87 countries (Affectiva, 2017). A total of 34 facial features were collected within four categories: head movement, facial expression, perceived facial emotion, and composite indices. The head movement indices consist of three head movement orientations (Yaw, Pitch, and Roll) determined by estimating the head position in a 3D space in Euler angles, as shown in Fig 1, and interocular distances between the two outer eye corners, essentially indicating a combination of the yaw and the movement between the face and the screen. Twenty facial expressions and seven perceived emotion indices were collected, all ranging between 0 and 100. Perceived facial emotion was based on the emotional facial action coding system (EMFACS; Friesen & Ekman, 1983; McDuff et al., 2016). EMFACS determines the likelihood of perceived emotions (not real emotions) based on facial expression changes (called “action units”) without bias by the investigators or techniques (Wolf, 2015). Moreover, valence, engagement, and attention were calculated using iMotions with Affectiva software based on the perceived emotion indices and head orientations (Yaw, Pitch, and Roll), as shown in Table 2. Engagement is a measure of facial muscle activation ranging between 0 and 100, and valence indicates the intrinsic positivity or negativity in emotions ranging between −100 and +100 (Frijda, 1986).

Table 1 Facial indices obtained using iMotions with Affectiva. Index names are from Affectiva
Fig. 1
figure 1

Head orientations

Table 2 Relation between indices (Friesen & Ekman, 1983; Affectiva, 2017)

Design and procedure

All participants were asked to maintain consistent sleep prior to the day of the experiment, in addition to recording sleep diaries for a week before the experiment day. They were asked to avoid any stimulating or caffeine-containing drink or food starting 48 hours before their experiment day. They were also asked to bring their food on the experiment day, and the food was checked by experimenters to ensure compliance. Participants completed a medical screening questionnaire to ensure there would be no medical issues and to prevent unexpected accidents and minimize confounding factors, such as certain medications that can affect prolonged wakefulness. Participants received a 30-minute training at some point within two days before the start of the experiment.

Participants were asked to wake up at 6 AM and arrive at the building within 2 hours of waking up. The experiments were conducted in a 3 × 3-meter lab in the Engineering and Science building on the Storrs campus of the University of Connecticut. The room temperature was adjusted to the preference of each participant. The participants stayed with experimenters inside the building during the experiment. A total of 13 sessions per participant was performed every 2 hours for 25 hours. In every session, PVT test was performed, after a 4-minute baseline recording without any test (Khitrov et al., 2014; Loh, Lamond, Dorrian, Roach, & Dawson, 2004). Facial indices were obtained in real time during each session. Experimenters monitored the participants to ensure they were awake throughout the study. This research complied with tenets of the Declaration of Helsinki and was approved by the Institutional Review Board at the University of Connecticut.

Statistics

From the experiments, the facial indices and working performance during the PVT were obtained for the 13 runs during the 25 hours of prolonged wakefulness, for the 20 subjects. Each participant’s facial index was divided by the Euclidean norm of each session vector (1–13) of each participant, in order to accommodate the differences among subjects (Horn & Johnson, 1990). The Kolmogorov–Smirnov test was used to check normality of each PVT index and each feature index of each participant. The significant differences were calculated using the one-way analysis of variance (ANOVA) for normally distributed variables, while for nonnormally distributed variables we used Dunn’s test due to the existence of missing data. For these analyses, the Bonferroni method was used for the purpose of multiple comparison correction.

For correlation analysis, the intersubject average value for each session was obtained for each facial index and measure of performance. The correlation coefficients were then calculated between each PVT index (AvRT, MaL, MiL, and FS) and each facial index. All facial indices and three PVT indices, AvRT, MiL, and FS were normally distributed, while MaL was nonnormally distributed. Thus, we calculated the Pearson and Spearman correlation coefficients for three PVT indices (AvRT, MiL, and FS) and MaL, respectively.

Each PVT value larger than mean + standard deviation was set as a deterioration (Class 1), which represents approximately 84.1% of the data (given the normality of the data), while the rest were set as normal (Class 0), as follows:

$$ \Big\{{\displaystyle \begin{array}{cc}{\mu}_f+{\sigma}_f> PV{T}_f& class1\\ {} otherwise& class0\end{array}}\operatorname{},f\ni \left\{ AvRT, MaL, MiL, FS\right\}. $$
(1)

The receiver operating characteristic (ROC) curves were obtained for each facial index and the PVT indices of each participant in order to evaluate the feasibility of facial indices as detectors of performance deterioration during prolonged wakefulness (Fan, Upadhye, & Worster, 2006). ROC curves calculate true positive and false positive rates of a classification model at all classification thresholds that determine two classes (Class 1 and Class 0). The thresholds can be either probability of prediction or feature values when a single feature is evaluated. For evaluating each feature, we used all possible values of each normalized facial index as thresholds. Table 3 shows an example of ROC curve using a feature threshold of 0.2. To evaluate the ROC curves, area under the curves (AUC, ranging between 0 and 1) were computed, which indicate performance across all possible classification thresholds. The higher AUC indicates more sensitivity of each facial feature to detecting performance deterioration caused by prolonged wakefulness. Figure 2 shows an example of ROC curve and AUC for eye closure.

Table 3 An example of ROC curve calculation when a threshold (facial index) is 0.2
Fig. 2
figure 2

An example of ROC curve (eye closure)

Finally, we performed machine learning to detect performance deterioration with facial features. We tested two data sets that consisted of (1) all facial features and (2) facial features that highly correlated with PVT indices only. We used six different machine-learning models: support vector machine (SVM) with the linear, with the radial basis function (RBF) kernel, and with a third-order polynomial kernel, logistic regression, random forest, and k-nearest neighbors (KNN). Each classification method has its strengths and weaknesses. SVM is one of the most popular machine-learning methods that classifies or regresses linear data sets by maximizing boundaries between classes, and nonlinear problems can be solved using different types of kernels (e.g., RBF, polynomial kernels; Cortes & Vapnik, 1995). SVM has been used in many applications to date in part due to its robustness to high dimensional data, but it may not perform well for large data sets or noisy data sets. Logistic regression is a generalized linear classifier that estimates the probability of a class (such as deterioration) using a sigmoid function (McCullagh, 2019). Logistic regression is simpler and requires fewer parameter to tune, but it cannot solve nonlinear problems. Random forest classifier is an ensemble learning method that is based on voting among multiple decision trees generated with different criteria (e.g., number of samples, features; Ho, 1995). It is suitable for high dimensional data and nonlinear data; it also provides low bias and moderate variance with the voting strategy. However, several parameters have to be tuned accordingly to avoid overfitting problems. In KNN classification, each sample’s class is determined by voting among the training data set’s closest K number (Altman, 1992). It is robust to noisy data sets, but sensitive to irrelevant features (i.e., features have to be properly selected).

Data were standardized with zero mean and unit variance. Class weights are applied for the methods due to the imbalance of the data set (210 samples for normal class and 40 samples for deterioration class). SVM parameters were set as 1 and 0.5 of C and gamma, respectively, for all kernels. Logistic regression was performed with Broyden–Fletcher–Goldfarb–Shannon’s optimizer (Fletcher, 2013). Random forest was run with 10 estimators. Finally, KNN was performed with K = 5. All models were trained both with all indices, and with a set of indices that was highly correlated with the PVT outcomes. We evaluated using the leave-one-subject-out (LOSO) cross-validation approach (Koul, Becchio, & Cavallo, 2018). The LOSO cross-validation approach leaves all samples from one subject to be a test data set and uses the samples from all other subjects to be a training data set, and the procedure is repeated until all subjects are tested. This helps prevent overfitting and avoid subject bias and maximize the number of training sets (Dietterich, 1995; Ng, 1997). We then calculated the geometric mean scores (the squared root of the product of the sensitivity and specificity) of each method due to the imbalance of the data set. Geometric mean score measures the balance between classification performance of major and minor classes effectively, by maximizing both classes’ accuracy equally by using both sensitivity and specificity (Akosa, 2017). Also, we used the SHapley Additive exPlanations (SHAP) to evaluate the importance of each feature in terms of the degree of its contribution (Lundberg et al., 2019; Lundberg & Lee, 2017), which is a good tool to evaluate feature importance of machine learning models using a game theory and related statistic methods.

Results

Overall, all PVT indices showed similar trends, with significantly higher values for the last two to four sessions when compared to the rest of the sessions. Figures 3, 4, 5, 6, 7 and 8 display the mean ± standard error of the mean (SEM) of each PVT index and each facial index highly correlated with PVT indices. Significant differences between sessions are shown with the numbers in the figures, obtained using multicomparison tests. In Fig. 3, all PVT indices showed increasing trends through all sessions, with the highest values in the last session. AvRT and MaL exhibited that the last three sessions were significantly higher than the rest of the sessions. Likewise, they both showed noticeable drops in the last three sessions. On the other hand, MiL and FS showed that the last four and last two sessions were significantly higher than the other sessions, respectively. Also, MiL monotonically increased until the ninth session, and FS showed an increasing trend during the first nine sessions with noticeable performance drops.

Fig. 3
figure 3

Indices of performance on PVT. Mean ± SEM. The column numbers indicate their significant differences to each corresponding vertical x-axis session (p < .05)

Fig. 4
figure 4

Pitch of the head movement indices. Mean ± SEM

Fig. 5
figure 5

The upper facial expression. The column numbers indicate their significant differences to each corresponding vertical x-axis session (p < .05)

Fig. 6
figure 6

The lower facial expression indices strongly correlated with the PVT indices. Mean ± SEM. The column numbers indicate their significant differences to each corresponding vertical x-axis session (p < .05)

Fig. 7
figure 7

Perceived emotions. Mean ± SEM

Fig. 8
figure 8

Composite indices. Mean ± SEM

Tables 4 and 5 show correlation coefficients and the AUCs derived from the ROC curves, respectively, between each PVT index and each facial index. Twenty-one facial indices from all four categories (AvRT, MaL, MiL, and FS) showed high correlation coefficients with PVT indices. Pitch from head movement indices showed a significant correlation with PVT indices. Among facial expressions, Brow Furrow, Brow Raise, Inner Brow Raise, Eye Closure, Lid Tighten, Lip Corner Depress, Upper Lip Raise, Mouth Open, Lip Pucker, Dimpler, Jaw Drop, Nose Wrinkle, and Chin Raise were highly correlated with PVT indices. Finally, anger, surprise, sadness, disgust of perceived facial emotion indices and all composite indices were highly correlated with PVT indices.

Table 4 Correlation coefficients between PVT and facial indices
Table 5 AUCs of facial indices with PVT indices

Among head movement indices, Pitch showed significant correlation with the PVT indices AvRT, MaL, and FS (0.77, 0.64, and 0.73, p < .05) and MiL (0.84, p < .001). Pitch showed an increasing trend, except for slight decreases between the fifth and eighth sessions, as shown in Fig. 4. As shown in Table 5, it also had the highest AUC with AvRT, MaL, MiL, and FS (0.62, 0.62, 0.61, and 0.62) among the head movement indices. Yaw and roll negatively correlated with all PVT indices.

Figure 5 shows the facial expression indices for the upper part of the face, which strongly correlated with the PVT indices. Although showing strong correlations with AvRT and FS (0.56 and 0.65, p < .05), brow furrow exhibited the lowest AUC values with all PVT indices among the indices, as the last two values are similar to the second and sixth sessions in Fig. 5. Brow raise showed noticeably higher values in the last three sessions except for the particularly high value in the fifth session, but not significantly different (possibly due to the high SEM). Inner brow raise, eye closure, and lid tighten showed higher values in the last three sessions than those in the other sessions. Inner brow raise and eye closure showed a noticeable drop in the 12th session, while lid tighten showed the slightly higher value in the 12th session than in the two adjacent sessions. The 11th and 13th sessions of eye closure were significantly different from first, third, fifth, seventh, and eighth sessions, and the 12th session of lid tighten was significantly different from second, third, fifth, and seventh–ninth sessions. Lid tighten was revealed to have the highest correlation coefficients with AvRT, MaL, and FS (0.89, 0.83, and 0.87, p < .001) among facial expression indices (even MiL showed a significant correlation of 0.86, p < .001) and the highest AUC with AvRT (0.79).

Figure 6 shows that the lower facial expression indices (near the mouth) strongly correlated with the PVT indices. These indices exhibited a similar pattern—the changes of the first six or seven sessions are irregularly lower than those of the last one to three sessions, followed by stable movements during the next two to four sessions showing noticeable increases in the rest of the sessions. The values in the last three sessions of Lip Corner Depress were higher than those of the other sessions with a dip in the 12th session, and the values in the 11th and 13th sessions of that were significantly different from those in the eighth session. Nose Wrinkle and Upper Lip Raise exhibited similar changes throughout all sessions, with noticeably higher value in the last session. Mouth Open showed higher values in the last two sessions, possibly due to yawning. Although shown high SEM, chin raise showed higher values in the 11th and 12th sessions with recovering in the 13th session, possibly due to nodding off. Both Lip Pucker and Dimpler (Dimpling) showed a particularly high value in the fourth session. Lip pucker showed the higher values in the last two sessions, and the value in the last session was significantly higher than in the sixth session. Dimpler (Dimpling) showed the higher values in the last three sessions. Jaw Drop showed noticeable higher values in the last two sessions, and the value in the last session was significantly higher than in the fifth and sixth sessions.

Figure 7 shows the four perceived emotions—Anger, Surprise, Sadness, and Disgust—highly correlated with the PVT indices except for MaL. Perceived emotion indices showed positive correlation coefficients with PVT indices, except for Joy. Especially, Surprise, Sadness, and Disgust correlated highly with all PVT indices, except for the MaL (0.82, 0.85, and 0.89, respectively, p < .001). Anger, Sadness, and Disgust exhibited noticeable drops in the last two sessions, while values in the last three sessions of Surprise were higher than in the other sessions.

Figure 8 shows the composite indices, valence, engagement, and attention are highly correlated with all PVT indices, except for MaL. All composite indices had no significant difference between sessions. Valence and attention showed negative correlation coefficients with PVT indices, while engagement showed positive correlation coefficients with PVT indices. Valence showed a noticeable decrease in not only the last three sessions but also the sixth session. Engagement exhibited higher values in the last three sessions than in the other sessions, with a slight drop at the last session. Attention did not show significant correlation coefficients with AvRT and FS (−0.39 and −0.36).

Table 6 shows results of machine learning models evaluated with LOSO cross validation. Note that geometric mean scores are in the range of 0–1, and the higher values indicate more accurate predictive power for both classes (performance deterioration vs. normal). Linear classifiers (SVM with a linear kernel and logistic regression) performed better than did the other machine learning models throughout all PVT indices. We discovered that the best geometric mean scores for PVT indices AvRT, MaL, MiL, and FS were logistic regression with all facial indices (0.7302), linear SVM with correlation-based facial indices (0.8062), logistic regression with correlation-based facial indices (0.7159), and linear SVM with all facial indices (0.6330). SVM with nonlinear kernels (RBF and polynomial kernels) showed lower geometric scores with all PVT indices than linear classifiers, for both features sets consisting of all indices and correlation-based indices. Random forest and KNN showed poorer performance for all PVT indices when compared with any other classifiers, achieving less than 0.5 geometric scores for AvRT, MiL, and FS with both feature sets, except for the value of 0.5389 for MaL with the correlation-based facial indices.

Table 6 Geometric mean score of machine learning models. Boldface fonts represent the highest score of each PVT index

Moreover, most feature sets trained using the correlation-based indices showed higher performance than using all indices. SVM with a linear kernel showed higher geometric scores for PVT indices AvRT, MaL, and MiL with correlation-based indices feature set (0.7058, 0.8062, and 0.7001, respectively) than with all indices’ feature set (0.6850, 0.6696, and 0.6294, respectively); the geometric scores for FS were 0.6330 and 0.6124 with all indices and correlation-based indices, respectively. Logistic regression exhibited higher geometric scores with correlation-based indices for PVT indices MaL and MiL (0.7702 and 0.7159, respectively) than with all indices (0.6556 and 0.6317). Geometric scores of logistic regression for AvRT were 0.7302 and 0.8241 with all indices and correlation-based indices, respectively, and 0.6036 and 0.5747 were obtained for FS. SVM with RBF and polynomial kernels showed higher geometric scores with correlation-based indices than with all indices. For random forest and KNN, classifiers trained with the correlation-based indices showed higher geometric scores for AvRT and MaL (also KNN for MiL).

To evaluate the best models for each index, we calculated Shapley additive explanations (SHAP) values of the models, as they show the best performance for each PVT index, as shown in Fig. 9. The features are sorted by the mean of absolute SHAP values in descending order, indicating the importance of each feature. Eye closure was the most important feature for MaL and was the top four most important feature for the others. Lip Suck, Engagement, and Lip Pucker were shown to be the most important features for AvRT, MiL, and FS, respectively. Although Lid Tighten was the five most important feature for AvRT, MaL, and MiL, it was less essential than 10 features for FS. Lip Pucker showed as an important feature in the top five for all PVT indices.

Fig. 9
figure 9

SHAP summary plots showed the best model for each PVT index, sorted in descending order by the average of the absolute SHAP values. a Logistic regression with all facial indices for AvRT. b SVM-linear kernel with correlation-based facial indices for MaL. c Logistic regression with correlation-based facial indices for MiL. d SVM-linear kernel with all facial indices for FS

Discussions and conclusions

In this paper, we analyzed the correlation between facial features obtained using a webcam and PVT indices of performance. We observed 34 facial indices obtained using a webcam, including head movements, facial expressions, perceived facial emotion, and composite indices. A total of 21 out of 34 indices were highly correlated with at least one PVT index, including eye-related, mouth-related expressions, Pitch from head movements, and Anger, Surprise, Sadness, and Disgust from perceived emotion features. Similar to other studies, our work also showed deterioration of PVT indices during prolonged wakefulness (Basner & Dinges, 2011; Basner et al., 2011; Posada-Quintero et al., 2018). Our PVT results exhibit stable PVT performance during the first nine to 11 sessions (0 to 17–21 hours awake), followed by significant performance deterioration in the last two to four sessions (following 25 hours awake). Our results show that facial indices are effective measurements to assess individuals' deterioration of performance and cognition during prolonged wakefulness.

To date, there have been only a few publications that compared facial features and fatigue (Knoll, Attkiss, & Persing, 2008; Sundelin et al., 2013). Sundelin et al. (2013) showed that hanging eyelids and more droopy corners of the mouth correlated with prolonged wakefulness. Similarly, our results also show that Lid Tighten and Lip Corner Depress (corresponding to the features hanging eyelids and droopy corners of the mouth) strongly correlated with PVT indices AvRT and FS. Sundelin et al. (2013) also found that sadness was significantly associated with fatigue rating which is in agreement with our finding since there were high correlations between Sadness and PVT indices. However, their study examined only one perceived emotion, Sadness, and did not measure cognitive performance. Knoll et al. (2008) modified photographs of an upper face using digital imaging software to exam the influence of eyebrow position and shape, eyelid position, and facial rhytids with the perception of tiredness. They observed significant differences between tiredness scores and the two modifications on the face: lowering the upper eyelid and depressing the lateral brow. These two features correspond to our features of Lid Tighten and Brow Raise, and they were highly correlated with PVT indices. However, their study investigated only eye-related indices and did not measure cognitive performance at all.

In our study, we found that many eye-related features (Brow Furrow, Brow Raise, Inner Brow Raise, Eye Closure, Lid Tighten) and mouth-related features (Lip Corner Depress, Upper Lip Raise, Mouth Open, Lip Pucker, Dimpler, Jaw Drop) were significantly correlated with PVT indices. Chin Raise and Nose Wrinkle were also significantly correlated with PVT indices AvRT, MiL, and FS. Five indices from facial expressions—Eye Closure, Lid Tighten, Lip Corner Depressor, Lip Pucker, Jaw Drop—showed that a few values in the last three sessions were significantly different from a few of the first nine sessions. Especially, Lid Tighten and Eye Closure showed more than five stable sessions significantly different from the last three sessions. With regard to using head movement as a detector of prolonged wakefulness-induced performance degradation, Pitch is the only index that we tested that shows practical promise. Not surprisingly, the correlation between pitch and PVT indices is significant (0.84 for MiL, p < .001 and .77, .64 and .72 for AvRT, MaL, and FS, p < .05) since it is affected by nodding off. No significant difference was observed between sessions of Pitch.

Four perceived facial emotion indices (Anger, Surprise, Sadness, and Disgust) showed high correlation coefficients with the PVT indices. Note that this does not mean that participants genuinely felt emotions, as the emotion indices indicate the likelihood of perceived emotions based on the emotional facial action coding system (Friesen & Ekman, 1983). The explanation of these four perceived facial emotions—Anger, Surprise, Sadness, and Disgust—which were highly correlated with PVT indices, can be expanded to intense levels of annoyance, distraction, pensiveness, and boredom, respectively, according to Robert Plutchik’s Wheel of Emotions (Plutchik, 2001). Interestingly, these expanded emotions are also known to be affected by sleepiness (Anderson & Horne, 2006; Bodin, Björk, Ardö, & Albin, 2015; Li et al., 2017; Weinger, 1999). The composite indices Valence and Attention showed strong negative correlation with the PVT indices. The composite index Engagement correlated with the PVT indices. No significant difference was observed between sessions of each feature from perceived emotion indices and composite indices.

Our machine learning results exhibited feasibility of classifying performance deterioration during prolonged wakefulness. We tested six machine-learning classifiers and found that linear classifiers (SVM with a linear kernel and logistic regression) outperformed others for all PVT indices, with 73.02, 80.62, 71.59, 63.30 % of geometric mean scores of AvRT, MaL, MiL, and FS, respectively. The linear classifiers with correlated features resulted in higher geometric mean scores for MaL, MiL than with all features. Although we found lower geometric mean scores with correlated features for AvRT and FS, the geometric mean scores with all features and correlated features for AvRT and FS were comparable with less than 3% of difference. We then calculated feature importance of the linear classifiers using SHAP. The rankings of the importance were different from those of the correlation coefficients in machine learning models. For example, correlation coefficients of MaL index with Eye Closure and Lid Tighten were 0.75 and 0.83, respectively; however, it was shown that for MaL the importance of Eye Closure is higher than that of Lid Tighten in Fig. 9b. Likewise, Interocular Distance, Eye Widen, Smirk, Lip Suck from facial expression indices showed no significant correlation with PVT indices, but were the top 10 important features in the classifiers. This is possibly because redundant features rather than correlation are more important for machine learning approaches.

We found that some facial indices correlate with working and cognitive performance deterioration during prolonged wakefulness, which was also found with other indices such as electrodermal activity (EDA), electrocardiogram (ECG; Posada-Quintero et al., 2018), and voice (McGlinchey et al., 2011). Facial features are practical in applications since they can be obtained using a webcam that is noninvasive and easy to collect data. Many facial indices highly correlated with working and cognitive performance on PVT. However, we need more careful approaches to select features rather than relying on only correlation in practice. For instance, using Pitch from head movement indices to detect and predict performance deterioration in the driving situation may not be a good option, as observing frequent Pitch during prolonged wakefulness can cause accidents before detection. Also, external validity of our classifiers may be limited as some real-life tasks may engage specific facial features that PVT may not invoke. For example, some tasks by medical providers (e.g., surgeon) may include frequent head movements. Moreover, some facial features cannot be observed in some cases (e.g., mouth-related features from surgeons wearing masks). This must be properly considered in the feature selection criteria as well. By comparing our indices directly to the working performance on PVT, future works can use highly correlated facial indices found in this work to detect and predict the deterioration of working and cognitive performance in practical operations (e.g., driving) so that irrevocable consequences are prevented.