1 Introduction

Engagement is a concept widely investigated in human–robot interaction (HRI) and yet still elusive [52]. Commonly adopted definitions include the one of Sidner et al.  [68], defining engagement as “the process by which two (or more) participants establish, maintain and end their perceived connection during interactions they jointly undertake”, or the one of Poggi et al.  [58], defining engagement as “the value that a participant in an interaction attributes to the goal of being together with the other participant(s) and continuing interaction”. Castellano et al., investigating predictors and components of engagement, regard engagement as characterised by both an affect and an attention component  [17]. Conversely, Salam et al., postulate that “engagement is not restricted to one or two mental or emotional states (enjoyment or attention). During the interaction, as the objective of the current sub-interaction differs, the different concepts or cues related to engagement would differ” [63]. Similarly, O’Brien et al. define “user engagement as a multidimensional construct comprising the interaction between cognitive (e.g., attention), affective (e.g., emotion, interest), and behavioural (e.g., propensity to re-engage with a technology) characteristics of users, and system features (e.g., usability)”  [48, 49].

Fig. 1
figure 1

Children engaged in the JUSThink educational activity: which of these two teams, apparently similarly performing, will end up actually learning? Can we tell from their behavior? And if so, can we equip a robot with this knowledge, so that it will drive the robots behavior that is helpful for learning?

Studying HRI engagement in educational applications is particularly challenging (and therefore interesting) because of the fact that the robot and the interaction with it is a means to an end, which is learning. In [7], Baxter et al. show “that students who interacted with a robot that simultaneously demonstrated three types of personalization (nonverbal behavior, verbal behavior, and adaptive content progression) showed increased learning gains and sustained engagement when compared with students interacting with a non-personalized robot”. Szafir et al. found that “adaptive robotic agent employing behavioral techniques (i.e. the use of verbal and non-verbal cues: increased spoken volume, gaze, head nodding, and gestures) to regain attention during drops in engagement (detected using EEG) improved student recall abilities 43% over the baseline” [69]. In [13], 24 students engage with the robot during a computer-based math test and the results demonstrate increased test performance with various forms of behavioral strategies while combining them with verbal cues result in a slightly better outcome. These studies show how in fact changing the robot’s behavior has an impact on learning while making the linear assumption that increasing users engagement leads to increased learning. Hence, the standard approaches in the literature look to maximize engagement itself. But, is it enough to assume that maximizing engagement, as currently defined, maximizes learning?

Inspired by the behaviour and pedagogical principles of human teachers, we propose a paradigm shift for which at a given point in time, an engaging robot for education is the one capable of choosing an action that is in line with enhancing the educational goals directly. We postulate that to maximize learning, engagement need not be maximized, rather it needs to be optimized. This postulation draws some inspiration from the idea of Productive Failure proposed by Manu Kapoor [40] where he says “Engaging students in solving complex, ill-structured problems without the provision of support structures can be a productive exercise in failure”. We believe that more often that not, there are learners that consecutively fail in a constructivist design, apparently scoring low on perceived engagement that can be biased by performance; however, they end up with higher learning. Same can be true with learners that seem to be succeeding but achieve lower learning. An example of this can be observed in [30] where the authors design a tangible tabletop environment for logistic apprentices for warehouse manipulation. They observe that while the task performance is high compared to the learners using the traditional method of paper and pencil, there is no increase in the learning outcomes. This is due to a phenomenon they termed as Manipulation Temptation where there is over-engagement with the task but no high-level reflection. Hence, interventions are incorporated to disengage learners to reflect more and eventually increase learning gains. Going back to the idea of engaging robot for education, as pointed out by Belpaeme et al. [9], designing one such robot is then not an easy feat. This is because even experienced human instructors struggle to make the best choice always. We believe that to not be able to distinguish actual engagement that potentially will lead to higher learning from apparent engagement that has no, or even a detrimental effect on learning plays a role in the struggle to find the appropriate choice.

If optimal engagement does exist, higher learning should then be reflected in certain behavioral patterns of the users. These patterns can then be leveraged to inform the behavior of the robot that is useful for learning. Briefly, this paper makes the following contributions:

  • Validate the existence of “a hidden hypothesis that links multi-modal behaviors of the users to learning and performance” that we term as Productive Engagement (See Fig. 2).

  • The existence of the hidden hypothesis paves way to have machine-learning engagement models for which the labels do not come from human annotators but instead can emerge from the data itself.

Moreover, we define our human–human–robot setting where a learning task is present as a social-task engagement scenario as seen in Fig. 1. The definition by Corrigan et al. in [22] seems to be in line with the social/task distinction in the HRI engagement literature with regards to the nature of the HRI scenario/context. They define engagement in terms of three contexts as follows: “task engagement where there is a task and the participant starts to enjoy the task he is doing, social engagement which considers being engaged with another party of which there is no task included and social-task engagement which includes interaction with another (e.g., robot) where both cooperate with each other to perform some task”. That said, still in a vast amount of literature, while defining the scenario, the distinction is often blurry since most interactions involve both task as well as social components, intertwined with each other and possibly co-dependent.

Lastly, the choice to have two users in our setting, introducing social engagement with a human, is because we want to grasp all facets of engagement, since we do not know yet which ones will better relate to learning. Social engagement with a human is supported by the idea that collaboration only produces learning if peers engage into rich verbal interactions such as argumentation, explanation, mutual regulation [12, 27], or conflict resolution [33, 66]. Furthermore, we want the interaction of the user to be as rich as possible and, therefore, the counterpart has to be another human. However, since engagement itself is still rather ambiguous, as explained at the start of the section, having two participants adds the variable of “group engagement”, for which, too, multiple definitions exist. Salam et al. define group engagement as,“the joint engagement state of two participants interacting with each other and a humanoid robot” [62]. Oertel et al. define group engagement as “a group variable which is calculated as the average of the degree to which individual people in a group are engaged in spontaneous, non-task-directed conversations”  [51] whereas Gatica et al. define group interest as “the perceived degree of interest or involvement of the majority of the group” in  [32]. In our human–human–robot setup, we adapt the definition by [51] to our multi-modal data and where the engagement with a robot is dependent on the role of the robot (active, e.g. a team member; or passive, e.g. an instructor). Briefly, for the purpose of analyzing the hidden hypothesis highlighted in the contributions, we want to consider multiple facets of engagement as well as have two human users in the setting to have richer interactions.

In the remainder of the paper, Sect. 2 presents the related work while Productive Engagement (PE) is introduced in Sect. 3. The research questions are highlighted in Sect. 4 followed by the description of the learning activity, and the setup in Sect. 5. Section 6 includes an in-depth analysis, results and discussion. Lastly, concluding remarks follow in Sect. 7.

2 Related Work

The paradigm shift we propose puts us at the crossroad of two fields, social robotics and education. Therefore, this leads us to look at engagement literature from both perspectives of HRI and Multi-modal Learning Analytics (MLA).

It should be noted that in MLA, several studies target “motivation’ and its link to learning. This is inspired by the positive relationship established in educational psychology between motivation and success at learning [24, 71], For example, in this work by [59], they “demonstrate that motivation in young learners corresponds to observable behaviors when interacting with a robot tutoring system, which, in turn, impact learning outcomes”. They observe a correlation between “academic motivation stemming from one’s own values or goals as assessed by the Academic Self-Regulation Questionnaire (SRQ-A)” and observable suboptimal help-seeking behavior. The authors then go on to show that an interactive robot that responds intelligently to the observed behaviors positively impacts students learning outcomes. While motivation is not equivalent to engagement, it could rather be the cause of engagement, i.e., if one is motivated to learn intrinsically or extrinsically, one will engage more which is also in line with Maslow’s theory of human motivation [44]. These MLA studies are thus sometimes also viewed relevant in the context of understanding engagement in educational settings.

In the literature coming from HRI and MLA, engagement is conventionally described as multi-faceted, meaning that various aspects of the user can be used to model it. Some of the forms found in literature, following the nomenclature proposed by  [26], include affective, behavioral, cognitive, academic, and psychological, etc.

Various methods to measure engagement along these facets can then be found in the HRI and MLA literature. In [26], Dewan et al. categorize these methods (for online learning) into manual, semi-automatic, and automatic, and then divide the methods in each category into sub-categories depending upon the modality(ies) of the data used. Adapting the classification mainly from [26], we focus on the manual and automatic categories:

2.1 Manual

Two of the most popular manual methods found both in HRI and MLA engagement literature include: 1) Self-Reporting, where “the learners report their own levels of engagement, attention, distraction, motivation, excitement, etc.” [50, 70]; 2) Observational Checklist, where external observers complete questionnaires on learners engagement or annotate video or speech data  [39, 55]. While self-reporting is easy to administer and useful for “self-perception and other less observable engagement indicators” [70], there is also the issue of validity that depends on several factors such as learners honesty, willingness, and self-perception accuracy, etc. [29]. On the other hand, disadvantages of the second type of methods include the fact that they require a huge amount of time and effort by the observers, as well as the risk of observational metrics to be affected by confounding factors. For instance, as Whitehill et al. point out in [70], “sitting quietly, good behavior, and no tardy cards appear to measure compliance and willingness to adhere to rules and regulations rather than engagement”. Furthermore, while studies with a single observer might suffer from subjectivity, studies with multiple observers might lead to low inter-rater agreement as engagement is a highly subjective construct.

2.2 Automatic

Some of the most widely used methods in MLA and HRI engagement modelling fall under this category. They can further be sub-divided into: 1) Log-file Analysis, and 2) Sensor Data Analysis methods. In Log-file Analysis, interaction traces are analyzed to extract users engagement or even performance (in educational settings) via behavioral indicators like the frequency of doing a particular behavior or the time taken on a particular action, etc. [1, 16, 20]. Various learning analytics and data mining approaches are used to perform log-file analysis in educational settings [4] including prediction methods, structure discovery, relationship mining, etc. While the interaction data is relatively easier to log and, hence, result in considerable amount of data; it lacks information that can be crucial to learning such as where the user is looking at or how the user feels. In the second method, a number of cues are investigated, most commonly through video and audio data: gaze, mutual gaze, joint-attention, speech, posture, gestures, facial expressions, proxemics, personality etc. [3, 11, 15, 31, 37, 38, 41, 64, 65]. A number of work complement video and audio data with physiological and neurological sensors to provide information such as: EEG, heart rate, perspiration rate, etc. [18, 43]. The main advantage of relying on video and audio data only is that the setup can be made relatively unobtrusive and as close to the real settings in a classroom. On the other hand, while physiological and neurological sensors may provide more accurate information about some of the internal states of a learner (namely arousal, alertness, anxiety, etc.), they are specialized sensors that are not very practical in daily classroom settings.

Due to the multi-modality and diversity of the data collected, Sensor Data Analysis approaches can differ significantly in terms of the chosen analysis methods. Commonly found solutions include: 1) methods that look to detect the presence of specific engagement cues/events such as directed gaze, back-channels, valence, smile  [34, 60], 2) supervised classifiers where the labels come from human annotators [15, 41, 64], and 3) deep-learning [47] and deep reinforcement learning  [53, 61] approaches for engagement estimation. The deep-learning methods are relatively newer methods in HRI, motivated by the idea that the traditional machine learning methods are not equipped to deal with high-dimensional feature space, require expert engineering, and always rely on data annotation. While the first kind of methods are relatively straight-forward to implement, they are limited to the detectable cues, which are few and possibly affected by confounding factors. Even though supervised classifiers are one of the widely used methods, since engagement is a highly subjective construct, there is the problem of generalization and accuracy of such models since they are modeled in a specific context and the labels are provided by multiple annotators. We must also note that not many studies actually report the annotation protocol. Lastly, the latest deep learning approaches suffer from the lack of interpretability/explainability of results and require an abundance of data.

The state of the art review reported above emphasizes the benefits of multi-modal approaches, which are better suited to capture the nuances of engagement and less severely affected by confounding factors, as well as emphasizes the disadvantages of relying on human observers/annotators, which introduce a hard-to-control-for subjectivity. Hence, in the proposed work, we try to steer away from dependency on human annotators and lack of interpretability (introduced by deep learning approaches) while still making use of multi-modal data as in [57]. We put forward an automatic machine learning method, which relies on both log-files and video/audio data, analysed with clustering techniques. This method can then generate labels for engagement which can then be utilized for training a supervised classifier.

While engagement research in HRI is usually studied as the standalone goal of an experiment and, to the best of our knowledge, no study exists trying to explicitly link it to learning, a large amount of contributions within MLA (and specifically coming from the field of Intelligent Tutoring Systems - ITS) aims at capturing the knowledge state or skill level of the students through the interactions with the system [4, 6, 21, 25, 54] in addition to modelling meta-cognitive behaviors, affective states, engagement, and motivation [5, 8, 14, 23, 25]. We want to explore the relation between engagement and learning. The reported MLA literature supports our hypothesis that it is possible to “unveil” learning and performance in the way learners engage with each other and the task at hand. The article investigates this intuition, without forgetting the ultimate goal of turning what we find into something that a robot can use online to drive its behavior to best support learning.

3 Productive Engagement

Our research is motivated by the following conceptions:

  1. 1.

    Maximizing engagement does not necessarily lead to increased learning outcomes, as first noted in Sect. 1, where by here engagement entails the apparent representation through logs, video and audio streams that are annotated by humans.

  2. 2.

    As first discussed in Sect. 2, evaluating engagement in light of domain specific measures like learning outcomes and performance metrics, that are more objective constructs, and relying upon multi-modal data, can be more effective in educational settings than using classifiers with labels from human observers.

We define Productive Engagement as the level of engagement that maximizes learning. Unproductive engagement can occur either due to over engagement (that can happen especially when interacting with gamified educational setups or setups with a robot) or under-engagement, both socially or with the task. We make a distinction between the social and task aspects of an interaction that happen in an educational setting, adapted from the work of [22]. Productive Engagement would then have the following components:

  1. 1.

    Social Engagement that we define as the quality and quantity of the verbal and non-verbal social interaction with other entities (learners and robots).

  2. 2.

    Task Engagement that we define as the quality and quantity of the interaction with the task.

Fig. 2
figure 2

Overview-productive engagement

As seen in Fig. 2, learning and performance can be positive or negatively affected by behavioral patterns pertaining to social and/or task engagement and vice versa. Furthermore, we argue that the other popularly used distinction (cognitive and affective), as seen in the review by  [9], comes under the umbrella of both task and social engagement aspect of an interaction. To shed more light on the motivation to use this distinction, we include the outcomes classification from the aforementioned review by  [9]. They showed that in most of the studies carried out with robots in educational settings, the outcomes (what the robot intervention targets and what the learning activity is designed for) can be classified into cognitive and affective outcomes [9].“Cognitive outcomes focus on one or more of the following competencies: knowledge, comprehension, application, analysis, synthesis, and evaluation” while the “Affective outcomes refer to qualities that are not learning outcomes per se, for example, the learner being attentive, receptive, responsive, reflective, or inquisitive”. Both of these outcomes have been reported to affect learning; however, having a positive affective outcome does not imply positive cognitive outcome or vice versa [9, 36]. The use of these two outcomes is also in line with the study of [28] who propose a model to explain the dynamics of affective states that emerge during deep learning that ultimately are also linked with cognitive engagement. Based on the definitions in the engagement literature  [19, 35, 48, 49, 70], we define them as follows:

  1. 1.

    Cognitive engagement refers to the effort that is put into understanding and analyzing the learning concept including meta-cognitive behaviors like reflection.

  2. 2.

    Affective engagement encompasses feelings, enjoyment, attitude and the mood of the learners, etc.

The above categorization of engagement facets is presented to ground our definition of productive engagement in the context of existing engagement literature and to illustrate our rationale for selecting engagement-related features. Furthermore, we are aware that separating the cognitive and affective dimensions of interactions is a gross simplification. We nonetheless use this distinction as a convenient way to design the robot behavior as well as to analyse data. Concretely, we propose that a feature can be labelled based on the type of engagement (cognitive or affective in task or/and social space) we are using it to measure.

4 Research Questions

We consider our definition of Productive Engagement described above as a hidden hypothesis that “links multimodal behaviors of the users to learning and performance”. Briefly, this paper investigates the following research questions:

  • RQ1: Given the behavioral patterns, whether cognitive or affective, social or task, can we reveal a quantitative relationship that links them to learning and performance? i.e., do people that differ in their behavior also differ in their learning and performance?

  • RQ2: To feed a machine-learning model of engagement with labelled data, can we replace human annotated labels by measures extracted from learning outcomes?

The link between the stated contributions in the paper, Productive Engagement and the research questions is analogous to a cosco ladder. Previous work on educational HRI and MLA, as aforementioned, agree in suggesting that there is a link between learner engagement and learning. Then, the two fields differ: while the educational HRI side has mostly focused on investigating the relationship between the robot’s behavior and learner’s engagement, a subset of MLA literature has investigated the relation between learners behaviors (indicative of constructs like engagement, motivation, effortful behavior, that have been used comparably [67]) and learning. In this article, we postulate that it is time to reunite the two sides of the equation: robot behavior to user engagement to user learning. We propose to do so via the concept of Productive Engagement that emerges by investigating such domains in parallel. Productive Engagement is the type of engagement that the robot seeks to raise in the user, because “it is the one that is expected to put the user in conditions likely to trigger learning mechanisms, although there is no guarantee that the expected conditions would occur”Footnote 1. Aforementioned is the first half of the ladder, the one where we climb from the literature to Productive Engagement. Now, on the second half, we descend from Productive Engagement to experiments and implementation. For the full link to work: (1) the robot needs to be able to autonomously infer the user’s Productive Engagement in real time (RQ2), and (2) there must exist a link between said engagement and learning (RQ1), so that the robot can verify whether the current user engagement is conducive to learning and plan its actions accordingly.

Fig. 3
figure 3

The contents of the screens of the participants during the JUSThink learning activity, where one participant is in the figurative view as seen in (a) and the other participant is in the abstract view given by (b). The figures show a set of tracks forming a minimum spanning tree for the network of gold mines: finding it and building it collaboratively is the goal of the activity.

5 User Study

For the evaluation purpose of the hidden hypothesis, we make use of the data from a user study done with a first version of a robot-mediated human–human collaborative learning activity called JUSThink [46]. The JUSThink learning activity aims to (1) improve the computational skills of children by imparting intuitive knowledge about minimum-spanning-tree problems and (2) promote collaboration among the team via its design. As an experimental setup for HRI studies, it also serves as a platform for designing and evaluating robot behaviors that are effective for the pedagogical goals. The minimum-spanning-tree problem is introduced through a gold mining scenario based on a map of Switzerland, where mountains represent gold mines labelled with Swiss cities names (see Fig. 3).

5.1 Learning Activity

The activity that envisions two children to play as a team consists of several stages spanning approximately 50 minutes. It starts with the robot welcoming the children, then introducing the goal of the task which is then followed by a pre-test. After the pre-test, the robot gives a demo explaining the two game views (see Fig. 3) and their functionalities, which is then followed by the learning task lasting around 25 minutes. After the task, children are asked to fill in a post-test and a self-assessment questionnaire before the robot greets them goodbye. Both the pre-test and post-test are defined in a context other than Swiss gold mines and are based on variants of the graphics in the muddy cityFootnote 2 problem. Both tests are composed of 10 multiple-choice questions, assessing the three concepts: (1) If a spanning tree exists, i.e. if the graph is connected., (2) If the given subgraph spans the graph, and (3) If the given subgraph that spans the graph has a minimum cost.

The learning task lies at the heart of the activity and requires the children to interact with maps such as those shown in Fig. 3 via touch-screens, as shown in Fig. 1. A small humanoid robot, acting as the CEO of a gold-mining company reiterates the problem by asking the participants to help it collect the gold by connecting the gold mines with railway tracks, while spending as little money as possible. The participants collaboratively construct a solution by drawing and erasing tracks that connect pairs of goldmines, and submit it to the robot for evaluation (one of the two optimal solutions is shown in Fig. 3).

The learning task design is scaffolded towards collaboration through precise design choices:

  1. 1.

    The task relies on two different views, respectively called figurative and abstract, where each gives only partially observable information to the user. The nodes and edges of the graph are shown by mountain and railway tracks in the figurative view while in the abstract view, they are denoted by circles and solid lines, respectively. Additionally, in the abstract view, deleted railway tracks are shown with dashed lines and the cost of each edge is indicated as a number.

  2. 2.

    The two views provide complimentary functionality and, therefore, in order to make informed decisions, the team members need to communicate. While in the figurative view, one can build and erase tracks, in the abstract view, one can view the cost of every track ever added, access previous solutions and their costs, and bring back a previous solution.

  3. 3.

    Every two edits, the views are swapped between participants, thus allowing each team member to experience the thought process that comes with a view. It also eliminates permanent roles in the game.

  4. 4.

    The cost of each track is initially hidden and only revealed after it is drawn, thus instigating reasoning about an edge in terms of a connection between two entities with an associated cost.

  5. 5.

    The team can submit their solution only if it spans the whole graph and only when both participants press the submit button. This scaffolds for team agreement before submission.

The robot’s role in the current activity is two fold: 1) to mediate and automate the entire activity by giving instructions at every stage and moving the activity from one stage to the next as required, and 2) to intervene sparsely during the learning task to provide feedback on the progress, give hints and lend support through minimal verbal and non-verbal behaviors [46].

Fig. 4
figure 4

The layout of the hardware setup for JUSThink

5.2 Setup and Participants

The setup for the experiment is shown in Fig. 1 where the two children in the team sit across each other separated by a barrier. Each of them has a touch screen in front, to interact with the application. The humanoid robot (QTrobot) is placed sideways with respect to the participants, to be visible to both. As depicted in Fig. 4, there are two RGB-D cameras that record the facial streams and one environment camera that films the entire scene. Two lavalier microphones, clipped on the participants, are used to record audio. We use two computers, connected to the screens and the robot, to manage the activity and the synchronous recording of the sensors. On the software side, each participant interacts with an instance of the JUSThink application while a separate robot application is used to manage the robot. All of the applications communicate via Robot Operating System (ROS). Rosbags are used to record all of the participants’ actions (the logs) as well as the robot actions. For more details on the hardware and the software setup, see [46].

The study was conducted in two international schools in Switzerland over two weeksFootnote 3. Although the experimenters were always present in the room, the activity was autonomous with little to no intervention required. A total of 96 students participated ranging from 9 to 12 years old; however, to ensure that data used for the study is complete and non-faulty across all sensing modalities (i.e., video, audio and actions logs) as well as homogeneous (e.g., we excluded a team in which participants were speaking French instead of English to communicate with each other), we omitted 28 students, resulting in a corpus of 68 participants (i.e., 34 teams) used for the analysis reported in this article. The dataset that we termed as PE-HRI is made freely and publicly available [45].

6 Evaluating the Hidden Hypothesis

RQ2 assumes that learning and performance data, respectively extracted from the pre- and post-tests and the learning task itself, can provide labels to be used as a reference for the analysis of the engagement features. Concretely, this means that learning and performance data should allow for a separation of teams into different groups, with different learning outcomes and performance. This analysis, which we call “backward” since it allows for moving from learning to engagement (from learning outcomes back to the learning process), is reported in Sect. 6.1. In Sect. 6.2, we first discuss the engagement-related features extracted from video, audio and log data (see Table 1), then investigate the existence of the link between behavior and learning and performance, which we postulate, by verifying whether correspondences exist between the clustering of teams based on their behavior patterns and the learning labels. This is what we call the “forward” approach, since it moves from engagement features to learning outcomes and performance metric. We must point out that by performance, we mean how the teams perform, i.e., fail or succeed at the activity and by learning outcomes, we refer to how the learners score in there pre- and post-tests. For our analysis, we make use of the sklearn machine learning library [56].

Fig. 5
figure 5

Clustering of teams based on their learning and performance

Fig. 6
figure 6

Pair plots of the clusters obtained through the backward approach. According to their relative placement w.r.t. learning and performance (and in line with terms and concepts used in Education), we can label the clusters as: non-Productive Success (cluster non-PS), Productive Failure (cluster PF), non-Productive Failure (cluster non-PF) and Productive Success (cluster PS)

Table 1 Multi-modal features for the analysis of the participants’ engagement in the Forward Approach

6.1 Backward Analysis

We make use of the following learning outcomes and performance metric (which were first outlined in [46]), the definitions of which are outlined as:

  • Last error: It is a performance metric, denoted by last_error, defined as the error of the last submitted solution by a team. It is computed as the difference between the total cost of the submitted solution and the cost of the optimal solution. Note that if a team has found an optimal solution (last_error = 0) the game stops, therefore making last error = 0.

  • Relative learning gain: It is a learning outcome, calculated individually and not as a team, defined as the difference between a participant’s post-test and pre-test score, divided by the difference between the maximum score that can be achieved and the pre-test score. This grasps how much the participant learned of the knowledge that he/she didn’t possess before the activity. At team level, denoted by T_LG_relative, we take the average of the two individual relative learning gains of the team members.

  • Joint learning gain: It is a learning outcome, denoted by T_LG_joint_abs, defined as the difference between the number of questions that both of the team members answer correctly in the post-test and in the pre-test, which grasps the amount of knowledge acquired together by the team members during the activity.

We calculate these measures for each team, normalize them to have unit variance, and then perform a K-means clustering on the metrics as observed in Fig. 5. The k is estimated based on the commonly used metric of inertia for analyzing how well the clustering method did. For a better understanding of the resulting clusters, we also generate pair plots for the three metrics in Fig. 6. As the pair plots show, we have four clusters that we can label, in accordance with terminology and concepts commonly adopted in the field of learning and education (more specifically the terms productive/non-productive inspired by the terminology of Productive Failure [40]), as:

  • Non-Productive Success, i.e. teams that performed well in the task but did not end up learning; hence, with lower last errors and lower learning gains (BA cluster = non-PS in blue in Fig. 6).

  • Productive Failure, i.e. teams that did not perform well but did end up learning; hence, with higher last errors and higher learning gains (BA cluster = PF in orange).

  • Non-Productive Failure, i.e. teams that neither performed well in the task nor did end up learning; hence, with higher last errors and lower learning gains (BA cluster = non-PS in green).

  • Productive Success, i.e. teams that performed well and also ended up learning; hence, with lower last errors and higher learning gains (BA cluster = PS in red).

In terms of the pedagogical goal as well as the apparent success in the activity, it is quite interesting to see these four types of teams. However, the next question is whether behavioral patterns of teams would cluster in a similar manner or not. In other words, would the different behavioral patterns also indicate such a division among teams?

6.2 Forward Analysis

6.2.1 Joint Analysis of Video, Audio and Log Features

As explained in Sect. 2, in this work we focus on video, audio and log features as some of the most commonly used features for engagement detection, such as speech, affective states, and gaze come from such data. Table 1 lists and details the multi-modal features that we use to analyze participants’ behavior in the forward approach. We also mark the feature type as task/social and cognitive/affective, in line with the definitions and rationale outlined in Sect. 3. As a first step, we make sure that the logs, videos, and audios used for generating all the features for a team are aligned and cut for the task duration only and not the entire pipeline given in Sect. 5.

Log features are extracted from the recorded rosbags. The features related to both affective states and gaze are computed through the open source library OpenFace [2]. A common way of calculating affective states, such as valence and arousal, is via the facial action units generated by OpenFace. For positive and negative valence, we build on action units (AUs) that correspond to positive and negative emotions, respectively, based on the findings from IMotionsFootnote 4 that uses AffectivaFootnote 5 for emotion recognition. These findings are also similar to the ones in EmotioNet ([10]). Exponential moving average is applied to smoothen the data for each AU followed by taking an average of the AUs belonging to positive and negative emotions for positive and negative valence, respectively. We calculate arousal by taking average of all the AUs above a certain intensity at a given point in time. Regardless of the valence, the absolute value of arousal is calculated to measure the expressivity of a user. For the smile extraction based on AUs, we base it on the findings from a smile authenticity study conducted by [42]. OpenFace also generates gaze angles that can be used to determine the eye gaze direction in radians in world coordinates. These angles are averaged for both eyes and are converted into more easy to use format than the gaze vectors. Using these gaze angles, it can be approximated if a person is looking straight ahead, left or right. Lastly, voice activity detection (VAD) through audio stream is done by using the python wrapper for the opensource Google WebRTC Voice Activity Detection. All the audio features listed in Table 1 are computed on the output given by the Google WebRTC VAD.

Assessing Forward Clusters: To cluster teams based on their behavior pattern, as captured by the 28 features listed in Table 1, we first apply Principal Component Analysis (PCA) on the normalized features (we use min-max scaler to transform features by scaling each feature between a range of 0 and 1) which return three principal components (PCs). The three principal components identified by the PCA account for 50% of the variance within the features dataset, with the fourth component only contributing for 8%. Then, by applying K-means clustering on the three PCs (with K=4 chosen in accordance with the inertia score), we end up with four clusters as shown in Fig. 7 where each cluster represents a different behavioral pattern.

Fig. 7
figure 7

Clustering of teams based on their behavioural pattern (extracted from video, audio and log features)

Fig. 8
figure 8

Learning outcomes and performance metric (averaged within cluster) for the clusters computed with the forward approach. Stars denote statistically significant differences (\({p < 0.05}\)). Dashed horizontal lines indicate the metrics’ global averages

As outlined in the opening of this section, to investigate RQ1, we compute the average performance metric and learning outcomes for the teams in the clusters obtained from the analysis of behavioral features as shown in Fig. 8. In the rest of the analysis, we disregard cluster \(F^{2}_{all}\) since it is composed of only 2 data points. As the figure shows, while the three clusters \(F^{0}_{all}\), \(F^{1}_{all}\) and \(F^{3}_{all}\) have similar average performance, they significantly differ in terms of learning outcomes, with clusters \(F^{0}_{all}\) and \(F^{3}_{all}\) having higher averages than cluster \(F^{1}_{all}\) (i.e., \(F^{0}_{all}\) and \(F^{3}_{all}\) including teams that ended up with higher learning, while cluster \(F^{1}_{all}\) includes teams who ended up with low learning). To validate these differences statistically, we perform a Kruskal-Wallis (KW) test on these metrics between each pair. In addition to the learning outcomes first defined in Sect. 6.1, we also include “absolute learning gain” to further validate the results. It is calculated individually and is defined as the difference between a participant’s post-test and pre-test score, divided by the maximum score that can be achieved (10), which grasps how much the participant learned of all the knowledge available. At team level, denoted by T_LG_absolute, we take the average of the two individual absolute learning gains of the team members. Coming back to the KW test, for the pair \((F^{1}_{all},F^{3}_{all})\), there is a significant difference for absolute learning gain, relative learning gain, and joint learning gain respectively as (mean_LG_abs: \({p = 0.025}\), mean_LG_rel: \({p = 0.016}\), mean_LG_joint: \({p = 0.026}\)). For the pair \((F^{0}_{all},F^{1}_{all})\), albeit not statistically significant (for \({p < 0.05}\)), there is a difference in absolute learning gain, and relative learning gain, respectively, as (mean_LG_abs: \({p = 0.073}\), mean_LG_rel: \({p = 0.067}\)). These results seem to indicate that the teams that end up having significantly higher learning gains behave differently w.r.t. the teams ending up with lower learning gains. In other words, this suggests that participants’ behavior is indicative of the separation of teams in high- and low-learners. This, in turn, supports our hypothesis of the existence of a link between engagement and learning (RQ1) and its representability with features that do not require human annotation (RQ2).

Comparing Forward and Backward Clusters: In an effort to further assess our hypothesis, we compare the clusters formed by the backward approach with those obtained in the forward approach. For this, we compute a similarity score \(S^F_B\) for each backward cluster B with each forward cluster F as:

$$\begin{aligned} S_{B}^{F}=\frac{\textit{common teams in both clusters}}{\textit{total teams in both clusters}} \end{aligned}$$
(1)

which generates the Similarity Matrix shown in Fig. 9. It must be noted here that in Fig. 9, the order of naming of clusters on each axis is unrelated, i.e., we don’t expect learners in horizontal cluster non-PS to also be in vertical cluster \(F^0_{all}\), or more specifically we do not expect the diagonal to be filled.

Fig. 9
figure 9

Similarity matrix between the clusters computed on the learning outcomes and performance metric (backward analysis - rows) and those computed on the engagement features listed in Table 1 (forward analysis-columns)

In order to interpret the matrix, let us look at Figs. 6 and 8, along with Fig. 9. Starting from the backward clusters, we can observe that the majority of the teams belonging to low-learning clusters (i.e., cluster non-PS - non-Productive Success and cluster non-PF - non-Productive Failure in Fig. 6) fall in the forward cluster \(F^{1}_{all}\) (\(S^1_{non{\text {-}}PS} = 0.37\), \(S^1_{non{\text {-}}PF} = 0.52\)), which in fact is the one with lowest average learning gain values (see Fig. 8 and Fig. 9). Similarly, the majority of the teams belonging to the high-learning clusters (i.e., cluster PF - Productive Failure and cluster PS - Productive Success in Fig. 6) fall in the forward clusters \(F^{0}_{all}\) (\(S^0_{PF} = 0.40\), \(S^0_{PS} = 0.37\)) and \(F^{3}_{all}\) (\(S^3_{PF} = 0.46\), \(S^3_{PS} = 0.41\)) that have significantly higher learning gain values (refer to  Figs. 8 and 9).

The aforementioned analyses show that there are similarities in the composition of clusters generated by evaluating the teams’ learning and performance and those generated by considering their behavior, captured by features extracted from logs, video and audio data. Concretely, in both cases, teams with low learning are grouped together and separated from high-learning teams. This indicates that, irrespective of performance during the task, teams that end up with higher learning exhibit behavioral patterns that can be clearly distinguished from those of teams that do not end up learning. In accordance with the definition put forth in Sect. 3, we deem the teams displaying behavioural patterns conducive to learning as Productively Engaged, as opposed to those whose behaviour, albeit possibly appearing engaged and even leading to good performance in the task, is not conducive to learning (non-Productive Engagement). We conclude that the reported analysis supports our hypothesis of the existence of a link between behavioral patterns and learning. Moreover, it paves the way for the design of robot behaviours, via the definition of Productive Engagement, which aim at putting learners in the best conditions for learning, by optimizing their engagement to that end.

6.2.2 Type-Specific Forward Analysis

The forward analysis presented in the previous section relies on features extracted from action logs, video and audio data. In an effort to verify the robustness of our findings, as well as restrict the feature set, we decided to replicate the forward analysis by first considering only the features extracted from the logs and then only the features extracted from the video and audio data. This separation is based on the idea that log-features are task-specific and, as captured by Table 1, mostly cognitive, while the other two data sources provide mostly social features (both cognitive and affective). Hence, an additional motivation for the analysis is therefore to check whether features of one type contribute more than the other to explaining the results seen in Sect. 6.2.1.

Performing PCA and K-means clustering on the log features (first section of Table 1), returns 3 clusters along 2 significant PCs (accounting for 55% of the variance within the features dataset, with the fourth component only contributing for 10%) as shown in Fig. 10. The similarity matrix given in Fig. 12 between the backward (on learning outcomes and performance metric) and forward (on behavioral features) clusters shows similar results w.r.t. those obtained when considering all features. The low-learning backward clusters (i.e., cluster non-PS - non-Productive Success and cluster non-PF - non-Productive Failure in Fig. 6) fall more in the forward cluster \(F^{1}_{logs}\) (\(S^1_{non{\text {-}}PS} = 0.44\), \(S^1_{non{\text {-}}PF} = 0.59\)) while the high-learning backward clusters (i.e., cluster PF - Productive Failure and cluster PS - Productive Success in Fig. 6) fall more in the other two forward clusters \(F^{0}_{logs}\) (\(S^0_{PF} = 0.68\), \(S^0_{PS} = 0.40\)) and \(F^{2}_{logs}\) (\(S^2_{PS} = 0.41\)) (see Figs. 11 and 12). However, a Kruskal-Wallis test run pairwise for the forward clusters over the learning outcomes shown in Fig. 11 reports no statistically significant difference, with only near-significant results we get are for the pair \((F^{0}_{logs},F^{2}_{logs})\) (mean_LG_abs: \({p = 0.060}\), mean_LG_rel: \({p = 0.065}\), mean_LG_joint: \({p = 0.096}\)) .

Fig. 10
figure 10

Clustering of teams based on their behavioural pattern (extracted from log features only)

Fig. 11
figure 11

Learning outcomes and performance metric (averaged within cluster) for the clusters computed with the forward approach using log features only. Dashed horizontal lines indicate the metrics’ global averages. No statistically significant difference is found

Fig. 12
figure 12

Similarity Matrix between the clusters computed on the learning outcomes and performance metric (backward analysis - rows) and those computed on the log features listed in the top section of Table  1 (forward analysis - columns)

Similarly, following the backward and forward approach when using only the video and audio features (see Figs. 1314, and 15), we see the same conclusion as previously seen. The low-learning backward clusters (i.e., cluster non-PS - non-Productive Success and cluster non-PF - non-Productive Failure in Fig. 6) fall more in the forward cluster \(F^{2}_{v\_a}\) (\(S^2_{non{\text {-}}PS} = 0.54\), \(S^2_{non{\text {-}}PF} = 0.44\)) which in fact is the one with lowest average learning gain values. On the other hand, the high-learning backward clusters (i.e., cluster PF - Productive Failure and cluster PS - Productive Success in Fig. 6) fall more in the other two forward clusters \(F^{0}_{v\_a}\) (\(S^0_{PF} = 0.42\), \(S^0_{PS} = 0.47\)) and \(F^{1}_{v\_a}\) (\(S^1_{PF} = 0.36\)) (see Figs. 14 and 15) that have higher learning gain values. However, a Kruskal-Wallis test run pairwise for the forward clusters over the learning outcomes shown in Fig. 14 reports no statistically significant difference.

Fig. 13
figure 13

Clustering of teams based on their behavioural pattern (extracted from video and audio features only)

Fig. 14
figure 14

Learning outcomes and performance metric (averaged within cluster) for the clusters computed with the forward approach using video and audio features only. Dashed horizontal lines indicate the metrics’ global averages. No statistically significant difference is found

Fig. 15
figure 15

Similarity Matrix between the clusters computed on the learning outcomes and performance metric (backward analysis - rows) and those computed on the video and audio features listed in the middle and bottom sections of Table 1 (forward analysis - columns)

The results of the type-specific analyses suggest that (1) the results obtained in the global analysis of Sect. 6.2.1 are robust (since type-specific analyses are in line with them, either isolating high-learners or low-learners), and (2) the results obtained in the global analysis are produced by the combined effect of all types of features (since type-specific analyses fail to produce statistically significant results). The latter conclusion is a nice, indirect proof of the multi-dimensional, multi-faceted nature of human engagement, which makes it such a challenging and fascinating research topic.

7 Conclusion and Future Work

As outlined in Sect. 3, our goal is to pave the way for a new way of designing social robots for learning. The behavior of these robots is driven by the effects it will ultimately have on the user’s learning, via the effect it has on the user’s engagement, inspired by the findings in the fields of Educational HRI and Multi-modal Learning Analytics about the existence of a link between engagement and learning. Fundamental pre-requisites for achieving that goal are that (1) it is possible to compute an approximation of user engagement which is devoid of human intervention, to allow for its automatic online extraction (RQ2); (2) the operationalization of engagement obtained in step 1 preserves the link with user learning (RQ1). The results we have obtained, reported in Sect. 6, support both hypotheses. Briefly, this paper explores the link between engagement and learning and, thus, proposes the concept of Productive Engagement, its validation in an HRI data set, and considerations on its consequences.

Firstly, we conclude that there are behavioral features, pertaining to task or/and social engagement, that predict learning outcomes and that these features are sometimes disconnected from performance in the task. To elaborate on the statement, in light of the results in Sect. 3, we observe that the teams that end up achieving a higher learning gain (i.e., cluster PF - Productive Failure and cluster PS - Productive Success in Fig. 6) in the JUSThink activity may or may not apparently perform well in the task itself. However, irrespective of their performance, the way those teams interact with the task and express themselves through speech, facial expressions and gaze is distinct from the behavior of the teams who achieve lower learning gains (i.e., cluster non-PS - non-Productive Success and cluster non-PF - non-Productive Failure in Fig. 6). Hence, these patterns of observable behaviors validate the existence of the hidden hypothesis of Productive Engagement.

Secondly, we conclude that the existence of this hidden hypothesis paves way for the design of machine-learning engagement detection models where the labelling for the state of engagement would not need a human annotator but rather come from the data itself. Specifically, the link between the behavioral patterns and the learning outcomes and the performance metric, in the form of statistically significant differences found with KW and the similarity matrix shown in Sect. 6.2.1, allows us to label the teams in forward clusters \(F^{0}_{all}\) and \(F^{3}_{all}\) as Productively Engaged and the teams in FA cluster \(F^{1}_{all}\) as Non-productively Engaged. At the same time, the results show that the proposed procedure seems better in isolating high-learners than low-learners (see results in Sect. 6.2.1 based on similarity matrix). This finding seems to suggest that while the behavior of people closer to the pedagogical goal of understanding the concept tends to be more distinctive and identifiable, the behavior or people who are (and will end up) not learning is more varied and harder to characterize. Intuitively, this finding reminds of Thomas Edison’s famous quote about the many ways in which something can go wrong, and the only (or few) ways in which it can go right.

With this said, while performance is usually a biasing factor for humans when annotating a subjective construct like engagement in such activities; a robot enabled with the aforementioned knowledge around Productive Engagement would thus not make its interventions based on whether a team is failing in the task or not, but rather by observing more sophisticated patterns of interaction of a team with the task and with the social environment including the partner and the robot itself.

Furthermore, the analysis presented in this paper considers features computed at global level, i.e., at the end of the interaction. The next logical step along the path that we aim to walk is to transform the features of interest into time-series and verify whether the correlation with learning that we found at a global level still holds in the progression. To further investigate in this direction, as a second step, we plan to design a supervised time-series model with labels adapted through the hidden hypothesis established in the baseline JUSThink scenario, i.e., where the robot’s interventions are minimal in order to reduce the confounding effects. The idea is then, as a third step forward, to put the model to test in a real-time scenario where the robot will adapt its behaviors according to the concept of productive engagement. The model will, thus, help the robot to answer the question of when to intervene effectively. However, to determine what behavior to induce in the user while designing for effective robot interventions, the next logical step we envision for this research is the characterization of the forward clusters obtained in Sect. 6.2.1 in terms of the contributions of the single features, and emerging differences between high- and low-learners. The aim is to acquire a deeper understanding of the link between engagement and learning, and therefore reach a refined and more solid definition for Productive Engagement.