1 Introduction

Social robots can take many different forms [1, 2] and the ones that interact with humans—companions or helpers—have become increasingly demanded [3]. For example, socially interactive robots [4] have moved from research laboratories into shopping malls [5] and consumer markets (e.g., Cozmo & RoBoHoN) for providing service and entertainment. According to the International Federation of Robotics, 1.7 million social and entertainment robots were sold in 2015, and the number of sales is projected to be over 7.4 million units in 2025 [6]. In another example, socially assistive robots [5] have been gradually adopted to assist in the health care of children [7], adults [8], and the elderly [9].

For successful long-term interactions with humans, social robots are recommended to identify returning users, remember past interactions with them, and get to know them for offering personalized responses [10, 11]. For example, a user may be annoyed by the same, non-personalized greeting question from a robot: “Do you mind telling me more about yourself?” For another example, the personal spatial zone a user prefers during human–robot interaction is influenced by the personality of that user [12]. Therefore, it is important for a social robot to learn the personal information and idiosyncrasies of a user so as to foresee and meet the social, emotional, and cognitive needs of that user [13].

Despite its importance, understanding or profiling a user is a challenging task, especially when concerning hidden psychological variables (e.g., traits, states, values, and preferences). Although not directly observable, a user’s psychological variables (e.g., personality) has been inferred indirectly from behavioral manifestations (e.g., linguistic or prosodic patterns [14]) or psychological tests (e.g., a personality questionnaire [15]) to facilitate human–robot interaction (HRI). Note, however, that inferences from behavioral manifestations usually rely on observations from a third-person perspective, and hence the true values of inferred psychological variables actually remain unknown to an observer, be it a social robot or a real person. By contrast, psychological tests, particularly those with sound psychometric properties, can reliably estimate the values of an individual’s psychological variables based on the individual’s first-person responses.

Although psychological tests have been utilized in previous HRI studies, overall they have not been heavily employed as user-profiling tools and used to their full potential for improving HRI. In most HRI studies, psychological tests were administered by researchers using paper questionnaires to understand a user before or after HRI [15] rather than during HRI for a robot to dynamically adapt its behavior to individual users. Only a few recent studies pioneered the possibility of having a robot to administer psychological tests during HRI, and all of these robot-administered psychometric evaluations used standard testing procedures as in human-administered tests (e.g., [16]), which may sometimes appear uninteresting or even tedious to users.

To circumvent the formality problem, here we propose using asynchronous test questions (ATQs) as a user-friendly way of administering a psychological test during HRI. Specifically, our proposed testing procedure consists of three steps: (1) obtaining items or questions of a psychological test; (2) embedding parts of these questions into contextually relevant periods in a conversation during HRI as asynchronous mini-tests; and (3) aggregating all the answers to these ATQs for scoring the psychological test. Compared to a traditional psychological test that usually assesses one psychological domain with items temporally grouped together, ATQs from a psychological test are not given to a testee all at once. As a result, ATQs that assess a psychological domain as a whole are less susceptible to the issues of sustained attention and cognitive demand relative to their original test. Furthermore, because ATQs are presented casually in contextually relevant periods in a conversation rather than formally in a test setting, it is less likely for a testee to be self-conscious about being tested and modify his/her responses accordingly [17].

As has been reported previously, different formats of the same psychological assessment—such as computer, questionnaire, or interview—could lead to differences in evaluations [18]. To examine whether a temporally fragmented psychological test could yield results comparable to those of the original test and thus substitute for the original test, the present study used a social robot to administer a big-five personality test amidst a broader HRI session for each research participant using the ATQ procedure because big-five personality measures are commonly used predictors of human behavior [19, 20] and important dimensions in HRI [21]. The verbal responses from each participant to ATQs were then compared with the written responses to the same personality test for validating the ATQ procedure. The ideal result would be a perfect positive correlation between verbal and written responses to the same test, although theoretically the upper bound of this correlation is the test-retest reliability of the psychological test. The procedure and results of our ATQ experiment are detailed in the following sections.

2 Methods

The ATQ experiment was part of a larger human–robot interaction study on both young and older adults. Approved by the Research Ethics Office of National Taiwan University (REC 201803HS017), the 1-h larger study consisted of three main HRI events. They are robot-administered cognitive testing, followed by robot-accompanied toy-playing, and then robot-assisted tablet-using. All participants gave written informed consent for their participation in the study.

2.1 Human Participants

Relative to young adults, older adults tend to have stronger negative attitudes toward robots [22], which may, in turn, affect their human–robot interactions in general and our robot-administered psychological testing in particular. Therefore, we experimented with our ATQ procedure on two age groups of participants to examine its suitability for use with both young and older adults, which cannot be taken for granted [23].

There were 26 participants in the young group (13 males and 13 females; mean age: 21.42; age range: 18–29) and 20 participants in the older group (11 males and 9 females; mean age: 72.25; age range 67–81). None of these participants had auditory impairments. Each experimental session consisted of a participant interacting with a social robot in a well-lit room with a one-way mirror, with the whole HRI session recorded by a hidden camera.

2.2 Social Robot

We used a programmable humanoid robot—RoBoHoN (Sharp Co., Ltd.)—in our study. RoBoHoN is 19.5 cm tall when standing. For research participants to maintain eye contact with RoBoHoN during HRI, the RoBoHoN unit used in our experiment was placed on a table to converse with the seated research participant (Fig. 1). RoBoHoN has a built-in speech-to-text and text-to-speech engines for speech recognition and production, respectively. Although it could be fully autonomous, our RoBoHoN was remotely controlled by a human operator in a dark observation room to accurately detect sentence endpoints in participant speech and manage relevant conversational contingencies such as speech pauses, repetitions, queries, etc. that arise during memory recall and decision-making.

Fig. 1
figure 1

A side view of the HRI environment

2.3 Psychological Test

We used the Ten Item Personality Inventory (TIPI) in that each of the five personality dimensions—Extraversion, Agreeableness, Conscientiousness, Emotional Stability, & Openness to Experience—is measured by two items [24]. The personality inventory was administered twice on each participant to validate whether the original test can be substituted by its ATQs. The first test was administered verbally by RoBoHoN using ATQs across each 1-h HRI session (Fig. 2), whereas the second test was administered by an experimenter right after the whole HRI session as a post-study questionnaire in TIPI’s original form on paper. The order of the two administrations was not counterbalanced to prevent participants from experiencing ATQs as repeated “test” questions, which would defeat the purpose of using ATQs in non-testing contexts. Both administrations imposed a 5-point response format on each item: Strongly Agree = 5, Slightly Agree = 4, Neutral = 3, Slightly Disagree = 2, and Strongly Disagree = 1. Specifically, RoBoHoN would ask: “Do you strongly or slightly agree, strongly or slightly disagree, or are you neutral?“ as a follow-up question to verbally described these five response options whenever a participant’s response to an ATQ could not be clearly mapped onto the 5-point Likert scale.

Fig. 2
figure 2

Normalized temporal positions of 10 ATQs in each 1-h HRI session. Each color represents a different ATQ to show the variability across sessions. Each number associated with each color corresponds to the item number listed in Tables 1 and 2, and zero indicates other HRI events that are outside the scope of the present study

The ATQs used in this study were not a verbatim copy of the original TIPI test [24]. They were adapted to appear natural in human–robot conversations. The ten pairs of personality-describing adjectives, as shown in Tables 1 and 2, were embedded in sentences such as “Are you, in general, an [anxious, easily upset] person?", “Do you consider yourself a [conventional, uncreative] person in general?" or “I’m curious whether you are a [dependable, self-disciplined] person in general." The original instruction of TIPI is: “Here are a number of personality traits that may or may not apply to you. Please write a number next to each statement to indicate the extent to which you agree or disagree with that statement. You should rate the extent to which the pair of traits applies to you, even if one characteristic applies more strongly than the other."

It should be noted that the human–robot conversational contexts were prearranged such that the 10 pre-programmed ATQs were asked by the RoBoHoN as parts of small talk before or after an interaction event, such as toy-playing or cognitive testing. As a result, the presentation order of the 10 personality items was fixed across participants and not identical to the order in the original test. For example, when RoBoHoN greeted a participant at the beginning of the study, it expressed interest in learning more about the participant and asked whether he/she is a conventional person. In another example, right before a toy-playing event, RoBoHoN asked a participant whether he/she is a person who is open to new experiences.

3 Results

For the comparison between a participant’ verbal and written responses to the same psychological test, the participants’ verbal responses to ATQs were initially coded from video recordings by two independent coders from our research team who knew the purpose of the present study but were not provided with the participants’ written responses at the time of behavioral coding. The coding instructions were as follows. “Please help label the degree of agreement expressed in each participant’s verbal response by a number from one to five with one being a strong disagreement, two being a slight disagreement, three being neutral, four being a slight agreement, and five being a strong agreement. If a particular response of a participant is difficult to judge, please make your best guess based on the response patterns of that participant, if any."

There were indeed difficult cases and minor coding differences (ordinal Krippendorff’ α = 0.995 and 0.87 for the younger and older groups, respectively). For example, an older participant shook her head without a verbal response to an ATQ but then verbally answered “Agree” when RoBoHoN reminded her of the five response options. The two raters discussed in person on their coding differences incurred by such difficult cases to reach a consensus on each item score of each verbal response.

The final item scores were then compared with the scores obtained from the written responses of the same participants. Below we report the results of Pearson correlations, paired t-tests, and their associated equivalence tests [25]. For each test item or personality dimension, an ideal result would be a perfect positive correlation and even no difference between verbal and written responses, which translates to a correlation not statistically equivalent to 0 and significantly larger than 0 as well as a difference not significantly different from 0 and statistically equivalent to 0.

3.1 The Young Adult Group

The results of the younger participants are summarized in Table 1. For almost all the ten test items and five personality dimensions, both null-hypothesis and equivalence tests suggest very strong positive correlations and little differences between the participants’ verbal and written responses. In other words, the young participants responded similarly to the same questions regardless of whether these questions were synchronously or asynchronously presented, suggesting a good validity of our proposed ATQ testing procedure on the younger adult population.

Table 1 The correlations and differences in means between verbal and written responses of the young group (N = 26)

3.2 The Older Adult Group

The analysis results of the older participants are summarized in Table 2. Overall, the results were not as ideal as the ones of the young participants. Several correlations between the two response formats were weak or even statistically not different from 0. Moreover, the distributions of verbal and written responses are not statistically equivalent for some test items and personality dimensions. In particular, the ratings of one’s Openness to Experiences trait were not very consistent between the two testing procedures, suggesting some potential problems in the use of the ATQ procedure on the older adult population.

Table 2 The correlations and differences in means between verbal and written responses of the older group (N = 20)

4 Discussion

We used a brief personality test as an example to explore the possibility of using temporally fragmented psychological tests for user-friendly psychological assessment in human–robot interaction. As the first attempt toward such endeavor, the present study examined whether a psychological test, when administered informally in the form of asynchronous test questions, could still lead to assessment results comparable to those from a formal psychological test. Our results showed that the proposed ATQ testing procedure was quite successful with younger adults but only partially successful with older adults.

The discrepancies between participants’ verbal and written responses, especially those observed in older adults, could result from several sources. First, participants, particularly those low in behavioral consistency, might not respond consistently to the same test question even with the same administration method. Second, participants, when receiving the same question through different perceptual modalities (audition vs. vision), might engage distinct attentive/memory processes [26, 27] and thus make different choice decisions. Third, participants might mishear some questions unnaturally pronounced by a robot, a situation that had been observed during robot-administered cognitive assessments [16]. Fourth, participants might be less aware of being tested and hence less defensive when answering informal ATQs than formal test questions. Fifth, when asked the same question, participants might be more willing to disclose themselves to a robot than a researcher. Previous studies have found people to be more willing to share personal information with non-judgmental computer agents [18, 28].

While the aforementioned factors might all affect the experimental results and the current study design could not distinguish these different sources of contributions, the proposed ATQ procedure, when applied to the young adults, sill yielded results on par with those from a conventional paper-based test. By contrast, some of these factors might affect older adults more pronouncedly than younger adults. Revisiting the video recording of each HRI session allowed us to provide some speculations as to why the ATQ procedure did not work as well in the older adults and how this might be overcome in the future. Below we will elaborate on these findings together with other implications from this explorative study.

4.1 Validity, Reliability, and Applicability of ATQs

The validity and reliability of ATQs are limited by the ones of their original psychological test. For example, in our younger group, the Pearson correlations between participants’ verbal and written responses ranged from 0.77 (Openness to Experience) to 0.88 (Extraversion), which are comparable to the six-week test-retest reliability that ranged from 0.62 (Openness to Experience) to 0.77 (Extraversion) in the original TIPI study on young adults [24]. Additionally, as evidenced by the marked differences between the results of the two age groups in our study, the psychometric properties of ATQs may vary with different tested populations in a way similar to their original test, such as an overall reduced test-rest reliability of TIPI in older adults relative to younger adults [29].

The applicability of ATQs is critically constrained by the cognitive capabilities of testees. It is important to note that questions asynchronously presented in a conversation and questions simultaneously presented on a paper are processed respectively through audition and vision, two fundamentally different perceptual modalities. Thus, compared to the questions of a paper-based test that can be seen all at once, ATQs can only be heard sequentially, and a testee cannot voluntarily reinspect words that have been presented or modify an answer to an earlier ATQ. Consequently, if a testee cannot maintain auditory attention or short-term memory throughout the entire presentation of an ATQ, he/she may not process the question as fully as in the case of a paper-based test.

As a case in point, the older participants in our study might not process ATQs thoroughly. For example, when responding to the test item, “Do you consider yourself a conventional, uncreative person in general?”, many of our elderly participants paused, pondered, and then responded, “Yes, I’m conventional.” Such a narrow focus on parts of a heard sentence could result from an age-associated, smaller capacity of attention and short-term memory [30] or a developed habit of selective attention [31]. This can explain why older participants agreed more with the compound question “conventional, uncreative” in their verbal than written responses—sequential auditory processing might have accentuated the earlier “conventional” part [32], with which they agreed more relative to the later “uncreative” part.

Moreover, a cognitively demanding task that precedes particular ATQs may set up a stressful or even frustrating context and induce age-based stereotype threats [33, 34] in older but not younger adults when they answer those ATQs. In our study, the ATQ “anxious, easily upset” was asked right after a series of RoBoHoN-administered cognitive tests in which the older participants showed much poorer performance (not shown here) than did the younger participants, presumably because of cognitive decline [35,36,37]. Such results suggest that these tests might be rather challenging to our older participants and therefore induce their stress responses, including negative affect [38, 39]. By contrast, the paper version of our personality test was given right after a task relatively easy for both younger and older participants and could be the reason behind why our older participants agreed less with being “anxious, easily upset” in their written than verbal responses.

4.2 Guidelines for Using ATQs

Based on the above discussions, we recommend the following for effective use of ATQs:

  1. (1)

    Basing questions on a psychological test that has good validity and reliability to avoid invalid measures, such as the unreliably measured Openness to Experience in the present study;

  2. (2)

    Adapting questions for the target population in a way unambiguous to them to avoid unintended responses, such as the partial answers of our older participants’ to the compound questions;

  3. (3)

    Asking the same question on different occasions if the answer to this question may lack cross-situational consistency, such as the responses of our older participants’ to the “anxious, easily upset” question.

Importantly, despite the fact that the validity and reliability of ATQs may vary across populations, it is unnecessary and impractical to also administer a paper version of the same test in addition to ATQs, especially when ATQs are applied in commercial products. This is because the test–retest reliability of ATQs can be directly evaluated using our third recommendation, and arguably the ultimate validity of ATQ is whether they help estimate internal variables of a person and improve predictions on the behaviors of that person [40, 41].

Last but not least, because estimated personal characteristics may be exploited for malicious purposes [42], such sensitive information should be securely protected [43], and users have the right to decline any form of psychological testing in the first place [44], including ATQs. One possible way to address users’ privacy concerns is to ask users’ informed consent [45] so that users are aware of being profiled. It also helps to solve the personalization-privacy paradox if personal information can be stored locally inside a conversational agent rather than in the cloud. This approach has been proved effective in reducing smartphone users’ perception of privacy violations [46].

4.3 Possible Future Directions

The present study has several limitations and can be extended in the following directions.

  • Exploring the longest temporal window within which the ATQ procedure can remain valid;

  • Using a psychological test with more items to improve test reliability [47, 48] ;

  • Embedding a question into a contextually relevant conversation by an automatic information retrieval mechanism [49];

  • Taking non-Likert, natural language responses from participants as they are and scoring them by fuzzy methods [50].

These possible extensions are of importance in applications. They will either clarify the boundary conditions or automate the question-distributing and answer-scoring components of the ATQ testing procedure. Future feasibility studies are needed to address these important issues and put ATQs into use in real-life HRI.

5 Conclusion

The present study put forward and experimented with the possibility of asynchronously administering a psychological test by embedding items from the test into human–robot conversations. This proposed ATQ procedure was then successfully validated on a young adult group but less so on an older group, based on which we derived our guidelines for future effective use of this approach.

The ATQ procedure is designed as a user-friendly method of psychological testing during human–robot interaction. Social robots can leverage such a non-strenuous procedure to support fragile individuals in the completion of psychological tests or to profile general users for response personalization. Overall, the asynchronous testing procedure holds great promise for improving user understanding and thereby human–robot connections.

As a concluding remark, it should be pointed out that the ATQ testing procedure can, in theory, be generalized to use various psychological tests and applied to various populations once the test questions are properly adapted for a target population. Also, this proposed method can be implemented in various social robots designed for companion or assistance, such as text-based chatbots [51] and embodied conversational agents [52]. All in all, we hope that the ATQ testing procedure can help import the long-accumulated knowledge of psychology—in the crystallized form of psychological tests—into robotics for improving machine cognition and service.