Elsevier

Speech Communication

Volume 125, December 2020, Pages 24-40
Speech Communication

A cross-linguistic analysis of the temporal dynamics of turn-taking cues using machine learning as a descriptive tool

https://doi.org/10.1016/j.specom.2020.09.004Get rights and content

Highlights

  • We analyze 3 corpora of task-oriented conversations in English, Slovak and Spanish.

  • Acoustic/prosodic features are studied with finer temporal granularity.

  • A novel metric captures the amount of information contained in each turn-taking cue.

  • All three languages are shown to share similar resources to signal turn transitions.

  • Speech rate, final-word lengthening and the of track carry most of the information.

Abstract

In dialogue, speakers produce and perceive acoustic/prosodic turn-taking cues, which are fundamental for negotiating turn exchanges with their interlocutors. However, little of the temporal dynamics and cross-linguistic validity of these cues is known. In this work, we explore a set of acoustic/prosodic cues preceding three turn-transition types (hold, switch and backchannel) in three different languages (Slovak, American English and Argentine Spanish). For this, we use and refine a set of machine learning techniques that enable a finer-grained temporal analysis of such cues, as well as a comparison of their relative explanatory power. Our results suggest that the three languages, despite belonging to distinct linguistic families, share the general usage of a handful of acoustic/prosodic features to signal turn transitions. We conclude that exploiting features such as speech rate, final-word lengthening, the pitch track over the final 200 ms, the intensity track over the final 1000 ms, and noise-to-harmonics ratio (a voice-quality feature) might prove useful for further improving the accuracy of the turn-taking modules found in modern spoken dialogue systems.

Introduction

Corpus-based computational linguistics studies have opened new opportunities for answering questions about how human dialogue flows. In addition to more recent state-of-the-art prediction techniques, data-based studies have allowed us not only to study complex speech phenomena, but also to use the resulting knowledge in the creation of more natural spoken dialogue systems. Turn-taking management is a very good area for exploring these issues in depth.

Research articles over the last decades have shown that information about what the next turn-transition is going to be seems to be present not only in what we say, but also in how we say it. In addition to the textual cues (lexical, syntactic, pragmatic), prosodic cues also play a role in perceiving how the dialogue will unfold. In particular, Gravano and Hirschberg (2011) identify a group of seven turn-yielding cues and six backchannel-preceding cues in American English. They compute these cues through the use of statistics over certain acoustic/prosodic features that are automatically extracted from speech signals several hundreds of milliseconds before pauses in conversations. They also provide supporting evidence for the Duncan’s theory, which establishes that the sum of turn-taking cues impact on the subject’s agreement on how the dialogue will unfold (Starkey and Fiske, 1977).

Nevertheless, more research is needed to better understand the amount of information these cues contain and the dynamics over time of these acoustic/prosodic cues – i.e., how informative each cue is and how its informativeness varies over time. Additionally, we know little about the cross-linguistic validity of these findings. Revealing aspects of how turn-taking cues affect human-human conversations in different languages may not only help the community understand the underlying process of communication, but also allow improvements in human-computer interfaces in languages in which annotated data may not be available.

In the present article, we study the similarities and differences of how acoustic/prosodic turn-taking cues are produced in three typologically different Indo-European languages: Slovak (Slavic), American English (Germanic), and Argentine Spanish (Romance). We address this task in a data-driven approach in which we use machine learning techniques to model turn transitions based on hours of labeled, naturally-spoken dialogue from the Objects Games Corpus collection – a series of conversations with no visual contact in which 38 pairs of subjects collaborate in simple object-positioning games. We expect to validate our findings by modeling data taken from the same experimental setup, and by using the exact same methodology, in all three languages. Therefore, the contribution of this work is twofold: 1) it brings novel evidence regarding the variation of turn-taking cues over time and their comparison across different languages, and 2) it provides insights about the use of machine learning as a tool for describing speech corpora.

Studies in turn-taking have traditionally been interested in the way interlocutors engage in dialogue and the dynamics of speaker change. In a seminal work, Sacks et al. (1974) propose that turn-taking allocation is controlled by a set of fixed but flexible rules that allow an indeterminate number of participants into a conversation with no interruptions or overlaps. Starkey and Fiske (1977), suggest that participants produce a number of prosodic, syntactic and even gestural cues that, in combination, contribute to the flow and naturalness of turn-taking in conversations. While some studies argue the non-relevance of acoustic/prosodic cues and claim that lexical and syntactic information are sufficient for turn-management, others (Duncan, 1974, Ford, Thompson, 2010, Ferrer, Shriberg, Stolcke, 2002, Wennerstrom, Siegel, 2003, Gravano, Hirschberg, 2011, Hjalmarsson, 2011, Bögels, Torreira, 2015, Ward, 2019, inter alia), including the aforementioned studies, show evidence suggesting that acoustic/prosodic cues based on pitch and duration, and syntactic features such as the position of a word in an utterance play a key role in the turn-allocation mechanism.

While many, especially earlier, studies analyze English, turn-taking management has been explored also in other languages. In Hjalmarsson (2011), the author performs a series of experiments for understanding how turn-taking cues affect the perception of the interlocutor in Swedish conversational dialogues. She reinforces the importance of the additive effect of cues on the perception of turn-transitions about to come. Even though some studies claim that cultures strongly deviate in different turn-taking systems (Watson-Gegeo et al., 1976), others argue that some kind of ‘universals’ exist (Schegloff, 2006). For instance, in Stivers et al. (2009), the authors analyze ten different languages and explore the variability in the response offsets in turn transitions. They arrive to a series of cross-culturally valid observations; for example, in all of the analyzed languages, speakers provide answer responses to questions significantly faster than non-answer, and confirmation answers are delivered faster than non-confirmation ones. In a cross-linguistic study in the perception of prosodic cues in Slovak and Argentine Spanish, Gravano et al. (2016) test the subjects’ predictions regarding turn-taking transitions in the two languages and show that some prosodic cues provide similar information in both languages, thus contributing to the aforementioned turn-taking ‘universals’. Closely following them, the present study contributes to this line of work and helps filling the gap of knowledge in turn-taking behavior though comparing Germanic and non-Germanic languages.

From a modeling perspective, recent research has shown that acoustic/prosodic features can be used for the construction of turn-transition predictive models (Skantze, 2018, Maier, Hough, Schlangen, 2017, Hara, Inoue, Takanashi, Kawahara, 2018, Roddy, Skantze, Harte, 2018, inter alia). For instance, in Skantze (2018) the author predicts the future speech activity in dialogues using LSTMs – recurrent neural network models especially designed to learn contextual representations from temporal series. They use both interlocutors’ pitch, intensity, and spectral stability tracks together with a voicing mask every 50 ms; and, also present a system for detecting turn-taking transitions (in particular turn continuation and switches) by following heuristics for automatically labeling turn transitions based on speech activity labels. However, these techniques are still not easy to analyze in terms of what they learn, thus making the underlying knowledge base for these aspects of human-human turn-taking management inaccessible for the moment. Contrary to these approaches, we intend to use machine learning as a descriptive tool to obtain information about a complex phenomenon through the exploration of models built from data. In this way, we intend to facilitate further advances in the research community in discovering and validating new findings.

The three chosen languages provide a good testing ground for studying which prosodic features (and their development over time) might be cross-linguistically valid and which might present language-specific cues in turn-taking management. On the one hand, the prosodic systems of the three languages differ. For example, according to Hualde (2013), the most important difference between the intonation system of Spanish (and of other Romance languages) and that of English (and of other Germanic languages) is the flexibility found in this second group of languages in the placement of the nuclear accent. In English, the position of the nuclear accent can move to indicate focus on various constituents. In Spanish, on the contrary, the position of the nuclear accent is practically fixed, and, except in cases of narrow focus, it falls on the last syllable with a lexical accent. Slovak is a prototypical example of a hybrid system, that is characteristic of the Slavic languages (e.g. Jasinskaja (2016)), and that combines the Romance and Germanic ones above: the information structure is expressed jointly by intonation and movable nuclear accent (like in Germanic) as well as by a flexible word-order and the tendency to move the focused element to the end of utterances (like in Romance). Since the location and type of pitch accents influence prosodic contours to a great extent, these differences might also participate in the predictions the contours have in turn-taking management.

On the other hand, the three languages share many characteristics in how prosody participates in information structure signalling the intentions and mutual beliefs of the speakers. Graham (1978) argues that Spanish and English share certain intonation patterns such as rising pitch at the end of (polar) questions, which is certainly common in Slovak as well. However, some narrow-focus Slovak polar questions might also be realized with a plateau, or gradual fall, following the nuclear rising pitch. Additionally, the notion of the ‘continuation (rise)’ in the literature on English intonation is related to the notion of ‘incompleteness’ in the Romance and Slavic traditions of intonation descriptions. Hence, rising pitch followed by a pause should be interpreted in a forward-looking fashion that either the speaker wishes to continue or that a response from the interlocutor is expected. While continuations in English are typically related to pitch rises, incompleteness in Slovak/Spanish has been linked to more variability in contours (e.g. Quilis, 1993, Král, 1988).

Typically, in the literature of turn-taking, studies that use confirmatory data analysis require a hypothesis to be specified before the design of the dataset and also need assumptions about the generation of the data by a given stochastic data model. Nevertheless, as described in Lin et al. (2007, p. 243), interpreting the results of methods outputting p-values or R-scores in high dimensional data can be difficult and misleading. For example, in the case of time series, researchers typically study specific pre-selected time frames or collapse the data dimension by averaging over time. In this way, they gain statistical power and avoid multiple comparisons problems at the expense of losing temporal detail and other information.

In contrast with classical statistics, machine-learning algorithmic models are built assuming data generation mechanisms are unknown. Methods such as random forests, support vector machines, and neural networks, among others, are known to produce powerful predictive models, are designed to handle variable interactions, and generally capture non-linearities in high-dimensional data.

Nevertheless, the vast majority of machine learning-based models are currently used as a powerful tool for achieving state-of-the-art results in predicting turn exchanges. To our knowledge, only a small number of studies observe and analyze information to allow the scientific community to explain the reasons of why a prediction is made. In this work, we focus our attention on an approach in which models are created to provide new evidence that may help reject or reinforce linguistic hypotheses, sometimes to the detriment of the models’ prediction power.

In the past, when model interpretability was needed, simple and transparent models such as linear regression or decision trees have been used to understand complex phenomena. Yet, it is essential to clarify that up to this day, interpretable models tend to be less powerful than fully black-box models, such as neural networks, especially when enough labeled data is available. Simple models usually suffer from high bias or high variance problems. That is, models underfit or overfit the data due to their design characteristics, which leads to low predictive power or low stability in the obtained results.

However, as explained in Breiman (2001b), transparency is not the only way of getting information from machine learning models. Methods such as Trees Impurity Importance (Breiman, 2001a), Permutation Feature Importance (Breiman, 2001a), Partial Dependence Plots (Friedman, 2001), LIME (Ribeiro et al., 2016), and Shapley Additive Explanations (Lundberg and Lee, 2017) have allowed the exploration of fully black-box models to a certain level. The goal of these methods is exploring not the internal structure of the model but how the model generates predictions. Unfortunately, especially in the area of speech processing — in which models predict based on processing and combining high-dimensional temporal series — the interpretation of complex models still remains an open problem.

In this work we explore the use of a random forest classifier, a robust and competitive supervised classifier along with a modified version of the permutation feature importance method. First proposed by Breiman (2001a), the random forest algorithm proposes to build an ensemble of decision trees classifiers whose predictions are individually produced and then combined for a final decision. This algorithm has shown competitive prediction power with almost no tuning effort. It manages complex and non-linear relations between inputs and outputs while avoiding overfitting at the same time. Random forests are not as simple and transparent as linear models or decision trees; therefore, some effort must be put into how they are explored. See Biau and Scornet (2016) for a full description of the method.

The article is divided into two experiments. The first one addresses question Q1, how do the acoustic/prosodic features of speech compare in Argentine Spanish, Slovak, and American English just before a turn transition? The second one addresses question Q2, how much information do acoustic/prosodic features carry and what is their relative contribution when preceding a turn-taking transition?

Section 2 introduces the speech corpora on which we based the experiments, with particular detail on the annotations we created. In Section 3, we analyze the corpora by visualizing different acoustic/prosodic features over time and across languages; we compare the results with previous works and show the difficulties of working in high dimensional data. In Section 4, we address the problem of automatically classifying turn-taking events, paying special attention to revealing which features contain the most relevant information over time. We also test the stability of our results by varying the way features are extracted. Finally, in Section 5 we discuss the research results and present the outlines of future work.

Section snippets

Materials: the object games corpora

We used three versions of the Objects Games Corpus (first described in Gravano and Hirschberg (2011)), in American English, in Argentine Spanish, and in Slovak. In each, a collection of spontaneous task-oriented dyadic conversations elicited from native speakers playing Objects Games was gathered. Subjects were paid to play a series of collaborative computer games requiring verbal communication. Experiments took place in soundproof booths, each participant using a different laptop computer, and

Study 1: visualization of acoustic/prosodic features

In this first study, we address the question of how the acoustic/prosodic features of speech compare in Argentine Spanish, Slovak, and American English just before a turn transition.

To this end, we perform for each language a series of exploratory analyses with descriptive visualizations on how a number of acoustic/prosodic features behave on IPUs immediately preceding each turn-transition condition – namely, a turn exchange or switch (S), a turn continuation or hold (H), or a backchannel (BC)

Study 2: learning turn-transitions

We know from Starkey and Fiske (1977) and from quantitative measures in Gravano and Hirschberg (2011), that turn-transition cues have an additive effect. Turn-yielding and turn-holding cues do not occur in an isolated way: the more cues signaling turn-hold or turn-yield, the higher the agreement among the listeners in identifying the subsequent turn type. Analyzing these cues separately does not give us the full picture. Additionally, it is not clear to what degree these patterns are

Conclusions

We conducted a number of experiments to explore similarities and differences between American English, Slovak and Argentine Spanish in the production of acoustic/prosodic cues before turn exchanges. We analyzed the speech in three corpora of spontaneous dyadic conversations, first through a series of visual explorations, and second by using machine learning techniques to predict turn transitions based on features from pause-preceding units.

In the first study, we saw in detail how IPU features

CRediT authorship contribution statement

Pablo Brusco: Conceptualization, Investigation, Software, Methodology, Visualization, Formal analysis, Data curation, Writing - original draft, Writing - review & editing. Jazmín Vidal: Methodology, Validation, Data curation, Writing - original draft, Writing - review & editing. Štefan Beňuš: Resources, Data curation, Investigation, Writing - original draft. Agustín Gravano: Resources, Conceptualization, Investigation, Methodology, Data curation, Writing - original draft, Writing - review &

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by Argentina’s National Agency for Scientific and Technological Promotion (ANPCYT), PICT-PRH 2009-0026 and PICT 2014-1561; the University of Buenos Aires, UBACYT 20020090300087 and 20020120200025BA; the Bilateral Cooperation Program between Argentina’s National Scientific and Technical Research Council (CONICET) and the Slovak Academy of Sciences (SAS); the VEGA 2/0161/18 grant from the Slovak Scientific Granting Agency, and the Air Force Office of Scientific Research,

References (49)

  • J. Cohen

    A coefficient of agreement for nominal scales

    Educ. Psychol. Meas.

    (1960)
  • S. Duncan

    On the structure of speaker-auditor interaction during speaking turns

    Lang. Soc.

    (1974)
  • F. Eyben et al.

    The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing

    IEEE Trans. Affect. Comput.

    (2016)
  • F. Eyben et al.

    Recent developments in openSMILE, the Munich open-source multimedia feature extractor

    Proceedings of the 2013 ACM Multimedia Conference

    (2013)
  • L. Ferrer et al.

    Is the speaker done yet? Faster and more accurate end-of-utterance detection using prosody

    7th International Conference on Spoken Language Processing, ICSLP 2002

    (2002)
  • C.E. Ford et al.

    Interactional units in conversation: syntactic, intonational, and pragmatic resources for the management of turns

    Interact. Grammar

    (2010)
  • J.H. Friedman

    Greedy function approximation: a gradient boosting machine

    Ann. Stat.

    (2001)
  • R. Graham

    Intonation and emphasis in spanish and english

    Hispania

    (1978)
  • A. Gravano et al.

    Classification of discourse functions of affirmative words in spoken dialogue

    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

    (2007)
  • A. Gravano et al.

    Who do you think will speak next? Perception of turn-taking cues in Slovak and Argentine Spanish

    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

    (2016)
  • K. Hara et al.

    Prediction of turn-taking using multitask learning with prediction of backchannels and fillers

    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

    (2018)
  • M. Heldner et al.

    Prosodic features in the vicinity of silences and overlaps

    Proc. 10th Nordic Conference on Prosody

    (2008)
  • M. Heldner et al.

    Voice quality as a turn-taking cue

    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

    (2019)
  • J.I. Hualde

    Los sonidos del español: Spanish Language Edition

    (2013)
  • Cited by (11)

    • Automatic offline annotation of turn-taking transitions in task-oriented dialogue

      2023, Computer Speech and Language
      Citation Excerpt :

      In the case of an overlapping transition, rather than the final 1000 ms from IPU1, we consider its 1000 ms immediately preceding the onset of IPU2, as seen in Fig. 3(b). The choice of this context size was based on previous findings that show that the acoustic information regarding turn-taking transitions concentrates within a one-second-wide window around the transition (Brusco et al., 2020). From these time intervals, we build fixed-sized feature vectors as follows:

    • Cues to next-speaker projection in conversational Swedish: Evidence from reaction times

      2023, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    View all citing articles on Scopus
    View full text