1 Introduction

Throughout the course of a conversation, each conversational partner, the -‘speaker’- and the -‘interlocutor’-, changes a number of parameters of speech production. Convergence phenomena refer to the tendency of conversational partners to co-adjust their speaking styles. Convergence between conversational partners has been shown to occur at various levels, including syntactic and lexical levels (Pickering and Garrod 2004; Bock 1986; Gries 2005; Brennan and Clark 1996) and acoustic levels [intensity Natale (1975), Levitan and Hirschberg (2011); fundamental frequency Godfrey et al. (2014), Giles and Powesland (1997); speech rate Street (1984)]. Most of these studies use carefully controlled datasets in which all parameters except the scrutinized variable have been neutralized. This study sought to reproduce and expand the research (Cohen Priva et al. 2017) grounded on an existing corpus rather than experimentally controlled material. Cohen Priva et al. showed evidence of convergence in speech rate production using the Switchboard corpus (Godfrey et al. 1992). The goal of our study was firstly to show that it was possible to reproduce the results of Cohen Priva et al. following the same procedures and using the same statistical tools and then to check the robustness of their findings. Replicability and reproduction have become a major focus as can be judged by the proliferation of special issues and conferences on these subjects in various fields, including psychology (Pashler and Wagenmakers 2012; Camerer et al. 2016) economics, and Shekelle et al. (1998) medicine. The difference between replicability and reproducibility had been explored in Goodman et al. (2016), Plesser (2018) and more specifically in language issues in Branco et al. (2017). Reproducibility is the calculation of quantitative scientific results by independent scientists using the original datasets, while replication is the practice of independently implementing scientific experiments to validate specific findings. Reproducibility is beginning to receive well-deserved attention from the Natural Language Processing (NLP) community. In language sciences and in particular in NLP, reproducing a result may involve many detailed steps from the raw data to actual results. Our reproduction adopted the original authors’ choices in data selection and pre-processing and attempted to follow the exact procedure of the different steps in the analysis. Interestingly, while the main lines and results of the reproduced study were confirmed, specific results differed despite our having taken care not to alter the original experimental setup. Moreover, based on our reproduction we were able to explore the robustness of the results by varying some of the parameters of the original study. We believe this constitutes another interest in reproducing a study.

Our reproduction study includes two parts: (i) the first part is related to the effects of gender and age on speech rate; (ii) the second part deals with the convergence of a speaker’s speech rate to their baseline and their interlocutor’s speech rate baseline. The latter part will show further analysis that we carried out on the corpus using the model from the reproduced study. First, we used different subsets of the main corpus, changing the number of minimum conversations per speaker. We then tested another approach to computing a crucial ingredient of the reproduced study, the expected word duration, and finally validated the model with a k-fold cross-validation technique. In this last part, we also demonstrated the benefit of using a different approach that took into account the temporal dynamic of speech rate, showing an example of the complex nature of convergence phenomena.

The paper is organized as follows: after describing the general interest of the research question (Sect. 2), we present our reproduction (Sect. 3) of the different experiments. We then present our additions to the initial study in Sect. 4, in particular with regard to dataset selection and the underlying model, and we call attention to the issue of speech rate dynamics.

2 Related work and motivation

Speech rate is a feature that has been explored extensively in the sphere of inter-speaker convergence. Studies in experimental settings using confederates (Schultz et al. 2016; Jungers and Hupp 2009) have shown that speakers modify their speech rate in response to confederates’ variation. The study conducted by Freud et al. (2018) using quasi-natural conversations established that speakers tend to adjust their speech rate to each other. These speech rate variations are related to intended communicative and social goals. For example, in Smith et al. (1975, 1980), Street (1984) conversants increased their speech rate to fit the impression that speakers with higher speech rates are considered to be more competent. In Buller and Aune (1992) speech rate accommodation is linked to intimacy and sociability. Finally, Manson et al. (2013) showed that convergence in speech rate predicts cooperation.

The gender and age of participants can also affect speech rate and its convergence, as shown by Hannah and Murachver (1999), Kendall (2009). Specifically, women tend to converge more than men (Bilous andKrauss 1988; Gallois and Callan 1988; Willemyns et al. 1997); mixed-gender pairs tend to converge the most (Levitan et al. 2012; Namy et al. 2002), while in same-gender interactions, Pardo (2006) found that male-male pairs showed the greatest degree of convergence. Kendall (2009) found that speech rates were more strongly affected by the interlocutor’s gender than by the speaker’s gender. More precisely, both male and female speakers spoke at a similar, slow rate when interviewed by a woman, and faster when the interviewer was a man. Another trend is to evaluate convergence using third-party judgment (human judgment), such as in Namy et al. (2002), Goldinger (1989), which compared speech rates within the same conversation or with those of various shadow participants (Street 1984; Levitan and Hirschberg 2011; Pardo 2006; Sanker 2015). In the study reproduced here, Cohen Priva et al. compared the speech rate of both participants with the average value of their speech rates, or baseline, taken from other conversations. In the second part of their study, the conversants’ baselines, along with their gender and age, were investigated. It was shown that a speaker may increase their usual speech rate, or baseline, in response to a fast-speaking interlocutor, or vice versa. Computing the baseline speech rate using more than one conversation makes it possible to compare speech rate robustly. Another benefit of this approach is to smooth out other external factors that could affect speech rate, such as the topic of the conversation. Cohen Priva et al.’s study is well suited for reproducibility studies due to its precise baseline model and the general availability of the dataset, the Switchboard corpus (Godfrey et al. 1992). This corpus is composed of about 2400 conversations and 543 speakers, which meant that we could also carry out additional analyses by varying and altering the shape of the original dataset.

3 Reproduction of the original study

To ease comparison with the study conducted by Cohen Priva et al., we will use the same definitions. The speaker’s speech rate while speaking with the interlocutor I is indicated as \(S_{I}\), while the interlocutor’s speech rate with the speaker S is \(I_{S}\). The speech rate baseline of the speaker in other conversations with anyone except I is indicated as \(S_B\) (speaker baseline). Similarly, \(I_B\) (interlocutor baseline) is the speech rate baseline of the interlocutor while speaking with anyone except S.

The data used in the reproduction are the same as in the original paper, the Switchboard corpus (Godfrey et al. 1992), in which participants took part in multiple telephone conversations. The 543 speakers in the corpus, with about 2400 transcribed conversations, were set up in both mixed and same gender and age dyads. The speakers were strangers to each other, and each speaker was paired randomly by a computer operator with various other speakers; for each conversation, a topic (from a list of 70 topics) was assigned randomly. In the pure reproduction stage we only took into account conversations in which both participants had participated in at least one additional conversation with a different speaker/interlocutor, as in the original study. After filtering the data by excluding speakers who took only took part in one conversation, we were left with 4788 conversation sides and 483 speakers.

3.1 Speech rate

In their study, Cohen Priva et al. computed the Pointwise Speech Rate (PSR) for an utterance as the ratio between the utterance duration and expected utterance duration.

$$\begin{aligned} \text {PSR} = \frac{\text {utterance real duration} }{ \text {utterance expected duration} } = \frac{\sum _{w=1}^{N} t_w^{real}}{\sum _{w=1}^{N} t_w^ {expected}} \end{aligned}$$
(1)

In Eq. (1), \(t_w^{real}\) is the time used by the speaker to pronounce the word w in that utterance while \(t_w^{expected}\) is the expected time necessary to pronounce the word. N is the number of words in the utterance. Note that a value of PSR \(>1\) means that the speaker rate is slower than expected. Conversely, a value of \(<1\) means that the speaker rate is faster than the expected rate.

To calculate each word’s expected duration, Cohen Priva et al. used a linear regression model in which the median duration of the word across the entire Switchboard corpus, the length of the utterance, and the distance to the end of the utterance (in words) are the predictors of the word’s duration. Medians were used because the distribution of word durations are not symmetric. The authors also included the length of the utterance and the distance to the end of the utterance because it has been shown that these factors can affect speech rate (Jiahong et al.1980; Quené 2008; Jacewicz et al. 2009).

We found that the mean word duration was 246 ms for both actual and expected scenarios; the median word duration was 205 ms for actual and 208 ms for expected scenarios.

Expected utterance duration is defined as the sum of the expected duration of all words in the utterance, excluding silences and filled pauses (uh, um and oh). Real utterance duration is defined as the time from the beginning of the first word in an utterance, excluding silences and filled pauses, to the end of the last word in that utterance, excluding silences and filled pauses, but including intermediate silences and filled pauses. [noise], [vocalized-noise], [laughter] were excluded from the computation of both utterance duration and expected utterance duration.

Figure 1 shows an example of how time-aligned transcripts were used to compute speech rate.

Fig. 1
figure 1

Example of Speech Rate calculation for a given utterance

In Eq. (2), we calculated the speaker’s speech rate as the mean of the logarithm of the Pointwise Speech Rate (Eq. 1) of all utterances with four or more words. Shorter utterances were not included because many of them were back-channels (Yngve 1970), such as isolated ‘yeah’ or ‘uhuh’, which may exhibit different phenomena in terms of speech rate; n is the number of utterances.

$$ {\text{Speech rate}} = \sum _{{\begin{array}{*{20}c} {j = 1} \\ {{\text{N}} \ge 4} \\ \end{array} }}^{n} \frac{{log(PSR_{j} )}}{n} $$
(2)

Finally, both the speaker’s and interlocutor’s baseline speech rates were calculated using their mean speech rate from other conversations (\(S_B\) and \(I_B\), respectively).

3.2 Statistical models

The statistical model used in the original study was a linear mixed regression model with speech rate as the predicted value. The slope of the linear regression gives information about the effect of the fixed effect scrutinized. In Study 1 (Table 1), the model captures the differences between male and female populations, also illustrated in Fig. 2. In this example, the negative slope indicates that the female population has a faster speech rate compared to the male population.

Fig. 2
figure 2

Gender effect on speech rate; in this corpus the female population has a higher speech rate value compared to that of the male population

The lme4 library in R, version 3.4.3 (Bates et al. 2014) was used to fit the models and provide t-values. The lmerTest package (Kuznetsova et al. 2014), which encapsulates lme4, was used to estimate degrees of freedom (Satterthwaite approximation) and calculate p-values. All numerical predictors were standardized. All models used the interlocutor id, conversation id, and topic identity as random intercepts. The original Study 1 also used speaker id as a random intercept. Following the original study, we used the R p.adjust function to adjust p-values for multiple comparisons using the FDR (false discovery rate) method, as described by Benjamini and Hochberg (1995), in order to control the false discovery rate, with the expected proportion of false discoveries.

3.3 Study 1: gender and age effects on speech rate

This part of our study sought to validate previous studies establishing that age and gender affect speech rate. Studies have found younger speakers to have faster speech rates than older speakers (Duchin and Mysak 1987; Harnsberger et al. 2008; Horton et al. 2010) and male speakers to have slightly faster rates than female speakers (Jacewicz et al. 2009; Jiahong et al. 1980; Kendall 2009). Gender, age, and their interaction were used as fixed effects.

Table 1 Results—comparison between our reproduction and the original study 1

Results Similarly to Cohen Priva et al., we confirmed that older speakers are more likely to have a slower rate of speech (\(\beta \) = 0.2151, standard error (SE) = 0.0532, \(p < 10^{-5}\), FDR-adjusted \(p < 10^{-6}\)). Male speakers are generally more likely to have a faster rate of speech (\(\beta \) = − 0.4089, SE = 0.0744, \(p < 10^{-7}\) , FDR-adjusted \(p < 10^{-6}\)). Age did not affect male and female speakers differently (\(\beta \) = − 0.0716, SE = 0.0748, unadjusted \(p = 0.3389\) , FDR-adjusted \(p > 0.05\)). A summary of these results is shown in Table 1 and compared with the results of Cohen Priva et al. As shown, our study revealed the same tendencies as Cohen Priva et al.; in other words, both the age and gender of speakers affect speech rate.

3.4 Study 2: converging to the baseline

The second part of the original study attempted to determine to what extent speakers converge with their interlocutor’s baseline rate and verify the influence of other features like gender and age on convergence. The method used was the same as that explained in Sect. 3.3, with several predictors added. First, two predictors were used for speech rate: speaker baseline speech rate, estimated from the speaker’s conversations with other interlocutors (\(S_B\)), and interlocutor baseline speech rate, estimated from the interlocutor’s conversations with others (\(I_B\)).

Other predictors were included, as described by Cohen Priva et al., to take into account the identity of the speaker, and speaker and interlocutor properties like gender and age that could affect speech rate. To summarize, the predictors were:

  • Age (standardized) of the interlocutor, and its interaction with the (standardized) age of the speaker: \(Interlocutor\,Age\); \(Interlocutor\,Age\) \(\cdot Speaker\,Age\)

  • Gender of the interlocutor, and its interaction with the gender of the speaker: \(Interlocutor\,Gender\); \(Interlocutor\,Gender\cdot Speaker\,Gender\)

  • Interactions between the interlocutor’s baseline speech rate and all other variables:

    • \(Interlocutor\,Baseline \cdot Speaker\,Baseline\);

    • \(Interlocutor\,Baseline \cdot Speaker\,Age\);

    • \(Interlocutor\,Baseline \cdot Interlocutor\,Age\);

    • \(Interlocutor\,Baseline \cdot Interlocutor\,Age \cdot Speaker\,Age\);

    • \(Interlocutor\,Baseline \cdot Speaker\,Gender\);

    • \(Interlocutor\,Baseline \cdot Interlocutor\,Gender\);

    • \(Interlocutor\,Baseline \cdot Interlocutor\,Gender \cdot Speaker\,Gender\).

Table 2 Results—comparison between our reproduction and the original study 2

Results As shown in Table 2, our reproduction is in agreement with the results of Cohen Priva et al.; a speaker’s baseline speech rate has the most significant effect on their own speech rate in a conversation (\(\beta \) = 0.7777, standard error (SE) = 0.0929, \(p < 10^{-16}\), FDR-adjusted \(p < 2\times 10^{-16}\)). Interlocutor baseline rate has a smaller significant effect on speaker speech rate (\(\beta \) = 0.0464, standard error (SE) = 0.0094, \(p < 8\times 10^{-8}\), FDR-adjusted \(p < 0.05\) ). The positive coefficient indicates convergence: when speaking with an interlocutor who speaks slower or faster, the speaker’s speech rate changes in the same direction. The difference in the effects of speaker baseline rate and interlocutor baseline rate on speaker speech rate suggest that speakers are more consistent than they are convergent, and that they rely much more on their own baseline. Interlocutor age also has a significant effect on speaker speech rate (\(\beta \) = 0.0231, SE = 0.0089, \(p < 0.05\), FDR-adjusted \(p < 0.05\)). The positive coefficient of this variable indicates that speakers are categorically slower while speaking with older speakers, regardless of the interlocutor baseline speech rate.

Finally, contrary to the results of Cohen Priva et al., the gender combination of the speakers and interlocutors was not found to be significant in affecting speech rate.

4 Additional analyses

In this section, we will describe additional analyses that we carried out on the Switchboard corpus to test the model proposed by Cohen Priva et al. (2017). We extended three aspects of the study in particular: (i) we used a subset of the corpus in order to only include speakers involved in more than two conversations; (ii) we applied a different model to compute expected word duration, and (iii) we tested the model on different data subsets following a k-fold approach.

4.1 Taking a more conservative stance on baseline estimates

As seen above, external factors like the topic of a conversation can affect speech rate. A speaker might vary their speech rate depending on how immersed they are in the discussion or according to how important they consider the topic to be. We mitigated this effect by applying the same model to subsets of the Switchboard corpus which only included speakers who were involved in at least N = 2, 3, 4, 5, or 6 conversations. We preferred to use a greater number of conversations per speaker to compute \(S_B\) and \(I_B\), even if this meant that the analysis was then based on a smaller number of total speakers. In this way, we obtained five different datasets with 483, 442, 406, 385, and 357 different speakers, respectively, and 4788, 4630, 4418, 4264, and 4018 conversation sides. The decision to use these datasets was also due to other factors. For example, emotion can affect a speaker’s manner of speaking and subsequently their speech rate. Previous studies such as (Ververidis and Kotropoulos 2006) compared the effect of emotions by recognizing them through speech analysis using several databases, while (Siegman and Boyle 1993) demonstrated that people who feel sad may speak more slowly and softly. Using a greater number of conversations per speaker made it possible to smooth out these effects when computing the baseline. As for Study 2, we only took into account the predictors which were significant in the previous study. Table 3 shows the magnitude of the estimates (for Study 1) for each subset. The magnitude of the effect of gender on speech rate increased with the number of conversations, while the effect of age decreased. Moreover, both variables preserved significance with an adjusted p-value which in the worst case (corresponding to the dataset with six conversations per speaker) was \(p = 0.009\) for speaker age and \(p \sim 10^{-8} \) for speaker gender. The meaning of the estimates was still significant, even when a smaller amount of data was used. These results demonstrate the model’s robustness.

Table 3 Estimate, standard deviation and adjusted p-value for the gender, age and \(gender\cdot age\) for different subsets of the Switchboard corpus

In our extension of Study 2, we only took into account significant predictors. The results in Table 4 show that the magnitude of the speaker baseline, the interlocutor baseline and the interlocutor age all increased, but age lost significance as the minimum number of conversations increased. The speech rate results were mainly affected by the speaker baseline and interlocutor baseline. Moreover, the fact that interlocutor age did not seem to affect speech rate convergence implies that the results would not be reproduced if we reduced the size of the dataset. These results suggest reviewing the threshold of the p-value, as discussed in Benjamin et al. (2017).

Table 4 Estimate, standard deviation and adjusted p-value for the speaker baseline, interlocutor baseline and interlocutor age for different subsets of the Switchboard corpus

4.2 Variation on expected duration computation

The definition of speech rate at the utterance level is taken to be the ratio between utterance duration and expected utterance duration. Speech rate is therefore influenced by the way the expected duration of each word is computed. Assuming that the duration of a word depends on the length of the utterance, the position of the word in the utterance and the median duration of that word in the entire corpus, we fitted the expected duration using an artificial neural network regression with a one-hidden layer of 10 neurons and an adaptive learning method. The model was integrated by the use of the Scikit-Learn package in Python Pedregosa et al. (2011). In this case, we found that the median of the expected word duration was \(\sim 205\) ms, just like the median word duration in the corpus. Applying the same procedure as described in the previous paragraph, we obtained the results in Table 5. The direction of the estimates and SD results remained similar to what was found in Sect. 4.1, thus reinforcing the hypothesis that both speaker baseline and interlocutor baseline affect speech rate.

Table 5 Results obtained using the method described in Sect. 4.2 to compute the expected word duration

4.3 Validation of the model on smaller datasets

Finally, to further validate the model, we applied a cross-validation (k-fold) approach to determine if the results were still significant in smaller datasets. We used \(k = 5\) to obtain each subset from the main corpus. We filtered the data to create a non-independent (the subset could contain overlapping data) with conversation size representing 80% of the total duration of the corpus, used in Sect. 3. In this way, each dataset contained 3830 conversation sides with the condition that each speaker participated in at least two conversations. We compared the results of Study 2 (Sect. 3.4) with the mean and standard deviation of the results computed on the subsets as detailed in Table 6. We found that although interlocutor baseline and interlocutor age (estimates and standard deviation values) were consistent with the values in Sect. 3 and showed the same direction of effect, they no longer were statistically significant. Moreover, the estimate for the speaker baseline appeared to be slightly lower compared to the result of the whole dataset but still was significant. The lack of significance cannot be attributed to the smaller number of speakers in the datasets. The minimum number of speakers involved in the subsets was 452, which is about 95% of the total number used in Sect. 3. The difference in the results could be attributed to the use of fewer conversation sides per speaker in the k-fold subsets (after the filtering processing), which reinforces our proposal to take into account more than two conversations per speaker. These results suggest that speech rate is mainly affected by the speaker baseline when both the number of conversations and the number of speakers decrease.

Table 6 Estimate, standard deviation and adjusted p-value for the speaker baseline, interlocutor baseline and interlocutor age averaged on the 5 different subsets and compared with the value computed in Sect. 3.4

4.4 Beyond averages

The reproduction we carried out, including additional analyses to test the robustness of the model, use speech rate as the mean value of all the utterances produced by the speaker in the whole conversation. Even if this approach captures the general properties and behavior of the speakers and their interlocutors while conversing, it cannot account for the complex dynamics of speech rate precisely over the course of the conversation. To get a closer view of what speech rate variation looks like in conversation, we produced a series of speech rate plots in actual conversations, as shown in Fig. 3.

Fig. 3
figure 3

Blue (upper part) and red (bottom part) indicate the speaker and interlocutor variables, respectively. (Color figure online)

First of all, we note that Study 2 focused on comparing baselines and average speech rates (straight lines). To illustrate the variability and complexity of speech rate in a conversation, we plotted the speech rate for each utterance for both the speaker and the interlocutor. We smoothed the data using a moving average with a window (\(n=6\)). We then applied a polynomial fit p(x) of order \(k=8\) to the filtered data to obtain the trend of the speech rate as a smoothed function. As we can see, the difference between the average speech rate of the speaker and the interlocutor (respectively in light blue and pink) is \(\sim 0.4\). These averaged values are in accordance with the punctual speech rate (blue for speaker and red for interlocutor) at the utterance level for the first part of the conversation (up to \(300\) s) that shows a considerable difference between the conversants. However, this hides the fact that the difference is less than 0.05 in the temporal interval of \(300-400\) s. In this interval of the conversation, the speaker and interlocutor have a similar trend in their speech rates, each converging toward their respective interlocutor. A model that uses the average speech rate over the course of the whole conversation ignores the complex dynamic of the speaker’s behavior that can alternate between attitudes of convergence, divergence or ignorance during the conversation. Moreover, average speech rate is sensitive to outliers. This issue could affect the analysis of speech rate in conversations, leading to an erroneous description of the conversants’ behavior. The variation we found in speech rate over the course of a conversation points to the need for new analytical approaches that take conversational dynamics into account.

5 GitHUb repository

In order to facilitate further reproductions and replications, we created a JuPyteR (Kluyver et al. 2016) notebook with the code developed to reproduce the study of Cohen Priva et al. (2017) as well as the additional analyses described in this paper in Sects. 4, 4.1, and 4.2. The notebook contains Python scripts and can be used to perform the following tasks:

  1. 1.

    Pre-Processing the transcripts of the Switchboard corpus

  2. 2.

    Computing the speech rate as described in detail in Sect. 3.1

  3. 3.

    Computing the baseline and standardizing the data

In addition, we added R scripts to use to perform the statistical analysis described in Sects. 3.2, 3.3, and 3.4.

The code is accessible at https://github.com/simonefu/Converging_to_baseline

6 Conclusion

The results of our reproduction of the study of Cohen Priva et al. (2017) confirmed that the gender and age of speakers affect speech rate production (Study 1), as stated in the original work. In Study 2, our reproduction confirmed that both speaker baseline and interlocutor baseline affect speech rate, supporting the theory that speakers’ speech rates tend to converge, as explained in the original paper. In particular, the speaker’s baseline has a stronger effect on their own speech rate than the interlocutor’s baseline. Conversely, the interaction of interlocutor baseline and speaker gender did not have a significant effect on convergence. Moreover, our verification of the robustness of the model revealed that only the speaker baseline effect retained significance when we reduced the number of speakers.

More generally, despite their key importance, replication/reproduction studies in language sciences of the kind presented here have been too rare. They constitute a crucial ingredient needed to make scientific results more reliable and more credible inside and outside the community. Furthermore, replicated studies constitute the perfect ground for extending previous work. We hope that the benefits exhibited in this paper can convince more NLP and language science researchers to initiate replications and present them in dedicated papers.

Finally, the visual exploration of speech rate we have presented here allowed us to grasp the distances between the study we focused on, our replication, and the actual complexity of the phenomena. Our results add to the interest of the reproduced study and reveal how much we still have left to understand about conversational dynamics.