Learning a new language involves discovering how sound sequences combine to make words. For a novice learner, this can be challenging, as words are not typically produced in isolation (Brent & Siskind, 2001), and unlike written language, spoken language does not have reliable pauses between words. Fortunately, there are a number of probabilistic cues, including transitional probability (TP) and phonotactic probability (PP) that may work together to help learners overcome the segmentation problem (Johnson, 2016; Jusczyk, 2002; Onnis et al., 2005). The current study further investigates the combined effects of TP and PP on speech segmentation and highlights the methodological implications of overlooking PPs when studying statistical speech segmentation, which is often the case in both classical and more recent studies.

Transitional probabilities (TPs) are a type of sequential statistic found between words’ segments (e.g., syllables) that can be used to predict the occurrence of the next or the previous segment (Aslin et al., 1998; Hay et al., 2011). For instance, given sufficient experience with English, one might learn that in the sequence pretty#baby, the chance of pre being followed by ty is greater than the chance of ty being followed by ba, signaling a word boundary. Infants and adults are able to track differences in TPs to find word boundaries in continuous speech (Saffran, Aslin, et al., 1996a; Saffran et al., 1997). Phonotactic probabilities (PPs), on the other hand, are positional statistics based on the frequency of phonological segments in given positions within words of a language (Vitevitch & Luce, 2004). For example, in the same sequence pretty#baby, the English PP of the word pretty (≈ 0.0440) is higher than the PP of the word baby (≈ 0.0050), which, in turn, is higher than the PP of the part-word ty#ba (≈ 0.0022). Differences in PPs between words and/or part-words have also been shown to signal word boundaries and promote speech segmentation for both infants and adults (Jusczyk et al., 1994; Mattys & Jusczyk, 2001; Mattys et al., 1999).

To study statistical speech segmentation, researchers usually combine made-up words to create continuous speech streams. The very nature of made-up words implies that they have TPs = 0 in the participants’ native language. For instance, the syllables from the sequence pretty#baby could be rearranged to create the made-up sequence tyba#preby. Native TP knowledge will not aid segmentation of this made-up sequence, but given sufficient exposure, participants can track the new TPs and use it to find word boundaries. Artificial languages have been successful in precisely controlling TP information in their made-up words; however, this level of strict control is not automatically extended to PP. For instance, in the same made-up sequence (tyba#preby), the PP of the word preby is much higher (≈ 0.0391) than the PP of the part-words ba#pre (≈ 0.0110), which, in turn, is higher than the word tyba (≈ 0.0022). If an artificial language has systematic PP differences among its’ words and/or part-words, participants’ native PP knowledge may affect segmentation, regardless of TP information (Finn & Hudson Kam, 2008; Mersad & Nazzi, 2011). Made-up words in statistical learning studies usually have legal phonotactics (e.g., Karaman & Hay, 2018; Mirman et al., 2008; Pelucchi et al., 2009). However, variations in phonotactic probabilities are seldom controlled. We expand this point and discuss its’ implications, with data, in the Discussion.

At least two studies have focused on the combined effects of TP and PP on speech segmentation. Finn and Hudson Kam (2008) familiarized English-speaking adults to a continuous stream of speech formed by disyllabic words that always had strong TPs, but varying PPs. In one condition, words began with legal consonant clusters (e.g., /kr/, /pl/as in, “kraft,” “plural”). In the other condition, words began with illegal clusters (e.g., /tf/, /km/, /bt/). Despite TPs being stronger within words (ranging from 0.25 to 1.0) than part-words (ranging from 0.035 to 0.143) in both conditions, only participants who heard the speech with legal clusters preferred words more than part-words at test. These results show that adults combine TP information learned in the task with native PP knowledge when segmenting novel speech. It also shows that segmentation may be impaired if these statistics collide (i.e., strong TP, but illegal PP).

In a more graded approach, Mersad and Nazzi (2011) showed that probabilistic differences between words’ PPs may serve as cues to word segmentation. French-speaking adults were familiarized to either a uniform speech stream, with all words having medium-low PPs, or to a non-uniform speech stream, with some words having high and others having medium-low PPs. For both speech streams, TP was always stronger within words (1.0) than within part-words (0.5). Interestingly, only participants exposed to the non-uniform speech stream preferred words to part-words during test. These results suggest that the PP contrasts can be combined with TPs to promote segmentation on continuous speech. One explanation offered by the authors is that words with stronger PPs could have functioned as anchors that highlighted word boundaries.

Support for this idea comes from research showing that infants as young as 6 months old can use their knowledge of highly familiar words, such as their own name and caregivers’ name, as anchors to find unfamiliar word boundaries in continuous speech streams (Bortfeld et al., 2005). In addition, by taking advantage of anchors, infants can segment speech streams with more challenging structures, such as ones with varying word length (Mersad & Nazzi, 2012). In the adult literature, the presence of recently learned words in the speech stream facilitates speech segmentation (Cunillera et al., 2010) and creates an anticipatory expectation for the words surrounding the anchor (Cunillera et al., 2016).

The findings from Finn and Hudson Kam (2008) and Mersad and Nazzi (2011) provide compelling evidence that transitional and phonotactic statistics can affect speech segmentation. However, both studies used very distinct phonotactics values (i.e., legal vs. illegal; high vs. medium-low). On one hand, given the implicit nature of PP knowledge, large PP variations may be necessary for the anchoring effect to be observed. On the other hand, PP learning starts early in life. Even before segmenting speech into words, 5-month-olds are already sensitive to PP information from their native language (Sundara & Breiss, 2020). In addition, both infants (Chambers et al., 2003) and adults (Adriaans & Kager, 2017; Onishi et al., 2002) can learn PPs from brief exposures to continuous speech in the lab. Further, PPs have been shown to affect a range of psychological phenomena, from memory to speech production (Apel et al., 2006; Gathercole et al., 1999; Graf Estes et al., 2016; MacKenzie et al., 2012; Zamuner, 2009). Together, this evidence suggests that PP knowledge may be robustly encoded in a manner that impacts speech segmentation (Onnis et al., 2005).

The present study aims to provide a more fine-grained test of the combined effects of PP and TP on speech segmentation. We conducted a conceptual replication of Mersad and Nazzi (2011), but using smaller differences in words’ PPs. Again, adults heard either a stream of speech with PP anchors or a uniform stream of speech without anchors. If participants rely on TP alone, we expect both speech streams to be segmented. If participants rely on the combination of TP and PP anchors, we expect those in the Anchor condition to demonstrate better segmentation than those in the Uniform condition. By investigating the integration of transitional and phonotactic statistics, our results will provide a more fine-grained understanding of statistical language learning (Saffran, 2020).

Method

Participants

Eighty-one students (53 females, Mage = 22.59 ± 5.28 SD) from Universidade Federal de São Carlos were recruited to participate through online postings. All participants were native speakers of Brazilian-Portuguese and had typical hearing, vision, and motor control, according to self-report. The research was approved by the Institutional Review Board of the Universidade Federal de São Carlos (#1.484.847). Participants signed the informed consent form and were randomly assigned to a condition. No compensation was provided for participation.

Stimuli

To ensure that PPs were tightly controlled, we focused on PP information for initial stimulus selection. To do so, we used a PP calculator especially designed for this study (Estivalet & Dal Ben, n.d.). The calculator had two parts: the database and the search engine (algorithm based on Vitevitch & Luce, 2004). The database was built from a comprehensive Brazilian-Portuguese corpus (Estivalet & Meunier, 2015) in five steps. First, grapheme-phone transcriptions were perfomed automatically following the conventions set by Barbosa et al. (2003). Second, all words from the corpus were split into biphones. Third, the log (base 10) value of the frequencies of each biphone by position was summed (e.g., sum of /mæ/ frequency in the first position, in the second, and so forth). Fourth, the log value (base 10) of total word frequencies by number of biphones was summed (i.e., the total frequency of words with one, two, three biphones, and so forth). Finally, the biphones’ sums, which were calculated on the second step, were divided by the words’ sums, calculated on the fourth step. The quotient was the log-transformed PP of biphones in Brazilian-Portuguese. Log transformations were used because they better correlate with performance in linguistic experiments than raw frequency (Vitevitch & Luce, 2004). The search engine worked in four steps—the reverse of Vitevitch and Luce (2004). First, it took as input user-specified log-transformed PPs (e.g., exactly 0.002 in the first position, 0.003 in the second, and so forth), or PP ranges (e.g., 0.002 or higher in the first position), for each desired biphone position. Second, it searched the database for those PPs and created a list of biphones by position. Third, it recombined overlapping biphones by position (e.g., /pe/ in the first position would be recombined with /em/ in the second position, and with /mi/ in the third position to make /pemi/). Finally, it returned the combinations that matched the user-specified statistics.

Using the calculator, we selected six disyllabic pseudo-words with consonant–vowel structure (CVCV) and with the highest possible PPs before becoming words in Brazilian-Portuguese. We labeled these six pseudo-words PP+. Then, we recombined the biphones from PP+ pseudo-words to create a second set of six pseudo-words that had slightly less probable, but still high, PPs. We labeled these PP− (see Table 1).

Table 1 Phonetic transcription (IPA), Phonotactic Probabilities (PP), and Phonetic Levenshtein Distances (PLD20) of the pseudowords with highest possible phonotactic probabilities (PP+) and their recombinations (PP−)

To ensure that both sets were very similar to actual words from Brazilian-Portuguese, we calculated the mean number of insertions, deletions, and substitutions needed to transform PP+ and PP− words into their closest 20 phonetic neighbors (i.e., Levenshtein Distance; Yarkoni et al., 2008). We used the package vwr (Keuleers, 2013) for R (R Core Team, 2017) and the same Brazilian-Portuguese corpus in these calculations (Estivalet & Meunier, 2015; see Table 1). The analysis indicated that both sets were very close to actual Brazilian-Portuguese words (1.26 for PP+, 1.52 for PP−; the smaller the number of operations, the closer the words were to Brazilian-Portuguese) and to each other—difference of 0.26 mean operations.

We used disyllabic words, instead of trisyllabic ones, because it avoids possible confounding variations that arise from recombining final + initial + middle syllables or middle + final + initial syllables of trisyllabic words (Mersad & Nazzi, 2011). We chose to use a CVCV structure because it is one of the most common word structures in Brazilian-Portuguese (Estivalet, 2018). Finally, we decided to use PPs over raw biphone positional frequency (Mersad & Nazzi, 2011) because it provides a comparable measure across all biphones for each stimulus (i.e., CV, VC, CV, in the first, second and third position, respectively), it is a standard measure in psycholinguistic research (Vitevitch & Luce, 2004), and it allows for cross-linguistic comparisons (e.g., see Table 3).

When recombining these words to create continuous speeches, we aimed to mimic, as closely as possible, the type of artificial language used in many of the classic speech segmentation experiments (Aslin et al., 1998; Saffran, Aslin, et al., 1996a; Saffran et al., 1997). To rule out frequency of repetition of test items as a confound, we created four frequency-balanced languages, two for each condition (see Aslin et al., 1998 for advantages of using a frequency-balanced language). All languages had six words (TP = 1.0). Half of the words were presented 300 times (i.e., high-frequency words) and half of the words were presented 150 times (i.e., low-frequency words). Words were concatenated in a quasi-random order following two constraints. First, a given word was never repeated in succession (e.g., nipenipe…). Second, high-frequency words were arranged so that their part-words (TP = 0.5) were matched in frequency with low-frequency words (i.e., they occurred 150 times each). During both familiarization and test, each word lasted for 696 ms, had a mean F0 of 220 Hz, and a mean intensity of 77 dB. Words were synthesized and concatenated, with co-articulation, using the MBROLA synthesizer (Dutoit et al., 1996) with the female Brazilian-Portuguese database br4. Each language had a total duration of approximately 15 min 40 s (stimuli are available at: https://osf.io/s9thk/).

In the Anchor condition, both languages had words with varying PPs in order to generate anchor effects. High-frequency words from one of the languages (Language 2) came from the PP+ set and had the highest possible PPs (M = 0.0085, range from 0.0090 to 0.0080). In the other language (Language 1), the high-frequency words came from the PP− set and had slightly lower PPs (M = 0.0072, range from 0.0075 to 0.0066; see Table 2). The opposite was true for low frequency words. Low frequency words in Language 2 came from the PP− set and had slightly lower PPs (M = 0.0072) than they did in Language 1 (M = 0.0085; PP+). Thus, the PP difference between the highest and the lowest PP was 0.0024 during familiarization, for both languages. Furthermore, the low-frequency words and the part-words used at test had similar PPs in both languages. It was slightly higher for Language 1 (M = 0.0085 and 0.0086, respectively) and slightly lower for Language 2 (M = 0.0072 and 0.0075, respectively).

Table 2 Phonotactic Probabilities (PP), Frequency (Freq), and Transitional Probabilities (TP) of familiarization and test stimuli of Languages (L) from Anchor and Uniform conditions (Cond)

In the Uniform condition, PPs were flattened for both languages during familiarization (i.e., no anchors were present). In one of the languages (Language 3), all words (high and low frequency) had slightly lower (but still high) PPs (M = 0.0072, range from 0.0066 to 0.0078; PP− set). In Language 4, all words had the highest possible PP values (M = 0.0085, range from 0.0080 to 0.0090; PP+ set; see Table 2). Thus, the PP difference between the highest and the lowest PP during familiarization was 0.0012 for Language 3 and 0.0010 for Language 4. As a consequence of the flattened PP distribution during familiarization, part-words used at test had the highest possible PPs in Language 3 (M = 0.0085), and slightly lower in Language 4 (M = 0.0072).

During warm-up trials, before the test phase, four existing words in Brazilian-Portuguese were contrasted with four novel words (not used in the experiment)—that is, bola (ball), tela (screen), cabo (cable), pato (duck) versus sibu, bafi, guvi, tibo, respectively.

Procedure

The experiment consisted of three phases: familiarization, test, and self-evaluation. The task was computer administrated using PsychoPy2 (Peirce et al., 2019) and stimuli were played on high-definition neutral headphones (AKG 740 powered by Fiio E10K dac/amp) in a sound-attenuated room. At the beginning of the experiment, before the familiarization phase, music was played through the headphones at the same intensity as the experimental stimuli (≈77 dB) and participants were instructed to adjust the volume to a comfortable level. The familiarization language was then played at the individually selected volume. In the familiarization phase, participants were told that they would hear a new language and were provided with puzzles to play with while they were listening (e.g., wood puzzles, slide puzzles, dexterity puzzle; Saffran et al., 1997). They were not instructed about any aspect of the language, nor were they instructed to pay attention to it.

After familiarization, participants were presented with four warm-up trials during which they were instructed to select the existing word in Brazilian-Portuguese compared with a novel word, by pressing either 1 or 2 on an adapted keyboard. There was a 500-ms pause between the two stimuli. In the following test phase, participants were presented with 18 two-alternative forced-choice trials that had the same structure as the warm-up trials. There were six trials for each of the three words. During each trial, participants heard one word (TP = 1) paired with a part-word (TP = 0.5)—order of presentation was counterbalanced across trials—and were instructed to select the stimulus that sounded more like the language they just heard.

Finally, during the self-evaluation phase, participants were asked to estimate the percentage of the words they were able to detect (between 0–25%, 25–50%, 50–75%, 75–100%), and to inform how focused they were on the puzzles while listening to the new language (very focused, focused, poorly focused, not focused). After the experimental task, participants completed a questionnaire including questions about their age, native language, and whether they had any experience with foreign languages.

Data analysis

The main dependent variable was the proportion of words (versus part-words) selected during the test phase. To test whether participants relied on TP alone or on a combination of TP and PP anchors to segment the languages, the proportion of words selected in each condition was tested against chance (0.5) using classic and Bayesian t tests. All tests against chance were one-sample, two-tailed, and Bayesian tests had a default Cauchy prior (0, 0.707; Wagenmakers et al., 2018). To test whether performance differed between conditions, we used an independent-samples t test. We also tested correlations between the proportion of words selected, self-evaluations of performance and focus on the task, and experience with foreign languages. All analyses were performed on JASP (JASP Team, 2018) and are available at OSF (https://osf.io/s9thk/). A post hoc power analysis of the Anchor condition, the one with results significantly different than chance, revealed 99% of power to detect a true effect (Faul et al., 2007).

Results

We found a significant difference in word selection across conditions, t(79) = 2.613, p = 0.011, Cohen’s d = 0.581, 95% CI [0.035, 0.262], suggesting that information from TP and PP anchors were combined during word segmentation (see Fig. 1a). In the Anchor condition, words were selected (M = 0.70, SD = 0.22) significantly above chance (0.5), t(39) = 5.661, p < .001, d = 0.895 [0.523, 1.259], with extreme evidence in favor of a combined effect of TP and PP on word selection, BFalternative = 11,446. Furthermore, there was no significant difference between performance across language versions (1 and 2), t(38) = 0.378, p = .707, d = −0.120 [−0.176, 0.121]. Participants were equally successful when tested on target words with the highest phonotactics (PP+; Language 1) or with slightly lower phonotactics (PP−; Language 2). In addition, no significant correlations were found between the proportion of words selected, self-evaluation of performance and focus, or experience with foreign languages.

Fig. 1
figure 1

a Proportion of word selection across conditions (Anchor and Uniform). b Proportion of word selection across Languages 3 and 4, from the Uniform condition. Empty dots represent individual performance, and filled dots represent the mean performance for each Condition (a) or Language (b). Point ranges indicate 95% confidence intervals. Dashed lines indicate chance level (0.5)

In contrast, in the Uniform condition, words were not selected (M = 0.55, SD = 0.28) above chance, t(40) = 1.305, p = .199, d = 0.204 [−0.107, 0.512], and no evidence was found for the effects of TP alone on word segmentation, BFalternative = 0.370. Nonetheless, a significant difference was found between language versions (3 and 4; see Fig. 1b), t(39) = −3.038, p = .004, d = −0.949 [−1.519, −0.296]. Participants who heard Language 3 did not select words above chance (M = 0.43, SD = 0.23), t(20) = −1.197, p = .245, d = −0.261 [−0.693, 0.177], BFalternative = 0.427. In contrast, participants that listened to Language 4 selected words above chance (M = 0.68, SD = 0.27), t(19) = 2.936, p = .008, d = 0.656 [0.165, 1.134], BFalternative = 5.875. In addition, a positive correlation was found between the proportion of words selected and self-evaluation of performance (rs = 0.634, p = 0.003).

The overall successful segmentation in the Anchor condition but not in the Uniform condition suggests that adults combined information from TP and PP anchors to segment the continuous speech. It is noteworthy that the flat PP distribution in the Uniform condition created test items with unbalanced PPs (see Table 2). In this condition, phonotactic information present during test may have led to word preferences in one language version (Language 4), but not in the other (Language 3). We return to these points in the Discussion.

Discussion

The present study investigated the combined effects of transitional probabilities and phonotactic probabilities on speech segmentation. Across two conditions, participants were exposed either to artificial languages in which all statistically defined words (TP = 1) had similar phonotactic probabilities (i.e., Uniform condition) or to artificial languages in which words had varying degrees of phonotactic probabilities (i.e., Anchor condition). If participants relied on TP alone, we expected similar segmentation performance for both conditions, given that TP was always stronger in words (TP = 1) in comparison with part-words (TP = 0.5). On the other hand, if participants relied on a combination of TP and PP, we expected segmentation to happen only in the Anchor condition.

The successful segmentation found in the Anchor condition, but not in the Uniform condition, suggests a combined effect of TP and PP anchors on speech segmentation. These results were found using very small PP differences between words in the Anchor condition. Thus, we replicated Mersad and Nazzi’s (2011) findings under more challenging circumstances and these results add support to the argument that anchors play an important role in speech segmentation (Bortfeld et al., 2005; Cunillera et al., 2010; Cunillera et al., 2016; Mersad & Nazzi, 2011, 2012). Nonetheless, the separate analysis of each speech stream from the Uniform condition indicates that PP effects may have extended to the test phase. In this condition, PP distributions were flat during familiarization for both versions of the familiarization language. Thus, participants could not have relied on PP anchors to find word boundaries. However, as a side effect of having flat PPs during the familiarization, words and part-words used during test had unbalanced PPs. For one language (Language 3), part-words had higher PPs than did words; for the other language, words had higher PPs (Language 4). Thus, even if participants did not segment the speech stream during familiarization, unbalanced PP information during test could still affect performance. Based solely on the PP information during test, participants exposed to Language 4 should prefer words (higher PP on words), but participants exposed to Language 3 should not (lower PP on words), which is in accordance with our findings. Furthermore, the positive correlation between performance and self-evaluation found only in Language 4 might be a further indication that participants were using PP information recently presented in the test phase—a recency effect. We did not expect these results because we assumed that the small PP differences in our stimuli would only drive learning after repeated exposure during the familiarization phase (at least 150 repetitions of each word). However, it seems that the limited experience with only six repetitions of each word and part-word during the test was sufficient to drive participants’ preferences. Although unexpected, this is in line with evidence showing that abrupt distinction between legal versus illegal phonotactics at test affect preference, regardless of TP information presented during familiarization (Finn & Hudson Kam, 2008). Future research can prevent PP effects during test by using the same set of test items across conditions (Adriaans & Kager, 2017; Reber & Perruchet, 2003). It is worth noting, however, that the higher the stimuli PPs, as in the current study, the harder it is to balance PPs at both familiarization and test—especially when differences in PP, as small as 0.0024, might affect performance, a point we return in the next paragraphs.

Our findings also provide a note of caution for speech segmentation researchers, who may overlook the impact that subtle phonotactic differences in their experimental stimuli might have on segmentation. Typically, statistical speech segmentation researchers use counterbalanced speech streams to control for any “general preferences for certain syllable strings” (Saffran, Aslin, et al., 1996a, p 1928) across participants. Thus, it is assumed that all observed learning reflects listeners sensitivity to TP information presented in the exposure stream rather than participants’ previous knowledge of their native language (such as phonotactics). However, counterbalancing does not necessarily control for differences in PP that, combined with TP, could transform words into anchors.

The results from the present study suggest that PP differences as small as 0.0024 (highest PP minus the lowest [0.0090 − 0.0066], Anchor condition) between stimuli can be combined with TP information to create anchors that promote segmentation. Using this difference as a threshold, we conducted a brief analysis of PPs from words, part-words, and nonwords used in artificial languages from a number of well-known statistical segmentation studies. Based on our knowledge of the literature, we analyzed highly cited seminal studies with infants and adults (Aslin et al., 1998; Saffran, Aslin, et al., 1996a; Saffran, Newport, et al., 1996b; 1209, 5590, and 1389 citations, respectivelyFootnote 1); one study that used dissyllabic stimuli and a combination of TP and stress cues to segmentation (Thiessen & Saffran, 2003; 607 citations); one study that combined speech segmentation and word learning (Estes et al., 2007; 491 citations); and one study published by one of us (Hay & Saffran, 2012; 62 citations). For each study, we first calculated PPs for each reported stimulus using Vitevitch and Luce (2004) phonotactic calculator. Then, we subtracted the lowest PP from the highest PP for each word from the speech stream and between test items (words, part-words, nonwords). All phonetic transcriptions, PPs, and calculations are openly available at OSF (https://osf.io/s9thk/).

Overall, we found an average difference of 0.0082 across studies’ speech streams (range from 0.0042 to 0.0199; see Table 3), which is 3.4 higher than the 0.0024 difference from our stimuli (Anchor condition). Even the smallest difference, 0.0042 (Estes et al., 2007; Experiment 2) is still almost twice as high as the difference in our stimuli. This shows that, regardless of counterbalancing, PP differences persist in speech segmentation tasks. Such differences, combined with TP information, may have created PP anchors that highlighted word boundaries and facilitated speech segmentation during familiarization—as in our Anchor condition. Furthermore, with the exception of the first experiment from Saffran, Aslin, et al. (1996a), all test items also had systematic PP differences. The overall mean difference was 0.0055 across studies’ test phases (range from 0.0020 to 0.0090), 2.3 times higher than the PP difference in the test phase of the Uniform condition of our study (0.0024). Such differences could have also driven word preferences during test.

Table 3 Differences between highest and lowest phonotactic probabilities and their range for each experiment of the analyzed studies

Moreover, the effects of PP anchors in segmentation may not be limited to artificial languages. With this in mind, we also calculated PPs for one of the Italian languages used in a number of statistical learning studies by Hay et al. (2011, Corpus 2A; e.g., Karaman & Hay, 2018; Pelucchi et al., 2009; Shoaib et al., 2018). This natural language controls for critical TP information, but it is composed of meaningful and grammatically correct phrases in Italian and is naturally produced by a native Italian-speaker (Hay et al., 2011). Again, we used the PP calculator by Vitevitch and Luce (2004) to calculate the English PPs for the Italian words, as participants were native English-speakers. All phonetic transcriptions, PPs, and calculations are openly available at OSF (https://osf.io/s9thk/).

We found an overall difference between PPs of 0.0120 (range from 0.0001 to 0.0617), which is 5 times higher than the 0.0024 difference from our own stimuli. In addition, the PPs follow a skewed distribution similar to word frequencies distributions in natural languages (Fig. 2a; Kurumada et al., 2013; Zipf, 1965). Most of the words had low PPs and few words had very high PPs. When combined, they generate phrases with highly variable PP information (see Fig. 2b), which, in turn, may generate PP anchors that are likely to support speech segmentation. Future corpus analyses could be very informative on this matter.

Fig. 2
figure 2

a Distribution of phonotactic probabilities (PP) for words of Italian speech from Hay et al. (2011, Language 2A). b Phonotactic probabilities (PP) of words from one sentence (8th) of the same speech

The presence of PP differences in all the analyzed languages (artificial and natural) raises the possibility that differences in words’ PPs may have created anchors that helped to promote speech segmentation in these studies. Thus, it is possible that learning that has been previously assumed to depend solely on TPs may be a result, instead, of a combination of TP and PP information. However, at least two important counterpoints should be noted.

First, our data, as in Mersad and Nazzi (2011) and Finn and Hudson-Kam (2008), come from adult participants and may not reflect linguistic sensitivities in infants. On one hand, adults’ extensive experience with their native language most likely generates a richer phonotactic knowledge when compared with infants’. On the other hand, phonotactic learning begins very early in life, as early as 5 months of age (Sundara & Breiss, 2020), and adults and infants may rely on similar strategies to segment speech (Cunillera et al., 2010). Future studies should extend this investigation to infants, which could provide more insights into the role of PP anchors during language development.

Second, PP differences may only function as anchors when combined with strong TPs. For example, after being familiarized with the Italian languages, infants readily segment words with high TPs (1.0), but not words with low TPs (0.33; Hay et al., 2011; Pelucchi et al., 2009); despite the fact that anchors could have highlighted both types of words. In natural languages, TP values are usually much smaller than the absolute 1.0 used in most segmentation studies (Yang, 2004). Future corpus analyses, experimental studies, and neurophysiological measures (Cunillera et al., 2016) may provide better resolution on the role of varying degrees of TP and PP combination on word segmentation.

Furthermore, although we have focused on PP differences in previously published studies, it may be even more informative for future analyses to explore whether differences in PPs can account for the fact that numerous studies have failed to replicate the original statistical learning findings and thus were not published—file drawer problem (Black & Bergmann, 2017; Rosenthal, 1979).

In sum, we believe that our findings, together with those from Mersad and Nazzi (2011), and the aforementioned observations about PPs in previous segmentation studies, make a reasonable argument for the effects of PP anchors on speech segmentation. In addition, the present study is, to our knowledge, the first to investigate statistical learning with Brazilian-Portuguese speakers, thus providing additional support for the generality of statistical learning research.

Language learning is a complex process. Several statistical cues have now been shown to play a role in speech segmentation. Here, we present evidence that small variations in phonotactic probability can be combined with transitional probability information to impact speech segmentation. Our findings provide a more nuanced understanding of the role PP anchors play in speech segmentation and suggest that, in future studies, we need to consider even subtle differences in PP when selecting stimuli for speech segmentation research.

Open practices statement

The data and materials for all experiments are available at https://osf.io/s9thk/. None of the experiments was preregistered.

CRediT statement

Rodrigo Dal Ben: Conceptualization, Methodology, Software, Investigation, Writing–Original draft preparation. Débora de Hollanda Souza and Jessica F. Hay: Conceptualization, Resources, Writing–Reviewing and Editing, Supervision.