Abstract
Modulations in both amplitude and frequency are prevalent in natural sounds and are critical in defining their properties. Humans are exquisitely sensitive to frequency modulation (FM) at the slow modulation rates and low carrier frequencies that are common in speech and music. This enhanced sensitivity to slow-rate and low-frequency FM has been widely believed to reflect precise, stimulus-driven phase locking to temporal fine structure in the auditory nerve. At faster modulation rates and/or higher carrier frequencies, FM is instead thought to be coded by coarser frequency-to-place mapping, where FM is converted to amplitude modulation (AM) via cochlear filtering. Here, we show that patterns of human FM perception that have classically been explained by limits in peripheral temporal coding are instead better accounted for by constraints in the central processing of fundamental frequency (F0) or pitch. We measured FM detection in male and female humans using harmonic complex tones with an F0 within the range of musical pitch but with resolved harmonic components that were all above the putative limits of temporal phase locking (>8 kHz). Listeners were more sensitive to slow than fast FM rates, even though all components were beyond the limits of phase locking. In contrast, AM sensitivity remained better at faster than slower rates, regardless of carrier frequency. These findings demonstrate that classic trends in human FM sensitivity, previously attributed to auditory nerve phase locking, may instead reflect the constraints of a unitary code that operates at a more central level of processing.
SIGNIFICANCE STATEMENT Natural sounds involve dynamic frequency and amplitude fluctuations. Humans are particularly sensitive to frequency modulation (FM) at slow rates and low carrier frequencies, which are prevalent in speech and music. This sensitivity has been ascribed to encoding of stimulus temporal fine structure (TFS) via phase-locked auditory nerve activity. To test this long-standing theory, we measured FM sensitivity using complex tones with a low F0 but only high-frequency harmonics beyond the limits of phase locking. Dissociating the F0 from TFS showed that FM sensitivity is limited not by peripheral encoding of TFS but rather by central processing of F0, or pitch. The results suggest a unitary code for FM detection limited by more central constraints.
Introduction
Natural sounds are composed of fluctuations, or modulations, in amplitude and frequency. Amplitude modulations (AM) convey important speech information (Shannon et al., 1995; Smith et al., 2002) and frequency modulations (FM), particularly at slow rates (fm less than ∼5–10 Hz), are important for the perception of music and speech prosody (Ding et al., 2017). Our sensitivity to FM worsens with age and sensorineural hearing loss (Wallaert et al., 2016; Whiteford et al., 2017). This degradation may underlie some of the communication challenges faced by older and hard-of-hearing adults (Strelcyk and Dau, 2009; Parthasarathy et al., 2020), underscoring the importance of understanding how FM is encoded in the auditory system.
Slow FM is thought to be coded via auditory nerve phase locking to stimulus temporal fine structure (TFS; Carlyon et al., 2000; Hoover et al., 2019). This theory has been used to explain why sensitivity to slow FM is greatest at low carrier frequencies, where phase-locking cues are available, and deteriorates at higher frequencies (>4 kHz), where phase locking is thought to degrade (Demany and Semal, 1986; Moore and Sek, 1995, 1996; Sek and Moore, 1995). At faster FM rates, timing cues may be unavailable, even at low frequencies, because of a postulated sluggishness in the ability of the auditory system to evaluate phase-locked TFS information (Moore and Sek, 1995; Sek and Moore, 1995). Therefore, for fast FM rates and/or high carrier frequencies, FM detection may rely on a tonotopic (place) code, based on conversion of FM to AM through cochlear filtering (Zwicker, 1956; Moore and Sek, 1995; Saberi and Hafter, 1995; Whiteford and Oxenham, 2015, 2017; Whiteford et al., 2017).
Doubt has been cast recently on the importance of auditory nerve phase locking for frequency perception (Lau et al., 2017; Mehta and Oxenham, 2022) and FM detection (Whiteford et al., 2020). An alternative explanation is that the tonotopic distribution of average firing-rate information can provide a unitary code for fine-grained changes in both intensity and frequency across all combinations of carrier and modulator frequencies (Micheyl et al., 2013). This alternative approach postulates a more central limitation to perceptual performance, based on the responses of cortical units that have somewhat correlated responses (Cohen and Kohn, 2011). The postulated correlations would favor frequency over amplitude encoding (Micheyl et al., 2013) but only for static or slowly varying stimuli, thereby providing a natural explanation for why sensitivity is greater for slow-rate than for fast-rate FM. The pattern of reduced slow-rate benefits in FM sensitivity at high frequencies (>4 kHz) can also be explained by reduced central sensitivity to certain patterns of AM at high frequencies (Whiteford et al., 2020), perhaps reflecting the reduced prevalence and/or perceptual relevance of slower fine-grained (e.g., harmonic) spectrotemporal modulation patterns at high frequencies (Oxenham et al., 2011; Saddler et al., 2021). If the limitations are indeed central, it is possible that the relevant neural code is based not on frequency, as represented in the cochlea, but instead on higher-level features, such as fundamental frequency (F0), or pitch, as represented cortically (Bendor and Wang, 2005; Allen et al., 2022).
This study tested whether the enhanced sensitivity to slow-rate FM is determined by the constituent frequencies in a stimulus or by higher-level F0 representations. Sensitivity to FM was measured in pure and complex tones with harmonics either within or outside (>8 kHz) the putative limits of auditory nerve phase locking. If peripheral phase-locking limitations determine the pattern of FM detection (Fig. 1A), with a slow-rate advantage limited to carriers within the limits of phase locking, then multiple high-frequency tones should exhibit the same pattern of results as single high-frequency tones because performance will still be limited by lack of phase locking. On the other hand, if more central representations of F0 or pitch determine FM sensitivity (Fig. 1B), then it should be F0, not the harmonic frequencies in the stimulus, that determines the pattern of FM detection thresholds.
Materials and Methods
Participants
Eighteen participants completed experiment 1 (3 male, 15 female; mean age, 22.1 years; range, 18–28 years), experiment 2 (3 male, 15 female; mean age, 23.4 years; range, 19–31 years), and experiment 3 (2 male, 16 female; mean age, 21.7 years; range, 18–32 years). Three participants completed both experiments 1 and 2, and one completed all three experiments. To qualify for the study, participants had to have the following: (1) audiometric thresholds ≤20 dB hearing level (HL) at octave-spaced frequencies between 250 and 8000 Hz and (2) sufficient audibility in the frequency regions tested in the experiment, up to 12 kHz, as described (see below, Audibility screening). Most participants in experiments 1 and 2 had extensive psychophysical training from participating in previous related studies in the lab, whereas most participants in experiment 3 were naive. To prevent the need for extensive training on the task for participants to reach asymptotic performance, an additional criterion was used for experiment 3, namely, that participants had to exhibit average FM detection thresholds of no more than 1.3% for the easiest condition that was tested in all three experiments [F0 = 1.4 kHz, slow rate, with harmonics (H) 6–9; H6–H9] to be included in the study. This cutoff corresponds to the average FM threshold of the poorest-performing participant in experiments 1 and 2. Six participants were excluded from experiment 1, 13 from experiment 2, and 2 from experiment 3 because they did not meet the audibility screening criterion. Three additional participants met all screening criteria but dropped out of experiment 1; one qualifying participant dropped out of experiment 2. Six additional participants were excluded from experiment 3 because of FM thresholds exceeding the study criteria. All participants provided written informed consent and received monetary compensation or course credit for their time. The experimental protocols were approved by the Institutional Review Board of the University of Minnesota.
Sound presentation and calibration
Stimuli were generated digitally in MATLAB 2016b and converted to analog at a sampling rate of 48 kHz using a Lynx E22 sound card with 24-bit resolution. The tones were presented diotically via open-ear headphones (Sennheiser HD 650) in a sound-attenuating booth. Experiments were conducted using the AFC software package (Ewert, 2013), which is available for download (http://medi.uni-oldenburg.de/afc/).
General procedures
All tasks used a three-interval, three-alternative forced-choice paradigm with a two-down, one-up adaptive procedure that tracks the 70.7% correct point of the psychometric function (Levitt, 1971). The target was randomly presented in either the first, second, or third interval with equal a priori probability. Virtual buttons on the computer screen (labeled 1, 2, and 3) marked the presentation of each stimulus interval. Participants were asked to select the interval that contained the tone (audibility screening test) or contained the modulated tone (experiments 1, 2, and 3). Participants received visual feedback (the words Correct or Incorrect) after each trial and were instructed to look at the screen so they could read the feedback.
Audibility screening
To ensure that the experimental stimuli were sufficiently audible, all participants first underwent an audibility screening. Tone-in-noise thresholds were assessed at low (experiment 1, 200 Hz; experiment 2, 500 Hz), medium (experiments 1 and 2, 1.4 kHz; experiment 3, 2.8 kHz), and high (all three experiments, 12.6 kHz) frequencies. The target intervals contained a tone of 500 ms in duration, including 10 ms raised-cosine onset and offset ramps, while the reference intervals contained 500 ms of silence. The three intervals were separated from each other by 100 ms interstimulus intervals (ISI). Broadband threshold-equalizing noise (TEN; Moore et al., 2000) was presented continuously within each trial, beginning 300 ms before the onset of the first interval and ending 200 ms after the offset of the third interval. The TEN was generated independently for each ear at a level of 35 dB SPL per estimated equivalent rectangular bandwidth (ERB) of the auditory filter at 1 kHz (Glasberg and Moore, 1990) for experiments 1 and 2. For experiment 3, the TEN was generated in the same manner, except that the same TEN was presented to each ear (diotic presentation). On the first trial, the target was presented at 40 dB SPL. The initial step size of the adaptive procedure was 8 dB; this was reduced to 4 dB after the first two reversals, and then to 2 dB after the next two reversals, where it remained for the last six reversals. The threshold for each run was calculated as the mean level at the final six reversal points. Participants completed two runs per target frequency condition, and the average of the two runs was used to calculate each participant's threshold at that frequency. The order of the frequency conditions was randomized on each repetition. To pass the audibility screening, listeners needed to obtain masked thresholds <30 dB SPL for experiments 1 and 2 and masked thresholds <39 dB SPL for experiment 3 to ensure that each stimulus component in the experiment was at least 6 dB above its masked or absolute threshold.
Experiment 1
FM difference limens (FMDLs) were measured for slow (fm = 2 Hz) and fast (fm = 20 Hz) modulation rates for both pure-tone and complex-tone carriers. Pure-tone FMDLs in TEN were assessed for carrier frequencies of 0.2, 0.5, 1.4, 4, 8, and 12 kHz. Pure-tone FMDLs in quiet were also measured for fc = 500 Hz and fc = 8000 Hz to provide a more direct comparison with earlier studies. Complex-tone FMDLs were assessed at two F0s (200 and 1400 Hz) for both low (H2–H5) and high (H6–H9) harmonic conditions. The TEN was always present with complex tones to limit the audibility of distortion products (Smoorenburg, 1972; Oxenham et al., 2011). The target and reference stimuli were always 1 s in duration, including 50 ms raised-cosine onset and offset ramps, and were presented at 45 dB SPL per component. The target interval was an FM tone, whereas the reference intervals were the same tone without modulation. All intervals within a trial were separated by 200 ms ISIs. The modulator starting phase was randomized for each stimulus presentation but was held constant across harmonics within each complex tone. The TEN level was 35 dB SPL per ERB at 1 kHz and was generated independently in each ear. The TEN was presented continuously within each trial, beginning 300 ms before the onset of the first interval and ending 200 ms after the offset of the last interval.
The adaptive tracking procedure began with a peak-to-peak frequency excursion (2Δf, where Δf is the frequency excursion from the carrier frequency) of 4%. The value of 2Δf initially increased or decreased by a factor of two, based on the rules of the adaptive procedure. The step size was decreased to a factor of 1.41 after two reversals and then to a factor of 1.19 after another two reversals, where it remained for the last six reversals. The threshold was calculated as the mean of the log-transformed peak-to-peak frequency excursion [log10(2Δf)] at the final six reversal points. Participants completed four runs per condition. The conditions were presented in a different randomized order for each of the four repetitions.
Experiment 2
Pure-tone FMDLs and AM difference limens (AMDLs) in TEN were measured for carrier frequencies of 0.5, 1.4, 8, and 12.6 kHz, whereas complex FMDLs and complex AMDLs embedded in TEN were assessed at the high F0 (1.4 kHz) and high-harmonic condition (H6–H9) only. Pure-tone FMDLs and AMDLs in quiet were measured at 1.4 and 12.6 kHz. All other aspects of the stimuli were the same as in experiment 1.
Participants completed four runs of modulation detection per condition. To control for order effects, half of the participants completed FMDLs first, and the other half completed AMDLs first. The FMDL adaptive procedure was assessed in the same manner as in experiment 1. The AMDL adaptive tracking procedure began with a modulation depth [in units of 20log10(m)] of −8 dB. The initial step size was 6 dB for the first two reversals, which was decreased to 2 dB for the next two reversals and then to 1 dB for the last six reversals. The threshold was calculated as the mean depth [in 20log10(m)] at the final six reversal points. All other aspects of the AM task were identical to the procedures for the FM task.
Experiment 3
Harmonic and inharmonic complex-tone FMDLs were measured for F0 = 1.4 kHz with either lower (H2–H5) or upper (H6–H9) harmonics present. The carriers were made inharmonic by independently jittering each component by ±30% of the F0 across trials while maintaining the same jittered values between intervals within each trial. To reduce the potential for audible beats between components in an inharmonic complex, an additional constraint was imposed, namely, that adjacent components had to be at least F0/2 Hz apart. By chance, some of the jitters will create more perceptually harmonic tones than others. One method for estimating the amount of inharmonicity is to calculate the aperiodicity of the signal using autocorrelation, where tones with lower peak autocorrelation values are judged to be more inharmonic (Popham et al., 2018). Autocorrelations were conducted on the steady (unmodulated) complex tones for every jitter combination and normalized so that lags of zero had an autocorrelation value of one. Only jitter combinations that produced tones with a peak of <0.6 at the F0 lag were used. Because the peakiness of the autocorrelation could be high at periods other than the F0, 10,000 jitter combinations were randomly sampled for each condition. Jitter combinations were sorted based on their maximum peak value, and only the 1000 combinations with the lowest peak values were used to generate the inharmonic stimuli for the experiment. The jitter values were randomly sampled with replacement for each trial. All stimuli were presented in TEN, with the TEN level at 35 dB SPL per ERB at 1 kHz and the same sample of TEN presented to each ear. All other aspects of the experimental design were identical to experiment 1.
Experimental design and statistical analyses
The log-transformed FM and AM thresholds [FM, 10log10(2Δ(f%)); AM, 20log10(m)] were analyzed using paired-samples t tests (for single comparisons; experiment 2 complex-tone conditions) or repeated-measures ANOVAs (all other conditions). Greenhouse–Geisser corrections were used in cases where Mauchly's test indicated a violation of the assumption of sphericity. Effect size was quantified as partial-eta squared (ηp2). For all analyses, significant interactions were further interpreted using simple effects tests. All paired comparisons were two tailed. Bonferroni correction was used to correct for multiple comparisons, and the corrected alpha values are reported in the results, along with the obtained (uncorrected) p values. Statistical analyses were conducted using IBM SPSS Statistics 25 and MATLAB 2016b software.
Our goal was to have a sample size for experiment 1 that was large enough to detect a significant difference between slow- and fast-rate FMDLs with a power (1-β) of 0.8 after Bonferroni correction for six paired comparisons (α = 0.0083), as Bonferroni-corrected paired comparisons would be run to interpret a significant modulation rate × carrier frequency interaction. The power analysis, conducted with G*Power version 3.1.9.4 software (Faul et al., 2007) using FM thresholds at 500 Hz from Whiteford and Oxenham (2015), indicated that a sample size of 18 was required. The sample size from experiments 2 and 3 was chosen to match the sample size from experiment 1. The data are available via the Open Science Framework (https://osf.io/q9mwa/).
Modeling
If human FM detection is limited by independent peripheral noise in each frequency channel (e.g., via phase-locked auditory nerve spike time coding), then FMDLs for complex FM should be predictable from the pure-tone FMDLs for each component within the complex tone, using a model rooted in signal detection theory (Green and Swets, 1966). The individual thresholds for each harmonic were estimated by fitting a spline function through individual FMDLs for pure tones in TEN; thresholds at 12 kHz were substituted for estimated thresholds of the highest harmonic (12.6 kHz) as pure-tone FMDLs were not assessed at or above 12.6 kHz. Sensitivity was estimated from pure-tone thresholds by assuming that (1) d′ is proportional to percentage change in frequency (Dai and Micheyl, 2011; Gockel et al., 2020), (2) information is integrated optimally across independent frequency channels, and (3) performance is limited by peripheral coding (Green and Swets, 1966; Dai and Micheyl, 2011; Oxenham, 2016). Based on methods described in Lau et al. (2017), complex-tone FMDLs (
Results
Experiment 1: patterns of FM sensitivity reflect F0 not frequency
Pure-tone FMDLs
Listeners' pure-tone FMDLs depended on the modulation rate and carrier frequency, consistent with past results that have typically been attributed to temporal coding of TFS at low carriers and slow rates (Sek and Moore, 1995); performance was better at the slow than the fast modulation rate, but only for carrier frequencies ≤4 kHz (Fig. 2). The effect of carrier frequency and rate on FM sensitivity in TEN was examined using a repeated-measures ANOVA with log-transformed FMDLs as the dependent variable and within-subjects factors of carrier frequency and modulation rate (Table 1). There were significant main effects of carrier frequency and rate, as well as a significant interaction (p < 0.0001 in all cases). Post hoc simple effects tests were Bonferroni corrected for six multiple comparisons (α = 0.008) and confirmed that sensitivity for slow (2 Hz) FM was significantly better than fast (20 Hz) FM at carrier frequencies ≤4 kHz (all p values < 0.008) but not at 8 kHz (p = 0.038) or 12 kHz (p = 0.178).
This classic interaction between modulation rate and carrier frequency was present both in quiet and in TEN when analyses were limited to the two carrier frequencies common in both background conditions (0.5 and 8 kHz; Table 1), although the magnitude was dependent on the background (rate × carrier × background, F(1,17) = 20.3, p = 0.0003, ηp2 = 0.545; Table 1). Simple effects tests were conducted to further interpret the interaction, with Bonferroni correction applied for four multiple comparisons (α = 0.0125). For FM in quiet, there was no significant difference between slow and fast FMDLs when fc = 500 Hz (p = 0.147), but fast FM was significantly better than slow FM at 8000 Hz (p < 0.0001). For FM in TEN, sensitivity was significantly better at the slow rate for the 500 Hz carrier (p < 0.0001) but not the 8000 Hz carrier (p = 0.038). Overall, performance was worse in TEN than in quiet, but the magnitude of the impairment was greatest for the fast rate, especially when the carrier frequency was low. The significant slow-rate advantage only for the low carrier frequency, even in the presence of TEN, is consistent with the phase-locked timing theory for coding low-carrier, slow-rate FM.
The results replicate the classic finding that sluggishness in FM sensitivity (i.e., greater sensitivity to slow than fast FM rates) is found for low, but not high, carrier frequencies. In the next section, these FMDLs are taken from the individual pure-tone carrier frequencies and used to predict FMDLs for the complex tones, each comprising four harmonics, and to compare them with the observed FMDLs for complex tones.
Predicted and observed complex-tone FMDLs
If the FMDLs for the pure tones are limited by peripheral processes (e.g., auditory nerve phase locking), then the optimal performance for FMDLs of complex tones (assuming independent information from each component) is given by Equation 1. Inserting the interpolated FMDL values obtained for the pure tones into this equation at both low and high F0s (200 and 1400 Hz) and harmonic numbers (H2–H5 and H6–H9) produces the predictions shown in Figure 3A (200 Hz F0) and 3B (1400 Hz F0). As expected, based on the pure-tone FMDLs in Figure 2, a large benefit of slow-rate (filled symbols) versus fast-rate (open symbols) FM detection is predicted when the stimuli include low frequencies, including the 1400 Hz F0 with lower harmonic numbers (H2–H5). However, that benefit is not predicted for the 1400 Hz F0 condition with higher harmonic numbers (H6–H9).
The actual data from participants are shown in Figure 3C for the 200 Hz F0 and Figure 3D for the 1400 Hz F0. The obtained thresholds were higher overall than the predicted thresholds, which is expected given human thresholds were measured for each component alone without the context of the other harmonic components (Moore et al., 1984; Gockel et al., 2007). In contrast to the predicted FMDLs, complex FM sensitivity was greater for slow-rate (filled symbols; Figure 3) than fast-rate (open symbols) FM, even for the high-harmonic condition (F0 = 1.4 kHz; H6–H9), where the lowest harmonic present (H6 = 8.4 kHz) is well above the putative limits of human phase locking (Verschooten et al., 2018). This conclusion is supported via a repeated-measures ANOVA on the log-transformed F0 difference limens (F0DLs) by a highly significant main effect of modulation rate but no significant two- or three-way interactions between modulation rate and harmonic condition (Table 2). These results suggest that the dependence of FMDLs on rate may be determined by F0 (pitch) rather than phase-locked encoding of the individual harmonics. The elevated (poorer) overall thresholds for F0 = 200 Hz are consistent with that seen for very low-frequency pure-tone carriers (Sek and Moore, 1995).
Interestingly, there was a small but significant harmonic condition × F0 interaction (p = 0.039; Table 2). The FMDLs were elevated in the high harmonic condition relative to the low, but only when the F0 was 1400 Hz (200 Hz F0, high vs low, p = 0.988; 1400 Hz F0, high vs low, p = 0.024; α = 0.025). Notably, however, FMDLs were better overall for the 1400 Hz F0 than for the 200 Hz F0, even in the high harmonic condition (low, 200 Hz vs 1400 Hz, p < 0.001; high, 200 Hz vs 1400 Hz, p < 0.001; α = 0.025), despite the lowest harmonic present in this condition, 8400 Hz, being well above the limits of phase locking. This outcome casts further doubt on the idea that phase locking underlies accurate FM detection.
To summarize, contrary to predictions from the pure-tone FMDLs, the complex-tone FMDLs continued to show a slow-rate benefit with an F0 of 1400 Hz, even when all the components of the complex were well above the putative limits of phase locking. This outcome is consistent with a central pitch- or F0-based representation limiting FM detection rather than one based on peripheral auditory nerve phase locking.
Experiment 2: different sensitivity patterns observed for FM and AM
To provide a direct comparison with patterns of AM sensitivity under comparable conditions, both FM and AM sensitivity were measured within the same participants. We predicted that slow-rate benefits in modulation perception should only be observed when detecting fluctuations in pitch (i.e., FM perception) but not when detecting fluctuations in amplitude or loudness (AM perception).
Figure 4 shows results consistent with the hypothesis that slow-rate benefits are linked to F0 processing and are specific to FM (Table 3). Consistent with experiment 1, the slow-rate benefit in pure-tone FM sensitivity diminished as the carrier frequency increased (Fig. 4A; Table 3). This effect seems to be driven by F0, rather than absolute frequency, as the slow-rate benefit was again present in the high-frequency complex-tone condition (F0 = 1400 Hz; H6–H9; t(17) = −4.29, p = 0.0005; Fig. 4B). This trend is not observed for AM encoding at high frequencies; sensitivity for fast-rate AM always exceeded slow-rate AM [based on post hoc simple effects tests for all pairwise comparisons of slow vs fast rates for pure tones at each carrier frequency in TEN and in quiet, all p values ≤ 0.005; Bonferroni-corrected α = 0.0083 (0.05/6)], even for complex tones with an F0 of 1400 Hz and high-frequency harmonics (H6–H9; t(17) = 10.2, p < 0.0001). There was also a significant three-way interaction in pure-tone AM sensitivity (Table 3). An ANOVA on the difference thresholds between TEN and quiet revealed a significant rate × carrier interaction (F(1,17) = 76, p < 0.0001, ηp2 = 0.655). Post hoc simple effects tests demonstrated no difference in impairment from TEN on slow and fast AM thresholds when the carrier frequency was high (mean difference between slow/fast rates, 0.072, p = 0.888) but significantly greater impairment on the fast rate when the carrier was low (mean difference, 4.18, p < 0.0001). A separate ANOVA on pure-tone AM sensitivity in TEN confirmed that AM sensitivity was better at the fast rate for all carrier frequencies (Table 3; all post hoc comparisons for slow vs fast AMDLs at each carrier resulted in p ≤ 0.005). In summary, the slow-rate benefits in FM sensitivity were maintained at an F0 of 1400 Hz, despite the absence of TFS (timing) cues, as found in experiment 1. In contrast, AM detection was generally independent of the carrier frequency or F0, with greater sensitivity at the faster than slower modulation rate across all conditions.
Experiment 3: pitch salience and the slow-rate advantage
If the slow-rate advantage in FM sensitivity is driven by pitch (F0), then weakening the pitch should lead to a reduced slow-rate advantage. We reduced the pitch salience of the harmonic tones by jittering the components by ±30% of the F0 and measured FMDLs for F0 = 1400 Hz with low (H2–H5) and high harmonic components (H6–H9) at slow (2 Hz) and fast (20 Hz) rates. Harmonic FM sensitivity was assessed in the same participants for direct comparison. Weakening the pitch by jittering the harmonics resulted in poorer FMDLs (main effect of harmonicity, Table 4; Fig. 5), in line with previous studies that found pitch perception for steady complex tones worsens somewhat when the components are not integer multiples of the F0 (Carlyon and Stubbs, 1989; Micheyl et al., 2010; McPherson and McDermott, 2018; Mehta and Oxenham, 2022). Importantly, the degree of impairment was greatest at the slow rate, consistent with the hypothesis that the slow-rate advantage is related to pitch encoding (harmonicity × modulation rate interaction; Table 4). FM sensitivity was greatest at the slow rate for both harmonic (p < 0.0001) and inharmonic complex tones (p < 0.0001), but the magnitude of the slow-rate advantage was larger in the harmonic condition. The effect of harmonicity was significant for slow-rate FM (p < 0.0001), with higher sensitivity for harmonic compared with inharmonic tones, but this trend did not reach significance for fast-rate FM (p = 0.024, α = 0.0125). Unlike experiment 1, there was a significant modulation rate × harmonic condition interaction (Table 4), with thresholds poorer for the upper harmonics, particularly for the slow rate (slow rate, low vs upper harmonics, p = 0.0002; fast rate, low vs upper harmonics, p = 0.021). Visual inspection (Fig. 5) suggests that this interaction appears mostly driven by the inharmonic condition, but the three-way interaction was not significant (Table 4). Overall, the results support the hypothesis that weakening the pitch by introducing inharmonicity significantly decreases the slow-rate advantage in FM detection, consistent with the hypothesis that enhanced slow-rate FM is based on central processing for pitch rather than phase locking to the TFS of the individual frequency components.
Discussion
Summary of results
In three separate experiments we found enhanced slow-rate FM sensitivity for complex harmonic tones with an F0 within the range of musical pitch (1400 Hz), even with spectrally resolved harmonic components all above 8 kHz, well beyond the putative limits of auditory nerve phase locking. Weakening the pitch by randomly jittering the component frequencies significantly reduced FM sensitivity for stimuli in the same extended high-frequency range. Our results render unlikely the previously accepted explanation for slow-rate FM sensitivity in terms of the phase-locked activity that is observed in the auditory nerve and brainstem (Paraouty et al., 2018) and instead support more central coding limitations at the level of F0 or pitch representations.
Considering alternative explanations
One alternative explanation is that there may be residual phase-locking cues available even above 8 kHz (Heinz et al., 2001; Verschooten et al., 2019), and the combined residual information across the four harmonics may be enough to drive the slow-rate benefit. This explanation was ruled out based on two findings. First, performance at both slow and fast FM rates was better in the high-harmonic condition for the higher F0 than the lower F0, despite the fact that the harmonic components of the lower F0 ranged from 1200 to 1800 Hz—well within the range of accurate phase locking—whereas the components of the higher F0 were all above 8 kHz. The better performance with the 1400 than the 200 Hz F0 may be because of sharper human cochlear tuning at higher frequencies (Shera et al., 2002; Sumner et al., 2018), resulting in more salient (place based) effects of continuous changes in the frequencies and F0s for 1400 Hz than 200 Hz. Second, the predicted complex-tone FMDLs from the pure-tone FMDLs, based on the assumption of optimal integration of independent information from each carrier within the complex tone, did not match the pattern of observed results. As illustrated by the predictions, if residual phase locking had limited performance, then the slow- and fast-rate FMDLs should have converged for the high-frequency complex tones as predicted from the pure-tone FMDLs. Instead, the complex-tone FMDLs continued to show a significant slow-rate advantage.
Another alternative explanation is that the instantaneous F0 of the complex tones could be represented via temporal-envelope cues, produced via interactions between neighboring harmonics so that the slow-rate benefit is driven by phase-locked responses to the envelope, rather than representations of individual harmonics, as found by Carlyon et al. (2000) for harmonic complexes with higher-numbered (>H15) unresolved harmonics. However, there are several reasons the results cannot be explained by temporal-envelope coding to the F0. First, F0 envelope coding begins to degrade at ∼150 Hz, nearly an order of magnitude lower than the F0s used in the present study (Kohlrausch et al., 2000). This degradation beyond 150 Hz likely explains why envelope cues have not been found to provide pitch information at F0s beyond 850–1000 Hz (Burns and Viemeister, 1976; Macherey and Carlyon, 2014). Finally, studies using very similar stimuli for pitch discrimination (Oxenham et al., 2011; Lau et al., 2017) have found that performance is not affected by presenting alternating harmonics to opposite ears (which would disrupt temporal-envelope cues) but is affected by shifting harmonics by a fixed number of Hz (which would not disrupt temporal-envelope cues but would disrupt place-based harmonicity cues). In summary, F0-based temporal-envelope cues are unlikely to explain the results.
A unified framework for FM based on place coding and F0-dependent rate processing
The central finding of our study is that the slow-rate benefit observed in FM detection seems to be mediated by the centrally derived F0 and not the peripherally represented individual components. This observation fills an important gap in our understanding of how dynamic changes in frequency are coded and perceived. Earlier work in young normal-hearing individuals, which found substantial multicollinearity between FM and AM detection thresholds at different rates (Ochi et al., 2014; Whiteford and Oxenham, 2015; Otsuka et al., 2016; Whiteford et al., 2017), has already provided indirect support for the hypothesis that central rather than peripheral constraints determine performance limits. One exception involves people with sensorineural hearing loss who have poorer cochlear tuning, leading to degraded frequency selectivity; in this population, FM detection thresholds at both slow and fast rates seem to be mediated in part by cochlear filter tuning (Whiteford et al., 2020).
The pattern of results from these earlier studies can be reconciled within a single theoretical framework, whereby FM at all rates and carrier frequencies is detected via the FM-to-AM conversion, initially produced by cochlear filtering (Zwicker, 1956; Saberi and Hafter, 1995) but then propagated via the tonotopic mapping that characterizes the auditory pathways up to and including auditory cortex. At a central (cortical) level, the tonotopic representation is carried by neurons that have some degree of correlation in their responses (Cohen and Kohn, 2011). According to Micheyl et al. (2013), even a small (and physiologically plausible) degree of such correlation can predict human perceptual sensitivity to changes in both amplitude and frequency (i.e., AM and FM) within the same neural coding framework. Relatively long time windows are required to extract correlation information, which would limit the influence of correlation to slow modulation rates, where the period of modulation is longer than the duration of the analysis time window. The present results suggest that FM encoding may be limited by a central representation of F0, not just frequency. A recent fMRI study in humans has revealed evidence for tuning to F0 in the areas surrounding Heschl's gyri (HG; Allen et al., 2022). Superior temporal gyri and medial HG were specifically responsive to lower F0s, which could be a potential locus for slow-rate, low-carrier FM coding, although further physiological studies specific to FM are needed.
Why would the slow-rate advantage for pure-tone FM worsen at high frequencies if not for degraded peripheral phase locking? Fine-grained spectrotemporal patterns are less perceptually relevant and probably less prevalent in our natural environment at high than low frequencies (Oxenham et al., 2011; Saddler et al., 2021); therefore, their representations may be associated with less accurate neural coding and potentially decreased neuronal correlation, consistent with the detection and discrimination of AM patterns at high frequencies (Whiteford et al., 2020).
The present results with FM can be compared with results from frequency- and F0-discrimination tasks with similar combinations of frequency components and F0s (Oxenham et al., 2011; Lau et al., 2017). In both cases, good performance can be achieved with harmonics that are all well above the putative limit of auditory nerve phase locking. Lau et al. (2017) compared frequency discrimination (FDLs) with F0 discrimination (F0DLs) and found that the F0DLs were better than could be predicted by an optimal integration of information, assuming independent (peripheral) noise sources. We did not find better-than-predicted integration here. This difference between the studies seems to be because of the much lower (better) thresholds found here for pure-tone FM detection than for pure-tone frequency discrimination at the same frequencies observed by Lau et al. (2017). The poorer sensitivity to discrete frequency changes than continuous FM may be partly because of the fact that the levels of the tones were roved between intervals by Lau et al. (2017) to avoid potential loudness cues. More recently, Gockel et al. (2020) replicated the basic patterns of results found by Lau et al. (2017) but used participants with more extensive exposure to very high-frequency tones. They found considerably lower pure-tone FDLs (closer to the FMDLs observed here), so the improved performance between the pure-tone and F0 discrimination was also not better than predicted by optimal integration, as found here.
One difference between the present results and previous studies pertains to the especially fine-grained sensitivity for F0 changes at extended high frequencies (F0 = 1400 Hz, H6–H9), relative to F0 discrimination in the same frequency range (Lau et al., 2017; Gockel et al., 2020; Mehta and Oxenham, 2022). High-acuity FM for this F0 is expected as FM sensitivity for the individual component frequencies is also quite fine grained, with pure-tone thresholds comparable to previous studies (Sek and Moore, 1995). Another possible reason for the difference is that F0-based pitch may be more fragile at extended high frequencies and not as robust to the additional task demands required for discrimination (i.e., memory/attention demands of comparing the pitch across time intervals and labeling of high/low contours) compared with detection of instantaneous changes in any dimension, which is all that is required from FM detection.
Implications
The results from our study provide new insights into the neural representation of sounds that vary dynamically in frequency, a characteristic shared by most natural sounds. Along with the results from earlier studies (Oxenham et al., 2011; Lau et al., 2017; Whiteford et al., 2020), the outcomes can be explained in terms of a unitary code for FM that is mediated by cochlear filtering but constrained by central representations of F0, rather than the peripheral phase-locked representations of the individual components. One important implication of a place-based code for pitch and FM coding is that future innovations to restore pitch and FM in dysfunctional hearing should focus on improving frequency-to-place representations, as efforts to emphasize temporal representations of TFS (Dillon et al., 2016) may not succeed in improving either pitch or FM perception. Current assistive listening devices do not adequately restore FM cues to normal (Chen and Zeng, 2004; Ives et al., 2013), but the present findings suggest that emphasizing place-based cues to extract pitch and F0 may lead to more robust perception.
Footnotes
This work was supported by National Institutes of Health Grants R01 DC005216 (A.J.O.) and R21 DC019409 (K.L.W.) and an Eva O. Miller Fellowship (K.L.W.). We thank Penny Corbett, Angela Sim, Kara Stevens, and Linlu Sun for assistance with data collection and Daniel Guest for assistance with creating Figure 1.
The authors declare no competing financial interests.
- Correspondence should be addressed to Kelly L. Whiteford at whit1945{at}umn.edu