当前位置: X-MOL 学术Speech Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Significance of spectral cues in automatic speech segmentation for Indian language speech synthesizers
Speech Communication ( IF 2.4 ) Pub Date : 2020-06-27 , DOI: 10.1016/j.specom.2020.06.002
Arun Baby , Jeena J. Prakash , Aswin Shanmugam Subramanian , Hema A. Murthy

Building speech synthesis systems for Indian languages is challenging owing to the fact that digital resources for these languages are hardly available. Vocabulary independent speech synthesis requires that a given text is split at the level of the smallest sound unit, namely, phone. The waveforms or models of phones are concatenated to produce speech. The waveforms corresponding to that of the phones are obtained manual (listening and marking) when digital resources are scarce. But the manual labeling of speech data (also known as speech segmentation) can lead to inconsistencies as the duration of phones can be as short as 10ms.

The most common approach to automatic segmentation of speech is to perform forced alignment using monophone hidden Markov models (HMMs) that have been obtained using embedded re-estimation after flat start initialization. These results are then used in neural network frameworks to build better acoustic models for speech synthesis/recognition. Segmentation using this approach requires large amounts of data and does not work very well for low resource languages. To address the issue of paucity of data, signal processing cues like short-term energy (STE) and sub-band spectral flux (SBSF) are used in tandem with HMM based forced alignment for automatic speech segmentation.

STE and SBSF are computed on the speech waveforms. STE yields syllable boundaries, while SBSF provides locations of significant change in spectral flux that are indicative of fricatives, affricates, and nasals. STE and SBSF cannot be used directly to segment an utterance. Minimum phase group delay based smoothing is performed to preserve these landmarks, while at the same time reducing the local fluctuations. The boundaries obtained with HMMs are corrected at the syllable level, wherever it is known that the syllable boundaries are correct. Embedded re-estimation of monophone HMM models is again performed using the corrected alignment. Thus, using signal processing cues and HMM re-estimation in tandem, robust monophone HMM models are built. These models are then used in Gaussian mixture model (GMM), deep neural network (DNN) and convolutional neural network (CNN) frameworks to obtain state-level frame posteriors. The boundaries are again iteratively corrected and re-estimated.

Text-to-speech (TTS) systems are built for different Indian languages using phone alignments obtained with and without the use of signal processing based boundary corrections. Unit selection based and statistical parametric based TTS systems are built. The result of the listening tests showed a significant improvement in the quality of synthesis with the use of signal processing based boundary correction.



中文翻译:

频谱提示在印度语音合成器自动语音分割中的意义

由于几乎没有这些语言的数字资源,因此为印度语言建立语音合成系统具有挑战性。与词汇无关的语音合成要求给定的文本在最小声音单元(即电话)的级别上进行拆分。将电话的波形或模型连接起来以产生语音。当数字资源匮乏时,可以手动获得(听和标记)与电话相对应的波形。但是手动标注语音数据(也称为语音分段)会导致不一致,因为电话的持续时间可能短至10ms。

语音自动分段的最常见方法是使用单音隐藏Markov模型(HMM)进行强制对齐,该模型已在固定启动初始化后使用嵌入式重新估计功能获得。然后,将这些结果用于神经网络框架,以建立更好的声学模型以进行语音合成/识别。使用这种方法进行细分需要大量数据,并且对于低资源语言而言效果不佳。为了解决数据不足的问题,信号处理提示(例如短期能量(STE)和子带频谱通量(SBSF))与基于HMM的强制对齐一起用于自动语音分割。

在语音波形上计算STE和SBSF。STE产生音节边界,而SBSF提供频谱通量显着变化的位置,这些变化指示擦音,附属音和鼻音。STE和SBSF不能直接用于分割语音。执行基于最小相位组延迟的平滑以保留这些界标,同时减少局部波动。只要已知音节边界是正确的,就可以在音节级别上校正通过HMM获得的边界。使用更正的对齐方式再次执行单声道HMM模型的嵌入式重新估计。因此,结合使用信号处理提示和HMM重新估计,可以构建健壮的单电话HMM模型。然后将这些模型用于高斯混合模型(GMM),深度神经网络(DNN)和卷积神经网络(CNN)框架来获取状态级框架后验。边界再次被迭代地校正和重新估计。

文字转语音(TTS)系统是针对使用不同的印度语言而构建的,使用的电话对齐方式可以使用和不使用基于信号处理的边界校正。建立了基于单元选择和基于统计参数的TTS系统。听力测试的结果表明,使用基于信号处理的边界校正可以显着提高合成质量。

更新日期:2020-06-27
down
wechat
bug