Unsupervised Adaptation of Categorical Prosody Models for Prosody Labeling and Speech Recognition.,IEEE/ACM Transactions on Audio, Speech, and Language Processing

当前位置： X-MOL 学术 › IEEE ACM Trans. Audio Speech Lang. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised Adaptation of Categorical Prosody Models for Prosody Labeling and Speech Recognition.
IEEE/ACM Transactions on Audio, Speech, and Language Processing ( IF 4.1 ) Pub Date : 2009-01-01 , DOI: 10.1109/tasl.2008.2005347
Sankaranarayanan Ananthakrishnan ₁ , Shrikanth Narayanan

Affiliation

Automatic speech recognition (ASR) systems rely almost exclusively on short-term segment-level features (MFCCs), while ignoring higher level suprasegmental cues that are characteristic of human speech. However, recent experiments have shown that categorical representations of prosody, such as those based on the Tones and Break Indices (ToBI) annotation standard, can be used to enhance speech recognizers. However, categorical prosody models are severely limited in scope and coverage due to the lack of large corpora annotated with the relevant prosodic symbols (such as pitch accent, word prominence, and boundary tone labels). In this paper, we first present an architecture for augmenting a standard ASR with symbolic prosody. We then discuss two novel, un-supervised adaptation techniques for improving, respectively, the quality of the linguistic and acoustic components of our categorical prosody models. Finally, we implement the augmented ASR by enriching ASR lattices with the adapted categorical prosody models. Our experiments show that the proposed unsupervised adaptation techniques significantly improve the quality of the prosody models; the adapted prosodic language and acoustic models reduce binary pitch accent (presence versus absence) classification error rate by 13.8% and 4.3%, respectively (relative to the seed models) on the Boston University Radio News Corpus, while the prosody-enriched ASR exhibits a 3.1% relative reduction in word error rate (WER) over the baseline system.

中文翻译：

用于韵律标记和语音识别的分类韵律模型的无监督适应。

自动语音识别 (ASR) 系统几乎完全依赖于短期段级特征 (MFCC)，而忽略了作为人类语音特征的更高级别的超段级线索。然而，最近的实验表明，韵律的分类表示，例如基于音调和中断索引 (ToBI) 注释标准的那些，可用于增强语音识别器。然而，由于缺乏用相关韵律符号（例如音调重音、单词突出和边界声调标签）注释的大型语料库，分类韵律模型在范围和覆盖范围上受到严重限制。在本文中，我们首先提出了一种用符号韵律来增强标准 ASR 的架构。然后我们讨论了两种新颖的、无监督的适应技术，分别用于改进，我们的分类韵律模型的语言和声学成分的质量。最后，我们通过使用适应的分类韵律模型丰富 ASR 格来实现增强的 ASR。我们的实验表明，所提出的无监督适应技术显着提高了韵律模型的质量；适应的韵律语言和声学模型在波士顿大学广播新闻语料库上分别将二进制音调重音（存在与不存在）分类错误率降低了 13.8% 和 4.3%（相对于种子模型），而韵律丰富的 ASR 表现出与基线系统相比，单词错误率 (WER) 相对降低了 3.1%。我们的实验表明，所提出的无监督适应技术显着提高了韵律模型的质量；适应的韵律语言和声学模型在波士顿大学广播新闻语料库上分别将二进制音调重音（存在与不存在）分类错误率降低了 13.8% 和 4.3%（相对于种子模型），而韵律丰富的 ASR 表现出与基线系统相比，单词错误率 (WER) 相对降低了 3.1%。我们的实验表明，所提出的无监督适应技术显着提高了韵律模型的质量；适应的韵律语言和声学模型在波士顿大学广播新闻语料库上分别将二进制音调重音（存在与不存在）分类错误率降低了 13.8% 和 4.3%（相对于种子模型），而韵律丰富的 ASR 表现出与基线系统相比，单词错误率 (WER) 相对降低了 3.1%。

更新日期：2019-11-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文