Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings
arXiv - CS - Sound Pub Date : 2020-07-01 , DOI: arxiv-2007.00183
Bowen Shi, Shane Settle, Karen Livescu

Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word ("acoustic-to-word") speech recognition, with the segment feature vectors defined using acoustic word embeddings. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi decoding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we investigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embeddings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic representation with AWEs, and additional (smaller) gains can be obtained by pre-training the word prediction layer with AGWEs. Our final models improve over comparable A2W models.

中文翻译：

具有声学词嵌入的全词分段语音识别

分段模型是序列预测模型，其中假设分数基于帧的整个可变长度分段。我们考虑用于全词（“声学到词”）语音识别的分段模型，分段特征向量使用声学词嵌入来定义。此类模型在计算上具有挑战性，因为路径数量与词汇量大小成正比，这可能比使用音素等子词单元时大几个数量级。我们描述了一种端到端全词分段模型的有效方法，在 GPU 上执行前向后向和 Viterbi 解码，以及降低空间复杂度的简单分段评分函数。此外，我们通过联合训练的声学词嵌入 (AWE) 和书面词标签的声学基础词嵌入 (AGWE) 来研究预训练的使用。我们发现通过使用 AWE 预训练声学表示可以大大降低单词错误率，并且可以通过使用 AGWE 预训练单词预测层获得额外的（较小的）增益。我们的最终模型比可比的 A2W 模型有所改进。

更新日期：2020-07-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>