Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech Recognition,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech Recognition
arXiv - CS - Sound Pub Date : 2020-06-30 , DOI: arxiv-2007.00131
Maarten Van Segbroeck, Harish Mallidih, Brian King, I-Fan Chen, Gurpreet Chadha, Roland Maas

Acoustic models in real-time speech recognition systems typically stack multiple unidirectional LSTM layers to process the acoustic frames over time. Performance improvements over vanilla LSTM architectures have been reported by prepending a stack of frequency-LSTM (FLSTM) layers to the time LSTM. These FLSTM layers can learn a more robust input feature to the time LSTM layers by modeling time-frequency correlations in the acoustic input signals. A drawback of FLSTM based architectures however is that they operate at a predefined, and tuned, window size and stride, referred to as 'view' in this paper. We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views, into a dimensionality reduced feature representation. The proposed multi-view FLSTM architecture allows to model a wider range of time-frequency correlations compared to an FLSTM model with single view. When trained on 50K hours of English far-field speech data with CTC loss followed by sMBR sequence training, we show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7% for different speaker and acoustic environment scenarios over an optimized single FLSTM model, while retaining a similar computational footprint.

中文翻译：

多视图频率 LSTM：自动语音识别的高效前端

实时语音识别系统中的声学模型通常堆叠多个单向 LSTM 层，以随时间处理声学帧。通过在时间 LSTM 前面添加一组频率 LSTM (FLSTM) 层，已经报告了对普通 LSTM 架构的性能改进。这些 FLSTM 层可以通过对声学输入信号中的时频相关性进行建模，从而向时间 LSTM 层学习更稳健的输入特征。然而，基于 FLSTM 的架构的一个缺点是它们以预定义和调整的窗口大小和步幅运行，在本文中称为“视图”。我们通过将具有不同视图的多个 FLSTM 堆栈的输出组合成降维特征表示，提出了一种简单而有效的修改。与具有单视图的 FLSTM 模型相比，所提出的多视图 FLSTM 架构允许对更广泛的时频相关性进行建模。当在 50,000 小时的英语远场语音数据上进行 CTC 损失训练，然后进行 sMBR 序列训练时，我们表明多视图 FLSTM 声学模型为不同的说话者和声学提供了 3-7% 的相对字错误率 (WER) 改进在优化的单个 FLSTM 模型上的环境场景，同时保留了类似的计算足迹。

更新日期：2020-07-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>