A Robust Speaking Rate Estimator Using a CNN-BLSTM Network,Circuits, Systems, and Signal Processing

当前位置： X-MOL 学术 › Circuits Syst. Signal Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Robust Speaking Rate Estimator Using a CNN-BLSTM Network
Circuits, Systems, and Signal Processing ( IF 1.8 ) Pub Date : 2021-06-11 , DOI: 10.1007/s00034-021-01754-1
Aparna Srinivasan , Diviya Singh , Chiranjeevi Yarra , Aravind Illa , Prasanta Kumar Ghosh

Direct acoustic feature-based speaking rate estimation is useful in applications including pronunciation assessment, dysarthria detection and automatic speech recognition. Most of the existing works on speaking rate estimation have steps which are heuristically designed. In contrast to the existing works, in this work a data-driven approach with convolutional neural network-bidirectional long short-term memory (CNN-BLSTM) network is proposed to jointly optimize all steps in speaking rate estimation through a single framework. Also, unlike existing deep learning-based methods for speaking rate estimation, the proposed approach estimates the speaking rate for an entire speech utterance in one go instead of considering segments of a fixed duration. We consider the traditional 19 sub-band energy (SBE) contours as the low-level features as the input of the proposed CNN-BLSTM network. The state-of-the-art direct acoustic feature-based speaking rate estimation techniques are developed based on 19 SBEs as well. Experiments are performed separately using three native English speech corpora (Switchboard, TIMIT and CTIMIT) and a non-native English speech corpus (ISLE). Among these, TIMIT and Switchboard are used for training the network. However, testing is carried out on all the four corpora as well as TIMIT and Switchboard with additive noise, namely white, car, high-frequency-channel, cockpit, and babble at 20, 10 and 0 dB signal-to-noise ratios. The proposed CNN-BLSTM approach outperforms the best of the existing techniques in clean as well as noisy conditions for all four corpora.

中文翻译：

使用 CNN-BLSTM 网络的鲁棒语速估计器

基于直接声学特征的语速估计在包括发音评估、构音障碍检测和自动语音识别在内的应用中很有用。大多数现有的语速估计工作都有启发式设计的步骤。与现有工作相比，在这项工作中，提出了一种具有卷积神经网络-双向长短期记忆（CNN-BLSTM）网络的数据驱动方法，通过单个框架联合优化语速估计的所有步骤。此外，与现有的基于深度学习的语速估计方法不同，所提出的方法一次性估计整个语音的语速，而不是考虑固定持续时间的片段。我们将传统的 19 个子带能量 (SBE) 轮廓视为所提出的 CNN-BLSTM 网络的输入的低级特征。最先进的基于直接声学特征的语速估计技术也是基于 19 个 SBE 开发的。使用三个母语英语语音语料库（Switchboard、TIMIT 和 CTIMIT）和一个非母语英语语音语料库（ISLE）分别进行实验。其中，TIMIT 和 Switchboard 用于训练网络。然而，测试是在所有四个语料库以及 TIMIT 和 Switchboard 上进行的，具有附加噪声，即白色、汽车、高频通道、驾驶舱和 babble，信噪比为 20、10 和 0 dB。所提出的 CNN-BLSTM 方法在所有四个语料库的干净和嘈杂条件下均优于现有技术中的最佳方法。

更新日期：2021-06-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文