Improving phoneme recognition of throat microphone speech recordings using transfer learning,Speech Communication

当前位置： X-MOL 学术 › Speech Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Improving phoneme recognition of throat microphone speech recordings using transfer learning
Speech Communication ( IF 2.4 ) Pub Date : 2021-03-10 , DOI: 10.1016/j.specom.2021.02.004
M.A. Tuğtekin Turan , Engin Erzin

Throat microphones (TM) are a type of skin-attached non-acoustic sensors, which are robust to environmental noise but carry a lower signal bandwidth characterization than the traditional close-talk microphones (CM). Attaining high-performance phoneme recognition is a challenging task when the training data from a degrading channel, such as TM, is limited. In this paper, we address this challenge for the TM speech recordings using a transfer learning approach based on the stacked denoising auto-encoders (SDA). The proposed transfer learning approach defines an SDA-based domain adaptation framework to map the source domain CM representations and the target domain TM representations into a common latent space, where the mismatch across TM and CM is eliminated to better train an acoustic model and to improve the TM phoneme recognition. For the phoneme recognition task, we use the convolutional neural network (CNN) and the hidden Markov model (HMM) based CNN/HMM hybrid system, which delivers better acoustic modeling performance compared to the conventional Gaussian mixture model (GMM) based models. In the experimental evaluations, we observed more than 12% relative phoneme error rate (PER) improvement for the TM recordings with the proposed transfer learning approach compared to baseline performances.

中文翻译：

使用转移学习改善喉咙麦克风语音记录的音素识别

嗓音麦克风（TM）是一种与皮肤连接的非声学传感器，对环境噪声具有抵抗力，但与传统的近距离麦克风（CM）相比，具有较低的信号带宽特性。当来自降级渠道（例如TM）的训练数据受到限制时，实现高性能音素识别是一项艰巨的任务。在本文中，我们使用基于堆叠降噪自动编码器（SDA）的转移学习方法来解决TM语音录制的这一挑战。提出的转移学习方法定义了一个基于SDA的域自适应框架，以将源域CM表示和目标域TM表示映射到一个共同的潜在空间，从而消除了TM和CM之间的不匹配，从而更好地训练了声学模型并改善了声学模型TM音素识别。对于音素识别任务，我们使用卷积神经网络（CNN）和基于隐马尔可夫模型（HMM）的CNN / HMM混合系统，与基于常规高斯混合模型（GMM）的模型相比，该模型提供了更好的声学建模性能。在实验评估中，我们发现，与基准性能相比，使用建议的转移学习方法，TM录音的相对音素错误率（PER）改善了12％以上。

更新日期：2021-03-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11