Speaker-Independent Silent Speech Recognition from Flesh-Point Articulatory Movements Using an LSTM Neural Network.,IEEE/ACM Transactions on Audio, Speech, and Language Processing

当前位置： X-MOL 学术 › IEEE ACM Trans. Audio Speech Lang. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Speaker-Independent Silent Speech Recognition from Flesh-Point Articulatory Movements Using an LSTM Neural Network.
IEEE/ACM Transactions on Audio, Speech, and Language Processing ( IF 5.4 ) Pub Date : 2018-10-03 , DOI: 10.1109/taslp.2017.2758999
Myungjong Kim ₁ , Beiming Cao ₁ , Ted Mau ₂ , Jun Wang ₁

Affiliation

Silent speech recognition (SSR) converts non-audio information such as articulatory movements into text. SSR has the potential to enable persons with laryngectomy to communicate through natural spoken expression. Current SSR systems have largely relied on speaker-dependent recognition models. The high degree of variability in articulatory patterns across different speakers has been a barrier for developing effective speaker-independent SSR approaches. Speaker-independent SSR approaches, however, are critical for reducing the amount of training data required from each speaker. In this paper, we investigate speaker-independent SSR from the movements of flesh points on tongue and lip with articulatory normalization methods that reduce the inter-speaker variation. To minimize the across-speaker physiological differences of the articulators, we propose Procrustes matching-based articulatory normalization by removing locational, rotational, and scaling differences. To further normalize the articulatory data, we apply feature-space maximum likelihood linear regression and i-vector. In this paper, we adopt a bidirectional long short term memory recurrent neural network (BLSTM) as an articulatory model to effectively model the articulatory movements with long-range articulatory history. A silent speech data set with flesh points was collected using an electromagnetic articulograph (EMA) from twelve healthy and two laryngectomized English speakers. Experimental results showed the effectiveness of our speaker-independent SSR approaches on healthy as well as laryngectomy speakers. In addition, BLSTM outperformed standard deep neural network. The best performance was obtained by BLSTM with all the three normalization approaches combined.

中文翻译：

使用LSTM神经网络从肉点发音运动中独立于说话者的沉默语音识别。

静默语音识别（SSR）将非音频信息（例如发音运动）转换为文本。SSR具有使喉切除术者通过自然口语表达进行交流的潜力。当前的SSR系统在很大程度上依赖于说话者相关的识别模型。跨不同说话者的发音模式的高度可变性已成为开发独立于说话者的有效SSR方法的障碍。但是，独立于说话者的SSR方法对于减少每个说话者所需的训练数据量至关重要。在本文中，我们使用发音归一化方法从舌头和嘴唇上的肉点运动研究了独立于说话者的SSR，以减少说话者之间的差异。为了最大程度地减少发音者的跨说话者生理差异，我们提出通过消除位置，旋转和缩放差异来进行基于Procrustes匹配的关节归一化。为了进一步规范发音数据，我们应用了特征空间最大似然线性回归和i-vector。在本文中，我们采用双向长期短期记忆递归神经网络（BLSTM）作为发音模型，以有效地模拟具有长距离发音历史的发音运动。使用电磁关节描记器（EMA）从十二名健康的和两名经喉切除的英语使用者那里收集了一个带有肉点的无声语音数据集。实验结果表明，我们的独立于说话者的SSR方法对健康人和喉切除者都是有效的。此外，BLSTM的性能优于标准的深度神经网络。

更新日期：2019-11-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>