Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI,arXiv - CS - Human-Computer Interaction

当前位置： X-MOL 学术 › arXiv.cs.HC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI
arXiv - CS - Human-Computer Interaction Pub Date : 2021-06-16 , DOI: arxiv-2106.08706
Laxmi Pandey, Ahmed Sabbir Arif

Speech sounds of spoken language are obtained by varying configuration of the articulators surrounding the vocal tract. They contain abundant information that can be utilized to better understand the underlying mechanism of human speech production. We propose a novel deep neural network-based learning framework that understands acoustic information in the variable-length sequence of vocal tract shaping during speech production, captured by real-time magnetic resonance imaging (rtMRI), and translate it into text. The proposed framework comprises of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. On the USC-TIMIT corpus, the model achieved a 40.6% PER at sentence-level, much better compared to the existing models. To the best of our knowledge, this is the first study that demonstrates the recognition of entire spoken sentence based on an individual's articulatory motions captured by rtMRI video. We also performed an analysis of variations in the geometry of articulation in each sub-regions of the vocal tract (i.e., pharyngeal, velar and dorsal, hard palate, labial constriction region) with respect to different emotions and genders. Results suggest that each sub-regions distortion is affected by both emotion and gender.

中文翻译：

实时 MRI 中声带形状动力学的无声言语和情绪识别

通过改变环绕声道的发音器的配置来获得口语的语音。它们包含丰富的信息，可用于更好地理解人类语音产生的潜在机制。我们提出了一种新的基于深度神经网络的学习框架，它可以理解语音生成过程中声道整形的可变长度序列中的声学信息，由实时磁共振成像 (rtMRI) 捕获，并将其翻译成文本。提出的框架包括时空卷积、循环网络和连接主义时间分类损失，完全端到端地训练。在 USC-TIMIT 语料库上，该模型在句子级别实现了 40.6% 的 PER，与现有模型相比要好得多。据我们所知，这是第一项证明可以根据 rtMRI 视频捕获的个人发音动作识别整个口语句子的研究。我们还对不同情绪和性别的声道每个子区域（即咽部、软腭和背侧、硬腭、唇收缩区）的发音几何变化进行了分析。结果表明，每个子区域的失真都受到情绪和性别的影响。唇收缩区）相对于不同的情绪和性别。结果表明，每个子区域的失真都受到情绪和性别的影响。唇收缩区）相对于不同的情绪和性别。结果表明，每个子区域的失真都受到情绪和性别的影响。

更新日期：2021-06-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文