当前位置: X-MOL 学术IEEE Trans. Netural Syst. Rehabil. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Speech Vision: An End-to-End Deep Learning-Based Dysarthric Automatic Speech Recognition System
IEEE Transactions on Neural Systems and Rehabilitation Engineering ( IF 4.9 ) Pub Date : 2021-04-30 , DOI: 10.1109/tnsre.2021.3076778
Seyed Reza Shahamiri

Dysarthria is a disorder that affects an individual’s speech intelligibility due to the paralysis of muscles and organs involved in the articulation process. As the condition is often associated with physically debilitating disabilities, not only do such individuals face communication problems, but also interactions with digital devices can become a burden. For these individuals, automatic speech recognition (ASR) technologies can make a significant difference in their lives as computing and portable digital devices can become an interaction medium, enabling them to communicate with others and computers. However, ASR technologies have performed poorly in recognizing dysarthric speech, especially for severe dysarthria, due to multiple challenges facing dysarthric ASR systems. We identified these challenges are due to the alternation and inaccuracy of dysarthric phonemes, the scarcity of dysarthric speech data, and the phoneme labeling imprecision. This paper reports on our second dysarthric-specific ASR system, called Speech Vision (SV) that tackles these challenges by adopting a novel approach towards dysarthric ASR in which speech features are extracted visually, then SV learns to see the shape of the words pronounced by dysarthric individuals. This visual acoustic modeling feature of SV eliminates phoneme-related challenges. To address the data scarcity problem, SV adopts visual data augmentation techniques, generates synthetic dysarthric acoustic visuals, and leverages transfer learning. Benchmarking with other state-of-the-art dysarthric ASR considered in this study, SV outperformed them by improving recognition accuracies for 67% of UA-Speech speakers, where the biggest improvements were achieved for severe dysarthria.

中文翻译:

语音视觉:基于端到端深度学习的动态表情自动识别系统

构音障碍是一种由于参与关节运动的肌肉和器官麻痹而影响个人语音清晰度的疾病。由于该病常与肢体虚弱的残疾人有关,因此这些人不仅面临沟通问题,而且与数字设备的互动也可能成为负担。对于这些人来说,自动语音识别(ASR)技术可以极大地改变他们的生活,因为计算和便携式数字设备可以成为一种交互介质,从而使他们能够与其他人和计算机进行通信。但是,由于构音不良的ASR系统面临多重挑战,因此ASR技术在识别构音不良的语音方面表现不佳,尤其是对于重度构音障碍。我们发现这些挑战是由于音调异常音素的交替和不准确,音调异常语音数据的稀缺以及音素标记的不精确性造成的。本文报告了我们的第二种构音异常专用ASR系统,即语音视觉(SV),该系统通过采用一种新颖的方法来处理构音异常ASR,通过视觉提取语音特征来应对这些挑战,然后SV学会看清发音的单词的形状dysarthric个人。SV的这种视觉声学建模功能消除了音素相关的挑战。为了解决数据稀缺性问题,SV采用了视觉数据增强技术,生成了合成的动态异常视觉图像,并利用了转移学习的功能。结合本研究中考虑的其他最新的构音障碍ASR进行基准测试,
更新日期:2021-05-11
down
wechat
bug