End-to-End Learning of Speech 2D Feature-Trajectory for Prosthetic Hands,arXiv - CS - Robotics

当前位置： X-MOL 学术 › arXiv.cs.RO › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

End-to-End Learning of Speech 2D Feature-Trajectory for Prosthetic Hands
arXiv - CS - Robotics Pub Date : 2020-09-22 , DOI: arxiv-2009.10283
Mohsen Jafarzadeh, Yonas Tadesse

Speech is one of the most common forms of communication in humans. Speech commands are essential parts of multimodal controlling of prosthetic hands. In the past decades, researchers used automatic speech recognition systems for controlling prosthetic hands by using speech commands. Automatic speech recognition systems learn how to map human speech to text. Then, they used natural language processing or a look-up table to map the estimated text to a trajectory. However, the performance of conventional speech-controlled prosthetic hands is still unsatisfactory. Recent advancements in general-purpose graphics processing units (GPGPUs) enable intelligent devices to run deep neural networks in real-time. Thus, architectures of intelligent systems have rapidly transformed from the paradigm of composite subsystems optimization to the paradigm of end-to-end optimization. In this paper, we propose an end-to-end convolutional neural network (CNN) that maps speech 2D features directly to trajectories for prosthetic hands. The proposed convolutional neural network is lightweight, and thus it runs in real-time in an embedded GPGPU. The proposed method can use any type of speech 2D feature that has local correlations in each dimension such as spectrogram, MFCC, or PNCC. We omit the speech to text step in controlling the prosthetic hand in this paper. The network is written in Python with Keras library that has a TensorFlow backend. We optimized the CNN for NVIDIA Jetson TX2 developer kit. Our experiment on this CNN demonstrates a root-mean-square error of 0.119 and 20ms running time to produce trajectory outputs corresponding to the voice input data. To achieve a lower error in real-time, we can optimize a similar CNN for a more powerful embedded GPGPU such as NVIDIA AGX Xavier.

中文翻译：

假肢语音二维特征轨迹的端到端学习

语音是人类最常见的交流方式之一。语音命令是假手多模式控制的重要组成部分。在过去的几十年里，研究人员使用自动语音识别系统通过语音命令来控制假手。自动语音识别系统学习如何将人类语音映射到文本。然后，他们使用自然语言处理或查找表将估计的文本映射到轨迹。然而，传统的语音控制假手的性能仍然不尽如人意。通用图形处理单元 (GPGPU) 的最新进展使智能设备能够实时运行深度神经网络。因此，智能系统的架构已经从复合子系统优化范式迅速转变为端到端优化范式。在本文中，我们提出了一种端到端的卷积神经网络 (CNN)，它将语音 2D 特征直接映射到假手的轨迹。所提出的卷积神经网络是轻量级的，因此它可以在嵌入式 GPGPU 中实时运行。所提出的方法可以使用在每个维度上具有局部相关性的任何类型的语音 2D 特征，例如频谱图、MFCC 或 PNCC。我们在本文中省略了控制假手的语音到文本步骤。该网络是用 Python 编写的，Keras 库具有 TensorFlow 后端。我们针对 NVIDIA Jetson TX2 开发人员套件优化了 CNN。我们在这个 CNN 上的实验表明均方根误差为 0。119 和 20ms 的运行时间产生对应于语音输入数据的轨迹输出。为了实时实现更低的错误，我们可以针对更强大的嵌入式 GPGPU（例如 NVIDIA AGX Xavier）优化类似的 CNN。

更新日期：2020-09-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文