当前位置: X-MOL 学术IEEE Trans. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Low-Power, High-Performance Speech Recognition Accelerator
IEEE Transactions on Computers ( IF 3.6 ) Pub Date : 2019-12-01 , DOI: 10.1109/tc.2019.2937075
Reza Yazdani , Jose-Maria Arnau , Antonio Gonzalez

Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for large-vocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators’ design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.

中文翻译:

低功耗、高性能语音识别加速器

自动语音识别 (ASR) 正变得越来越普遍,尤其是在移动领域。快速而准确的 ASR 需要高昂的能源成本,对于功率预算很小的移动设备来说是负担不起的。硬件加速可降低 ASR 系统的能耗,同时提供高性能。在本文中,我们提出了一种用于大词汇量、独立于说话者的连续语音识别的加速器。它侧重于代表 ASR 系统中主要瓶颈的维特比搜索算法。提议的设计包括改进内存子系统的创新技术,因为内存是这些加速器设计中性能和功耗的主要瓶颈。它包括一个专为 ASR 系统需求定制的预取方案,该方案隐藏了大部分内存访问的主内存延迟,影响区域可忽略不计。此外,我们还引入了一种新颖的带宽节省技术,可将片外存储器访问减少 20%。最后,我们提出了一种节能技术,可显着降低加速器暂存存储器的泄漏功率,使整体功耗降低 8.5% 至 29.2%。总体而言,与在 Geforce-GTX-980 GPU 上运行的高度优化的 CUDA 实现相比,所提出的设计比在 CPU 上运行的实现高出几个数量级,并为不同的语音解码器实现了 1.7 倍到 5.9 倍的加速,同时将能量降低了 123 -454 倍。我们提出了一种节能技术,可显着降低加速器暂存存储器的泄漏功率,使整个功耗降低 8.5% 至 29.2%。总体而言,与在 Geforce-GTX-980 GPU 上运行的高度优化的 CUDA 实现相比,所提出的设计比在 CPU 上运行的实现高出几个数量级,并为不同的语音解码器实现了 1.7 倍到 5.9 倍的加速,同时将能量降低了 123 -454 倍。我们提出了一种节能技术,可显着降低加速器暂存存储器的泄漏功率,使整个功耗降低 8.5% 至 29.2%。总体而言,与在 Geforce-GTX-980 GPU 上运行的高度优化的 CUDA 实现相比,所提出的设计比在 CPU 上运行的实现高出几个数量级,并为不同的语音解码器实现了 1.7 倍到 5.9 倍的加速,同时将能量降低了 123 -454 倍。
更新日期:2019-12-01
down
wechat
bug