Timestamp-aligning and keyword-biasing end-to-end ASR front-end for a KWS system,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Timestamp-aligning and keyword-biasing end-to-end ASR front-end for a KWS system
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2021-07-08 , DOI: 10.1186/s13636-021-00212-9
Gui-Xin Shi ₁ , Wei-Qiang Zhang ₁ , Guan-Bo Wang ₁ , Jing Zhao ₁ , Shu-Zhou Chai ₁ , Ze-Yu Zhao ₁

Affiliation

Many end-to-end approaches have been proposed to detect predefined keywords. For scenarios of multi-keywords, there are still two bottlenecks that need to be resolved: (1) the distribution of important data that contains keyword(s) is sparse, and (2) the timestamps of the detected keywords are inaccurate. In this paper, to alleviate the first issue and further improve the performance of the end-to-end ASR front-end, we propose the biased loss function for guiding the recognizer to pay more attention to the speech segments containing the predefined keywords. As for the second issue, we solve this problem by modifying the force alignment applied to the end-to-end ASR front-end. To get the frame-level alignment, we utilize a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) based acoustic model (AM) for auxiliary. The proposed system is evaluated in the OpenSAT20 held by the National Institute of Standards and Technology (NIST). The performance of our end-to-end KWS system is comparable to the conventional hybrid KWS system, sometimes even slightly better. With fusion results of the end-to-end and conventional KWS systems, we won the first prize in the KWS track. On the dev dataset (a part of SAFE-T corpus), the system outperforms the baseline by a large margin, i.e., our system with GMM-HMM aligner has a lower segmentation-aware word error rates (relatively 7.9–19.2% decrease) and higher overall Actual term-weighted values (relatively 3.6–11.0% increase), which demonstrates the effectiveness of the proposed method. For more precise alignments, we can use DNN-based AM as alignmentor at the cost of more computation.

中文翻译：

KWS 系统的时间戳对齐和关键字偏置端到端 ASR 前端

已经提出了许多端到端的方法来检测预定义的关键字。对于多关键词的场景，还有两个瓶颈需要解决：（1）包含关键词的重要数据分布稀疏，（2）检测到的关键词时间戳不准确。在本文中，为了缓解第一个问题并进一步提高端到端 ASR 前端的性能，我们提出了有偏损失函数，用于引导识别器更多地关注包含预定义关键字的语音片段。至于第二个问题，我们通过修改应用于端到端 ASR 前端的力对齐来解决这个问题。为了获得帧级对齐，我们利用基于高斯混合模型-隐马尔可夫模型 (GMM-HMM) 的声学模型 (AM) 作为辅助。提议的系统在美国国家标准与技术研究院 (NIST) 举办的 OpenSAT20 中进行了评估。我们的端到端 KWS 系统的性能可与传统的混合 KWS 系统相媲美，有时甚至略胜一筹。凭借端到端和常规KWS系统的融合成果，我们获得了KWS赛道的一等奖。在开发数据集（SAFE-T 语料库的一部分）上，该系统大大优于基线，即我们的系统具有 GMM-HMM 对齐器具有较低的分段感知词错误率（相对降低 7.9-19.2%）和更高的整体实际项加权值（相对增加 3.6-11.0%），这证明了所提出方法的有效性。为了更精确的对齐，我们可以使用基于 DNN 的 AM 作为对齐器，但代价是更多的计算。

更新日期：2021-07-09

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文