当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
End-to-End Speaker-Dependent Voice Activity Detection
arXiv - CS - Sound Pub Date : 2020-09-21 , DOI: arxiv-2009.09906
Yefei Chen, Shuai Wang, Yanmin Qian, Kai Yu

Voice activity detection (VAD) is an essential pre-processing step for tasks such as automatic speech recognition (ASR) and speaker recognition. A basic goal is to remove silent segments within an audio, while a more general VAD system could remove all the irrelevant segments such as noise and even unwanted speech from non-target speakers. We define the task, which only detects the speech from the target speaker, as speaker-dependent voice activity detection (SDVAD). This task is quite common in real applications and usually implemented by performing speaker verification (SV) on audio segments extracted from VAD. In this paper, we propose an end-to-end neural network based approach to address this problem, which explicitly takes the speaker identity into the modeling process. Moreover, inference can be performed in an online fashion, which leads to low system latency. Experiments are carried out on a conversational telephone dataset generated from the Switchboard corpus. Results show that our proposed online approach achieves significantly better performance than the usual VAD/SV system in terms of both frame accuracy and F-score. We also used our previously proposed segment-level metric for a more comprehensive analysis.

中文翻译:

端到端说话人相关语音活动检测

语音活动检测 (VAD) 是自动语音识别 (ASR) 和说话人识别等任务的重要预处理步骤。一个基本目标是去除音频中的无声片段,而更通用的 VAD 系统可以去除所有不相关的片段,例如来自非目标说话者的噪声甚至不需要的语音。我们将仅检测来自目标说话者的语音的任务定义为说话者相关语音活动检测 (SDVAD)。这个任务在实际应用中很常见,通常通过对从 VAD 提取的音频段执行说话人验证 (SV) 来实现。在本文中,我们提出了一种基于端到端神经网络的方法来解决这个问题,该方法明确地将说话者身份纳入建模过程。此外,推理可以以在线方式进行,这导致低系统延迟。实验是在从 Switchboard 语料库生成的对话电话数据集上进行的。结果表明,我们提出的在线方法在帧精度和 F 分数方面都比通常的 VAD/SV 系统获得了明显更好的性能。我们还使用我们之前提出的细分级别指标进行更全面的分析。
更新日期:2020-09-22
down
wechat
bug