Voice activity detection in the wild via weakly supervised sound event detection,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Voice activity detection in the wild via weakly supervised sound event detection
arXiv - CS - Sound Pub Date : 2020-03-27 , DOI: arxiv-2003.12222
Heinrich Dinkel, Yefei Chen, Mengyue Wu and Kai Yu

Traditional supervised voice activity detection (VAD) methods work well in clean and controlled scenarios, with performance severely degrading in real-world applications. One possible bottleneck is that speech in the wild contains unpredictable noise types, hence frame-level label prediction is difficult, which is required for traditional supervised VAD training. In contrast, we propose a general-purpose VAD (GPVAD) framework, which can be easily trained from noisy data in a weakly supervised fashion, requiring only clip-level labels. We proposed two GPVAD models, one full (GPV-F), trained on 527 Audioset sound events, and one binary (GPV-B), only distinguishing speech and noise. We evaluate the two GPV models against a CRNN based standard VAD model (VAD-C) on three different evaluation protocols (clean, synthetic noise, real data). Results show that our proposed GPV-F demonstrates competitive performance in clean and synthetic scenarios compared to traditional VAD-C. Further, in real-world evaluation, GPV-F largely outperforms VAD-C in terms of frame-level evaluation metrics as well as segment-level ones. With a much lower requirement for frame-labeled data, the naive binary clip-level GPV-B model can still achieve comparable performance to VAD-C in real-world scenarios.

中文翻译：

通过弱监督声音事件检测在野外进行语音活动检测

传统的监督式语音活动检测 (VAD) 方法在干净和受控的场景中运行良好，但在实际应用中性能严重下降。一个可能的瓶颈是野外语音包含不可预测的噪声类型，因此帧级标签预测很困难，这是传统有监督的 VAD 训练所必需的。相比之下，我们提出了一个通用的 VAD (GPVAD) 框架，它可以很容易地以弱监督的方式从嘈杂的数据中训练，只需要剪辑级别的标签。我们提出了两种 GPVAD 模型，一种是全（GPV-F），在 527 个 Audioset 声音事件上训练，另一种是二进制（GPV-B），只区分语音和噪声。我们在三种不同的评估协议（干净、合成噪声、真实数据）上根据基于 CRNN 的标准 VAD 模型 (VAD-C) 评估两个 GPV 模型。结果表明，与传统的 VAD-C 相比，我们提出的 GPV-F 在清洁和合成场景中表现出具有竞争力的性能。此外，在现实世界的评估中，GPV-F 在框架级评估指标和段级评估指标方面大大优于 VAD-C。由于对帧标记数据的要求要低得多，朴素的二进制剪辑级 GPV-B 模型在实际场景中仍然可以达到与 VAD-C 相当的性能。

更新日期：2020-08-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文