Proposal-based Few-shot Sound Event Detection for Speech and Environmental Sounds with Perceivers,arXiv - CS - Neural and Evolutionary Computing

当前位置： X-MOL 学术 › arXiv.cs.NE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Proposal-based Few-shot Sound Event Detection for Speech and Environmental Sounds with Perceivers
arXiv - CS - Neural and Evolutionary Computing Pub Date : 2021-07-28 , DOI: arxiv-2107.13616
Piper Wolters, Chris Daw, Brian Hutchinson, Lauren Phillips

There are many important applications for detecting and localizing specific sound events within long, untrimmed documents including keyword spotting, medical observation, and bioacoustic monitoring for conservation. Deep learning techniques often set the state-of-the-art for these tasks. However, for some types of events, there is insufficient labeled data to train deep learning models. In this paper, we propose novel approaches to few-shot sound event detection utilizing region proposals and the Perceiver architecture, which is capable of accurately localizing sound events with very few examples of each class of interest. Motivated by a lack of suitable benchmark datasets for few-shot audio event detection, we generate and evaluate on two novel episodic rare sound event datasets: one using clips of celebrity speech as the sound event, and the other using environmental sounds. Our highest performing proposed few-shot approaches achieve 0.575 and 0.672 F1-score, respectively, with 5-shot 5-way tasks on these two datasets. These represent absolute improvements of 0.200 and 0.234 over strong proposal-free few-shot sound event detection baselines.

中文翻译：

带有感知器的语音和环境声音的基于提议的少拍声音事件检测

有许多重要的应用程序可用于检测和定位长的、未修剪的文档中的特定声音事件，包括关键字识别、医学观察和生物声学监测以进行保护。深度学习技术通常为这些任务设定最先进的技术。但是，对于某些类型的事件，没有足够的标记数据来训练深度学习模型。在本文中，我们提出了利用区域提议和感知器架构进行少拍声音事件检测的新方法，该架构能够准确定位声音事件，并且每个感兴趣类别的示例很少。由于缺乏适用于少镜头音频事件检测的基准数据集，我们生成并评估了两个新颖的偶发性罕见声音事件数据集：一个使用名人演讲片段作为声音事件，另一个使用环境声音。我们提出的最高性能的少镜头方法分别达到 0.575 和 0.672 F1 分数，在这两个数据集上进行 5 次 5 向任务。这些代表了 0.200 和 0.234 相对于强大的无提议少拍声音事件检测基线的绝对改进。

更新日期：2021-07-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文