Frequency-dependent auto-pooling function for weakly supervised sound event detection,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Frequency-dependent auto-pooling function for weakly supervised sound event detection
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2021-05-17 , DOI: 10.1186/s13636-021-00206-7
Sichen Liu , Feiran Yang , Yin Cao , Jun Yang

Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.

中文翻译：

频率相关的自动合并功能，用于弱监督的声音事件检测

声音事件检测（SED）通常被视为监督问题，旨在检测声音事件的类型和相应的时间信息。它需要估计每一帧声音事件的开始和偏移注释。许多可用的声音事件数据集仅包含音频标签，而没有精确的时间信息。因此，这种类型的数据集被归类为弱标记数据集。在本文中，我们提出了一种新的基于源分离的方法，该方法在弱标记数据上训练以解决SED问题。我们建立了一个扩展的深度方向可分离卷积块（DDC块），以从音频剪辑的TF表示估计每个声音事件的时频（TF）蒙版。实验证明，DDC块比“ VGG类”块更有效且计算更轻。为了充分利用声音事件的频率特性，我们然后提出一种频率相关的自动合并（FAP）函数，以获取每个声音事件类别的片段级存在概率。在DCASE 2018任务2，DCASE 2020任务4和DCASE 2017任务4数据集上评估了称为DDC-FAP方法的两种方案的组合。结果表明，在SED任务中，DDC-FAP比基于最新源分离的方法具有更好的性能。

更新日期：2021-05-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文