当前位置: X-MOL 学术Neurocomputing › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-Scale and Single-Scale Fully Convolutional Networks for Sound Event Detection
Neurocomputing ( IF 6 ) Pub Date : 2021-01-01 , DOI: 10.1016/j.neucom.2020.09.038
Yingbin Wang , Guanghui Zhao , Kai Xiong , Guangming Shi , Yumeng Zhang

Abstract Among various Sound Event Detection (SED) systems, Recurrent Neural Networks (RNN), such as long short-term memory unit and gated recurrent unit, is used to capture temporal dependencies, but it is confined in its length of temporal dependencies, resulting in a failure to model sound events with long duration. What’s more, RNN is incapable to process datasets in parallel, leading to low efficiency and low industrial value. Given these shortcomings, we propose to use dilated convolution (and causal dilated convolution) to capture temporal dependencies, as its great ability to ensure high time resolution and obtain longer temporal dependencies under the filter size and the network depth unchanged. In addition, dilated convolution can be parallelized, so it has higher efficiency and industrial value. Based on this, we propose Single-Scale Fully Convolutional Networks (SS-FCN) composed of convolutional neural networks and dilated convolutional networks, with the former to provide frequency invariance and the later to capture temporal dependencies. With the help of dilated convolution to control the length of temporal dependencies, we observe SS-FCN modeling a single length of temporal dependencies achieves superior detection performance for finite kinds of events. For better performance, we propose Multi-Scale Fully Convolutional Networks (MS-FCN), in which the feature fusion module is introduced to capture long short-term dependencies by fusing features with different length of temporal dependencies. The proposed methods achieve competitive performance on three main datasets with higher efficiency. The results show that SED systems based on Fully Convolutional Networks have further research value and potential.

中文翻译:

用于声音事件检测的多尺度和单尺度全卷积网络

摘要 在各种声音事件检测(SED)系统中,循环神经网络(RNN),如长短期记忆单元和门控循环单元,用于捕获时间依赖性,但其时间依赖性的长度受到限制,导致未能对持续时间长的声音事件进行建模。更重要的是,RNN无法并行处理数据集,导致效率低,工业价值低。鉴于这些缺点,我们建议使用扩张卷积(和因果扩张卷积)来捕获时间依赖性,因为它能够确保高时间分辨率并在滤波器大小和网络深度不变的情况下获得更长的时间依赖性。此外,扩张卷积可以并行化,因此具有更高的效率和工业价值。基于此,我们提出了由卷积神经网络和扩张卷积网络组成的单尺度全卷积网络 (SS-FCN),前者提供频率不变性,后者提供时间依赖性。借助扩张卷积来控制时间依赖的长度,我们观察到 SS-FCN 建模单一长度的时间依赖对有限种类的事件实现了卓越的检测性能。为了获得更好的性能,我们提出了多尺度全卷积网络 (MS-FCN),其中引入了特征融合模块,通过融合具有不同时间依赖性的特征来捕获长期的短期依赖性。所提出的方法在三个主要数据集上以更高的效率实现了具有竞争力的性能。
更新日期:2021-01-01
down
wechat
bug