当前位置: X-MOL 学术IEEE Signal Process. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning Temporal Relations from Semantic Neighbors for Acoustic Scene Classification
IEEE Signal Processing Letters ( IF 3.9 ) Pub Date : 2020-01-01 , DOI: 10.1109/lsp.2020.2996085
Liwen Zhang , Jiqing Han , Ziqiang Shi

Convolutional networks have achieved the state-of-the-art performance on Acoustic Scene Classification (ASC). Given the Log Mel-Spectrogram of an audio sample, the network can extract useful semantic contents in a certain range receptive field by stacking local convolutional operations. However, the temporal relations between different receptive fields are not captured explicitly. In this letter, we propose an end-to-end 3D Convolutional Neural Network (CNN) for ASC, named SeNoT-Net, which can generate effective audio representations by capturing temporal relations from semantic neighbors of different receptive fields over time. The SeNoT-Net treats the Log-Mel spectrogram as an ordered segment-level sequence. For each segment, the residual block can produce the semantic feature maps, then the semantic neighbors over time (SeNoT) module is applied to capture the relations between each feature point in the feature maps and its top-$k$ semantic neighbors. The proposed SeNoT-Net outperforms most of the state-of-the-art CNN models on both DCASE 2018 and 2019 ASC datasets.

中文翻译:

从语义邻居中学习时间关系以进行声学场景分类

卷积网络在声学场景分类 (ASC) 上取得了最先进的性能。给定音频样本的 Log Mel-Spectrogram,网络可以通过叠加局部卷积运算在一定范围的感受野中提取有用的语义内容。然而,没有明确捕获不同感受野之间的时间关系。在这封信中,我们提出了一种用于 ASC 的端到端 3D 卷积神经网络 (CNN),名为 SeNoT-Net,它可以通过从不同感受野的语义邻居随时间捕获时间关系来生成有效的音频表示。SeNoT-Net 将 Log-Mel 频谱图视为有序的段级序列。对于每个段,残差块可以产生语义特征图,然后应用随时间变化的语义邻居(SeNoT)模块来捕获特征图中的每个特征点与其top-$k$语义邻居之间的关系。提议的 SeNoT-Net 在 DCASE 2018 和 2019 ASC 数据集上的性能优于大多数最先进的 CNN 模型。
更新日期:2020-01-01
down
wechat
bug