Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection
arXiv - CS - Sound Pub Date : 2021-07-20 , DOI: arxiv-2107.09388
Parthasaarathy Sudarsanam, Archontis Politis, Konstantinos Drossos

Joint sound event localization and detection (SELD) is an emerging audio signal processing task adding spatial dimensions to acoustic scene analysis and sound event detection. A popular approach to modeling SELD jointly is using convolutional recurrent neural network (CRNN) models, where CNNs learn high-level features from multi-channel audio input and the RNNs learn temporal relationships from these high-level features. However, RNNs have some drawbacks, such as a limited capability to model long temporal dependencies and slow training and inference times due to their sequential processing nature. Recently, a few SELD studies used multi-head self-attention (MHSA), among other innovations in their models. MHSA and the related transformer networks have shown state-of-the-art performance in various domains. While they can model long temporal dependencies, they can also be parallelized efficiently. In this paper, we study in detail the effect of MHSA on the SELD task. Specifically, we examined the effects of replacing the RNN blocks with self-attention layers. We studied the influence of stacking multiple self-attention blocks, using multiple attention heads in each self-attention block, and the effect of position embeddings and layer normalization. Evaluation on the DCASE 2021 SELD (task 3) development data set shows a significant improvement in all employed metrics compared to the baseline CRNN accompanying the task.

中文翻译：

对用于声音事件定位和检测的学习特征的自我注意评估

联合声音事件定位和检测 (SELD) 是一项新兴的音频信号处理任务，为声学场景分析和声音事件检测添加空间维度。联合建模 SELD 的一种流行方法是使用卷积循环神经网络 (CRNN) 模型，其中 CNN 从多通道音频输入中学习高级特征，而 RNN 从这些高级特征中学习时间关系。然而，RNN 有一些缺点，例如由于其顺序处理性质，对长时间依赖性建模的能力有限以及训练和推理时间缓慢。最近，一些 SELD 研究使用了多头自我注意（MHSA），以及他们模型中的其他创新。MHSA 和相关的变压器网络在各个领域都表现出最先进的性能。虽然它们可以对长时间的时间依赖性进行建模，但它们也可以有效地并行化。在本文中，我们详细研究了 MHSA 对 SELD 任务的影响。具体来说，我们研究了用自注意力层替换 RNN 块的效果。我们研究了堆叠多个自注意力块、在每个自注意力块中使用多个注意力头的影响，以及位置嵌入和层归一化的影响。对 DCASE 2021 SELD（任务 3）开发数据集的评估表明，与任务伴随的基线 CRNN 相比，所有采用的指标都有显着改进。我们研究了堆叠多个自注意力块、在每个自注意力块中使用多个注意力头的影响，以及位置嵌入和层归一化的影响。对 DCASE 2021 SELD（任务 3）开发数据集的评估表明，与任务伴随的基线 CRNN 相比，所有采用的指标都有显着改进。我们研究了堆叠多个自注意力块、在每个自注意力块中使用多个注意力头的影响，以及位置嵌入和层归一化的影响。对 DCASE 2021 SELD（任务 3）开发数据集的评估表明，与任务伴随的基线 CRNN 相比，所有采用的指标都有显着改进。

更新日期：2021-07-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文