当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Audio Captioning Transformer
arXiv - CS - Sound Pub Date : 2021-07-21 , DOI: arxiv-2107.09817
Xinhao Mei, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang

Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events. We evaluate our model on AudioCaps, which is the largest audio captioning dataset publicly available. Our model shows competitive performance compared to other state-of-the-art approaches.

中文翻译:

音频字幕转换器

音频字幕旨在自动生成音频剪辑​​的自然语言描述。大多数字幕模型遵循编码器-解码器架构,其中解码器根据编码器提取的音频特征预测单词。卷积神经网络 (CNN) 和循环神经网络 (RNN) 通常用作音频编码器。然而,CNN 在对音频信号中时间帧之间的时间关系建模方面可能受到限制,而 RNN 在对时间帧之间的长期依赖关系建模方面可能受到限制。在本文中,我们提出了一种音频字幕转换器 (ACT),它是一个基于编码器-解码器架构的完整 Transformer 网络,并且完全无卷积。所提出的方法具有更好的能力来模拟音频信号中的全局信息以及捕获音频事件之间的时间关系。我们在 AudioCaps 上评估我们的模型,这是公开可用的最大的音频字幕数据集。与其他最先进的方法相比,我们的模型显示出具有竞争力的性能。
更新日期:2021-07-22
down
wechat
bug