Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning
arXiv - CS - Sound Pub Date : 2020-07-06 , DOI: arxiv-2007.02676
Khoa Nguyen and Konstantinos Drossos and Tuomas Virtanen

Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely available dataset Clotho and we evaluate the impact of different factors of temporal sub-sampling. Our results show an improvement to all considered metrics.

中文翻译：

用于自动音频字幕的音频特征序列的时间子采样

音频字幕是自动为一般音频信号的内容创建文本描述的任务。典型的音频字幕方法依赖于深度神经网络 (DNN)，其中 DNN 的目标是将输入音频序列映射到单词的输出序列，即字幕。但是，文本描述的长度远小于音频信号的长度，例如 10 个单词与数千个音频特征向量。这清楚地表明一个输出词对应多个输入特征向量。在这项工作中，我们提出了一种方法，通过对音频输入序列应用时间子采样，专注于明确利用序列之间的这种长度差异。我们采用序列到序列的方法，它使用固定长度的向量作为编码器的输出，我们在编码器的 RNN 之间应用时间子采样。我们通过使用免费提供的数据集 Clotho 来评估我们方法的好处，并评估时间子采样的不同因素的影响。我们的结果显示所有考虑的指标都有所改进。

更新日期：2020-07-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文