当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CL4AC: A Contrastive Loss for Audio Captioning
arXiv - CS - Sound Pub Date : 2021-07-21 , DOI: arxiv-2107.09990
Xubo Liu, Qiushi Huang, Xinhao Mei, Tom Ko, H Lilian Tang, Mark D. Plumbley, Wenwu Wang

Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded into a latent representation, and aligned with its corresponding text descriptions, then a decoder is used to generate the captions. However, training of an AAC system often encounters the problem of data scarcity, which may lead to inaccurate representation and audio-text alignment. To address this problem, we propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment between audio and texts, while trained with limited data. Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.

中文翻译:

CL4AC:音频字幕的对比损失

自动音频字幕 (AAC) 是一项跨模态翻译任务,旨在使用自然语言来描述音频剪辑的内容。正如 DCASE 2021 挑战任务 6 收到的提交所示,这个问题在社区中越来越受到关注。现有的 AAC 系统通常基于编码器-解码器架构,其中音频信号被编码为潜在表示,并与其对应的文本描述对齐,然后使用解码器生成字幕。然而,AAC 系统的训练经常会遇到数据稀缺的问题,这可能会导致表示和音频文本对齐不准确。为了解决这个问题,我们提出了一种新的编码器-解码器框架,称为音频字幕对比损失(CL4AC)。在 CL4AC 中,来自原始音频-文本配对数据的自监督信号用于通过对比样本来利用音频和文本之间的对应关系,这可以提高潜在表示的质量以及音频和文本之间的对齐,同时使用有限的数据进行训练。在 Clotho 数据集上进行了实验,以展示我们提出的方法的有效性。
更新日期:2021-07-22
down
wechat
bug