当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Query-graph with Cross-gating Attention Model for Text-to-Audio Grounding
arXiv - CS - Multimedia Pub Date : 2021-06-27 , DOI: arxiv-2106.14136
Haoyu Tang, Jihua Zhu, Qinghai Zheng, Zhiyong Cheng

In this paper, we address the text-to-audio grounding issue, namely, grounding the segments of the sound event described by a natural language query in the untrimmed audio. This is a newly proposed but challenging audio-language task, since it requires to not only precisely localize all the on- and off-sets of the desired segments in the audio, but to perform comprehensive acoustic and linguistic understandings and reason the multimodal interactions between the audio and query. To tackle those problems, the existing method treats the query holistically as a single unit by a global query representation, which fails to highlight the keywords that contain rich semantics. Besides, this method has not fully exploited interactions between the query and audio. Moreover, since the audio and queries are arbitrary and variable in length, many meaningless parts of them are not filtered out in this method, which hinders the grounding of the desired segments. To this end, we propose a novel Query Graph with Cross-gating Attention (QGCA) model, which models the comprehensive relations between the words in query through a novel query graph. Besides, to capture the fine-grained interactions between audio and query, a cross-modal attention module that assigns higher weights to the keywords is introduced to generate the snippet-specific query representations. Finally, we also design a cross-gating module to emphasize the crucial parts as well as weaken the irrelevant ones in the audio and query. We extensively evaluate the proposed QGCA model on the public Audiogrounding dataset with significant improvements over several state-of-the-art methods. Moreover, further ablation study shows the consistent effectiveness of different modules in the proposed QGCA model.

中文翻译:

用于文本到音频接地的具有交叉门控注意模型的查询图

在本文中,我们解决了文本到音频的接地问题,即,将自然语言查询描述的声音事件片段置于未修剪音频中。这是一项新提出但具有挑战性的音频语言任务,因为它不仅需要精确定位音频中所需片段的所有开始和偏移,还需要执行全面的声学和语言理解并推理之间的多模态交互音频和查询。为了解决这些问题,现有方法通过全局查询表示将查询整体视为一个单元,未能突出包含丰富语义的关键字。此外,该方法还没有充分利用查询和音频之间的交互。此外,由于音频和查询是任意的且长度可变,其中很多无意义的部分在这种方法中没有被过滤掉,这阻碍了所需片段的接地。为此,我们提出了一种新颖的具有交叉门控注意(QGCA)模型的查询图,该模型通过新颖的查询图对查询中的单词之间的综合关系进行建模。此外,为了捕捉音频和查询之间的细粒度交互,引入了为关键字分配更高权重的跨模式注意模块来生成特定于片段的查询表示。最后,我们还设计了一个交叉门控模块,以强调音频和查询中的关键部分并削弱不相关的部分。我们在公共 Audiogrounding 数据集上广泛评估了所提出的 QGCA 模型,并对几种最先进的方法进行了显着改进。而且,
更新日期:2021-06-29
down
wechat
bug