当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast-Slow Transformer for Visually Grounding Speech
arXiv - CS - Computation and Language Pub Date : 2021-09-16 , DOI: arxiv-2109.08186
Puyuan Peng, David Harwath

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. The model unifies dual-encoder and cross-attention architectures into a single model, reaping the superior retrieval speed of the former along with the accuracy of the latter. FaST-VGS achieves state-of-the-art speech-image retrieval accuracy on benchmark datasets, and its learned representations exhibit strong performance on the ZeroSpeech 2021 phonetic and semantic tasks.

中文翻译:

用于视觉接地语音的快慢变压器

我们提出了用于视觉接地语音的 Fast-Slow Transformer,或 FaST-VGS。FaST-VGS 是一种基于 Transformer 的模型,用于学习原始语音波形和视觉图像之间的关联。该模型将双编码器和交叉注意力架构统一到一个模型中,获得了前者的卓越检索速度和后者的准确性。FaST-VGS 在基准数据集上实现了最先进的语音图像检索精度,其学习表示在 ZeroSpeech 2021 语音和语义任务上表现出强大的性能。
更新日期:2021-09-20
down
wechat
bug