Fast-Slow Transformer for Visually Grounding Speech,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fast-Slow Transformer for Visually Grounding Speech
arXiv - CS - Computation and Language Pub Date : 2021-09-16 , DOI: arxiv-2109.08186
Puyuan Peng, David Harwath

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. The model unifies dual-encoder and cross-attention architectures into a single model, reaping the superior retrieval speed of the former along with the accuracy of the latter. FaST-VGS achieves state-of-the-art speech-image retrieval accuracy on benchmark datasets, and its learned representations exhibit strong performance on the ZeroSpeech 2021 phonetic and semantic tasks.

中文翻译：

用于视觉接地语音的快慢变压器

我们提出了用于视觉接地语音的 Fast-Slow Transformer，或 FaST-VGS。FaST-VGS 是一种基于 Transformer 的模型，用于学习原始语音波形和视觉图像之间的关联。该模型将双编码器和交叉注意力架构统一到一个模型中，获得了前者的卓越检索速度和后者的准确性。FaST-VGS 在基准数据集上实现了最先进的语音图像检索精度，其学习表示在 ZeroSpeech 2021 语音和语义任务上表现出强大的性能。

更新日期：2021-09-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>