当前位置:
X-MOL 学术
›
arXiv.cs.CL
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast-Slow Transformer for Visually Grounding Speech
arXiv - CS - Computation and Language Pub Date : 2021-09-16 , DOI: arxiv-2109.08186 Puyuan Peng, David Harwath
arXiv - CS - Computation and Language Pub Date : 2021-09-16 , DOI: arxiv-2109.08186 Puyuan Peng, David Harwath
We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS.
FaST-VGS is a Transformer-based model for learning the associations between raw
speech waveforms and visual images. The model unifies dual-encoder and
cross-attention architectures into a single model, reaping the superior
retrieval speed of the former along with the accuracy of the latter. FaST-VGS
achieves state-of-the-art speech-image retrieval accuracy on benchmark
datasets, and its learned representations exhibit strong performance on the
ZeroSpeech 2021 phonetic and semantic tasks.
中文翻译:
用于视觉接地语音的快慢变压器
我们提出了用于视觉接地语音的 Fast-Slow Transformer,或 FaST-VGS。FaST-VGS 是一种基于 Transformer 的模型,用于学习原始语音波形和视觉图像之间的关联。该模型将双编码器和交叉注意力架构统一到一个模型中,获得了前者的卓越检索速度和后者的准确性。FaST-VGS 在基准数据集上实现了最先进的语音图像检索精度,其学习表示在 ZeroSpeech 2021 语音和语义任务上表现出强大的性能。
更新日期:2021-09-20
中文翻译:
用于视觉接地语音的快慢变压器
我们提出了用于视觉接地语音的 Fast-Slow Transformer,或 FaST-VGS。FaST-VGS 是一种基于 Transformer 的模型,用于学习原始语音波形和视觉图像之间的关联。该模型将双编码器和交叉注意力架构统一到一个模型中,获得了前者的卓越检索速度和后者的准确性。FaST-VGS 在基准数据集上实现了最先进的语音图像检索精度,其学习表示在 ZeroSpeech 2021 语音和语义任务上表现出强大的性能。