当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Direct Speech-to-image Translation
arXiv - CS - Multimedia Pub Date : 2020-04-07 , DOI: arxiv-2004.03413
Jiguo Li, Xinfeng Zhang, Chuanmin Jia, Jizheng Xu, Li Zhang, Yue Wang, Siwei Ma, Wen Gao

Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In this paper, we attempt to translate the speech signals into the image signals without the transcription stage. Specifically, a speech encoder is designed to represent the input speech signals as an embedding feature, and it is trained with a pretrained image encoder using teacher-student learning to obtain better generalization ability on new classes. Subsequently, a stacked generative adversarial network is used to synthesize high-quality images conditioned on the embedding feature. Experimental results on both synthesized and real data show that our proposed method is effective to translate the raw speech signals into images without the middle text representation. Ablation study gives more insights about our method.

中文翻译:

直接语音到图像的翻译

由于在人机交互、艺术创作、计算机辅助设计方面的潜在应用,没有文本的直接语音到图像翻译是一个有趣且有用的话题。等等。更不用说许多语言没有书写形式。然而,据我们所知,如何将语音信号直接翻译成图像以及如何翻译它们还没有得到很好的研究。在本文中,我们尝试在没有转录阶段的情况下将语音信号转换为图像信号。具体来说,语音编码器被设计为将输入语音信号表示为嵌入特征,并使用师生学习使用预训练的图像编码器进行训练,以获得更好的对新类的泛化能力。随后,堆叠生成对抗网络用于合成以嵌入特征为条件的高质量图像。合成数据和真实数据的实验结果表明,我们提出的方法可以有效地将原始语音信号转换为没有中间文本表示的图像。消融研究为我们的方法提供了更多见解。
更新日期:2020-07-15
down
wechat
bug