Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity,arXiv - CS - Graphics

当前位置： X-MOL 学术 › arXiv.cs.GR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity
arXiv - CS - Graphics Pub Date : 2020-09-04 , DOI: arxiv-2009.02119
Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, Geehyuk Lee

For human-like agents, including virtual avatars and social robots, making proper gestures while speaking is crucial in human--agent interaction. Co-speech gestures enhance interaction experiences and make the agents look alive. However, it is difficult to generate human-like gestures due to the lack of understanding of how people gesture. Data-driven approaches attempt to learn gesticulation skills from human demonstrations, but the ambiguous and individual nature of gestures hinders learning. In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce a new quantitative evaluation metric for gesture generation models. Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models. We further confirm that our model is able to work with synthesized audio in a scenario where contexts are constrained, and show that different gesture styles can be generated for the same speech by specifying different speaker identities in the style embedding space that is learned from videos of various speakers. All the code and data is available at https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.

中文翻译：

从文本、音频和说话者身份的三模态上下文生成语音手势

对于类人代理，包括虚拟化身和社交机器人，说话时做出正确的手势在人与代理交互中至关重要。共同语音手势增强了交互体验并使代理看起来很活跃。然而，由于缺乏对人的手势的理解，很难生成类似人的手势。数据驱动的方法试图从人类演示中学习手势技能，但手势的模糊性和个体性阻碍了学习。在本文中，我们提出了一种自动手势生成模型，该模型使用语音文本、音频和说话者身份的多模态上下文来可靠地生成手势。通过结合多模态上下文和对抗性训练方案，所提出的模型输出类似于人类且与语音内容和节奏相匹配的手势。我们还为手势生成模型引入了一种新的定量评估指标。引入度量和主观人类评估的实验表明，所提出的手势生成模型优于现有的端到端生成模型。我们进一步确认我们的模型能够在上下文受限的场景中使用合成音频，并表明可以通过在从视频中学习的样式嵌入空间中指定不同的说话者身份来为同一个语音生成不同的手势样式。各种扬声器。所有代码和数据都可以在 https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context 获得。引入度量和主观人类评估的实验表明，所提出的手势生成模型优于现有的端到端生成模型。我们进一步确认我们的模型能够在上下文受限的场景中使用合成音频，并表明可以通过在从视频中学习的样式嵌入空间中指定不同的说话者身份来为同一个语音生成不同的手势样式。各种扬声器。所有代码和数据都可以在 https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context 获得。引入度量和主观人类评估的实验表明，所提出的手势生成模型优于现有的端到端生成模型。我们进一步确认我们的模型能够在上下文受限的场景中使用合成音频，并表明可以通过在从视频中学习的样式嵌入空间中指定不同的说话者身份来为同一个语音生成不同的手势样式。各种扬声器。所有代码和数据都可以在 https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context 获得。并表明可以通过在从不同说话者的视频中学习的风格嵌入空间中指定不同的说话者身份来为同一语音生成不同的手势风格。所有代码和数据都可以在 https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context 获得。并表明可以通过在从不同说话者的视频中学习的风格嵌入空间中指定不同的说话者身份来为同一语音生成不同的手势风格。所有代码和数据都可以在 https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context 获得。

更新日期：2020-09-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文