Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows,Computer Graphics Forum

当前位置： X-MOL 学术 › Comput. Graph. Forum › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows
Computer Graphics Forum ( IF 2.7 ) Pub Date : 2020-05-01 , DOI: 10.1111/cgf.13946
Simon Alexanderson ₁ , Gustav Eje Henter ₁ , Taras Kucherenko ₁ , Jonas Beskow ₁

Affiliation

Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off‐line applications, novel tools can alter the role of an animator to that of a director, who provides only high‐level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning‐based motion synthesis method called MoGlow, we propose a new generative model for generating state‐of‐the‐art realistic speech‐driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper‐body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full‐body gesticulation, including the synthesis of stepping motion and stance.

中文翻译：

使用归一化流的风格可控语音驱动手势合成

逼真手势的自动合成有望改变动画、化身和交流代理领域。在离线应用中，新颖的工具可以将动画师的角色转变为导演的角色，导演仅为所需动画提供高级输入；学习到的网络然后将这些指令翻译成适当的身体姿势序列。在交互式场景中，动态生成自然动画的系统是实现可信和相关角色的关键。在本文中，我们针对这些目的解决了一些核心问题。通过采用名为 MoGlow 的基于深度学习的运动合成方法，我们提出了一种新的生成模型，用于生成最先进的逼真的语音驱动手势。由于该方法的概率性质，我们的模型可以产生一组不同的，然而，在给出相同输入语音信号的情况下，手势是合理的。就像人类一样，这提供了丰富的自然运动变化。我们还展示了对输出风格（例如手势水平、速度、对称性和空间范围）进行导演控制的能力。可以利用这种控制来传达所需的角色个性或情绪。我们在没有对数据进行任何手动注释的情况下实现了所有这些。评估上半身手势的用户研究证实，生成的动作是自然的，并且与输入的语音非常匹配。我们的方法在这些措施上的得分高于所有先前的系统和基线，并且接近原始记录运动的评分。我们还发现，我们可以准确地控制手势风格，而不会不必要地影响感知的自然度。最后，

更新日期：2020-05-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11