Moving Fast and Slow: Analysis of Representations and Post-Processing in Speech-Driven Automatic Gesture Generation,International Journal of Human-Computer Interaction

当前位置： X-MOL 学术 › Int. J. Hum. Comput. Interact. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Moving Fast and Slow: Analysis of Representations and Post-Processing in Speech-Driven Automatic Gesture Generation
International Journal of Human-Computer Interaction ( IF 3.4 ) Pub Date : 2021-02-17 , DOI: 10.1080/10447318.2021.1883883
Taras Kucherenko ₁ , Dai Hasegawa ₂ , Naoshi Kaneko ₃ , Gustav Eje Henter ₁ , Hedvig Kjellström ₁

Affiliation

ABSTRACT

This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. We provide an analysis of different representations for the input (speech) and the output (motion) of the network by both objective and subjective evaluations. We also analyze the importance of smoothing of the produced motion. Our results indicated that the proposed method improved on our baseline in terms of objective measures. For example, it better captured the motion dynamics and better matched the motion-speed distribution. Moreover, we performed user studies on two different datasets. The studies confirmed that our proposed method is perceived as more natural than the baseline, although the difference in the studies was eliminated by appropriate post-processing: hip-centering and smoothing. We conclude that it is important to take both motion representation and post-processing into account when designing an automatic gesture-production method.

中文翻译：

快速和缓慢移动：语音驱动自动手势生成中的表示和后处理分析

摘要

本文提出了一种新的语音驱动手势生成框架，适用于虚拟代理以增强人机交互。具体来说，我们通过结合表征学习扩展了最近基于深度学习的、数据驱动的语音驱动手势生成方法。我们的模型以 3D 坐标序列的形式将语音作为输入并产生手势作为输出。我们通过客观和主观评估对网络的输入（语音）和输出（运动）的不同表示进行了分析。我们还分析了平滑产生的运动的重要性。我们的结果表明，所提出的方法在客观测量方面比我们的基线有所改进。例如，它可以更好地捕捉运动动态并更好地匹配运动速度分布。而且，我们对两个不同的数据集进行了用户研究。研究证实，我们提出的方法被认为比基线更自然，尽管通过适当的后处理消除了研究中的差异：臀部居中和平滑。我们得出的结论是，在设计自动手势生成方法时，同时考虑运动表示和后处理非常重要。

更新日期：2021-02-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11