Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform,Language Resources and Evaluation

当前位置： X-MOL 学术 › Lang. Resour. Eval. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2020-07-26 , DOI: 10.1007/s10579-020-09500-w
Sara Dahmani , Vincent Colotte , Slim Ouni

In this paper, we present a multimodal acquisition setup that combines different motion-capture systems. This system is mainly aimed for recording expressive audiovisual corpus in the context of audiovisual speech synthesis. When dealing with speech recording, the standard optical motion-capture systems fail in tracking the articulators finely, especially the inner mouth region, due to the disappearing of certain markers during the articulation. Also, some systems have limited frame rates and are not suitable for smooth speech tracking. In this work, we demonstrate how those limitations can be overcome by creating a heterogeneous system taking advantage of different tracking systems. In the scope of this work, we recorded a prototypical corpus using our combined system for a single subject. This corpus was used to validate our multimodal data acquisition protocol and to assess the quality of the expressiveness before recording a large corpus. We conducted two evaluations of the recorded data, the first one concerns the production aspect of speech and the second one focuses on the speech perception aspect (both evaluations concern visual and acoustic modalities). Production analysis allowed us to identify characteristics specific to each expressive context. This analysis showed that the expressive content of the recorded data is globally in line with what is commonly expected in the literature. The perceptual evaluation, conducted as a human emotion recognition task using different types of stimulus, confirmed that the different recorded emotions were well perceived.

中文翻译：

使用多模式平台获取表达性视听语音语料库的一些考虑

在本文中，我们提出了一种结合了不同运动捕捉系统的多模式采集装置。该系统主要用于在视听语音合成的情况下记录表达性视听语料库。当处理语音记录时，由于光学过程中某些标记的消失，标准的光学运动捕获系统无法精确跟踪咬合器，尤其是内口区域。而且，某些系统的帧速率有限，并且不适合进行平滑的语音跟踪。在这项工作中，我们演示了如何通过利用不同的跟踪系统创建异构系统来克服这些限制。在这项工作的范围内，我们使用组合系统为单个主题记录了原型语料库。该语料库用于验证我们的多模式数据采集协议，并在记录大语料库之前评估表达的质量。我们对记录的数据进行了两次评估，第一个评估涉及语音的生成方面，第二个评估专注于语音感知方面（这两个评估均涉及视觉和听觉模态）。生产分析使我们能够确定每种表达情境特有的特征。该分析表明，记录数据的表达内容总体上与文献中通常期望的一致。作为使用不同类型的刺激的人类情感识别任务而进行的感知评估证实，可以很好地感知不同记录的情感。

更新日期：2020-07-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11