当前位置: X-MOL 学术arXiv.cs.CV › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning Visual Voice Activity Detection with an Automatically Annotated Dataset
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2020-09-23 , DOI: arxiv-2009.11204
Sylvain Guy, St\'ephane Lathuili\`ere, Pablo Mesejo and Radu Horaud

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild -- WildVVAD -- based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset.

中文翻译:

使用自动注释的数据集学习视觉语音活动检测

视觉语音活动检测 (V-VAD) 使用视觉特征来预测一个人是否在说话。V-VAD 在音频 VAD (A-VAD) 效率低下时非常有用,因为声学信号难以分析或只是因为它丢失了。我们为 V-VAD 提出了两种深度架构,一种基于面部标志,一种基于光流。此外,用于学习和测试 V-VAD 的可用数据集缺乏内容可变性。我们引入了一种新颖的方法来自动创建和注释非常大的野外数据集——WildVVAD——基于将 A-VAD 与人脸检测和跟踪相结合。彻底的实证评估显示了使用该数据集训练所提出的深度 V-VAD 模型的优势。
更新日期:2020-10-19
down
wechat
bug