Spot the conversation: speaker diarisation in the wild,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Spot the conversation: speaker diarisation in the wild
arXiv - CS - Sound Pub Date : 2020-07-02 , DOI: arxiv-2007.01216
Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman

The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation pipeline which significantly reduces the number of hours required to annotate videos with diarisation labels. Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community. Our dataset consists of overlapping speech, a large and diverse speaker pool, and challenging background conditions.

中文翻译：

发现对话：野外的说话者分类

本文的目标是对“野外”收集的视频进行扬声器分类。我们做出了三个关键贡献。首先，我们提出了一种用于 YouTube 视频的自动视听分类方法。我们的方法包括使用视听方法的主动说话人检测和使用自注册说话人模型的说话人验证。其次，我们将我们的方法集成到半自动数据集创建管道中，这显着减少了使用分类标签注释视频所需的小时数。最后，我们使用此管道创建了一个名为 VoxConverse 的大规模分类数据集，该数据集是从“野外”视频中收集的，我们将向研究社区公开发布。我们的数据集包括重叠语音、庞大而多样的说话者池和具有挑战性的背景条件。

更新日期：2020-11-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>