End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings
arXiv - CS - Sound Pub Date : 2021-05-05 , DOI: arxiv-2105.02096
Soumi Maiti, Hakan Erdogan, Kevin Wilson, Scott Wisdom, Shinji Watanabe, John R. Hershey

We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions. We introduce several components that appear to help with diarization performance, including a local convolutional network followed by a global self-attention module, multi-task transfer learning using a speaker identification component, and a sequential approach where the model is refined with a second stage. These are trained and validated on simulated meeting data based on LibriSpeech and LibriTTS datasets; final evaluations are done using LibriCSS, which consists of simulated meetings recorded using real acoustics via loudspeaker playback. The proposed model performs better than previously proposed end-to-end diarization models on these data.

中文翻译：

具有局部全局网络和具有区别性的发言人嵌入的可变数量发言人的端到端差异化

我们提出了一个端到端的深度网络模型，该模型执行会议与单通道音频记录的对等化。与传统的基于聚类的区分方法不同，端到端的区分模型具有处理说话者重叠并能够直接处理歧视性训练的优势。拟议中的系统旨在使用基于可变数量置换-不变互熵的损失函数来处理发言人人数未知的会议。我们介绍了一些可能有助于数字化性能的组件，包括局部卷积网络，其后是全局自我注意模块，使用说话人识别组件的多任务转移学习，以及在第二阶段对模型进行细化的顺序方法。在基于LibriSpeech和LibriTTS数据集的模拟会议数据上对这些数据进行了训练和验证；最后的评估是使用LibriCSS进行的，它由模拟会议组成，这些会议是通过真实的声音通过扬声器播放录制的。在这些数据上，提出的模型比以前提出的端到端数字化模型表现更好。

更新日期：2021-05-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文