Speaker Diarization Using Stereo Audio Channels: Preliminary Study on Utterance Clustering,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Speaker Diarization Using Stereo Audio Channels: Preliminary Study on Utterance Clustering
arXiv - CS - Sound Pub Date : 2020-09-10 , DOI: arxiv-2009.05076
Yingjun Dong, Neil G. MacLaren, Yiding Cao, Francis J. Yammarino, Shelley D. Dionne, Michael D. Mumford, Shane Connelly, Hiroki Sayama, and Gregory A. Ruark

Speaker diarization is one of the actively researched topics in audio signal processing and machine learning. Utterance clustering is a critical part of a speaker diarization task. In this study, we aim to improve the performance of utterance clustering by processing multichannel (stereo) audio signals. We generated processed audio signals by combining left- and right-channel audio signals in a few different ways and then extracted embedded features (also called d-vectors) from those processed audio signals. We applied the Gaussian mixture model (GMM) for supervised utterance clustering. In the training phase, we used a parameter sharing GMM to train the model for each speaker. In the testing phase, we selected the speaker with the maximum likelihood as the detected speaker. Results of experiments with real audio recordings of multi-person discussion sessions showed that our proposed method that used multichannel audio signals achieved significantly better performance than a conventional method with mono audio signals.

中文翻译：

使用立体声音频通道进行说话人分类：话语聚类的初步研究

说话人分类是音频信号处理和机器学习领域的热门研究课题之一。话语聚类是说话人分类任务的关键部分。在这项研究中，我们旨在通过处理多通道（立体声）音频信号来提高话语聚类的性能。我们通过以几种不同的方式组合左右声道音频信号来生成处理后的音频信号，然后从这些处理过的音频信号中提取嵌入特征（也称为 d 向量）。我们将高斯混合模型 (GMM) 应用于有监督的话语聚类。在训练阶段，我们使用参数共享 GMM 为每个说话者训练模型。在测试阶段，我们选择了最大似然的说话人作为检测到的说话人。

更新日期：2020-09-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文