当前位置: X-MOL 学术Speech Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Graph attention-based deep embedded clustering for speaker diarization
Speech Communication ( IF 3.2 ) Pub Date : 2023-10-05 , DOI: 10.1016/j.specom.2023.102991
Yi Wei , Haiyan Guo , Zirui Ge , Zhen Yang

Deep speaker embedding extraction models have recently served as the cornerstone for modular speaker diarization systems. However, in current modular systems, the extracted speaker embeddings (namely, speaker features) do not effectively leverage their intrinsic relationships, and moreover, are not tailored specifically for the clustering task. In this paper, inspired by deep embedded clustering (DEC), we propose a speaker diarization method using the graph attention-based deep embedded clustering (GADEC) to address the aforementioned issues. First, considering the temporal nature of speech signals, when segmenting the speech signal into small segments, the speech in the current segment and its neighboring segments may likely belong to the same speaker. This suggests that embeddings extracted from neighboring segments could help generate a more informative speaker representation for the current segment. To better describe the complex relationships between segments and leverage the local structural information among their embeddings, we construct a graph for the pre-extracted speaker embeddings in a continuous audio signal. On this basis, we introduce a graph attentional encoder (GAE) module to integrate information from neighboring nodes (i.e., neighboring segments) in the graph and learn latent speaker embeddings. Moreover, we further jointly optimize both the latent speaker embeddings and the clustering results within a unified framework, leading to more discriminative speaker embeddings for the clustering task. Experimental results demonstrate that our proposed GADEC-based speaker diarization system significantly outperforms the baseline systems and several other recent speaker diarization systems concerning diarization error rate (DER) on the NIST SRE 2000 CALLHOME, AMI, and VoxConverse datasets.



中文翻译:

基于图注意力的深度嵌入聚类,用于说话人二值化

深度说话人嵌入提取模型最近已成为模块化说话人二值化系统的基石。然而,在当前的模块化系统中,提取的说话人嵌入(即说话人特征)并没有有效地利用它们的内在关系,而且不是专门针对聚类任务定制的。在本文中,受深度嵌入聚类(DEC)的启发,我们提出了一种使用基于图注意的深度嵌入聚类(GADEC)的说话者二值化方法来解决上述问题。首先,考虑到语音信号的时间性质,当将语音信号分割成小片段时,当前片段及其相邻片段中的语音很可能属于同一说话人。这表明从相邻片段中提取的嵌入可以帮助为当前片段生成信息更丰富的说话人表示。为了更好地描述片段之间的复杂关系并利用其嵌入之间的局部结构信息,我们为连续音频信号中预先提取的说话人嵌入构建了一个图。在此基础上,我们引入了图注意编码器(GAE)模块来整合图中相邻节点(即相邻段)的信息并学习潜在说话者嵌入。此外,我们在统一框架内进一步联合优化潜在说话人嵌入和聚类结果,从而为聚类任务提供更具辨别力的说话人嵌入。实验结果表明,我们提出的基于 GADEC 的说话人二值化系统在 NIST SRE 2000 CALLHOME、AMI 和 VoxConverse 数据集上的二值化错误率 (DER) 方面显着优于基线系统和其他几个最近的说话人二值化系统。

更新日期:2023-10-05
down
wechat
bug