Analysis of transition cost and model parameters in speaker diarization for meetings,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Analysis of transition cost and model parameters in speaker diarization for meetings
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2021-02-24 , DOI: 10.1186/s13636-021-00196-6
Beatriz Martínez-González , José M. Pardo , José A. Vallejo-Pinto , Rubén San-Segundo , Javier Ferreiros

There has been little work in the literature on the speaker diarization of meetings with multiple distance microphones since the publications in 2012 related to the last National Institute of Standards (NIST) Rich Transcription Evaluation Campaign in 2009 (RT09). Lately, the Second DIHARD Challenge Evaluation has also covered diarization at dinner party meetings that include multiple distant microphones. Dinner party meetings are somehow harder than office meetings because their participants can move freely around the room. In this paper, we studied some of the algorithms on speaker diarization for meetings with multiple distant microphones for the NIST Rich Transcription Evaluation Campaign in 2007 (RT07) and RT09 and provide definite and clear improvements. On the one hand, little or no care has been taken to the problem of penalizing or favoring transitions between speakers other than proposing a minimum duration of a speaker turn or calculating the speakers’ probabilities using Variational Bayes (VB). We have studied this issue and determined that a transition penalty term is needed that should be independent both of the number of active speakers and the minimum duration of speaker turns. On the other hand, the determination of a method to automatically select the right number of parameters is crucial in developing good models for speakers. Previous studies have proposed the dynamic selection of the number of parameters based on the duration of the speaker’s speech with a mixed performance when tested at one distant microphone meetings or multiple distant microphones meetings. In this paper, we propose a new method that takes into account both the duration of speaker’s speech to determine a minimum number of parameters, and the question of overfitting issue to determine a maximum number of them, also taking into account the computation time in order to reduce it. We have carried out experiments to support our findings, and we have been able to improve our baseline speaker error rate with multiple distant-microphone meetings. Both methods achieve improved performance over the baseline. The first method obtains a 21.6% decrease in relative speaker error for the development set and a 4.6% decrease in relative speaker error for the test set (RT09). The second method obtains a 46.47% decrease in relative speaker error for the development set and a 17.54% decrease in relative speaker error for the test set. Both methods complement each other, and when they are applied in combination, we obtain a 47.2% decrease in relative speaker error for the development set and a 22.02% decrease in relative speaker error for the test set. The performance obtained with our proposal is outstanding in some subsets of the development test such as the NIST RT07 and among the best for RT09 using our proposed simple modifications. Furthermore, with our algorithm we obtain gains in computation time without jeopardizing performance. Results with a different publicly available database, augmented multiparty interaction (AMI) obtains a 28.44% decrease in relative speaker error confirming the validity of our methods. Preliminary experiments with a single stream (mfcc) endorse the validity of our findings. Comparisons with an x-vector system deliver superior performance of our system on unseen test data.

中文翻译：

会议发言人区分中的过渡成本和模型参数分析

自2012年发布与2009年上一次美国国家标准研究院（NIST）丰富转录评估运动（RT09）相关的出版物以来，文献中关于使用多距离麦克风的会议的演讲者差异化的工作很少。最近，第二次DIHARD挑战评估还涵盖了晚宴会议上的差异化，包括多个远处的麦克风。晚宴会议比办公室会议要难一些，因为他们的参与者可以在会议室中自由移动。在本文中，我们针对2007年NIST丰富转录评估活动（RT07）和RT09，研究了与多个远距离麦克风进行会议时的演讲者区分算法，并提供了明确而明确的改进。一方面，除了提议说话者转动的最小持续时间或使用变分贝叶斯（VB）计算说话者的概率以外，很少或根本没有注意惩罚或赞成说话者之间的过渡的问题。我们已经研究了这个问题，并确定需要一个过渡惩罚术语，该术语应与活动发言人的数量和发言人讲话的最短持续时间无关。另一方面，确定自动选择正确数量的参数的方法对于开发好的扬声器模型至关重要。先前的研究提出了在一个远距离麦克风会议或多个远距离麦克风会议上进行测试时，基于讲话者语音的持续时间动态选择参数数量的方法，具有混合性能。在本文中，我们提出了一种新方法，该方法既要考虑说话者讲话的持续时间来确定最小数量的参数，又要考虑过拟合问题来确定最大数量的参数，同时还要考虑计算时间以减少参数。我们进行了实验以支持我们的发现，并且通过多次远距离麦克风会议，我们能够提高基准说话人的错误率。两种方法均在基线之上实现了改进的性能。对于开发集，第一种方法的相对说话者误差减少了21.6％，对于测试集（RT09），相对的说话者误差减少了4.6％。第二种方法使开发集的相对说话者误差减少了46.47％，而测试集的相对说话者误差减少了17.54％。两种方法相辅相成，当它们组合使用时，对于开发集，我们的相对说话者误差降低了47.2％，对于测试集，相对的说话者误差降低了22.02％。通过我们的建议获得的性能在开发测试的某些子集中非常出色，例如NIST RT07，并且使用我们提出的简单修改在RT09中表现最佳。此外，利用我们的算法，我们可以在不影响性能的情况下获得计算时间的收益。使用不同的公共数据库得到的结果，增强的多方互动（AMI）使说话人的相对错误减少了28.44％，这证实了我们方法的有效性。单流（mfcc）的初步实验证明了我们的发现的正确性。

更新日期：2021-02-24

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文