当前位置: X-MOL 学术J. Ambient Intell. Human. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Speaker age and gender classification using GMM supervector and NAP channel compensation method
Journal of Ambient Intelligence and Humanized Computing Pub Date : 2020-05-13 , DOI: 10.1007/s12652-020-02045-4
Ergün Yücesoy

One of the most important factors affecting the performance of speech-based recognition systems is the differences between training and test conditions. The Nuisance attribute projection (NAP) is an effective method for eliminating these differences, called channel effects. In this study, the effects of the NAP approach in determining age and gender groups are investigated. Mel-frequency cepstral coefficients and delta coefficients are used as a feature and Gaussian mixture models (GMM) adapted from the universal background model by maximum-a-posteriori method are used for the modeling of age and gender classes. After the GMMs corresponding to each speech are converted into mean supervectors, they are applied to a Support Vector Machine (SVM), and speeches are classified according to the age and gender group of the speakers. While linear GMM kernel based on Kullback–Leibler divergence is used instead of standard SVM kernels, the NAP channel subspace size is changed between 20 and 200 and the number of GMM components is changed between 32 and 512 to determine the optimum values for these parameters. In the tests on the aGender database, the optimum number of components is determined as 128, and the optimum NAP channel subspace size is determined as 45. The age and gender classification accuracy of the system, which is developed using these optimum parameters, is increased from 60.52 to 62.03% with the use of NAP. In addition, age classification accuracy is increased from 60.23 to 61.82% and gender classification accuracy is increased from 91.71 to 92.30%.



中文翻译:

使用GMM超向量和NAP通道补偿方法的说话人年龄和性别分类

影响基于语音的识别系统性能的最重要因素之一是训练条件和测试条件之间的差异。妨害属性投影(NAP)是消除这些差异(称为通道效应)的有效方法。在本研究中,研究了NAP方法在确定年龄和性别组中的作用。将梅尔频率倒谱系数和增量系数用作特征,并使用基于最大后验方法的通用背景模型改编的高斯混合模型(GMM)用于年龄和性别类别的建模。将与每个语音相对应的GMM转换为平均超向量后,将它们应用于支持向量机(SVM),然后根据说话者的年龄和性别对语音进行分类。当使用基于Kullback-Leibler散度的线性GMM内核代替标准SVM内核时,NAP通道子空间的大小在20到200之间变化,GMM组件的数量在32到512之间变化,以确定这些参数的最佳值。在aGender数据库上进行的测试中,确定的最佳组件数为128,并且确定的NAP通道子空间的最佳大小为45。使用这些最佳参数开发的系统的年龄和性别分类准确性得以提高使用NAP的百分比从60.52到62.03%。此外,年龄分类的准确性从60.23%增加到61.82%,性别分类的准确性从91.71增加到92.30%。NAP通道子空间大小在20到200之间更改,GMM组件的数量在32到512之间更改,以确定这些参数的最佳值。在aGender数据库上进行的测试中,确定的最佳组件数为128,确定的NAP通道子空间的最佳大小为45。使用这些最佳参数开发的系统的年龄和性别分类的准确性得到了提高使用NAP的百分比从60.52到62.03%。此外,年龄分类的准确性从60.23%增加到61.82%,性别分类的准确性从91.71增加到92.30%。NAP通道子空间大小在20到200之间更改,GMM组件的数量在32到512之间更改,以确定这些参数的最佳值。在aGender数据库上进行的测试中,确定的最佳组件数为128,确定的NAP通道子空间的最佳大小为45。使用这些最佳参数开发的系统的年龄和性别分类的准确性得到了提高使用NAP的百分比从60.52到62.03%。此外,年龄分类的准确性从60.23%增加到61.82%,性别分类的准确性从91.71增加到92.30%。确定的最佳组件数为128,确定的NAP通道子空间的最佳大小为45。使用这些最佳参数开发的系统的年龄和性别分类准确性通过使用从60.52%提高到62.03% NAP。此外,年龄分类的准确性从60.23%增加到61.82%,性别分类的准确性从91.71增加到92.30%。确定的最佳组件数为128,确定的NAP通道子空间的最佳大小为45。使用这些最佳参数开发的系统的年龄和性别分类准确性通过使用从60.52%提高到62.03% NAP。此外,年龄分类的准确性从60.23%增加到61.82%,性别分类的准确性从91.71增加到92.30%。

更新日期:2020-05-13
down
wechat
bug