当前位置: X-MOL 学术Digit. Signal Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and ERB frequency scales
Digital Signal Processing ( IF 2.9 ) Pub Date : 2020-05-11 , DOI: 10.1016/j.dsp.2020.102763
Sugan Nagarajan , Satya Sai Srinivas Nettimi , Lakshmi Sutha Kumar , Malaya Kumar Nath , Aniruddha Kanhe

Speech emotion recognition (SER) refers to the process of recognizing the emotional state of the speaker from the speech utterance. In earlier studies, wide varieties of cepstral features have been proposed to develop SER systems. The mel-frequency cepstral coefficients (MFCC), and human-factor cepstral coefficients (HFCC) are two popularly used variants of cepstral features. MFCC and HFCC features are extracted from speech signals using mel and human-factor filter banks, respectively. The magnitude response of individual filters in these filter banks is triangular in shape. As a result, these filter banks are referred to as triangular filter banks (TFB) and the corresponding extracted cepstral coefficients can be denoted as TFBCC-M (in case of MFCC) and TFBCC-HF (in case of HFCC). The mel-filter bank (TFB-M) is constructed using mel-scale, while the human-factor filter bank (TFB-HF) is constructed using human factor scale, which is a combination of mel and equivalent rectangular bandwidth (ERB) scales. Similarly, different frequency scales can be used to realize different TFBs to extract different types of TFBCC features. In this direction, this paper proposes two new TFBs denoted as TFB-B and TFB-E, realized using bark and ERB scales to extract new cepstral features referred to as TFBCC-B and TFBCC-E, respectively. The mathematical background to construct the proposed TFB-B and TFB-E is presented. The proposed filter banks are used along with the conventional TFB-M and TFB-HF to extract four different types of TFBCC features. These features are extracted from the emotional speech signals of two databases, namely Berlin database of emotional speech (Emo-DB) and Surrey audio-visual expressed emotion speech database (SAVEE). The extracted features are used to develop speaker-dependent (SD) and speaker-independent (SI) based SER systems using support vector machines. The performance of the respective features is analyzed in terms of isolated and combined usage. The experimental results show that the cepstral features extracted using the proposed TFBs are effective in characterizing and recognizing emotions similar to conventional MFCC and HFCC features. Moreover, the combined use of different cepstral features have resulted to improve the overall recognition performance of SER systems. In case of Emo-DB database, isolated use of the proposed TFBCC-B and TFBCC-E features achieve recognition accuracies of 83.23% and 81.99% for SD scenario, and 75% and 60.94% for SI scenario, respectively. Similarly, for SAVEE database, the recognition accuracies of 75% and 66.67% for SD scenario, and 44.17% and 55% for SI scenario are achieved. In case of Emo-DB database, the maximum recognition accuracies of 86.96% (for different combinations of conventional and proposed features namely, TFBCC-{(M+E), (M+B+E), (HF+B+E), (M+HF+B+E)}) and 77.08% (for combination TFBCC-(M+B+E)) are achieved for SD and SI scenarios, respectively. Similarly, for SAVEE database, the maximum recognition accuracies of 77.08% (for combination TFBCC-(M+HF+E)), and 55.83% (for combination TFBCC-(B+E)) are achieved for SD and SI scenarios, respectively.



中文翻译:

使用基于树皮和ERB频率标度的新颖三角滤波器组提取的倒谱特征进行语音情感识别

语音情感识别(SER)是指从语音中识别说话者的情感状态的过程。在较早的研究中,已经提出了多种倒频谱特征来开发SER系统。梅尔频率倒频谱系数(MFCC)和人为因素倒频谱系数(HFCC)是倒频谱特征的两个常用变体。分别使用mel和人为因素滤波器组从语音信号中提取MFCC和HFCC特征。这些滤波器组中单个滤波器的幅度响应为三角形。结果,这些滤波器组被称为三角形滤波器组(TFB),并且相应的提取倒谱系数可以表示为TFBCC-M(对于MFCC)和TFBCC-HF(对于HFCC)。梅尔滤波器组(TFB-M)是使用梅尔规模,而人为因素滤波器组(TFB-HF)是使用人为因素标度构建的,它是mel和等效矩形带宽(ERB)标度的组合。同样,可以使用不同的频率范围来实现不同的TFB,以提取不同类型的TFBCC特征。在这个方向上,本文提出了两个新的TFB,分别称为TFB-B和TFB-E,它们分别使用树皮和ERB比例尺来提取新的倒谱特征,分别称为TFBCC-B和TFBCC-E。提出了构建提出的TFB-B和TFB-E的数学背景。提出的滤波器组与传统的TFB-M和TFB-HF一起使用,以提取四种不同类型的TFBCC特征。这些功能是从两个数据库的情感语音信号中提取的,即柏林情感言语数据库(Emo-DB)和萨里视听表达情感言语数据库(SAVEE)。提取的特征用于使用支持向量机开发基于说话者无关(SD)和基于说话者无关(SI)的SER系统。根据隔离和组合用法来分析各个功能的性能。实验结果表明,与传统的MFCC和HFCC特征相似,使用拟议的TFB提取的倒谱特征可有效地表征和识别情绪。此外,不同倒谱特征的组合使用已导致改善SER系统的整体识别性能。对于Emo-DB数据库,单独使用建议的TFBCC-B和TFBCC-E功能可实现SD场景的识别准确性为83.23%和81.99%,SI情景分别为75%和60.94%。同样,对于SAVEE数据库,SD场景的识别准确度为75%和66.67%,SI场景的识别准确度为44.17%和55%。在Emo-DB数据库的情况下,最大识别精度为86.96%(对于传统和建议功能的不同组合,即TFBCC-{(M + E),(M + B + E),(HF + B + E) ,((M + HF + B + E)})和77.08%(对于TFBCC-(M + B + E)组合)分别达到SD和SI场景。同样,对于SAVEE数据库,SD和SI场景的最大识别准确率分别为77.08%(对于TFBCC-(M + HF + E)组合)和55.83%(对于TFBCC-(B + E)组合) 。在Emo-DB数据库的情况下,最大识别精度为86.96%(对于传统和建议功能的不同组合,即TFBCC-{(M + E),(M + B + E),(HF + B + E) ,(M + HF + B + E)})和77.08%(对于TFBCC-(M + B + E)组合)分别达到SD和SI场景。同样,对于SAVEE数据库,SD和SI场景的最大识别准确率分别为77.08%(对于TFBCC-(M + HF + E)组合)和55.83%(对于TFBCC-(B + E)组合) 。在Emo-DB数据库的情况下,最大识别精度为86.96%(对于传统和建议功能的不同组合,即TFBCC-{(M + E),(M + B + E),(HF + B + E) ,(M + HF + B + E)})和77.08%(对于TFBCC-(M + B + E)组合)分别达到SD和SI场景。同样,对于SAVEE数据库,SD和SI场景的最大识别准确率分别为77.08%(对于TFBCC-(M + HF + E)组合)和55.83%(对于TFBCC-(B + E)组合) 。

更新日期:2020-05-11
down
wechat
bug