当前位置: X-MOL 学术Appl. Acoust. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Unsupervised feature selection and NMF de-noising for robust Speech Emotion Recognition
Applied Acoustics ( IF 3.4 ) Pub Date : 2021-01-01 , DOI: 10.1016/j.apacoust.2020.107645
Surekha Reddy Bandela , T. Kishore Kumar

Abstract Speech feature fusion is the most commonly used phenomenon for improving the accuracy in Speech Emotion Recognition (SER). But in this, there is a disadvantage of increasing the complexity in SER system in terms of processing time. Besides this, some of the features could be redundant and do not contribute for SER and lead to incorrect emotion prediction and reduction in SER accuracy. To overcome this problem, in this paper, unsupervised feature selection is applied to the feature set with the combination of INTERSPEECH 2010 paralinguistic features, Gammatone Cepstral Coefficients (GTCC) and Power Normalized Cepstral Coefficients (PNCC). The Feature Selection with Adaptive Structure Learning (FSASL), Unsupervised Feature Selection with Ordinal Locality (UFSOL) and a novel Subset Feature Selection (SuFS) algorithm are the feature dimension reduction techniques used to acquire better SER performance in this work. The proposed SER system is analyzed in both clean and noisy environments. The EMO-DB and IEMOCAP emotion databases are considered for evaluating the proposed SER performance. For noise analysis, the clean speech is corrupted with different noises of Aurora noise database and white Gaussian noise at different Signal to Noise Ratio (SNR) levels from −5dB to 20 dB. Support Vector Machine (SVM) classifier with linear and Radial Basis Function (RBF) kernels using 10-fold cross-validation and hold-out validation is used in this analysis with classification accuracy and computation time as the performance metrics. The results show that the proposed SER system outperforms the baseline SER system as well as many of the existing literature works both in clean and noisy conditions. For SNR levels >15 dB, the proposed SER system in presence of different noises performs same as the SER in clean environments. Whereas, for lower SNRs

中文翻译:

用于鲁棒语音情感识别的无监督特征选择和 NMF 去噪

摘要 语音特征融合是提高语音情感识别(SER)准确率最常用的现象。但是在这方面,存在增加SER系统在处理时间方面的复杂性的缺点。除此之外,一些特征可能是多余的,对 SER 没有贡献,并导致错误的情绪预测和 SER 准确性的降低。为了克服这个问题,在本文中,结合 INTERSPEECH 2010 副语言特征、Gammatone Cepstral Coefficients (GTCC) 和 Power Normalized Cepstral Coefficients (PNCC),将无监督特征选择应用于特征集。自适应结构学习(FSASL)的特征选择,具有序数局部性的无监督特征选择 (UFSOL) 和新的子集特征选择 (SuFS) 算法是用于在这项工作中获得更好 SER 性能的特征降维技术。建议的 SER 系统在干净和嘈杂的环境中进行了分析。考虑使用 EMO-DB 和 IEMOCAP 情感数据库来评估提议的 SER 性能。对于噪声分析,干净的语音被 Aurora 噪声数据库的不同噪声和高斯白噪声破坏,在 -5dB 到 20dB 的不同信噪比 (SNR) 水平下。支持向量机 (SVM) 分类器具有线性和径向基函数 (RBF) 内核,使用 10 倍交叉验证和保持验证,在此分析中使用分类精度和计算时间作为性能指标。结果表明,所提出的 SER 系统在干净和嘈杂的条件下都优于基线 SER 系统以及许多现有文献。对于 >15 dB 的 SNR 水平,所提出的 SER 系统在存在不同噪声时的性能与清洁环境中的 SER 相同。而对于较低的 SNR
更新日期:2021-01-01
down
wechat
bug