当前位置: X-MOL 学术IEEE ACM Trans. Audio Speech Lang. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Feedback-Driven Sensory Mapping Adaptation for Robust Speech Activity Detection.
IEEE/ACM Transactions on Audio, Speech, and Language Processing ( IF 4.1 ) Pub Date : 2017-07-25 , DOI: 10.1109/taslp.2016.2639322
Ashwin Bellur 1 , Mounya Elhilali 1
Affiliation  

Parsing natural acoustic scenes using computational methodologies poses many challenges. Given the rich and complex nature of the acoustic environment, data mismatch between train and test conditions is a major hurdle in data-driven audio processing systems. In contrast, the brain exhibits a remarkable ability at segmenting acoustic scenes with relative ease. When tackling challenging listening conditions that are often faced in everyday life, the biological system relies on a number of principles that allow it to effortlessly parse its rich soundscape. In the current study, we leverage a key principle employed by the auditory system: its ability to adapt the neural representation of its sensory input in a high-dimensional space. We propose a framework that mimics this process in a computational model for robust speech activity detection. The system employs a 2-D Gabor filter bank whose parameters are retuned offline to improve the separability between the feature representation of speech and nonspeech sounds. This retuning process, driven by feedback from statistical models of speech and nonspeech classes, attempts to minimize the misclassification risk of mismatched data, with respect to the original statistical models. We hypothesize that this risk minimization procedure results in an emphasis of unique speech and nonspeech modulations in the high-dimensional space. We show that such an adapted system is indeed robust to other novel conditions, with a marked reduction in equal error rates for a variety of databases with additive and convolutive noise distortions. We discuss the lessons learned from biology with regard to adapting to an ever-changing acoustic environment and the impact on building truly intelligent audio processing systems.

中文翻译:

反馈驱动的感觉映射适应,可进行健壮的语音活动检测。

使用计算方法来解析自然声学场景提出了许多挑战。考虑到声学环境的丰富性和复杂性,列车和测试条件之间的数据不匹配是数据驱动音频处理系统中的主要障碍。相反,大脑在相对容易地分割声音场景方面表现出非凡的能力。当应对日常生活中经常面临的挑战性听音条件时,生物系统依赖于许多原理,可以轻松解析其丰富的音景。在当前的研究中,我们利用了听觉系统采用的一个关键原理:其在高维空间中适应其感觉输入的神经表示的能力。我们提出了一个框架,该框架在用于健壮语音活动检测的计算模型中模仿此过程。该系统采用了二维Gabor滤波器组,其参数可离线进行重新调整,以改善语音和非语音声音的特征表示之间的可分离性。这种重新调整过程是由语音和非语音类统计模型的反馈驱动的,它相对于原始统计模型,试图使数据不匹配的误分类风险最小化。我们假设此风险最小化过程导致在高维空间中强调独特的语音和非语音调制。我们表明,这样的自适应系统确实对其他新颖条件具有鲁棒性,对于具有加性和卷积性噪声失真的各种数据库,均等错误率显着降低。
更新日期:2019-11-01
down
wechat
bug