当前位置: X-MOL 学术Comput. Speech Lang › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Intoxicated Speech Detection: A Fusion Framework with Speaker-Normalized Hierarchical Functionals and GMM Supervectors.
Computer Speech & Language ( IF 3.1 ) Pub Date : 2014-03-01 , DOI: 10.1016/j.csl.2012.09.004
Daniel Bone 1 , Ming Li 1 , Matthew P Black 1 , Shrikanth S Narayanan 2
Affiliation  

Segmental and suprasegmental speech signal modulations offer information about paralinguistic content such as affect, age and gender, pathology, and speaker state. Speaker state encompasses medium-term, temporary physiological phenomena influenced by internal or external biochemical actions (e.g., sleepiness, alcohol intoxication). Perceptual and computational research indicates that detecting speaker state from speech is a challenging task. In this paper, we present a system constructed with multiple representations of prosodic and spectral features that provided the best result at the Intoxication Subchallenge of Interspeech 2011 on the Alcohol Language Corpus. We discuss the details of each classifier and show that fusion improves performance. We additionally address the question of how best to construct a speaker state detection system in terms of robust and practical marginalization of associated variability such as through modeling speakers, utterance type, gender, and utterance length. As is the case in human perception, speaker normalization provides significant improvements to our system. We show that a held-out set of baseline (sober) data can be used to achieve comparable gains to other speaker normalization techniques. Our fused frame-level statistic-functional systems, fused GMM systems, and final combined system achieve unweighted average recalls (UARs) of 69.7%, 65.1%, and 68.8%, respectively, on the test set. More consistent numbers compared to development set results occur with matched-prompt training, where the UARs are 70.4%, 66.2%, and 71.4%, respectively. The combined system improves over the Challenge baseline by 5.5% absolute (8.4% relative), also improving upon our previously best result.

中文翻译:

醉酒语音检测:具有说话人标准化分层功能和 GMM 超向量的融合框架。

分段和超分段语音信号调制提供有关副语言内容的信息,例如情感、年龄和性别、病理和说话者状态。说话者状态包括受内部或外部生化作用(例如,困倦、酒精中毒)影响的中期、临时生理现象。感知和计算研究表明,从语音中检测说话人状态是一项具有挑战性的任务。在本文中,我们提出了一个由韵律和频谱特征的多种表示构成的系统,该系统在酒精语言语料库上的 Interspeech 2011 中毒子挑战中提供了最佳结果。我们讨论了每个分类器的细节,并表明融合提高了性能。我们还解决了如何最好地构建说话人状态检测系统的问题,例如通过建模说话人、话语类型、性别和话语长度等相关可变性的稳健和实际边缘化。与人类感知的情况一样,说话人归一化为我们的系统提供了重大改进。我们表明,一组保留的基线(清醒)数据可用于实现与其他说话者归一化技术相当的增益。我们的融合帧级统计功能系统、融合 GMM 系统和最终组合系统在测试集上分别实现了 69.7%、65.1% 和 68.8% 的未加权平均召回率 (UAR)。与开发集结果相比更一致的数字出现在匹配提示训练中,其中 UAR 分别为 70.4%、66.2% 和 71.4%。
更新日期:2019-11-01
down
wechat
bug