当前位置: X-MOL 学术Comput. Speech Lang › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Detecting paralinguistic events in audio stream using context in features and probabilistic decisions.
Computer Speech & Language ( IF 3.1 ) Pub Date : 2015-09-11 , DOI: 10.1016/j.csl.2015.08.003
Rahul Gupta 1 , Kartik Audhkhasi 2 , Sungbok Lee 1 , Shrikanth Narayanan 1
Affiliation  

Non-verbal communication involves encoding, transmission and decoding of non-lexical cues and is realized using vocal (e.g. prosody) or visual (e.g. gaze, body language) channels during conversation. These cues perform the function of maintaining conversational flow, expressing emotions, and marking personality and interpersonal attitude. In particular, non-verbal cues in speech such as paralanguage and non-verbal vocal events (e.g. laughters, sighs, cries) are used to nuance meaning and convey emotions, mood and attitude. For instance, laughters are associated with affective expressions while fillers (e.g. um, ah, um) are used to hold floor during a conversation. In this paper we present an automatic non-verbal vocal events detection system focusing on the detect of laughter and fillers. We extend our system presented during Interspeech 2013 Social Signals Sub-challenge (that was the winning entry in the challenge) for frame-wise event detection and test several schemes for incorporating local context during detection. Specifically, we incorporate context at two separate levels in our system: (i) the raw frame-wise features and, (ii) the output decisions. Furthermore, our system processes the output probabilities based on a few heuristic rules in order to reduce erroneous frame-based predictions. Our overall system achieves an Area Under the Receiver Operating Characteristics curve of 95.3% for detecting laughters and 90.4% for fillers on the test set drawn from the data specifications of the Interspeech 2013 Social Signals Sub-challenge. We perform further analysis to understand the interrelation between the features and obtained results. Specifically, we conduct a feature sensitivity analysis and correlate it with each feature's stand alone performance. The observations suggest that the trained system is more sensitive to a feature carrying higher discriminability with implications towards a better system design.

中文翻译:


使用特征中的上下文和概率决策来检测音频流中的副语言事件。



非语言交流涉及非词汇线索的编码、传输和解码,并在对话期间使用声音(例如韵律)或视觉(例如凝视、肢体语言)通道来实现。这些线索具有维持对话流畅、表达情感、标记个性和人际态度的功能。特别是,言语中的非语言线索,例如副语言和非语言声音事件(例如笑声、叹息、哭泣)被用来区分含义并传达情感、情绪和态度。例如,笑声与情感表达相关,而填充词(例如,嗯,啊,嗯)则用于在对话中保持发言权。在本文中,我们提出了一种自动非语言声音事件检测系统,重点关注笑声和填充物的检测。我们扩展了 Interspeech 2013 社交信号子挑战赛(该挑战赛的获胜作品)中提出的系统,以进行逐帧事件检测,并测试了几种在检测过程中纳入本地上下文的方案。具体来说,我们将上下文纳入系统中的两个不同级别:(i)原始的逐帧特征和(ii)输出决策。此外,我们的系统根据一些启发式规则处理输出概率,以减少基于帧的错误预测。我们的整个系统在根据 Interspeech 2013 社交信号子挑战赛的数据规范得出的测试集上,检测笑声的接收器操作特征曲线下面积为 95.3%,填充填充物的接收器操作特征曲线下面积为 90.4%。我们进行进一步的分析以了解特征和获得的结果之间的相互关系。具体来说,我们进行功能敏感性分析,并将其与每个功能的独立性能相关联。 观察结果表明,经过训练的系统对具有更高辨别力的特征更加敏感,这对更好的系统设计有影响。
更新日期:2019-11-01
down
wechat
bug