当前位置: X-MOL 学术IEEE Trans. Circ. Syst. Video Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Temporal Relation Inference Network for Multimodal Speech Emotion Recognition
IEEE Transactions on Circuits and Systems for Video Technology ( IF 8.4 ) Pub Date : 2022-03-30 , DOI: 10.1109/tcsvt.2022.3163445
Guan-Nan Dong 1 , Chi-Man Pun 1 , Zheng Zhang 1
Affiliation  

Speech emotion recognition (SER) is a non-trivial task for humans, while it remains challenging for automatic SER due to the linguistic complexity and contextual distortion. Notably, previous automatic SER systems always regarded multi-modal information and temporal relations of speech as two independent tasks, ignoring their association. We argue that the valid semantic features and temporal relations of speech are both meaningful event relationships. This paper proposes a novel temporal relation inference network (TRIN) to help tackle multi-modal SER, which fully considers the underlying hierarchy of phonetic structure and its associations between various modalities under the sequential temporal guidance. Mainly, we design a temporal reasoning calibration module to imitate real and abundant contextual conditions. Unlike the previous works, which assume all multiple modalities are related, it infers the dependency relationship between the semantic information from the temporal level and learns to handle the multi-modal interaction sequence with a flexible order. To enhance the feature representation, an innovative temporal attentive fusion unit is developed to magnify the details embedded in a single modality from semantic level. Meanwhile, it aggregates the feature representation from both the temporal and semantic levels to maximize the integrity of feature representation by an adaptive feature fusion mechanism to selectively collect the implicit complementary information to strengthen the dependencies between different information subspaces. Extensive experiments conducted on two benchmark datasets demonstrate the superiority of our TRIN method against some state-of-the-art SER methods.

中文翻译:

用于多模态语音情感识别的时间关系推理网络

语音情感识别 (SER) 对人类来说是一项重要的任务,而由于语言复杂性和上下文失真,自动 SER 仍然具有挑战性。值得注意的是,以前的自动 SER 系统总是将多模态信息和语音的时间关系视为两个独立的任务,而忽略了它们的关联。我们认为有效的语义特征和语音的时间关系都是有意义的事件关系。本文提出了一种新的时间关系推理网络(TRIN)来帮助解决多模态SER,它充分考虑了语音结构的底层层次结构及其在顺序时间引导下各种模态之间的关联。主要是,我们设计了一个时间推理校准模块来模拟真实和丰富的上下文条件。不同于以往的作品,它假设所有的多模态都是相关的,它从时间层面推断语义信息之间的依赖关系,并学习以灵活的顺序处理多模态交互序列。为了增强特征表示,开发了一种创新的时间注意力融合单元,以从语义级别放大嵌入在单个模态中的细节。同时,它从时间和语义层面聚合特征表示,通过自适应特征融合机制最大限度地提高特征表示的完整性,选择性地收集隐式互补信息,以加强不同信息子空间之间的依赖关系。
更新日期:2022-03-30
down
wechat
bug