当前位置: X-MOL 学术IEEE Trans. Circ. Syst. Video Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Temporal Relation Inference Network for Multimodal Speech Emotion Recognition
IEEE Transactions on Circuits and Systems for Video Technology ( IF 8.3 ) Pub Date : 2022-03-30 , DOI: 10.1109/tcsvt.2022.3163445
Guan-Nan Dong 1 , Chi-Man Pun 1 , Zheng Zhang 1
Affiliation  

Speech emotion recognition (SER) is a non-trivial task for humans, while it remains challenging for automatic SER due to the linguistic complexity and contextual distortion. Notably, previous automatic SER systems always regarded multi-modal information and temporal relations of speech as two independent tasks, ignoring their association. We argue that the valid semantic features and temporal relations of speech are both meaningful event relationships. This paper proposes a novel temporal relation inference network (TRIN) to help tackle multi-modal SER, which fully considers the underlying hierarchy of phonetic structure and its associations between various modalities under the sequential temporal guidance. Mainly, we design a temporal reasoning calibration module to imitate real and abundant contextual conditions. Unlike the previous works, which assume all multiple modalities are related, it infers the dependency relationship between the semantic information from the temporal level and learns to handle the multi-modal interaction sequence with a flexible order. To enhance the feature representation, an innovative temporal attentive fusion unit is developed to magnify the details embedded in a single modality from semantic level. Meanwhile, it aggregates the feature representation from both the temporal and semantic levels to maximize the integrity of feature representation by an adaptive feature fusion mechanism to selectively collect the implicit complementary information to strengthen the dependencies between different information subspaces. Extensive experiments conducted on two benchmark datasets demonstrate the superiority of our TRIN method against some state-of-the-art SER methods.

中文翻译:


用于多模态语音情感识别的时间关系推理网络



语音情感识别 (SER) 对人类来说是一项不平凡的任务,但由于语言复杂性和上下文失真,自动 SER 仍然具有挑战性。值得注意的是,以前的自动SER系统总是将多模态信息和语音的时间关系视为两个独立的任务,而忽略了它们之间的关联。我们认为语音的有效语义特征和时间关系都是有意义的事件关系。本文提出了一种新颖的时间关系推理网络(TRIN)来帮助解决多模态 SER,它充分考虑了语音结构的底层层次结构及其在顺序时间指导下各种模态之间的关联。主要是,我们设计了一个时间推理校准模块来模仿真实且丰富的上下文条件。与之前的工作假设所有多种模态都是相关的不同,它从时间层面推断语义信息之间的依赖关系,并学习以灵活的顺序处理多模态交互序列。为了增强特征表示,开发了一种创新的时间注意力融合单元,以从语义层面放大嵌入单一模态的细节。同时,它从时间和语义层面聚合特征表示,通过自适应特征融合机制最大化特征表示的完整性,有选择地收集隐式互补信息,加强不同信息子空间之间的依赖关系。在两个基准数据集上进行的大量实验证明了我们的 TRIN 方法相对于一些最先进的 SER 方法的优越性。
更新日期:2022-03-30
down
wechat
bug