当前位置: X-MOL 学术Inform. Fusion › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multimodal Video Sentiment Analysis Using Deep Learning Approaches, a Survey
Information Fusion ( IF 18.6 ) Pub Date : 2021-06-12 , DOI: 10.1016/j.inffus.2021.06.003
Sarah A. Abdu , Ahmed H. Yousef , Ashraf Salem

Deep learning has emerged as a powerful machine learning technique to employ in multimodal sentiment analysis tasks. In the recent years, many deep learning models and various algorithms have been proposed in the field of multimodal sentiment analysis which urges the need to have survey papers that summarize the recent research trends and directions. This survey paper tackles a comprehensive overview of the latest updates in this field. We present a sophisticated categorization of thirty-five state-of-the-art models, which have recently been proposed in video sentiment analysis field, into eight categories based on the architecture used in each model. The effectiveness and efficiency of these models have been evaluated on the most two widely used datasets in the field, CMU-MOSI and CMU-MOSEI. After carrying out an intensive analysis of the results, we eventually conclude that the most powerful architecture in multimodal sentiment analysis task is the Multi-Modal Multi-Utterance based architecture, which exploits both the information from all modalities and the contextual information from the neighbouring utterances in a video in order to classify the target utterance. This architecture mainly consists of two modules whose order may vary from one model to another. The first module is the Context Extraction Module that is used to model the contextual relationship among the neighbouring utterances in the video and highlight which of the relevant contextual utterances are more important to predict the sentiment of the target one. In most recent models, this module is usually a bidirectional recurrent neural network based module. The second module is an Attention-Based Module that is responsible for fusing the three modalities (text, audio and video) and prioritizing only the important ones. Furthermore, this paper provides a brief summary of the most popular approaches that have been used to extract features from multimodal videos in addition to a comparative analysis between the most popular benchmark datasets in the field. We expect that these findings can help newcomers to have a panoramic view of the entire field and get quick experience from the provided helpful insights. This will guide them easily to the development of more effective models.



中文翻译:

使用深度学习方法进行多模态视频情感分析,一项调查

深度学习已成为一种强大的机器学习技术,可用于多模态情感分析任务。近年来,在多模态情感分析领域提出了许多深度学习模型和各种算法,这迫切需要有总结最近研究趋势和方向的调查论文。本调查文件全面概述了该领域的最新更新。我们根据每个模型中使用的架构将最近在视频情感分析领域提出的 35 个最先进模型的复杂分类分为 8 个类别。这些模型的有效性和效率已经在该领域使用最广泛的两个数据集 CMU-MOSI 和 CMU-MOSEI 上进行了评估。在对结果进行深入分析后,我们最终得出结论,多模态情感分析任务中最强大的架构是基于多模态多话语的架构,它利用来自所有模态的信息和来自视频中相邻话语的上下文信息来对目标话语进行分类. 这种架构主要由两个模块组成,它们的顺序可能因一个模型而异。第一个模块是上下文提取模块,用于对视频中相邻话语之间的上下文关系进行建模,并突出显示哪些相关上下文话语对预测目标人的情绪更重要。在最近的模型中,该模块通常是基于双向循环神经网络的模块。第二个模块是一个基于注意力的模块,负责融合三种模式(文本、音频和视频)并仅优先考虑重要的模式。此外,除了对该领域最流行的基准数据集之间的比较分析之外,本文还简要总结了用于从多模态视频中提取特征的最流行方法。我们希望这些发现可以帮助新人对整个领域有一个全景,并从提供的有用见解中快速获得经验。这将引导他们轻松开发更有效的模型。除了该领域最流行的基准数据集之间的比较分析之外,本文还简要总结了用于从多模态视频中提取特征的最流行方法。我们希望这些发现可以帮助新人对整个领域有一个全景,并从提供的有用见解中快速获得经验。这将引导他们轻松开发更有效的模型。除了该领域最流行的基准数据集之间的比较分析之外,本文还简要总结了用于从多模态视频中提取特征的最流行方法。我们希望这些发现可以帮助新人对整个领域有一个全景,并从提供的有用见解中快速获得经验。这将引导他们轻松开发更有效的模型。

更新日期:2021-06-19
down
wechat
bug