Deep Multimodal Fusion: A Hybrid Approach,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Deep Multimodal Fusion: A Hybrid Approach
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2017-02-20 , DOI: 10.1007/s11263-017-0997-7
Mohamed R. Amer , Timothy Shields , Behjat Siddiquie , Amir Tamrakar , Ajay Divakaran , Sek Chai

We propose a novel hybrid model that exploits the strength of discriminative classifiers along with the representation power of generative models. Our focus is on detecting multimodal events in time varying sequences as well as generating missing data in any of the modalities. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative space which allows for data generation and joint feature representation that discriminative models lack. We propose a new model that jointly optimizes the representation space using a hybrid energy function. We employ a Restricted Boltzmann Machines (RBMs) based model to learn a shared representation across multiple modalities with time varying data. The Conditional RBMs (CRBMs) is an extension of the RBM model that takes into account short term temporal phenomena. The hybrid model involves augmenting CRBMs with a discriminative component for classification. For these purposes we propose a novel Multimodal Discriminative CRBMs (MMDCRBMs) model. First, we train the MMDCRBMs model using labeled data by training each modality, followed by training a fusion layer. Second, we exploit the generative capability of MMDCRBMs to activate the trained model so as to generate the lower-level data corresponding to the specific label that closely matches the actual input data. We evaluate our approach on ChaLearn dataset, audio-mocap, as well as the Tower Game dataset, mocap-mocap as well as three multimodal toy datasets. We report classification accuracy, generation accuracy, and localization accuracy and demonstrate its superiority compared to the state-of-the-art methods.

中文翻译：

深度多模态融合：一种混合方法

我们提出了一种新颖的混合模型，它利用了判别分类器的强度以及生成模型的表示能力。我们的重点是检测时变序列中的多模态事件以及在任何模态中生成缺失数据。已证明判别分类器比相应的基于生成似然的分类器具有更高的性能。另一方面，生成模型学习了丰富的信息空间，这允许判别模型缺乏的数据生成和联合特征表示。我们提出了一种使用混合能量函数联合优化表示空间的新模型。我们采用基于受限玻尔兹曼机 (RBM) 的模型来学习具有时变数据的跨多种模态的共享表示。条件 RBM (CRBM) 是 RBM 模型的扩展，它考虑了短期时间现象。混合模型涉及使用用于分类的判别组件来增强 CRBM。出于这些目的，我们提出了一种新颖的多模态判别 CRBM (MMDCRBM) 模型。首先，我们通过训练每个模态使用标记数据训练 MMDCRBMs 模型，然后训练融合层。其次，我们利用 MMDCRBM 的生成能力来激活训练模型，从而生成与实际输入数据紧密匹配的特定标签对应的低级数据。我们在 ChaLearn 数据集、audio-mocap 以及 Tower Game 数据集、mocap-mocap 以及三个多模态玩具数据集上评估了我们的方法。我们报告分类准确度、生成准确度、

更新日期：2017-02-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11