Multimodal Weibull Variational Autoencoder for Jointly Modeling Image-Text Data,IEEE Transactions on Cybernetics

当前位置： X-MOL 学术 › IEEE Trans. Cybern. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multimodal Weibull Variational Autoencoder for Jointly Modeling Image-Text Data
IEEE Transactions on Cybernetics ( IF 11.8 ) Pub Date : 2021-04-28 , DOI: 10.1109/tcyb.2021.3070881
Chaojie Wang ₁ , Bo Chen ₁ , Sucheng Xiao ₂ , Zhengjue Wang ₁ , Hao Zhang ₃ , Penghui Wang ₁ , Ning Han ₄ , Mingyuan Zhou ₅

Affiliation

For multimodal representation learning, traditional black-box approaches often fall short of extracting interpretable multilayer hidden structures, which contribute to visualize the connections between different modalities at multiple semantic levels. To extract interpretable multimodal latent representations and visualize the hierarchial semantic relationships between different modalities, based on deep topic models, we develop a novel multimodal Poisson gamma belief network (mPGBN) that tightly couples the observations of different modalities via imposing sparse connections between their modality-specific hidden layers. To alleviate the time-consuming Gibbs sampler adopted by traditional topic models in the testing stage, we construct a Weibull-based variational inference network (encoder) to directly map the observations to their latent representations, and further combine it with the mPGBN (decoder), resulting in a novel multimodal Weibull variational autoencoder (MWVAE), which is fast in out-of-sample prediction and can handle large-scale multimodal datasets. Qualitative evaluations on bimodal data consisting of image-text pairs show that the developed MWVAE can successfully extract expressive multimodal latent representations for downstream tasks like missing modality imputation and multimodal retrieval. Further extensive quantitative results demonstrate that both MWVAE and its supervised extension sMWVAE achieve state-of-the-art performance on various multimodal benchmarks.

中文翻译：

用于联合建模图像-文本数据的多模态 Weibull 变分自动编码器

对于多模态表示学习，传统的黑盒方法通常无法提取可解释的多层隐藏结构，这有助于在多个语义级别上可视化不同模态之间的连接。为了提取可解释的多模态潜在表示并可视化不同模态之间的层次语义关系，基于深度主题模型，我们开发了一种新的多模态泊松伽玛信念网络（mPGBN），它通过在不同模态之间施加稀疏连接来紧密耦合不同模态的观察-特定的隐藏层。为了缓解传统主题模型在测试阶段采用的耗时的Gibbs采样器，我们构建了一个基于 Weibull 的变分推理网络（编码器），以将观察结果直接映射到其潜在表示，并进一步将其与 mPGBN（解码器）结合，从而产生一种新的多模态 Weibull 变分自动编码器（MWVAE），它可以快速进出样本预测，可以处理大规模多模态数据集。对由图像-文本对组成的双模态数据的定性评估表明，开发的 MWVAE 可以成功地为缺失模态插补和多模态检索等下游任务提取富有表现力的多模态潜在表示。进一步广泛的定量结果表明，MWVAE 及其监督扩展 sMWVAE 在各种多模式基准上均实现了最先进的性能。产生了一种新颖的多模态 Weibull 变分自动编码器 (MWVAE)，它在样本外预测方面速度很快，并且可以处理大规模的多模态数据集。对由图像-文本对组成的双模态数据的定性评估表明，开发的 MWVAE 可以成功地为缺失模态插补和多模态检索等下游任务提取富有表现力的多模态潜在表示。进一步广泛的定量结果表明，MWVAE 及其监督扩展 sMWVAE 在各种多模式基准上均实现了最先进的性能。产生了一种新颖的多模态 Weibull 变分自动编码器 (MWVAE)，它在样本外预测方面速度很快，并且可以处理大规模的多模态数据集。对由图像-文本对组成的双模态数据的定性评估表明，开发的 MWVAE 可以成功地为缺失模态插补和多模态检索等下游任务提取富有表现力的多模态潜在表示。进一步广泛的定量结果表明，MWVAE 及其监督扩展 sMWVAE 在各种多模式基准上均实现了最先进的性能。对由图像-文本对组成的双模态数据的定性评估表明，开发的 MWVAE 可以成功地为缺失模态插补和多模态检索等下游任务提取富有表现力的多模态潜在表示。进一步广泛的定量结果表明，MWVAE 及其监督扩展 sMWVAE 在各种多模式基准上均实现了最先进的性能。对由图像-文本对组成的双模态数据的定性评估表明，开发的 MWVAE 可以成功地为缺失模态插补和多模态检索等下游任务提取富有表现力的多模态潜在表示。进一步广泛的定量结果表明，MWVAE 及其监督扩展 sMWVAE 在各种多模式基准上均实现了最先进的性能。

更新日期：2021-04-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>