Mix and Match Networks: Cross-Modal Alignment for Zero-Pair Image-to-Image Translation,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Mix and Match Networks: Cross-Modal Alignment for Zero-Pair Image-to-Image Translation
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2020-06-15 , DOI: 10.1007/s11263-020-01340-z
Yaxing Wang , Luis Herranz , Joost van de Weijer

This paper addresses the problem of inferring unseen cross-modal image-to-image translations between multiple modalities. We assume that only some of the pairwise translations have been seen (i.e. trained) and infer the remaining unseen translations (where training pairs are not available). We propose mix and match networks, an approach where multiple encoders and decoders are aligned in such a way that the desired translation can be obtained by simply cascading the source encoder and the target decoder, even when they have not interacted during the training stage (i.e. unseen). The main challenge lies in the alignment of the latent representations at the bottlenecks of encoder–decoder pairs. We propose an architecture with several tools to encourage alignment, including autoencoders and robust side information and latent consistency losses. We show the benefits of our approach in terms of effectiveness and scalability compared with other pairwise image-to-image translation approaches. We also propose zero-pair cross-modal image translation, a challenging setting where the objective is inferring semantic segmentation from depth (and vice-versa) without explicit segmentation-depth pairs, and only from two (disjoint) segmentation-RGB and depth-RGB training sets. We observe that a certain part of the shared information between unseen modalities might not be reachable, so we further propose a variant that leverages pseudo-pairs which allows us to exploit this shared information between the unseen modalities.

中文翻译：

混合和匹配网络：用于零对图像到图像转换的跨模式对齐

本文解决了推断多个模态之间看不见的跨模态图像到图像转换的问题。我们假设只有一些成对翻译已经被看到（即训练过）并推断出剩余的未见过翻译（训练对不可用）。我们提出了混合和匹配网络，一种将多个编码器和解码器对齐的方法，通过简单地级联源编码器和目标解码器可以获得所需的翻译，即使它们在训练阶段没有交互（即看不见）。主要挑战在于编码器-解码器对瓶颈处的潜在表示的对齐。我们提出了一种架构，其中包含多种工具来鼓励对齐，包括自动编码器和强大的辅助信息以及潜在的一致性损失。与其他成对图像到图像转换方法相比，我们展示了我们的方法在有效性和可扩展性方面的优势。我们还提出了零对跨模态图像转换，这是一个具有挑战性的设置，其目标是在没有明确的分割深度对的情况下从深度（反之亦然）推断语义分割，并且仅从两个（不相交的）分割-RGB 和深度- RGB 训练集。我们观察到看不见的模态之间共享信息的某些部分可能无法访问，因此我们进一步提出了一种利用伪对的变体，它允许我们利用看不见的模态之间的共享信息。我们还提出了零对跨模态图像转换，这是一个具有挑战性的设置，其目标是在没有明确的分割深度对的情况下从深度（反之亦然）推断语义分割，并且仅从两个（不相交的）分割-RGB 和深度- RGB 训练集。我们观察到看不见的模态之间共享信息的某些部分可能无法访问，因此我们进一步提出了一种利用伪对的变体，它允许我们利用看不见的模态之间的共享信息。我们还提出了零对跨模态图像转换，这是一个具有挑战性的设置，其目标是在没有明确的分割深度对的情况下从深度（反之亦然）推断语义分割，并且仅从两个（不相交的）分割-RGB 和深度- RGB 训练集。我们观察到看不见的模态之间共享信息的某些部分可能无法访问，因此我们进一步提出了一种利用伪对的变体，它允许我们利用看不见的模态之间的共享信息。

更新日期：2020-06-15

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11