Skip to main content
Log in

Mix and Match Networks: Cross-Modal Alignment for Zero-Pair Image-to-Image Translation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

This paper addresses the problem of inferring unseen cross-modal image-to-image translations between multiple modalities. We assume that only some of the pairwise translations have been seen (i.e. trained) and infer the remaining unseen translations (where training pairs are not available). We propose mix and match networks, an approach where multiple encoders and decoders are aligned in such a way that the desired translation can be obtained by simply cascading the source encoder and the target decoder, even when they have not interacted during the training stage (i.e. unseen). The main challenge lies in the alignment of the latent representations at the bottlenecks of encoder–decoder pairs. We propose an architecture with several tools to encourage alignment, including autoencoders and robust side information and latent consistency losses. We show the benefits of our approach in terms of effectiveness and scalability compared with other pairwise image-to-image translation approaches. We also propose zero-pair cross-modal image translation, a challenging setting where the objective is inferring semantic segmentation from depth (and vice-versa) without explicit segmentation-depth pairs, and only from two (disjoint) segmentation-RGB and depth-RGB training sets. We observe that a certain part of the shared information between unseen modalities might not be reachable, so we further propose a variant that leverages pseudo-pairs which allows us to exploit this shared information between the unseen modalities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. The code is available online at http://github.com/yaxingwang/Mix-and-match-networks.

  2. Note that Johnson et al. (2016) refers to this as zero-shot translation. In this paper we refer to this setting as zero-pair to emphasize that what is unseen is paired data and avoid ambiguities with traditional zero-shot recognition which typically refers to unseen samples.

  3. For simplicity, we will refer to the output semantic segmentation maps and depth as modalities rather than tasks, as done in some works.

  4. The RGB decoder does not use pooling indices, since in our experiments we observed undesired grid-like artifacts in the RGB output when we use them.

  5. We choose the opponent channels because they are less correlated than the R, G and B channels (Geusebroek et al. 2001).

References

  • Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2016). Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(7), 1425–1438.

    Article  Google Scholar 

  • Alharbi, Y., Smith, N., & Wonka, P. (2019). Latent filter scaling for multimodal unsupervised image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1458–1466).

  • Almahairi, A., Rajeswar, S., Sordoni, A., Bachman, P., & Courville, A. (2018). Augmented cyclegan: Learning many-to-many mappings from unpaired data. International Conference on Machine Learning.

  • Amodio, M., & Krishnaswamy, S. (2019). Travelgan: Image-to-image translation by transformation vector learning. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Anoosheh, A., Agustsson, E., Timofte, R., & Van Gool, L. (2018). Combogan: Unrestrained scalability for image domain translation. In 2018 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW) , http://dx.doi.org/10.1109/CVPRW.2018.00122.

  • Badrinarayanan, V., Handa, A., & Cipolla, R. (2015). Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Cadena, C., Dick, A. R., & Reid, I. D. (2016). Multi-modal auto-encoders as joint estimators for robotics scene understanding. In Robotics: Science and systems.

  • Castrejon, L., Aytar, Y., Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Learning aligned cross-modal representations from weakly aligned data. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2940–2949).

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

    Article  Google Scholar 

  • Chen, Q., & Koltun, V. (2017). Photographic image synthesis with cascaded refinement networks.

  • Chen, Y., Liu, Y., Cheng, Y., & Li, V. O. (2017). A teacher–student framework for zero-resource neural machine translation. Preprint arXiv:170500753.

  • Chen, Y. C., Xu, X., Tian, Z., & Jia, J. (2019). Homomorphic latent space interpolation for unpaired image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2408–2416).

  • Cheng, Y., Zhao, X., Cai, R., Li, Z., Huang, K., Rui, Y., et al. (2016). Semi-supervised multimodal deep learning for RGB-D object recognition. In Proceedings of the international joint conference on artificial intelligence.

  • Cho, W., Choi, S., Park, D. K., Shin, I., & Choo, J. (2019). Image-to-image translation via group-wise deep whitening-and-coloring transformation. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Choi, Y., Choi, M., Kim, M., Ha, J. W., Kim, S., & Choo, J. (2018). Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the international conference on computer vision (pp. 2650–2658).

  • Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M., & Burgard, W. (2015). Multimodal deep learning for robust rgb-d object recognition. In Proceedings of the IEEE/RSJ conference on intelligent robots and systems (pp. 681–687), IEEE.

  • Fergus, R., Bernal, H., Weiss, Y., & Torralba, A. (2010). Semantic label sharing for learning with many categories. In Proceedings of the European conference on computer vision (pp. 762–775).

  • Firat, O., Cho, K., & Bengio, Y. (2016). Multi-way, multilingual neural machine translation with a shared attention mechanism. Preprint arXiv:160101073.

  • Fu, Y., Xiang, T., Jiang, Y. G., Xue, X., Sigal, L., & Gong, S. (2017). Recent advances in zero-shot recognition. Preprint arXiv:171004837.

  • Ganin, Y., & Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In International conference on machine learning (pp. 1180–1189).

  • Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2414–2423).

  • Geusebroek, J. M., Van den Boomgaard, R., Smeulders, A. W. M., & Geerts, H. (2001). Color invariance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(12), 1338–1350.

    Article  Google Scholar 

  • Gong, B., Shi, Y., Sha, F., & Grauman, K. (2012). Geodesic flow kernel for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2066–2073), IEEE.

  • Gonzalez-Garcia, A., van de Weijer, J., & Bengio, Y. (2018). Image-to-image translation for cross-domain disentanglement. In Advances in neural information processing systems (pp. 1294–1305).

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672–2680).

  • Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Hoffman, J., Gupta, S., & Darrell, T. (2016a). Learning with side information through modality hallucination. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 826–834).

  • Hoffman, J., Gupta, S., Leong, J., Guadarrama, S., & Darrell, T. (2016b). Cross-modal adaptation for rgb-d detection. In 2016 IEEE international conference on robotics and automation (ICRA) (pp. 5032–5039), IEEE.

  • Huang, X., Liu, M. Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (pp. 172–189).

  • Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition

  • Jayaraman, D., & Grauman, K. (2014). Zero-shot recognition with unreliable attributes. In Advances in neural information processing systems (pp. 3464–3472).

  • Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, G., et al. (2016). Google’s multilingual neural machine translation system: Enabling zero-shot translation. Preprint arXiv:161104558.

  • Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Kim, S., Park, K., Sohn, K., & Lin, S. (2016). Unified depth prediction and intrinsic image decomposition from a single image via joint convolutional neural fields. In Proceedings of the European conference on computer vision (pp. 143–159), Springer.

  • Kim, T., Cha, M., Kim, H., Lee, J., & Kim, J. (2017). Learning to discover cross-domain relations with generative adversarial networks.

  • Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. In International conference on learning representations.

  • Kuga, R., Kanezaki, A., Samejima, M., Sugano, Y., & Matsushita, Y. (2017). Multi-task learning using multi-modal encoder–decoder networks with shared skip connections. In Proceedings of the international conference on computer vision.

  • Kuznietsov, Y., Stückler, J., Leibe, B. (2017). Semi-supervised deep learning for monocular depth map prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6647–6655).

  • Lai, K., Bo, L., Ren, X., & Fox, D. (2011). A large-scale hierarchical multi-view rgb-d object dataset. In Proceedings of IEEE international conference on robotics and automation (pp. 1817–1824), IEEE.

  • Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In 2016 fourth international conference on 3D vision (3DV) (pp. 239–248), IEEE.

  • Lampert, C. H., Nickisch, H., & Harmeling, S. (2014). Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3), 453–465.

    Article  Google Scholar 

  • Lee, H. Y., Tseng, H. Y., Huang, J. B., Singh, M., & Yang, M. H. (2018). Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (pp. 35–51).

  • Li, Y., Liu, M. Y., Li, X., Yang, M. H., & Kautz, J. (2018). A closed-form solution to photorealistic image stylization. In Proceedings of the European conference on computer vision (pp. 453–468).

  • Lin, J., Xia, Y., Qin, T., Chen, Z., & Liu, T. Y. (2018). Conditional image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5524–5532).

  • Liu, F., Shen, C., Lin, G., & Reid, I. (2016). Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10), 2024–2039.

    Article  Google Scholar 

  • Liu, M. Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. In Advances in neural information processing systems.

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).

  • Mao, X., Li, Q., Xie, H., Lau, R. Y., & Wang, Z. (2016). Multi-class generative adversarial networks with the l2 loss function. Preprint arXiv:161104076.

  • Mathieu, M. F., Zhao, J. J., Zhao, J., Ramesh, A., Sprechmann, P., & LeCun, Y. (2016). Disentangling factors of variation in deep representation using adversarial training. In Advances in neural information processing systems (pp. 5040–5048).

  • McCormac, J., Handa, A., Leutenegger, S., & JDavison, A. (2017). Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In Proceedings of the international conference on computer vision.

  • Mejjati, Y. A., Richardt, C., Tompkin, J., Cosker, D., & Kim, K. I. (2018). Unsupervised attention-guided image-to-image translation. In Advances in neural information processing systems (pp. 3697–3707).

  • Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. Preprint arXiv:14111784.

  • Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In International conference on machine learning (pp. 689–696).

  • Nilsback, M. E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In Sixth Indian conference on computer vision, graphics & image processing, 2008. ICVGIP’08 (pp. 722–729), IEEE.

  • Perarnau, G., Van De Weijer, J., Raducanu, B., & Álvarez, J. M. (2016). Invertible conditional gans for image editing. Preprint arXiv:161106355.

  • Reed, S., Akata, Z., Lee, H., & Schiele, B. (2016). Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 49–58).

  • Rohrbach, M., Stark, M., & Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1641–1648), IEEE.

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 234–241), Springer.

  • Roy, A., & Todorovic, S. (2016). Monocular depth estimation using neural regression forest. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5506–5514).

  • Saito, K., Ushiku, Y., & Harada, T. (2017). Asymmetric tri-training for unsupervised domain adaptation.

  • Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In Proceedings of the European conference on computer vision (pp. 746–760), Springer.

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition.

  • Song, S., Lichtenberg, S. P., & Xiao, J. (2015). Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 567–576).

  • Song, X., Herranz, L., & Jiang, S. (2017). Depth CNNs for RGB-D scene recognition: Learning from scratch better than transferring from rgb-cnns. In Proceedings of the AAAI conference on artificial intelligence.

  • Taigman, Y., Polyak, A., & Wolf, L. (2017). Unsupervised cross-domain image generation.

  • Tsai, Y. H., Hung, W. C., Schulter, S., Sohn, K., Yang, M. H., & Chandraker, M. (2018). Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Valada, A., Oliveira, G. L., Brox, T., & Burgard, W. (2016). Deep multispectral semantic scene understanding of forested environments using multimodal fusion. In International symposium on experimental robotics (pp. 465–477), Springer.

  • Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., & Yuille, A. L. (2015). Towards unified depth and semantic prediction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2800–2809).

  • Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2018a). High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8798–8807).

  • Wang, W., & Neumann, U. (2018). Depth-aware CNN for RGB-D segmentation. In Proceedings of the European conference on computer vision (pp. 135–150).

  • Wang, Y., Gonzalez-Garcia, A., van de Weijer, J., & Herranz, L. (2019). Sdit: Scalable and diverse cross-domain image translation. Preprint arXiv:190806881.

  • Wang, Y., van de Weijer, J., & Herranz, L. (2018b). Mix and match networks: Encoder–decoder alignment for zero-pair image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5467–5476).

  • Wu, W., Cao, K., Li, C., Qian, C., & Loy, C. C. (2019). Transgaga: Geometry-aware unsupervised image-to-image translation. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Wu, Z., Han, X., Lin, Y. L., Uzunbas, M. G., Goldstein, T., Lim, S. N., & Davis, L. S. (2018). Dcan: Dual channel-wise alignment networks for unsupervised scene adaptation. In Proceedings of the European conference on computer vision.

  • Xian, Y., Lampert, C. H., Schiele, B., & Akata, Z. (2018a). Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Xian, Y., Lorenz, T., Schiele, B., & Akata, Z. (2018b). Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5542–5551).

  • Xu, D., Ouyang, W., Ricci, E., Wang, X., & Sebe, N. (2017). Learning cross-modal deep representations for robust pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5363–5371).

  • Yi, Z., Zhang, H., Gong, P. T., et al. (2017). Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the international conference on computer vision.

  • Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions.

  • Yu, L., Zhang, L., van de Weijer, J., Khan, F. S., Cheng, Y., & Parraga, C. A. (2018). Beyond eleven color names for image understanding. Machine Vision and Applications, 29(2), 361–373.

    Article  Google Scholar 

  • Zhang, L., Gonzalez-Garcia, A., van de Weijer, J., Danelljan, M., & Khan, F. S. (2019). Synthetic data generation for end-to-end thermal infrared tracking. IEEE Transactions on Image Processing, 28(4), 1837–1850.

    Article  MathSciNet  Google Scholar 

  • Zhang, R., Isola, P., & Efros, A. A. (2016). Colorful image colorization. In Proceedings of the European conference on computer vision

  • Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881–2890).

  • Zheng, H., Cheng, Y., & Liu, Y. (2017). Maximum expected likelihood estimation for zero-resource neural machine translation. In Proceedings of the international joint conference on artificial intelligence.

  • Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017a). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the international conference on computer vision.

  • Zhu, J. Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., & Shechtman, E. (2017b). Toward multimodal image-to-image translation. In Advances in neural information processing systems (pp. 465–476).

  • Zou, Y., Yu, Z., Vijaya Kumar, B., & Wang, J. (2018). Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision.

Download references

Acknowledgements

The Titan Xp used for this research was donated by the NVIDIA Corporation. We acknowledge the Spanish Projects TIN2016-79717-R and RTI2018-102285-A-I00, the CHISTERA Project M2CR (PCIN-2015-251) and the CERCA Programme / Generalitat de Catalunya. Herranz also acknowledges the European Union’s H2020 research under Marie Sklodowska-Curie Grant No. 665919. Yaxing Wang acknowledges the Chinese Scholarship Council (CSC) Grant No. 201507040048.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yaxing Wang.

Additional information

Communicated by Chen Change Loy.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Appendix: Network Architecture on RGB-D or RGB-D-NIR Dataset

Table 9 shows the architecture (convolutional and pooling layers) of the encoders used in the cross-modal experiment. Tables 10 and 11 show the corresponding decoders. Table 12 shows the discriminator used for RGB. Every convolutional layer of the encoders, decoders and the discriminator is followed by a batch normalization layer and a ReLU layer (LeakyReLU for the discriminator). The only exception is the RGB encoder, which is initialized with weights from the VGG16 model pretrained on imageNet (Simonyan and Zisserman 2015) and does not use batch normalization. The used abbreviations are shown in Table 16.

Table 9 The architecture of the encoder of RGB, depth, NIR and semantic segmentation
Table 10 The architecture of the decoder of depth, NIR and semantic segmentation
Table 11 The architecture of the decoder of RGB
Table 12 RGB discriminator

B Appendix: Network Architecture on the Color Dataset and the Artworks Dataset

We use several datasets to verify the generality of our method, including object (Color) and scenes (Artworks).

Color dataset (Yu et al. 2018). We consider the object dataset for color which is collected by Yu et al. (2018), which includes 11 color labels, each category containing 1000 images. We resize all images to \(128 \times 128\).

Artworks (Zhu et al. 2017a). We also illustrate M&MNet in an artwork setting. This includes real images (photo) and four artistic styles (Monet, van Gogh, Ukiyo-e and Cezanne). The the set contains 3000 (photo), 800 (Ukiyo-e), 500 (van Gogh), 600 (Cezanne) and 1200 (Monet) images. All images are resized to \(256 \times 256\).

We consider Adam (Kingma and Ba 2047) with a batch size of 4, using a learning rate of 0.0002. The network is initialized using a Gaussian distribution with zero mean and a standard deviation of 0.5. We only use adversarial loss to train our model.

Tables 13, 14 and 15 show the architectures of the encoder, image decoder and discriminator used in the cross-modal experiment. The following tables only show the image size of \(128 \times 128\), while for artworks dataset it is same architecture except for image resolution. The used abbreviations are shown in Table 16.

Table 13 The architecture of the encoder for \(128\times 128\) input
Table 14 The architecture of the decoder for \(128\times 128\) output
Table 15 Architecture for the discriminator Loss specification for \(128\times 128\) input
Table 16 Abbreviations used in other tables

C Appendix: Network Architecture for the Flower Dataset

Flower dataset (Nilsback and Zisserman 2008). The Flower dataset consists of 102 categories. We consider 10 categories(passionflower, petunia, rose, wallflower, watercress, waterlily, cyclamen, foxglove, frangipani, hibiscus). Each category includes between 100 and 258 images. we resize the image to \(128 \times 128\). Similarly, we optimize our model by means of using Adam (Kingma and Ba 2047), the batch size of 4 and a learning rate of 0.0002. We initialize hyperparameters using a Gaussian distribution with zero mean and a standard deviation of 0.5. We use adversarial loss and L2 to train \( \Theta _3\), and only L2 for \( \Theta _1\) and \( \Theta _2\).

Tables 17 and 18 detail the architecture of the encoder and decoder, respectively, of the two single channel modalities \( \Theta _1\) and \( \Theta _2\). The encoder and decoder for the third modality \( \Theta _3\) are analogous, just adapted to three input and output channels, respectively. For \( \Theta _3\) we also use the discriminator detailed in Table 15.

Table 17 The architecture of the encoder of \(\Theta _1\) and \( \Theta _2\)
Table 18 The architecture of the decoder for \(\Theta _1\) and \( \Theta _2\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Herranz, L. & van de Weijer, J. Mix and Match Networks: Cross-Modal Alignment for Zero-Pair Image-to-Image Translation. Int J Comput Vis 128, 2849–2872 (2020). https://doi.org/10.1007/s11263-020-01340-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01340-z

Keywords

Navigation