Multi-scale generative adversarial inpainting network based on cross-layer attention transfer mechanism

https://doi.org/10.1016/j.knosys.2020.105778Get rights and content

Abstract

Deep learning-based methods have recently shown promising results in image inpainting. These methods generate patches with visually plausible image structures and textures, which are semantically coherent with the context of surrounding regions. However, existing methods tend to generate artifacts which are inconsistent with surrounding regions, especially when dealing with complex images. Aiming at the limitations current in deep learning-based methods, this paper proposes a multi-scale generative adversarial network model based on cross-layer attention transfer mechanism. Cross-Layer Attention Transfer Module (CL-ATM) is presented to guide the filling of the corresponding low-level semantic feature map by using the high-level semantic feature map, so as to ensure visual and semantic consistency of inpainting. Meanwhile, a multi-scale generator and the multi-scale discriminators are added into the network structure. Different scales of discriminators have different receptive fields, which enable the generator to produce images with better global consistency and more details. Qualitative and quantitative experiments show that our method has superior performance against state-of-art inpainting models.

Introduction

With the rapid development of deep learning in the field of computer vision, researches on image editing and image generation have achieved remarkable results. Image inpainting is a hot issue involving image editing and generation [1], [2], [3], which can also be extended to tasks including image un-cropping, rotation, stitching, re-targeting, re-composition, compression, super-resolution, harmonization and many computer vision problems [4], [5], [6], [7], [8], [9].

Image inpainting refers to fill the missing pixels in the damaged image [10] given corresponding masks, as shown in Fig. 1. Human can easily complete the image restoration based on the content of surrounding environment, but it is especially difficult for computer since there is no unique result and no universally applicable method. Currently, the key issue of inpainting is how to assist the model to make full use of existing information and how to judge whether results have enough authenticity and rationality.

Existing image inpainting works can be roughly divided into two categories: traditional methods [1], [11], [12] and deep learning-based methods [3], [13], [14], [15], as shown in Table 1. Traditional methods either match and replicate the background patches from low resolution to high resolution, or synthesize textures by propagating the neighbor region appearance to the target holes. These methods can synthesize detailed results and have been widely used in practice. Owing to the lack of advanced understanding of image semantics, traditional methods cannot produce semantically reasonable results. In recent years, the emergence of deep neural networks such as Convolutional Neural Network (CNN) [16], [17] and Generative Adversarial Network (GAN) [18] have greatly improved the performance of tasks such as classification, object detection, semantic segmentation and feature selection [19], [20], [21], [22], [23], [24]. Deep learning-based methods first encode the semantic context of images into potential feature spaces through deep neural network, and then generate semantically relevant patches through generating model. Meanwhile, the consistency between generated pixels and existing pixels is encouraged by the training with the adversarial network. However, the methods based on deep neural networks such as the context encoder (CE) [3] and global-local model (GL) [13] tend to produce boundary artifacts and distorted structures, which are mainly caused by the low efficiency of CNN in explicitly borrowing or reproducing the information from distant spatial locations. Some recent studies attempt to solve this problem by using the information of known regions to estimate the missing pixels. For example, the context attention layer [25] and patch-swap layer [26] proposed by Song et al. and Yu et al. respectively are used to fill in missing pixels with similar patches from undamaged regions at high-level feature maps. Nevertheless, the visual consistency of the repaired image remains a major challenge.

In order to ensure the consistency of visual effect and semantics, and to get high-quality results, we propose a novel multi-scale GAN with cross-layer attention transfer mechanism. Our model consists of two stages. The first stage network with dilated convolution simply reconstructs the missing areas by reconstruction loss. Cross-layer attention transfer mechanism is added in the second stage network. To begin with, a cross-layer attention transfer module (CL-ATM) is proposed. We calculate the cosine similarity between the patches of missing areas and the known areas in a high-level semantic feature map, and then obtain the attention score map through softmax. The attention score map is used to guide the transfer of the related features from known areas to missing areas in the lower-level feature map with higher resolution. Further, the second stage generator network uses deep feature maps to generate repaired images of different sizes, meanwhile, multi-scale GAN discriminators are added on the basis of the global discriminator and local discriminator. Multi-scale discriminators are trained on images with different resolutions, which can guide the generator to produce images with more details and better visual effects. The entire network is trained with two stages of reconstruction losses and four WGAN-GP losses.

In short, the main contributions of our work include:

1. We propose a multi-scale GAN consisting of one generator and four adversarial discriminators, which deploys the reconstruction and adversarial losses to synthesize the missing content from damaged images. The WGAN-GP loss and dilated convolution are used to improve the training stability and speed.

2. We introduce CL-ATM to utilize the attention score learned from the high-level semantic feature map. It can guide the feature transfer of the low-level semantic feature map in a corresponding layer (filling missing regions with known regions). It ensures the visual and semantic consistency of the generated image.

3. The experiments on Place2, CelebA, ImageNet, DTD, COCO, and Paris StreetView datasets have validated the effectiveness of our method. Compared with existing state-of-the-art methods, our method can generate higher quality images.

Section snippets

Traditional methods

Traditional inpainting methods such as diffusion-based methods can spread adjacent information to missing areas. For instance, Bertalmio et al. [10] used the diffusion equation to iteratively prop-agate low-level features of known areas to missing areas along mask boundaries. Ballester et al. [27] proposed to use a method to solve the variational problem through gradient descent flow to calculate interpolation, automatically and smoothly diffusing pixels into missing regions. Although these

Multi-scale GAN model

In this section, we describe our method in order. We first introduce the two-stage generator, CL-ATM and the discriminator. Subsequently, the learning objective of our method is also presented. We build our inpainting network by refactoring and improving the most advanced inpainting model CA, which is a two-stage generation network for inpainting. The network architecture we proposed is shown in Fig. 2.

Experiments

In this section, we evaluated quantitatively and qualitatively the inpainting effects of the proposed multi-scale GAN model based on CL-ATM. Section 4.1 provides details of the experimental setup, and Section 4.2 describes the experimental results and the validity of the model.

Conclusion

In recent years, with the rapid development of deep learning, image inpainting has become one of the research hotspots in computer vision. In order to tackle the shortcomings of deep learning-based methods and to improve the effect of inpainting, we propose a novel multi-scale GAN with cross-layer attention transfer mechanism. CL-ATM is added to use the attention score graph learned from the high-level semantic feature map to guide the feature transfer of the low-level semantic feature map in

CRediT authorship contribution statement

Mingwen Shao: Conceptualization, Supervision, Validation. Wentao Zhang: Investigation, Writing - original draft. Wangmeng Zuo: Methodology, Data curation. Deyu Meng: Formal analysis, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors are very indebted to the anonymous referees for their critical comments and suggestions for the improvement of this paper. This work was supported by the grants from the National Natural Science Foundation of China (Nos. 61673396, U19A2073, 61976245).

References (52)

  • Pérez-HernándezF. et al.

    Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance

    Knowl.-Based Syst.

    (2020)
  • ShiH. et al.

    Automated heartbeat classification based on deep neural network with multiple input layers

    Knowl.-Based Syst.

    (2020)
  • CriminisiA. et al.

    Region filling and object removal by exemplar-based image inpainting

    IEEE Trans. Image Process.

    (2004)
  • A. Levin, A. Zomet, S. Peleg, et al. Seamless image stitching in the gradient domain, in: Proceedings of the European...
  • D. Pathak, P. Krahenbuhl, J. Donahue, et al. Context encoders: Feature learning by inpainting, in: Proceedings of the...
  • Y. He, C. Zhu, J. Wang, et al. Bounding box regression with uncertainty for accurate object detection, in: Proceedings...
  • ZhuR. et al.

    Scratchdet: Exploring to train single-shot object detectors from scratch

    (2018)
  • X. Jia, X. Wei, X. Cao, et al. ComDefend: An efficient image compression model to defend adversarial examples, in:...
  • W. Tao, F. Jiang, S. Zhang, et al. An end-to-end compression framework based on convolutional neural networks, in:...
  • W. Zhang, Y. Liu, C. Dong, et al. Ranksrgan: Generative adversarial networks with ranker for image super-resolution,...
  • Z. Li, J. Yang, Z. Liu, et al. Feedback network for image super-resolution, in: Proceedings of the IEEE Conference on...
  • M. Bertalmio, G. Sapiro, V. Caselles, et al. Image inpainting, in: Proceedings of the 27th Annual Conference on...
  • BarnesC. et al.

    Patchmatch: A randomized correspondence algorithm for structural image editing

    ACM Trans. Graph.

    (2009)
  • SunJ. et al.

    Image completion with structure propagation

    ACM Trans. Graph.

    (2005)
  • IizukaS. et al.

    Globally and locally consistent image completion

    ACM Trans. Graph.

    (2017)
  • Y. Li, S. Liu, J. Yang, et al. Generative face completion, in: Proceedings of the IEEE Conference on Computer Vision...
  • YehR.A. et al.

    Semantic image inpainting with perceptual and contextual losses

    (2016)
  • FukushimaK.

    Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

    Biol. Cybernet.

    (1980)
  • LeCunY. et al.

    Backpropagation applied to handwritten zip code recognition

    Neural Comput.

    (1989)
  • GoodfellowI. et al.

    Generative adversarial nets

  • ZhaoC. et al.

    Multi-source domain adaptation with joint learning for cross-domain sentiment classification

    Knowl.-Based Syst.

    (2019)
  • X. Wang, A. Shrivastava, A. Gupta, A-fast-rcnn: Hard positive generation via adversary for object detection, in:...
  • J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE...
  • N. Souly, C. Spampinato, M. Shah, Semi supervised semantic segmentation using generative adversarial network, in:...
  • YangY. et al.

    Feature selection for multimedia analysis by sharing information among multiple tasks

    IEEE Trans. Multimedia

    (2013)
  • Y. Song, C. Yang, Z. Lin, et al. Contextual-based image inpainting: Infer, match, and translate, in: Proceedings of the...
  • Cited by (17)

    • Image inpainting based on deep learning: A review

      2023, Information Fusion
      Citation Excerpt :

      For facial repair, Jiawan Zhang et al. [100] added symmetric loss for joint training. Mingwen Shao et al. [107] added a multi-scale discriminator to obtain different receiving domains, thereby obtaining more details and better repair results. In addition to the improvement of the network structure, the CE network [22] and the network proposed by Iizuka et al. [98] were also employed to the first stage of coarse-to-fine network, to achieve coarse image inpainting.

    • Efficient texture-aware multi-GAN for image inpainting

      2021, Knowledge-Based Systems
      Citation Excerpt :

      Furthermore, they require expensive computational resources since they optimize the parameters of two or more networks. Other studies employ the contextual attention mechanism (CAM) to borrow information from the surrounding parts [18,19]. However, CAM still fails to ensure feature continuities [22] and requires expensive computational resources.

    View all citing articles on Scopus
    View full text