Multi-scale generative adversarial inpainting network based on cross-layer attention transfer mechanism

doi:10.1016/j.knosys.2020.105778

Knowledge-Based Systems

Volume 196, 21 May 2020, 105778

https://doi.org/10.1016/j.knosys.2020.105778 Get rights and content

Abstract

Deep learning-based methods have recently shown promising results in image inpainting. These methods generate patches with visually plausible image structures and textures, which are semantically coherent with the context of surrounding regions. However, existing methods tend to generate artifacts which are inconsistent with surrounding regions, especially when dealing with complex images. Aiming at the limitations current in deep learning-based methods, this paper proposes a multi-scale generative adversarial network model based on cross-layer attention transfer mechanism. Cross-Layer Attention Transfer Module (CL-ATM) is presented to guide the filling of the corresponding low-level semantic feature map by using the high-level semantic feature map, so as to ensure visual and semantic consistency of inpainting. Meanwhile, a multi-scale generator and the multi-scale discriminators are added into the network structure. Different scales of discriminators have different receptive fields, which enable the generator to produce images with better global consistency and more details. Qualitative and quantitative experiments show that our method has superior performance against state-of-art inpainting models.

Introduction

With the rapid development of deep learning in the field of computer vision, researches on image editing and image generation have achieved remarkable results. Image inpainting is a hot issue involving image editing and generation [1], [2], [3], which can also be extended to tasks including image un-cropping, rotation, stitching, re-targeting, re-composition, compression, super-resolution, harmonization and many computer vision problems [4], [5], [6], [7], [8], [9].

Image inpainting refers to fill the missing pixels in the damaged image [10] given corresponding masks, as shown in Fig. 1. Human can easily complete the image restoration based on the content of surrounding environment, but it is especially difficult for computer since there is no unique result and no universally applicable method. Currently, the key issue of inpainting is how to assist the model to make full use of existing information and how to judge whether results have enough authenticity and rationality.

Existing image inpainting works can be roughly divided into two categories: traditional methods [1], [11], [12] and deep learning-based methods [3], [13], [14], [15], as shown in Table 1. Traditional methods either match and replicate the background patches from low resolution to high resolution, or synthesize textures by propagating the neighbor region appearance to the target holes. These methods can synthesize detailed results and have been widely used in practice. Owing to the lack of advanced understanding of image semantics, traditional methods cannot produce semantically reasonable results. In recent years, the emergence of deep neural networks such as Convolutional Neural Network (CNN) [16], [17] and Generative Adversarial Network (GAN) [18] have greatly improved the performance of tasks such as classification, object detection, semantic segmentation and feature selection [19], [20], [21], [22], [23], [24]. Deep learning-based methods first encode the semantic context of images into potential feature spaces through deep neural network, and then generate semantically relevant patches through generating model. Meanwhile, the consistency between generated pixels and existing pixels is encouraged by the training with the adversarial network. However, the methods based on deep neural networks such as the context encoder (CE) [3] and global-local model (GL) [13] tend to produce boundary artifacts and distorted structures, which are mainly caused by the low efficiency of CNN in explicitly borrowing or reproducing the information from distant spatial locations. Some recent studies attempt to solve this problem by using the information of known regions to estimate the missing pixels. For example, the context attention layer [25] and patch-swap layer [26] proposed by Song et al. and Yu et al. respectively are used to fill in missing pixels with similar patches from undamaged regions at high-level feature maps. Nevertheless, the visual consistency of the repaired image remains a major challenge.

In order to ensure the consistency of visual effect and semantics, and to get high-quality results, we propose a novel multi-scale GAN with cross-layer attention transfer mechanism. Our model consists of two stages. The first stage network with dilated convolution simply reconstructs the missing areas by reconstruction loss. Cross-layer attention transfer mechanism is added in the second stage network. To begin with, a cross-layer attention transfer module (CL-ATM) is proposed. We calculate the cosine similarity between the patches of missing areas and the known areas in a high-level semantic feature map, and then obtain the attention score map through softmax. The attention score map is used to guide the transfer of the related features from known areas to missing areas in the lower-level feature map with higher resolution. Further, the second stage generator network uses deep feature maps to generate repaired images of different sizes, meanwhile, multi-scale GAN discriminators are added on the basis of the global discriminator and local discriminator. Multi-scale discriminators are trained on images with different resolutions, which can guide the generator to produce images with more details and better visual effects. The entire network is trained with two stages of reconstruction losses and four WGAN-GP losses.

In short, the main contributions of our work include:

1. We propose a multi-scale GAN consisting of one generator and four adversarial discriminators, which deploys the reconstruction and adversarial losses to synthesize the missing content from damaged images. The WGAN-GP loss and dilated convolution are used to improve the training stability and speed.

2. We introduce CL-ATM to utilize the attention score learned from the high-level semantic feature map. It can guide the feature transfer of the low-level semantic feature map in a corresponding layer (filling missing regions with known regions). It ensures the visual and semantic consistency of the generated image.

3. The experiments on Place2, CelebA, ImageNet, DTD, COCO, and Paris StreetView datasets have validated the effectiveness of our method. Compared with existing state-of-the-art methods, our method can generate higher quality images.

Section snippets

Traditional methods

Traditional inpainting methods such as diffusion-based methods can spread adjacent information to missing areas. For instance, Bertalmio et al. [10] used the diffusion equation to iteratively prop-agate low-level features of known areas to missing areas along mask boundaries. Ballester et al. [27] proposed to use a method to solve the variational problem through gradient descent flow to calculate interpolation, automatically and smoothly diffusing pixels into missing regions. Although these

Multi-scale GAN model

In this section, we describe our method in order. We first introduce the two-stage generator, CL-ATM and the discriminator. Subsequently, the learning objective of our method is also presented. We build our inpainting network by refactoring and improving the most advanced inpainting model CA, which is a two-stage generation network for inpainting. The network architecture we proposed is shown in Fig. 2.

Experiments

In this section, we evaluated quantitatively and qualitatively the inpainting effects of the proposed multi-scale GAN model based on CL-ATM. Section 4.1 provides details of the experimental setup, and Section 4.2 describes the experimental results and the validity of the model.

Conclusion

In recent years, with the rapid development of deep learning, image inpainting has become one of the research hotspots in computer vision. In order to tackle the shortcomings of deep learning-based methods and to improve the effect of inpainting, we propose a novel multi-scale GAN with cross-layer attention transfer mechanism. CL-ATM is added to use the attention score graph learned from the high-level semantic feature map to guide the feature transfer of the low-level semantic feature map in

CRediT authorship contribution statement

Mingwen Shao: Conceptualization, Supervision, Validation. Wentao Zhang: Investigation, Writing - original draft. Wangmeng Zuo: Methodology, Data curation. Deyu Meng: Formal analysis, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors are very indebted to the anonymous referees for their critical comments and suggestions for the improvement of this paper. This work was supported by the grants from the National Natural Science Foundation of China (Nos. 61673396, U19A2073, 61976245).

References (52)

Pérez-HernándezF. et al.
Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance
Knowl.-Based Syst.
(2020)
ShiH. et al.
Automated heartbeat classification based on deep neural network with multiple input layers
Knowl.-Based Syst.
(2020)
CriminisiA. et al.
Region filling and object removal by exemplar-based image inpainting
IEEE Trans. Image Process.
(2004)
A. Levin, A. Zomet, S. Peleg, et al. Seamless image stitching in the gradient domain, in: Proceedings of the European...
D. Pathak, P. Krahenbuhl, J. Donahue, et al. Context encoders: Feature learning by inpainting, in: Proceedings of the...
Y. He, C. Zhu, J. Wang, et al. Bounding box regression with uncertainty for accurate object detection, in: Proceedings...
ZhuR. et al.
Scratchdet: Exploring to train single-shot object detectors from scratch
(2018)
X. Jia, X. Wei, X. Cao, et al. ComDefend: An efficient image compression model to defend adversarial examples, in:...
W. Tao, F. Jiang, S. Zhang, et al. An end-to-end compression framework based on convolutional neural networks, in:...
W. Zhang, Y. Liu, C. Dong, et al. Ranksrgan: Generative adversarial networks with ranker for image super-resolution,...

Z. Li, J. Yang, Z. Liu, et al. Feedback network for image super-resolution, in: Proceedings of the IEEE Conference on...

M. Bertalmio, G. Sapiro, V. Caselles, et al. Image inpainting, in: Proceedings of the 27th Annual Conference on...

BarnesC. et al.

Patchmatch: A randomized correspondence algorithm for structural image editing

ACM Trans. Graph.

(2009)

SunJ. et al.

Image completion with structure propagation

ACM Trans. Graph.

(2005)

IizukaS. et al.

Globally and locally consistent image completion

ACM Trans. Graph.

(2017)

Y. Li, S. Liu, J. Yang, et al. Generative face completion, in: Proceedings of the IEEE Conference on Computer Vision...

YehR.A. et al.

Semantic image inpainting with perceptual and contextual losses

(2016)

FukushimaK.

Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

Biol. Cybernet.

(1980)

LeCunY. et al.

Backpropagation applied to handwritten zip code recognition

Neural Comput.

(1989)

GoodfellowI. et al.

Generative adversarial nets

ZhaoC. et al.

Multi-source domain adaptation with joint learning for cross-domain sentiment classification

Knowl.-Based Syst.

(2019)

X. Wang, A. Shrivastava, A. Gupta, A-fast-rcnn: Hard positive generation via adversary for object detection, in:...

J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE...

N. Souly, C. Spampinato, M. Shah, Semi supervised semantic segmentation using generative adversarial network, in:...

YangY. et al.

Feature selection for multimedia analysis by sharing information among multiple tasks

IEEE Trans. Multimedia

(2013)

Y. Song, C. Yang, Z. Lin, et al. Contextual-based image inpainting: Infer, match, and translate, in: Proceedings of the...

Cited by (17)

Image inpainting based on deep learning: A review
2023, Information Fusion
Citation Excerpt :
For facial repair, Jiawan Zhang et al. [100] added symmetric loss for joint training. Mingwen Shao et al. [107] added a multi-scale discriminator to obtain different receiving domains, thereby obtaining more details and better repair results. In addition to the improvement of the network structure, the CE network [22] and the network proposed by Iizuka et al. [98] were also employed to the first stage of coarse-to-fine network, to achieve coarse image inpainting.
Image inpainting is an important research direction in the study of computer vision, and is widely used in image editing and photo inpainting etc. Traditional image inpainting algorithms are often difficult to deal with large-scale image deletion, since these algorithms are prone to inconsistent image semantics. With the rapid development of deep learning (DL) in recent years, the advantages of DL in image processing have become increasingly prominent, it can solve the problems existing in traditional image inpainting algorithms to a certain extent. At present, image inpainting based on deep learning becomes a research hotspot in computer vision. In this article, we systematically summarize and analyze the literature on image inpainting based on deep learning. First, we review the specific research status of deep learning technology in the field of image inpainting in the past 15 years; then, We deeply study and analyze the existing image restoration methods based on different neural network structures and their information fusion methods. In addition, we also classify and summarize the different tasks of image inpainting according to the application scenarios of image inpainting. Finally, we point out some problems that urgently need to be solved for deep learning in the field of image inpainting, provide constructive suggestions and discuss the future development direction.
A lightweight ensemble discriminator for Generative Adversarial Networks
2022, Knowledge-Based Systems
While Generative Adversarial Networks (GANs) have brought immense success in various content-generation tasks, they still face enormous challenges in generating high-quality visually realistic images because of the model collapse or instability during GAN training. One common accepted explanation for the model collapse and instability is that the learning signal provided by the discriminator to the generator become inadequate when the discriminator overconcentrates on the most discriminative difference between real and synthetic images and ignores the less discriminative parts. To this end, we propose a lightweight ensemble discriminator to evaluate the generator from multi-perspective. Borrowing the insights from ensemble learning, several auxiliary discriminators are embedded into one deep model. A novel ensemble loss function is designed to promote the complementariness within the ensemble and train the whole framework in an end-to-end manner. Extensive experiments on datasets of varying resolutions and data sizes prove significant performance improvements over the state-of-the-art GANs. The proposed method can be easily embedded into various GAN frameworks and combined with different loss functions.
DMDIT: Diverse multi-domain image-to-image translation
2021, Knowledge-Based Systems
Cross-domain image translation studies have shown brilliant progress in recent years, which intend to learn the mapping between two different domains. A good cross-domain image translation model should meet the following conditions: (1) do not rely on paired dataset, (2) can deal with multiple domains, (3) obtain diverse outputs with the same source image. Most state-of-art studies are devoted to addressing two of them i.e., either (1) and (2), or (1) and (3). In this paper, we construct a unified diverse multi-domain image to image translation framework (DMDIT) which can satisfy the above three requirements simultaneously. Different from traditional approaches, the proposed generator can achieve diverse and multi-label image-to-image translation while retaining the underlying features of the input image. The diverse outputs are obtained through a latent noise sampled from the normal distribution randomly. To further improve the multiplicity of the outputs, we propose a novel style regularization loss to restrain the latent noise. The mode collapse problem usually occurs due to the lack of constraints on the noise, so we embed a noise separation module in the discriminator to avoid this issue. In addition, we apply an attention mechanism to make the model attentively focus on the most attribute-relevant regions, helping to improve the quality of the generated images. Extensive qualitative and quantitative evaluations clearly demonstrate the effectiveness of our approach.
IIT-GAT: Instance-level image transformation via unsupervised generative attention networks with disentangled representations
2021, Knowledge-Based Systems
Image-to-image translation is an important research field in computer vision, which is widely associated with Generative Adversarial Networks (GANs) and dual learning. However, the existing methods mainly translate the global image of the source domain to the target domain, which fails to implement instance-level image-to-image translation, and the translation results in the target domain cannot be controlled. In this paper, an instance-level image-to-image translation network (IIT-GAT) is proposed, which includes attention module and feature-encoder module. The attention module is used to guide our model to focus on more interesting instance to generate instance masks, which helps to separate instance and background of an image. The feature-encoder module is used to embed the images into two different spaces: domain-invariant content space and domain-specific attribute space. The content features and attribute features of different images are used as input to generator simultaneously to improve the controllability of image-to-image translation. To this end, we introduce a local self-reconstruction loss that encourages the network to learn the style feature of target instances. Generally, our method not only improves the quality of instance-level image-to-image translation, but also increases controllability on this basis. Extensive experiments are conducted on multiple datasets to validate the effectiveness of the proposed framework, and the results show our method has better performance than previous methods.
Deep multi-level fusion network for multi-source image pixel-wise classification
2021, Knowledge-Based Systems
For multi-source image pixel-wise classification, each image information is different and complementary in the same area or scene. However, how to integrate them for decision-making is a difficult problem. In this paper, we focus on the characteristics of multi-source image and propose a novel pixel-wise classification method, named deep multi-level fusion network. The proposed method is to classify multi-sensor data including very high-resolution (VHR) RGB imagery, hyperspectral imagery (HSI) and multispectral light detection and ranging (MS-LiDAR) point cloud data. First, a deep spectral–spatial attention network is proposed to process HSI and MS-LiDAR images and get a learned classification map, which is based on feature level fusion. Next, a down-superpixel segmentation algorithm is proposed to get a segmentation result for VHR RGB imagery. Finally, the feature level fusion results are refinement by the down-superpixel segmentation results on the decision level, and get the final result. Extensive experiments and analyses on the data set $g r s s_d f c_2018$ demonstrate that the proposed multi-level fusion network can achieve a better result in the multi-source image pixel-wise classification.
Efficient texture-aware multi-GAN for image inpainting
2021, Knowledge-Based Systems
Citation Excerpt :
Furthermore, they require expensive computational resources since they optimize the parameters of two or more networks. Other studies employ the contextual attention mechanism (CAM) to borrow information from the surrounding parts [18,19]. However, CAM still fails to ensure feature continuities [22] and requires expensive computational resources.
Recent GAN-based (Generative adversarial networks) inpainting methods show remarkable improvements and generate plausible images using multi-stage networks or Contextual Attention Modules (CAM). However, these techniques increase the model complexity limiting their application in low-resource environments. Furthermore, they fail in generating high-resolution images with realistic texture details due to the GAN stability problem. Motivated by these observations, we propose a multi-GAN architecture improving both the performance and rendering efficiency. Our training schema optimizes the parameters of four progressive efficient generators and discriminators in an end-to-end manner. Filling in low-resolution images is less challenging for GANs due to the small dimensional space. Meanwhile, it guides higher resolution generators to learn the global structure consistency of the image. To constrain the inpainting task and ensure fine-grained textures, we adopt an LBP-based loss function to minimize the difference between the generated and the ground truth textures. We conduct our experiments on Places2 and CelebHQ datasets. Qualitative and quantitative results show that the proposed method not only performs favorably against state-of-the-art algorithms but also speeds up the inference time.

View all citing articles on Scopus

View full text

Multi-scale generative adversarial inpainting network based on cross-layer attention transfer mechanism

Abstract

Introduction

Section snippets

Traditional methods

Multi-scale GAN model

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Knowl.-Based Syst.

Knowl.-Based Syst.

Region filling and object removal by exemplar-based image inpainting

IEEE Trans. Image Process.

Scratchdet: Exploring to train single-shot object detectors from scratch

Patchmatch: A randomized correspondence algorithm for structural image editing

ACM Trans. Graph.

Image completion with structure propagation

ACM Trans. Graph.

Globally and locally consistent image completion

ACM Trans. Graph.

Semantic image inpainting with perceptual and contextual losses

Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

Biol. Cybernet.

Backpropagation applied to handwritten zip code recognition

Neural Comput.

Generative adversarial nets

Multi-source domain adaptation with joint learning for cross-domain sentiment classification

Knowl.-Based Syst.

Feature selection for multimedia analysis by sharing information among multiple tasks

IEEE Trans. Multimedia