Attentive deep network for blind motion deblurring on dynamic scenes

https://doi.org/10.1016/j.cviu.2021.103169Get rights and content

Highlights

  • A joint spatial-channel attention-based DNN for image deblurring.

  • Joint attention improves generalizability and performance.

  • We show different roles of spatial attention played in encoder and decoder.

  • SOTA performance achieved.

Abstract

Non-uniform blind motion deblurring is a challenging yet important problem in image processing that receives enduring attention in the last decade. The non-uniformity nature of motion blurring leads to great variations on the blurring effects across image regions and over different images, which makes it very difficult to train an end-to-end deblurring neural network (NN) with good generalization performance. This paper introduces an attention mechanism for the blind deblurring NN, including both spatial and channel attention, so as to effectively handle the significant spatial variations on blurring effects. In the attention mechanism, the spatial attention is introduced in both the encoder for discriminative exploitation of image edges and smooth regions and the decoder for discriminative treatment on different regions with different blurring effects. The channel attention is introduced for better generalization performance of the NN, as it allows adaptive weighting on intermediate features for a particular image. Building such an attention mechanism into a multi-scale encoder–decoder framework, an attentive NN is developed for practical non-uniform blind image deblurring. The experiments on several benchmark datasets show that the proposed NN can effectively restore the images degraded by spatially-varying blurring, with state-of-the-art performance.

Introduction

Image blurring is one often-seen type of image degradation, which causes the loss of image details. In addition to yielding poor picture quality unwanted in digital photography, image blurring also have negative impacts on many vision tasks, e.g. automatic driving, object tracking, and visual surveillance. Image deblurring is then about recovering a clear image with sharp details from an input blurred image. The effect of image blurring may come from multiple sources, e.g. out-of-focus and motion blur. In practice, motion blurring effect often is non-uniform (spatially-varying), i.e., different image regions have different blurring effects, when the camera motion is not the translation along image plane, or there are large variations on scene depth, or there exist independent moving objects. This paper focuses on the study of how to remove spatially-varying motion blurring effect from the images of dynamic scenes.

This paper concerns the motion blurring process that can be formulated as: g=Kf+n,where g denotes the input blurred image, f denotes the latent image with sharp details, n denotes the measurement noise, and K denotes some linear operator that models the blurring process such that iK(i,j)=1,j and K(i,j)0,i,j. In other words, the value of each blurred pixel is the weighted average of the values of all its neighboring sharp pixels (Denis et al., 2015). It is noted that the linear model (1) is not applicable to the images where there are occlusions occurring during shutter time. In the case of uniform blurring, as most existing deblurring methods (e.g. Cai et al., 2009, Danielyan et al., 2011, Shan et al., 2008, Xu and Jia, 2010) assume, all blurred pixels take the same weighting average scheme. Thus, the operator K can be expressed as a convolution operator with a smoothing kernel. When the blurring is not uniform over the image, different blurred pixels will take different weighting schemes. Clearly, as both the operator K and image f are unknown, the blind deblurring is a very challenging ill-posed problem to solve.

For images with complex spatially-varying blurring effects (Nah et al., 2017, Caglioti and Giusti, 2009, Seibold et al., 2017), the operator K has a very high degree of freedom, which makes it very difficult to resolve the ambiguity of the solution in blind image deblurring. In the past, there have been several approaches that impose certain structural prior on the blurring operator K, e.g. two-layer-based model for defocus blurring (Chan and Nguyen, 2011), patch-based model for non-uniform motion blurring (Ji and Wang, 2012), and non-uniform motion blurring model parameterized by 3D camera intrinsic motion (Whyte et al., 2012). Nevertheless, the applicability of these models is limited. For instance, they are not applicable to the blurring effects in dynamic scenes with moving objects of different speeds or for the blurring effects caused by very complex scene depths.

In order to have a blind deblurring method that covers a wide range of spatially-varying blurring effects, an alternative approach is to directly recover each blurred image pixel without explicitly modeling its associated blurring operator. Deep learning provides a powerful tool to learn such a direct recovery process. In recent years, many deep-learning-based approaches (e.g. Nah et al., 2017, Nimisha et al., 2017, Sharma et al., 2018, Su et al., 2017, Xin et al., 2018) have been proposed for blind image deblurring. Most of these methods train a convolutional neural network (CNN) that models the mapping between a blurred image to its clear version, using many pairs of blurred images and their clear versions. The NN models trained by these approaches have shown promising performance on removing spatially-varying blur from input blurred images.

To train an NN that models the mapping between the pair of blur/clear images with good generalization performance, a great amount of training data is needed to provide a comprehensive coverage of the instances of different image contents and different blurring effects. In comparison to uniform blur, the variations of blurring effects in non-uniform blur are much more significant, as the spatial configurations of blurring effects can be very different across image regions and over different images. Thus, it is overwhelming to build a training dataset that is sufficiently comprehensive to avoid possible overfitting when training the NN. As a result, the performance gain of existing deep-learning-based approaches over the traditional ones is limited, and increasing their model size does not help much for further performance improvement; see e.g. the studies in Xin et al. (2018) and Zhang et al. (2019).

It is well known in human visual perception that blur directly participates in visual experience especially for space perception. It is shown in Khan et al. (2011) that blurring has an important influence on visual attention, and there is deep connection between blur and extraction of salient regions. Indeed, human visual system can directly estimate local blur effects from many salient structures (e.g. edges and corner points) and generalize them to more global salient regions. This motivated us to investigate the introduction of the spatial attention mechanism to the NN so that the NN can be learned to effectively exploit salient image features to deal with spatially-varying blurring effects. Also, how to restore image regions with different blurring effects in one NN is another concern that needs to be handled. As different blur effects require different processes for restoration, the NN for processing non-uniformly blurred images needs to be spatially-varying as well. Clearly, spatial attention is one solution to introduce such a spatially-varying nature in the NN, specially for the CNN.

Channel attention is another widely-used attention mechanism in deep NNs for image classification and processing (Hu et al., 2018). Channel attention allows the intermediate features of the CNN have varying weights over different images, and thus improve the adaptivity of the features of the CNN to different image contents. Such an adaptivity is certainly very appealing when the CNN need to handle a wide range of image contents, as well as blurring effects.

In summary, the potential benefits of spatial attention and channel attention in handling non-uniform blur, inspired us to investigate the attention mechanism for deep-learning-based non-uniform blind deblurring.

In this paper, using a multi-scale encoder–decoder CNN as the backbone, we propose a deep attentive NN with built-in attention mechanism for non-uniform blind image deblurring. The attention mechanism we use for the deblurring NN include both spatial attention and channel attention.

In the proposed approach, the spatial attention is introduced in both the encoder and the decoder, and they have different functions. In an encoder–decoder CNN, the encoder functions as a feature extractor for capturing essential image features that provide essential information for image recovery while are robust to the blurring. It is shown in edge-selection-based uniform blind motion deblurring methods (e.g. Cho and Lee, 2009, Xu et al., 2013, Yang and Ji, 2019) that focusing on strong image edges with different orientations for kernel estimation can provide very robust estimation of blur kernels, which in turn greatly improves the deblurring performance. For instance, strong horizontal/vertical edges will not be erased by blurring, and they provide all information regarding the blurring effect along the vertical/horizontal direction.

According to the success of edge selection techniques in uniform blind deblurring methods, different stages of an effective encoder should discriminatively treat image edges and smooth regions, or say some stages emphasize edges and some emphasize smooth regions. Therefore, we introduce the spatial attention into the encoder part. Such an attention mechanism is expected to allow the encoder to treat different spatial image features with different weights. As a result, the features extracted from the encoder with spatial attention will focus on those image features encoding more information regarding the blurring effect, e.g. strong image edges with isotropic orientations. See Fig. 1 (middle row) for an illustration of the spatial attention in the encoder part, which makes the NN focus more on strong image edges with various orientations.

In our approach, spatial attention is also introduced into the decoder part of the encoder–decoder CNN, but with a different function from its counter-part in the encoder part. Recall that the decoder can be interpreted as an image recovery process that maps the extracted features from the encoder to a clear image. In the case of non-uniform blind deblurring, different image regions have different blurring effects. Thus, different image regions should be treated by different reconstruction processes. For example, a region with more severe blurring should be paid more attention to, as more details need to be recovered. A plain version of the deblurring CNN without spatial attention is not effective on modeling such highly location-dependent mappings. The introduction of spatial attention in the decoder enables efficient modeling on location-dependent operations. See Fig. 1 (bottom row) for an illustration of the spatial attention in the decoder part, where the spatial attention distinguishes well the image regions with different blur degrees, e.g. focusing more on fast-moving persons.

In addition to spatial attention, the channel attention is also employed for further improvement on the generalizability of the CNN in non-uniform blind deblurring. As the intermediate features from the CNN are supposed to cover a wide range of images with different contents, many features are not very related to one particular image. Such a redundancy in features will cause severe issues in the case of image blurring, as there will exist certain ambiguities among different images when they are severely blurred. The channel attention allows the CNN to impose different weights on different channels (i.e. different feature maps), which makes the CNN more adaptive to the input image.

Similar to other existing works (Xin et al., 2018, Zhang et al., 2019), we also implement a multi-scale version of the encoder–decoder CNN as the backbone, and built the aforementioned attention mechanisms into the NN. The multi-scale architecture of the NN provides better guidance for the encoder to extract the blur-invariant representations as well as the decoder to recover image details.

The main contribution of this paper is the introduction of attention mechanism into the NN for non-uniform blind image deblurring. One main limitation of the existing deep learning methods for non-uniform blind deblurring lies in their unsatisfactory generalization performance, owing to significant variations among spatially-varying blurring effects. Introducing the attention mechanism has several benefits toward better generalization performance, including (i) the spatial attention in the encoder makes the NN focus more on those image features that are closely related to the blurring estimation; (ii) the spatial attention in the decoder enables spatially-varying treatments on different image regions, and (iii) the channel attention allows image-adaptive deblurring procedures.

Based on the spatial and channel attention mechanisms, this paper present an encoder–decoder CNN with light-weight concurrent spatial and channel attention modules. The proposed CNN can effectively restore the images degraded by complex spatially-varying blurring, with relatively-small model size. The experiments on standard benchmark datasets show that the proposed model achieved the state-of-the-art performance, which have justified the value of the attention mechanism in deep-learning-based non-uniform blind image deblurring.

Section snippets

Related work

In the last decade, there have been many approaches proposed for single-image-based blind motion deblurring. Depending on the setting of motion blurring effects, most existing approaches can be classified into three categories: non-blind motion deblurring which assumes the parameters of blur processing are known, blind uniform motion deblurring which assumes the blur is generated by convolving with an unknown kernel, and blind non-uniform motion deblurring which considers complex

Network architecture

The proposed CNN for image deblurring is outlined in Fig. 2, whose backbone is an encoder–decoder NN repeated with a multi-scale fashion. Such a backbone is inspired by the work of Xin et al. (2018) and Zhang et al. (2019). Concretely, there are T modules, denoted by M1(;Θ),,MT(;Θ), in the proposed CNN, each of which module is an encoder–decoder network whose weights are shared with other modules. Given a blurry image f as input, we first generate its multi-scale representations, denoted by f

Datasets and configurations

The proposed approach is evaluated on three public benchmark datasets for blind image deblurring, including the GoPro dataset (Nah et al., 2017), the VideoDeblurring dataset (Su et al., 2017) and the Köhler dataset (Köhler et al., 2012). The details of these three datasets are as follows:

  • The GoPro dataset (Nah et al., 2017) contains 3214 blurry/sharp image pairs of resolution 720 × 1280, which are extracted from 33 videos captured by the GoPro Hero 4 Black Camera. The blurred images are

Summary

In this paper, we tackle the challenging single-image blind deblurring problem based using a multi-scale residual CNN with spatial attention and channel attention. One big challenge is the large variations from the spatially-varying blurring effects across image regions and over different images, which makes a deep NN hard to generalize well. This paper demonstrated that introducing the spatial and channel attention mechanisms can improve the generalizability and performance a deep neural

CRediT authorship contribution statement

Yong Xu: Funding acquisition, Resources, Project administration, Supervision. Ye Zhu: Software, Validation, Investigation, Visualization. Yuhui Quan: Conceptualization, Methodology, Supervision, Writing - original draft. Hui Ji: Formal analysis, Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by National Natural Science Foundation of China under Grants 61872151 and 62072188, in part by Natural Science Foundation of Guangdong Province under Grants 2017A030313376 and 2020A1515011128, in part by Science and Technology Program of Guangdong Province under Grant 2019A050510010, in part by Science and Technology Program of Guangzhou under Grant 201802010055, and in part by Singapore MOE AcRF under Grant MOE2017-T2-2-156.

References (51)

  • DelbracioM. et al.

    Hand-held video deblurring via efficient Fourier aggregation

    IEEE Trans. Comput. Imaging

    (2015)
  • DenisL. et al.

    Fast approximations of shift-variant blur

    Int. J. Comput. Vision

    (2015)
  • Gao, H., Tao, X., Shen, X., Jia, J., 2019. Dynamic scene deblurring with parameter selective sharing and nested skip...
  • Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks. In: Proc. Int....
  • Gong, D., Yang, J., Liu, L., Zhang, Y., Reid, I., Shen, C., Van Den Hengel, A., Shi, Q., 2017. From motion blur to...
  • Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks. In: Proc. IEEE Conf. Comput. Vision Pattern...
  • JiH. et al.

    A two-stage approach to blind spatially-varying motion deblurring

  • KhanR.A. et al.

    Visual attention: effects of blur

  • Kim, T.H., Ahn, B., Lee, K.M., 2013. Dynamic scene deblurring. In: Proc. IEEE Int. Conf. Comput....
  • KingmaD.P. et al.

    Adam: A method for stochastic optimization

    (2014)
  • KöhlerR. et al.

    Recording and playback of camera shake: Benchmarking blind deconvolution with a real-world database

  • KruseJ. et al.

    Learning to push the limits of efficient fft-based image deconvolution

  • Kupyn, O., Budzan, V., Mykhailych, M., Mishkin, D., Matas, J., 2018. DeblurGAN: Blind motion deblurring using...
  • Levin, A., 2007. Blind motion deblurring using image statistics. In: Proc. Advances Neural Info. Process. Syst., pp....
  • LevinA. et al.

    Understanding and evaluating blind deconvolution algorithms

  • Cited by (21)

    • Lightweight MIMO-WNet for single image deblurring

      2023, Neurocomputing
      Citation Excerpt :

      utilizes multi-scale latent structure prior to recover sharp image from coarse-to-fine. [9] proposes an end-to-end deblurring methods at coarse-to-fine scale. [25] introduces the channel and spatial attention module to coarse-to-fine pipeline for non-uniform blind image deblurring.

    View all citing articles on Scopus
    View full text