Full length article
Depth-aware blending of smoothed images for Bokeh effect generation,☆☆

https://doi.org/10.1016/j.jvcir.2021.103089Get rights and content

Highlights

  • Bokeh effect is generally captured using Single Reflex Cameras.

  • In this paper, an end-to-end network is presented to synthesize bokeh effect.

  • Bokeh Images are rendered by blending differently smoothed images using a network.

  • The proposed method ranked 2nd in AIM 2019 Challenge on Bokeh Effect Synthesis.

Abstract

Bokeh effect is used in photography to capture images where the closer objects look sharp and everything else stays out-of-focus. Bokeh photos are generally captured using Single Lens Reflex cameras using shallow depth-of-field. Most of the modern smartphones can take bokeh images by leveraging dual rear cameras or a good auto-focus hardware. However, for smartphones with single-rear camera without a good auto-focus hardware, we have to rely on software to generate bokeh images. This kind of system is also useful to generate bokeh effect in already captured images. In this paper, an end-to-end deep learning framework is proposed to generate high-quality bokeh effect from images. The original image and different versions of smoothed images are blended to generate Bokeh effect with the help of a monocular depth estimation network. The model is trained through three phases to generate visually pleasing bokeh effect. The proposed approach is compared against a saliency detection based baseline and a number of approaches proposed in AIM 2019 Challenge on Bokeh Effect Synthesis. Extensive experiments are shown in order to understand different parts of the proposed algorithm. The network is lightweight and can process an HD image in 0.03 s. This approach ranked second in AIM 2019 Bokeh effect challenge-Perceptual Track.

Introduction

Depth-of-field effect or Bokeh effect is often used in photography to generate aesthetic pictures. Bokeh images basically focus on a certain subject and out-of-focus regions are blurred. Bokeh images can be captured in Single Lens Reflex cameras using high aperture. In contrast, most smartphone cameras have small fixed-sized apertures that cannot capture bokeh images. Many smartphone cameras with dual rear cameras can synthesize bokeh effect. Two images are captured from the cameras and stereo matching algorithms are used to compute depth maps and using this depth map, depth-of-field effect is generated. Some smartphones (e.g. iPhone7+, Google Pixel 2) with good auto-focus hardware (dual lens or Phase-Detect Auto-Focus) can generate depth maps which helps in rendering Bokeh images. However, smartphones with single camera that do not have a good auto-focus sensor have to rely on software to synthesize bokeh effect.

Also, already captured images can be post-processed to have Bokeh effect by using this kind of software. That is why generation of synthetic depth-of-field or Bokeh effect is an important problem in Computer Vision and has gained attention recently. Most of the existing approaches [1], [2], [3] work on human portraits by leveraging image segmentation and depth estimation. However, not many approaches have been proposed for bokeh effect generation for images in the wild. Recently, [4] proposed an end-to-end network to generate bokeh effect on random images by leveraging monocular depth estimation and saliency detection.

In this paper, one such algorithm is proposed that can generate Bokeh effect from diverse images. The proposed approach relies on a depth-estimation network to generate weight maps that blend the input image and different smoothed versions of the input image. The generated bokeh images by this algorithm are visually pleasing. The proposed approach ranked 2nd in AIM 2019 challenge on Bokeh effect Synthesis-Perceptual Track [5].

Section snippets

Monocular depth estimation

Depth estimation from a single RGB image is a significant problem in Computer Vision with a long range of applications including robotics, augmented reality and autonomous driving. Recent advances in deep learning have helped in the progress of monocular depth estimation algorithms. Supervised algorithms rely on ground truth Depth data captured from depth sensors. [6] formulated monocular depth estimation as a combination of two sub-problems: view synthesis and stereo matching. View synthesis

Depth estimation network

Depth maps are important for Bokeh effect rendering. Since the goal is to design a system that is independent of the camera hardware, we have to compute depth map from the input RGB image. Thus, Monocular Depth Estimation network is an important element in the proposed algorithm. Megadepth [11] is used as Monocular Depth estimation network in this work. The authors use an hourglass architecture which was originally proposed in [7]. The architecture is shown in Fig. 1. The encoder part of this

System configuration

The codes were written in Python and Pytorch [25] is used as the deep learning framework. The models were trained on a machine with Intel Xeon 2.40 GHz processor, 64 GB RAM and NVIDIA GeForce TITAN X GPU card with approximately 12 GB of GPU memory.

Dataset description

We use ETH Zurich Bokeh dataset [5] (also known as EBB! Dataset[20]), which was used in AIM 2019 Bokeh Effect Synthesis Challenge. This dataset contains 4893 pairs of bokeh and bokeh-free images. Training set contains 4493 pairs whereas Validation and

Testing strategy

During testing, the input image is first resized to the original dimension in which the network was trained (384 × 512 for Phase-1 and 768 × 1024 for Phase-2 and Phase-3) and then passed to the network. The synthesized image is then scaled back to the input image resolution using bilinear interpolation.

Evaluation metrics

Both fidelity and perceptual metrics are used to evaluate the model’s performance. Fidelity measures include Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) [26]. Learned

Changing the type of blur kernel

Although the proposed network produces perceptually good bokeh effect, the type of blurring in the background in the rendered images is different from that of the ground truth image. Instead of using gaussian blur kernels to obtain different versions of smoothed images, one can also use disk blur kernels. Table 7 shows that PSNR and SSIM scores have slightly decreased when disk blur is used instead of gaussian blur and LPIPS score slightly improves in disk blur setting. However, disk blur also

Conclusion

In this paper, an end-to-end deep learning approach for Bokeh effect synthesis is proposed. The synthesized bokeh image is rendered as a weighted sum of the input image and a number of differently smoothed images, where the corresponding weight maps are predicted by a depth-estimation network. The proposed system is trained in three phases to synthesize realistic bokeh images. It is shown through experiments that using more number of blur kernels and bigger blur kernels produce better quality

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

I would like to thank Computer Vision Lab, IIT Madras for providing GPU resources used in this work.

References (36)

  • ShenX. et al.

    Automatic portrait segmentation for image stylization

    Comput. Graph. Forum

    (2016)
  • WadhwaN. et al.

    Synthetic depth-of-field with a single-camera mobile phone

    ACM Trans. Graph.

    (2018)
  • X. Xu, D. Sun, S. Liu, W. Ren, Y.-J. Zhang, M.-H. Yang, J. Sun, Rendering portraitures from monocular camera and...
  • PurohitK. et al.

    Depth-guided dense dynamic filtering network for bokeh effect rendering

  • IgnatovA. et al.

    Aim 2019 challenge on bokeh effect synthesis: Methods and results

  • Y. Luo, J. Ren, M. Lin, J. Pang, W. Sun, H. Li, L. Lin, Single view stereo matching, in: Proceedings of the IEEE...
  • ChenW. et al.

    Single-image depth perception in the wild

  • A. Atapour-Abarghouei, T.P. Breckon, Real-time monocular depth estimation using synthetic data with domain adaptation...
  • C. Godard, O. Mac Aodha, G.J. Brostow, Unsupervised monocular depth estimation with left–right consistency, in:...
  • C. Godard, O. Mac Aodha, M. Firman, G.J. Brostow, Digging into self-supervised monocular depth estimation, in:...
  • Z. Li, N. Snavely, Megadepth: Learning single-view depth prediction from internet photos, in: Proceedings of the IEEE...
  • WuJ. et al.

    Realistic rendering of bokeh effect based on optical aberrations

    Vis. Comput.

    (2010)
  • LeeS. et al.

    Real-time lens blur effects and focus control

    ACM Trans. Graph.

    (2010)
  • YuX. et al.

    Real-time depth of field rendering via dynamic light field generation and filtering

    Comput. Graph. Forum

    (2010)
  • SolerC. et al.

    Fourier depth of field

    ACM Trans. Graph.

    (2009)
  • J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE...
  • LiuS. et al.

    Learning affinity via spatial propagation networks

  • S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P.H. Torr, Conditional random fields as...
  • Cited by (19)

    • Defocus to focus: Photo-realistic bokeh rendering by fusing defocus and radiance priors

      2023, Information Fusion
      Citation Excerpt :

      Wadhwa et al. [1] focus on the shallow DoF effect with dual-pixel sensors in the mobile phone camera by segmenting portraits/objects and predicting depth. In addition, a blending method for depth-based bokeh rendering [2,32,39] is proposed to generate the shallow DoF based on composition of images blurred by different kernels. The blur kernel can be produced by a scatter [40] or a cluster [41] operation.

    • Self-supervised multi-scale pyramid fusion networks for realistic bokeh effect rendering

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Dataset, the available training set consists of 4694 image pairs. Similar to work [17], during experiments, it is divided into two parts, in which 294 pairs are taken for evaluation and the rest 4400 pairs are for training. In order to verify the selection of kernel sizes for DefocusBlur, an ablation study is conducted.

    • Efficient Multi-Lens Bokeh Effect Rendering and Transformation

      2023, IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
    View all citing articles on Scopus

    EDICS: 1.5, 5.13.

    ☆☆

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text