The Synthesis of Unpaired Underwater Images for Monocular Underwater Depth Prediction

Zhao, Qi; Zheng, Ziqiang; Zeng, Huimin; Yu, Zhibin; Zheng, Haiyong; Zheng, Bing

doi:10.3389/fmars.2021.690962

ORIGINAL RESEARCH article

Front. Mar. Sci., 17 September 2021
Sec. Ocean Observation
Volume 8 - 2021 | https://doi.org/10.3389/fmars.2021.690962

The Synthesis of Unpaired Underwater Images for Monocular Underwater Depth Prediction

Qi Zhao¹^†

Ziqiang Zheng²^†

Huimin Zeng¹

Zhibin Yu^1,3^*

Haiyong Zheng¹

Bing Zheng^1,3

¹College of Electronic Engineering, Ocean University of China, Qingdao, China
²Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, SAR China
³Sanya Oceanographic Institution, Ocean University of China, Sanya, China

Underwater depth prediction plays an important role in underwater vision research. Because of the complex underwater environment, it is extremely difficult and expensive to obtain underwater datasets with reliable depth annotation. Thus, underwater depth map estimation with a data-driven manner is still a challenging task. To tackle this problem, we propose an end-to-end system including two different modules for underwater image synthesis and underwater depth map estimation, respectively. The former module aims to translate the hazy in-air RGB-D images to multi-style realistic synthetic underwater images while retaining the objects and the structural information of the input images. Then we construct a semi-real RGB-D underwater dataset using the synthesized underwater images and the original corresponding depth maps. We conduct supervised learning to perform depth estimation through the pseudo paired underwater RGB-D images. Comprehensive experiments have demonstrated that the proposed method can generate multiple realistic underwater images with high fidelity, which can be applied to enhance the performance of monocular underwater image depth estimation. Furthermore, the trained depth estimation model can be applied to real underwater image depth map estimation. We will release our codes and experimental setting in https://github.com/ZHAOQIII/UW_depth.

1. Introduction

As an important part of underwater robotics and 3D reconstruction, underwater depth prediction is crucial for underwater vision research. However, the quality of collected images is restricted by light refraction and absorption, suspended particles in the water, and color distortion, making it difficult and challenging to obtain reliable underwater depth maps. Due to the influence of strong absorption and scattering, some widely used devices designed to obtain in-air depth maps, such as Kinect units (Dancu et al., 2014), lidar (Churnside et al., 2017), and binocular stereo cameras (Deris et al., 2017), exhibit limited performance in underwater environments (Massot-Campos and Oliver-Codina, 2015; Pérez et al., 2020). As quite a few underwater RGB-D datasets (Akkaynak and Treibitz, 2019; Gomez Chavez et al., 2019; Berman et al., 2020) are currently available, many researchers have sought to adopt image processing methods to estimate the depth from a single monocular underwater image or a consecutive underwater image sequence. To perform single monocular underwater depth prediction, several restoration-based methods have been developed (e.g., UDCP; Drews et al., 2016; Ueda et al., 2019). The transmission map is regarded as an intermediate step for obtaining depth maps and restoring underwater images. In theory, the physical process is highly dependent on the calibrated intrinsic parameters and the well-described structural information of the scene. However, it is extremely laborious to select and measure these parameters relevant to the physical process (Abas et al., 2019), and limited to some special task.

Recently, deep learning methods have shown great potential in image processing (Li et al., 2018) applications, such as image-to-image translation (Isola et al., 2017; Zhu et al., 2017a; Choi et al., 2018; Wang et al., 2018b; Zheng et al., 2020), image restoration (Peng et al., 2015), and depth estimation (Gupta and Mitra, 2019). Due to the lack of the underwater depth ground truth to formulate full supervision, supervised learning models cannot be directly adopted for underwater depth estimation. Due to the introduction of cycle-consistency loss designed for unpaired image-to-image translation, many researchers aim to translate the in-air images to the desired underwater images and preserve the original depth annotation (Li et al., 2017, 2018; Gupta and Mitra, 2019). With the synthetic underwater images from the original in-air images paired with the corresponding depth annotation, we can obtain the pseudo underwater and depth image pairs. Previous methods such as WaterGAN (Li et al., 2017) and UMGAN (Li et al., 2018) adopted a two-stage optimization framework for underwater depth estimation. The former underwater image synthesis and the downstream vision task (such as depth prediction or underwater image restoration) are optimized separately. The two models have no direct connection at the training stage. UW-Net (Gupta and Mitra, 2019) has addressed this problem and aims to perform underwater image synthesis and underwater depth estimation parallel. However, two competitive tasks with cycle-consistent learning lead to low training efficiency and inaccurate depth estimation outputs. The leakage of texture is another challenge. The depth value of a fish should be about equal. However, the bright color and textures of a fish may lead to an incorrect depth estimation result (Figures 1B–E).

FIGURE 1

Figure 1. Examples of texture leakage during the underwater depth map estimation process using different methods. (A) Real underwater images. (B) DCP (He et al., 2010), (C) UDCP (Drews et al., 2016), (D) Berman et al. (2017), (E) UW-Net (Gupta and Mitra, 2019), (F) ours.

To address these problems, we propose a novel joint-training generative adversarial network for both multi-style underwater image synthesis and depth estimation performed in an end-to-end manner. For the former image synthetic task, we aim to transfer the hazy in-air RGB-D images to multi-style underwater images while retaining the objects and the structural information of the in-air images and controlling the underwater style through one conditional input message. To take advantage of multi-task learning (Zhang and Yang, 2017) between underwater image synthetic and depth estimation tasks, we design a joint-training generator to estimate the depth from the synthesized underwater images through full supervision. Overall, our system includes two consecutive generators (responsible for the underwater image synthesis and underwater depth estimation, separately), which are trained simultaneously. To ensure that the generated underwater images retain the objects and the structural information of the in-air images, we consider perceptual loss (Johnson et al., 2016) computed at the selected layers as a structural loss along with the adversarial loss to optimize the whole network. Furthermore, we develop a depth loss to alleviate the texture leakage phenomenon as shown in Figure 1. Finally, we evaluate the effectiveness of our proposed method to synthesize underwater images and estimate the depth map of real underwater images, and the comprehensive experimental results demonstrate the superiority of the proposed method. Overall, our main contributions of this paper are summarized as follows:

• We propose a novel joint-training generative adversarial network, which can simultaneously handle the controllable translation from the hazy RGB-D images to the multi-style realistic underwater images by combining one additional label, and the depth prediction from both the synthetic and real underwater images.

• To construct a semi-real underwater RGB-D dataset, we take the hazy in-air RGB-D image pairs and conditional labels as inputs to synthesize multi-style underwater images. During the training process, we introduce perceptual loss to preserve the objects and structural information of the in-air images during the image-to-image translation process.

• To improve the results of underwater depth estimation, we design the depth loss to make better use of high-level and low-level information. We verify the effectiveness of our proposed method on a real underwater dataset.

2. Related Work

2.1. Image-to-Image Translation

In the past several years, a series of image-to-image translation methods based on generative adversarial networks (GANs) (Mirza and Osindero, 2014; Odena et al., 2017) have been proposed. These approaches can mainly be divided into two categories of paired training and unpaired training methods. Pix2pix (Isola et al., 2017) is a typical powerful paired model and first proposes cGAN (Mirza and Osindero, 2014) learns the one-side mapping function from the input images to target images. To achieve the image-to-image translation of unpaired datasets, CycleGAN (Zhu et al., 2017a) translates images into two domains using two generators and two discriminators and proposes the cycle-consistent loss to tackle the mode collapse of unpaired image translation. To address the multimodal problem, methods including BicycleGAN (Zhu et al., 2017b), MUNIT (Huang et al., 2018), DRIT (Lee et al., 2018), StarGAN (Choi et al., 2018), etc. have been proposed. The BicycleGAN (Zhu et al., 2017b) learns to transfer the given input with a low-dimensional latent code to more diverse results. It takes advantage of the bijective consistency between the latent and target spaces to avoid the mode collapse problem. MUNIT (Huang et al., 2018) achieves multidomain translation by assuming two latent representations that present style and content, respectively and combining different representations of content and style. StarGAN (Choi et al., 2018) learns multiple mapping functions between multiple domains. It only uses a single generator and a discriminator to transfers the source images to the target domain. Then to avoid mode collapse, the generator takes the generated images and the original labels as input and transfers them to the original domain. The subsequently developed image-to-image translation methods, such as pix2pixHD (Wang et al., 2018b), GauGAN (Park et al., 2019), vid2vid (Wang et al., 2018a), FUNIT (Liu et al., 2019), NICE-GAN (Chen et al., 2020), and StarGAN v2 (Choi et al., 2020) pay more attention to generate higher visual quality, multiple outputs and have been applied in video and small sample studies.

To synthesize underwater images, due to the lack of a large paired underwater image dataset, studies have mainly focused on unsupervised learning. In a pioneering approach of underwater image synthesis, WaterGAN (Li et al., 2017) synthesized the underwater images from the in-air image and the paired depth map for real-time color correction of monocular underwater images. To achieve multidomain translation, UMGAN (Li et al., 2018) proposes an unsupervised method that combines CycleGAN (Zhu et al., 2017a) and cGAN (Mirza and Osindero, 2014) with an additional style classifier to synthesize multi-style underwater images. UW-Net developed by Gupta and Mitra (2019) learns the mapping functions between unpaired hazy RGB-D images and arbitrary underwater images to synthesize underwater images and estimate the underwater depth map. This method translates the hazy RGB-D image to underwater images while it learns to convert underwater images to the hazy RGB-D images. However, WaterGAN (Li et al., 2017) and UW-Net (Gupta and Mitra, 2019) only provide a solution for single domain underwater image generation. UMGAN (Li et al., 2018) does not consider the transmission map as an extra clue to generate underwater images. Moreover, all of the synthesized underwater images using these methods still lack the characteristics of real underwater images and clear structural information.

2.2. Underwater Depth Map Estimation

Underwater depth map estimation has mainly been studied in the field of traditional image processing. Since, He et al. (2010) first proposed a dark channel prior (DCP) for dehazing, many methods based on DCP (He et al., 2010) have been proposed for underwater depth map estimation in recent years. Drews et al. (2016) proposed a method based on a physical model of light propagation and the statistical priors of the scene to obtain the medium transmission and scene depth in typical underwater scenarios. Peng et al. (2015) proposed a three-step approach consisting of pixel blurriness estimation, rough depth map generation, and depth map refinement for depth map estimation. Berman et al. (2017) took different optical underwater types into account and proposed a more comprehensive physical image formation model to recover the distance maps and object colors. They mainly considered transmission map estimation as an intermediate step to obtain a depth map. Due to the unknown scattering parameters and multiple possible solutions, the results of these methods are most likely to be incorrect (Gupta and Mitra, 2019).

Recently, many deep learning-based methods have been proposed for depth estimation. However, most of these approaches focus on depth estimation from in-air RGB images with full supervision, which are not suitable for underwater depth map estimation due to the lack of the paired RGB-D data. The above mentioned UW-Net developed by Gupta and Mitra (2019) proposed an unsupervised method to learn depth map estimation. It considers an in-air transmission map as a cue to synthesize underwater images and obtains the required depth map from the synthesized underwater images. However, this method cannot estimate the depth map from underwater images of multiple water types. Because two competitive tasks (hazy in-air image reconstruction and depth estimation) are assigned to one generator, the depth prediction results of UW-Net lack sharp outlines. Ye et al. (2019) proposed another unsupervised adaptation networks. They developed a joint learning framework which can handle underwater depth estimation and color correction tasks simultaneously. Unlike their work, in which the two networks (style adaptation network and task network) should be trained separately, our model is more simple and can be trained simultaneously. The depth loss and a fine-tune strategy make our model more efficient in practice for underwater depth map prediction.

3. Materials and Methods

3.1. Overall Framework

In this paper, we aim to estimate the depth map from real underwater images. Because there are no paired underwater RGB-D images, we cannot perform supervised learning directly. Therefore, we choose to translate the original in-air images with corresponding depth to underwater images and obtain pseudo-paired images. To perform this task, we design an end-to-end system with two joint-training modules: multi-style underwater image synthesis and underwater depth estimation based on the synthetic paired samples. The former module is trained through unpaired training, while the latter adopts supervised training to achieve precise underwater depth estimation. The overall framework is shown in Figure 2 and consists of two generators, namely, G_s: x → ỹ and G_d: $ỹ \to \tilde{d}$ , where x and ỹ are the original in-air image and the synthesized underwater image with specific underwater style. $\tilde{d}$ is the estimated depth output. For discrimination, we also design two discriminators D_s and D_d to perform adversarial training to boost the underwater image synthesis and depth estimation, respectively. D_s aims to distinguish between real and fake images and identify the domains from which both the real images and the generated images originate. The discriminator D_d only learns to distinguish between the real and fake depth maps.

FIGURE 2

Figure 2. The network framework of our proposed model is designed to synthesize multi-style underwater images and estimate underwater depth maps. The generator G_s and the discriminator D_s are used to synthesize multi-style underwater images, and the generator G_d and discriminator D_d learn to estimate underwater depth map based on the synthesized underwater RGB-D dataset.

3.1.1. Multi-Style Underwater Image Synthesis

As shown in Figure 2, we refer to the training of StarGAN (Choi et al., 2018) to generate multi-style underwater images. To synthesize specified underwater style images, we adopt an additional one-hot vector c to represent domain attributes. To make the generator G_s depth-aware and preserve the original depth representation after translation, we concatenate the three inputs, namely, the in-air image (x), the target underwater style (c_y), and the corresponding in-air depth (d) to synthesize an underwater image $ỹ = G_{s} [C (x, d, c_{y})]$ with the required style (c_y), where $C$ denotes depthwise concatenation. To guarantee that the synthetic image ỹ has the target underwater style, we include an adversarial domain classifier D_s with two branches (one for domain classification and another for real/fake discrimination). The classification branch with the domain classification loss $L_{c l s}$ aims to recognize the underwater style (c_y) of both the synthesized image ỹ and the real underwater image y. Noted that y does not have the corresponding depth annotation due to the lack of underwater ground truth. The adversarial loss $L_{a d v}^{s}$ is computed to promote the naturalness of the synthetic images. The generator G_s from CycleGAN (Zhu et al., 2017a) and StarGAN (Choi et al., 2018) is one symmetric encoder-decoder architecture with 6 residual blocks.

3.1.2. Underwater Depth Estimation

In the training stage, we perform underwater estimation on the above-mentioned synthetic underwater images ỹ by adopting a generator G_d with dense-block architectures. The output of generator G_s (ỹ) is the input of generator G_d used to estimate its depth map G_d(ỹ). Considering that we have the depth annotation d of the in-air images, we can obtain pseudo pairs to compute the $L_{d e p t h}$ between d and $\tilde{d}$ . The discriminator D_d is also designed and has only one discrimination output. Furthermore, the adversarial loss $L_{a d v}^{d}$ in the depth space is conducted. For underwater depth map estimation, we use DenseNet (Jégou et al., 2017) as the generator. In UW-Net (Gupta and Mitra, 2019), the authors proved the importance of using hazy above-water images and compared the results of underwater depth maps estimation with different generator networks, including ResNet (He et al., 2016), Unet (Ronneberger et al., 2015), DenseNet (Jégou et al., 2017), and so on. In their work, DenseNet is proved to be the best choice.

3.2. Loss Functions

3.2.1. Multi-Style Underwater Image Synthesis

3.2.1.1. Adversarial Loss

Regular GANs use sigmoid activation output and the cross-entropy loss function (Goodfellow et al., 2014), which may cause a vanishing gradient during the learning process. To stabilize the training process and generate underwater images with higher quality, we adapt the least-squares loss (Mao et al., 2017) in our method. $L_{a d v}^{s}$ can be expressed as follows:

\begin{array}{l} \begin{array}{l} L_{a d v}^{s} = \underset{G}{m i n} \underset{D}{m a x} {𝔼 x, y ~ P_{d t a} (x, y) [{(D_{s} (y) - 1)}^{2}] \\ + 𝔼_{x ~ P_{d a t a} (x)} [(D_{s} {(ỹ)}^{2}]}, \\ w h e r e ỹ = G_{s} (C (x, d, c_{y}))), \end{array} & (1) \end{array}

where G_s targets the transfer of a hazy in-air RGB-D image x by concatenating an underwater condition label c_y to synthesize image $G_{s} [C (x, d, c_{y})]$ . The discriminator D_s attempts to distinguish the real underwater image y and the synthesized underwater image ỹ.

3.2.1.2. Domain Classification Loss

For the given hazy in-air image x and an underwater domain style c_y, G_s translates x into an underwater image ỹ, which can be properly classified to the desired target domain by D_s. To achieve this goal, the classification branch of D_s imposes the domain classification. For the real underwater image y, the domain classification loss $L_{c l s}^{r}$ is computed as:

\begin{array}{l} L_{c l s}^{r} = 𝔼_{y, c_{y}} [- log D_{s} (c_{y} | y)] . & (2) \end{array}

where the term D_s(c_y|y) denotes a probability distribution over the underwater domain labels (c_y) computed by D_s. By minimizing this objective, D_s learns to classify an underwater image y to its original domain c_y. We assume that the underwater image and domain label pair (y, c_y) is given by the training data. For generator G_s, the loss function for the domain classification of synthetic underwater images is defined as:

\begin{array}{l} L_{c l s}^{f} = 𝔼_{ỹ, c_{y}} [- log D_{s} (c_{y} | ỹ)] . & (3) \end{array}

During the training, G_s tries to synthesize underwater image ỹ that can fool the classification branch of D_s.

3.2.1.3. Feature-Level Loss

Beyond the pixel-level loss, we design feature-level loss functions between the feature representations extracted from a pre-trained VGG19 network. The hybrid feature-level loss can effectively preserve the similarity of the object between the hazy in-air images and the synthesized underwater images. For the multi-style underwater image synthesis, we introduce a perceptual loss, namely, $L_{s y n}$ . $L_{s y n}$ is designed to preserve the object content and loosen the restrictions on the color and textile changes after translation. $L_{s y n}$ is expressed as follows:

\begin{array}{l} L_{s y n} = [| | Φ^{(i)} (x) - Φ^{(i)} (G_{s} (x | c_{y})) | |_{1}] . & (4) \end{array}

where Φ⁽ⁱ⁾ denotes the parameters at the i-th layer of a pre-trained VGG19 network. Following the work by Kupyn et al. (2019), we compute the 1-norm distance at the same selected i = 14 layer of the VGG19 network between the hazy in-air images and the synthesized underwater images.

3.2.1.4. Reconstruction Loss

To perform unpaired training between in-air and underwater images, we include the cycle consistency loss (Zhu et al., 2017a) in our framework. The reconstruction loss $L_{r e c}$ between $\hat{x}$ and x is defined as follows:

\begin{array}{r} \begin{array}{r} L_{r e c} = 𝔼_{x, c_{y}, c_{x}} [| | x - \hat{x} | |_{1}], \\ \hat{x} = G_{s} (C (G_{s} (C (x, d, c_{y})), d, c_{x})), \end{array} & (5) \end{array}

where c_x and c_y indicate the original hazy in-air domain label and the target underwater domain style, respectively. G_s takes the counterpart G_s(x|c_y), its corresponding depth, and the original domain label c_x as input and tries to reconstruct the original hazy in-air image. We adapt the L1 loss as our reconstruction loss. Note that we use the generator G_s twice, first to translate the hazy in-air RGB-D images into an underwater image in the target domain and then to reconstruct the hazy in-air RGB images from the translated images.

3.2.2. Underwater Depth Estimation

3.2.2.1. Adversarial Loss

For the second underwater depth estimation procedure, the adversarial loss $L_{a d v}^{d}$ is described as:

\begin{array}{l} \begin{array}{l} L_{a d v}^{d} = \underset{G}{m i n} \underset{D}{m a x} {𝔼_{G_{s} (ỹ), d ~ P_{d a t a} (ỹ, d)} [{(D_{d} (d) - 1)}^{2}] \\ + 𝔼_{ỹ ~ P_{d a t a} (ỹ)} [{(D_{d} (\tilde{d}))}^{2}]}, \\ w h e r e \tilde{d} = G_{d} (G_{s} (C (x, d, c_{y}))), \end{array} & (6) \end{array}

where G_d learns the mapping function from the synthesized underwater images ỹ to the in-air depth d as G_d(ỹ) → d. D_d is responsible to recognize the fake ingredient from the synthesized depth output $\tilde{d}$ .

3.2.2.2. Depth Loss

For underwater depth estimation, the pixel-level distance between the estimated value and the ground truth, such as 1-norm and 2-norm, is generally adopted to favor less blurring. However, we find that only the pixel-level loss between the predicted depth map and the ground truth often leads to poor performance due to the influences of noise, water with various turbidity, etc (Please refer to section 4.3 for more details). To force the model to pay more attention to the objects, we make use of the feature representations extracted from a pre-trained VGG19 network for multi-level information. We also introduce pixel-level distance for low-level details. Finally, to obtain improved results, we combine 1-norm loss and the multi-layer feature constraint between $\tilde{d}$ and d and define the depth loss, namely $L_{d e p t h}$ :

\begin{array}{r} L_{d e p t h} = [| | d - G_{d} (G_{s} (x | c_{y})) | |_{1}] \\ + \sum_{i = 0}^{N} [| | Φ^{(i)} (d) - Φ^{(i)} (G_{d} (G_{s} (x | c_{y}))) | |_{1}] . & (7) \end{array}

Similarly, Φ⁽ⁱ⁾ represents the pre-trained parameter of the i-th layer. Here, following the work of Wang et al. (2018b) and Wang C. et al. (2018), we compute the L1 distance at the same selected six layers: i = 1, 6, 11, 20, 29.

3.3. Full Objective

Finally, the objective functions can be written, respectively, as:

\begin{array}{l} L_{D_{s}} = L_{a d v}^{s} + α L_{c l s}^{r} & (8) \end{array}

\begin{array}{l} L_{G_{s}} = L_{a d v}^{s} + γ L_{r e c} + α L_{c l s}^{f} + λ L_{s y n} & (9) \end{array}

\begin{array}{l} L_{D_{d}} = L_{a d v}^{d} & (10) \end{array}

\begin{array}{l} L_{G_{d}} = L_{a d v}^{d} + η L_{d e p t h} & (11) \end{array}

where α, γ, λ, and η are the hyperparameters that control the effect of each loss in the final objective function. We set α = 5, γ = 10, λ = 0.1, η = 50 in all of our experiments, and we optimize the objective function with the Adam optimizer (Kingma and Ba, 2015). To choose appropriate weights, we design ablation studies for each hyperparameter except for γ. We follow StarGAN (Choi et al., 2018) to set γ = 10. For the choice of the rest of hyperparameters, please refer to section 4.3 for more details.

4. Results

4.1. Datasets and Implementation Details

In our experiments, we translate the hazy in-air images to two underwater domains (green and blue). We also choose the hazy in-air D-Hazy dataset (Ancuti et al., 2016) as the input images; this dataset contains the indoor scenes. For the two underwater domains, we adapt the real underwater images from the SUN (Xiao et al., 2010), URPC,¹ EUVP (Islam et al., 2020), UIEB (Li et al., 2019), and Fish datasets.² We collect 1,031 blue and 1,004 green underwater images from these datasets and the Google website, respectively. The D-Hazy dataset (Ancuti et al., 2016) includes 1,449 images. We randomly choose 1,300 images as the in-air images x to train the model. The remaining 149 images of the dataset are selected for evaluation. We use random-crop to obtain 128 × 128 patches for training. For the evaluation stage, we take complete images of 256 × 256. The entire network is trained on one Nvidia GeForce GTX 1070 using the Pytorch framework. To avoid the mode collapse problem, we apply spectral normalization (Miyato et al., 2018) in both the discriminators and the generators. Because of the introduction of spectral normalization (Miyato et al., 2018), we use a two-timescale update rule (TTUR) based on BigGAN (Brock et al., 2019) and SAGAN (Zhang et al., 2018). The Adam algorithm is applied with a learning rate of 0.0002 for the discriminators while 0.00005 for the generators. Because of the limited computing resources, we set the batch size to 10 and perform 100,000 training iterations in our experiments.

4.2. Comparison Methods

Our method achieves underwater depth map estimation using multi-style synthesized underwater images. In this section, we first evaluate the performance of WaterGAN (Li et al., 2017), CycleGAN (Zhu et al., 2017a), StarGAN (Choi et al., 2018), UW-Net (Gupta and Mitra, 2019), StarGAN v2 (Choi et al., 2020), and our method on multiple synthetic underwater images. Additionally, to evaluate the effectiveness of underwater depth map estimation, we compare the results obtained using DCP (He et al., 2010), UDCP (Drews et al., 2016), Berman et al. (2017), Gupta and Mitra (2019), and our method.

4.2.1. Qualitative Evaluation

To evaluate the effectiveness of the proposed method, we perform underwater image synthesis on the NYUv2 (Silberman et al., 2012) and D-Hazy (Ancuti et al., 2016) datasets. Figure 3 shows a visual comparison of the synthesized underwater images generated by different methods. WaterGAN (Li et al., 2017) takes advantage of in-air RGB-D images to synthesize underwater images. As shown in Figure 3B, the results are somewhat single-hued and lack water characteristics. Although WaterGAN supports multi-style image generation, the two styles (blue and green) obtained by WaterGAN in Figure 3B are difficult to distinguish. The results of CycleGAN (Zhu et al., 2017a) retain most of the contents and structures of the original images. Compared to WaterGAN, they are similar to the natural underwater scenes shown in Figure 3C. By contrast, the outputs of CycleGAN (Zhu et al., 2017a) include serious distortions of the details of the image with incorrect depth information. StarGAN (Choi et al., 2018) can simultaneously translate in-air images into multiple underwater styles. However, the results lack the characteristics of real underwater images, such as depth information, and clear structural information of the objects. Besides, many artifacts are observed in Figure 3D. UW-Net (Gupta and Mitra, 2019) also takes hazy in-air RGB-D images as input, the results are presented in Figure 3E and show fuzzy structures for the objects. The results of StarGAN v2 (Choi et al., 2020) are shown in Figure 3F. There is no denying that StarGAN v2 (Choi et al., 2020) possesses a powerful style network to extract style codes from reference images. However, the underwater images provided by StarGAN v2 fail to help the depth estimation tasks. As shown in Figure 3F, StarGAN v2 removed some objects and structural information during the image synthetic process, which makes the synthetic underwater images and their corresponding in-air depth maps unmatched. The quantitative results in section 4.2.2 further confirm this point.

FIGURE 3

Figure 3. Comparison of the visual quality of synthesized underwater images obtained by different methods. From left to right, (A) are original in-air images, (B–G) are the results of the WaterGAN (Li et al., 2017), CycleGAN (Zhu et al., 2017a), StarGAN (Choi et al., 2018), UW-Net (Gupta and Mitra, 2019), StarGAN v2 (Choi et al., 2020), and our method.

Our model is optimized to synthesize underwater images with multiple styles based on the unpaired datasets. The results of our method (Figure 3G), in which the structural information is well preserved, are better than those obtained from other methods in terms of visual quality.

For underwater depth map estimation, Figure 4 shows the results of our method and other methods developed by He et al. (2010) (DCP), (Drews et al., 2016) (UDCP), Berman et al. (2017), and Gupta and Mitra (2019) based on the underwater images obtained by Berman et al. (2017). In Figures 4B–D, these methods fail to capture relative depth of the scene with respect to the camera. Moreover, these methods mainly obtain the transmission maps of the scene and have excessive texture leakage in the results. Gupta and Mitra (2019) used an unsupervised method to estimate the depth map, obtaining the results shown in Figure 4E, and this method appears to be better than the other methods, whose results are presented in Figures 4B–D. However, this method still suffers from excessive texture leakage and only estimates the depth map for single-domain underwater images. Our results have a much more reasonable appearance with a linear depth variation. On the other hand, we observe that our network successfully captures the depth information from multi-style underwater images. More results for real underwater images with different underwater characteristics are seen in Figure 5. Furthermore, the UW-Net (Gupta and Mitra, 2019) and our method synthesize underwater images using the underwater dataset provided by Berman et al. (2017) to fine-tune the models of the depth map estimation. We fine-tune our model for 10,000 iterations on Berman et al.'s dataset for better depth map estimation.

FIGURE 4

Figure 4. Comparison of our method with other underwater depth estimation methods. From left to right, (A) are real underwater images from the dataset of Berman et al. (2017), (B–F) are the results of DCP (He et al., 2010), UDCP (Drews et al., 2016), Berman et al. (2017), Gupta and Mitra (2019), and our method, and (G) are the ground truths.

FIGURE 5

Figure 5. The results of our model for depth map estimation. Every two rows from top to bottom are real underwater images with different illumination and scattering conditions and the results of our model for depth map estimation.

4.2.2. Quantitative Evaluation

The dataset of Berman et al. (2017) consists of 114 paired underwater RGB-D images from Katzaa, Michmoret, Nachsholim, and Satil. We use 71 images belonging to the three regions Katzaa, Nachsholim, and Satil. Because the Michmoret region has very few natural objects and is of the same scene. Following UW-Net (Gupta and Mitra, 2019), we use two metrics for comparison, namely, log scale-invariant mean squared error (SI-MSE) (Eigen et al., 2014) and the Pearson correlation coefficient (ρ). Considering the fact that the depth map provided by the stereo camera is not complete (e.g., the ground truth of the white regions in Figure 7H are not provided), we only calculate the pixels with a defined depth-value in the ground truth (GT).

The underwater image synthesis assists to estimate depth maps from real underwater images. Thus, how much the synthetic underwater images can be used to boost the performance of underwater image-based depth prediction is the key evaluation index. We evaluate performance on depth prediction tasks with a series of the state-of-the-art methods, which consist of WaterGAN (Li et al., 2017), CycleGAN (Zhu et al., 2017a), StarGAN (Choi et al., 2018), UW-Net (Gupta and Mitra, 2019), and StarGAN v2 (Choi et al., 2020). We aim to calculate the depth map estimation results on a semi-real underwater RGB-D dataset. UW-Net suggests that fine-tuning the models with a few unlabeled images from the target underwater environment could further boost the depth prediction performance. During the fine-tuning process, we only use the RGB underwater images without considering the depth ground truth of the data from Berman et al. to show the ability that our model can adapt itself to a new environment well. To make it fair, we fine-tune all models to generate a similar underwater style of the dataset of Berman et al..

Although our model already provides a solution for a depth estimation task, we choose a typical independent supervised image-to-image model, pix2pix (Isola et al., 2017), to fairly evaluate the potential of synthetic underwater images on the application of depth prediction. We use identical pix2pix models to learn the mapping function between the generate underwater images of different underwater image synthetic methods and their corresponding in-air depth maps. Finally, we test and evaluate all models on the dataset of Berman et al.. Table 1 shows the results, and our model obtains higher ρ values and lower SI-MSE.

TABLE 1

Table 1. Quantitative comparison of our method and other methods for underwater image synthesis.

For the underwater depth estimation task, Table 2 shows the quantitative results. Our method obtains the least scale-invariant error (SI-MSE) (Eigen et al., 2014) and the highest Pearson correlation coefficient (ρ).

TABLE 2

Table 2. Quantitative comparison of our method and other methods on the dataset of Berman et al. (2017).

We also investigate the parameters and Floating Point Operations (Tan and Le, 2019) (FLOPs) among different generators in Table 3. In the case of CycleGAN, we only count the FLOPs and parameters of a single generator. We can find that the proposed method can achieve better performance with fewer network parameters and computational cost. Benefiting from the dense blocks, the G_d of our model has fewer parameters and FLOPs than G_s. Please note that G_s is only used in training stage. In testing phase, we only need G_d to estimate the depth map.

TABLE 3

Table 3. Comparison of Floating Point Operations (FLOPs) and total number of parameters among different generators with a size of 256 × 256.

4.3. Ablation Study

4.3.1. Loss Selection of Underwater Image Synthesis

To preserve clear structural information, we consider the perceptual loss $L_{s y n}$ , structural similarity index (SSIM) $L_{s s i m}$ , and multiscale structural similarity index (MS-SSIM) $L_{m s s s i m}$ as the structural loss. We evaluate the efficiency of each loss, including $L_{s y n}$ , $L_{s s i m}$ , and $L_{m s s s i m}$ , and based on the visual effect of the synthesized underwater images and the results of depth map estimation, we choose the perceptual loss. To verify the effectiveness of the extra losses in our network, we design ablation experiments and perform a comparison on D-Hazy (Ancuti et al., 2016) which consists of 1,449 images. Figure 6 shows that each loss affects the quality of the generated underwater images. It is observed from Figure 6B, that the generated underwater images using ResNet without any extra loss have more color blocks and artifacts. Additionally, during the training, it is extremely unstable and tends to produce color inversions and serious distortions situations. In Figures 6C,D, many artifacts are still retained for ResNet with $L_{s s i m}$ or $L_{m s s s i m}$ . Table 4 shows the results of depth map estimation based on different synthetic underwater image datasets, which are generated by ResNet and ResNet with extra losses, separately. Using $L_{s y n}$ , we obtain the best results of underwater depth map estimation. Based on the experiments mentioned above, we introduce a perceptual loss $L_{s y n}$ to preserve the details and restrain the artifacts in Figure 6E. To minimize the negative effects of the synthesized images, we design experiments to determine the proper weight of α and λ. In Table 5, we show the results of different weights, including α and λ. We note that both UW-Net and our model can be fine-tuned on the dataset of Berman et al. to obtain better results of underwater depth map estimation. Fine-tuning processing provides a flexible approach for adjusting our model and the estimation of depth maps from unexplored underwater regions within a relatively short period.

FIGURE 6

Figure 6. Sample results of our method for synthesizing underwater images using different losses. $L_{s s i m}$ , $L_{m s s s i m}$ and $L_{s y n}$ , respectively represent SSIM loss, MS-SSIM loss, and perceptual loss. (A) Are in-air images, (B) are the results without any structural loss (Baseline), (C–E) are the results with $L_{s s i m}$ , $L_{m s s s i m}$ , and $L_{s y n}$ , respectively.

TABLE 4

Table 4. Comparison of our method for the synthesis of underwater images with different combinations.

TABLE 5

Table 5. Comparison of weights used in the objective function of our model, including α and λ.

4.3.2. The Design of Underwater Depth Map Estimation

With the support of synthetic paired RGB-D data, we consider L1 loss, L2 loss, L_ssim loss, or L_msssim loss to learn the mapping functions for supervised depth map prediction. During the training, we observe the all above-mentioned losses are not enough to generate more correct depth maps. The results in Figures 7B–E show that depth prediction based on the above-mentioned losses are easily affected by the shape, noise, etc. As mentioned in section 3.2.2, we design depth loss L_depth to make better use of low-level and high-level feature information and avoid the risk of texture leakage. We take advantage of a pre-trained VGG19 network to extract feature maps between the generated depth maps and the ground truths. We assume the feature maps between the generated depth map and its corresponding ground truth in each layer from a pre-trained VGG19 network should be equal. The loss L_depth makes our model pay more attention to the objects and the relative distance in the underwater images. Inspired by Wang et al.'s work (Wang C. et al., 2018), we also attempt to extract feature maps from the discriminator D_d, namely $L_{p a n}$ , rather than a pre-trained VGG19 network. In Figure 7F, we can see that our model with $L_{p a n}$ are often overwhelmed with incorrect boundary prediction due to the insufficient layers of our discriminator D_d to extract high-level feature maps comparing with $L_{d e p t h}$ . Furthermore, we investigate the optimal parameter setting of η with a greedily searching strategy (Table 6), and we discover that η = 50 is the best choice among all the parameters.

FIGURE 7

Figure 7. Effectiveness evaluation of the $L_{1}$ , $L_{2}$ , $L_{s s i m}$ , $L_{m s s s i m}$ , and $L_{d e p t h}$ . From left to right, respectively, (A) are real underwater images, (B–H) are the results of depth map estimation with L1 loss, L2 loss, $L_{s s i m}$ , $L_{m s s s i m}$ , $L_{p a n}$ , $L_{d e p t h}$ , and their corresponding ground truths.

TABLE 6

Table 6. Results with different η values.

Based on Figure 7 and Table 7, we can easily conclude that the results of depth map estimation using L_depth loss are more accurate and continuous. The results show sharper outlines. We can clearly distinguish the relative distance and the objects.

TABLE 7

Table 7. Quantitative comparison of our method with different losses on the dataset of Berman et al. (2017).

5. Discussions and Conclusion

To further explore the potential of our model on depth prediction, we considered the work by Li et al. (2018) and prepared a more complex underwater image dataset including four different styles. In this experiment, we still consider the depth map as a conditional input to synthesize a corresponding underwater image. But we did not utilize the physical parameters (e.g., the water turbidity or any optical parameters) for the unpaired image-to-image translation. Instead, we roughly divide the images with different water turbidity into four groups and follow the manner of StarGAN (Choi et al., 2018) to perform conditional image translation. Some synthetic examples of four different styles are shown in Figure 8. Due to the lack of ground truth of the depth map, we cannot quantitatively evaluate the effectiveness of our model for multi-style underwater depth map estimation. Instead, we prepared several qualitative evaluation results, as shown in Figure 9. Intuitively, we find that the depth estimation of a side-view underwater image is better than that from a vertical view. This result is caused by the lack of vertical view in-air images from the in-air D-Hazy dataset required to produce sufficient synthetic underwater vertical view images. We plan to improve the performance on this point by data augmentation in the future.

FIGURE 8

Figure 8. Sample results for the synthesis of underwater images. (A) Show in-air images. (B–E) Represent blue style, green style, white style and yellow style, respectively.

FIGURE 9

Figure 9. Multi-style underwater depth map estimation. The rows from top to bottom are real underwater images with four different water types and the results of our model for depth map estimation. Every two rows are real underwater images and their predicted depth maps of our method.

In this paper, we proposed an end-to-end system that can synthesize multi-style underwater images using one-hot encoding and estimate underwater depth maps. The system can convert the in-air RGB-D images into more realistic underwater images with multiple watercolor styles. Then we use the synthesized underwater RGB images to construct a semi-real underwater RGB-D dataset. With the synthetic underwater RGB-D dataset, our model can learn to estimate underwater depth maps using supervised learning. Finally, we compare our method with existing state-of-the-art methods to synthesize underwater images and estimate underwater depth maps, and we verify that our method outperforms these methods both qualitatively and quantitatively. Furthermore, our model can be fine-tuned on the untrained datasets to synthesize a similar underwater style. It effectively makes our model to be applied for depth map estimation on new underwater datasets.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author Contributions

QZ performed the experiments and wrote the manuscript. ZZ and HZe revised the manuscript. ZY provided the ideas and revised the article. HZh and BZ provided advices and GPU devices for parallel computing. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by the finance science and technology project of 630 Hainan province of China under Grant Number ZDKJ202017 and National Natural Science Foundation of China under Grant Numbers 62171419 and 61701463.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

1. ^http://www.cnurpc.org/

2. ^http://www.fishdb.co.uk/

References

Abas, P. E., and De Silva, L. C. (2019). Review of underwater image restoration algorithms. IET Image Process. 13, 1587–1596. doi: 10.1049/iet-ipr.2019.0117

ORIGINAL RESEARCH article

The Synthesis of Unpaired Underwater Images for Monocular Underwater Depth Prediction

1. Introduction

2. Related Work

2.1. Image-to-Image Translation

2.2. Underwater Depth Map Estimation

3. Materials and Methods

3.1. Overall Framework

3.1.1. Multi-Style Underwater Image Synthesis

3.1.2. Underwater Depth Estimation

3.2. Loss Functions

3.2.1. Multi-Style Underwater Image Synthesis

3.2.1.1. Adversarial Loss

3.2.1.2. Domain Classification Loss

3.2.1.3. Feature-Level Loss

3.2.1.4. Reconstruction Loss

3.2.2. Underwater Depth Estimation

3.2.2.1. Adversarial Loss

3.2.2.2. Depth Loss

3.3. Full Objective

4. Results

4.1. Datasets and Implementation Details

4.2. Comparison Methods

4.2.1. Qualitative Evaluation

4.2.2. Quantitative Evaluation

4.3. Ablation Study

4.3.1. Loss Selection of Underwater Image Synthesis

4.3.2. The Design of Underwater Depth Map Estimation

5. Discussions and Conclusion

Data Availability Statement

Author Contributions

Funding

Conflict of Interest

Publisher's Note

Footnotes

References

Appendix

Generator Architectures

Discriminator Architectures

This article is part of the Research Topic

People also looked at