1 Introduction

Year 2010 is regarded as the breakthrough year of 3D video and the 3D industry [1]. Miraculous 3D films are delivered to the market, such as the first well-known masterpiece, Avatar. After Avatar, more and more excellent works have sprung up. By providing two different perspectives, the 3D film technology allows the audience to wear stereoscopic spectacles in the cinema, as if they are in the real scene. The prosperity of the 3D industry has led to the development of 3DTV, making 3DTV become the next generation after high-definition TV (HDTV). The free viewpoint video (FVV) is widely addressed because of its better immersion and freedom to users.

How to use the existing known viewpoint videos to generate all the other needed viewpoint videos is a crucial technology for FVV. The common form of 3D scene is video plus depth (V+D). In general, the texture information comes from the video stream, and the geometric information of the scene is provided by the depth stream, which contains the distance between the object and the camera. The DIBR method makes use of the texture and depth information of the known viewpoints to generate videos of other viewpoints, greatly reducing the burden on bandwidth. So that DIBR has become an important method in the FVV system.

Although DIBR can generate virtual images arbitrarily, the quality of virtual images is not satisfying because of the imprecision of the depth map and the occlusion between objects. In the spatial domain, virtual image contains artifacts and holes. Artifacts are mainly due to the misalignment of color maps and depth maps, especially at the edge of objects. Meanwhile, there is a weak difference in illumination between different reference points, making the pixel values from different reference points different. Apart from artifacts, holes are another challenging issue to be solved. They are mainly represented that some pixel positions on the virtual images have no warping values. Holes can be classified into two types according to their causes: cracks and disocclusions.

Cracks are merely one to two pixels wide because of integer rounding error from reference views. By comparison with cracks, disocclusions are larger holes contributed to unavoidable occlusion between objects, resulting in missing information in the virtual view.

In addition to the above issues in the spatial domain, the virtual video generated by the DIBR method has the time discontinuity. At present, the majority of the virtual view synthesis algorithms just deal with each frame individually, but ignore the correlation information between image frames and frames, leading to frequent flickering in the virtual video, especially at the edge of objects.

2 Related work

As for artifacts, the color difference between different reference viewpoints and the discontinuity of depth values of the object edges in the depth map are mainly concerned about. The literatures [2, 3] performed color correction by converting the color distribution between viewpoints by estimating the characteristics of the cameras. Fezza et al. [4] used an improved histogram matching algorithm for color correction in the common area of views, and used time sliding windows to maintain temporal correlation. Loghman et al. [5] used a multi-threshold segmentation method to distinguish the foreground from the background, and mapped the different segments separately. Li Yao et al. [6] combine depth-based image fusion with direct image fusion to decrease the ghost effect. Luo et al. [7] extracted and separated the foreground objects and filled the holes with a relatively stable background obtained by the Gaussian mixture model. Although this method can effectively avoid artifacts, the construction of the background is very complicated and the parameters of the model are difficult to choose.

As for holes, Do et al. [8] filled holes with distance-weighted sums of non-hole background pixels in eight directions around the holes. The exemplar-based image inpainting [9] can restore the hole texture information well. Daribo et al. [10] improved on the basis of the Criminisi’s image inpainting algorithm by adding the depth information. D M Motiur Rahaman et al. [11] used the number of models in Gaussian mixture modelling (GMM) to separate background and foreground pixels, subsequently the missing pixels were recovered from the adaptive weighted average of the pixel intensities from the corresponding model(s) of the GMM and the warped image. Li S et al. [12] located useful pixels in the complementary views to reduce the holes.

Temporal continuity in the image sequence has also received the researchers’ attention. Chen et al. [13] used motion vectors to obtain texture information from different frames to fill holes. The literatures [14,15,16] distinguish depth by the probability analysis, Gaussian mixture model, and structural similarity index, and extracted static background information in the scene. Hsu et al. [17] used the global optimization method to fill holes and the image inpainting method used to make the background structure show the discontinuity of the spatial domain. Choi et al. [18] combined the current frame and the previous frames to find the best matching block of the hole. Muddala et al. [19] identified the occlusion through layered depth images and used time frame information and motion estimation to repair it. Xi et al. [16] maintained temporal continuity by extracting the scene’s static background image and measured the temporal continuity using the peak signal-to-noise ratio (PSNR). Schmeing et al. [20] pointed out the importance of time continuity and introduced quantitative indicators for calculating the number of flicker. Liu et al. [21] proposed a full reference objective video quality assessment (VQA) method for temporal flicker distortion and changes in spatio-temporal activity in composite video. Although the abovementioned method has made many attempts in temporal continuity and can reduce the flicker of the video sequence to a certain degree, they do not address spatio-temporal continuity of the image sequence, and therefore the quality of the virtual viewpoint image is unlikely to be satisfactory in both subjective and objective evaluation.

In this paper, we propose a virtual view synthesis based on spatio-temporal continuity. In order to maintain the temporal continuity, we make full use of the time domain information in the video sequence and use the relationship between adjacent frames to extract the static background image, which can assist in solving artifacts and holes. A weighted-fusion hole-filling method based on static background image is proposed, which not only solves the holes, but also maintains the temporal continuity, and avoids the flicker phenomenon to some extent.

3 Proposed view synthesis method

We propose a virtual view synthesis method based on spatio-temporal continuity. The framework is shown in Fig. 1. The left and right reference points are combined to synthesize the virtual viewpoint in the middle with the left reference view as the main point. The static background in the scene is extracted from the adjacent frames of the reference viewpoint image sequence to keep the continuity of time. At the same time, a weighted-fusion hole-filling method is used to fill the remaining holes. The algorithm is executed in RGB color space and greatly improves the image quality of the virtual viewpoint.

Fig. 1
figure 1

The diagram of the proposed view synthesis method

3.1 Depth map processing

The inaccuracy of the depth map affects the rendering quality of the virtual viewpoint, especially in the transitional area between the foreground and the background. In the color map, these transitional regions are smooth, but are sharp in the depth map, as shown in Fig. 2. This asymmetry in the transition area between foreground and background boundaries in color and depth maps produces artifacts in virtual view. In order to avoid artifacts as much as possible, preprocessing the depth map is necessary.

Fig. 2
figure 2

The asymmetry in the transition area between foreground and background boundaries in color and depth maps

First, the edge contour of the object in the horizontal and vertical directions in the depth map is detected. The detection method is as follows:

$$ dir\left(i,j\right)=\Big\{{\displaystyle \begin{array}{c}1,D\left(i,j\right)-D\left(i+1,j\right)\ge T or\begin{array}{c}\end{array}D\left(i,j\right)-D\left(i,j+1\right)\ge T\\ {}-1,\begin{array}{c}\end{array}D\left(i,j\right)-D\left(i-1,j\right)\ge T\begin{array}{c}\end{array} or\begin{array}{c}\end{array}D\left(i,j\right)-D\left(i,j-1\right)\ge T\end{array}} $$
(1)

where dir(i, j) represents the direction in which (i, j) pixel needs to expand, D(i, j) is the depth value of the (i, j) pixel in the depth map, and T is a fixed threshold.

There is one more point: the detected edge of the object is expanded in the direction of background. The position of the expanded pixels is as follows:

$$ \Big\{{\displaystyle \begin{array}{c}{i}^{\hbox{'}}=i+ dir\left(i,j\right)\ast x\\ {}{j}^{\hbox{'}}=j+ dir\left(i,j\right)\ast x\end{array}} $$
(2)

where i′ and j′ are the horizontal and vertical coordinates of the expanded pixel position, and x is the number of expanded pixels. The expanded edge pixel values are as follows:

$$ D\left({i}^{\hbox{'}},j\right)=D\left(i,{j}^{\hbox{'}}\right)=D\left(i,j\right) $$
(3)

Gaussian smoothing is performed on pixels with large difference of depth between the foreground and background after expansion, and the depth image is traversed in the horizontal direction in order to compare the difference between each two adjacent pixel points with a preset threshold value. The filter area in the image is found, and Gaussian smoothing is applied to all pixels in the area. The filter area is determined as follows:

$$ \mathrm{filter}\left(i,j\right)=\Big\{{\displaystyle \begin{array}{c}\mathrm{True},D\left(i,j\right)-D\left(i,j-1\right)\ge T\begin{array}{c}\end{array} or\begin{array}{c}\end{array}D\left(i,j\right)-D\left(i,j+1\right)\ge T\\ {}\mathrm{False},\begin{array}{cccc}& & & \mathrm{others}\begin{array}{ccc}\begin{array}{ccc}\begin{array}{ccc}\begin{array}{ccc}\begin{array}{ccc}& & \end{array}& & \end{array}& & \end{array}& & \end{array}& & \end{array}\end{array}\end{array}} $$
(4)

where filter(i,j) indicates whether the pixel needs Gaussian filtering.

Figure 3 is a comparison of the depth map before and after preprocessing. It can be clearly seen that the transition region of edge is relatively sharp before the depth map preprocessing. By filtering the depth map, the transition region of edge has become smooth and has been corresponding to the color map.

Fig. 3
figure 3

Contrast before and after preprocessing in depth map. a Before preprocessing. b After preprocessing

3.2 Static scene extraction

The connection between the frames is used to maintain temporal continuity. The static portions between every two frames are extracted and accumulated in the time direction.

The structural similarity measurement (SSIM) method can extract the majority of the static scenes from images by comparing the similarity between two image pixels. Nevertheless, due to the occlusion between objects, some static backgrounds cannot be distinguished and depth information must be combined.

The global static background is constituted by the color image of the static background and the depth image of the static background, respectively initialized by the color map and the depth map at the first frame of the reference viewpoint. The initialization process is as follows:

$$ \Big\{{\displaystyle \begin{array}{c}{C}_g(p)={C}_t(p)\\ {}{D}_{\mathrm{g}}(p)={D}_t(p)\end{array}}\kern0.5em t=0 $$
(5)

where Cg is the global static background color map, Dg is the global static background depth map, Ct and Dt represent the color map and depth map of the reference viewpoint in frame t, and p is the pixel on the image.

After the initialization, the color map TCg and the depth map TDg of the background at the current frame can be extracted by comparing the current frame and the previous frame of the video sequence. For a common static pixel between adjacent frames, the calculated SSIM value is intensely large. The SSIM value between two pixels is calculated and denoted as PSSIM . The formula is as follows:

$$ P{}_{\mathrm{SSIM}}=\frac{\left(2{\mu}_{\varphi_t}{\mu}_{\varphi_t-1}+{K}_1\right)\left(2{\sigma}_{\varphi_{t\left(t-1\right)}}+{K}_2\right)}{\left({\mu_{\varphi_t}}^2+{\mu_{\varphi_{t-1}}}^2+{K}_1\right)\left({\sigma_{\varphi_t}}^2+{\sigma_{\varphi_{t-1}}}^2+{K}_2\right)} $$
(6)

where φt and φt − 1 are the corresponding matching blocks in Ct and Ct−1 respectively, and the matching block is centered on the pixel p, and the side length is M. μφt and μφt − 1are the average values of the brightness of the corresponding matching blocks, and \( {\sigma}_{\varphi_t} \) and \( {\sigma}_{\varphi_{t-1}} \)are the variances of the brightness of the corresponding matching blocks, respectively. \( {\sigma}_{\varphi_{t\left(t-1\right)}} \) is the luminance correlation coefficient of the corresponding matching block, and K1 and K2 are constants.

When the PSSIM value is greater than a certain threshold T1, the pixel is determined as a static pixel denoted as Cs; otherwise, it is an undetermined pixel denoted as Cr.

Static pixels Cs can be used to update TCg and TDg, and the undetermined pixels require being further divided by depth information. The undetermined pixels can be divided into three types:

  1. 1.

    The same part of the object shows difference in brightness at different viewpoints. This part has a similar structural texture but PSSIM value is not very high, recorded as p1.

  2. 2.

    Due to the movement of the foreground object, background information occluded in the previous frame appears in the current frame, and these background pixels need to be extracted from the current frame. These pixels are denoted as p2.

  3. 3.

    The background information in the previous frame is occluded by the foreground in the current frame, and the background needs to be extracted from the previous frame. These pixels are denoted as p3.

The specific methods to divide the undetermined pixels are as follows:

$$ \Big\{{\displaystyle \begin{array}{c}p\in {p}_1\in {C}_r,\mid {\mu}_t^D-{\mu}_{t-1}^D\mid \le {T}_2\\ {}p\in {p}_2\in {C}_r,\begin{array}{c}\end{array}{\mu}_t^D-{\mu}_{t-1}^D\vartriangleleft -{T}_2\\ {}p\in {p}_3\in {C}_r,\begin{array}{c}\end{array}{\mu}_t^D-{\mu}_{t-1}^D\vartriangleright {T}_2\end{array}} $$
(7)

where \( {\mu}_t^D \) and \( {\mu}_{t-1}^{\mathrm{D}} \) are the average depth values of matching blocks, and T2 is the depth difference threshold. The first two types p1 and p2 are still background pixels, so can update TCg and TDg, and the third type p3 needs to be discarded in the current frame while using the same position in the previous frame to update TCg and TDg. Therefore, TCg and TDg can be extracted as follows:

$$ T{C}_g(p)=\Big\{{\displaystyle \begin{array}{c}{C}_t(p),p\in {C}_s\cup {p}_1\cup {p}_2\\ {}{C}_{t-1}(p),\begin{array}{c}\begin{array}{ccc}\begin{array}{cc}& \end{array}& & \end{array}\end{array}p\in {p}_3\end{array}} $$
(8)
$$ \mathrm{T}{D}_g(p)=\Big\{{\displaystyle \begin{array}{c}{D}_t(p),p\in {C}_s\cup {p}_1\cup {p}_2\\ {}{C}_{t-1}(p),\begin{array}{c}\begin{array}{cccc}& & & \end{array}\end{array}p\in {p}_3\end{array}} $$
(9)

The background TCg and TDg extracted from each adjacent frame may be used to update the global static background Cg and Dg as follows:

$$ {C}_g(p)=\Big\{{\displaystyle \begin{array}{c}T{C}_g(p),{\mu}_{TD}^p-{\mu}_D^p\le {T}_2\\ {}{C}_g(p),\begin{array}{c}\begin{array}{ccc}\begin{array}{cc}& \end{array}& & \end{array}\end{array}\mathrm{others}\end{array}} $$
(10)
$$ {D}_g(p)=\Big\{{\displaystyle \begin{array}{c}T{D}_g(p),{\mu}_{TD}^p-{\mu}_D^p\le {T}_2\\ {}{D}_g(p),\begin{array}{c}\begin{array}{ccc}\begin{array}{cc}& \end{array}& & \end{array}\end{array}\mathrm{others}\end{array}} $$
(11)

where \( {\mu}_{TD}^p \) and \( {\mu}_D^p \) are average depth values of matching blocks centered on pixel p in the depth maps TDg and Dg, respectively.

By extracting the pixels of adjacent frames, the static background image of the current frame can be updated, and at the same time, the global static background image can be updated with the depth information. Through the final extractable global static background image, useful information can be provided for subsequent holes. Figure 4 shows the results of the static background extraction in different frames.

Fig. 4
figure 4

Static scene extraction result of the “Ballet” sequence. a 3th frame. b 30th frame. c 60th frame. d The final static scene

3.3 Forward warping fusion

The process of warping from the reference viewpoint to the virtual viewpoint is called forward warping.

First, the two global static background images extracted from the previous step need to be mapped onto the virtual viewpoint imaging plane. Similarly, the left and right reference images need to be forward mapped to the same virtual viewpointFootnote 1.

After forward warping, two virtual images of the same position are generated, and the two virtual needs to be merged. The depth information of two pixel values is compared during the fusion, mainly based on the left viewpoint mapping. The fusion initialization method is as follows:

$$ \Big\{{\displaystyle \begin{array}{c}{\mathrm{C}}_V\left(i,j\right)={C}_{VL}\left(i,j\right)\\ {}{D}_V\left(i,j\right)={D}_{VL}\left(i,j\right)\end{array}} $$
(12)

where CV and DV are the color map and depth map of the merged virtual viewpoint respectively, and CVL and DVL are the color map and depth map warped by the left reference viewpoint, respectively. The fusion effect is shown in Fig. 5. It can be seen that the image quality based on the depth value fusion method is higher than that generated by the direct fusion method.

Fig. 5
figure 5

Comparison of two fusion methods. a Direct fusion. b Fusion based on depth values

The warping result of the right reference viewpoint will assist in correcting the initial virtual viewpoint image as follows:

$$ {C}_V\left(i,j\right)=\Big\{{\displaystyle \begin{array}{c}{C}_V\left(i,j\right),\begin{array}{cccc}\begin{array}{cc}\begin{array}{cc}\begin{array}{cc}\begin{array}{cc}\begin{array}{cc}\begin{array}{cc}& \end{array}& \end{array}& \end{array}& \end{array}& \end{array}& \end{array}& & & \end{array}\kern0.5em \mathrm{others}\\ {}{C}_{VR}\left(i,j\right),\begin{array}{cc}& {D}_V\left(i,j\right)=0\begin{array}{c}\end{array} or\begin{array}{c}\end{array}{D}_V\left(i,j\right)\vartriangleleft {D}_{VR}\left(i,j\right)\end{array}\end{array}} $$
(13)
$$ {\mathrm{D}}_V\left(i,j\right)=\Big\{{\displaystyle \begin{array}{c}{D}_V\left(i,j\right),\begin{array}{cccc}\begin{array}{cc}\begin{array}{cc}\begin{array}{cc}\begin{array}{cc}\begin{array}{cc}\begin{array}{cc}& \end{array}& \end{array}& \end{array}& \end{array}& \end{array}& \end{array}& & & \end{array}\kern0.5em \mathrm{others}\\ {}{D}_{VR}\left(i,j\right),\begin{array}{cc}& {D}_V\left(i,j\right)=0\begin{array}{c}\end{array} or\begin{array}{c}\end{array}{D}_V\left(i,j\right)\vartriangleleft {D}_{VR}\left(i,j\right)\end{array}\end{array}} $$
(14)

where CVR and DVR are the color map and the depth map warped by the right reference viewpoint, respectively.

3.4 Artifacts elimination

Considering that the forward warping fusion is based on the result of the left view warping, it is necessary to detect the hole edge of the artifacts in the depth map of the virtual images, which are warped by the left reference view. The detection is done as follows:

$$ \mathrm{Boundary}\left(i,j\right)=\Big\{{\displaystyle \begin{array}{c}255,\begin{array}{c}\begin{array}{cc}& \begin{array}{c}{D}_{VL}\left(i,j\right)=0\&\&{D}_{VL}\left(i,j-1\right)!=0\begin{array}{c}\end{array} or\\ {}{D}_{VL}\left(i,j\right)=0\&\&{D}_{VL}\left(i-1,j\right)!=0\begin{array}{c}\end{array} or\\ {}{D}_{VL}\left(i,j\right)=0\&\&{D}_{VL}\left(i+1,j\right)!=0\begin{array}{cc}& \end{array}\end{array}\end{array}\end{array}\\ {}0,\begin{array}{cc}\begin{array}{cc}\begin{array}{cc}\begin{array}{cc}& \end{array}\begin{array}{cc}\begin{array}{cc}\begin{array}{cc}\begin{array}{cc}\begin{array}{cc}& \end{array}& \end{array}& \end{array}& \end{array}& \end{array}& \end{array}& \end{array}& \mathrm{others}\end{array}\end{array}} $$
(15)

where Boundary represents the detected edge of the holes, with a pixel value of 0 (black) indicating a non-edge region and a pixel value of 255 (white) representing the edge region. The detected edge pixels of holes should not be all artifacts. In order to detect the artifacts, the foreground edge is extracted from the depth map of the left reference viewpoint and forward warped to the virtual viewpoint imaging plane. Subsequently we get the real artifacts edge called Boundary_Artifact as follows:

$$ \mathrm{Boundary}\_\mathrm{Artifact}\left(i,j\right)=\Big\{{\displaystyle \begin{array}{c}255,\begin{array}{c}\begin{array}{cc}& \begin{array}{c}\mathrm{Boundary}\left(i,j\right)=255\begin{array}{c}\end{array}\&\&\\ {}\mathrm{Boundary}\_\mathrm{Fore}\left(i,j\right)!=\left(0,255,0\right)\\ {}\end{array}\end{array}\end{array}\\ {}0,\begin{array}{cc}\begin{array}{cccc}\begin{array}{ccc}\begin{array}{cc}& \end{array}& & \end{array}& & & \end{array}& \mathrm{others}\end{array}\end{array}} $$
(16)

The artifacts can be eliminated effectively by marking the position of these pixels on the global static background image as follows:

$$ {C}_V\left(i,j\right)=\Big\{{\displaystyle \begin{array}{c}{C}_g\left(i,j\right),\mathrm{if}\begin{array}{c}\end{array}\mathrm{Boundary}\_ Artifact\left(i,j\right)=255\\ {}{C}_V\left(i,j\right),\begin{array}{c}\end{array}\mathrm{if}\begin{array}{c}\end{array}\mathrm{Boundary}\_ Artifact\left(i,j\right)=0\end{array}} $$
(17)

Figure 6 shows the virtual images after eliminating artifacts. It can be seen that artifacts elimination method based on the global static background can improve the quality of the virtual images.

Fig. 6
figure 6

The virtual view after artifacts elimination

3.5 Weighted-fusion hole-filling method

The generated virtual viewpoint image still has some holes. These holes mainly appear in the background area, so the previously extracted global image of background is used to fill the holes. At the same time, in order to maintain time continuity, this paper proposes a weighted-fusion hole-filling method based on the static background. The specific steps are as follows:

  1. 1.

    The holes in the first frame, the 1+L frame, the 1+2 L frame, and the subsequent frame are directly filled with the extracted global static background image.

  2. 2.

    For the middle frame between every two frames filled in step (1), use the dynamic weighted fusion method for holes in the image. The specific method is as follows:

$$ {C}_V^{t+k}\left(i,j\right)=\left(1-\mathrm{weight}\right)\ast {C}_V^t\left(i,j\right)+\mathrm{weight}\ast {C}_V^{t+L}\left(i,j\right) $$
(18)
$$ \mathrm{weight}=\frac{k\operatorname{mod}\begin{array}{c}\end{array}L}{L} $$
(19)

where \( {C}_V^t\left(i,j\right) \) and \( {C}_V^{t+\mathrm{L}}\left(i,j\right) \) are the virtual viewpoint images directly repaired by the global static background image in step (1), \( {C}_V^{t+k}\left(i,j\right) \) is the virtual viewpoint image in step (2) that needs weighted fusion to fill in the hole, and weight is the weight to be taken when blending.

When using the image pixels in step (1) to fill other images, the image pixels must be background pixels. Therefore, the average value of depth DA in the global background depth image needs to be calculated.

$$ {C}_V\left(i,j\right)=\Big\{{\displaystyle \begin{array}{c}{C}_V\left(i,j\right),\begin{array}{ccc}\begin{array}{ccc}\begin{array}{ccc}\begin{array}{ccc}\begin{array}{ccc}& & \end{array}& & \end{array}& & \end{array}& & \end{array}& & \end{array}\kern1.00em {C}_V\left(i,j\right)>0\\ {}{C}_V^{t+k}\left(i,j\right),\begin{array}{cc}& \end{array}{C}_V\left(i,j\right)=0\&\&{D}_V^t\left(i,j\right)<{D}_A\&\&{D}_V^{t+L}\left(i,j\right)<{D}_A\\ {}0,\begin{array}{cccc}\begin{array}{cccc}\begin{array}{cccc}\begin{array}{cccc}\begin{array}{cccc}& & & \end{array}& & & \end{array}& & & \end{array}& & & \end{array}& & & \end{array}\mathrm{others}\end{array}} $$
(20)

The final virtual viewpoint image is shown in Fig. 7(b). It can be seen that the holes in the virtual viewpoint image are filled well compared with the holes in (a).

Fig. 7
figure 7

Contrast of virtual images before and after holes filling. a The virtual view before holes filling. b The virtual view after holes filling

4 Results and discussion

We use “BreakDancers” and “Ballet” [22] to evaluate the performance of proposed method. The specific parameters of the data set are shown in the Table 1. In this paper, the left and right reference viewpoints are used to merge the middle viewpoint, and the final virtual viewpoint image is in comparison with the real viewpoint image. We will evaluate the experimental results by subjective and objective indicators.

Table 1 Characteristics of the test sequences

4.1 Subjective comparison

In order to compare the subjective quality, the reference viewpoint 3 and the reference viewpoint 5 are used in both datasets to generate the intermediate viewpoint 4.

This paper selects one frame of images from the two sequences to display the results, as shown in Figs. 8 and .9. The first row is the virtual images generated by Do et al. [8], the second row is the virtual images generated by Yao et al. [6], the third row shows our results, and the last row shows the ground truth images. From the enlarged detail image in Figs. 8 and 9, we can see that proposed method eliminates artifacts and fills the holes best.

Fig. 8
figure 8

The results for “Ballet” sequence. a The results of Do et al. b The results of Yao et al. c The results of proposed method. d The ground truth. From the enlarged detail image,we can see that (a) contains some errors around the arm of the ballet dancer; (b) contains some artifacts on the wall; (c) contains nearly no error or artifacts

Fig. 9
figure 9

The results for “BreakDancers” sequence. a The results of Do et al. b The results of Yao et al. c The results of proposed method. d The ground truth. Since the color of BreakDancers sequence is darker, it may be hard to see the difference between results of different methods. But when comparing with the ground truth image, we can see that the white square and the white line behind the dancer’s arm in proposed method are the result closest to the ground truth

4.2 Objective comparison

We compare proposed method with other methods using PSNR and SSIM [23] on 100 frames, to evaluate the spatial continuity. It can be seen that proposed method is best in PSNR and SSIM (Tables 2 and 3).

Table 2 Result contrast for “Ballet” sequence
Table 3 Result contrast for “BreakDancers” sequence

F-score [20] is used as assessment of time continuity. The F-score calculation method is as follows:

$$ F-\mathrm{score}=\mid {\tau}_I^t-{\tau}_V^t\mid, \forall t=\mathrm{2...}{N}_f $$
(21)
$$ {\tau}_D^t=\frac{1}{\varphi^t}\mid {I}^t(p)-{I}^{t-1}(p)\mid, \forall t=\mathrm{2...}{N}_f $$
(22)

where \( {\tau}_I^t \) and \( {\tau}_V^t \) are the average absolute difference between t frames of the real image and the virtual view image respectively. ∣φt∣ is the number of holes in the frame t. It and I−1 are the frame t and t−1 respectively. Nf is the number of frames. Table 4 shows the comparison results. The smaller the F-score number, the better the time continuity is maintained. It can be seen that proposed method works best. At the same time, as shown in Fig. 10, the F-score value of proposed method is relatively uniform within 100 frames, and the overall transition is relatively smooth. This reveals that proposed method can continuously maintain the continuity of time in the entire video sequence. The virtual viewpoint images generated by Do et al. [8] and Yao [6] et al. have a poor effect on the temporality of the image sequence. Their magnitude of the F-score value is relatively large, which is represented by multiple peak regions in the graph. This indicates that the pixel value transitions generated by the two virtual images have large jumps and there is no continuity in the time direction.

Table 4 Time continuity comparison for “Ballet” sequence
Fig. 10
figure 10

The time continuity comparison

The proposed method can only be applied to the case of fixed camera. Experiments show that the proposed method can effectively solve the problems of artifacts, holes, and time continuity. During the experiment, it was found that if the reference view depth map was not accurate, the pixels with the same depth value would be mapped to different positions of the virtual view image, resulting in some flickers in the virtual view image, especially at the foreground edge of the object. This problem needs to make full use of the correlation information between depth maps. In addition, extracting static background and filling holes take some time, and there is even more space for improvement in real-time.

5 Conclusion and future work

This paper proposes a virtual view synthesizing method based on spatio-temporal continuity. The static background image of the entire scene is created. Furthermore, we propose a weighted-fusion hole-filling method based on static background to fill holes and maintain time continuity. Our future work will focus on reducing the time cost. By analyzing the parallel processing in extracting the static background and filling the holes, CUDA can be used to further accelerate our algorithm.