STC-Flow: Spatio-temporal context-aware optical flow estimation

https://doi.org/10.1016/j.image.2021.116441Get rights and content

Highlights

  • A general contextual attention framework of multi-level features & complicated fusion.

  • STC-Flow to capture rich spatio-temporal long-range dependencies for flow estimation.

  • PSC, TCC & RRCU for the effect of feature extraction, correlation, and reconstruction.

  • Promising performance in Sintel & KITTI for two-frame based optical flow estimation.

Abstract

In this paper, we propose a spatio-temporal contextual network, STC-Flow, for optical flow estimation. Unlike previous optical flow estimation approaches with local pyramid feature extraction and multi-level correlation, we propose a contextual relation exploration architecture by capturing rich long-range dependencies in spatial and temporal dimensions. Specifically, STC-Flow contains three key context modules, i.e., pyramidal spatial context module, temporal context correlation module and recurrent residual contextual upsampling module for the effect of feature extraction, correlation, and flow reconstruction, respectively. Experimental results demonstrate that the proposed scheme achieves the state-of-the-art performance of two-frame based methods on Sintel and KITTI datasets.

Introduction

Optical flow estimation is an important yet challenging problem in the field of video analytics. Recently, deep learning based approaches have been extensively exploited to estimate optical flow via convolutional neural networks (CNNs). Despite the great efforts and rapid developments, the advancements are not as significant as those in single image based computer vision tasks. The main reason is that optical flow is not directly measurable in the wild, and it is challenging to model motion dynamics with pixel-wise correspondence between two consecutive frames, which would contain variable motion displacements; thus optical flow estimation requires the efficient representation of features to match objects or scenes of different motions.

Conventional methods propose mathematical algorithms of optical flow estimation such as EpicFlow [1] by matching features of two frames. Most of these methods, however, are complicated with heavy computational complexity, and usually fail for motions with large displacements. CNN-based methods, which usually utilize encoder–decoder architectures with pyramidal feature extraction and flow reconstruction like FlowNet [2], SpyNet [3], PWC-Net [4], boost the state-of-the-art performance of optical flow estimation and outperform conventional methods. However, the stacked convolutional layers they utilize are limited of which the features in lower level contain rich details, while the corresponding receptive field of a single convolutional layer is small, which is not effective to catch the larger displacement of motion. The features in higher level capture the overall outlines or shapes of objects and can catch larger displacement with less details, and they may cause the misalignment for objects with complex shapes or non-rigid motions. So it is essential to capture context information with large receptive field and long-range dependencies, and build the global relationship for each level of CNNs, which could both catch larger displacement and retain more details.

In this paper, as shown in Fig. 1, we propose an end-to-end architecture which jointly explores spatio-temporal context for optical flow estimation. The network contains three key context modules. (a) Pyramidal spatial context module aims to enhance the discriminant ability of feature representations in the spatial dimension. (b) Temporal context correlation module is designed to model the global spatio-temporal relationships of the cost volume calculated from correlation operation, which is used to measure the effect of correspondence relationships. (c) Recurrent residual context upsampling module leverages the underlying content of predicted flow field between adjacent feature levels, to learn high-frequency features and preserve edges within a large receptive field.

In summary, the main contributions of this work are summarized as:

  • We propose a general framework, i.e. contextual attention framework, for efficient feature representation learning, which explores the multiple level features and comprehensive operation of feature fusion.

  • We propose three corresponding context modules in the contextual attention framework, for feature extraction, correlation and optical flow reconstruction, aiming at improving the overall performance via better feature representation and correlation and enhancing high-frequency details with context information.

Section snippets

Related work

Optical flow estimation. Horn and Schunck [5] pioneer the study on optical flow estimation, of which method takes advantage of illumination changes, and uses an iterative implementation. Brox et al. [6] propose a warping-based optical flow prediction method. Brox et al. [7] use rich descriptors for feature matching to estimate a dense optical flow field with large displacement. Zimmer et al. [8] propose the anisotropic smoothness as the constraint term for data regularization. Weinzaepfel

STC-Flow

Given a pair of video frames, scene or objects are diverse on movement velocity and direction. They change in scales, views, and luminance. Convolutional operations in CNNs are in general performed just in a local neighborhood. The pixels of the same non-rigid object may have similar textures and features, even though they may have different motions. For instance, (1) The objects are non-rigid with different obvious motion situation of each part, such as the girl walking in the street and the

Experiments

In this section, we introduce the implementation details, and evaluate our method on public optical flow benchmarks, including MPI Sintel [45], KITTI 2012 [46] and KITTI 2015 [47], and compare it with state-of-the-art methods.

Conclusion

To explore the motion context information for accurate optical flow estimation, we propose a spatio-temporal context-aware network, STC-Flow, for optical flow estimation. We propose three context modules in feature extraction, correlation, and optical flow reconstruction, i.e. pyramidal spatial context (PSC) module, temporal context correlation (TCC) module, and recurrent residual contextual upsampling (RRCU) module, respectively. These three modules utilize contextual information to deal with

CRediT authorship contribution statement

Xiaolin Song: Conceptualization, Methodology, Software, Writing – original draft, Visualization. Yuyang Zhao: Data curation, Software, Writing – original draft, Investigation. Jingyu Yang: Supervision, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 61771339.

References (48)

  • T. Brox, A. Bruhn, N. Papenberg, et al. High accuracy optical flow estimation based on a theory for warping, in:...
  • BroxT. et al.

    Large displacement optical flow: Descriptor matching in variational motion estimation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • ZimmerH. et al.

    Optic flow in harmony

    Int. J. Comput. Vis.

    (2011)
  • P. Weinzaepfel, J. Revaud, Z. Harchaoui, et al. DeepFlow: Large displacement optical flow with deep matching, in: IEEE...
  • TuZ. et al.

    Weighted local intensity fusion method for variational optical flow estimation

    Pattern Recognit.

    (2016)
  • E. Ilg, N. Mayer, T. Saikia, et al. FlowNet 2.0: Evolution of optical flow estimation with deep networks, in: IEEE...
  • J. Thewlis, S. Zheng, P.H.S. Torr, et al. Fully-trainable deep matching, in: British Machine Vision Conference,...
  • D. Gadot, L. Wolf, PatchBatch: A batch augmented loss for optical flow, in: IEEE Conference on Computer Vision and...
  • C. Bailer, K. Varanasi, D. Stricker, CNN-based patch matching for optical flow with thresholded hinge embedding loss,...
  • F. Güney, A. Geiger, Deep discrete flow, in: Asian Conference on Computer Vision,...
  • ChenJ. et al.

    Efficient segmentation-based PatchMatch for large displacement optical flow estimation

    IEEE Trans. Circuits Syst. Video Technol.

    (2019)
  • Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, W. Xu, Unos: Unified unsupervised optical-flow and stereo-depth estimation...
  • P. Liu, I. King, M.R. Lyu, J. Xu, Ddflow: Learning optical flow with unlabeled data distillation, in: Proceedings of...
  • T.-W. Hui, X. Tang, C. Change Loy, Liteflownet: A lightweight convolutional neural network for optical flow estimation,...
  • Cited by (4)

    • Parallel multiscale context-based edge-preserving optical flow estimation with occlusion detection

      2022, Signal Processing: Image Communication
      Citation Excerpt :

      However, these variational approaches usually require numerous iterations to minimize the energy function, which may dramatically increase the computational complexity and time consumption. With the great success of convolutional neural network (CNN) modeling in recent years, CNN-based methods have been increasingly popular in optical flow estimation [17–19]. Although CNN-based models have shown to have high accuracy and good robustness in optical flow computation from several public optical flow databases, e.g., MPI-Sintel [20] and KITTI [21] benchmarks, the issue of edge-blurring caused by motion occlusions remains a challenge for most CNN-based optical flow methods.

    • Deep Scene Flow Learning: From 2D Images to 3D Point Clouds

      2024, IEEE Transactions on Pattern Analysis and Machine Intelligence
    View full text