3D Pedestrian tracking using local structure constraints

https://doi.org/10.1016/j.isprsjprs.2020.05.002Get rights and content

Abstract

Tracking pedestrians based on visual sensors has many diverse applications, among them autonomous driving. Besides obtaining high recall, maintaining the consistency of tracked trajectories during data association is one of the most crucial issues of any tracker. This issue has been tackled in the literature for some time, taking advantage of geometry cues for improving the pairwise matching of detections across consecutive frames. However, this idea has only been employed in a simple way and not thoroughly leveraged in existing studies, i.e., only 2D information is utilized that cannot help to completely understand the real-world geometry in 3D space. Motivated by this observation, in this paper, we present a new method called 3D-TLSR (3D pedestrian tracking using local structure refinement). We use stereo images and expand the idea of geometry cues into 3D space to improve the association of existing trajectories and new detections. We divide the assignment optimization into two steps: (1) determining trajectories whose assignments are strongly believed to be correct, which we call anchors and (2) employing geometry constraints between the anchors and their nearby trajectories in 3D space to improve the matching of less reliable assignments of the first step. In addition, we suggest a simple approach to compute and correct the velocity of a tracked person so that we can better recover missed detections. Experimental results on the well known KITTI tracking benchmark, the ETHMS data set, as well as a self-generated dataset show that our tracker yields comparable results to other state-of-the-art methods with (for KITTI) multi object tracking accuracy (MOTA) of 54.00, which is the best online result among all investigated approaches, multi object tracking precision (MOTP) of 73.03, which is the best of all reported values, and mostly tracked (MT) of 29.55, being the second-best result. On the ETHMS dataset, our approach obtains best results with large margins for recall, precision, and MT, while maintaining a reasonable low number of Id switches (IDs) and fragmentation (FG). These findings confirms the effectiveness of our proposed association method and velocity estimation approach.

Introduction

Pedestrians are among the most involved traffic participants in metropolitan areas. Therefore, tracking and predicting their behaviours in 3D object space are crucial for applications related to autonomous driving and traffic safety. Pedestrian tracking based on images in particular, and object tracking in general, is an active field in computer vision. Though a substantial amount of studies have been carried out to tackle this problem, tracking pedestrians correctly and robustly still requires extensive improvements to deal with difficulties like mutual occlusions and noisy detection results. These challenges usually lead to two main problems in tracking: missed detections and identity switches (Leal-Taixé et al., 2017).

Tracking-by-detection is a well-known and widely used remedy in the state-of-the-art tracking literature, in which the tracking task is decomposed into two separate stages: detection and data association (Choi, 2015, Dimitrievski et al., 2019, Hong Yoon et al., 2016, Ošep et al., 2017, Xiang et al., 2015). Most studies following this approach concentrate on concatenating detections across image frames to form consistent trajectories for the objects of interest. This task can become extremely complicated in crowded groups. Previous studies usually tried to handle the association task either by developing better optimization methods (Berclaz et al., 2011, Dehghan et al., 2015, Lenz et al., 2015) or by improving the appearance feature extractors (Bae and Yoon, 2018, Leal-Taixé et al., 2016, Tang et al., 2017). In contrast, exploring geometry cues does not seem to have attracted the attention of many researches. This may be because most of the existing works typically focus on tracking pedestrians in 2D image space, where the ego-motion of camera sensors is difficult to distinguish from object movements and the background can also change significantly (Breitenstein et al., 2011, Fagot-Bouquet et al., 2016, Kieritz et al., 2016, Leal-Taixé et al., 2017). Hence, using 2D geometry cues can not help to fully support the task at hand, which evolves in 3D.

In this study, we solve the pedestrian tracking problem in 3D space using stereo images acquired from moving cameras. We aim at not only obtaining high recall values but also maintaining accurate identities for tracked trajectories. For that purpose, we take advantages of 3D geometry constraints among pedestrians in a group to enhance association results. We call our new method 3D-TLSR (3D pedestrian tracking using local structure refinement). To do so, we design a two-step association method. First, we determine a set of anchors, defined as assignment pairs with strong evidence of correctness. Since the image sequences we deal with typically contain some clearly visible pedestrians, in general anchor determination is not a very difficult task. Then, in a so-called local structure refinement (LSR) step, we employ the geometry changes of the detected anchors as prior information to find the correct assignments for the remaining detections. In addition, we consider targets with a close distance to each other as neighbours, and we assume that the related trajectories have a similar geometric shape, at least over a short period of time. Combining those constraints with appearance cues, the accuracy of association and tracking is assumed to be improved.

To enhance the recall value, we retrieve missed detections of a target through a prediction step using its previous states. However, while helping to increase the number of true positives (TPs), the prediction can also produce more false positives (FPs) if inferred positions drift away from the true ones. For that reason, we explore a method to calculate and evaluate the accuracy of pedestrian velocity in 3D object space. Moreover, we define a so-called friend relationship among the neighbours, whose trajectories are supposed to move together for a longer time interval. In this way, the velocity and movement of a tracked pedestrian can be corrected and updated by its friends, especially when it is not assigned to any detection for some time.

To this end, our main contributions in this paper can be summarized as follows:

  • We propose a framework to track pedestrians in 3D object space by employing advantages of both 2D and 3D information.

  • We suggest a two-step data association method to leverage 3D geometry cues, in which a set of strong association pairs, called anchors, is detected first. Then, the trajectories of the anchors are utilized to constrain matching of other detections in the local structure refinement (LSR) step.

  • We introduce an approach to reliably estimate and assess the motion of pedestrians. In addition, we define a so-called friend relationship to correct pedestrian velocity and improve trajectory prediction.

The rest of this paper is organized as follows: previous studies related to our research are discussed in Section 2. The details of our tracking framework are described in Section 3. The performance of our tracker is presented in Section 4, followed by the conclusions in Section 5.

Section snippets

Multi-pedestrians tracking

Tracking-by-detection is used by most studies dealing with pedestrian tracking and achieves state-of-the-art results. This method consists of two sub-tasks: detection and data association (Choi, 2015, Dimitrievski et al., 2019, Hong Yoon et al., 2016, Lee et al., 2016, Nguyen et al., 2019, Ošep et al., 2017, Yoon et al., 2015). While detection is considered as an independent problem in computer vision and can be solved by employing state-of-the-art detectors (He et al., 2017, He et al., 2016,

3D Pedestrian tracking with 3D-TLSR

In this section we present our new method called 3D-TLSR (3D pedestrian tracking using local structure refinement). To track pedestrians in 3D object space, we take calibrated and normalized stereo image pairs as input, i.e. the disparity of conjugate points is horizontal, and provide 3D trajectories of pedestrians as output. The disparity map w.r.t the stereo rig is estimated using the state-of-the-art dense matching method proposed in Yamaguchi et al. (2014). Following the

Experiments and results

In order to evaluate the performance of our approach, we used three different datasets, namely the KITTI object tracking benchmark (Geiger et al., 2012), the ETH mobile scene (ETHMS) dataset (Ess et al., 2008) and our own dataset. In case of KITTI, we examined the effectiveness of each component in our tracking pipeline using two sequences of the training dataset, namely 15 and 19, for which ground truth (GT) data is provided. We also compared the performance of our method on the KITTI testing

Conclusion

In this paper, we present a new approach to track pedestrians in 3D-space using stereo images called 3d-TLSR. We introduce a way to estimate the uncertainty of interesting objects in 3D which enables us to localize pedestrians in 3D object space with high precision. We propose the two-step association method so that the local structure constraints between nearby trajectories are leveraged to reduce the number of identity switches in the tracked trajectories even when pedestrians move in crowded

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by the German Research Foundation (DFG) as part of the Research Training Group i.c.sens [RTG2159].

References (57)

  • T. Klinger et al.

    Probabilistic multi-person localisation and tracking in image sequences

    ISPRS J. Photogramm. Remote Sens.

    (2017)
  • F. Poiesi et al.

    Multi-target tracking on confidence maps: An application to people tracking

    Comput. Vis. Image Underst.

    (2013)
  • K. Schindler et al.

    Automatic detection and tracking of pedestrians from a moving stereo rig

    ISPRS J. Photogramm. Remote Sens.

    (2010)
  • S. Bae et al.

    Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • J. Berclaz et al.

    Multiple object tracking using k-shortest paths optimization

    IEEE Trans. Pattern Anal. Machine Intell.

    (2011)
  • K. Bernardin et al.

    Evaluating multiple object tracking performance: The clear mot metrics

    EURASIP J. Image Video Process.

    (2008)
  • M.D. Breitenstein et al.

    Gool, Robust tracking-by-detection using a detector confidence particle filter

  • M.D. Breitenstein et al.

    Online multiperson tracking-by-detection from a single, uncalibrated camera

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • W. Choi

    Near-online multi-target tracking with aggregated local flow descriptor

  • M. Conforti et al.
    (2014)
  • Dehghan, A., Modiri Assari, S., Shah, M., 2015. Gmmcp tracker: Globally optimal generalized maximum multi clique...
  • M. Dimitrievski et al.

    Behavioral pedestrian tracking using a camera and lidar sensors on a moving vehicle

    Sensors

    (2019)
  • A. Ess et al.

    Gool, A mobile vision system for robust multi-person tracking

  • L. Fagot-Bouquet et al.

    Improving multi-frame data association with sparse representations for robust near-online multi-object tracking

  • Geiger, A., Lenz, P., Urtasun, R., 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. In:...
  • Gelb, A., 1974. Applied Optimal...
  • K. He et al.

    Deep residual learning for image recognition

  • He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask r-cnn. In: IEEE International Conference on Computer Vision...
  • Hermans, A., Beyer, L., Leibe, B., 2017. In defense of the triplet loss for person re-identification, arXiv preprint...
  • Hong Yoon, J., Lee, C.-R., Yang, M.-H., Yoon, K.-J., 2016. Online multi-object tracking via structural constraint event...
  • O.H. Jafari et al.

    Real-time RGB-D based people detection and tracking for mobile robots and head-worn cameras

  • H. Kieritz et al.

    Online multi-person tracking using integral channel features

  • S. Kim et al.

    Online multi-target tracking by large margin structured learning

  • Leal-Taixé, L., Pons-Moll, G., Rosenhahn, B., 2011. Everybody needs somebody: Modeling social and grouping behavior on...
  • Leal-Taixé, L., Fenzi, M.M., Kuznetsova, A., Rosenhahn, B., Savarese, S., 2014. Learning an image-based motion context...
  • L. Leal-Taixé et al.

    Learning by tracking: Siamese CNN for robust target association

  • Leal-Taixé, L., Milan, A., Schindler, K., Cremers, D., Reid, I., Roth, S., 2017. Tracking the trackers: An analysis of...
  • B. Lee et al.

    Multi-class multi-object tracking using changing point detection

  • Cited by (9)

    • Aleatoric uncertainty estimation for dense stereo matching via CNN-based cost volume analysis

      2021, ISPRS Journal of Photogrammetry and Remote Sensing
      Citation Excerpt :

      If it is, however, necessary to quantify the uncertainty magnitude, residual learning or a probabilistic model may be more suitable. This could, for example, be beneficial to fuse multiple disparity maps for multi-view 3D reconstruction or if the depth uncertainty is incorporated in subsequent applications that build on depth information from stereo images, such as pedestrian tracking (Nguyen and Heipke, 2020) or 3D object reconstruction (Coenen and Rottensteiner, 2019). In this second set of experiments, we compare our approach (CVA-Net) against other state-of-the-art methods for estimating the aleatoric uncertainty of disparity maps.

    • INTEGRATING MOTION PRIORS FOR END-TO-END ATTENTION-BASED MULTI-OBJECT TRACKING

      2023, International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences - ISPRS Archives
    • Guiding Deep Learning with Expert Knowledge for Dense Stereo Matching

      2023, PFG - Journal of Photogrammetry, Remote Sensing and Geoinformation Science
    • JOINT ESTIMATION OF DEPTH AND ITS UNCERTAINTY FROM STEREO IMAGES USING BAYESIAN DEEP LEARNING

      2022, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
    • PolarMOT: How Far Can Geometric Relations Take us in 3D Multi-object Tracking?

      2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text