3D Pedestrian tracking using local structure constraints
Introduction
Pedestrians are among the most involved traffic participants in metropolitan areas. Therefore, tracking and predicting their behaviours in 3D object space are crucial for applications related to autonomous driving and traffic safety. Pedestrian tracking based on images in particular, and object tracking in general, is an active field in computer vision. Though a substantial amount of studies have been carried out to tackle this problem, tracking pedestrians correctly and robustly still requires extensive improvements to deal with difficulties like mutual occlusions and noisy detection results. These challenges usually lead to two main problems in tracking: missed detections and identity switches (Leal-Taixé et al., 2017).
Tracking-by-detection is a well-known and widely used remedy in the state-of-the-art tracking literature, in which the tracking task is decomposed into two separate stages: detection and data association (Choi, 2015, Dimitrievski et al., 2019, Hong Yoon et al., 2016, Ošep et al., 2017, Xiang et al., 2015). Most studies following this approach concentrate on concatenating detections across image frames to form consistent trajectories for the objects of interest. This task can become extremely complicated in crowded groups. Previous studies usually tried to handle the association task either by developing better optimization methods (Berclaz et al., 2011, Dehghan et al., 2015, Lenz et al., 2015) or by improving the appearance feature extractors (Bae and Yoon, 2018, Leal-Taixé et al., 2016, Tang et al., 2017). In contrast, exploring geometry cues does not seem to have attracted the attention of many researches. This may be because most of the existing works typically focus on tracking pedestrians in 2D image space, where the ego-motion of camera sensors is difficult to distinguish from object movements and the background can also change significantly (Breitenstein et al., 2011, Fagot-Bouquet et al., 2016, Kieritz et al., 2016, Leal-Taixé et al., 2017). Hence, using 2D geometry cues can not help to fully support the task at hand, which evolves in 3D.
In this study, we solve the pedestrian tracking problem in 3D space using stereo images acquired from moving cameras. We aim at not only obtaining high recall values but also maintaining accurate identities for tracked trajectories. For that purpose, we take advantages of 3D geometry constraints among pedestrians in a group to enhance association results. We call our new method 3D-TLSR (3D pedestrian tracking using local structure refinement). To do so, we design a two-step association method. First, we determine a set of anchors, defined as assignment pairs with strong evidence of correctness. Since the image sequences we deal with typically contain some clearly visible pedestrians, in general anchor determination is not a very difficult task. Then, in a so-called local structure refinement (LSR) step, we employ the geometry changes of the detected anchors as prior information to find the correct assignments for the remaining detections. In addition, we consider targets with a close distance to each other as neighbours, and we assume that the related trajectories have a similar geometric shape, at least over a short period of time. Combining those constraints with appearance cues, the accuracy of association and tracking is assumed to be improved.
To enhance the recall value, we retrieve missed detections of a target through a prediction step using its previous states. However, while helping to increase the number of true positives (TPs), the prediction can also produce more false positives (FPs) if inferred positions drift away from the true ones. For that reason, we explore a method to calculate and evaluate the accuracy of pedestrian velocity in 3D object space. Moreover, we define a so-called friend relationship among the neighbours, whose trajectories are supposed to move together for a longer time interval. In this way, the velocity and movement of a tracked pedestrian can be corrected and updated by its friends, especially when it is not assigned to any detection for some time.
To this end, our main contributions in this paper can be summarized as follows:
- •
We propose a framework to track pedestrians in 3D object space by employing advantages of both 2D and 3D information.
- •
We suggest a two-step data association method to leverage 3D geometry cues, in which a set of strong association pairs, called anchors, is detected first. Then, the trajectories of the anchors are utilized to constrain matching of other detections in the local structure refinement (LSR) step.
- •
We introduce an approach to reliably estimate and assess the motion of pedestrians. In addition, we define a so-called friend relationship to correct pedestrian velocity and improve trajectory prediction.
The rest of this paper is organized as follows: previous studies related to our research are discussed in Section 2. The details of our tracking framework are described in Section 3. The performance of our tracker is presented in Section 4, followed by the conclusions in Section 5.
Section snippets
Multi-pedestrians tracking
Tracking-by-detection is used by most studies dealing with pedestrian tracking and achieves state-of-the-art results. This method consists of two sub-tasks: detection and data association (Choi, 2015, Dimitrievski et al., 2019, Hong Yoon et al., 2016, Lee et al., 2016, Nguyen et al., 2019, Ošep et al., 2017, Yoon et al., 2015). While detection is considered as an independent problem in computer vision and can be solved by employing state-of-the-art detectors (He et al., 2017, He et al., 2016,
3D Pedestrian tracking with 3D-TLSR
In this section we present our new method called 3D-TLSR (3D pedestrian tracking using local structure refinement). To track pedestrians in 3D object space, we take calibrated and normalized stereo image pairs as input, i.e. the disparity of conjugate points is horizontal, and provide 3D trajectories of pedestrians as output. The disparity map w.r.t the stereo rig is estimated using the state-of-the-art dense matching method proposed in Yamaguchi et al. (2014). Following the
Experiments and results
In order to evaluate the performance of our approach, we used three different datasets, namely the KITTI object tracking benchmark (Geiger et al., 2012), the ETH mobile scene (ETHMS) dataset (Ess et al., 2008) and our own dataset. In case of KITTI, we examined the effectiveness of each component in our tracking pipeline using two sequences of the training dataset, namely 15 and 19, for which ground truth (GT) data is provided. We also compared the performance of our method on the KITTI testing
Conclusion
In this paper, we present a new approach to track pedestrians in 3D-space using stereo images called 3d-TLSR. We introduce a way to estimate the uncertainty of interesting objects in 3D which enables us to localize pedestrians in 3D object space with high precision. We propose the two-step association method so that the local structure constraints between nearby trajectories are leveraged to reduce the number of identity switches in the tracked trajectories even when pedestrians move in crowded
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by the German Research Foundation (DFG) as part of the Research Training Group i.c.sens [RTG2159].
References (57)
- et al.
Probabilistic multi-person localisation and tracking in image sequences
ISPRS J. Photogramm. Remote Sens.
(2017) - et al.
Multi-target tracking on confidence maps: An application to people tracking
Comput. Vis. Image Underst.
(2013) - et al.
Automatic detection and tracking of pedestrians from a moving stereo rig
ISPRS J. Photogramm. Remote Sens.
(2010) - et al.
Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking
IEEE Trans. Pattern Anal. Mach. Intell.
(2018) - et al.
Multiple object tracking using k-shortest paths optimization
IEEE Trans. Pattern Anal. Machine Intell.
(2011) - et al.
Evaluating multiple object tracking performance: The clear mot metrics
EURASIP J. Image Video Process.
(2008) - et al.
Gool, Robust tracking-by-detection using a detector confidence particle filter
- et al.
Online multiperson tracking-by-detection from a single, uncalibrated camera
IEEE Trans. Pattern Anal. Mach. Intell.
(2011) Near-online multi-target tracking with aggregated local flow descriptor
- et al.(2014)
Behavioral pedestrian tracking using a camera and lidar sensors on a moving vehicle
Sensors
Gool, A mobile vision system for robust multi-person tracking
Improving multi-frame data association with sparse representations for robust near-online multi-object tracking
Deep residual learning for image recognition
Real-time RGB-D based people detection and tracking for mobile robots and head-worn cameras
Online multi-person tracking using integral channel features
Online multi-target tracking by large margin structured learning
Learning by tracking: Siamese CNN for robust target association
Multi-class multi-object tracking using changing point detection
Cited by (9)
Aleatoric uncertainty estimation for dense stereo matching via CNN-based cost volume analysis
2021, ISPRS Journal of Photogrammetry and Remote SensingCitation Excerpt :If it is, however, necessary to quantify the uncertainty magnitude, residual learning or a probabilistic model may be more suitable. This could, for example, be beneficial to fuse multiple disparity maps for multi-view 3D reconstruction or if the depth uncertainty is incorporated in subsequent applications that build on depth information from stereo images, such as pedestrian tracking (Nguyen and Heipke, 2020) or 3D object reconstruction (Coenen and Rottensteiner, 2019). In this second set of experiments, we compare our approach (CVA-Net) against other state-of-the-art methods for estimating the aleatoric uncertainty of disparity maps.
INTEGRATING MOTION PRIORS FOR END-TO-END ATTENTION-BASED MULTI-OBJECT TRACKING
2023, International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences - ISPRS ArchivesGuiding Deep Learning with Expert Knowledge for Dense Stereo Matching
2023, PFG - Journal of Photogrammetry, Remote Sensing and Geoinformation ScienceJOINT ESTIMATION OF DEPTH AND ITS UNCERTAINTY FROM STEREO IMAGES USING BAYESIAN DEEP LEARNING
2022, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information SciencesPolarMOT: How Far Can Geometric Relations Take us in 3D Multi-object Tracking?
2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)