1 Introduction

Obstacles detection in autonomous and drive-assisted vehicles concerns the detection of any other objects, static or in movement, on or near the road. In an intelligent autonomous vehicle navigation scenario it is, along with path detection, one of the most important tasks, because it involves not only the safety of the vehicle where the obstacles detection and recognition are performed, but also because it affects other participants in this scenario, such as other vehicles, pedestrians, cyclists and animals. Based upon information continuously gathered by the obstacle detection, the behavior of an autonomous vehicle must adjust itself or, in the case of an advanced driver assistance systems (ADAS), it must generate alerts that allows drivers to adapt their driving to potential threats.

The state-of-the-art for obstacle detection is already quite robust, and with the recent advancements in convolutional neural network (CNN)-based deep learning approaches, has been obtaining excellent results. Prior to this work, we performed a systematic literature review [28, 29], based on the procedures described in [19]. And based on this literature review we were able to determine that the state-of-the-art in the road obstacle detection area with focus on vehicular navigation has many examples and different approaches.

The approaches we were able to identify vary widely. Some examples are: using only stereo vision (e.g., [16]), only Optical Flow (e.g., [2]), image segmentation (e.g., [26]), and recently works using convolutional neural networks (e.g., [27]). There are also several other approaches that use combinations between different methods, such as [13] which uses methods based on neural networks, stereo vision and image segmentation, and [12] which uses methods based upon stereo vision, optical flow and Image Segmentation.

Other approaches focus on detecting obstacles moving on the scene, as in [34], where the authors used statistical background subtraction. The background subtraction is applied only to the components of the wavelet transformed frame. To detect moving obstacles, an adaptive threshold based on different statistical parameters is used. A post-processing step is applied, with morphological operations, to improve the accuracy. In [33] the authors used adaptive thresholding-based optical flow method for detecting moving objects. Optical flow approach is also used in [31], where it first estimates the bi-directional optical flow fields between the current frame and the previous frame as well between the current frame and the next frame. The bi-directional optical flow is normalized, followed by an analysis of the histogram of optical flow values for each block. At the end the obstacles are detected as binary blobs. In [32], the authors present an approach for detecting moving objects that uses frame differencing, by an histogram-based frame differencing technique and W4 algorithm with a morphological operation is applied at the end.

In moving obstacles detection, however, it is not only important to identify the kind of obstacle: a vehicle, a cyclist, a horse-rider, a pedestrian crossing the road or a stray animal. It is also important to be able to determine the potential path of these obstacles and be able to estimate if there exists the possibility of a collision, i.e., if each detected object has the potential to become a threat to the vehicle. For this purpose, it is necessary to be able to estimate the obstacles distance, velocity and direction of movement. In this context, we can understand autonomous vehicle threat assessment (AVTA) as the continuous active inspection of its sensorial data by a vehicle in order to identify road objects and traffic participants that could pose a threat to the vehicle’s navigation.

In the work we present here, we move on from obstacle detection to the next step, which is to identify, in the detected obstacles, features that are relevant to threat assessment in a vehicular navigation system context, such as distance, velocity and also direction of movement from detected objects.

1.1 Objectives

The objective of this work is to investigate the feasibility of the development of a passive vision (PV)-based integrated moving obstacles detection and description approach that fulfills the following requirements:

  • detects and classifies obstacles pertaining to a set of predefined classes;

  • provides depth information about each obstacle, relative to the vehicle;

  • provides information about the trajectory and speed of each obstacle, relative to the vehicle;

  • is capable of determining this information employing only data gained from passive vision, without relying on additional data from LIDAR (light detection and ranging) or other active sensors.

Furthermore, our work concentrated not in developing new image processing algorithms, but investigated if there exist already developed and mature technologies which could be combined in order to achieve the objective above.

1.2 Approach outline

In our approach, in the obstacle detection step, we employ stereo images and a state-of-the-art CNN structure, the mask R-CNN [14], which in addition to the detection and recognition of objects, also determines the position and shape of these objects, providing, as a second layer of results, a semantic segmentation (SS) of the recognized objects. From the original images, obtained from datasets that provide stereo data with two-camera captures, we also generate the disparity maps (DM) of the scene (depth map) for each pair of stereo frames. We apply this DM to the objects recognized and segmented by the mask R-CNN, extracting the average depth information for each segmented object, allowing spatial localization of these objects. In addition, we also apply the optical flow calculation on these objects, being able to filter the average movement flows (motion direction and intensity) separately for each detected object.

1.3 Research rationale

Different sensors can be used for the obstacle detection task. Some vehicles employ an ensemble of diverse sensors, not only cameras [9, 38]. One of these sensors, present in many autonomous vehicle navigation projects, is active sensor named as LIDAR, which is laser source used for active sensing of reflected light, in order to measure distances between the sensor and the target object [21]. In vehicular projects, the LIDAR employed is normally a laser of Class1, which is the category considered to present less danger. It employs light in the infra-red (IR) spectrum, in wavelengths in the order of 905nm.

Based on the studies of [5, 36], a single Class1 laser source poses no danger to the retina when it does not remain in direct contact with the human eye for a longer time. A categorization of the lasers and the possible damages caused by excessive exposure in different levels is presented in [5]. Lasers that emit in a wavelength between 780nm and 1400nm can cause cataracts and burn the retina. Considering a scenario where autonomous vehicles are used on a large scale, situations of dense traffic could be responsible for a many-LIDAR-originated “lasersmog” and become a risk to the nearest humans, which would simultaneously be targeted by the signals of many laser sources.

Even if there exists no conclusive study of the impact of many-car generated “lasersmog” on pedestrians yet, we understand that stereo camera-based PV may be a better alternative in a future autonomous vehicle scenario. For this purpose our work focuses on data achieved through passive stereo vision only, without any information supplementation through LIDAR data.

We show that this combination of methods, with only a passive view, presents good results for the analysis of important characteristics in a vehicle navigation scenario. In the systematic literature review ([28, 29]) where we evaluated several methods and method combinations, this combination did not appear in other works. We believe that this is the key contribution of this work.

The remainder of this paper is organized as follows: In Sect. 2 we present the related works and their respective approaches. In Sect. 3 we present the datasets used in our experiments and also present the methods we apply in our approach. Our approach is presented in Sect. 4, followed by the results obtained in Sect. 5. Finally, in Sect. 6 we conclude this paper with a discussion about the results and the next steps in future work.

2 Related work

Other authors have already tackled the PV-based extraction of relevant features from objects in the scene. [25] performs a pedestrian detection with focus on multiple pedestrians tracking and uses stereo vision techniques for the detection step and the RANSAC framework for the pedestrian motion estimation and tracking.

In [3] the authors present an approach to do the tracking of detected vehicles in the scene with focus on identifying overtaking situations. The markings on the road, the lanes, are also detected to know when a vehicle may be entering in front, allowing it to generate an alert. For the vehicles detection steps are used histogram of oriented gradients (HOG) and support vector machine classifier (SVM). The Kalman filter is used for vehicle tracking steps.

An obstacle detection that also employs stereo vision techniques is presented in [15]. Based on the image generated by the DM, the object contours are found. Based on these contours the authors use the objects’ geometric information, such as area and height to classify objects (e.g., people, vehicles and others).

Other works also use geometric information from the detected obstacles in order to classify the obstacles by types. In [20], the authors present an approach that besides the geometric information of the detected obstacles (height and width) uses fuzzy logic to classify these obstacles. In [23] the authors applied a segmentation in the disparity map and also used width and height features from the detected obstacles to make the classification.

Stereo disparity map is also used in [4]) together with histogram of oriented gradient (HOG) to extract the obstacles features. Finally, the classification of obstacles is made through a support vector machine.

To predict future vehicle localization the authors from [39] use a recurrent neural network (RNN) with a dense optical flow incorporation. In [6] the authors also showed how to prevent other vehicles’ actions using hidden Markov models (HMM), interacting multiple models (IMM) and variational Gaussian mixture models (VGMM). Also to predict the trajectories from other vehicles the [18] presents an approach which uses a convolutional neural network (CNN).

These works used different combinations of methods and techniques and are focused on classifying the types of obstacles or, at most, tracking some of the obstacles, not focusing on extracting behavioral features from obstacles in relation to the moving vehicle.

3 Material and methods

We employed two different datasets in our experiments, both presenting stereo images from urban vehicle navigation scenarios, but in different contexts (Germany and Brazil). Both datasets are presented in Sect. 3.1. In Sects. 3.2, 3.3 and 3.4 we provide brief descriptions about each method we used in our model.

3.1 Datasets

The two datasets used in our experiments were: KITTI datasetFootnote 1 [11] and CaRINA datasetFootnote 2 [35]. Both provide high-quality stereo images in vehicle navigation scenarios. KITTI uses PointGray Flea2 cameras, and CaRINA uses a Bumblebee XB3 camera.

Created by the Mobile Robot Laboratory group (ICMC / USP - So Carlos) filmed in Brazil, more specifically in the city of So Carlos in So Paulo state, the CaRINA dataset aimed to provide images for experiments in autonomous navigation visual perception in emerging countries scenarios, containing low-quality roads. There are few pedestrian situations (almost none), but contains other vehicles in the scene (e.g., cars, motorbikes, trucks).

In contrast, the dataset provided by KITTI contains a considerable amount of pedestrian and cyclist situations in the scene, in addition to other vehicles. KITTI was created by the Karlsruhe Institute of Technology in Karlsruhe city, Germany. It is probably one of the most commonly used datasets in visual perception works for vehicle navigation tasks, including for path detection and obstacle detection.

3.2 Mask R-CNN

In [14] the authors present a framework for object instance segmentation. The mask R-CNN, in addition to detecting and classifying objects in the scene, also applies a segmentation mask to each detected object (e.g., Fig. 1). According to the authors, mask R-CNN is an extension of faster R-CNN [30].

Fig. 1
figure 1

Mask R-CNN example on KITTI dataset

We use this framework with pre-trained models in the Inception backbone architecture [37], which has good classification accuracy and is faster than many other architectures. Also, our experiments run in a model trained with MSCOCO dataset [22], which is a dataset specific for object detection and segmentation.

3.3 Disparity map

The disparity is the difference that the same pixel has between two images; this difference takes into account the position of the same pixel in each image. It is common to use disparity as a synonym of depth [1]. The ideal for stereo vision works is that the images are perfectly rectified on the y-axis, allowing the scanning by checking the corresponding pixels and their respective differences to occur only on the x-axis:

$$\begin{aligned} D = x_{l} - x_{r} \end{aligned}$$
(1)

where \(x_{l}\) is the specific pixel coordinate in the left image, \(x_{r}\) is the coordinate of the same specific pixel in the right image and D is the disparity value between these points. Both datasets used in our experiments have perfectly rectified images.

The disparity map is the image that represents the pixel disparity values as an intensity image, where high intensity values represent high disparities and low intensity values represent lower disparities [1]. Normally the disparity map is displayed as a grayscale image, we applied a simple color conversion for a better visualization, but the intensity information is the same (e.g., Fig. 2).

Fig. 2
figure 2

Disparity map example on KITTI dataset. Original left image on top, disparity map on bottom

3.4 Optical flow

The goal of optical flow (OF) is to identify the displacement of intensity patterns in the image along sequential frames. This movement information can be very useful in computer vision because it also allows the identification of certain patterns in the scene [10].

In the literature there are examples of OF obtained through neural networks [7, 17] and also through traditional numeric methods [8, 24]. Neural OF methods may be a more recent tendency, but they also require more computational resources. In our work, we already performed the detection and segmentation of the obstacles with the use of a CNN and we only need to apply the OF calculations to the detected objects. For this reason we opted to perform a post-processing employing a traditional OF approach. In addition, this approach provides us with explicit vector data which could be later used by a vehicle for threat assessment, which is not possible with the present CNN-based OF approaches.

In our approach we used the Gunnar-Farneback algorithm [8], which produces a dense OF working on a grid of points. In this algorithm, the movement vector value is extracted through information obtained from two consecutive frames [8]. As this algorithm calculates the OF for each pixel in the image, it performs a good motion estimation of the regions encompassing the detected objects. An example with flow vectors is shown in Fig. 3, where a pedestrian is crossing the street in front of an awaiting vehicle.

Fig. 3
figure 3

Optical flow example on KITTI dataset

4 Our approach

In Fig. 4 we present a schematic overview of our approach, which consists in combining the techniques described in Sect. 3. The mask R-CNN results in a bounding-box and a segmentation mask of the detected object, which we use as a region-of-interest to calculate the average movement of each object, using the values obtained by the OF, and also to calculate the average depth of each object, using the disparity values in each detected region. Allowing us to generate a detailed analysis of depth and movement for each object in the scene. That is, we calculate the average, either of movement and depth, distinctly, in each object detected by mask R-CNN.

Both KITTI and CaRINA datasets provide stereo images. For the OF calculation and the CNN object detection we employ only the left-captured (driver-side) images from the datasets. For the calculation of the disparity maps we employ the whole stereo data.

Fig. 4
figure 4

The steps in our model

With the disparity values obtained from each object segmentation, it is possible to generate an average disparity value for each segmented object. Thus, in the final analysis, we defined some depth labels with pre-set thresholds. We defined four depth labels: very close when the disparity average value is equal to or greater than 185; close when the average disparity value is equal to or greater than 115 and less than 185; far when the average disparity value is equal to or greater than 45 and less than 115; and finally, very far when the disparity average value is less than 45.

Table 1 Accuracy results
Table 2 Depth confusion matrix
Table 3 Depth performance

In the same way, we used the optical flow values from each detected object to generate average movement values, computing their means as the resulting direction and intensity motion vector. We collected the direction values from the OF vector on the x-axis to label whether the vehicle is stationary, or going from right to left or from left to right. We defined the direction labels on the x-axis as: left to right when the value is equal to or greater than 1. right to left when the value is less or equal than \(-1\). Finally, stable direction when the value is less than 1 and greater than \(-1\).

The direction values from the OF vector on the y-axis indicate to us whether the vehicle is approaching, moving away or maintaining a stable distance. We defined three labels for the y-axis being: approaching when the value is equal to or greater than 1; moving away when the value is equal to or less than \(-1\); lastly, stable distance when the value is less than 1 and greater than \(-1\).

Table 4 x-axis direction confusion matrix
Table 5 x-axis direction performance
Table 6 y-axis direction confusion matrix
Table 7 y-axis Direction Performance
Table 8 Movement intensity confusion matrix

The greater the displacement of a pixel between two frames, the greater will be the vector representing that displacement. This value allows us to have a sense whether the detected object is moving fast or slow. We obtained the displacement value from each object by multiplying the average values of the x-axis and y-axis from each object:

$$\begin{aligned} xM={\frac{1}{n}}\sum _{i=1}^{n}x_{i}\end{aligned}$$
(2)
$$\begin{aligned} yM={\frac{1}{n}}\sum _{i=1}^{n}y_{i}\end{aligned}$$
(3)
$$\begin{aligned} VL = xM * yM \end{aligned}$$
(4)

where xM is the x-axis average value in an object, yM is the y-axis average value in the same object, and VL is the vector intensity value from that object. We defined five labels to represent the movement intensity: stopped when the value is less than 0.1. slow when the value is equal to or greater than 0.1 and less than 0.5. average speed when the value is equal to or greater than 0.5 and less than 5. fast when the value is equal to or greater than 5 and less than 100. And very fast when the value is equal to or greater than 100. It is important to emphasize that the movement thresholds were not normalized in relation to the distance in this approach.

5 Results

We compared the results obtained with manual annotations made in a total of 415 obstacles over 100 frames, being part from the CaRINA dataset and part from the KITTI dataset. Twenty sequences of frames were selected containing 5 frames each sequence. In Table 1 the general accuracy for each task in the extraction and analysis of the obstacles positioning and movement is presented. A more individual analysis is possible by Tables 2, 4, 6 and 8 where we presented, through the confusion matrices, the detailed results of each task and their labels. We also present the performance of each task by Tables 3, 5, 7, 9 containing the true positive rate (TPR, recall), the false positive rate (FPR) and the precision values for each class in each task.

Table 9 Movement intensity performance
Fig. 5
figure 5

Examples of results in the KITTI dataset

Fig. 6
figure 6

Examples of results in the CaRINA dataset

In Table 2, the confusion matrix of the distance (depth) analysis of the obstacles in the scene, it is noticed that the worst results were with the labels “very-close” and “far.” However, it is also possible to verify that the biggest errors in both classes were in neighboring labels. Still, the “very-close” label featured a considerable amount of errors as “far.” This occurs in situations when there are objects close between each other in the scene making it difficult to analyze as individual objects. The same occurs with the “approaching” label in Table 6, which has 8.45% being as “moving-away.” In Table 8 the worst result was in label “slow,” yet in all classes have errors occurring as being from neighboring labels (Tables 2, 4, 6, and 8).

In Fig. 5 and in Fig. 6 we present some of the results obtained by our approach. In the left column we show the combined results obtained with the CNN, together with the disparity map and OF patterns. In the right column we present the labels on the objects based on the analysis of the depth and movement patterns of each object.

In the first row of Fig. 5 the vehicle responsible for the capture of the images (capture source, CS) is stationary, and four cars are passing through the right lane. Here, using the patterns obtained by disparity map and OF it is possible to verify their behavior and how distant these four vehicles are, as highlighted in Fig. 7. In comparison, traffic lights are identified as static. The second row of Fig. 5 presents the continuity of the first row, with the CS still stationary and vehicles passing through the right lane.

The vehicles passing through the right lane present as a result of direction the label right to left because even though it is not crossing abruptly in front of the CS, it is not an exactly parallel movement because the image perspective, by surpassing the CS it is like moving in the x-axis, from right to left. Considering this perspective, as lines going to meet at the vanishing point.

The third, fourth and fifth rows of Fig. 5 are a sequence and presents vehicles with a trajectory that will generate an actual direct crossing. These vehicles are further away while the CS is standing behind another, nearby vehicle, which is also stationary (Fig. 8a).

The figures in the fourth and fifth rows of Fig. 5 show a truck crossing the front of the vehicle from right to left. In the sixth row, we show the extraction and analysis of the patterns on a pedestrian very close to the CS and another vehicle more distant, both crossing the front of the CS in opposite directions (Fig. 8 b). Different objects with different distances are shown in the seventh row.

In the last row of Fig. 5 we present a sequence where the CS is slowing down, almost stopping, while several pedestrians begin to cross with similar but not synchronized behavior, resulting in some data variation.

In the five rows of Fig. 6, showing results from CaRINA dataset, the CS is in motion and, although it presents images with less movement than in the KITTI dataset, it is still possible to observe the patterns of movement and distance from the detected objects, mainly from first to third row, which correspond to a sequence.

6 Conclusion and discussion

Obstacle detection and recognition focused on ADAS and/or on autonomous vehicles navigation has made a major breakthrough in the state of the art in recent years, especially considering the advances in CNNs. The approach we present in this paper focuses on the next step after the detection and recognition of obstacles: the extraction of the depth and movement patterns of the detected objects.

We understand that identifying these patterns will allow smarter and safer decision making in an ADAS or in an autonomous vehicle, helping to identify potential threats, both providing for a more precise alert for a human driver and passing more data to an intelligent agent module responsible for making decisions in an autonomous vehicle.

In our approach, we combine CNN-based detection and object recognition results with the depth patterns by a disparity map and movement patterns (direction and velocity) by an optical stream. The results obtained are promising and motivate the continuity of this research.

Fig. 7
figure 7

Highlighted pattern analysis results

Fig. 8
figure 8

Highlighted pattern analysis results

6.1 Future work

One of our next steps with this approach consists of applying that same model, with the same combined methods in a NVIDIA Jetson card, provided by NVIDIA for our project through the NVIDIA GPU Grant Program, in a vehicle with real-time image capture and a specially developed stereo rig, thus improving the performance of the current proposed flow.

In addition, one of the possibilities opened by the extraction of the distance, trajectory and movement patterns we are performing, is to try to predict the actions of the participants in the scene, such as other vehicles, cyclists, pedestrians and animals, performing threat assessment, which is a project that is already underway in our group. For this step it can be very relevant to normalize the values of movement intensity in relation to the distance.

In the context of these possible next steps we are also investigating the possibilities associated with the analysis of the obtained patterns, studying the creation of potential new behavior labels. During these next phases we plan to perform more experiments related to the obtained patterns analysis, differentiating for example the direction label in order to differentiate situations where in fact there will occur some crossing in front of the capture source from when it is a possible lateral overtaking.