Abstract

Accurate prediction and reliable significant factor analysis of incident clearance time are two main objects of traffic incident management (TIM) system, as it could help to relieve traffic congestion caused by traffic incidents. This study applies the extreme gradient boosting machine algorithm (XGBoost) to predict incident clearance time on freeway and analyze the significant factors of clearance time. The XGBoost integrates the superiority of statistical and machine learning methods, which can flexibly deal with the nonlinear data in high-dimensional space and quantify the relative importance of the explanatory variables. The data collected from the Washington Incident Tracking System in 2011 are used in this research. To investigate the potential philosophy hidden in data, K-means is chosen to cluster the data into two clusters. The XGBoost is built for each cluster. Bayesian optimization is used to optimize the parameters of XGBoost, and the MAPE is considered as the predictive indicator to evaluate the prediction performance. A comparative study confirms that the XGBoost outperforms other models. In addition, response time, AADT (annual average daily traffic), incident type, and lane closure type are identified as the significant explanatory variables for clearance time.

1. Introduction

According to Lindley [1], traffic incidents result in about 60% of nonrecurrent traffic congestions. These congestions may cause lots of adverse effects such as reducing the roadway capacity, increasing the likelihood of secondary incidents [2], and unfavorable social and economic phenomenon [3]. When a traffic incident occurred, timely and reliable incident duration prediction plays an important role in the traffic authorities to design strategy for traffic guidance. In terms of Highway Capacity Manual, there are four phases in traffic incident duration [4]: detection time (the time from incident occurrence to detection), response time (the time from incident detection to verification), clearance time (the time from incident verification to clearance), and recovery time (the time from incident clearance to the normal traffic condition). Severe incidents that are not cleared in time may lead to a twice even three times incident duration [5]. Compared to other phases, clearance time is the most important and time-consuming phase in the time incident process. Thus, the aims of this paper are to effectively predict the clearance time and investigate the significant influencing factors of clearance time.

Over the past few decades, a large number of works have been undertaken to predict the incident duration time. These approaches can be mainly categorized into statistical approaches and machine learning approaches. Statistical methods have their own model assumptions and predefined underlying relationships between dependent and independent variables [6] which provide the explainable ability to statistical methods. The widely used statistical methods are summarized as follows: probabilistic distribution analyses method [7, 8], regression method [913], discrete choice method [14], structure equation method [15], hazard-based duration method [16], Cox proportional hazards regression method [1719], and accelerated failure time method [2023]. Unlike statistical methods, machine learning methods are based on a more flexible mapping process that requires no or less prior hypothesis. And flexible mapping allows machine learning methods to handle the nonlinear data in the high-dimensional space, but it cannot explore the potential relationship between dependent variables and independent variables. These widely used machine learning methods are categorized as K-nearest neighborhood method [2427], support vector machine method [2628], Bayesian networks method [2934], artificial neural networks method [2, 3537], genetic algorithm [37, 38], tree-based method [25, 3941], and hybrid method [42].

In summary, conventional incident clearance time prediction studies rely on either statistical models with prior assumptions or machine learning models with poor interpretability [43]. To solve the abovementioned issues, we apply the extreme gradient boosting machine (XGBoost) method to predict the clearance time and then investigate the significant influencing factors of traffic incident clearance time. Because the XGBoost inherits both the advantages of statistical models and machine learning models, which can handle the nonlinear high-dimensional data when computing the relative importance among variables.

In this study, the prediction performance of XGBoost is examined by using the data from the Washington Incident Tracking System in 2011. In order to better explore the potential philosophy hidden in the original data, we cluster the original data in terms of their inherent properties. And then XGBoost model is built for each cluster. The framework of the proposed method is detailed in Section 3.5.

The remaining of this research is organized as follows. The data source is described in Section 2. Section 3 presents the K-means algorithm, the XGBoost algorithm, the Bayesian optimization algorithm, evaluation indicator, and the framework of the proposed method. The model results and discussion are outlined in Section 4. The last section is the conclusion.

2. Data Description

Traffic incident data were collected from the Washington Incident Tracking System (WITS), which occurred on the section from Boeing Access Road (Milepost 157) to the Seattle Central Business District (Milepost 165). This segment is not only a high incident-occurrence area but also takes on heavy traffic demand [44]. Therefore, it was chosen as the research object. And the annual average daily traffic (AADT) comes from the Highway Safety Information System (HSIS) database. The historical weather data were obtained from the National Oceanic and Atmospheric Administration (NOAA)’s weather stations in the region. The components of the data are detailed in Table 1. There are 14 discrete explanatory variables and 2 continuous explanatory variables in this dataset. In terms of their properties, they are divided into six categories: incident, temporal, geographical, environment, traffic, and operational. The detailed value sets of variables are presented as the third column in Table 1. In order to equalize the variability of independent variables, both response time and AADT variables are normalized [41, 4346].

Totally, 2565 incident records were retrieved from the WITS database for the time period from 1 January to 31 December 2011. The mean and standard values of clearance time are, respectively, 13.10 minutes and 14.63 minutes. A big standard value (14.63 min) means that most of the clearance time values are quite different from their average values. That is, the original data should be processed to make the data organized well.

3. Methodology

3.1. K-Means Algorithm

K-means algorithm, developed by MacQueen [47], is one of the widely used methods in the field of dataset clustering. Samples in the dataset with similar characteristics can be clustered into the same class by using K-means [48]. The data we used in this research are expressed as {}, and n represents the number of incidents, m is the number of explanatory variables, and the y denotes the actual clearance time. The detailed steps of the K-means algorithm are presented as follows:Step 1: assuming the number of clusters (K clusters) and choosing the cluster centers from the dataset randomly.Step 2: determining the clusters of other samples by the distance function asHere, the and are the centers of the cluster a and cluster b, and denotes the cluster a.Step 3: after all samples have been clustered, the new center of each cluster should be calculated by using the following equation:where is the number of the samples in cluster j.Step 4: repeating step 2 and step 3 until the center of the cluster is within the permission.Accordingly, we can find that the value of K and the cluster center are important to the clustering performance, as the clustering of K-means is extremely dependent on the selection of initial cluster center and the number of K. To obtain a reasonable K, we use the silhouette coefficient as the evaluation index, which is proposed by Rousseeuw [49] and defined as follows:Here, the is the average distance between sample i and other samples within the same cluster, and the is the lowest average distance of sample i to all the remaining samples.

3.2. Extreme Gradient Boosting Machine Algorithm

Chen and Guestrin [50] proposed the extreme gradient boosting machine (XGBoost) algorithm. It is regarded as the advanced application of gradient boosting machine (GBDT) and adopts decision trees as the base learners for achieving classification and regression. Boosting is the integrated approach that can adjust the predicted error of the current model by adding new models to the model [41]. The predicted result of the boosting model is the sum scores of all models. Accordingly, the prediction of XGBoost is the sum scores of K boosted trees and is shown in the following equation:where is the sample, is the score of at the boosted tree, and F is the space composed of boosted trees. To decrease the fitting error of XGBoost, there is an improvement in regulation compared to GBDT, and it is presented as follows:where and are the actual and predicted values of the sample, the former item is the loss function, which needs to be a differentiable convex function, and the latter item is the penalty corresponding to the model complexity for avoiding overfitting. The second item of equation (5) can be detailed as follows:where both and are constants, T denotes the sum number of leaves, and is the score of leaf. When equation (6) equals zero, the will convert to the conventional formula of GBDT.

According to equations (5) and (6), the training error and the model complexity are the two main sections of XGBoost. When the previous trees have been trained, the current tree can be trained by using additive training method. It means that when the boosted tree is trained, the parameters of the previous trees (from the first tree to the tree) are fixed and their corresponding variables are constant. Taking the boosted tree as an example, the loss can be expressed as follows:

There are two formulas in these two items of (7):

The first items of equations (8) and (9) are the sum score and sum regulation of former trees and the second items of them are the score and regulation of the boosted tree, is the predicted value of the iteration, and is the regulation of iteration.

Equations (8) and (9) are substituted into equation (7), and then equation (7) is expanded by using the following Taylor formula:

The is considered as x and the is regarded as . Then, equation (7) is transformed as follows:

As Chen and Guestrin [50] suggested, can also be written aswhere is the leaf node of x, the indicates the weight of or that can be considered as the predicted value of the iteration, and d is the number of leaf nodes. Then, equation (11) can be expressed as follows:where and are the first order and second order of gradient statistics. When the is fixed, the optimal leaf weight and the metric function can be used to measure the quality of the tree structure can be calculated:

3.3. Bayesian Optimization Algorithm

Bayesian optimization algorithm (BOA), one of the most famous extendible applications of the Bayesian network, is based on the construction of the probabilistic model. This model defines the distribution of objective function from the input data to output data. In this Bayesian optimization process, the global statistical characteristics are obtained from the optimal solutions and modeled by using the Bayesian network [51]. That is why the BOA shows its advantage in machine learning models because these machine learning models need more accurate parameters to flexibly handle nonlinear high-dimensional data [52]. In this study, the BOA is applied to optimize the parameters in the XGBoost with the aim to accurately predict the traffic incident clearance time.

The accomplishment of Bayesian optimization includes two core parts: prior function (PF) and acquisition function (AC), which is also called the utility function [51]. Gaussian process (GP) is generally considered as the PF. And the AC is used to balance the model exploration and exploitation. The framework of Bayesian optimization is presented in Figure 1 and the main steps are described as follows: (1) The data is split into training data and validation data by using the k-fold cross-validation method. Initialization parameters of the target model are defined as . (2) The accuracy of the target model with initial parameters is evaluated by using validation data, and then the accuracy is recorded. The goal of the optimization is to minimize validation accuracy. (3) Gaussian process (GP) is employed to fit the recorded accuracy. (4) The parameters of the target model are updated in terms of the result of GP. Then, the maximum value of AC is used to select the next point, as it achieves the optimization by determining the next point to evaluate. Probability of improvement, expected improvement, and information gain are the three widely used AC [51]. In this study, expected improvement is chosen as the AC. Then, the best validation accuracy is mathematically written as follows:where is the validation accuracy and is the probability of with that is executed by using GP.

3.4. Evaluation Indicator

In general, the mean absolute percent error (MAPE) is a commonly used predictive indicator to evaluate the prediction performance of the regressive model. As mentioned above, the data are described as {}, , that can be considered as a matrix with the size of . Specifically, n is the number of incidents and represents the actual value of the incident. Considering is the predicted value of the incident. Then, the MAPE can be expressed as follows:

In terms of this formula, the MAPE is a relative predictive indicator that can measure the prediction performance of the models based on actual values and predicted values.

3.5. Framework of the Proposed Method

As introduced in Section 2, we need a suitable way to handle the original dataset to organize the dataset well for exploring the potential philosophy hidden in data easier. To this end, in this research, we select the K-means algorithm as the method to cluster the original dataset into several categories in which the data are high similarity. Then, the XGBoost model is built for each category to perform prediction. The main steps of the proposed method are introduced as follows:Step 1: clustering the original data into several categories by using the K-means algorithm. The number of clusters is determined by the optimal silhouette coefficient (the detailed information is introduced in Section 3.1).Step 2: splitting the clustered data into training data and testing data for each category. Using the training data to constructs the XGBoost model.Step 3: the BOA is used to optimize parameters for each constructed XGBoost model.Step 4: inputting the testing data into the trained XGBoost, and then the predicted clearance time will be output and recorded.Step 5: calculating the predictive indicator (MAPE) and the relative importance of explanatory factors

Noting that with the number of traffic incidents increasing, the dataset will be updated continuously, and thus the XGBoost should be retrained.

4. Prediction Result and Discussion

There are two objects of this study: (a) examining the performance of the XGBoost model in predicting clearance time and (b) investigating the significant factors of clearance time. We firstly process the original data, including data clustering, and clustering evaluation. Next, the data are split into training data and testing data with a ratio of 7 : 3. The XGBoost is trained by using training data, and the testing data are used for model evaluation. Then, comparison research examines the prediction performance of XGBoost. MAPE is chosen as a predictive measure. Finally, the relative importance of all the explanatory variables is calculated, and the significant explanatory variables of incident clearance time are analyzed. Overall, the proposed model is accomplished by coding and executing at Python.

4.1. Data Preprocessing

Before modeling, the original dataset has been processed by means of the K-means algorithm. As described in Section 3.1, the number of clusters (K) is the key parameter of the K-means algorithm. To find the best K, the values of K increasing from 2 to 10 are selected to calculate the corresponding silhouette coefficient, and the results are shown in Table 2. Assuming the iteration stops when the silhouette coefficients for continuous 5 iterations are not improved. The iteration stops when K = 7, as the silhouette coefficients of continuous 5 iterations are decreasing. In terms of equation (3), a higher silhouette coefficient indicates a better clustering performance. According to Table 2, when K = 2, the silhouette coefficient reaches the biggest value (0.613), which means K is set as 2 in this study. In this case, the original data are clustered into two clusters in this study. To present each cluster clearly, we draw the scatter plots of the target variable and one of the explanatory variables (which is chosen randomly), shown in Figure 2. The x-axis is clearance time and the y-axis denotes the response time. Figure 2(a) shows the scatter plot of these two variables in the original data, while Figure 2(b) shows the scatter plot of the clustered data. As shown in Figure 2(b), the cluster 1 marked with purple represents relative shorter clearance time, and cluster 2 marked with yellow indicates longer clearance time.

In order to knowledge the characteristic of two clusters clearly, several essential indexes are calculated and presented in Table 3. In total, there are 2246 incidents in cluster 1 and 319 incidents in cluster 2. Regarding cluster 1, the mean, standard, median, and range values of clearance time are 9 minutes, 5.44 minutes, 7.00 minutes, and 22 minutes. In respect to cluster 2, these values, respectively, are 39.25 minutes, 15.25 minutes, 35 minutes, and 75 minutes. Compared median value to mean value within each cluster, we can find that median values are, respectively, bigger than mean values for both two clusters. The result indicates that the distributions of clearance time in two clusters are skewed, instead of normal distribution. Then, we calculate the skew values of two clearance time distributions, and they are 0.92 in cluster 1 and 1.59 in cluster 2. Both of them present right-skewed, which are consistent with previous studies [26, 39, 41]. Distribution figures of clearance time in two clusters are shown in Figures 3(a) and 3(b).

Both Figures 3(a) and 3(b) present long-tail distributions with the range values of 22 and 75. It is difficult to handle the data with such a wide value range [53]. In this case, in order to make the distribution of clearance time closer to the normal distribution, we use data transformation to deal with clearance time data in two clusters. Regarding cluster 1, the skew value of clearance time is 0.92 which is between 0.5 and 1, indicating the median skewed. Therefore, according to the empirical method, we apply the square transformation to handle clearance time in cluster 1. In respect to cluster 2, the skewed value is 1.59 which is larger than 1, leading to a highly skewed. The log transformation is used to convert clearance time in cluster 2. Distributions of transformed clearance time are presented in Figures 3(c) and 3(d). In Figure 3, the blue line is the fitting curve of clustered data and the black line denotes the normal distribution curve which is fitted by their calculated mean and standard values. As shown in Figures 3(c) and 3(d), the distributions of transformed data are closer to normal distribution.

4.2. Parameter Optimization

In general, there are three approaches to optimize parameters, including the systematic grid search approach, the random search approach, and the Bayesian optimization approach. The grid search approach works well as it systematically searches the entire search space, but time-consuming. In contrast, the random search approach runs fast while it may miss the best value as it searches randomly in the search space. Bayesian optimization is the process of continuously sampling, calculating, and updating the model. In overall, we apply the Bayesian optimization method to find the optimal parameters in XGBoost. These parameters include max depth of the tree (max_depth), the number of trees (n_estimators), the learning rate of the tree (learning_rate), percent of randomly sampling for trees (subsample), sum of minimum leaf node sample weights (min_child_weight), and percentage of randomly sampled features (colsample_bytree). The increasing of n_estimators may improve the accuracy of XGBoost but increase the computing time too. The max_depth is used to avoid overfitting. In contrast, the larger min_child_weight will result in underfitting. Both subsample and min_child_weight, respectively, denote the row and column sampling. The meaning of the learning rate is identified to avoid overfitting and increase the robustness of the model [54]. Therefore, all these parameters should be optimal for achieving the best model performance.

The Bayesian optimization is packaged in a module of python, called Hyperopt [55]. The objective function (fmin), search space (space), optimal algorithm (algo), and the maximum numbers of evaluations (max_evals) are four main objects of the Hyperopt, which is used to accomplish BOA. In this research, the XGBoost is the fmin, tree of Parzen estimator defaults as the algo, and the max_evals is generally set as 4. Regarding search space, we set n_estimators ∈ [50, 500], learning_rate ∈ [0.05, 0.1], max_depth ∈ [2, 10], subsample ∈ [0.1, 0.9], colsample_bytree ∈ [0.1, 0.9], and min_child_weight ∈ [2, 12]. In addition, we use 5-fold cross-validation during parameter tuning, and the result is shown in Table 4.

Regarding cluster 1, the n_estimators, learning_rate, max_depth, subsample, colsample_bytree, and min_child_weight are, respectively, set as 140, 0.09, 6, 0.5, 0.7, and 3. In respect to cluster 2, the best prediction performance of XGBoost is obtained when the n_estimators = 100, the learning_rate = 0.05, the max_depth = 5, the subsample = 0.5, the colsample_bytree = 0.3, and the min_child_weight = 5. The XGBoost model reaches its best prediction performance when using these optimal parameters. And the MAPE values of optimized XGBoost for two clusters are 0.348 and 0.221, respectively.

4.3. Comparison Analysis

To examine the prediction performance of XGBoost in clearance time prediction, we select several commonly used models including support vector regression (SVR) model, random forest (RF) model, and Adaboost model for comparison. To ensure fairy comparison, the testing data and the parameter-tuning method (BOA) of all models are the same. For the SVR model, we select the radial basis function (RBF) as the kernel function. The gamma and penalty C are two key parameters of RBF and are set as 0.1, 64, and 0.15, 32 for two clusters. For the RF model, the number of trees (n_estimators), the maximum depth of the tree (max_depth), the minimum number of samples of internal node splitting (min_samples_split), and the minimum number of leaf nodes (min_samples_leaf) are the four key parameters, and they are set as 195, 8, 11, and 23 in the cluster 1 and 100, 13, 18, and 12 in the cluster 2. In regard to the Adaboost model, the same with RF model, n_estimators, max_depth, and min_samples_split should be identified. In addition, the learning_rate and the maximum features in splitting (max_features) also need to be optimized. These parameters of Adaboost in two clusters are set as 470, 6, 25, 0.05, 7 and 425, 9, 30, 0.11. The MAPE for four candidates is shown in Table 5, and the smallest values for two clusters are marked in bold.

As shown in Table 5, for cluster 1, the MAPE values of XGBoost, SVR, RF, and Adaboost are 0.348, 0.363, 0.357, and 0.383. The XGBoost represents the smallest MAPE, showing its superiority in clearance time prediction for cluster 1. As for cluster 2, the MAPE values of XGBoost, SVR, RF, and Adaboost are 0.221, 0.253, 0.228, and 0.231. Compared to other models, the XGBoost represents the smallest MAPE (0.221). It means the XGBoost model outperforms SVR, RF, and Adaboost in both two clusters. This result confirms the superiority of XGBoost in clearance time prediction.

4.4. Importance Evaluation for Explanatory Factors

Different explanatory variables have different effects on the target factor [56, 57]. To investigate the significant factors of clearance time, the relative importance of each explanatory factor is calculated by using the XGBoost with optimal parameters for two clusters. An explanatory factor with higher relative importance means that it generates a stronger effect on clearance time [41]. In this study, we assume that factors with relative importance greater than 8.0% are defined as significant explanatory factors, the relative importance of the general factor is from 2.5% to 8.0%, and the remaining explanatory factors are considered as insignificant factors. In this case, the explanatory factors with its importance are shown in Table 6.

As for cluster 1, AADT (17.70%), incident type (17.30%), response time (15.10%), and lane closure type (8.00%) are categorized into the significant explanatory factors of clearance time as their relative importance is bigger than 8.00%. The general factors of clearance time include six explanatory factors, such as WSP involved (7.60%), month of year (6.10%), traffic control (5.00%), weather (4.70%), day of week (4.60%), and peak hours (3.10%). And the remaining HOV (2.50%), time of day (2.10%), heavy truck involved (1.70%), injury involved (1.70%), and work zone involved (0.30%) are regarded as the insignificant explanatory variables in cluster 1. Regarding cluster 2, four explanatory factors are included in significant explanatory factors to clearance time, including AADT (14.00%), incident type (12.8%), response time (22.30%), and lane closure type (8.40%). And fire involved (8.40%), weather (6.10%), month of year (6.10%), traffic control (6.10%), injury involved (5.00%), and HOV (2.80%) are the general explanatory factors. Peak hours (2.20%), heavy truck involved (2.20%), WSP involved (1.70%), day of week (1.10%), time of day (0.60%), and work zone involved (0.20%) are categorized into insignificant explanatory factors to incident clearance time.

That is, for both two clusters, AADT, incident type, response time, and lane closure type are considered as the significant explanatory factors of clearance time. But the same factor may generate varying impacts on clearance time in the different datasets [58]. In detail, the AADT is the greatest contribution to shorter clearance time in cluster 1 and generates the second impacts on longer clearance time in cluster 2 with the relative importance of 17.70% and 14.00%, respectively. Generally speaking, AADT represents the characteristic of current traffic [59, 60]. That is, the traffic congestion with a high AADT may make the incident difficult to clear, leading to longer clearance time. As for incident type, it respectively contributes 17.30% and 12.80% to short and long clearance time and ranks the second in cluster 1 and the third in cluster 2. As shown in Table 1, the incident type factor consists of disabled vehicles, debris, abandoned vehicles, collision, and others. These incidents may block normal traffic [61, 62]. In this case, the transportation authorities may make a series of strategies to deal with the problems caused by these incidents [63, 64]. Interestingly, the longer clearance time seems less sensitive to incident type than shorter clearance time. Maybe a long clearing time means a high severity of the crash. With the relative importance of 15.10% and 22.3%, the response time factor is the third contributor for shorter clearance time in cluster 1 and yields the biggest impacts on longer clearance time in cluster 2. The result shows that longer clearance time is more sensitive to response time compared to shorter clearance time, which is consistent with the previous studies [18, 19]. For every minute, the response time increases, and the clearing time will increase by one percent [18, 19]. The lane closure type factor is the fourth contributed factor for both two clusters. It indicates the severity of incidents by restricting vehicles from entering the incident site [41].

5. Conclusions

In this study, XGBoost is applied to predict incident clearance time that occurred on the freeway and investigates the significant factors of clearance time by using the data collected from the Washington Incident Tracking System in 2011. We firstly introduce the original data and the proposed method briefly. The original data are clustered by using the K-means algorithm for better exploring the underlying relationship. Then, we built the XGBoost model for each cluster. Each clustered data is divided into 70% training data and 30% testing data. Training data are applied for modeling XGBoost and optimizing parameters on the basis of 5-fold cross-validation BOA. Testing data are used to measure the prediction performance of XGBoost. And the MAPE is considered as the predictive indicators in this paper. To examine the model performance of XGBoost, support vector regression (SVR), random forest (RF), and Adaboost are chosen to predict the clearance time. The comparing study manifests that the XGBoost outperforms the other three models with the lowest MAPE in both two clusters. To obtain the significant factors of clearance time, we calculate the relative importance of each explanatory factor and then define the quantitative indexes about significant explanatory factors, general explanatory factors, and insignificant explanatory factors. The result is that response time, AADT, incident type, and lane closure type are the significant explanatory factors of clearance time.

It is worth noting that the traffic incident is the time-sequential process [65]. And almost the incident information is acquired from that process [66]. Modeling based on the acquired incident information is the limitation of the proposed method in this study. Because, during the initial stage of the incident, the prediction may not be accurate due to the acquired information is incomplete. For future research, multistage updates of information should be a promising future research direction. In addition, strategies about dealing with the unobserved heterogeneity of dependent variables, especially in traffic incidents filed, may be a hot topic, due to some omitted variables (e.g., driving behavior) that may generate potential impacts on the target variable.

Data Availability

The traffic incident data used to support the findings of this study are available from the corresponding author and first author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (71701215), Innovation-Driven Project of Central South University (no. 2020CX041), Foundation of Central South University (no. 502045002), Science and Innovation Foundation of the Transportation Department in Hunan Province (no. 201725), and Postdoctoral Science Foundation of China (nos. 2018M630914 and 2019T120716).