Abstract

Short-term traffic prediction is a key component of Intelligent Transportation Systems. It uses historical data to construct models for reliably predicting traffic state at specific locations in road networks in the near future. Despite being a mature field, short-term traffic prediction still poses some open problems related to the choice of optimal data resolution, prediction of nonrecurring congestion, and the modelling of relevant spatiotemporal dependencies. As a step towards addressing these problems, this paper investigates the ability of Artificial Neural Networks, Random Forests, and Support Vector Regression algorithms to reliably model traffic flow at different data resolutions and respond to unexpected traffic incidents. We also explore different feature selection methods to identify and better understand the spatiotemporal attributes that most influence the reliability of these models. Experimental results indicate that data aggregation does not necessarily achieve good performance for multivariate spatiotemporal machine learning models. The models learned using high-resolution 30-second input data outperformed the corresponding baseline ARIMA models by . Furthermore, feature selection based on Recursive Feature Elimination resulted in models that outperformed those based on linear correlation-based feature selection.

1. Introduction

Traffic congestion results in significant monetary losses in countries around the world, with the cost of traffic congestion in 2014 estimated to be billion in the US alone [1]. A significant amount of effort has been put into reducing congestion in cities. In many cities, it is becoming impractical to build new roads or to expand existing roads, and it is becoming all more important to make the best use of the available resources. Intelligent Transportation Systems, Advanced Traffic Management Systems, and route guidance systems use real-time data of traffic flow gathered from various sensors. In such systems, short-term traffic prediction, which helps make decisions based on predictions of traffic in the near future, is more useful than just using the real-time data of traffic conditions. The field of short-term traffic prediction is over 30 years old with early work utilizing Box-Jenkins ARIMA methods [2]. Recent approaches still use variations of the original ARIMA models, for example, seasonal ARIMA [3, 4], but there has been a shift towards using machine learning algorithms to address the traffic prediction challenges [5]. Although such models based on machine learning algorithms have been shown to be more reliable than the traditional ARIMA models, there are still many open problems [6]. These include building responsive algorithms that are able to predict nonrecurring congestion, determining the optimum data resolution, and identifying and modelling the important spatiotemporal dependencies in traffic data. The study described in this paper is a step towards addressing these challenges. We make the following key contributions:(i)Explore the effect of the resolution of multivariate spatiotemporal input data on the accuracy of short-term traffic predictions models; we specifically consider models built using Artificial Neural Networks, Support Vector Regression, and Random Forests.(ii)Evaluate the responsiveness of these predictive models to nonrecurring congestion events. Specifically, we study the reliability of the predictions provided by these models in the presence of unexpected events such as accidents.(iii)Identify the spatiotemporal traffic attributes that most influence the performance of these models and their ability to model the complex dependencies in traffic data.

We illustrate these contributions using historical data of volume and occupancy measurements on a highway in Auckland (New Zealand). We first motivate the need for the proposed study by discussing related work in Section 2. Next, Section 3 describes the dataset and methodology used to build and evaluate the predictive models, and Section 4 describes the machine learning algorithms used to build these models. Section 5 describes the hypotheses and measures used for experimental evaluation, and Section 6 analyzes the corresponding experimental results. Finally, Section 7 discusses the conclusions and directions for future work.

2. Background

Many algorithms have been developed for short-term traffic prediction, which is a complex problem influenced by a variety of factors such as the resolution (i.e., the aggregation level) of the input and output data, and spatiotemporal dynamics. We review some of the related work in this section.

Although studies in the existing literature predominantly use data aggregated over 5 min and 15 min intervals, some prior studies have investigated the effect of data resolution on the reliability of the predictions provided by the corresponding models; the results have, however, been inconclusive. For instance, Park et al. [7] investigated the effect of aggregation on travel time prediction and considered aggregation levels from 2 min to 60 min in the context of an ARIMA model. They concluded that higher levels of aggregation were required to forecast route travel time than when forecasting link travel times. Dougherty and Cobbett [8] constructed a neural network model for making predictions and found that data aggregated over 5 min intervals gives better results than data aggregated over 1 min intervals. Vlahogianni and Karlaftis [9] looked at aggregation levels and although they found that temporal aggregation may distort critical traffic flow information, they also concluded that further research was necessary to determine the optimum aggregation level(s).

The use of high-resolution data is challenging for multiple reasons. First, for some statistical models used for short-term traffic state prediction, it is necessary to ensure that the input data and the output data have the same aggregation level, but this constraint can be relaxed when machine learning algorithms are used to build predictive models. Second, while research shows that the high-resolution data (as expected) includes more accurate measurements; for example, Martin et al. [10] state that inductive loops are “one of the most accurate count and presence detectors;” it also makes the noise in sensor measurements more distinct. Although data from these inductive loops can represent individual vehicles in the network, computational models developed to capture the flow of vehicles between segments or links in the network need to be robust to such noise and be able to capture spatiotemporal dynamics in order to exploit the information encoded in high-resolution data. Studies based on univariate time-series methods often perform aggregation to smooth out the variability in higher-resolution data [9]; however, these data smoothing techniques result in loss of information (and sensitivity) and make it difficult for the corresponding models to capture the spatiotemporal dynamics of traffic flow. In the study reported in this paper, we fixed the resolution of the output data (i.e., for the predictions being made) and examined the effect of different input data aggregation levels on the prediction accuracy.

There has been considerable research on analyzing the effects of spatiotemporal dynamics. For instance, Kamarianakis and Prastacos [11] used a Spatiotemporal Autoregressive Moving Average (STARIMA) model to incorporate data from links upstream to the link of interest in their prediction model, and Chandra and Al-Deek [12] found that vector autoregressive models that incorporate data from links neighbouring the link of interest perform better than ARIMA models that do not consider the data from the neighbouring links. Yang et al. [13] found that a sparse selection of neighbours chosen based on the level of correlation with the link of interest improves performance. Min and Wynter [14] showed that a multivariate spatiotemporal model with templates was able to provide very good prediction accuracy. However, these models depend on fixed correlations matrices that are modified infrequently. As a result, it is difficult for these models to track changes or to capture sudden (or significant) changes between congested and free-flowing traffic conditions.

In addition to the approaches that build on the ARIMA models [24, 11, 14], models based on machine learning and probabilistic estimation algorithms have also been explored because they are well-suited to model the complex spatiotemporal relationships in data. Popular approaches include Artificial Neural Networks (ANN) [1519], Support Vector Machines (SVM) [2024], k-Nearest Neighbours (kNN) [2529], Kalman Filters [3032], Bayesian Networks [3335], and Random Forests [36, 37]. For instance, existing work has explored various ANN configurations. Wang et al. [19] developed a space-time delay neural network (STDNN) that included 22 links in central London and showed that this model outperforms a STARIMA model. Hodge et al. [38] used a binary neural network that incorporates spatiotemporal data for traffic prediction. Vlahogianni et al. [18] used a neural network model optimized with genetic algorithms and found that incorporating spatial and temporal data was helpful for multistep predictions. More recently, there have been efforts to use deep neural network architectures, including deep belief networks [39, 40] and stacked autoencoders [41].

There is no agreement in the literature regarding the number of upstream and downstream links (neighbouring any link of interest) that should be considered while building the predictive models. While some algorithms consider just one upstream or downstream link [24, 29], others consider a variable number of upstream and downstream links [38]. For an extensive review of spatiotemporal forecasting, please see Ermagun and Levinson [42]. As noted in Vlahogianni et al. [6], capturing spatial attributes in traffic data from a freeway is still an open problem.

Most existing work on short-term traffic prediction focuses on typical conditions [21]. Traffic is (on average) inherently periodic with daily or weekly patterns, and many studies exploit this periodicity in their algorithms. However, accurate predictions are arguably more useful in situations of nonrecurring congestion such as accidents where periodic patterns do not hold. Of the studies that do not leave out nonrecurring congestion in their input data, a common approach is to create multiple models to deal with different conditions. For example, Dunne and Ghosh [43] used a model with nonlinear preprocessing in cases of congestion. Fusco et al. [44] reported good performance during nonrecurring congestion with a SARMA model, while a Bayesian Network performed better during recurring congestion. An online-SVR-based model was found to predict nonrecurring congestion accurately by Castro-Neto et al. [21]. Pan et al. [45] also highlight some of the challenges in capturing moving bottlenecks and nonrecurring congestion. See Vlahogianni et al. [6], Ermagun and Levinson [42], Oh et al. [46], and Oh et al. [47] for a more comprehensive overview of the existing literature in short-term traffic prediction.

In this study, we explore three machine learning algorithms that have demonstrated the ability to incorporate spatiotemporal data in predictive models built for intelligent transportation and other applications. Specifically, we explore (1) Artificial Neural Networks (ANN), (2) Support Vector Regression (SVR), and (3) Random Forests (RF). We chose ANN and SVR because they are the most widely used machine learning algorithms used to build predictive models in the literature. We chose Random Forests since it is an ensemble learning algorithm that requires a small number of parameters to be tuned. Please note that the primary objective of our study was not to introduce new algorithms. Instead, we make three key contributions. First, we examine how the predictive accuracy of models based on these algorithms changes as a function of the aggregation level of the input data. Second, we explore the ability of these models to respond accurately to nonrecurring congestion conditions. Third, we identify the spatiotemporal attributes that most influence the predictive accuracy of these models and their ability to model the complex dependencies in traffic data.

3. Methodology

This section introduces the study area and data and provides a mathematical formulation of the short-term traffic prediction problem (Section 3.1). This is followed by a description of the data preprocessing steps used in the proposed study (Section 3.2).

3.1. Study Area and Mathematical Formulation

This study was carried out in a section of State Highway 1 (SH1) in Auckland, New Zealand. We considered data from 45 segments along SH1 from the suburb of Papakura towards Auckland City (see Figure 1). On average, there are three lanes of roadway in each direction, and we only considered lanes going northbound in this study. The average length of a segment was , with the length varying between and .

Traffic can be measured in different ways. The most common sensor used to collect traffic data is the Inductive Loop Detector, which comes in different forms. Dual loop detectors, which have two inductive loops placed a short distance apart, are able to accurately capture the speed of a vehicle going over them, the volume (i.e., count of vehicles passing the detector), and occupancy (i.e., the amount of time a vehicle was over the detector). However, most of the loops in many cities (including Auckland) are single loop detectors, which can measure volume and occupancy but can only estimate vehicle speed as a function of these measured values and the average effective vehicle length. Research shows that measuring speed with a constant effective vehicle length can lead to errors of up to [48]. Using these derived speed estimates for making decisions can lead to misleading results—we thus did not use speed data in this study.

The fundamental model of traffic flow established by traffic engineers considers the relationship between three key traffic variables: (1) flow (volume), (2) density, and (3) speed. Since density is difficult to measure directly, occupancy is frequently used as a substitute [49]. It is not possible to accurately and comprehensively describe the current state of traffic using only information about flow. For example, if 200 vehicles pass over a detector during a interval, this could correspond to free-flow conditions during early mornings and evenings, but it could also correspond to highly congested conditions due to an accident during peak hours. The combination of both volume and occupancy uniquely defines the current state of traffic. Unlike many existing studies that have only considered flow when making predictions, which does not define the traffic state uniquely, we consider both volume and occupancy because they each provide useful information. Together they help eliminate ambiguities, such as those described above.

For each predictive model, the input vector is of the form:where and denote volume and occupancy (respectively) of segment at time-step , is the total number of segments, and is the total number of historical time-steps considered. The output of each such model is the volume or occupancy aggregated over the subsequent five-minute interval for each specific segment of interest. This output is a function of the input vector; for example, if traffic volume is to be predicted, the output of the models is . The goal of each machine learning algorithm is to build a model of this functional relationship between the inputs and outputs. The learned model can then be used to predict the output for any given input.

3.2. Data Processing

Data from 30 days of April 2016 was collected for 45 segments on the motorway. In order to get segment level data from loop detectors, individual values were aggregated across the lanes (volume data was summed, and occupancy was averaged) for each segment and at each point in time. We use the volume and occupancy values of all segments in the past 20 time-steps , resulting in an input vector with 1800 attributes. To ensure that each segment has data from a reasonable number of upstream and downstream segments, predictions are only made for segments on the motorway (see Figure 1). Recall that volume and occupancy readings were reported every 30 seconds, which correspond to 86400 time-steps. A naive aggregation would have resulted in smaller datasets of 8640 samples and 2880 samples for and aggregation, respectively. To minimize the imbalance in the size of the datasets, a sliding window approach was used, resulting in a new sample being generated every 30 seconds for all the aggregation levels. The final size of the input dataset, with 20 time-steps included in each input sample, was thus 86370 samples for resolution, 86190 for , and 85790 for aggregation. Also, to ensure a fair comparison, the output is aggregated over the same time period for each model for all input time resolutions, that is, the amount of time represented in the input depends on the resolution of the data, whereas in the output, all models will consider the aggregated values over the interval from when the final input reading was taken to five minutes past this time.

The dataset was preprocessed to remove some extreme values that were highly unlikely. First, we used winsorization [50] to set the upper bound of the values in the dataset. Winsorization, a common approach for dealing with outliers, replaces all values above and below a certain percentile with the value of that percentile. In this paper, we set the upper percentile to so that all values above this percentile are replaced by the value of this percentile. If a standard normal distribution is assumed, this choice of upper bound corresponds to clipping values that are standard deviations from the mean. Figure 2 shows volume values from segment 23 before and after winsorization.

Second, we scaled each attribute in the input data to lie ; this scaling was especially crucial for producing stable results with Support Vector Regression and Artificial Neural Networks. Scaling was performed using the training data, and the corresponding scaling constants were applied to the test data. The occupancy values always stayed between and in the input and output, and no additional processing was needed to constrain the data to this range. Nonstationary time-series data is typically transformed into stationary data before applying time-series models. However, traffic data is considered to be cyclostationary and we model short-term traffic prediction as a multivariate pattern recognition problem with all data assumed to arise from the same underlying distribution. Thus, we did not perform any transformations to make the data stationary. Also, although the periodic nature of traffic can be exploited to improve the prediction accuracy of the learned models, doing so will make it difficult to reliably and efficiently identify and respond to nonrecurring congestion conditions (also see Section 4.2).

Training of the models was accomplished using data from the first 20 days (57,600 samples), and data corresponding to the remaining ten days was used for testing. The parameters of each model were tuned using the training dataset. Next, we briefly discuss the algorithms that we used to build the models for short-term traffic prediction.

4. Machine Learning Algorithms

In this section, we describe the three machine learning algorithms used to build the predictive models explored in this paper: Artificial Neural Networks (Section 4.1), Support Vector Regression (Section 4.2), and Random Forests (Section 4.3).

4.1. Artificial Neural Network

Feedforward neural networks or multilayer perceptrons are the most common Artificial Neural Network (ANN) models. A neural network is composed of neurons arranged in layers with each layer containing one or more neurons. Each neuron is connected to all the neurons in its adjacent layers, and neurons within a layer are not connected. Each neuron takes a linear weighted sum of all its inputs (from the layer before it) and passes it through a nonlinear activation function to produce the output :

Each such output is then used as an input to the next layer of neurons until the final (i.e., output) layer is reached. The weights associated with each neuron may be initialized randomly to enable each neuron to potentially learn a different function of its inputs.

The weights associated with each neuron are the parameters defining the neural network model, and these parameters are estimated by minimizing a loss function that measures the difference between the output values estimated by the network and the ground-truth values included in the training data. For regression problems, the squared error between the estimated and ground-truth output values is generally used as the loss function. The backpropagation algorithm is then used to calculate the gradient of this error and to propagate this gradient back through the network (towards the input layer) to update the weights of each neuron by gradient descent. Stochastic gradient descent algorithms are used widely to update the weights, and we used a stochastic gradient-based optimizer called Adam that is computationally efficient and is known to scale well to larger datasets [51]. All parameters of this optimizer were set to their default values.

Although the nonlinear activation function in a neural network has traditionally been the sigmoid function, empirical results have indicated that the rectified linear unit (ReLU) activation function improves the ability to model complex relationships and reduces the time taken to train the model [52]. We thus used the ReLU activation function in a network with three hidden layers, each with 150 neurons. We performed 400 iterations of learning with minibatches of data with 200 samples (each).

4.2. Support Vector Regression

For classification problems, a Support Vector Machine computes a decision boundary that maximizes the margin between this boundary and the closest data sample. Support Vector Regression (SVR) uses a similar approach for regression problems—errors corresponding to estimated values within an distance from the ground-truth values are ignored. More specifically, given a set of training data, the objective is to find a function that produces at most deviation from the actual target values for the training data and is as flat as possible [53]. For instance, a linear function is flat if it has a small —this can be accomplished by minimizing . Since a function that satisfies all the required constraints may not exist, some slack variables are introduced to allow for some errors. We then obtain the following formulation for SVR:

We can also incorporate nonlinear kernel functions to extend SVR to nonlinear problems. Popular kernels include linear kernel and the Radial Basis Function (RBF) kernel, which transform the input sample into a higher dimensional space that results in better separation (for classification) or estimation of values (for regression). We experimentally chose to use a linear kernel for SVR because it provided better results.

4.3. Random Forest

Random Forest (RF) [54] is an ensemble method for building classification or regression models. Ensemble methods combine predictions from multiple models to improve accuracy. In an RF, the ensemble is a set of decision trees trained on subsets of the full dataset. Each subset is selected by a technique known as bagging or bootstrap aggregation. If the training set is defined as input vectors and the corresponding (target) output values , decision trees will be created as follows:for in 1…doPick training samples randomly with replacement; call this subset Train a decision tree using where each split in a decision tree is based on a random subset of the attributesend for

In other words, each subset created by sampling from the training set with replacement results in a decision tree. The prediction for any test input is then the average of the predictions from each decision tree:

This approach ensures that individual trees are not highly correlated because of a small number of strong predictors. RF methods are popular because they provide some robustness to noisy data with outliers. They are also able to focus on attributes most useful to the regression or classification task under consideration and ignore attributes that are less relevant. In our study, we used a RF with 100 trees.

5. Hypotheses and Measures

We experimentally evaluated the following hypotheses regarding the predictive models learning using the machine learning algorithms:(1)The learned models are able to disregard the amplification of noise and variations in high-resolution data and provide higher accuracy than models that do not use high-resolution data(2)The learned models are responsive to nonrecurring congestion events such as accidents, and this ability improves with the increase in the resolution of data(3)The learned models are able to capture the complex spatiotemporal evolution of traffic by assigning higher importance to volume and occupancy attributes extracted from segments near the segment of interest

As baselines for comparison, wherever appropriate, we used two established methods for volume prediction in existing literature (ARIMA, historical average). To experimentally evaluate the hypotheses, we used three measures: accuracy, root mean square error (RMSE), and mean absolute error (MAE), defined as follows:where is the predicted value and is the ground-truth value of the data sample.

To quantify responsiveness to nonrecurring conditions, we computed these measures over samples that were representative of nonrecurring conditions. Specifically, a sample was considered if the difference between its output value and the weekly seasonal mean of the predicted variable was more than two standard deviations away from the mean of the distribution of output values:where is the standard deviation and is the mean of the values of the predicted variable during the corresponding time period for that day of the week.

6. Experimental Results

This section discusses the results of experimentally evaluating the three hypotheses listed in Section 5. We summarize the results in Sections 6.1, 6.2, and 6.4 and examine the computational efficiency of the proposed models in Section 6.3. Unlike results reported in many papers, our predictive models considered different traffic conditions such as peak and off-peak traffic at different times of the week, including weekends and public holidays. Recall that we explore different aggregation levels ranging from to for the input data, but the output of each model is the volume or occupancy of vehicles (in a particular segment in the highway) aggregated over a period of five minutes—see Section 3.1 for more details.

6.1. Using High-Resolution Data

As stated in Section 3.1, the predictive models were constructed using the training set and evaluated on the test set. We repeated the trials to check that the performance of the models was stable using different random initializations.

The results summarized in Table 1 show that all three machine learning algorithms performed better with aggregation level for input data in comparison with the and aggregation levels. While the increase in prediction accuracy with resolution may not be surprising, it is important to note that the increase in resolution also amplifies the noise and minor variations in the data. As baselines for comparison, we considered two established methods for volume prediction in the existing literature (ARIMA, historical average). For the ARIMA models, we applied a square-root transformation in addition to the first-order difference and verified their stationarity. To compare the outputs from these methods with the outputs from the learned models, we evaluated all models at the same output resolution of . For instance, for the aggregation level, the aggregated output value was obtained by iterating and aggregating the output over ten one-step-ahead predictions. Also, results for the input aggregation level were obtained by first applying the Stram-Wei temporal disaggregation [55] to extract aggregated values from the aggregated data. ARIMA (2, 1, 2) models were used for predicting volume at the and input aggregation levels, ARIMA (2, 1, 1) models were used for predicting occupancy at the and aggregation levels, and ARIMA (4, 1, 0) models were used for the input aggregation level. These models were selected experimentally using the Box-Jenkins method.

The results in Table 1 indicate that the models corresponding to the input aggregation level provide an average accuracy improvement of over the ARIMA approach and an average improvement over the historical average baseline. Note that these results include both recurring and nonrecurring congestion events; we examine the nonrecurring events in more detail in Section 6.2. To confirm the significance of these results, we conducted Diebold–Mariano (DM) tests for predictive accuracy [56]. The DM test compares the forecast accuracy of a pair of forecast methods. The test’s null hypothesis is that the two forecasts have the same accuracy. The null hypothesis will be rejected if the computed DM statistic falls outside the required significance level under a standard normal distribution; for example, for a significance of , the null hypothesis is rejected if the DM statistic . We used MSE as the error metric. Table 2 shows the DM test statistic for each pair of models. Except for the SVR and RF models, all other models have significantly different levels of accuracy.

Table 3, which summarizes the results of predicting occupancy, indicates similar trends. Although all three predictive models based on machine learning algorithms performed well, the model based on the Random Forest algorithm (Section 4.3) provided the highest accuracy.

Next, the average accuracy and MAE at different times of the day, for the three different data aggregation levels, are shown in Figure 3. For each algorithm, the accuracy increases with the resolution. Overall, we observe that the performance of the learned predictive models improves significantly with the increase in resolution despite the associated amplification of noise and minor variations in data.

The results discussed so far support the first hypothesis that predictive models based on machine learning algorithms are able to disregard the amplification of noise in high-resolution data and provide higher accuracy than models that do not use the high-resolution data. The lower accuracy values during overnight hours can be explained by the accuracy being represented as a percentage of vehicles and the average number of vehicles overnight being significantly lower; this is confirmed by the lower MAE values for the same period.

6.2. Nonrecurring Congestion

Next, we evaluated the second hypothesis by examining the responsiveness of the predictive models to nonrecurring congestion events. We did so by only evaluating the trained predictive models on a subset of the test set comprising samples that were significantly different from historical average values. The results are summarized in Tables 4 and 5. We observe that the models built using input data at the aggregation level outperform the models use input data at the and aggregation levels. Among the learned models, the model based on the ANN algorithm provides marginally better performance than that based on the RF algorithm for volume predictions while the converse is true for occupancy predictions. Furthermore, we observe that the learned predictive models provide better performance than the models based on historical average and ARIMA, which are established methods for short-term traffic prediction.

To further explore the responsiveness of the learned models, we examined a known (i.e., reported) breakdown along the motorway in more detail. Figure 4(a) compares the average volume of traffic on segment 23 of SH1 on Thursday with the traffic volume on a specific Thursday, April 21, 2016. The data corresponding to this date was in the test dataset, that is, not used to train the predictive models. Figure 4(a) shows that there was a significant deviation from the average traffic around 6.40 am on April 21, 2016. As reported on the social media site, Twitter, there was a breakdown near SH1 at that day (see Figure 4(b)). More specifically, the Ellerslie on-ramp mentioned in the tweet is near segment 27 of SH1, which is from segment 23 on SH1.

Figures 5(a)5(c) show how the learned predictive models are able to track the traffic volume corresponding to this event, with each of the three different input data aggregation levels. For comparison, the figures also include the performance of the ARIMA approach. We observe in Figure 5(a) that using the high-resolution input data aggregation level enabled the learned models to predict the change in traffic volume at almost the same time-step when the nonrecurring event occurred, whereas there is a lag when the other two aggregation levels are used; the performance is significantly worse with the baseline ARIMA model.

For additional examples of how the models predicted during nonrecurring congestion, see Figure 6. These plots indicate that the ANN model at the input aggregation level responds very quickly to nonrecurring congestion. The SVR-based models and the coarser-resolution models tend to smooth out shocks to traffic and are better at smoothing out the noise in typical congestion conditions. The RF-based learned models tend to provide good overall performance that lies in between that provided by the ANN-based and SVR models.

Figure 7 shows that an ANN-based learned model at the input data aggregation level accurately predicts traffic volume on a public holiday. Recall that this model had no information about the day of the week and the seasonal mean. Overall, these results support the second hypothesis that the models based on machine learning algorithms and high-resolution data are more responsive to nonrecurring congestion.

6.3. Computational Efficiency and Practical Scalability

Table 6 summarizes the training time and testing time of the proposed models, when they are built and evaluated on an Intel Core desktop with of RAM. The time taken to generate a forecast was under 0.1 seconds for all models. The training time, even in the most extreme case, was under 20 minutes. Since the training process can easily be parallelized to create models for all segments on a network and this can be done in an initial offline phase, we believe these methods can be easily implemented for forecasts over the entire traffic network.

We did not optimize our algorithms—performance could have been improved by using fewer training samples or tuning the algorithms’ parameters, for example, by using a smaller number of trees in the Random Forest or a smaller neural network. The different algorithms take different amounts of time for training and testing; for example, models based on the (linear) SVR algorithm have the lowest training time and testing time—the nonlinear SVR models have a much longer training time ( one hour for one model) but they did not perform as well as the linear model. The ANN-based models take longer to train but are fast during testing, whereas the RF-based ensemble models take longer to train and test.

Overall, we believe that models based on these machine learning methods will scale to large road networks. The retraining of the models can be undertaken as new data comes in over several weeks or months, enabling the system to adapt to changes in the road network.

6.4. Attribute Selection

Next, we evaluate the third hypothesis regarding the ability to model the complex spatiotemporal evolution of traffic. To do so, we first identify the attributes that most influence the performance of the learned predictive models.

One common approach for identifying informative attributes is to compute the Pearson correlation coefficient between the target variable and each of the input attributes [42]. However, the Pearson correlation coefficient is not able to capture nonlinear relationships that may exist between the input attributes and thetarget variable . We, therefore, used the Recursive Feature Elimination (RFE) approach to select the most relevant (i.e., informative) attributes [58, 59]. RFE works by iteratively considering an increasingly smaller subset of attributes, dropping (in each iteration) the attributes considered to be the least relevant. In each iteration, we removed 10 attributes ranked the lowest in terms of importance.

There are different ways to characterize the importance of attributes in RF-based models. Since any RF is a collection of decision trees, the gini importance of each attribute in all decision trees can be averaged, for instance, to arrive at the importance of the attribute. In the case of an ANN, the weights of the first layer of an ANN-based model can provide insight into the attributes that contributed significantly to making the predictions. In a similar manner, the weights assigned to each attribute of a linear SVM can be used to identify the relative importance of the attributes [60].

Figures 8(a), 9, and 10 visualize the relative ranking of each of the 1800 input attributes considered by the models for traffic prediction at a particular segment (segment 23 in these figures). The darker shades represent the more informative attributes. For each figure, the plot on the left visualizes the volume attributes and the plot on the right visualizes the occupancy attributes. In each of these plots, the columns going from left to right along the x-axis represent the segments in spatial order along the motorway from the south to the north. Along the y-axis, the first row is the most recent time-step, and the top row is the oldest time-step, for example, for the aggregation level for input data, row 20 corresponds to the data from 10 minutes before the current time-step. Overall, we observed that all three models provide a higher rank to neighbouring segments over a few time-steps.

A more careful examination of the results indicated that the predictive models based on SVR and RF assign higher importance to volume attributes than occupancy attributes when making decisions. Also, the same set of attributes do not contribute significantly to the performance of all three models. For all three models, the attributes that are considered important change when the resolution of the input data changes. For instance, for the models based on the aggregation level (i.e., highest resolution), the set of attributes considered to be important for decision-making mostly included values (of volume and occupancy) from nearby spatial locations and time-steps. The number of attributes corresponding to downstream segments that are nearby is high for the higher-resolution models, especially when predicting nonrecurring congestion events. For the models based on the and aggregation levels, on the other hand, the set of attributes considered to be important also included values from more distant segments. These results add to the current knowledge about representing information for short-term traffic prediction. For instance, some recent research found that having more than one time-step of data from neighbouring locations only provides minor improvements in performance [13]. Our results, on the other hand, indicate that volume and occupancy values from multiple neighbouring locations and time-steps may be important for accurate prediction of traffic depending on the resolution of the input data.

To further analyze the importance of the attributes, we considered the relative importance of different subsets of these ranked attributes. We observed that the performance, specifically accuracy, flattens out after including attributes. Figure 11 shows the performance of the three models for the aggregation level, as a function of the number of attributes considered, with the attributes ordered in decreasing order of importance. A similar result was observed for the other two aggregation levels.

Finally, we compared the performance of the RFE approach for ranking attributes with the more common correlation-based approach and an approach that chose important attributes randomly; we considered the performance of the corresponding models under normal conditions and in the presence of nonrecurring congestion events. Tables 7 and 8 as well as Figures 12 and 13 indicate that the RFE approach outperforms the other two approaches for ranking attributes. In fact, in the case of nonrecurring congestion, the prediction accuracy using correlation-based attribute selection is similar to that with a random selection of the important attributes. One explanation for the poor performance provided by correlation-based feature selection is that the features that are most likely to be highly correlated to the output correspond to the road segments closest to the segment under consideration. However, in most cases, these features give redundant information. Segments further away may contain information about situations such as queues building up or a spike in traffic that is not necessarily correlated with the output but are quite informative for predictions. The RFE provides an opportunity to identify these dependencies, and the experimental results show that it is a much better choice for accurate traffic prediction, especially with nonrecurring congestion events. The experimental results also support the hypothesis that the predictive models based on the machine learning algorithms capture the complex spatiotemporal evolution of traffic by assigning higher importance to the attributes that are more relevant to the prediction task.

7. Conclusions

Traffic congestion results in significant monetary losses in countries around the world. Short-term traffic prediction helps make decisions based on predictions of traffic in the near-future and is more useful than just using the real-time data of traffic conditions. Despite being a mature field, short-term traffic prediction poses many open problems such as the (a) choice of the optimal input data resolution; (b) reliable prediction and efficient tracking of nonrecurring congestion events; and (b) accurate modelling of the complex spatiotemporal dependencies influencing traffic estimation. We have explored the construction and use of predictive models based on three established machine learning algorithms for addressing the aforementioned problems. Specifically, we investigated the use of Artificial Neural Network (ANN), Support Vector Regression (SVR), and Random Forest (RF) and evaluated the predictive performance of these models for three different input data aggregation levels, , , and . For each learned model, the output was a prediction (of volume or occupancy) over a period, although the same methodology can be used to provide predictions over or intervals as well. Our experiments indicate the following.(i)Aggregation of high-resolution data to a lower resolution is not required for accurate forecasting with machine learning algorithms. Aggregation may actually have a negative effect on accuracy for these multivariate models. Our results indicate that machine learning algorithms are able to extract useful information from high-resolution data despite the corresponding amplification of noise and variability in the sensor measurements.(ii)By not explicitly exploiting the periodic characteristics in traffic, the machine learning models studied here perform equally well under both recurring and nonrecurring congestion without requiring any special changes to the models. The corresponding experimental results also indicate that these learned models are able to capture the underlying complex, spatiotemporal evolution of traffic.(iii)Recursive Feature Elimination provides a good ranking of attributes for short-term traffic prediction. The more commonly used linear Pearson correlation coefficient-based feature selection [42] provides poor prediction accuracy similar to that with a random selection of features in the presence of nonrecurring congestion. Furthermore, feature selection enables us to visualize and better understand the spatiotemporal patterns modeled by the machine learning models.

These results open up multiple directions for further research. First, we will incorporate these findings in more sophisticated machine learning algorithms for short-term traffic prediction. For instance, the complex, nonlinear relationships influencing traffic flow may be modeled well using deep network architectures, especially when high-resolution input data is considered. We will also consider other datasets in order to generalize the findings reported in this paper based on data from a single highway. Second, we will build on the indicated ability to track nonrecurring congestion events in order to consider both accidents and weather conditions. This will require the underlying algorithms to model additional variables and their effect on traffic flow. Furthermore, we will explore network-wide traffic predictions towards the long-term objective of effective use of resources for the smooth flow of traffic under a wide range of circumstances.

Data Availability

The terms of use of the data used in this study do not allow the authors to distribute or publish the data directly. However, these data can be obtained directly from NZTA through APIs on the following web page: https://www.nzta.govt.nz/traffic-and-travel-information/infoconnect-section-page/.

Conflicts of Interest

Mr. Rivindu Weerasekera (BE (Hons)) is a doctoral candidate at the University of Auckland, New Zealand. He holds a first class honors degree in Electrical and Electronics Engineering from the University of Auckland. His research interest focus on the intersection of Intelligent Transportation Systems and Machine Learning. Dr. Mohan Sridharan (Ph.D.) is a senior lecturer in the School of Computer Science at the University of Birmingham (UK). He was previously a senior lecturer in the Department of Electrical and Computer Engineering at The University of Auckland (NZ), and a faculty member at Texas Tech University (USA) where he is currently an Adjunct Associate Professor of Mathematics and Statistics. He received his Ph.D. in Electrical and Computer Engineering from The University of Texas at Austin (USA). Dr Sridharan’s primary research interests include knowledge representation and reasoning, interactive machine learning, cognitive systems, and computational vision, in the context of adaptive robots and agents. Dr. Prakash Ranjitkar (Ph.D., MEng, BEng (Civil)) is a senior lecturer in Transportation Engineering in the Department of Civil and Environmental Engineering and a founding member of the Transportation Research Centre (TRC) at the University of Auckland, New Zealand. He has over 19 years of academic, research, and consulting work experience in a range of transport and other infrastructure engineering projects. He has strong research interest in modelling and simulation of traffic, Intelligent Transportation System, traffic operations and management, traffic safety, human factors, and applications of advanced technologies in transportation. Prior to joining the University of Auckland in 2007, Prakash worked for the University of Delaware in USA (2006–2007) and before that in Hokkaido University in Japan (2001–2006). He is a member of IPENZ Transportation Group and Institute of Transportation Engineers (USA). He is an Editorial Board Member for the Open Transportation Journal and reviewer of Journal of Transportation Research Board, Journal of Eastern Asia Society for Transportation Studies, Journal of Intelligent Systems, and IEEE Transactions of Intelligent Transportation Systems.

Acknowledgments

The authors would like to thank Mike Duke from Auckland’s Joint Transport Operations Centre (JTOC) for helping them obtain access to the data used for experimental evaluation in this paper.