Skip to main content
Log in

Modeling taxi cruising time based on multi-source data: a case study in Shanghai

  • Published:
Transportation Aims and scope Submit manuscript

Abstract

Vacant cruising is an inevitable part of taxi services caused by spontaneous demand, and the efficiency of cruising strategies has purported impact on the profit of individual drivers. Extensive studies have been conducted to analyze taxi cruising patterns and propose effective cruising strategies. However, existing studies mainly focused on the collective behavior of certain driver groups and failed to capture cruising behavior patterns at the individual driver or trip level. Also, prior studies considered different types of factors affecting taxi cruising, but we still lack an integrated model to compare their relative importance. In this study, we analyze trip-level cruising time and the associated external and internal factors using a taxi trajectory dataset in Shanghai, China. A trajectory annotation technique is introduced to segment taxi trajectories into different phases. Various external (supply and demand, traffic condition and built environment) and internal (cruising strategies and historical driver performance) factors are derived from taxi trajectories and other data sources. A spatiotemporal embedding method is devised to capture unobserved effects over time and space. The impacts of external and internal factors on taxi cruising time are examined using regression and XGBoost—a machine learning model. The results show external and internal factors are both important in determining taxi cruising time. Cruising strategies contribute 49.0% in taxi cruising time, which implies effective cruising strategies can greatly reduce vacant cruising time. Additionally, nonlinear associations of some variables (e.g., supply–demand patterns, traffic speed) with taxi cruising time are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://www.amap.com/search?query=shanghai.

References

  • Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)

  • Cheng, L., De Vos, J., Zhao, P., Yang, M., Witlox, F.: Examining non-linear built environment effects on elderly’s walking: a random forest approach. Transp. Res. Part D: Transp. Environ. 88, 102552 (2020)

    Article  Google Scholar 

  • Cheng, L., Yang, X., Tang, L., Duan, Q., Kan, Z., Zhang, X., Ye, X.: Spatiotemporal analysis of taxi-driver shifts using big trace data. ISPRS Int. J. Geo-Inf. 9, 281 (2020). https://doi.org/10.3390/ijgi9040281

    Article  Google Scholar 

  • Gao, Y., Xu, P., Lu, L., Liu, H., Liu, S., and Qu, H.: Visualization of taxi drivers’ income and mobility intelligence. In: International Symposium on Visual Computing, pp. 275–284. Springer (2012)

  • Hu, Y., Miller, H.J., Li, X.: Detecting and analyzing mobility hotspots using surface networks. Trans. GIS 18, 911–935 (2014)

    Article  Google Scholar 

  • Kang, C., Qin, K.: Understanding operation behaviors of taxicabs in cities by matrix factorization. Comput. Environ. Urban Syst. 60, 79–88 (2016). https://doi.org/10.1016/j.compenvurbsys.2016.08.002

    Article  Google Scholar 

  • Kong, H., Zhang, X., Zhao, J.: Is ridesourcing more efficient than taxis? Appl. Geogr. 125, 102301 (2020)

    Article  Google Scholar 

  • Li, B., Zhang, D., Sun, L., Chen, C., Li, S., Qi, G., Yang, Q.: Hunting or waiting? Discovering passenger-finding strategies from a large-scale real-world taxi dataset. In: 2011 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops), pp. 63–68. IEEE (2011)

  • Liu, C., Guo, C.: Mining top-n high-utility operation patterns for taxi drivers. Expert Syst. Appl. 170, 114546 (2021)

    Article  Google Scholar 

  • Liu, L., Andris, C., Ratti, C.: Uncovering cabdrivers’ behavior patterns from their digital traces. Comput. Environ. Urban Syst. 34, 541–548 (2010)

    Article  Google Scholar 

  • Liu, Y., Kang, C., Gao, S., Xiao, Y., Tian, Y.: Understanding intra-urban trip patterns from taxi trajectory data. J. Geogr. Syst. 14, 463–483 (2012)

    Article  Google Scholar 

  • Liu, Y., Wang, F., Xiao, Y., Gao, S.: Urban land uses and traffic ‘source-sink areas’: evidence from gps-enabled taxi data in shanghai. Landsc. Urban Plan. 106, 73–87 (2012)

    Article  Google Scholar 

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural. Inf. Process. Syst. 26, 3111–3119 (2013)

    Google Scholar 

  • Parrott, J., Reich, M.: An Earnings Standard for New York City’s App-based Drivers: Economic Analysis and Policy Assessment. https://grist.org/wp-content/uploads/2020/06/787dd-parrott-reichnycappdriverstlcjul2018jul1.pdf (2018)

  • Powell, J.W., Huang, Y., Bastani, F., Ji, M.: Towards reducing taxicab cruising time using spatio-temporal profitability maps. In: International Symposium on spatial and temporal Databases, pp. 242–260. Springer (2011)

  • Qin, G., Li, T., Yu, B., Wang, Y., Huang, Z., Sun, J.: Mining factors affecting taxi drivers’ incomes using GPS trajectories. Transp. Res. Part C: Emerg. Technol. 79, 103–118 (2017)

    Article  Google Scholar 

  • Qu, M., Zhu, H., Liu, J., Liu, G., Xiong, H.: A cost-effective recommender system for taxi drivers. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 45–54

  • Songchitruksa, P., Zeng, X.: Getis-ord spatial statistics to identify hot spots by using incident management data. Transp. Res. Rec. 2165, 42–51 (2010)

    Article  Google Scholar 

  • Szeto, W.Y., Wong, R.C.P., Wong, S.C., Yang, H.: A time-dependent logit-based taxi customer-search model. Int. J. Urban Sci. 17, 184–198 (2013)

    Article  Google Scholar 

  • Tu, M., Li, W., Orfila, O., Li, Y., Gruyer, D.: Exploring nonlinear effects of the built environment on ridesplitting: evidence from Chengdu. Transp. Res. Part D: Transp. Environ. 93, 102776 (2021)

    Article  Google Scholar 

  • Urata, J., Xu, Z., Ke, J., Yin, Y., Wu, G., Yang, H., Ye, J.: Learning ride-sourcing drivers’ customer-searching behavior: a dynamic discrete choice approach. Transp. Res. Part C: Emerg. Technol. 130, 103293 (2021)

    Article  Google Scholar 

  • Wang, T., Shen, Z., Cao, Y., Xu, X., Gong, H.: Taxi-cruising recommendation via real-time information and historical trajectory data. IEEE Trans. Intell. Transp. Syst. (2021)

  • Wong, R.C.P., Szeto, W.Y., Wong, S.C.: A cell-based logit-opportunity taxi customer-search model. Transp. Res. Part C: Emerg. Technol. 48, 84–96 (2014)

    Article  Google Scholar 

  • Wong, R.C.P., Szeto, W.Y., Wong, S.C.: A two-stage approach to modeling vacant taxi movements. Transp. Res. Part C: Emerg. Technol. 59, 147–163 (2015)

    Article  Google Scholar 

  • Wong, R.C.P., Szeto, W.Y., Wong, S.C., Yang, H.: Modelling multi-period customer-searching behaviour of taxi drivers. Transportm. B: Transp. Dyn. 2, 40–59 (2014)

    Google Scholar 

  • Yu, X., Gao, S., Hu, X., Park, H.: A Markov decision process approach to vacant taxi routing with e-hailing. Transp. Res. Part B: Methodol. 121, 114–134 (2019)

    Article  Google Scholar 

  • Yuan, J., Zheng, Y., Zhang, L., Xie, X., Sun, G.: Where to find my next passenger. In: Proceedings of the 13th international conference on Ubiquitous computing, pp. 109–118 (2011)

  • Zhang, D., Sun, L., Li, B., Chen, C., Pan, G., Li, S., Wu, Z.: Understanding taxi service strategies from taxi GPS traces. IEEE Trans. Intell. Transp. Syst. 16, 123–135 (2014)

    Article  Google Scholar 

  • Zhang, H., Shi, B., Zhuge, C., Wang, W.: Detecting taxi travel patterns using GPS trajectory data: a case study of Beijing. KSCE J. Civ. Eng. 23, 1797–1805 (2019)

    Article  Google Scholar 

  • Zhang, K., Chen, Y., Nie, Y.M.: Hunting image: taxi search strategy recognition using sparse subspace clustering. Transp. Res. Part C: Emerg. Technol. 109, 250–266 (2019)

    Article  Google Scholar 

  • Zhao, P., Xu, Y., Liu, X., Kwan, M.-P.: Space-time dynamics of cab drivers’ stay behaviors and their relationships with built environment characteristics. Cities 101, 102689 (2020). https://doi.org/10.1016/j.cities.2020.102689

    Article  Google Scholar 

  • Zheng, Y.: Trajectory data mining: an overview. ACM Trans. Intell. Syst. Technol. (TIST) 6, 1–41 (2015)

    Article  Google Scholar 

  • Zhu, M., Chen, W., Xia, J., Ma, Y., Zhang, Y., Luo, Y., Huang, Z., Liu, L.: Location2vec: a situation-aware representation for visual exploration of urban locations. IEEE Trans. Intell. Transp. Syst. 20, 3981–3990 (2019)

    Article  Google Scholar 

  • Zong, F., Wu, T., Jia, H.: Taxi drivers’ cruising patterns-insights from taxi GPS traces. IEEE Trans. Intell. Transp. Syst. 20, 571–582 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

The data processing computations were performed using research computing facilities offered by Information Technology Services, the University of Hong Kong. We thank the anonymous reviewers for their helpful comments and suggestions.

Author information

Authors and Affiliations

Authors

Contributions

YL compiled different data sources, performed the data analysis, and wrote the first draft of the manuscript. ZZ conceptualized the study, designed the methodology, and provided the GPS data. XZ contributed to the revision of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zhan Zhao.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Detecting stay points

We use Algorithm 1 to identify stay points from GPS trajectories. If a taxi has not moved more than 100 ms within a 5-min window, the location is flagged as a candidate stay point. To mitigate false positive candidates caused by traffic jams, we further assess heading direction difference before and after a candidate stay point, given as:

$$\begin{aligned} \Delta h_{i:j}=\frac{1}{K}\left| \sum _{k=1}^{K} {h_{i-k}}-\sum _{k=1}^{K} {h_{j+k}}\right| \end{aligned}$$
(6)

where i and j are the start and end points of the stay candidate respectively, \({h_i}\) is the heading direction of point i, K is the sliding window used for heading direction calculation. A stay point candidate with \(\Delta h_{i:j}\) smaller than a threshold \(\delta _h\) is detected as traffic jam. From experiments, we set K as 5 and \(\delta _h\) as 30 degrees. Through qualitative evaluation, we find that such setting can effectively discriminate between stay behavior and traffic congestion. A few examples of detected traffic jam points are shown in Fig. 9.

figure a
Fig. 9
figure 9

A few examples of detected traffic jam using our proposed algorithm

Appendix B: Classifying stay points

Algorithm 2 is used to classify stay points based on the potential behavioral states. It is relatively easy to identify the start and end of driver shifts because they exhibit a high level of spatiotemporal regularity (Cheng et al. 2020). The regularity of a stay point is measured based on the number of active days when the taxi stays at the corresponding location (i.e., a 200 m\(\times \)200 m grid cell), where an active day is a day with at least one occupied trip. The location s with the highest stopping frequency is labeled as off-shift location (where a taxi starts and ends their shifts) and stay points associated with s are labeled as off-shift points.

Previous research has introduced different methods to detect waiting behavior from taxi trajectories. One of the first works is Li et al. (2011), who proposed to identify waiting behavior as the stay point before a pickup event. However, in reality, it is possible that taxi drivers wait at a location for some time but fail to find customers, and then move to another place to get passengers. To figure out waiting points in the middle of a cruising trip, Zhang et al. (2014) introduced a method to detect initial intended locations. Specifically, they considered the longest "normal" and non-waiting trajectory starting from the last drop-off event as the "initial intended path", and detected whether the path is ended with a waiting event. Although this method takes stay points in the middle of a cruising trip into account, it fails to differentiate between waiting and breaks when drivers need to have meals or visit the washroom. To address this issue, in this study, we use two criteria for waiting behavior identification: (1) whether the stay is immediately followed by a pickup; (2) whether the stay location is a hotspot for taxi pickups. A hotspot is a place with an unusually high density of pick-ups and our method for hotspot identification is introduced in “Appendix 3”. If a stay meets either criterion, it likely corresponds to waiting for passengers and the rest of stay points are classified as on-break points.

The distribution of the detected waiting locations using our proposed method and the aforementioned two methods are displayed in Fig. 10. We find that the number of waiting points detected from Li et al. (2011) is only about a quarter of the other two methods. This is reasonable as it only considers waiting behavior at the end of cruising trips. The distribution of waiting points detected from Zhang et al. (2014) is much more scattered, for the reason that they do not differentiate between waiting and break points. This aligns with our intuition that waiting points should be more clustered around hotspots while break points are more randomly distributed. In comparison, the waiting points identified by our method have a similar spatial distribution as in Li et al. (2011) but are generally denser, since it also includes waiting instances that do not immediately lead to passenger pick-ups.

figure b
Fig. 10
figure 10

The results of waiting behavior analysis using different methods

Appendix C: Pick-up hotspot identification

Pick-up hotspot identification is an important criterion to select waiting points in our data pre-processing procedures. Specifically, we aggregate all the pick-up events to a 200 m \(\times \) 200 m grid system and the top 20% grid cells with the most pick-up events are defined as hotspots. Since there exist different ways to detect pick-up hotspots, we compare our method with two commonly used algorithms:

  • Getis-Ord Gi* statistic (Songchitruksa and Zeng 2010): this method calculates the Getis-Ord Gi* statistic for each grid cell in the study area. The resultant z-scores and p-values represent where grid cells with either high or low values cluster spatially. In this research, we keep detected hotspots with 95% confidence for further analysis.

  • Kernel Density Estimation (Hu et al. 2014): this method generates a surface summarizing the spatial distribution of a point pattern, and high points on the surface are identified as hotspots. In this research, we define the top 20% grid cells with the highest density as hotspots.

The results of different methods are displayed in Fig. 11. It can be found that the hotspots detected by the two baselines are several large clusters distributed in the city center or around several popular POIs, including Pudong and Hongqiao Airport, Expo Park and more. Similar with the baseline models, the results of our method are mostly in the city center. However, we detect more small and scattered hotspots in the suburban area compared with the other two. This is reasonable as both Getis-Ord Gi* statistic and Kernal Density Estimation look for clustered areas with similarly high demand, while our method focuses on the demand of each grid cell itself and does not consider spatial neighborhoods. We argue that for our specific application, such method might be more suitable for the reason that it is more capable of detecting small and scattered POIs which can also be locations for taxi drivers to wait for customers. This can be verified from the distribution of waiting points detected by Li et al. (2011) (Fig. 10a). Recall that Li et al. (2011) only considers waiting points that successfully leads to a pickup event, and thus can be regarded as a sample of waiting behavior and is independent of the hotspot detection result. It can be found that our detected hotspots have a similar distribution with waiting points in Fig. 10a, suggesting that our method can better match the distribution of actual waiting behavior.

Fig. 11
figure 11

The results of hotspot detection using different methods

Previous research has shown that hotspots might be scale dependent (Hu et al. 2014). Therefore, we compare the hotspot detection result using different grid cell sizes, including 100 m \(\times \) 100 m, 200 m \(\times \) 200 m, 300 m \(\times \) 300 m and 400 m \(\times \) 400 m. For different grid systems, we all choose the top 20% grid cells with the most pick-up events as hotspots. The results are displayed in Fig. 12. It is found that the distribution of detected hotspots using different grid sizes are quite similar. As the grid size increases, the detected hotspots look smoother, which is reasonable as with larger grid sizes, hotspots are more likely to be connected with each other. We further compare the results of waiting behavior detection using different hotspots in Fig. 13. It is found that the results based on different grid systems are more than 90% identical. These results verify that our hotspot detection method is not much affected by spatial scale.

Fig. 12
figure 12

The results of hotspot detection using different grid sizes

Fig. 13
figure 13

The results of waiting behavior analysis using different grid sizes

Appendix D: Computation of several external factors

In this section, we clarify the definition of several external factors listed in Table 1:

  • Population density as we only obtained the population data of each residential community zone, the approximate population of a grid is estimated using:

    $$\begin{aligned} pop_j^{(grid)}=\sum _{i=1}^{N} {pop_i^{(RCZ)}\frac{area({RCZ}_i\cap {grid}_j)}{area({RCZ}_i)}} \end{aligned}$$
    (7)

    where N is the number of RCZs, \(pop_i^{(RCZ)}\) is the population of the ith RCZ and \(pop_j^{(grid)}\) is the estimated population of the jth grid cell.

  • Congestion level the congestion level of grid j at time interval t is defined as:

    $$\begin{aligned} congestion_{tj}=1 - {\overline{v}}_{tj}/{\overline{v}}_{max j} \end{aligned}$$
    (8)

    where \({\bar{v}}_{tj}\) is the average traffic speed of grid j at time interval t and \({\bar{v}}_{max j}\) is the traffic speed of grid j with no congestion. In our experiment, \({\bar{v}}_{max j}\) is defined as the 95th percentile of \(\{{\bar{v}}_{tj}|t\in T\}\) where T is the 7 days x 24 h = 168 time intervals defined in our study.”

Appendix E: Prediction performance of different machine learning models for taxi cruising time

In this section, we compare the performance of several commonly used machine learning models for the problem of taxi cruising time prediction. Specifically, in addition to XGBoost, the following two methods are considered:

  • Random Forests (RF) RF is a bagging ensemble learning method which builds multiple decision tress in parallel and the results are calculated by taking the average of all the decision tree values.

  • Artificial Neural Network (ANN) ANN consists of an input layer, an output layer and several hidden layers in between. Each node is a neuron with an activation function apart from input notes.

The results of different models on the test set are reported in Table 5, where the highlighted values indicate the best performing model for each metric. It is found that XGBoost performs best regarding RMSE and \(R^2\), followed by RF and finally ANN. Although ANN performs best regarding MAE, its difference with XGBoost is marginal and it is more difficult to be interpreted compared with tree-based ensemble methods. Therefore, we choose XGBoost for further analysis due to its competitive prediction performance and relatively good interpretability.

Table 5 Prediction Performance of different models

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, Y., Zhao, Z. & Zhang, X. Modeling taxi cruising time based on multi-source data: a case study in Shanghai. Transportation (2022). https://doi.org/10.1007/s11116-022-10348-y

Download citation

  • Published:

  • DOI: https://doi.org/10.1007/s11116-022-10348-y

Keywords

Navigation