Abstract
The main contribution of this paper is the development of a new decision tree algorithm. The proposed approach allows users to guide the algorithm through the data partitioning process. We believe this feature has many applications but in this paper we demonstrate how to utilize this algorithm to analyse data sets containing missing values. We tested our algorithm against simulated data sets with various missing data structures and a real data set. The results demonstrate that this new classification procedure efficiently handles missing values and produces results that are slightly more accurate and more interpretable than most common procedures without any imputations or pre-processing.
Similar content being viewed by others
References
Bailey MA, Rosenthal JS, Yoon AH (2016) Grades and incentives: assessing competing grade point average measures and postgraduate outcomes. Stud High Educ 41(9):1548–1562. https://doi.org/10.1080/03075079.2014.982528
Beaulac C, Rosenthal JS (2018) Predicting University Students’ Academic Success and Choice of Major using Random Forests. ArXiv e-prints
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. https://doi.org/10.1007/BF00058655
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth and Brooks, Monterey
Ding Y, Simonoff JS (2010) An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res 11:131–170
Feelders AJ (1999) Handling missing data in trees: surrogate splits or statistical imputation. In: PKDD
Friedman J, Kohavi R, Yun Y (1997) Lazy decision trees 1
Gavankar S, Sawarkar S (2015) Decision tree: review of techniques for missing values at training, testing and compatibility. In: 2015 3rd international conference on artificial intelligence, modelling and simulation (AIMS), pp 122–126. https://doi.org/10.1109/AIMS.2015.29
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42. https://doi.org/10.1007/s10994-006-6226-1
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, Berlin
Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 15(3):651–674. https://doi.org/10.1198/106186006X133933
Kim H, Loh WY (2001) Classification trees with unbiased multiway splits. J Am Stat Assoc 96:589–604
Kuhn M, Quinlan R (2018) C50: C5.0 decision trees and rule-based models. https://CRAN.R-project.org/package=C50. R package version 0.1.2
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, Hoboken. https://doi.org/10.1002/9781119013563
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco
R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
Rahman MG, Islam MZ (2013) Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl Based Syst 53:51–65. https://doi.org/10.1016/j.knosys.2013.08.023
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592. https://doi.org/10.1093/biomet/63.3.581
Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8:1623–1657
Schafer JL, Olsen MK (2000) Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivar Behav Res 33:545–571
Seaman S, Galati J, Jackson D, Carlin J (2013) What is meant by “missing at random”? Stat Sci 28(2):257–268. https://doi.org/10.1214/13-STS415
Shalev-Shwartz S, Ben-David S (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press, New York
Sidiropoulos N, Sohi SH, Rapin N, Bagger FO (2015) Sinaplot: an enhanced chart for simple and truthful representation of single observations over multiple classes. bioRxiv. https://doi.org/10.1101/028191
Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(1):25. https://doi.org/10.1186/1471-2105-8-25
Therneau T, Atkinson B (2018) rpart: recursive partitioning and regression trees. https://CRAN.R-project.org/package=rpart. R package version 4.1-13
Tierney NJ, Harden FA, Harden MJ, Mengersen KL (2015) Using decision trees to understand structure in missing data. BMJ Open. https://doi.org/10.1136/bmjopen-2014-007450
Twala B (2009) An empirical comparison of techniques for handling incomplete data using decision trees. Appl Artif Intell 23(5):373–405. https://doi.org/10.1080/08839510902872223
Twala B, Jones M, Hand D (2008) Good methods for coping with missing data in decision trees. Pattern Recognit Lett 29:950–956
van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67
Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer, New York
Acknowledgements
We are very grateful to Glenn Loney and Sinisa Markovic of the University of Toronto for providing us with the anonymised students grade data. The authors also gratefully acknowledge the financial support from NSERC of Canada.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Beaulac, C., Rosenthal, J.S. BEST: a decision tree algorithm that handles missing values. Comput Stat 35, 1001–1026 (2020). https://doi.org/10.1007/s00180-020-00987-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-020-00987-z