BEST: a decision tree algorithm that handles missing values

Beaulac, Cédric; Rosenthal, Jeffrey S.

doi:10.1007/s00180-020-00987-z

BEST: a decision tree algorithm that handles missing values

Original paper
Published: 18 April 2020

Volume 35, pages 1001–1026, (2020)
Cite this article

Computational Statistics Aims and scope Submit manuscript

803 Accesses
12 Citations
4 Altmetric
Explore all metrics

Abstract

The main contribution of this paper is the development of a new decision tree algorithm. The proposed approach allows users to guide the algorithm through the data partitioning process. We believe this feature has many applications but in this paper we demonstrate how to utilize this algorithm to analyse data sets containing missing values. We tested our algorithm against simulated data sets with various missing data structures and a real data set. The results demonstrate that this new classification procedure efficiently handles missing values and produces results that are slightly more accurate and more interpretable than most common procedures without any imputations or pre-processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SPAARC: A Fast Decision Tree Algorithm

ConfDTree: A Statistical Method for Improving Decision Trees

Article 17 May 2014

Assessing variable importance in clustering: a new method based on unsupervised binary decision trees

Article 05 January 2019

References

Bailey MA, Rosenthal JS, Yoon AH (2016) Grades and incentives: assessing competing grade point average measures and postgraduate outcomes. Stud High Educ 41(9):1548–1562. https://doi.org/10.1080/03075079.2014.982528
Article Google Scholar
Beaulac C, Rosenthal JS (2018) Predicting University Students’ Academic Success and Choice of Major using Random Forests. ArXiv e-prints
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. https://doi.org/10.1007/BF00058655
Article MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth and Brooks, Monterey
MATH Google Scholar
Ding Y, Simonoff JS (2010) An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res 11:131–170
MathSciNet MATH Google Scholar
Feelders AJ (1999) Handling missing data in trees: surrogate splits or statistical imputation. In: PKDD
Friedman J, Kohavi R, Yun Y (1997) Lazy decision trees 1
Gavankar S, Sawarkar S (2015) Decision tree: review of techniques for missing values at training, testing and compatibility. In: 2015 3rd international conference on artificial intelligence, modelling and simulation (AIMS), pp 122–126. https://doi.org/10.1109/AIMS.2015.29
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42. https://doi.org/10.1007/s10994-006-6226-1
Article MATH Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, Berlin
Book Google Scholar
Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 15(3):651–674. https://doi.org/10.1198/106186006X133933
Article MathSciNet Google Scholar
Kim H, Loh WY (2001) Classification trees with unbiased multiway splits. J Am Stat Assoc 96:589–604
Article MathSciNet Google Scholar
Kuhn M, Quinlan R (2018) C50: C5.0 decision trees and rule-based models. https://CRAN.R-project.org/package=C50. R package version 0.1.2
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, Hoboken. https://doi.org/10.1002/9781119013563
Book MATH Google Scholar
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco
Google Scholar
R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
Rahman MG, Islam MZ (2013) Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl Based Syst 53:51–65. https://doi.org/10.1016/j.knosys.2013.08.023
Article Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592. https://doi.org/10.1093/biomet/63.3.581
Article MathSciNet MATH Google Scholar
Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8:1623–1657
MATH Google Scholar
Schafer JL, Olsen MK (2000) Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivar Behav Res 33:545–571
Article Google Scholar
Seaman S, Galati J, Jackson D, Carlin J (2013) What is meant by “missing at random”? Stat Sci 28(2):257–268. https://doi.org/10.1214/13-STS415
Article MathSciNet MATH Google Scholar
Shalev-Shwartz S, Ben-David S (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press, New York
Book Google Scholar
Sidiropoulos N, Sohi SH, Rapin N, Bagger FO (2015) Sinaplot: an enhanced chart for simple and truthful representation of single observations over multiple classes. bioRxiv. https://doi.org/10.1101/028191
Article Google Scholar
Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(1):25. https://doi.org/10.1186/1471-2105-8-25
Article Google Scholar
Therneau T, Atkinson B (2018) rpart: recursive partitioning and regression trees. https://CRAN.R-project.org/package=rpart. R package version 4.1-13
Tierney NJ, Harden FA, Harden MJ, Mengersen KL (2015) Using decision trees to understand structure in missing data. BMJ Open. https://doi.org/10.1136/bmjopen-2014-007450
Article Google Scholar
Twala B (2009) An empirical comparison of techniques for handling incomplete data using decision trees. Appl Artif Intell 23(5):373–405. https://doi.org/10.1080/08839510902872223
Article Google Scholar
Twala B, Jones M, Hand D (2008) Good methods for coping with missing data in decision trees. Pattern Recognit Lett 29:950–956
Article Google Scholar
van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67
Article Google Scholar
Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer, New York
Book Google Scholar

Download references

Acknowledgements

We are very grateful to Glenn Loney and Sinisa Markovic of the University of Toronto for providing us with the anonymised students grade data. The authors also gratefully acknowledge the financial support from NSERC of Canada.

Author information

Authors and Affiliations

Department of Statistical Sciences, University of Toronto, Toronto, Canada
Cédric Beaulac & Jeffrey S. Rosenthal

Authors

Cédric Beaulac
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey S. Rosenthal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cédric Beaulac.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Beaulac, C., Rosenthal, J.S. BEST: a decision tree algorithm that handles missing values. Comput Stat 35, 1001–1026 (2020). https://doi.org/10.1007/s00180-020-00987-z

Download citation

Received: 01 November 2019
Accepted: 08 April 2020
Published: 18 April 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s00180-020-00987-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BEST: a decision tree algorithm that handles missing values

Abstract

Access this article

Similar content being viewed by others

SPAARC: A Fast Decision Tree Algorithm

ConfDTree: A Statistical Method for Improving Decision Trees

Assessing variable importance in clustering: a new method based on unsupervised binary decision trees

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

BEST: a decision tree algorithm that handles missing values

Abstract

Access this article

Similar content being viewed by others

SPAARC: A Fast Decision Tree Algorithm

ConfDTree: A Statistical Method for Improving Decision Trees

Assessing variable importance in clustering: a new method based on unsupervised binary decision trees

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation