Skip to main content
Log in

Random forest with acceptance–rejection trees

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

In this paper, we propose a new random forest method based on completely randomized splitting rules with an acceptance–rejection criterion for quality control. We show how the proposed acceptance–rejection (AR) algorithm can outperform the standard random forest algorithm (RF) and some of its variants including extremely randomized (ER) trees and smooth sigmoid surrogate (SSS) trees. Twenty datasets were analyzed to compare prediction performance and a simulated dataset was used to assess variable selection bias. In terms of prediction accuracy for classification problems, the proposed AR algorithm performed the best, with ER being the second best. For regression problems, RF and SSS performed the best, followed by AR, and then ER at the last. However, each algorithm was most accurate for at least one study. We investigate scenarios where the AR algorithm can yield better predictive performance. In terms of variable importance, both RF and SSS demonstrated selection bias in favor of variables with many possible splits, while both ER and AR largely removed this bias.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Allwein E, Schapire R, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141

    MathSciNet  MATH  Google Scholar 

  • Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9:1545–1588

    Google Scholar 

  • Breiman L (1996) Bagging predictors. Mach Learn 24:123–140

    MATH  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45:5–32

    MATH  Google Scholar 

  • Breiman L (2004) Consistency for a simply model of random forests. Technical report, University of California at Berkeley

  • Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on machine learning, pp 161–168

  • Caruana R, Karampatziakis N, Yessenalina A (2008) An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th international conference on machine learning, pp 96–103

  • Chambers J, Cleveland W, Kleiner B, Tukey P (1983) Graphical methods for data analysis. Wadsworth, Belmont

    MATH  Google Scholar 

  • Cutler D, Edwards T Jr, Beard K, Cutler A, Hess K, Gibson J, Lawler J (2007) Random forest for classification in ecology. Ecology 88:2783–2792

    Google Scholar 

  • Davis R, Anderson Z (1989) Exponential survival trees. Stat Med 8:947–962

    Google Scholar 

  • Derrig R, Francis L (2008) Distinguishing the forest from the trees: a comparison of tree-based data mining methods. Variance 2:184–208

    Google Scholar 

  • Dietterich T, Bakiri G (1995) Solving multiclass learning problems via error–correcting output codes. J Artif Intell Res 2:263–286

    MATH  Google Scholar 

  • Fan J, Su X, Levine R, Nunn M, LeBlanc M (2006) Trees for correlated survival data by goodness of split, with applications to tooth prognosis. J Am Stat Assoc 101:959–967

    MathSciNet  MATH  Google Scholar 

  • Friedman J (2001) Greedy function approximation: the gradient boosting machine. Ann Stat 29:1189–1232

    MathSciNet  MATH  Google Scholar 

  • Genuer R, Poggi JM, Tuleau C (2008) Random forests: some methodological insights. arXiv

  • Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3–42

    MATH  Google Scholar 

  • Gordon L, Olshen R (1985) Tree-structured survival analysis. Cancer Treat Rep 69:1065–1069

    Google Scholar 

  • Hajjem A, Bellavance F, Larocque D (2014) Mixed effects random forest for clustered data. J Stat Comput Simul 84:1313–1328

    MathSciNet  MATH  Google Scholar 

  • Hanley J, McNeil B (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36

    Google Scholar 

  • Ho T (1995) Random decision forest. In: Proceedings of the 3rd international conference on document analysis and recognition, vol 1, pp 278–282

  • Ho T (1998) The random subspace method of constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20:832–844

    Google Scholar 

  • Hosmer D, Lemeshow S (1989) Applied logistic regression. Wiley, New York

    MATH  Google Scholar 

  • Hothorn T, Leisch F, Zeileis A, Hornik K (2005) The design and analysis of benchmark experiments. J Comput Graph Stat 14:675–699

    MathSciNet  Google Scholar 

  • Ishwaran H (2015) The effect of splitting on random forests. Mach Learn 99:75–118

    MathSciNet  MATH  Google Scholar 

  • Ishwaran H, Kogalur UB (2016) Random forests for survival, regression, and classification (RF-SRC). R package version 2.2.0

  • Ishwaran H, Kogalur U, Gorodeski E, Minn A, Lauer M (2010) High-dimensional variable selection for survival data. J Am Stat Assoc 105:205–217

    MathSciNet  MATH  Google Scholar 

  • König I, Malley J, Weimar C, Diener HC, Ziegler A (2007) Practical experiences on the necessity of external validation. Stat Med 26:5499–5511

    MathSciNet  Google Scholar 

  • Leisch F, Dimitriadou E (2010) mlbench: Machine Learning Benchmark Problems. R package version 2.1-1

  • Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22

    Google Scholar 

  • Malley J, Kruppa J, Dasgupta A, Malley K, Ziegler A (2012) Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inform in Med 51:74–81

    Google Scholar 

  • Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI repository of machine learning. http://www.ics.uci.edu/~mlearn/MLRepository.html. Accessed May 2018

  • Segal M, Xiao Y (2011) Multivariate random forests. WIREs Data Min Knowl Discov 1:80–87

    Google Scholar 

  • Sela R, Simonoff J (2012) RE-EM trees: a data mining approach for longitudinal and clustered data. Mach Learn 86:169–207

    MathSciNet  MATH  Google Scholar 

  • Shah A, Bartlett J, Carpenter J, Nicholas O, Hemingway H (2014) Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. Am J Epidemiol 179:764–774

    Google Scholar 

  • Strobl C, Boulesteix A, Zeileis A, Augustin T (2007a) Unbiased split selection for classification trees based on the gini index. Comput Stat Data Anal 52:483–501

    MathSciNet  MATH  Google Scholar 

  • Strobl C, Boulesteix A, Zeileis A, Hothorn T (2007b) Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform 8:25–46

    Google Scholar 

  • Su X, Kang J, Liu L, Yang Q, Fan J, Levine R (2016) Smooth sigmoid surrogate (SSS): An alternative to greedy search in recursive partitioning. Comput Stat Data Anal Under Rev

  • Su X, Pena A, Liu L, Levine R (2018) Random forests of interaction trees for estimating individualized treatment effects in randomized trials. Stat Med 37:2547–2560

    MathSciNet  Google Scholar 

  • Torgo L (1999) Inductive learning of tree-based regression models. Ph.D. thesis, University of Porto

  • Yoo W, Ference B, Cote M, Schwartz A (2012) A comparison of logistic re gression, logic regression, classification tree, and random forests to identify effective gene–gene and gene–environment interactions. Int J Appl Sci Technol 2:268

    Google Scholar 

Download references

Acknowledgements

This research was supported in part by NSF Grant 163310.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juanjuan Fan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 123 KB)

Appendix A: datasets

Appendix A: datasets

Table 8 Dataset Summaries

The twenty datasets analyzed to assess prediction accuracy are summarized in Table 8. The Ailerons and Elevators data were taken from Torgo (1999), the Birthwt data were taken from Hosmer and Lemeshow (1989), the Airquality data were taken from Chambers et al. (1983), and the remaining datasets were taken from the UCI Machine Learning Repository (Newman et al. 1998). Most of the datasets are also available in the mlbench R package (Leisch and Dimitriadou 2010). The Breast Cancer dataset had 16 missing values, the Imports85 dataset had 12 missing values, and the Airquality dataset had 42 missing values. Missing values were handled with list-wise deletion for this paper.

Among the ten binary problems, three are synthetic ones and seven are real ones. The Twonorm, Ringnorm, and Ringnorm datasets simulated two binary responses from continuous inputs each having the same effect on the response. Four of the real datasets (Breast Cancer, Ionosphere, German Credit, and Birthwt) included nominal or ordinal categorical variables. The binary response datasets ranged from 189 to 4601 observations and from 6 to 60 input variables. For the ten regression problems, half are synthetic with only continuous inputs. The Friedman datasets have nonlinear relationships with interacting variables and the Ailerons and Elevators datasets were simulated datasets related on F16 aircraft actions. Four of the real datasets (Housing, Servo, Abalone, and Imports85) included nominal or ordinal categorical variables. The regression datasets ranged from 111 to 16,599 observations and from 4 to 40 input variables.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Calhoun, P., Hallett, M.J., Su, X. et al. Random forest with acceptance–rejection trees. Comput Stat 35, 983–999 (2020). https://doi.org/10.1007/s00180-019-00929-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-019-00929-4

Keywords

Navigation