Abstract
Random forest (RF) classifiers do excel in a variety of automatic classification tasks, such as topic categorization and sentiment analysis. Despite such advantages, RF models have been shown to perform poorly when facing noisy data, commonly found in textual data, for instance. Some RF variants have been proposed to provide better generalization capabilities under such challenging scenario, including lazy, boosted and randomized forests, all which exhibit significant reductions on error rate when compared to the traditional RFs. In this work, we analyze the behavior of such variants under the bias-variance decomposition of error rate. Such an analysis is of utmost importance to uncover the main causes of the observed improvements enjoyed by those variants in classification effectiveness. As we shall see, significant reductions in variance along with stability in bias explain a large portion of the improvements for the lazy and boosted RF variants. Such an analysis also sheds light on new promising directions for further enhancements in RF-based learners, such as the introduction of new randomization sources on both, lazy and boosted variants.
Similar content being viewed by others
Notes
The term effectiveness is here used to refer to the “quality” of the classification process, as captured, for instance, by accuracy or error rate metrics.
\(\mathbb {A} {\setminus } \mathbb {B}\) denotes the set difference between \(\mathbb {A}\) and \(\mathbb {B}\) and is composed by elements in \(\mathbb {A}\) but not in \(\mathbb {B}\).
Although not restricted to sampling with replacement, we here stick with the original definition of the out-of-bag technique in Valentini and Dietterich (2003), which bootstraps samples from the original dataset with replacement.
As an example, considering the 20NG dataset (see Table 4), we see that both LazyNN_RF and BROOF classifiers obtain top performing results on \(\hbox {MicroF}_1\) and \(\hbox {MacroF}_1\).
References
Breiman L (2001) Random forests. Mach Learn 45(1):5
Díaz-Uriarte R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7(1):3
Domingos PM (2000) A unified bias-variance decomposition for zero-one and squared loss. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on on innovative applications of artificial intelligence, 30 July–3 Aug, Austin, Texas, USA. AAAI Press/The MIT Press, 2000, pp 564–569
Friedman JH (1997) On bias, variance, 0/1 loss, and the curse-of-dimensionality. Data Min Knowl Discov 1(1):55
Garber FD, Djouadi A (1988) Bounds on the bayes classification error based on pairwise risk functions. IEEE Trans Pattern Anal Mach Intell 10(2):281. https://doi.org/10.1109/34.3891
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3
Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning. Springer, Berlin
Hutto CJ, Gilbert E (2014) VADER: a parsimonious rule-based model for sentiment analysis of social media text. In: Proceedings of the eighth international conference on weblogs and social media, ICWSM 2014, Ann Arbor, MI, USA, 1–4 June 2014. The AAAI Press
Jain R (1991) The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. Wiley, New York
James GM (2003) Variance and bias for general loss functions. Mach Learn 51(2):115
Kohavi R, Wolpert D (1996) Bias plus variance decomposition for zero-one loss functions. In: ICML-96
Kong EB, Dietterich TG (1995) Error-correcting output coding corrects bias and variance. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, pp 313–321
Leshem YRG (2007) Traffic flow prediction using adaboost algorithm with random forests as a weak learner. Int J Intell Technol 2:1305
Li HB, Wang W, Ding HW, Dong J (2010) Trees weighting random forest method for classifying high-dimensional noisy data. In: 2010 IEEE 7th international conference on e-business engineering, pp 160–163. https://doi.org/10.1109/ICEBE.2010.99
Liu FT (2005) The utility of randomness in decision tree ensembles. Master’s thesis, Faculty of Information Technology Monash University. https://pdfs.semanticscholar.org/109d/88da6e41b33d043449913827ba5a4ccca1e0.pdf
Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA (2011) On oblique random forests. In: Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases, vol Part II, pp 453–469
Murthy SK, Kasif S, Salzberg S (1994) A system for induction of oblique decision trees. J Artif Int Res 2(1):1–32
Salles T, Gonçalves M, Rodrigues V, Rocha L (2018) Improving random forests by neighborhood projection for effective text classification. Inf Syst 77:1. https://doi.org/10.1016/j.is.2018.05.006
Salles T, Gonçalves M, Rodrigues V, Rocha L (2015) Broof: exploiting out-of-bag errors, boosting and random forests for effective automated classification. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, SIGIR ’15, pp 353–362. https://doi.org/10.1145/2766462.2767747
Segal MR (2004) Machine learning benchmarks and random forest regression. Technical report, eScholarship Repository, University of California
Thelwall M (2013) Heart and soul: Sentiment strength detection in the social web with sentistrength. Cyberemotions. 1–14
Tumer K, Ghosh J (1996) Estimating the bayes error rate through classifier combining. In: Proceedings of 13th international conference on pattern recognition, vol 2, pp 695–699. https://doi.org/10.1109/ICPR.1996.546912
Valentini G, Dietterich TG (2003) Low bias bagged support vector machines. In: Machine learning, proceedings of the twentieth international conference (ICML 2003), 21–24 Aug 2003, Washington, DC, USA. AAAI Press, pp 752–759
Webb GI, Conilione P (2003) Estimating bias and variance from data. Technical report, School of Computer Science and Software Engineering, Monash University
Webb GI (2000) Multiboosting: a technique for combining boosting and wagging. In: Machine learning, pp 159–196
Zhang Z, Xie X (2010) Research on adaboost.m1 with random forest. In: Proceedings ICCET, pp 647–652
Acknowledgements
This work was partially supported by CAPES, CNPq, Finep, Fapemig, MasWeb and InWeb.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Salles, T., Rocha, L. & Gonçalves, M. A bias-variance analysis of state-of-the-art random forest text classifiers. Adv Data Anal Classif 15, 379–405 (2021). https://doi.org/10.1007/s11634-020-00409-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-020-00409-4