A bias-variance analysis of state-of-the-art random forest text classifiers

Salles, Thiago; Rocha, Leonardo; Gonçalves, Marcos

doi:10.1007/s11634-020-00409-4

A bias-variance analysis of state-of-the-art random forest text classifiers

Regular Article
Published: 19 July 2020

Volume 15, pages 379–405, (2021)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

539 Accesses
5 Citations
Explore all metrics

Abstract

Random forest (RF) classifiers do excel in a variety of automatic classification tasks, such as topic categorization and sentiment analysis. Despite such advantages, RF models have been shown to perform poorly when facing noisy data, commonly found in textual data, for instance. Some RF variants have been proposed to provide better generalization capabilities under such challenging scenario, including lazy, boosted and randomized forests, all which exhibit significant reductions on error rate when compared to the traditional RFs. In this work, we analyze the behavior of such variants under the bias-variance decomposition of error rate. Such an analysis is of utmost importance to uncover the main causes of the observed improvements enjoyed by those variants in classification effectiveness. As we shall see, significant reductions in variance along with stability in bias explain a large portion of the improvements for the lazy and boosted RF variants. Such an analysis also sheds light on new promising directions for further enhancements in RF-based learners, such as the introduction of new randomization sources on both, lazy and boosted variants.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CHIRPS: Explaining random forest classification

Article Open access 04 June 2020

Julian Hatwell, Mohamed Medhat Gaber & R. Muhammad Atif Azad

Approximation of SHAP Values for Randomized Tree Ensembles

Short Text Classification Using Semantic Random Forest

Notes

The term effectiveness is here used to refer to the “quality” of the classification process, as captured, for instance, by accuracy or error rate metrics.
\(\mathbb {A} {\setminus } \mathbb {B}\) denotes the set difference between \(\mathbb {A}\) and \(\mathbb {B}\) and is composed by elements in \(\mathbb {A}\) but not in \(\mathbb {B}\).
Although not restricted to sampling with replacement, we here stick with the original definition of the out-of-bag technique in Valentini and Dietterich (2003), which bootstraps samples from the original dataset with replacement.
As an example, considering the 20NG dataset (see Table 4), we see that both LazyNN_RF and BROOF classifiers obtain top performing results on \(\hbox {MicroF}_1\) and \(\hbox {MacroF}_1\).

References

Breiman L (2001) Random forests. Mach Learn 45(1):5
Article Google Scholar
Díaz-Uriarte R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7(1):3
Article Google Scholar
Domingos PM (2000) A unified bias-variance decomposition for zero-one and squared loss. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on on innovative applications of artificial intelligence, 30 July–3 Aug, Austin, Texas, USA. AAAI Press/The MIT Press, 2000, pp 564–569
Friedman JH (1997) On bias, variance, 0/1 loss, and the curse-of-dimensionality. Data Min Knowl Discov 1(1):55
Article Google Scholar
Garber FD, Djouadi A (1988) Bounds on the bayes classification error based on pairwise risk functions. IEEE Trans Pattern Anal Mach Intell 10(2):281. https://doi.org/10.1109/34.3891
Article MATH Google Scholar
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3
Article Google Scholar
Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning. Springer, Berlin
Book Google Scholar
Hutto CJ, Gilbert E (2014) VADER: a parsimonious rule-based model for sentiment analysis of social media text. In: Proceedings of the eighth international conference on weblogs and social media, ICWSM 2014, Ann Arbor, MI, USA, 1–4 June 2014. The AAAI Press
Jain R (1991) The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. Wiley, New York
MATH Google Scholar
James GM (2003) Variance and bias for general loss functions. Mach Learn 51(2):115
Article Google Scholar
Kohavi R, Wolpert D (1996) Bias plus variance decomposition for zero-one loss functions. In: ICML-96
Kong EB, Dietterich TG (1995) Error-correcting output coding corrects bias and variance. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, pp 313–321
Leshem YRG (2007) Traffic flow prediction using adaboost algorithm with random forests as a weak learner. Int J Intell Technol 2:1305
Google Scholar
Li HB, Wang W, Ding HW, Dong J (2010) Trees weighting random forest method for classifying high-dimensional noisy data. In: 2010 IEEE 7th international conference on e-business engineering, pp 160–163. https://doi.org/10.1109/ICEBE.2010.99
Liu FT (2005) The utility of randomness in decision tree ensembles. Master’s thesis, Faculty of Information Technology Monash University. https://pdfs.semanticscholar.org/109d/88da6e41b33d043449913827ba5a4ccca1e0.pdf
Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA (2011) On oblique random forests. In: Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases, vol Part II, pp 453–469
Murthy SK, Kasif S, Salzberg S (1994) A system for induction of oblique decision trees. J Artif Int Res 2(1):1–32
MATH Google Scholar
Salles T, Gonçalves M, Rodrigues V, Rocha L (2018) Improving random forests by neighborhood projection for effective text classification. Inf Syst 77:1. https://doi.org/10.1016/j.is.2018.05.006
Article Google Scholar
Salles T, Gonçalves M, Rodrigues V, Rocha L (2015) Broof: exploiting out-of-bag errors, boosting and random forests for effective automated classification. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, SIGIR ’15, pp 353–362. https://doi.org/10.1145/2766462.2767747
Segal MR (2004) Machine learning benchmarks and random forest regression. Technical report, eScholarship Repository, University of California
Thelwall M (2013) Heart and soul: Sentiment strength detection in the social web with sentistrength. Cyberemotions. 1–14
Tumer K, Ghosh J (1996) Estimating the bayes error rate through classifier combining. In: Proceedings of 13th international conference on pattern recognition, vol 2, pp 695–699. https://doi.org/10.1109/ICPR.1996.546912
Valentini G, Dietterich TG (2003) Low bias bagged support vector machines. In: Machine learning, proceedings of the twentieth international conference (ICML 2003), 21–24 Aug 2003, Washington, DC, USA. AAAI Press, pp 752–759
Webb GI, Conilione P (2003) Estimating bias and variance from data. Technical report, School of Computer Science and Software Engineering, Monash University
Webb GI (2000) Multiboosting: a technique for combining boosting and wagging. In: Machine learning, pp 159–196
Zhang Z, Xie X (2010) Research on adaboost.m1 with random forest. In: Proceedings ICCET, pp 647–652

Download references

Acknowledgements

This work was partially supported by CAPES, CNPq, Finep, Fapemig, MasWeb and InWeb.

Author information

Authors and Affiliations

Federal University of Minas Gerais, Belo Horizonte, Brazil
Thiago Salles & Marcos Gonçalves
Federal University of São João Del Rei, São João del Rei, Brazil
Leonardo Rocha

Authors

Thiago Salles
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo Rocha
View author publications
You can also search for this author in PubMed Google Scholar
Marcos Gonçalves
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thiago Salles.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 134 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Salles, T., Rocha, L. & Gonçalves, M. A bias-variance analysis of state-of-the-art random forest text classifiers. Adv Data Anal Classif 15, 379–405 (2021). https://doi.org/10.1007/s11634-020-00409-4

Download citation

Received: 10 June 2019
Revised: 21 May 2020
Accepted: 08 July 2020
Published: 19 July 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s11634-020-00409-4

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A bias-variance analysis of state-of-the-art random forest text classifiers

Abstract

Access this article

Similar content being viewed by others

CHIRPS: Explaining random forest classification

Approximation of SHAP Values for Randomized Tree Ensembles

Short Text Classification Using Semantic Random Forest

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 134 KB)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A bias-variance analysis of state-of-the-art random forest text classifiers

Abstract

Access this article

Similar content being viewed by others

CHIRPS: Explaining random forest classification

Approximation of SHAP Values for Randomized Tree Ensembles

Short Text Classification Using Semantic Random Forest

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 134 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation