Skip to main content
Log in

A bias-variance analysis of state-of-the-art random forest text classifiers

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Random forest (RF) classifiers do excel in a variety of automatic classification tasks, such as topic categorization and sentiment analysis. Despite such advantages, RF models have been shown to perform poorly when facing noisy data, commonly found in textual data, for instance. Some RF variants have been proposed to provide better generalization capabilities under such challenging scenario, including lazy, boosted and randomized forests, all which exhibit significant reductions on error rate when compared to the traditional RFs. In this work, we analyze the behavior of such variants under the bias-variance decomposition of error rate. Such an analysis is of utmost importance to uncover the main causes of the observed improvements enjoyed by those variants in classification effectiveness. As we shall see, significant reductions in variance along with stability in bias explain a large portion of the improvements for the lazy and boosted RF variants. Such an analysis also sheds light on new promising directions for further enhancements in RF-based learners, such as the introduction of new randomization sources on both, lazy and boosted variants.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. The term effectiveness is here used to refer to the “quality” of the classification process, as captured, for instance, by accuracy or error rate metrics.

  2. \(\mathbb {A} {\setminus } \mathbb {B}\) denotes the set difference between \(\mathbb {A}\) and \(\mathbb {B}\) and is composed by elements in \(\mathbb {A}\) but not in \(\mathbb {B}\).

  3. Although not restricted to sampling with replacement, we here stick with the original definition of the out-of-bag technique in Valentini and Dietterich (2003), which bootstraps samples from the original dataset with replacement.

  4. As an example, considering the 20NG dataset (see Table 4), we see that both LazyNN_RF and BROOF classifiers obtain top performing results on \(\hbox {MicroF}_1\) and \(\hbox {MacroF}_1\).

References

  • Breiman L (2001) Random forests. Mach Learn 45(1):5

    Article  Google Scholar 

  • Díaz-Uriarte R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7(1):3

    Article  Google Scholar 

  • Domingos PM (2000) A unified bias-variance decomposition for zero-one and squared loss. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on on innovative applications of artificial intelligence, 30 July–3 Aug, Austin, Texas, USA. AAAI Press/The MIT Press, 2000, pp 564–569

  • Friedman JH (1997) On bias, variance, 0/1 loss, and the curse-of-dimensionality. Data Min Knowl Discov 1(1):55

    Article  Google Scholar 

  • Garber FD, Djouadi A (1988) Bounds on the bayes classification error based on pairwise risk functions. IEEE Trans Pattern Anal Mach Intell 10(2):281. https://doi.org/10.1109/34.3891

    Article  MATH  Google Scholar 

  • Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3

    Article  Google Scholar 

  • Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning. Springer, Berlin

    Book  Google Scholar 

  • Hutto CJ, Gilbert E (2014) VADER: a parsimonious rule-based model for sentiment analysis of social media text. In: Proceedings of the eighth international conference on weblogs and social media, ICWSM 2014, Ann Arbor, MI, USA, 1–4 June 2014. The AAAI Press

  • Jain R (1991) The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. Wiley, New York

    MATH  Google Scholar 

  • James GM (2003) Variance and bias for general loss functions. Mach Learn 51(2):115

    Article  Google Scholar 

  • Kohavi R, Wolpert D (1996) Bias plus variance decomposition for zero-one loss functions. In: ICML-96

  • Kong EB, Dietterich TG (1995) Error-correcting output coding corrects bias and variance. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, pp 313–321

  • Leshem YRG (2007) Traffic flow prediction using adaboost algorithm with random forests as a weak learner. Int J Intell Technol 2:1305

    Google Scholar 

  • Li HB, Wang W, Ding HW, Dong J (2010) Trees weighting random forest method for classifying high-dimensional noisy data. In: 2010 IEEE 7th international conference on e-business engineering, pp 160–163. https://doi.org/10.1109/ICEBE.2010.99

  • Liu FT (2005) The utility of randomness in decision tree ensembles. Master’s thesis, Faculty of Information Technology Monash University. https://pdfs.semanticscholar.org/109d/88da6e41b33d043449913827ba5a4ccca1e0.pdf

  • Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA (2011) On oblique random forests. In: Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases, vol Part II, pp 453–469

  • Murthy SK, Kasif S, Salzberg S (1994) A system for induction of oblique decision trees. J Artif Int Res 2(1):1–32

    MATH  Google Scholar 

  • Salles T, Gonçalves M, Rodrigues V, Rocha L (2018) Improving random forests by neighborhood projection for effective text classification. Inf Syst 77:1. https://doi.org/10.1016/j.is.2018.05.006

    Article  Google Scholar 

  • Salles T, Gonçalves M, Rodrigues V, Rocha L (2015) Broof: exploiting out-of-bag errors, boosting and random forests for effective automated classification. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, SIGIR ’15, pp 353–362. https://doi.org/10.1145/2766462.2767747

  • Segal MR (2004) Machine learning benchmarks and random forest regression. Technical report, eScholarship Repository, University of California

  • Thelwall M (2013) Heart and soul: Sentiment strength detection in the social web with sentistrength. Cyberemotions. 1–14

  • Tumer K, Ghosh J (1996) Estimating the bayes error rate through classifier combining. In: Proceedings of 13th international conference on pattern recognition, vol 2, pp 695–699. https://doi.org/10.1109/ICPR.1996.546912

  • Valentini G, Dietterich TG (2003) Low bias bagged support vector machines. In: Machine learning, proceedings of the twentieth international conference (ICML 2003), 21–24 Aug 2003, Washington, DC, USA. AAAI Press, pp 752–759

  • Webb GI, Conilione P (2003) Estimating bias and variance from data. Technical report, School of Computer Science and Software Engineering, Monash University

  • Webb GI (2000) Multiboosting: a technique for combining boosting and wagging. In: Machine learning, pp 159–196

  • Zhang Z, Xie X (2010) Research on adaboost.m1 with random forest. In: Proceedings ICCET, pp 647–652

Download references

Acknowledgements

This work was partially supported by CAPES, CNPq, Finep, Fapemig, MasWeb and InWeb.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thiago Salles.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 134 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Salles, T., Rocha, L. & Gonçalves, M. A bias-variance analysis of state-of-the-art random forest text classifiers. Adv Data Anal Classif 15, 379–405 (2021). https://doi.org/10.1007/s11634-020-00409-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-020-00409-4

Keywords

Mathematics Subject Classification

Navigation