Online random forests regression with memories

https://doi.org/10.1016/j.knosys.2020.106058Get rights and content

Abstract

In recent years, the online schema of the conventional Random Forests(RFs) have attracted much attention because of its ability to handle sequential data or data whose distribution changes during the prediction process. However, most research on online RFs focuses on structural modification during the training stage, overlooking critical aspects of the sequential dataset, such as autocorrelation. In this paper, we demonstrate how to improve the predictive accuracy of the regression model by exploiting data correlation. Instead of modifying the structure of the off-line trained RFs, we endow RFs with memory during regression prediction through an online weight learning approach, which is called Online Weight Learning Random Forest Regression(OWL-RFR). Specifically, the weights of leaves are updated based on a novel adaptive stochastic gradient descent method, in which the adaptive learning rate considers the current and historical prediction bias ratios, compared with the static learning rate. Thus, leaf-level weight stores the learned information from the past data points for future correlated prediction. Compared with tree-level weight which only has immediate memory for current prediction, the leaf-level weight can provide long-term memory. Numerical experiments with OWL-RFR show remarkable improvements in predictive accuracy across several common machine learning datasets, compared to traditional RFs and other online approaches. Moreover, our results verify that the weight approach using the long-term memory of leaf-level weight is more effective than immediate dependency on tree-level weight. We show the improved effectiveness of the proposed adaptive learning rate in comparison to the static rate for most datasets, we also show the convergence and stability of our method.

Introduction

Random Forests(RFs) [1] are ensembles of randomized decision trees that use bagging for classifier or regression tasks. RFs are regularly used in machine learning applications, as well as some tasks demanding high real-time performance, such as computer vision [2], [3]. Their main advantages embody the good performances in computing efficiency and prediction.

RFs are typically off-line methods [4]. Nevertheless, there are numerous situations in which applications require real-time processing of massive streaming data, such as prediction of network workload, traffic conditions during peak time, and users’ evaluation of products. These types of extensive data are challenging to store in totality. Furthermore, the distribution of test samples may differ from that of training samples. Therefore, the evaluation of future testing instances using the models trained off-line directly may feature large prediction bias.

To address these issues, some efforts have been dedicated to online training [5], [6], [7]. However, to the best of our knowledge, most of the existing approaches suffer from frequent updating of the model, which generally leads to inefficient computation. Moreover, they often ignore the continuing dependencies between samples in the data stream.

In this paper, we propose a novel approach that updates the leaf-level weight of off-line trained base learners in an online manner and endows RFs with long-term memory to avoid the need for frequent model updating. More precisely, as testing data arrive sequentially, our method updates the leaf-level weight of RFs simultaneously, while keeping the structure of the trained RFs’ model unchanged. In this way, the proposed approach achieves better prediction by such online weight learning, in addition to exploiting the superior efficiency of the off-line training approach. Moreover, our method avoids the low efficiency and poor accuracy of a pure online growth strategy during the beginning stages. Our contribution is three-fold:

  • (1)

    We propose an online weight learning approach that endows RFs with memory to improve regression prediction in cases of distribution is changed in streaming data.

  • (2)

    We propose an adaptive learning rate for stochastic gradient descent (SGD) based on current and historical prediction bias ratios.

  • (3)

    We validate the proposed method with extensive experiments and show the convergence and stability of our method.

The rest of the paper is organized as follows. In Section 2 we discuss previous work in related areas. In Section 3 we introduce the main principles of our method and present the online weight learning algorithm for RFs. Section 4 details a series of experiments with our approach on regression tasks. Finally, the conclusion and future work are given in Section 5.

Section snippets

Related work

The conventional RFs are an ensemble of unpruned (classification or regression) decision trees, such as C4.5 [8] and CART [9], and are a type of extension of bagging [10]. Conventional RFs use random feature selection in tree induction procedures, through bootstrap sampling of training data. RFs have many advantages, such as low cost of computation, ease of implementation, and powerful performance. RFs’ predictions are generated through aggregating predictions of trees, in the manner of either

Preliminaries

Online learning belongs to the online optimization problem, which attempts to overcome the shortcomings of statistical learning theory when addressing the dynamic aspects of large datasets. Online regression learning can be described formally as follows [38]. At every round t=1, 2, …, n:

  • receive question xt X where X R d which corresponds to a set of features

  • predict pt D by ht H, H is the hypothesis set, H={xt i=1d wt [i] xt [i]: i, wt [i] R}, where wt [i] is

Benchmark datasets

The algorithms proposed in this paper are designed for scenarios in which data are delivered through continuous streaming. For comparative evaluation, we selected eight different sizes of benchmark datasets from the UCI [39] Machine Learning Repository and Kaggle data science community. These datasets, are Boston, California Housing(California), Concrete Compressive Strength (Concrete) [40], Airfoil Self-Noise(Airfoil), ParkinsonsTelemonitoring(Parkinsons) [41], Wine Quality(WineQuality) [42],

Conclusion

We introduce an novel online weight learning algorithm(OWL-RFR), the basis of which is more typically used for the off-line training of RFs for regression prediction. We endow the RFs with memory by weight when predicting. We analyze the difference between each leaf weight’s long-term memory and the tree weight’s immediate memory in streaming data, to prove the effectiveness of the former. Experiments on commonly used machine learning datasets show that OWL-RFR significantly outperforms

CRediT authorship contribution statement

Yuan Zhong: Conceptualization, Methodology, Software, Writing - review & editing. Hongyu Yang: Supervision. Yanci Zhang: Validation, Investigation. Ping Li: Formal analysis, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors acknowledges the support from National Natural Science Foundation of China 61472261 and 61873218, Applied Basic Research Foundation of Sichuan Provincial Science and Technology Department, China (No. 18YYJC1147), Nanchong Science and Technology Department, China (18SXHZ0009) and Southwest Petroleum University innovation base, China (No. 642).

References (43)

  • LeistnerC. et al.

    Semi-supervised random forests

  • SaffariA. et al.

    On-line random forests

  • P. Domingos, G. Hulten, Mining high-speed data streams, in: Proceeding of the Sixth ACM SIGKDD International Conference...
  • BifetAlbert et al.

    New ensemble methods for evolving data streams

  • IkonomovskaElena et al.

    Learning model trees from evolving data streams

    Data Mining Knowl. Discov.

    (2011)
  • QuinlanJ.R.

    C4.5: Programs for Machine Learning

    (2014)
  • BreimanL.I. et al.

    Classification and regression trees (cart)

    Encycl. Ecol.

    (1984)
  • BreimanL.

    Bagging predictors

    Mach. Learn.

    (1996)
  • PerroneM.P. et al.

    When Networks Disagree: Ensemble Methods for Hybrid Neural NetworksTech. Rep.

    (1992)
  • ChenC. et al.

    Using random forest to learn imbalanced data

    Univ. Calif. Publ. Bot.

    (2004)
  • AmaratungaD. et al.

    Enriched random forests

    Bioinformatics

    (2008)
  • Cited by (18)

    • Prediction of the freshness of horse mackerel (Trachurus japonicus) using E-nose, E-tongue, and colorimeter based on biochemical indexes analyzed during frozen storage of whole fish

      2023, Food Chemistry
      Citation Excerpt :

      RF can be used to solve classification and regression problems. RFR is an ensemble learning algorithm for classification and regression; it can randomly sample data, generate a large number of classification regression trees, and finally obtain the final results by voting (Zhong, Yang, Zhang, & Li, 2020). RFR is typically used in data fusion and prediction because of its advantages of high prediction accuracy, anti-overfitting, insensitivity to missing data and multivariate collinearity, and the ability to simply process a large number of quantitative and qualitative data.

    • Application of different learning methods for arrays sensitivity analysis with uncertain excitation and location tolerances

      2022, AEU - International Journal of Electronics and Communications
      Citation Excerpt :

      The RF is one of the supervised learning-based methods. In this algorithm, many trees and nodes are defined to establish a network for optimal regression [20]. For the MC method and according to the Central Limit Theorem, the higher the number of repetitions, the more accurate the output and the more reliable the results.

    • Discovering spatial-temporal patterns via complex networks in investigating COVID-19 pandemic in the United States

      2022, Sustainable Cities and Society
      Citation Excerpt :

      The divided training and testing sets are unable to correct the early decisions. That is to say, the off-line trained random forests could be inapplicable to explore time series data in some senses, possibly leading to a relatively great prediction bias (Zhong et al., 2020). To resolve this situation, an online learning mechanism can be considered to combine into the traditional random forests.

    View all citing articles on Scopus
    View full text