Online random forests regression with memories
Introduction
Random Forests(RFs) [1] are ensembles of randomized decision trees that use bagging for classifier or regression tasks. RFs are regularly used in machine learning applications, as well as some tasks demanding high real-time performance, such as computer vision [2], [3]. Their main advantages embody the good performances in computing efficiency and prediction.
RFs are typically off-line methods [4]. Nevertheless, there are numerous situations in which applications require real-time processing of massive streaming data, such as prediction of network workload, traffic conditions during peak time, and users’ evaluation of products. These types of extensive data are challenging to store in totality. Furthermore, the distribution of test samples may differ from that of training samples. Therefore, the evaluation of future testing instances using the models trained off-line directly may feature large prediction bias.
To address these issues, some efforts have been dedicated to online training [5], [6], [7]. However, to the best of our knowledge, most of the existing approaches suffer from frequent updating of the model, which generally leads to inefficient computation. Moreover, they often ignore the continuing dependencies between samples in the data stream.
In this paper, we propose a novel approach that updates the leaf-level weight of off-line trained base learners in an online manner and endows RFs with long-term memory to avoid the need for frequent model updating. More precisely, as testing data arrive sequentially, our method updates the leaf-level weight of RFs simultaneously, while keeping the structure of the trained RFs’ model unchanged. In this way, the proposed approach achieves better prediction by such online weight learning, in addition to exploiting the superior efficiency of the off-line training approach. Moreover, our method avoids the low efficiency and poor accuracy of a pure online growth strategy during the beginning stages. Our contribution is three-fold:
- (1)
We propose an online weight learning approach that endows RFs with memory to improve regression prediction in cases of distribution is changed in streaming data.
- (2)
We propose an adaptive learning rate for stochastic gradient descent (SGD) based on current and historical prediction bias ratios.
- (3)
We validate the proposed method with extensive experiments and show the convergence and stability of our method.
The rest of the paper is organized as follows. In Section 2 we discuss previous work in related areas. In Section 3 we introduce the main principles of our method and present the online weight learning algorithm for RFs. Section 4 details a series of experiments with our approach on regression tasks. Finally, the conclusion and future work are given in Section 5.
Section snippets
Related work
The conventional RFs are an ensemble of unpruned (classification or regression) decision trees, such as C4.5 [8] and CART [9], and are a type of extension of bagging [10]. Conventional RFs use random feature selection in tree induction procedures, through bootstrap sampling of training data. RFs have many advantages, such as low cost of computation, ease of implementation, and powerful performance. RFs’ predictions are generated through aggregating predictions of trees, in the manner of either
Preliminaries
Online learning belongs to the online optimization problem, which attempts to overcome the shortcomings of statistical learning theory when addressing the dynamic aspects of large datasets. Online regression learning can be described formally as follows [38]. At every round , , …, :
- •
receive question where which corresponds to a set of features
- •
predict by , is the hypothesis set, ={ : , }, where is
Benchmark datasets
The algorithms proposed in this paper are designed for scenarios in which data are delivered through continuous streaming. For comparative evaluation, we selected eight different sizes of benchmark datasets from the UCI [39] Machine Learning Repository and Kaggle data science community. These datasets, are Boston, California Housing(California), Concrete Compressive Strength (Concrete) [40], Airfoil Self-Noise(Airfoil), ParkinsonsTelemonitoring(Parkinsons) [41], Wine Quality(WineQuality) [42],
Conclusion
We introduce an novel online weight learning algorithm(OWL-RFR), the basis of which is more typically used for the off-line training of RFs for regression prediction. We endow the RFs with memory by weight when predicting. We analyze the difference between each leaf weight’s long-term memory and the tree weight’s immediate memory in streaming data, to prove the effectiveness of the former. Experiments on commonly used machine learning datasets show that OWL-RFR significantly outperforms
CRediT authorship contribution statement
Yuan Zhong: Conceptualization, Methodology, Software, Writing - review & editing. Hongyu Yang: Supervision. Yanci Zhang: Validation, Investigation. Ping Li: Formal analysis, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors acknowledges the support from National Natural Science Foundation of China 61472261 and 61873218, Applied Basic Research Foundation of Sichuan Provincial Science and Technology Department, China (No. 18YYJC1147), Nanchong Science and Technology Department, China (18SXHZ0009) and Southwest Petroleum University innovation base, China (No. 642).
References (43)
- et al.
Protein–protein interaction sites prediction by ensembling svm and sample-weighted random forests
Neurocomputing
(2016) - et al.
Random feature weights for decision tree ensemble construction
Inform. Fusion
(2012) - et al.
Forestexter: An efficient random forest algorithm for imbalanced text categorization
Knowl.-Based Syst.
(2014) - et al.
A weight-adjusted voting algorithm for ensembles of classifiers
J. Korean Statist. Soc.
(2011) - et al.
Online adaptive decision trees based on concentration inequalities
Knowl.-Based Syst.
(2016) - et al.
An on-line weighted ensemble of regressor models to handle concept drifts
Eng. Appl. Artif. Intell.
(2015) Modeling of strength of high-performance concrete using artificial neural networks
Cement Concr. Res.
(1998)- et al.
Modeling wine preferences by data mining from physicochemical properties
Decis. Support Syst.
(2009) Random forests
Mach. Learn.
(2001)- et al.
Image classification using random forests and ferns
Semi-supervised random forests
On-line random forests
New ensemble methods for evolving data streams
Learning model trees from evolving data streams
Data Mining Knowl. Discov.
C4.5: Programs for Machine Learning
Classification and regression trees (cart)
Encycl. Ecol.
Bagging predictors
Mach. Learn.
When Networks Disagree: Ensemble Methods for Hybrid Neural NetworksTech. Rep.
Using random forest to learn imbalanced data
Univ. Calif. Publ. Bot.
Enriched random forests
Bioinformatics
Cited by (18)
Prediction of the freshness of horse mackerel (Trachurus japonicus) using E-nose, E-tongue, and colorimeter based on biochemical indexes analyzed during frozen storage of whole fish
2023, Food ChemistryCitation Excerpt :RF can be used to solve classification and regression problems. RFR is an ensemble learning algorithm for classification and regression; it can randomly sample data, generate a large number of classification regression trees, and finally obtain the final results by voting (Zhong, Yang, Zhang, & Li, 2020). RFR is typically used in data fusion and prediction because of its advantages of high prediction accuracy, anti-overfitting, insensitivity to missing data and multivariate collinearity, and the ability to simply process a large number of quantitative and qualitative data.
A hierarchical estimation of multi-modal distribution programming for regression problems
2023, Knowledge-Based SystemsApplication of different learning methods for arrays sensitivity analysis with uncertain excitation and location tolerances
2022, AEU - International Journal of Electronics and CommunicationsCitation Excerpt :The RF is one of the supervised learning-based methods. In this algorithm, many trees and nodes are defined to establish a network for optimal regression [20]. For the MC method and according to the Central Limit Theorem, the higher the number of repetitions, the more accurate the output and the more reliable the results.
Discovering spatial-temporal patterns via complex networks in investigating COVID-19 pandemic in the United States
2022, Sustainable Cities and SocietyCitation Excerpt :The divided training and testing sets are unable to correct the early decisions. That is to say, the off-line trained random forests could be inapplicable to explore time series data in some senses, possibly leading to a relatively great prediction bias (Zhong et al., 2020). To resolve this situation, an online learning mechanism can be considered to combine into the traditional random forests.
A maximal overlap discrete wavelet packet transform integrated approach for rainfall forecasting – A case study in the Awash River Basin (Ethiopia)
2021, Environmental Modelling and Software