1 Introduction

The ensemble method combines multiple different models to achieve better prediction accuracy for classification and regression (Dietterich 2000). Ensemble methods generally perform better than single fitted models (Hansen and Salamon 1990). Because of the good performance, ensemble methods are widely used in machine learning and statistics community (Breiman 1996; Freund and Schapire 1996; Bauer and Kohavi 1999; Amaratunga et al. 2008; Wolf et al. 2010). Classification ensemble methods consist of several classifiers, typically decision trees. The well-known methodologies of classification ensemble are Boosting (Freund and Schapire 1996; Schapire 1990; Freund and Schapire 1997), Bagging (Breiman 1996), Random Forest (Breiman 2001), Gradient Boosting (Mason et al. 1999; Hastie et al. 2009), and XGBoost (Chen and Guestrin 2016).

Random forest (RF) is a widely used method in various fields because it has many advantages over other classification ensemble methods. RF is fast in both training and prediction, resistant to label noise and outliers, has multi-class capabilities, is well suited for parallel processing and delivers superior performance for high-dimensional data (Hastie et al. 2013; Friedman et al. 2001).

Choosing appropriate hyper-parameters is critical for improving RF performance. Typical parameters are number of trees (ntree), number of candidate features (mtry), sample size (samplesize), and tree size (nodesize). Various studies have been carried out on the effect of different parameters on RF performance. Probst and Boulesteix (2018), Freeman et al. (2015) and Huang and Paul (2016) examined the sensitivity of the parameters. Banfield et al. (2007), Hernandez-Lobato et al. (2013) and Oshiro et al. (2012) tested the effect of the number of trees. Boulesteix et al. (2012) and Han and Kim (2019) proposed methods for selecting the appropriate number of candidate features. Martínez-Muñoz and Suárez (2010) studied on the estimation of sample size for Bagging. Lin and Jeon (2012) developed the idea of combining RF with the adaptive nearest neighbors to estimate the tree size.

This paper is about developing a new ensemble method that surpass RF. The idea stems from the question whether a tree of sufficiently large size is being generated during the RF process. The classification ensemble method proposed in this paper is called DRF, which will be explained in detail in subsequent sections. Section 2 reviews the original RF and provides motivation for this research. The novel algorithm of the new classification ensemble method is described in Sect. 3. In Sect. 4, we extend the previous motivational experiments to compare the performance of DRF and RF in terms of accuracy and tree size. We compare the prediction accuracy of DRF and other classification ensemble methods in Sect. 5. Section 6 concludes this study.

2 Motivation

2.1 Random forest

Random forest (RF) (Breiman 2001) completes the ensemble by creating trees on the bootstrap data. When constructing the trees, the RF algorithm chooses the optimal split among a randomly selected set of features for every intermediate nodes. It is well known that RF improves prediction accuracy because it produces more diverse and less correlated trees (Breiman 2001; Banfield et al. 2007). Algorithm 1 describes the RF algorithm.

figure a

2.2 The tree size effect on random forest

Trees created with RF are usually made to full size without pruning (Breiman 2001). This is because usually the lower bias is obtained when each tree grows to its maximum size. Along with the variance reduction effect of the ensemble, the bias reduction is expected to ultimately improve the classification accuracy.

In the randomForest package of R program, nodesize and maxnodes parameters control the tree size of RF, where nodesize is the minimum number of cases a terminal node should hold and maxnodes is the maximum number of terminal nodes allowed in a tree. Thus, if nodesize is small and maxnodes is large, the tree will grow large. The default value for nodesize is 1 in classification and 5 in regression, and maxnodes defaults to NULL which means that there is no limit on the maximum number of nodes. It is empirically known that using these defaults yields good RF performance (Huang and Paul 2016).

As a motivational example, an experiment was conducted to investigate the effect of tree size on RF performance using ‘mnist’ data, which is often used for classification problems in the field of machine learning. The ‘mnist’ data consist of images of handwritten digits 0–9, represented by a grid of 28 \(\times \) 28 pixels. From the modeling point of view, this data is interpreted as having 10 classes and 784 variables. The training and test sets have 60,000 and 10,000 instances, respectively. Figure 1 is an example of the ‘mnist’ dataset.

Fig. 1
figure 1

Examples of ‘mnist’ data

The design of the experiment is as follows. Using the approach of Larochelle et al. (2012), we fit and evaluate the RF model. That is, the training data is divided into six pieces, each consisting of 10,000 instances. Five pieces are collected to construct a learning sample for fitting, and the original test set is used for evaluation. This process is repeated six times, in which all combinations are formed.

To investigate the effect of tree size on RF performance, various nodesize values were used, namely nodesize = (0.1n, \(0.09n, \ldots , 0.01n\), 0.005n, 0.001n, default), where n is the number of instances in the training set, and the default is 1. The number of trees in RF is set to 200 to allow sufficient model fit. Other parameters in the randomForest R package were left at their default values.

Fig. 2
figure 2

Box-plot of the test accuracy of random forest for each nodesize using ‘mnist’ data

The result of the experiment was summarized in Fig. 2. The horizontal axis represents nodesize value and the vertical axis represents the accuracy of the test data. The boxplot in the figure is drawn using the test accuracy obtained from six repeated experiments.

First, we noticed that RF performance was heavily influenced by the nodesize option. Second, as the value of nodesize decreases, that is, as the tree size increases, the test accuracy gradually improves. Thus, in the case of the ‘mnist’ dataset, the best RF performance was when the tree size is maximized. If RF is able to construct even bigger trees, can the prediction accuracy be further improved? If so, the current RF algorithm has a limitation that the tree size is not large enough for the ‘mnist’ dataset. In other words, even the largest fitting provided by RF is under-fitting in the ‘mnist’ dataset. This question motivated us to develop a new classification ensemble method to overcome the limitation of the RF algorithm in terms of tree size.

3 Double random forest (DRF)

We propose a new classification ensemble method called “Double Random Forest” (DRF) to improve the RF performance when RF under-fits the data.

First, DRF uses the whole training data to grow a decision tree at each stage of ensemble. This is in contrast to RF that constructs individual trees using random bootstrap samples from the training set. This means that in DRF, all trees are created with the same data from the beginning, whereas in RF, the trees are created with different data having fewer unique instances. Note that the number of unique instances of the bootstrap sample is about \(1-e^{-1} \approx 0.632\) times of the training data. It is obvious that the more unique instances you have, the easier it is to create larger trees. Therefore, the DRF ensemble tends to consist of larger trees, compared to RF.

Second, DRF uses the bootstrap sampling momentarily for each intermediate node, including the root node, only to find the partitioning rule. In addition, a random subset of the features is also selected as in RF. Finally, DRF searches for the best partitioning rules by using a random bootstrap of instances and a random subset of features on each node of the tree. Once the partitioning rule is found, it recovers the original data in the node and sends the instances down to the child nodes.

In summary, DRF uses more instances to create trees, which increases the size of the trees, thereby reducing the bias in classification prediction. It also adds randomness to the partitioning rules by bootstrap and feature subset, allowing more diverse ensemble of trees.

Using the bootstrap near the terminal node may not be appropriate because there are considerably fewer unique instances. Therefore, we chose not to use bootstrap on nodes with instances less than 10% of the total number of instances. Algorithm 2 explains the DRF algorithm.

figure b

4 Case study (‘mnist’ data)

The motivational experiments discussed in Sect. 2.2 show that even if the RF algorithm uses hyper-parameters that allow a maximum tree, it may under-fit the data. We extend the experiment to see that the proposed DRF algorithm relaxes the under-fitting problem. The design of the experiment is the same as described in Sect. 2.2, except that nodesize was set to the default value 1 to compare both methods at the maximum tree size.

Fig. 3
figure 3

Box-plot of the test accuracy of RF and DRF using ‘mnist’ data

Table 1 Tree size comparison between RF and DRF

The results of the experiment are summarized in Fig. 3 and Table 1. In Fig. 3, we compared the test accuracy of both methods. Boxplots in the figure are drawn using the test accuracy obtained from six repeated experiments following Larochelle et al. (2012). DRF seems significantly more accurate than RF because the boxplot of DRF does not overlap and is well above the boxplot of RF.

In Table 1, we compared the tree sizes of both methods. The values in the table are the mean and standard deviation (parenthesis) of the number of terminal nodes in 200 trees of the ensemble. The trees in DRF are on average larger than those in RF as expected. Also note that the DRF tree has higher standard deviation than the RF tree in the number of terminal nodes. This is due to the randomness that comes from bootstraps and feature subsets that occur during split rule exploration. This will eventually lead to more diverse trees.

In conclusion, for ‘mnist’ data, the DRF ensemble method seems to overcome the under-fitting problem of the RF ensemble method. In addition, the DRF ensemble method produces more diverse trees, which is one of the factors that make the ensemble method successful.

5 Experimental study

In this section, more data are used to compare the prediction accuracy of the proposed DRF method with other ensemble methods. First, we will divide 58 datasets into two groups, one for which RF may under-fit and the other for which it does not. Then, the comparison will only proceed in the data group where RF may under-fit. More details are given below.

5.1 Effect of the tree size

This section expands the discussion of the tree size effect in Sect. 2.2 with a larger number of data. The experiment was conducted on 58 real and artificial datasets that are suitable for classification problems. They mostly came from UCI data repository (Asuncion and Newman 2007) and the mlbench R package (Dimitriadou and Leisch 2010). The datasets are listed in Table 2.

Table 2 Descriptions of the 58 datasets

For the experiment, we used a random 50% dataset as the training set for the fitting and a random 50% dataset as the test set for the evaluation. For a reliable comparison, the entire experimental process was repeated 100 times to reduce the impact of some specific training or test sets. It can also reduce the sampling bias that can occur with an unbalanced number of class instances. All other design settings are the same as in Sect. 2.2.

Fig. 4
figure 4

Relative accuracy for each nodesize. RF may under-fit these dataset

Fig. 5
figure 5

Relative accuracy for each nodesize. RF may under-fit these dataset

Fig. 6
figure 6

Relative accuracy for each nodesize. RF does not under-fit these dataset

Fig. 7
figure 7

Relative accuracy for each nodesize. RF does not under-fit these dataset

The results of the experiment are summarized in Figs. 4, 56, and 7 where the horizontal axis represents nodesize values and the vertical axis represents the relative test accuracy. Relative test accuracy is defined as (the accuracy of RF under the given nodesize) / (the accuracy of RF when nodesize is set to its default 1). Therefore, if this value is greater than 1, the RF with smaller trees is more accurate than that with the largest trees. After all, this means that RF does not under-fit. Similarly, if the value is less than 1, the RF with the largest trees is the best, which means that RF may under-fit.

Of the 58 data sets in total, we can divide the 34 data sets shown in Figs. 4 and 5 into one group and the 24 data sets shown in Figs. 6 and 7 into another. Figures 4 and 5 show that the relative test accuracies are lower than 1 for all nodesize values. In such datasets, the tree size of the RF is likely not large enough to provide the best performance, and there is room for further improvement. In contrast, relative test accuracy exceeds 1 in the Figs. 6 and 7. In this case, the tree size does not need to be maximum for best RF performance. In these cases, a certain level of pruning can be beneficial to RF when building the trees.

5.2 Comparison between DRF and other ensemble methods

This section compares the proposed DRF method and other classification ensemble methods, such as Bagging (Breiman 1996), Samme (a modified version of AdaBoost) (Zhu et al. 2009) and RF using the 34 datasets shown in Figs. 4 and 5. A single tree model (Therneau and Atkinson 2019) is also included in the comparison as a baseline.

The design of the experiment is as follows. As in Sect. 5.1, we used a random 50% dataset as the training set for the fitting and a random 50% dataset as the test set for evaluation. The maximum tree size without pruning was used for all ensemble methods except Samme. For Samme, tree growth is limited by the maximum tree depth as the number of classes. as in Kim et al. (2010). The number of trees generated in the ensemble was set to 200. The single tree model uses pruning for a better accuracy. Other parameters were left to their default values. The whole experimental process was also repeated 100 times for a reliable result.

Table 3 Pairwise comparison of prediction accuracies. In each cell, results are summarized as “a(b)”, where a is the number of dataset that the method in the column outperforms the method in the row and b is the number of dataset that the difference is statistically significant in the one-sided paired t-test

Table 3 compares the test accuracy of DRF with other methods. The values in the table are the number of times that the vertical method is more accurate than the horizontal method. For example, a value of 32(32) in the first and fifth columns means that DRF is more accurate than a single tree for 32 of 34 data sets, and the number in parentheses means the number of times it is statistically significant. On the other hand, the values in the fifth row and the first column, 2(2), indicate that there were 2 datasets for which the single tree performed better than DRF and both were statistically significant.

Table 4 Dominance ranks of the methods using the significant differences from the results in Table 3

Table 4 indicates the ranking of the methods based on the results in Table 3. The dominance rank is defined as the number of significant wins minus the number of significant losses. For example, DRF has a dominance rank of 87 given that it had 104 significant wins and 17 significant losses. The number of significant wins for DRF is the sum of the values in the parentheses in the DRF column. Similarly, the number of significant losses for DRF is the sum of the values in the parentheses in the DRF row.

Tables  3 and 4 show that the DRF method significantly outperforms other methods when using 34 datasets whose RF tree size is not large enough. Since the RF method is the closest competitor, we conducted a more detailed comparison between DRF and RF.

Fig. 8
figure 8

Confidence intervals of accuracy differences between DRF and RF. If the confidence interval does not contain zero, the difference is statistically significant

Figure 8 shows the confidence interval for the difference in accuracy between DRF and RF. Of the 34 confidence intervals, 19 are above zero, which means that for 19 of the 34 datasets, DRF is significantly better than RF. Only 6 confidence intervals are located below zero.

Table 5 ANOVA table of the randomized complete block design

Table 5 shows the results of the randomized complete block design (RCBD) performed to test the overall difference in accuracy between DRF and RF. The P-value for the difference between the two methods is very low, 0.0009, which means that DRF is statistically more accurate than RF.

The results so far have been compared with the prediction accuracy using test data. A comparison result was also generated using F1-score, an index that considers sensitivity and precision. F1-score is defined as \(2\times \frac{Precision\times Recall}{Precision + Recall}\). The comparison with F1-score showed little difference from previous results with test accuracy. For those interested, comparison results using F1-score are given in Appendix.

5.3 Comparison of computing times

The DRF algorithm requires more computation time, because the bootstrap technique must be used on each node of every tree in the ensemble. It also takes longer to complete, because the tree of DRF is larger than that of RF, which further increases the number of times the bootstrap must be performed.

Figure 9 shows the runtime of the two methods for each of the 34 data sets. The runtime did not make a big difference in small data sets, but the gap is widening in large data sets as expected.

Fig. 9
figure 9

Computing times of DRF and RF

5.4 Comparison of DRF and tuned RF (instance size)

Is there any other way to make the trees bigger in RF method? If so, it may exhibit similar accuracy to DRF. One possible way is to use instances which is sampled without replacement, instead of using bootstrap (sampling with replacement) in each tree. Bootstrapping gets n instances, but since it uses sampling with replacement, the number of unique instances is about \(0.632\times n\). If \(0.9 \times n\) instances are sampled without replacement, the trees of RF will be larger because there are more unique instances. However, as the number of unique instances increases, the pool of data used to construct each tree becomes more similar. This may lead to weaken the diversity of the RF trees. As a related study, Martínez-Muñoz and Suárez (2010) reported the effect of instance size on Bagging and showed that sampling without replacement, rather than bootstrap, is more effective on some data set.

In the experiment of this section, when fitting the RF method, bootstrapping was disabled and the samplesize parameter was set to \(0.9 \times n\) to make the RF trees bigger. As a result, it was found that trees are about 1.26 times larger than those of the typical RF. However, this method was more accurate on only 18 of the 34 datasets. In the rest of the data set, the typical RF was more accurate. In short, the accuracy of RF has not improved just for the reason that the generated trees are larger. This is because the diversity of classifiers is weakened by creating trees with more similar instances throughout the ensemble (Martínez-Muñoz and Suárez 2010).

It is of interest to compare the accuracy of DRF and the tuned RF (instance size), where the tuned RF is using sampleize which gives more accurate results for each data. As shown in Table 6, if the RF is tuned to the instance size, the difference in accuracy from DRF is reduced. However, DRF is still superior to tuned RF (instance size). The reason may be that in the case of DRF, the diversity of classifier is not weakened because bootstrapping is performed at every node of each tree.

Table 6 Pairwise comparison of tuned RF (instance size) and DRF

5.5 Comparison of tuned RF and tuned DRF

It is known that the number of candidate features in each node of a tree affects the accuracy of the RF ensemble. The number of candidate features, known as mtry in the randomForest package of R, is a hyper-parameter that users can choose. In general, the most commonly used value for mtry is \(\sqrt{p}\), where p is the number of all features (Han and Kim 2019). The results shown in Sects. 5.2 and 5.4 were completed using \(\sqrt{p}\) as the default value. However, the default does not always guarantee the highest accuracy, and depending on the data, more favorable values may exist.

In this section, we compare the performance of tuned RF and tuned DRF under the situation where their performances are maximized by appropriately selecting the mtry values for each data. The grid search method was used to select mtry values. In addition, for RF, tuning for instance size was also performed as in Sect. 5.4.

The pairwise comparison of prediction accuracy is given in Table 7. When comparing RF and tuned RF, data with the same accuracy were excluded from counting. This is the same for DRF. According to Table 7, the tuned DRF exceeds the performances of RF, tuned RF, and DRF.

Table 7 Pairwise comparison of tuned RF and tuned DRF

6 Conclusion

Random forest (RF) is known to better predict when using larger trees without pruning. Therefore, it is common to create trees of the maximum size possible. In this paper, we found that the maximum sized tree produced by RF may not be large enough for some dataset. This indicates that RF may underfit, even with the most favorable option. To cope with this, we developed a method called double random forest (DRF) that creates trees larger than RF.

The DRF ensemble method has two distinctions from RF. First, DRF uses the whole training data to grow a decision tree. Since whole training data has more unique instances than the bootstrap data, it tends to make the tree bigger than the tree in RF. For data where RF under-fits, this approach can remedy the problem, reducing the prediction bias. Second, DRF uses the bootstrap sampling only to find the partitioning rule for each intermediate node. That is, it searches for the best partitioning rules by using a random bootstrap of instances and a random subset of features on each node of the tree. Once the partitioning rule is found, the bootstrap data is restored to the original data and the instances are sent down to the child node. We can argue that DRF creates more diverse tree ensemble by adding randomness to the partitioning rules with a bootstrap and a feature subset.

Experimental studies using 34 real or artificial data sets where RF under-fit show that the proposed DRF method is far superior to other classification ensemble. Moreover, we found that DRF is statistically better than RF in terms of test accuracy through the analysis of randomized complete block design (RCBD). This trend was the same for both RF and DRF tuned for mtry values. However, it is true that the DRF algorithm is more complex and takes longer time to fit the model compared to RF.

For those datasets where the ideal nodesize is not 1, the accuracy of the DRF and RF methods was similar and no statistical difference was found. That is, when RF does not under-fit, we do not think that DRF is needed because trees do not need to be longer.