Abstract

Over the last decades, molecular signatures have become increasingly important in oncology and are opening up a new area of personalized medicine. Nevertheless, biological relevance and statistical tools necessary for the development of these signatures have been called into question in the literature. Here, we investigate six typical selection methods for high-dimensional settings and survival endpoints, including LASSO and some of its extensions, component-wise boosting, and random survival forests (RSF). A resampling algorithm based on data splitting was used on nine high-dimensional simulated datasets to assess selection stability on training sets and the intersection between selection methods. Prognostic performances were evaluated on respective validation sets. Finally, one application on a real breast cancer dataset has been proposed. The false discovery rate (FDR) was high for each selection method, and the intersection between lists of predictors was very poor. RSF selects many more variables than the other methods and thus becomes less efficient on validation sets. Due to the complex correlation structure in genomic data, stability in the selection procedure is generally poor for selected predictors, but can be improved with a higher training sample size. In a very high-dimensional setting, we recommend the LASSO-pcvl method since it outperforms other methods by reducing the number of selected genes and minimizing FDR in most scenarios. Nevertheless, this method still gives a high rate of false positives. Further work is thus necessary to propose new methods to overcome this issue where numerous predictors are present. Pluridisciplinary discussion between clinicians and statisticians is necessary to ensure both statistical and biological relevance of the predictors included in molecular signatures.

1. Introduction

With the advent of genomic technologies, personalized medicine is becoming a major concern in oncology [1]. An important issue is improving the management of cancer patients by identifying profiles of patients at risk of relapse. To assist clinicians in prognosis assessment and therapeutic decision-making, multigene signatures have been developed to stratify cancer patients into different risk groups. Since the publication of the first gene signature, many prognostic multigene signatures have been extensively studied, but only some of them have been successfully implemented in clinical practice [24]. Among the latter, the most promising are gene signatures commercially available for early breast cancer that predict the risk of metastatic relapse [57]. Nevertheless, reproducibility is lacking for many published gene signatures currently not implemented in clinical practice. Much debate is ongoing in both medical and statistical literature to explain this high rate of failure for prognostic signatures. Biological relevance appears questionable since random signatures or signatures unrelated to cancer (e.g., signatures pertaining to the effect of postprandial laughter or mouse social defeat) have been shown to be significantly associated to overall survival. It is even more surprising that many random signatures can outperform most breast cancer signatures [8]. Several authors have suggested that the selected sets of genes are not unique and are strongly influenced by the subset of patients included in the training cohort [9, 10] and by the variable selection procedures [1114]. For low-dimensional data, the reference method to study associations with time-to-event endpoints is the Cox proportional hazards model. In the context of high-dimensional data (), the Cox model may be nonidentifiable. Extensions, based on boosting or penalized regression, are proposed in the literature to overcome these hurdles [1518], as they shrink the regression coefficients towards zero. Alternatively to the Cox extensions, methods based on random forests have been adapted for survival analysis [19]. This nonparametric method—random survival forest (RSF)—combines multiple decision trees built on randomly selected subsets of variables. Since feature selection methods are questioned, it seems important to thoroughly assess and compare existing strategies that are significant components in prognostic signature development. Many studies were interested in false discovery rates or prognostic performances achieved by multiple variable selection methods and compared them on simulated or real datasets [2023]. However, the impact of the training set on the stability of the results was only assessed by Michiels et al. [9] on a binary endpoint with a selection based on Pearson’s correlation and did not evaluate most recent approaches.

The main objective of this publication is to compare six typical different feature selection methods which are commonly used for high-dimensional data in the context of survival analysis. For this purpose and as recommended in the literature [24], a simulation study is performed, with special focus on variable selection and prediction performance according to multiple data configurations (sample size of the training set, number of genes associated with survival). Feature selection methods are then applied on published data to explore stability and prognostic performances in a real breast cancer dataset.

2. Material and Methods

2.1. Feature Selection Methods
2.1.1. Limits of the Cox Model in High-Dimensional Data

In low-dimensional data, the semiparametric Cox proportional hazards model is commonly used to study the relationship between covariates and time-to-event endpoints. The regression coefficients related to the Z genes are estimated by maximizing the partial log-likelihood without it being necessary to model the baseline hazard: where is the number of observations, is the event indicator for patient , is the regression parameter vector, is the vector of covariates, and denotes the set of patients at risk before time . However, in the case of high-dimensional data, this model fails to be identifiable. Several methods have been proposed to handle such a case with a number of predictors . In this study, three kinds of computational methods were used to train the models: penalized Cox regression models, component-wise boosting for the Cox model, and RSF.

2.1.2. Penalized Approaches

Penalized Cox regression models make it possible to simultaneously perform coefficient estimation and variable selection. The Least Absolute Shrinkage and Selection Operator (LASSO) is an L1-norm regularization method [17]. Coefficients are estimated by maximizing a penalized log partial-likelihood: with a tuning parameter . The choice of this shrinkage parameter is challenging and is generally obtained by maximizing the cross-validated log-likelihood function (LASSO-cvl). Certain recent publications have highlighted the fact that the cvl method for choosing results in selecting a high number of false positives. Ternès et al. [16] proposed an extension of the cross-validation approach, denoting penalized cross-validated log-likelihood (pcvl), and compared its performances to other existing extensions (adaptive LASSO, percentile LASSO, etc.). This approach, trading off between goodness-of-fit and parsimony of the mode, leads to the selection of fewer genes by applying a more stringent selection criterion. LASSO-pcvl results in the best compromise between a decline in the false discovery rate and no large increase in the false-negative rate and thus was included in our comparison study. On the other hand, LASSO suffers, however, from some limitations. The number of features selected is bounded to the number of patients, and in the case of highly correlated predictors, LASSO tends to select only one of these features, resulting in a random selection in this group of features. This may not be desirable given that genes operating in the same biological pathway may be highly correlated, so taking this combination into account may be relevant.

To alleviate these limitations, Zou and Hastie [18] proposed the Elastic Net method—a penalized regression with the combination of the L1-norm and the L2-norm penalties. The additional L2-regularization term makes it possible to promote a grouping effect, thus removing the limitation of the number of selected variables. Coefficients are estimated by maximizing the partial log-likelihood subject to the penalty:

The regularization parameter and mixing parameter are estimated by cross-validation. For a more flexible alternative to LASSO without having to estimate two parameters, it was proposed to set the default value at 0.5 for the mixing parameter and to estimate only the tuning parameter by cross-validation [11]. A stability selection approach based on subsampling in combination with the Elastic Net algorithm may be performed (BSS Enet) [25, 26], applying the algorithm to subsamples obtained by bootstrapping. The proportion of subsamples in which the biomarker is selected in the model corresponds to the selection probability for this particular biomarker. Only genes selected with an occurrence frequency equal to or larger than (with ) are included in the final model.

We used the routine implemented in the glmnet package to determine the optimal penalty for LASSO-cvl and Elastic Net [27], and the biospear package for LASSO-pcvl. The penalty parameter was chosen based on the 10-fold cross-validation, and the mixing parameter was set at 0.5 for Elastic Net regression and BSS Enet. The threshold selection probability was arbitrarily set at 0.5 for BSS Enet.

The Ridge regression was not included in this study since, unlike L1 regression or a mixture of L1 and L2, L2 penalty () tends to minimize the parameters without providing variable selection.

2.1.3. Boosting Algorithm

Like forward stepwise regression, boosting is an iterative method which starts from a null model and then adapts only one coefficient at a time, the one that maximizes a penalized partial log-likelihood, including a boosting penalty [15]. Previous boosting steps were incorporated in the penalized partial log-likelihood as an offset for the next step. This variable selection method was implemented in the R package CoxBoost. A 10-fold cross-validation was performed to find the optimal number of boosting steps via cv. CoxBoost, with a boosting penalty chosen via the optimCoxBoostPenalty function.

2.1.4. Random Survival Forests

Random forests are nonparametric variable selection methods that have been extended for survival data [19]. The general algorithm consists in drawing bootstrap samples from the dataset, growing a tree for each of them and finally averaging predictions. For each node, a subset of predictor variables was selected as candidates to split data into two daughter nodes, according to the log-rank splitting rule. Then, the best split feature from this subset is used to split the node. Variable hunting was then used for variable selection [28]. Survival trees were built according to the parameters recommended by the authors in the case of high-dimensional data, using the R randomForestSRC package.

2.2. Simulated Datasets

Datasets were simulated with different sample sizes (, 750, or 1000) and a predetermined number of 1500 normally distributed covariates. For each sample size, we generated survival times following an exponential regression model (with baseline equal to 1). Three scenarios with different numbers of truly prognostic biomarkers (, 12, or 50) were investigated. Regression parameters were fixed at -0.11 or -0.22 (resulting in more or less important effects of the true predictors), with hazard ratios of 0.9 or 0.8, respectively. To simulate a biologically relevant gene correlation structure, predictors were divided into subgroups of correlated covariates, with an autoregressive correlation structure [29]: (). The parameter was set to 0.6 after a review of the literature [16, 30, 31] and analysis of several published datasets like METABRIC and TCGA cohorts. Censoring times were, respectively, simulated using the uniform distribution (U [3, 5]), leading to censoring rates from 10 to 30%. Simulation parameters are summarized in Table 1.

2.3. Application on a Published Dataset

A public breast cancer dataset with available survival and gene expression data for 614 patients was used as an application. This dataset was extracted from GitHub (http://github.com/Oncostat/biospear/) to obtain the same data version as Ternès et al. [16]. The endpoint of interest was the distant recurrence-free survival, with a censoring rate of 78.2%. Probes were previously prefiltered according to an interquartile range greater than 1 in order to reduce the number of predictors with low variance across the samples ().

2.4. Comparison of Variable Selection Methods

A resampling strategy was performed to evaluate the six selection methods: LASSO-cvl, LASSO-pcvl, Elastic Net, BSS Enet, CoxBoost, and RSF. Following the strategy used by Michiels et al., both simulated and real datasets were randomly split into 100 training and validation sets with different sample sizes for the training set (1/2 and 2/3 of the overall dataset) [9, 12]. For each training set, the different selection methods were applied to select significant genes and create a risk score for prediction. For the penalized and boosting approaches, the risk score was based on the linear predictor given by the Cox model. For RSF, the risk score was based on the average over the trees of the cumulative hazard estimations computed from the bootstrap samples which exclude this patient in order to reduce optimism bias. Models previously developed were then applied on the respective validation sets.

Based on the feature selection during the training stage, variable selection methods were compared in terms of number of selected genes and stability of the signatures, measured by the frequency of selection of each gene among the 100 training sets. For simulated datasets, both the FDR and the false-negative rate (FNR) could be computed. They corresponded, respectively, to the rates of inactive genes selected and the rates of true biomarkers missed by variable selection methods. Prognostic performances were then evaluated on the validation set by the Brier and C-index scores for risk scores, both implemented in the R pec package [32]. The integrated Brier score (IBS) is the area under the prediction error curve (i.e., the quadratic difference between the observed response and the predicted probability over time). IBS is tending to 0 for a perfect model. The C-index measures the discriminant ability of the model; its interpretation is similar to the classical area under the Receiver Operating Characteristic (ROC) curve. All analyses were implemented with R-3.2.4.

3. Results

3.1. Simulated Datasets
3.1.1. Variable Selection

The number of genes selected by selection methods for each sample size and scenario is presented in Table 2. For the null scenario (, i.e., no active biomarker), BSS Enet and RSF selected genes in all simulated cases, whereas the other methods tended to select no predictors. LASSO-cvl and Elastic Net performed the best with the smallest FDR (respectively, ). Obviously, an FDR of 1 for BSS Enet and RSF was observed (Table 3). For alternative scenarios (), the number of selected genes by boosting and penalized approaches increased when the sample size of the training set increased. On the contrary, for RSF, the number of genes was inversely correlated to the training set sample size. The FDR and FNR were, respectively, minimized by the LASSO-pcvl and Elastic Net approaches. The FDR of BSS Enet was reduced compared to that of Elastic Net (25% decrease in median over different scenarios), but sometimes involved a small FNR increase. Stability results (Table 4) show that the latter increased with the sample size of the training set and decreased when the number of true predictors increased. When the sample size of the training set was maximal (), the occurrence frequency was approximately 100% for more than 50% of the selected predictors, except for RSF with the worst selection stability (; ). If the training sample size is not sufficient to select a large number of predictors, stability tends to decrease in the presence of more true predictors. Some true predictors had poor occurrence frequency, and false predictors may have been more stable in our in silico study. A poor intersection was observed between predictors selected by each approach (Table 5). Among concordant selected predictors, there were few true positives. This proportion can be far less than 50%, particularly when the number of true predictors is moderate () and where large training sample sizes are used.

3.1.2. Prognostic Performances

For a low number of predictors (), the penalized and boosting approaches obtained better prognostic performances with higher C-index and lower IBS than RSF. For this scenario, prognostic performances tended to decrease when sample size increased. In contrast for , all approaches except RSF presented similar performances, and prognostic performances increased with sample size (Figures 1 and 2).

3.2. Application to the Published Dataset

The six selection methods were applied on a breast cancer dataset, with a fraction of a training data equal to 2/3. The number of selected probes was 54 for LASSO-cvl, 13 for LASSO-pcvl, 78 for Elastic Net, 42 for BSS Enet, 58 for CoxBoost, and 83 for RSF. The lists of probes selected by each method are given in Supp Table 1 and the intersections between them in Table 6.

RSF showed poor intersection with other methods. Among the probes selected by multiple methods, interesting targets in carcinogenesis could be identified that play a role in cell proliferation and apoptosis (CCDN1, MET, KIT, FGFR3, …), cell metabolism (BTG1, LPGAT1, CKB, ASNS, …), cell mobility and invasion (CDC42, PXDN, PFN2, …), and immunity (CD55, CXCL13, XBP1, HLA-DQB1, …). The same strategy of resampling was then applied to assess the impact of the training set on selection and performance. The number of selected genes increased when the training sample size increased. For a given sample size of the training set, more genes were selected by RSF than the other selection methods, and LASSO-pcvl was the most stringent method (Table 7). As per stability, 50% of the selected predictors had an occurrence frequency less than 5%. Discriminant capacities (e.g., C-index) were poor for all approaches, and BSS Enet tended to have the lowest performances (Figure 3).

4. Discussion

The main objective of this study was to provide an exhaustive comparison of popular variable selection methods and to offer recommendations for developing prognostic signatures. To our knowledge, this is the first comparison assessing the effect of the constitution of the training set on multiple criteria like stability, false discovery rates, and prognostic performances between classical extensions of the Cox proportional hazards model and random forest, the most commonly used nonparametric method in this context. Using simulated datasets and one real application, we investigated both variable selection and prognostic performances of each approach. Our main conclusion is that while there is great variability in the selection process, all variable selection methods provide good prognostic performance with a satisfactory C-index in most scenarios.

Using simulated datasets based on biologically plausible parameters, we evaluated the influence of both selection methods and sample sizes on selection and prognostic performances. Considering the different comparison criteria, LASSO-pcvl is the most advisable selection method since FDR is minimized whatever the sample size, with a moderate increase in FNR and prognostic performances similar to other selection methods. Conversely, RSF is the worst method in terms of performances with too many false positives selected in most scenarios and a lack of common genes with other approaches. This may be explained by the lack of a prefilter step in the “variable hunting” algorithm recommended for high-dimensional settings which does not limit the number of selected genes. A cross-validation to optimize this number, as for other selection methods, might improve RSF performances. On the other hand, RSF has been proposed to deal with complex relationships between covariates and survival and may perform worse with linear effects. Compared to classical Elastic Net, BSS Enet gives more satisfactory results in alternative scenarios with a lower FDR, despite a slightly higher FNR. The less variables are correlated with true predictors, the better BSS Enet performs, but this is an unlikely assumption in case of omics data. However, the systematic selection of false-positive genes under the null scenarios reduces the attractiveness of this approach. A sensitivity analysis examining the effects of a more stringent stability threshold on the risk score (70 and 80%) did not address this issue. Also, BSS does not improve stability in the selection process or prognostic performances compared to a single step in the present study. As expected, the discriminant capacity of the risk score obtained by the different methods increases when the number of true predictors associated with survival increases and when the number of false-positive genes selected is low. Due to overfitting, there is a decrease in discrimination performances on the validation dataset when the number of false positives increases. In the public dataset, we observed poor stability compared to simulation results and poor prognosis performance. Several reasons may explain these findings, for example, the small sample size. The instability may be explained by the fact that there is no unique solution due to the multidimensionality of the data [10]. Indeed, interactions and correlations between predictors in omics studies are often more complex than in our simulation study. Contrary to popular belief, it is important to note in our study that LASSO was able to simultaneously select correlated biomarkers simultaneously. Other simulations are necessary to determine the impact of the correlation structure on the selection of predictors in this approach.

This simulation study had various limitations. Firstly, other simulation strategies for time-to-event data could have been tested to check whether survival forests are not at too great a disadvantage in a Cox proportional hazards scenario. Secondly, in order to make a decision in clinical practice, it is often helpful to have threshold values that make it possible to determine different risk groups. Various techniques can be employed, like the minimum value approach, cutting the continuous risk score into equal groups (according to the median or other quantiles), or strategies based on ROC curves. Further work is necessary to provide help in selecting the most adequate threshold depending on the clinical question. Then, the number of covariates in our study could be much larger that is often true for genomic data unless a prefilter step is applied; but from this initial work with “moderate” high-dimensional settings, it is clear that variable selection methods achieve even worse stability and performances when . Finally, only one value was considered for the parameter of correlation in the autoregressive correlation structure, Nevertheless, we guess that a greater correlation may not be biologically relevant and that 0.6 is a reasonable choice for maximum correlation in our simulation study and a good compromise to evaluate statistical performance for biologically plausible scenarios.

In the context of precision medicine, gene signatures are usually developed independently from clinical factors, i.e., omics selection, development of a risk score and risk groups followed by adjustment on prognostic factors. This strategy does not make it possible to select genes with a prognostic value independent from a clinical value. Moreover, most of the gene signatures lose their significance after adjustment. To our knowledge, only one in silico study was interested in taking clinical factors into account during the gene selection step [33]. It could be of interest to evaluate the proposed approaches in terms of prognostic performance, but especially in terms of selected genes and associated signaling pathways. In breast cancer research, as over half of the genome is correlated with proliferation [8], it is quite easy to find several significant combinations of genes associated with clinical outcome. Nevertheless, identifying genes involved in signaling pathways other than cell cycles could highlight new therapeutic targets and improve prognostic models with both clinical and genomic data. Another important aspect is that selection methods investigated in this publication do not take into account knowledge of biomarker biological pathways [34]. The application on the breast dataset suggests that, despite some relevant selected genes previously described in the literature, many predictors are selected because of their correlation with true prognostic genes. Thus, an important issue in biomarker discovery is the true functional significance of the variables selected. To properly determine this, biologists, clinicians, and statisticians must work closely together to propose relevant gene signatures.

Data Availability

The breast cancer dataset was extracted from GitHub (http://github.com/Oncostat/biospear/). Simulated datasets are available from the corresponding authors on reasonable request.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

We would like to thank “La Ligue Nationale Contre le Cancer, France,” région Midi Pyrénées (Comité de l’Aude, Comité de l’Ariège, Comité des Hautes Pyrénées) for their financial support and Dr. G. Taillefer, native speaker, for English language support. This work was supported by la Ligue Contre le Cancer.

Supplementary Materials

Supp Table 1: list of selected probes and associated genes in breast cancer dataset (sample results). (Supplementary Materials)