Introduction

Glioblastoma is the most common and aggressive type of primary brain tumor in adults [1,2,3]. It has one of the highest mortality rates among human tumors with a median survival of 12 to 15 months after diagnosis despite improved standard-of-care defined as maximal safe resection followed by radiotherapy plus concomitant and adjuvant temozolomide [1, 4].

Survival statistics are well-described at the population level, and many factors that impact survival have been identified including age, Karnofsky performance status (KPS), O6-methylguanine-DNA methyltransferase (MGMT) promotor methylation status, isocitrate dehydrogenase (IDH), neurological deficit, extent of resection, and tumor multifocality and location among others [5,6,7]. Yet, predicting individual patient survival remains challenging [7].

In recent years, prognostic models are increasingly being developed to predict survival of the individual glioblastoma patient [8, 9]. These prognostic models utilize a wide range of statistical and machine learning algorithms to analyze heterogenous data sources and predict individual patient survival. This systematic review aims to synthesize the current trends and provide an outlook concerning the possibility of clinical use of prognostic glioblastoma models and the future direction of survival modeling in glioblastoma patients.

Methods

Search strategy

A search was performed in the Embase, Medline Ovid (PubMed), Web of science, Cochrane CENTRAL, and Google Scholar databases according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (S1). A professional librarian was consulted for constructing the search syntax with the use of keywords for glioblastoma, prognostic modeling, and overall survival as well as their synonyms (S1). All prognostic models concerning survival in glioblastoma patients were included in our search syntax. Prediction model studies on glioblastoma patients with overall survival as the primary outcome were included. Predictor finding studies were excluded. These studies focus on characterizing the association between individual variables and the outcome at the cohort level (e.g., identifying risk factors of survival within a population), whereas prediction model studies seek to develop a model that predicts survival as accurately as possible in the individual patient utilizing the optimal combination of variables. No restrictions were applied with regard to the participant characteristics, format of the input data, type of algorithm, or validation of testing procedures. Case reports or articles written in languages other than Dutch or English were excluded. No restrictions based on the date of publication were used. This systematic search was complemented by screening the references of included articles to identify additional publications. Titles and abstracts of retrieved articles were screened by two independent authors. Two authors (IRT, SK) read the full texts of the potentially eligible articles independently. Discrepancies were solved by discussion including a third reviewer (JS).

Data extraction

From all included studies, we extracted the year of publication, name of first author, title, abstract, source of data, selection criteria, events per variable, events, sample size, type of input, hyperparameter tuning, number of predictors in the best performing model, definition of overall survival, algorithm type, validation and testing procedure, performance metric, and model performance. To ensure a systematic approach of assessing validation of prognostic modeling in glioblastoma, all the extracted variables were based on the CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) checklist [10]. The Prediction model Risk Of Bias ASsessment Tool (PROBAST) tool was used to assess the risk of bias in all included studies [11].

Results

The search identified 595 unique studies. After screening by title and abstract and subsequently screening the full text, 112 studies were included for full-text review (S2). Of these, 27 articles met our inclusion criteria and were included in the qualitative synthesis [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38]. A total of 59 models were presented of which the best performing model was included in this review. Two included models used the same database [14, 36]. Yet, both were included in this systematic review because different predictors and algorithms were used to develop the models. The included prognostic models were developed between 2010 and 2019. General characteristics and model characteristics for each included study are presented in Supplementary S3. An overview of observations in all identified glioblastoma prognostic models is visualized in Figs. 1, 2, and 3.

Fig. 1
figure 1

Data source

Fig. 2
figure 2

Type of input

Fig. 3
figure 3

Model characteristics: algorithm type and validation strategy across all identified glioblastoma models

Participants

The data that were used to develop glioblastoma survival prediction models were retrieved from clinical trials (n = 2) [26, 30], institutional data (n = 13) [13,14,15, 19, 20, 22, 25, 28, 29, 32,33,34, 36], registry data (n = 9) [12, 16, 18, 21, 23, 24, 27, 31, 38], combined institutional and database data (n = 1) [35], and unspecified data sources (n = 2) [17, 37]. Twelve models used data from consecutive patients [12, 13, 15, 23,24,25,26,27,28,29, 35, 38]. Fifteen models did not explicitly specify the selection criteria or procedures [12, 14, 16, 17, 21,22,23,24, 26, 27, 31, 33, 35,36,37] (Fig. 1).

Type of input

The utilized data sources were clinical parameters (n = 2) [19, 32]; genomics (n = 2) [12, 21]; MRI imaging (n = 4) [22, 23, 34, 37]; combined clinical and genomics (n = 4) [13, 16, 27, 28]; combined clinical and MRI imaging (n = 10) [14, 18, 20, 24, 25, 29,30,31, 35, 38]; combined clinical, MRI imaging, and genomics (n = 3) [15, 26, 36]; histopathology (n = 1) [33]; and combined clinical and pharmacokinetics (n = 1) [17]. Up to 2017, only two studies analyzed high-dimensional data sources (i.e., MRI or genomic information) in addition to clinical information [12, 13]. From 2017 onwards, there was a substantial increase in the use of genomic [15, 16, 21, 26,27,28, 36] and imaging [15, 26, 36, 38] data for survival modeling in glioblastoma patients (Fig. 2).

Algorithm type

Various statistical and machine learning algorithms were used to predict survival in glioblastoma patients including Cox proportional hazards regression (n = 16) [11,12,13, 16, 19, 20, 24,25,26,27,28, 30, 33, 35, 36, 38], support vector machine (n = 4) [22, 31, 37, 38], random forest (n = 3) [14, 29, 38], convolutional neural networks (n = 2) [18, 23]), adaptive neuro-fuzzy inference system (n = 1) [17], and an unspecified mathematical machine learning model (n = 1) [34]. To date, only two deep learning models (i.e., convolutional neural networks) have been developed for predicting survival in glioblastoma patients [18, 23]. Although classical statistical algorithms are still being used for model development, a rapid increase in the use of machine learning can be seen from 2016 onwards (Fig. 3).

Outcome definition

Overall survival was modeled as a continuous (n = 7) [15, 17, 26, 28, 30, 35, 38], binary (n = 11) [16, 19, 22, 24, 25, 27, 29, 31, 34, 37, 38], or time-to-event outcome (n = 11) [12,13,14, 18, 20, 21, 23, 32, 33, 36, 38]. In studies that defined survival as a binary outcome, survival was dichotomized into short and long survival at 6 [19], 12 [27, 38], and 18 months [37], more or less than 400 days [31], as well as the median [29] and mean [25] overall survival in the training cohort.

Validation and testing procedures

Hyperparameter settings were optimized using various validation strategies. Thirteen models divided the original dataset into separate training and validation sets [12, 14,15,16, 19, 20, 23, 27,28,29, 32, 35, 36]. Twelve studies applied a cross-validation strategy including leave one out (n = 4) [22, 24, 25, 34], 5-fold (n = 3) [13, 31, 38], 10-fold (n = 2) [18, 26], 33-fold (n = 1) [17], leave 3 out (n = 1) [37], and unspecified cross-validation (n = 1) [21]. Bootstrap validation was mentioned in three studies [26, 30, 36]. Two studies used the same subset for validation and testing [27, 33]. Most studies used a separate, prefixed subset of the original data as the hold-out test set to avoid overfitting on the validation set. Seven studies even evaluated model performance on patients from a different data source, thereby developing a model on patients from one institution and testing on patients from another [14,15,16, 18, 23, 26, 32].

Performance metrics and performance

The performance of a prognostic model can be expressed according to various statistics depending on the type of outcome format used in the model. In prognostic models, discrimination and calibration are among the most commonly used metrics for measuring model performance [39, 40]. Discrimination is the ability to distinguish cases from non-cases and is often expressed as the area under the receiver operating characteristic curve (AUC) [41] or Harrel’s C-index [42, 43]. Harrel’s C-index is an extension of the AUC considering the occurrence of the event, as well as the length of follow-up, thereby particularly well-suited for right-censored survival data [42]. Calibration constitutes the agreement between the observed and predicted outcomes and is often graphically expressed as a calibration plot or numerically as the calibration slope and intercept [41]. Definitions of other performance metrics can be found in Table 1 [41, 42]. Ten models [16, 19, 22, 24, 25, 27, 29, 31, 34, 37] with a binary outcome format expressed performance as the AUC ranging between 0.58 [19] and 0.98 [16] (n = 7) (Fig. 4) [16, 17, 19, 24, 27, 30, 37], accuracy ranging from 0.69 [29] to 0.98 [31] (n = 5) [22, 25, 29, 31, 37], and a calibration curve (n = 1) [25]. Studies modeling survival as a continuous outcome (n = 7) [15, 17, 26, 28, 30, 35, 38] measured the prediction performance using Harrell’s C-statistic which ranged between 0.66 [26] and 0.70 [15, 28] (n = 3) [15, 26, 28], and a calibration plot (n = 2) [26, 38]. The eleven time-to-event models [12,13,14, 18, 20, 21, 23, 32, 33, 36, 38] applied the C-index ranging between 0.70 [12, 14, 32, 38] and 0.82 [13] (Fig. 4) (n = 10) [12,13,14, 18, 20, 21, 23, 32, 36, 38], and one model [33] did not describe the model performance [41,42,43] (Table 1).

Table 1 Definition of performance metrics
Fig. 4
figure 4

Performance score

Online prediction tools

Three studies have translated their model into an online prediction tool making the models more actionable and useful for individual prognostication in glioblastoma patients [26, 30, 38]. Although these three models included radiographic features, such as tumor size and extension, these features have to be interpreted or measured manually by a human expert and inserted into the model. Therefore, none of the online prediction tools used raw MRI data but exclusively used structured clinical parameters. Although studies have developed deep learning models utilizing unstructured data, e.g., genomics [26] or MRI imaging [38], none of these models has been translated to an actionable clinical prediction tool yet.

Risk of bias

Using the PROBAST tool [11] and CHARMS guidelines [10], risk of bias was assessed. This showed that three models had a potentially high risk of bias [14, 16, 33]. Seven models had an intermediate risk [13, 21, 22, 26, 29, 31, 35] with 17 of all models being low risk. The performed assessment of included models found a risk of bias in the following domains: participants (n = 4) [17, 26, 31, 37], predictors (n = 10) [12,13,14, 16, 20,21,22, 26, 29, 33], outcomes (n = 7) [13, 14, 16, 21, 22, 29, 33], and analysis (n = 7) [14, 16, 25, 31, 33, 35, 36]. When using manually measured imaging parameters, ambiguity in assessment can occur; there was no form of double-blinded assessment of MRI images which prevents objectivity. Additionally, models did not define how some parameters, such as extent of resection, were measured. Some models also seem to overfit in analysis on existing data with manners as a skewed divide of training/testing dataset or a low events per variable and not accounting for missing data. Fourteen models did not include all initially included patients in analysis of the best performing model [12,13,14,15, 19, 21, 25, 27,28,29,30, 32, 34, 36]. Out of 14 models without complete enrolment of patients, 7 models did not describe the type and/or frequency of missing data [12, 15, 19, 21, 25, 29, 34] (Table 2) (S4).

Table 2 ROB assessment included models

Discussion

This systematic review demonstrates the lack of widespread validation and clinical use of the existing glioblastoma models. Despite the increasing development of survival prediction models for glioblastoma patients, only seven model have been validated retrospectively in an external patient cohort [14,15,16, 18, 23, 26, 32], and none has been validated prospectively. Furthermore, three models, all of which developed from a statistical algorithm, have been deployed as a publicly available prediction tool [26, 30, 38], but none has been implemented as a standardized tool to guide clinical decision-making. Lastly, no trend was seen in performance throughout time despite machine learning methods increasingly being used, and no prognostic glioblastoma study till date had consequences for clinical decision-making.

Prognostic models have the potential to help tailor clinical management to needs of the individual glioblastoma patient by providing a personalized risk-benefit analysis. The increase of machine learning algorithms, and deep learning, enables the use of high-dimensional data, such as free text and imaging, to improve the accuracy and performance of prediction models. The increasing use of machine learning for the analysis of unstructured, high-dimensional data parallels the current trends in predictive modeling in medicine [44, 45]. Neurosurgical examples include machine learning algorithms for glioblastoma, deep brain stimulation, traumatic brain injury, stroke, and spine surgery [38, 44,45,46,47,48,49,50,51,52,53]. Deep learning algorithms are also increasingly being used to further improve the WHO 2016 classification of high-grade gliomas via histological and biomolecular variables for more concise diagnosis and classification of gliomas [54,55,56].

Furthermore, deep learning algorithms are also frequently used in radiotherapeutic research for automated skull stripping, automated segmentation, or delineation of resection cavities for stereotactic radiosurgery [57,58,59,60]. Despite the ubiquity of high-performing models in clinical research, none has been translated to the clinical realm and integrated in clinical decision-making. Few prognostic models are put to practice throughout all medical specialties. Specifically, for other diseases, such as breast cancer, where radiological and genetic factors are more well-known than glioblastoma, more than sixty models were found to prognosticate breast cancer survival [61]. Nevertheless, this has led to little consequence for clinical care [62]. Seemingly, clinical implementation is not a matter of robustness of evidence or notoriety of the disease. Moons et al. also parallel our finding that robust validation studies are missing for most prognostic models and that most validation studies include a relatively small patient cohort, thus not helping the model’s generalizability [62]. This raises the question: what needs to happen before prognostic models can be used in clinic?

Computational challenges

First, machine learning algorithms are accompanied with unique computational challenges which could limit the clinical implementation of these models. Due to the high-dimensionality of the input data, machine learning models are inherently limited in their generalizability. Computer vision models developed on single institutional data perform poorly on data from different institutions when different scanning parameters, image features, and other formatting methods are used [63]. In contrast, prediction models that exclusively use structured clinical parameters, such as age, gender, and the presence of comorbidity, are more generalizable and implementable across institutions as this information is less subject to institution-specific data acquisition methods [26, 30, 38]. This could be one of the reasons that the three currently available prediction models all exclusively use clinically structured information. Therefore, if unstructured data and machine learning algorithms are to be incorporated in clinical prognostic models at multicenter level, harmonization and standardization procedures are required between institutions.

Furthermore, most machine learning algorithms accommodate primarily to binary or continuous outcomes, whereas survival data is typically composed of right-censored data, in which the value of an observation is only partially known (i.e., the patient survived at least beyond a specific follow-up time). Other approaches have already been considered for handling aforementioned data, such as discarding, including it twice in the model: once as an event and once as event-free, which creates bias in risk estimate. In addition, novel approaches such as modifying specific machine learning algorithms or weight estimation of the amount of censoring in sample size are introduced [64,65,66]. This highlights the need for translating existing machine learning algorithms to alternatives that can accommodate time-to-event survival data as well.

Clinical challenges

Machine learning algorithms are also accompanied with unique clinical challenges which could limit their clinical implementation. As medical computational science progresses, critical unanswered questions arise: (to what extent) should a medical professional rely on technology and how do you intercept an inevitable predictive miscalculation of the algorithm? First, the black box of many algorithms, e.g., hidden layers in neural networks, substantially reduces the interpretability of a potentially high-performing prediction model and thereby limits their clinical deployment. However, this is not a new phenomenon: therapeutic measures can be implemented in clinical care based on studies demonstrating their safety and efficacy, yet without the underlying mechanisms being fully clarified. In addition, algorithms learn from real-world data, and therefore, real-world disparities could propagate into the developed models. This could potentially sustain or amplify existing healthcare disparities, such as ethnic or racial biases [67]. If survival prediction models were to be clinically deployed, should these algorithms be used as a directive or supportive application in clinical decision-making? There is insufficient information and experience up until now to fully answer this question. Liability can become an issue if a misprediction, e.g., false positive, is made and clinical decisions are influenced by this misprediction; is the clinician responsible for the algorithms’ fault? Therefore, medical professionals should be cautious when relying on technology and attempt to further understand predictive algorithms and their inevitable limitations.

Standardizing model evaluation prior to clinical implementation

These computational and clinical challenges have led to the development standardized methods of assessing predictive models for clinical implementation, both in the diagnostic and prognostic realms. The use of methylome data in neuro-oncology [68] has the promise to be used clinically as recommended by the iMPACT-now guidelines [69]. This study performed a prospective external validation in five different centers to test the accuracy of their model. Additionally, the model was tested in two different labs for technical robustness. This could be the most important step towards clinical deployment of prognostic glioblastoma models as well. Moreover, prognostic radiomics models for GBM patients demonstrate significant potential to achieve noninvasive pathological diagnosis and prognostic stratification of glioblastoma [18, 23]. Yet, technical challenges need to be overcome before implementation of these models is realized, namely the access to larger image datasets, common criteria for feature definitions and image acquisition, and wide-scale validation of one radiomics model [70]. Another neuro-oncological model that is widely used for research purposes is the ds-GPA that functions as a diagnostic prognosis assessment for brain metastases [71]. Sperduto et al. included shortcomings of the ds-GPA, but more importantly report possible consequences in clinical decision-making per prognostic score of the ds-GPA [71]. This offers clinicians insight in the utility of the ds-GPA. The caveat of prognostic glioblastoma models is the lack of appraisal concerning the collected clinical data. Moreover, the FDA considers appraisal of data and subsequent analysis of all gathered data as a preliminary for clinical deployment of software as a medical device. Concisely, iterated datasets that could be considered as “pivotal” for superior performance, safety, and specific risk definitions should be identified before clinical deployment or introduction into guidelines [72].

Limitations

There are several limitations to the current systematic review. First, a preferred quantitative meta-analysis was not possible due to the methodological heterogeneity across all studies, and differences in model performance should be interpreted with caution. As low-performing models are not published, common bottlenecks may remain unexposed resulting in a duplication of futile efforts. Furthermore, publication bias can influence results as previously mentioned; high-performing models are likelier to be published. To the best of our knowledge, there are no previous systematic reviews presenting a general overview of the emerging field of survival modeling in glioblastoma patients.

Future directions

As the current field of survival modeling in glioblastoma patients is gravitating towards high-dimensional models, future research efforts should focus on harmonization and standardization to increase the volume of available training data, the accuracy of developed models, and the generalizability of their associated prediction tools. As of now, models specifically report prediction performance, yet there are many secondary characteristics that determine whether or not a model can be implemented in clinical practice. Therefore, future studies should concentrate not only on model performance but also on secondary metrics, such as the interpretability and ease of use, that are relevant for their clinical deployment [38]. Moreover, future research should focus on clinical utility, i.e., explaining how or when clinicians should alter the treatment plan of the glioblastoma patient. Lastly, considering the ethical and clinical implications parallel to its development could ensure a safe and sound implementation of this rapidly emerging technology.

Conclusion

The use of machine learning algorithms in prognostic survival models for glioblastoma has increased progressively in recent years. Yet, no machine learning models have led to an actionable prediction tool to date. For successful translation of a tool to the clinic, multicentered standardization and harmonization of data are needed. Future studies should focus not only on the model performance, but also on the secondary model characteristics, such as interpretability and ease of use, and the ethical challenges accompanied with it to ensure a safe and effective implementation in clinical care.