Introduction

Allogeneic hematopoietic stem cell transplantation (HSCT) remains an indispensable curative therapy for several malignant and nonmalignant hematologic conditions. The outcomes of HSCT have improved over the years due to advances in supportive care and therapeutic modalities. Additionally, our ability to predict patients who are at a higher risk of adverse outcomes related to disease and/or transplant characteristics, and thereby individualize treatments, continues to be refined. Traditional outcome predictors in HSCT are patient age, comorbidity risk, disease status, HLA- and ABO-matching disparities, and other host- and disease-related factors [1,2,3,4,5].

Several tools have been published to inform critical decisions in HSCT, including risk of relapse post-HSCT, nonrelapse mortality (NRM), and overall survival (OS). These tools are also helpful to stratify patients according to relative risks imparted by these independent disease-related and patient characteristics [6, 7]. Additionally, they guide us when counseling patients and help physicians individualize transplant management for the patients.

The most widely used prognostic tool is the Hematopoietic Cell Transplantation specific Comorbidity Index (HCT-CI) and HCT-CI/age, which are adapted from Charlton Comorbidity Index (CCI) for assessment of HSCT patients, and has been validated in a large dataset [6, 7]. These indexes are primarily used to objectively assess organ function status and predict NRM and OS. Disease Risk Index (DRI) or refined DRI (rDRI) predicts OS primarily based on the type and status of disease prior to HSCT [8, 9]. A number of other multivariable tools in use are the European Group for Bone Marrow Transplantation (EBMT) risk score [10], pretransplant assessment of mortality (PAM) [11], and more recently, the acute leukemia—EBMT (AL-EBMT) model [12] and a composite hematopoietic cell-transplant composite-risk (HCT-CR) model [13]. These prediction tools differ from one another with respect to composition of variables, disease groups studied, end points, model building, validation, and calibration methodologies [14]. The c-statistic, by which most of the prognostic tools are built to discriminate patients, also varies between tools, at least partly dependent on the variables incorporated. Additionally, many advances have occurred with respect to identification of important variables, therapeutic modalities, and treatment selection over the years, which were not accounted for in most of the existing older models. Therefore, there is a constant effort to improve and develop holistic prognostic scoring systems as newer variables of significance, and statistical methods are identified.

In this study, we hypothesized that integration of more contemporarily used recipient, donor, and transplant characteristics would improve prediction of post-transplant survival outcomes compared with the currently published tools. After model building, we validated our tool in an external patient dataset and the results are presented here.

Patients and methods

This study includes two cohorts of patients from the University of Iowa Health Care (UIHC) and Mayo Clinic (MC). Patients ≥18 years of age who received first HSCT from a peripheral blood stem cell (PBSC) source for any malignant hematologic indication between 2010 and 2016 from HLA-matched related (MRD), HLA-matched unrelated (MUD), HLA-mismatched unrelated (MMUD), and HLA-mismatched related donors (MMRD/haploidentical) donors were included. HLA matching at -A, _B, -C, and -DRB1 loci was defined as matched status.

Patients with HSCT from bone marrow source and those with incomplete or missing data were excluded. After obtaining IRB approval from the respective institutions, we collected demographic, clinical, and outcome data.

Endpoints and definitions

The primary endpoints used for the models were two-year disease-free survival (DFS) defined as time from the initial allogeneic transplant to relapse or death due to HCT-related causes, and two-year overall survival (OS) defined as time from the initial allogeneic transplant to death due to any cause. Patients alive and without relapse at two years were censored.

The intensity of conditioning regimens was defined as per Bacigalupo et al [15]. HCT-CI and rDRI were defined as previously described per Sorror et al. and Armand et al. [6, 9].

Statistical analysis

The training dataset included 273 patients treated at UIHC, and the external testing dataset included 348 patients treated at MC.

Using the training dataset, a lasso-penalized Cox regression model was applied to identify prognostic predictors of two-year DFS and OS. Predictors under consideration included: recipient (age <55 vs 55 + , sex, KPS < 90 vs 90 + , HCT-CI, ABO type, and CMV status), disease (type, rDRI), donor (age <30 vs 30 + , sex, ABO type, and CMV status), and transplant (preparative regimen, year of transplant, related/unrelated, and match/mismatch) characteristics. The lasso penalty parameter was derived as the mean of 1000 iterations of 10-fold cross-validation. Median and IQR time-dependent area under the curve (AUC) using 1000 bootstrap samples was obtained using the method proposed by Uno et al. [16] to assess internal model validation. To assess internal model calibration, a risk score was computed from the regression coefficients. Patients were stratified based on a median cut point of risk scores.

We used Harrell’s concordance index (c-index) in which a c-index of 1.0 indicates a model’s discriminatory function to be perfect, while a score of 0.50 indicates a discrimination function not dissimilar to chance alone.

Differences in two-year DFS and OS between risk strata were evaluated using a log-rank test. Optimism-corrected (1000 bootstrap samples) predicted survival probabilities were compared with observed survival probabilities at two years. Median-predicted survival probabilities were plotted against the median observed survival probabilities along with 95% confidence interval estimated by the Kaplan–Meier method for each risk strata.

The model derived in the building phase was applied to the testing dataset. External model validation was assessed by constructing a time-dependent ROC curve. To assess external model calibration, patients were stratified by risk score. Two-year DFS and OS differences between risk strata were evaluated using a log-rank test. Additionally, median predicted survival probabilities were plotted against the median observed survival probabilities at two years along with the 95% confidence interval estimated by the Kaplan–Meier method for each risk strata.

All analyses were conducted using SAS v9.4 (SAS Institute, Cary, NC) or R (www.r-project.org) and the glmnet package [17] and the hdnom package [18].

Results

Comparison of patient cohorts

The baseline and transplant clinical characteristics are in Table 1. The UIHC cohort included 273 patients who received their first HSCT between 2010 and 2015 and the MC cohort included 348 patients who received their first HSCT between 2010 and 2016.

Table 1 Patient demographic and transplant characteristics of the two cohorts.

Disease, DRI, HCT-CI, regimen, year of transplant, recipient CMV status, transplant type and match, and donor age and CMV status significantly differed between cohorts. Notably, acute myelogenous leukemia (AML) was the most common indication present in 43.6% and 49.4%, followed by myelodysplastic syndrome (MDS) and myeloproliferative neoplasm (MPN) in 22.0% vs 23.0% and acute lymphoblastic leukemia (ALL) in 16.8% vs 13.8% in UIHC and MC cohorts, respectively. Other diagnosis category consisted of UIHC cohort: 18 chronic myeloid leukemia (CML); 1 Hodgkin lymphoma (HL); 10 other leukemia (OL); and 1 plasma cell disease (PCD); MC cohort: 2 chronic lymphocytic leukemia (CLL); 11 CML; 31 OL.

There were more patients in UIHC cohort with high–very high DRI (35.5% vs 20.4%), high HCT-CI scores (70.3% vs 27.0%), and those who received a myeloablative regimen (72.9% vs 64.1%) in UIHC compared with MC, respectively. On the other hand, more patients in MC received a related donor (57.5% vs 39.2%), matched (MSD or MUD) donor (96.3% vs 83.9%), and donors with age greater than 30 years (71.8% vs 51.3%) compared with UIHC.

Outcomes

Two-year DFS was 58% and 59% and 2-year OS was 61% and 66% for UIHC and Mayo cohorts, respectively (Fig. 1).

Fig. 1: Two-year disease-free survival and overall survival for the training (UIHC) and testing (Mayo Clinic) cohorts.
figure 1

Figure 1 represents two-year DFS A and two-year OS B for the UIHC and Mayo Clinic cohorts which were 58% and 59%, and 61% and 66%, respectively. Disease-free survival (DFS) is defined as time from the initial allogeneic transplant to relapse or death due to HCT-related causes; overall survival (OS) defined as time from the initial allogeneic transplant to death due to any cause. Patients alive and without relapse at two years were censored.

Two-year disease-free survival

After application of the lasso-penalized Cox regression model, the final model included the following variables: performance status, disease-risk index, comorbidity index, patient CMV status, donor CMV status, and donor age (Table 2). Median AUC for the prediction of two-year DFS in the training set was 0.71 (IQR 0.70–0.72), demonstrating good internal discrimination (Fig. 2A). Additionally, AUC across time was relatively consistent between 1- and 2 years post transplant. Internal model calibration showed good agreement between observed and predicted survival probabilities (Supplementary Fig. 1A), which is further supported by a significant difference in DFS between risk groups (p < 0.01). Two-year DFS was 76% and 40% for low and high risk, respectively (Fig. 3A).

Table 2 Variables selected for the model.
Fig. 2: Time-dependent AUC for disease-free and overall survival.
figure 2

A Panels show time-dependent AUC values for disease-free survival in the training and testing cohorts, at 0.71 and 0.61, respectively. B Panels show time-dependent AUC values for overall survival in the training and testing cohorts, at 0.70 and 0.61, respectively.

Fig. 3: Risk-stratified 2-year disease-free survival and overall survival.
figure 3

A Panels show significant difference in two-year DFS between low and high-risk patients stratified by the model at 76% and 40% in the training and 69% and 50% in the testing sets, respectively. B Panels show significant difference in two-year OS between the low- and high-risk patients stratified by the model at 76% and 47% and 75% and 56% among training and testing sets, respectively.

After applying the final model to the testing dataset, AUC was 0.61 at two years with AUC remaining consistent between 1- and 2 years post transplant (Fig. 2A). External model calibration showed good agreement (Supplementary Fig. 1A), which is further supported by a significant difference in DFS by risk groups (p < 0.01). Two-year DFS of 69% and 50% for low and high-risk groups, respectively (Fig. 3A).

Two-year overall survival

After application of the lasso-penalized Cox regression model, the final model included the following variables: performance status, disease-risk index, comorbidity index, donor CMV status, and donor age (Table 2). Median AUC for the prediction of two-year OS was 0.70 (IQR 0.69-0.71) in the training set demonstrating good internal discrimination (Fig. 2B). Additionally, AUC across time was relatively consistent between 1- and 2-years post transplant. Internal model calibration showed good agreement between observed and predicted survival probabilities (Supplementary Fig. 1B) which is further supported by a significant difference in OS between risk groups (p < 0.01). Two-year OS was 76% and 47% for low and high risk, respectively (Fig. 3B).

After applying the final model to the testing dataset, AUC was 0.61 at two years with AUC remaining consistent between 1- and 2 years post transplant (Fig. 2B). External model calibration showed good agreement (Supplementary Fig. 1B), which is further supported by a significant difference in OS by risk groups (p < 0.01). Two-year OS was 75% and 56% for low and high risk, respectively (Fig. 3B).

Discussion

In this study, we present a new, externally validated composite prognostic tool for hematologic malignancies to predict two-year DFS and OS following HSCT. For two-year DFS, performance status, HCT-CI and rDRI of the patient, CMV status of patient and donor, and age of the donor had significant impact. Model discrimination assessed by the c-statistic was 0.71 and 0.62 in the training and testing datasets, respectively, for two-year DFS. The results for two-year OS were similar, except that patient CMV status was not included in the final model. Additionally, model calibration showed good agreement between the predicted and observed outcomes in the training and test cohorts demonstrating a consistent performance of the tool in the prediction of outcomes. Finally, using this model, we could discriminate patient cohorts into 2 distinct risk groups with significantly different two-year DFS and OS rates. The high-risk group had a significantly lower two-year OS of 56% compared with 75% in the lower-risk group. An important feature of our model is that it captures the most crucial pretransplant recipient, disease, and donor characteristics that are known to influence transplant outcomes with a c-statistic of 0.62.

Of the numerous variables that are conventionally used to assess risk, age, performance status, and comorbidity burden of the patient remain the foremost and powerful prognostic factors in oncology, including in HSCT [19].

Various models have studied several important variable factors in combination for prediction of outcomes. This important differences between various existing tools compared with our model, including the endpoints used for predictions are summarized in Table 3.

Table 3 Comparison of allogeneic HSCT prognostic scoring systems.

CMV serostatus of donor and recipient remains a significant determinant of important HSCT outcomes such as DFS and OS, beyond the direct impact on CMV reactivation-associated morbidity and mortality [20]. Additionally, few studies suggested a likely favorable role of positive CMV serostatus of donor and/or recipient on early immune reconstitution [21], and reduced relapses [22].

ABO matching between the recipient and donors is another critical variable that is considered during donor selection. There have been conflicting reports about the impact of the ABO mismatching and outcomes of HCT. While some major registry and single institutional studies showed an adverse impact on increased GVHD, NRM, or OS [23,24,25,26], few other studies, including this recent analysis, did not find any major impact on the outcomes [27]. In our model, ABO status of the recipient and donor was not found as a significant variable for DFS or OS prediction.

These variables were studied, either independently or in combination, by multiple predictive tools.

HCT-CI and Comorbidity-age Index (HCT-CI that accounts for the age of the patient) were the first of the prognostic tools developed to estimate NRM and OS [6, 28]. Despite many attempts to augment the predictability [29], no major improvements in the c-statistic were noted [28] and the original HCT-CI still remains one of the most widely used tools in HSCT prognostication of NRM. Alternatively, rDRI was developed to estimate OS based primarily on the risk of relapse of the disease, regardless of the conditioning intensity, recipient age, and donor type and discriminates 4 distinct groups [9]. The discrimination function of HCT-CI and rDRI for OS is reported to be 0.63 and 0.66, respectively, in the original publications [6, 9]. Composite models have been developed in an attempt to improve discrimination and predictability by integrating various recipient, donor, and transplant characteristics. Of those, EBMT and PAM are prominent validated tools that have evaluated OS as a primary endpoint in various groups of hematologic malignancies and the c-statistics for OS are 0.62 and 0.69, respectively [10, 11]. However, they both miss important characteristics such as performance status and CMV serostatus. Although disease stage was included in both, the criteria of staging were not uniform nor validated as rDRI was not available at the time of development of these tools.

A more recently developed HCT-CR showed a relatively better c-statistic of 0.69 [30]. HCT-CR is a composite model that combined rDRI and HCT-CI/age with the reported superior ability to estimate NRM and OS, and to stratify 4 risk groups with significantly different three-year median OS [13, 31]. While their original model was only restricted to AML and MDS patients, the validation study was performed on an independent internal dataset and expanded to multiple disease groups and other outcomes such as GVHD and relapse-free survival (GRFS) [31], but this tool needs to be externally validated.

Most tools predict OS and NRM, while our current model and the HCT-CR also evaluate DFS. On the other hand, AL-EBMT is a model derived from machine-learning (ML) algorithm, restricted to AML and ALL patients only and the primary endpoint was 100-day mortality with a c-statistic of 0.70 [32].

Another distinction of our tool is inclusion of donor age. In recent years, age of the donor has been reported as one of the most influential factors on post-transplant outcomes. In large registry studies, younger donor age correlated with improved outcomes, including overall survival, which was noted across unrelated and haploidentical donor groups [33,34,35]. Although similar trends were reported by a few other studies [36, 37], an institutional study did not necessarily show a differential impact of donor age when dichotomized at 60 years [38]. Our tool is the first validated multivariable model that incorporates donor age, and provides further evidence for younger age of the donor as an emerging predictive variable for DFS and OS after HSCT for various hematologic malignancies.

External validation of scoring systems is important to assess the generalizability of any prognostic tool. In this regard, an important strength of our study is that, in compliance with TRIPOD guidelines [39, 40], model calibration showed agreement between observed and predicted outcomes and validation performed in an independent, external dataset, and showed a minimal decline in discrimination relative to internally validated values.

There could be specific limitations to generalizability even for validated models that are particularly highlighted in external validation studies. For example, in one single-center report, rDRI could not accurately predict OS and PFS in a cohort with a shorter follow-up [41], while another single-center analysis revealed diminished prediction accuracy of HCT-CI when applied to different donor groups [42]. Similarly, inconsistencies were noted for other tools in subsequent external validation studies [43].

Shouval et al in a recent study externally validated and compared performances of various prediction tools in HSCT [44], and appropriately point out that most models in the field of HSCT have at best, modest discrimination function, likely due to various unpredictable complications, and due to our inability to account for all aspects that could influence outcomes [44].

Last, using our model, we were able to discriminate patient cohorts into two distinct risk groups with significantly different 2-year DFS and OS rates. The high-risk group had a significantly lower two-year OS of 56% compared with 75% in the lower-risk group. This information would be helpful for estimation of OS pretransplant and may aid in further preemptive management post HSCT.

In our study, there were some differences between the two datasets. There were more patients with higher risk by rDRI and HCT-CI, and ALL subgroup in the training data (UIHC cohort), while the testing dataset (Mayo cohort) had more patients with significantly older donors. Similarly, differences in center practices relating to transplantation methodologies and donor composition may have also influenced generalizability as demonstrated by the decline in the c-statistic between training and testing datasets. There were differences in the timeframe, cohort sizes, and follow-up duration among the training and testing cohorts that could have also impacted the results [45].

An important strength of our study is that it allows physicians to predict two-year OS and DFS for HSCT with a c-index of 0.62, by combining the most used and validated variables and risk scores representing patient (age, CMV, KPS, and HCT-CI), disease (rDRI), and donor (age and CMV) characteristics. Incorporation of donor age, which is believed to be a formidable contributor to the outcome of transplantation, is an added strength of this tool. Furthermore, TRIPOD guidelines were followed for external validation and calibration attesting to the integrity of the model. The model is easy to use, and a web-based nomogram can be accessed here: https://allohsctsurvivalcalc.iowa.uiowa.edu/.

A few considerable limitations of this study include model building using retrospectively collected data, restriction to PBSC stem cell source, fewer numbers representing some disease groups, such as multiple myeloma, and fewer haploidentical and alternative donor transplants.

The endpoints of interest, target diseases, and the risk factors used in the original model building will have to be considered while applying any prediction tool(s) to a local dataset.

Validation of this tool in other external datasets and continuous refinement with incorporation of validated global prognostic variables, such as fragility index, cognitive assessment of patients, and biomarker correlates, are expected to further improve prognostic value.