The importance of being external. methodological insights for the external validation of machine learning models in medicine

https://doi.org/10.1016/j.cmpb.2021.106288Get rights and content

Highlights

  • External validation (EV) is necessary in Medical ML, to properly evaluate ML models.

  • We propose a lean meta-validation method to assess EV procedures.

  • We propose two diagrams to assess EV studies in light of data size and similarity.

  • We illustrate our methodology through the validation of a COVID-19 diagnostic model.

  • We provide guidelines and public scripts to aid the interpretation of EV results.

Abstract

Background and Objective Medical machine learning (ML) models tend to perform better on data from the same cohort than on new data, often due to overfitting, or co-variate shifts. For these reasons, external validation (EV) is a necessary practice in the evaluation of medical ML. However, there is still a gap in the literature on how to interpret EV results and hence assess the robustness of ML models.

Methods

We fill this gap by proposing a meta-validation method, to assess the soundness of EV procedures. In doing so, we complement the usual way to assess EV by considering both dataset cardinality, and the similarity of the EV dataset with respect to the training set. We then investigate how the notions of cardinality and similarity can be used to inform on the reliability of a validation procedure, by integrating them into two summative data visualizations.

Results

We illustrate our methodology by applying it to the validation of a state-of-the-art COVID-19 diagnostic model on 8 EV sets, collected across 3 different continents. The model performance was moderately impacted by data similarity (Pearson ρ = 0.38, p< 0.001). In the EV, the validated model reported good AUC (average: 0.84), acceptable calibration (average: 0.17) and utility (average: 0.50). The validation datasets were adequate in terms of dataset cardinality and similarity, thus suggesting the soundness of the results. We also provide a qualitative guideline to evaluate the reliability of validation procedures, and we discuss the importance of proper external validation in light of the obtained results.

Conclusions

In this paper, we propose a novel, lean methodology to: 1) study how the similarity between training and validation sets impacts the generalizability of a ML model; 2) assess the soundness of EV evaluations along three complementary performance dimensions: discrimination, utility and calibration; 3) draw conclusions on the robustness of the model under validation. We applied this methodology to a state-of-the-art model for the diagnosis of COVID-19 from routine blood tests, and showed how to interpret the results in light of the presented framework.

Introduction

The validation of machine learning (ML) classification models represents one of the most important, and yet most critical, steps in the development of this class of decision support tools [62]. In this respect, to “validate” means to provide evidence that the model is valid, that is it will properly work with new data that the model has never examined or processed before.

In the specialist literature, validation of ML models is often intended and performed as internal validation [70]: this refer to validation protocols, including, e.g. hold-out, bootstrap or cross- validation, that attempt to estimate the performance of the ML models by partitioning the whole training dataset into multiple smaller datasets, and by testing the model, trained on one part of the original dataset, on a different, usually smaller, part [34], [62]. This class of approaches has prompted the researchers to focus on an important aspect to assess the soundness of the validation procedure, namely the size of the dataset used, or its cardinality [50]. We argue, however, that sample cardinality alone is not sufficient for understanding the reliability of a validation procedure, and must be complemented with an equally important aspect, which is often completely overlooked: dataset similarity.

While internal validation procedures are widely used, especially for their convenience, they are not considered sufficiently conservative in so-called critical settings, like the medical one [5], [52], [59], [62]. In these settings, ML models must be robust, that is capable to reliably work also in contexts that may be more or less subtly different from the one from which the training data has been obtained [31], [60], [67]. This is sometimes called the requirement for cross-site transportability [58]. This requirement is due either because the model must be deployed in a different setting, as it is the case of medical ML models that are to be deployed in multiple hospitals or countries [26]; or because the distribution of the underlying phenomenon of interest and predictive variables may change over time (a phenomenon known as concept drift [32]), making the original setting of model deployment a new setting for any practical purpose.

Furthermore, the results of internal validation procedures are sometimes incorporated in the development of ML models, for example as a means to perform model selection [16], [59]. As a consequence, ML models are often not capable to generalize well beyond their training distribution and may be at risk of data leakage and overfitting, thus leading to highly inflated estimates of their prospective accuracy [16], [39]. A related issue is the underspecification [22] of ML models and training procedures; this refers to the extent different ML models could perform equally well on internal validation sets that are sufficiently similar to the training data, but still some of them could fail in generalizing to the deployment distribution.

Therefore, in critical settings, external validation has been advocated as necessary [5], [19], [36], [60]. External data, in this case, refers to a set of new data points that come from other cohorts, facilities, or repositories other than the data used for model creation. Most of the times, performance observed on external datasets is significantly poorer than performance appraised on original datasets (e.g., [40]), so that the following question should be addressed any time researchers develop a ML model: will its performance be reproduced consistently across different sites [55]?

In what follows, we will share a general method to assess the soundness of an external validation procedure grounding on the two notions mentioned above, dataset cardinality and dataset similarity.

To illustrate this method, we will apply it to the case of the COVID-19 diagnosis [12]. In particular, we will report about the challenges that we met in the validation of a state-of-the-art ML model with external datasets coming from across three continents, as well as of the lessons that we learnt in the interpretation of the results. Finally, we will share a set of practical recommendations for the meta-validation of external validation procedures (that is their validation), so as to meet the requirements of generalizability and reproducibility that diagnostic and prognostic ML models must guarantee in daily (clinical) practice [4].

Section snippets

Methods

In this Section, we describe our methodological contribution for the assessment of the soundness of external validation procedures. This contribution combines recent metrics and formulas, and integrates them together to get a tool for the qualitative (also visual) assessment of the validity of an external validation procedure. For this reason, we see our proposal as a lean method for meta-validation.

As said above, this method integrates two different sets of metrics. One set of metrics is aimed

Use case: external validation of models supporting COVID-19 diagnosis

In this section, we describe the multiple empirical settings where we applied our methodological approach, the characteristics of the training and validation datasets, and we briefly discuss the results of the validation of a ML model to diagnose COVID-19.

Discussion

The external validation of ML models is increasingly being proposed as the main (and only) means to certify the supposed validity of the model on (virtually any) unseen data [19], [60]. However, as also the findings shown above illustrate, the result from an external validation cannot guarantee reliability per se [26]: if the external validation had been performed only on the Italy-2 dataset, where the diagnostic model performed even better than on the original test set and exhibited very high

Conclusions and final recommendations

The external validation of a medical machine learning model is very important for a number of reasons [19], [60]. This is also the case because an external validation corroborates the reputation on the model, and hence the users’ trust and performance expectancy, which are known to be positively correlated with the behavioral intention to adopt and use the system in the actual practice [23], [35]. However, for external validation to provide a sounder basis for more reliable estimates (than

Acronyms and abbreviations

  • BA: Basophils

  • CBC: Complete Blood Count

  • DAC: Data Agreement Criterion

  • DRC: Data Representativeness Criterion

  • ED: Emergency Department

  • EO: Eosinophils

  • HCT: Hematocrit

  • HGB: Hemoglobin

  • HSR: Hospital San Raffaele

  • IOG: Istituto Ortopedico Galeazzi

  • LY: Lymphocytes

  • MCH: Mean Corpuscular Hemoglobin

  • MCHC: Mean Corpuscular Hemoglobin Concentration

  • MCV: Mean Corpuscular Volume

  • ML: Machine Learning

  • MSS: Minimum Sample Size

  • NE: Neutrophils

  • PLT1: Platelets

  • RBC: Red Blood Cells

  • RBF: Radial Basis Function

  • WBC: White Blood Cells

Declaration of Competing Interest

The authors report no competing interest.

Acknowledgments and Declarations

Ethics Approval: Research involving human subjects complied with all relevant national and international regulations, institutional policies and is in accordance with the tenets of the Helsinki Declaration (as revised in 2013), and was approved by the authors Institutional Review Board (70/INT/2020).

References (70)

  • I. Redko et al.

    Advances in domain adaptation theory

    (2019)
  • E.W. Steyerberg et al.

    Internal and external validation of predictive models: a simulation study of bias and precision in small samples

    J Clin Epidemiol

    (2003)
  • A. Vabalas et al.

    Machine learning algorithm validation with a limited sample size

    PLoS ONE

    (2019)
  • B. Van Calster et al.

    A calibration hierarchy for risk models was defined: from utopia to empirical data

    J Clin Epidemiol

    (2016)
  • D. Veen et al.

    Using the data agreement criterion to rank experts beliefs

    Entropy

    (2018)
  • M. Vidali et al.

    Standardization and harmonization in hematology: instrument alignment, quality control materials, and commutability issue

    Int J Lab Hematol

    (2020)
  • L. Wynants et al.

    Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal

    BMJ

    (2020)
  • K. Ahuja

    Estimating kullback-leibler divergence using kernel machines

    2019 53rd Asilomar Conference on Signals, Systems, and Computers

    (2019)
  • L. Archer et al.

    Minimum sample size for external validation of a clinical prediction model with a continuous outcome

    Stat Med

    (2021)
  • A.L. Beam et al.

    Challenges to the reproducibility of machine learning models in health care

    JAMA

    (2020)
  • S. Boltz et al.

    knn-based high-dimensional kullback-leibler distance for tracking

    Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS’07)

    (2007)
  • N. Bousquet

    Diagnostics of prior-data agreement in applied bayesian analysis

    J Appl Stat

    (2008)
  • A.A. Bradley et al.

    Sampling uncertainty and confidence intervals for the brier score and brier skill score

    Weather Forecasting

    (2008)
  • K.H. Brodersen et al.

    The balanced accuracy and its posterior distribution

    Proceedings of ICPR 2010

    (2010)
  • L. Brunese et al.

    Explainable deep learning for pulmonary disease and coronavirus covid-19 detection from x-rays

    Comput Methods Programs Biomed

    (2020)
  • F. Cabitza et al.

    Development, evaluation, and validation of machine learning models for covid-19 detection based on routine blood tests

    Clinical Chemistry and Laboratory Medicine (CCLM)

    (2021)
  • F. Cabitza et al.

    As if sand were stone. new concepts and metrics to probe the ground on which to build trustable ai

    BMC Med Inform Decis Mak

    (2020)
  • F. Cabitza et al.

    The proof of the pudding: in praise of a culture of real-world validation for medical artificial intelligence

    Ann Transl Med

    (2019)
  • A. Carobene et al.

    A very uncommon haemoglobin value resulting from a severe acute malnutrition in a 16-month-old child in ethiopia

    Clinical Chemistry and Laboratory Medicine (CCLM)

    (2020)
  • G.C. Cawley et al.

    On over-fitting in model selection and subsequent selection bias in performance evaluation

    The Journal of Machine Learning Research

    (2010)
  • D. Chicco et al.

    The matthews correlation coefficient (mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation

    BioData Min

    (2021)
  • J. Cohen

    Statistical power analysis for the behavioral sciences

    (2013)
  • G.S. Collins et al.

    External validation of multivariable prediction models: a systematic review of methodological conduct and reporting

    BMC Med Res Methodol

    (2014)
  • G.S. Collins et al.

    Sample size considerations for the external validation of a multivariable prognostic model: a resampling study

    Stat Med

    (2016)
  • A. Coskun et al.

    Systematic review and meta-analysis of within-subject and between-subject biological variation estimates of 20 haematological parameters

    Clinical Chemistry and Laboratory Medicine (CCLM)

    (2020)
  • Cited by (0)

    1

    These authors contributed equally.

    View full text