The importance of being external. methodological insights for the external validation of machine learning models in medicine

doi:10.1016/j.cmpb.2021.106288

Computer Methods and Programs in Biomedicine

Volume 208, September 2021, 106288

https://doi.org/10.1016/j.cmpb.2021.106288 Get rights and content

Highlights

•
External validation (EV) is necessary in Medical ML, to properly evaluate ML models.
•
We propose a lean meta-validation method to assess EV procedures.
•
We propose two diagrams to assess EV studies in light of data size and similarity.
•
We illustrate our methodology through the validation of a COVID-19 diagnostic model.
•
We provide guidelines and public scripts to aid the interpretation of EV results.

Abstract

Background and Objective Medical machine learning (ML) models tend to perform better on data from the same cohort than on new data, often due to overfitting, or co-variate shifts. For these reasons, external validation (EV) is a necessary practice in the evaluation of medical ML. However, there is still a gap in the literature on how to interpret EV results and hence assess the robustness of ML models.

Methods

We fill this gap by proposing a meta-validation method, to assess the soundness of EV procedures. In doing so, we complement the usual way to assess EV by considering both dataset cardinality, and the similarity of the EV dataset with respect to the training set. We then investigate how the notions of cardinality and similarity can be used to inform on the reliability of a validation procedure, by integrating them into two summative data visualizations.

Results

We illustrate our methodology by applying it to the validation of a state-of-the-art COVID-19 diagnostic model on 8 EV sets, collected across 3 different continents. The model performance was moderately impacted by data similarity (Pearson $ρ$ = 0.38, $p <$ 0.001). In the EV, the validated model reported good AUC (average: 0.84), acceptable calibration (average: 0.17) and utility (average: 0.50). The validation datasets were adequate in terms of dataset cardinality and similarity, thus suggesting the soundness of the results. We also provide a qualitative guideline to evaluate the reliability of validation procedures, and we discuss the importance of proper external validation in light of the obtained results.

Conclusions

In this paper, we propose a novel, lean methodology to: 1) study how the similarity between training and validation sets impacts the generalizability of a ML model; 2) assess the soundness of EV evaluations along three complementary performance dimensions: discrimination, utility and calibration; 3) draw conclusions on the robustness of the model under validation. We applied this methodology to a state-of-the-art model for the diagnosis of COVID-19 from routine blood tests, and showed how to interpret the results in light of the presented framework.

Introduction

The validation of machine learning (ML) classification models represents one of the most important, and yet most critical, steps in the development of this class of decision support tools [62]. In this respect, to “validate” means to provide evidence that the model is valid, that is it will properly work with new data that the model has never examined or processed before.

In the specialist literature, validation of ML models is often intended and performed as internal validation [70]: this refer to validation protocols, including, e.g. hold-out, bootstrap or cross- validation, that attempt to estimate the performance of the ML models by partitioning the whole training dataset into multiple smaller datasets, and by testing the model, trained on one part of the original dataset, on a different, usually smaller, part [34], [62]. This class of approaches has prompted the researchers to focus on an important aspect to assess the soundness of the validation procedure, namely the size of the dataset used, or its cardinality [50]. We argue, however, that sample cardinality alone is not sufficient for understanding the reliability of a validation procedure, and must be complemented with an equally important aspect, which is often completely overlooked: dataset similarity.

While internal validation procedures are widely used, especially for their convenience, they are not considered sufficiently conservative in so-called critical settings, like the medical one [5], [52], [59], [62]. In these settings, ML models must be robust, that is capable to reliably work also in contexts that may be more or less subtly different from the one from which the training data has been obtained [31], [60], [67]. This is sometimes called the requirement for cross-site transportability [58]. This requirement is due either because the model must be deployed in a different setting, as it is the case of medical ML models that are to be deployed in multiple hospitals or countries [26]; or because the distribution of the underlying phenomenon of interest and predictive variables may change over time (a phenomenon known as concept drift [32]), making the original setting of model deployment a new setting for any practical purpose.

Furthermore, the results of internal validation procedures are sometimes incorporated in the development of ML models, for example as a means to perform model selection [16], [59]. As a consequence, ML models are often not capable to generalize well beyond their training distribution and may be at risk of data leakage and overfitting, thus leading to highly inflated estimates of their prospective accuracy [16], [39]. A related issue is the underspecification [22] of ML models and training procedures; this refers to the extent different ML models could perform equally well on internal validation sets that are sufficiently similar to the training data, but still some of them could fail in generalizing to the deployment distribution.

Therefore, in critical settings, external validation has been advocated as necessary [5], [19], [36], [60]. External data, in this case, refers to a set of new data points that come from other cohorts, facilities, or repositories other than the data used for model creation. Most of the times, performance observed on external datasets is significantly poorer than performance appraised on original datasets (e.g., [40]), so that the following question should be addressed any time researchers develop a ML model: will its performance be reproduced consistently across different sites [55]?

In what follows, we will share a general method to assess the soundness of an external validation procedure grounding on the two notions mentioned above, dataset cardinality and dataset similarity.

To illustrate this method, we will apply it to the case of the COVID-19 diagnosis [12]. In particular, we will report about the challenges that we met in the validation of a state-of-the-art ML model with external datasets coming from across three continents, as well as of the lessons that we learnt in the interpretation of the results. Finally, we will share a set of practical recommendations for the meta-validation of external validation procedures (that is their validation), so as to meet the requirements of generalizability and reproducibility that diagnostic and prognostic ML models must guarantee in daily (clinical) practice [4].

Section snippets

Methods

In this Section, we describe our methodological contribution for the assessment of the soundness of external validation procedures. This contribution combines recent metrics and formulas, and integrates them together to get a tool for the qualitative (also visual) assessment of the validity of an external validation procedure. For this reason, we see our proposal as a lean method for meta-validation.

As said above, this method integrates two different sets of metrics. One set of metrics is aimed

Use case: external validation of models supporting COVID-19 diagnosis

In this section, we describe the multiple empirical settings where we applied our methodological approach, the characteristics of the training and validation datasets, and we briefly discuss the results of the validation of a ML model to diagnose COVID-19.

Discussion

The external validation of ML models is increasingly being proposed as the main (and only) means to certify the supposed validity of the model on (virtually any) unseen data [19], [60]. However, as also the findings shown above illustrate, the result from an external validation cannot guarantee reliability per se [26]: if the external validation had been performed only on the Italy-2 dataset, where the diagnostic model performed even better than on the original test set and exhibited very high

Conclusions and final recommendations

The external validation of a medical machine learning model is very important for a number of reasons [19], [60]. This is also the case because an external validation corroborates the reputation on the model, and hence the users’ trust and performance expectancy, which are known to be positively correlated with the behavioral intention to adopt and use the system in the actual practice [23], [35]. However, for external validation to provide a sounder basis for more reliable estimates (than

Acronyms and abbreviations

•
BA: Basophils
•
CBC: Complete Blood Count
•
DAC: Data Agreement Criterion
•
DRC: Data Representativeness Criterion
•
ED: Emergency Department
•
EO: Eosinophils
•
HCT: Hematocrit
•
HGB: Hemoglobin
•
HSR: Hospital San Raffaele
•
IOG: Istituto Ortopedico Galeazzi
•
LY: Lymphocytes
•
MCH: Mean Corpuscular Hemoglobin
•
MCHC: Mean Corpuscular Hemoglobin Concentration
•
MCV: Mean Corpuscular Volume
•
ML: Machine Learning
•
MSS: Minimum Sample Size
•
NE: Neutrophils
•
PLT1: Platelets
•
RBC: Red Blood Cells
•
RBF: Radial Basis Function
•
WBC: White Blood Cells

Declaration of Competing Interest

The authors report no competing interest.

Acknowledgments and Declarations

Ethics Approval: Research involving human subjects complied with all relevant national and international regulations, institutional policies and is in accordance with the tenets of the Helsinki Declaration (as revised in 2013), and was approved by the authors Institutional Review Board (70/INT/2020).

References (70)

A.A. Ardakani et al.
Application of deep learning technique to manage covid-19 in routine clinical practice using ct images: results of 10 convolutional neural networks
Comput. Biol. Med.
(2020)
S. Bleeker et al.
External validation is necessary in prediction research:: a clinical example
J Clin Epidemiol
(2003)
F. Cabitza et al.
The need to separate the wheat from the chaff in medical informatics: introducing a comprehensive checklist for the (self)-assessment of medical ai studies
Int J Med Inform
(2021)
V. García et al.
Theoretical analysis of a performance measure for imbalanced data
2010 20th International Conference on Pattern Recognition
(2010)
A. Gretton et al.
A kernel method for the two-sample-problem
Adv Neural Inf Process Syst
(2006)
J.-H. Kim
Estimating classification error rate: repeated cross-validation, repeated hold-out and bootstrap
Computational statistics & data analysis
(2009)
S. Kim et al.
Analysis of the factors influencing healthcare professionals adoption of mobile electronic medical record (EMR) using the unified theory of acceptance and use of technology (UTAUT) in a tertiary hospital
BMC Med Inform Decis Mak
(2015)
E.-M. Lim et al.
Race-specific wbc and neutrophil count reference intervals
Int J Lab Hematol
(2010)
C. Menni et al.
Loss of smell and taste in combination with other symptoms is a strong predictor of covid-19 infection
MedRxiv
(2020)
M. Pavlou et al.
Estimation of required sample size for external validation of risk models for binary outcomes
Stat Methods Med Res
(2021)

I. Redko et al.

Advances in domain adaptation theory

(2019)

E.W. Steyerberg et al.

Internal and external validation of predictive models: a simulation study of bias and precision in small samples

J Clin Epidemiol

(2003)

A. Vabalas et al.

Machine learning algorithm validation with a limited sample size

PLoS ONE

(2019)

B. Van Calster et al.

A calibration hierarchy for risk models was defined: from utopia to empirical data

J Clin Epidemiol

(2016)

D. Veen et al.

Using the data agreement criterion to rank experts beliefs

Entropy

(2018)

M. Vidali et al.

Standardization and harmonization in hematology: instrument alignment, quality control materials, and commutability issue

Int J Lab Hematol

(2020)

L. Wynants et al.

Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal

BMJ

(2020)

K. Ahuja

Estimating kullback-leibler divergence using kernel machines

2019 53rd Asilomar Conference on Signals, Systems, and Computers

(2019)

L. Archer et al.

Minimum sample size for external validation of a clinical prediction model with a continuous outcome

Stat Med

(2021)

A.L. Beam et al.

Challenges to the reproducibility of machine learning models in health care

JAMA

(2020)

S. Boltz et al.

knn-based high-dimensional kullback-leibler distance for tracking

Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS’07)

(2007)

N. Bousquet

Diagnostics of prior-data agreement in applied bayesian analysis

J Appl Stat

(2008)

A.A. Bradley et al.

Sampling uncertainty and confidence intervals for the brier score and brier skill score

Weather Forecasting

(2008)

K.H. Brodersen et al.

The balanced accuracy and its posterior distribution

Proceedings of ICPR 2010

(2010)

L. Brunese et al.

Explainable deep learning for pulmonary disease and coronavirus covid-19 detection from x-rays

Comput Methods Programs Biomed

(2020)

F. Cabitza et al.

Development, evaluation, and validation of machine learning models for covid-19 detection based on routine blood tests

Clinical Chemistry and Laboratory Medicine (CCLM)

(2021)

F. Cabitza et al.

As if sand were stone. new concepts and metrics to probe the ground on which to build trustable ai

BMC Med Inform Decis Mak

(2020)

F. Cabitza et al.

The proof of the pudding: in praise of a culture of real-world validation for medical artificial intelligence

Ann Transl Med

(2019)

A. Carobene et al.

A very uncommon haemoglobin value resulting from a severe acute malnutrition in a 16-month-old child in ethiopia

Clinical Chemistry and Laboratory Medicine (CCLM)

(2020)

G.C. Cawley et al.

On over-fitting in model selection and subsequent selection bias in performance evaluation

The Journal of Machine Learning Research

(2010)

D. Chicco et al.

The matthews correlation coefficient (mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation

BioData Min

(2021)

J. Cohen

Statistical power analysis for the behavioral sciences

(2013)

G.S. Collins et al.

External validation of multivariable prediction models: a systematic review of methodological conduct and reporting

BMC Med Res Methodol

(2014)

G.S. Collins et al.

Sample size considerations for the external validation of a multivariable prognostic model: a resampling study

Stat Med

(2016)

A. Coskun et al.

Systematic review and meta-analysis of within-subject and between-subject biological variation estimates of 20 haematological parameters

Clinical Chemistry and Laboratory Medicine (CCLM)

(2020)

Cited by (0)

¹: These authors contributed equally.

View full text

The importance of being external. methodological insights for the external validation of machine learning models in medicine

Highlights

Abstract

Methods

Results

Conclusions

Introduction

Section snippets

Methods

Use case: external validation of models supporting COVID-19 diagnosis

Discussion

Conclusions and final recommendations

Acronyms and abbreviations

Declaration of Competing Interest

Acknowledgments and Declarations

Comput. Biol. Med.

J Clin Epidemiol

Int J Med Inform

Adv Neural Inf Process Syst

Computational statistics & data analysis

BMC Med Inform Decis Mak

Int J Lab Hematol

MedRxiv

Stat Methods Med Res

J Clin Epidemiol

PLoS ONE

J Clin Epidemiol

Entropy

Int J Lab Hematol

BMJ

Estimating kullback-leibler divergence using kernel machines

2019 53rd Asilomar Conference on Signals, Systems, and Computers

Minimum sample size for external validation of a clinical prediction model with a continuous outcome

Stat Med

Challenges to the reproducibility of machine learning models in health care

JAMA

knn-based high-dimensional kullback-leibler distance for tracking

Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS’07)

Diagnostics of prior-data agreement in applied bayesian analysis

J Appl Stat

Sampling uncertainty and confidence intervals for the brier score and brier skill score

Weather Forecasting

The balanced accuracy and its posterior distribution

Proceedings of ICPR 2010

Explainable deep learning for pulmonary disease and coronavirus covid-19 detection from x-rays

Comput Methods Programs Biomed

Development, evaluation, and validation of machine learning models for covid-19 detection based on routine blood tests

Clinical Chemistry and Laboratory Medicine (CCLM)

As if sand were stone. new concepts and metrics to probe the ground on which to build trustable ai

BMC Med Inform Decis Mak

The proof of the pudding: in praise of a culture of real-world validation for medical artificial intelligence

Ann Transl Med

A very uncommon haemoglobin value resulting from a severe acute malnutrition in a 16-month-old child in ethiopia

Clinical Chemistry and Laboratory Medicine (CCLM)

On over-fitting in model selection and subsequent selection bias in performance evaluation

The Journal of Machine Learning Research

The matthews correlation coefficient (mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation

BioData Min

Statistical power analysis for the behavioral sciences

External validation of multivariable prediction models: a systematic review of methodological conduct and reporting

BMC Med Res Methodol

Sample size considerations for the external validation of a multivariable prognostic model: a resampling study

Stat Med

Systematic review and meta-analysis of within-subject and between-subject biological variation estimates of 20 haematological parameters

Clinical Chemistry and Laboratory Medicine (CCLM)