The importance of being external. methodological insights for the external validation of machine learning models in medicine
Introduction
The validation of machine learning (ML) classification models represents one of the most important, and yet most critical, steps in the development of this class of decision support tools [62]. In this respect, to “validate” means to provide evidence that the model is valid, that is it will properly work with new data that the model has never examined or processed before.
In the specialist literature, validation of ML models is often intended and performed as internal validation [70]: this refer to validation protocols, including, e.g. hold-out, bootstrap or cross- validation, that attempt to estimate the performance of the ML models by partitioning the whole training dataset into multiple smaller datasets, and by testing the model, trained on one part of the original dataset, on a different, usually smaller, part [34], [62]. This class of approaches has prompted the researchers to focus on an important aspect to assess the soundness of the validation procedure, namely the size of the dataset used, or its cardinality [50]. We argue, however, that sample cardinality alone is not sufficient for understanding the reliability of a validation procedure, and must be complemented with an equally important aspect, which is often completely overlooked: dataset similarity.
While internal validation procedures are widely used, especially for their convenience, they are not considered sufficiently conservative in so-called critical settings, like the medical one [5], [52], [59], [62]. In these settings, ML models must be robust, that is capable to reliably work also in contexts that may be more or less subtly different from the one from which the training data has been obtained [31], [60], [67]. This is sometimes called the requirement for cross-site transportability [58]. This requirement is due either because the model must be deployed in a different setting, as it is the case of medical ML models that are to be deployed in multiple hospitals or countries [26]; or because the distribution of the underlying phenomenon of interest and predictive variables may change over time (a phenomenon known as concept drift [32]), making the original setting of model deployment a new setting for any practical purpose.
Furthermore, the results of internal validation procedures are sometimes incorporated in the development of ML models, for example as a means to perform model selection [16], [59]. As a consequence, ML models are often not capable to generalize well beyond their training distribution and may be at risk of data leakage and overfitting, thus leading to highly inflated estimates of their prospective accuracy [16], [39]. A related issue is the underspecification [22] of ML models and training procedures; this refers to the extent different ML models could perform equally well on internal validation sets that are sufficiently similar to the training data, but still some of them could fail in generalizing to the deployment distribution.
Therefore, in critical settings, external validation has been advocated as necessary [5], [19], [36], [60]. External data, in this case, refers to a set of new data points that come from other cohorts, facilities, or repositories other than the data used for model creation. Most of the times, performance observed on external datasets is significantly poorer than performance appraised on original datasets (e.g., [40]), so that the following question should be addressed any time researchers develop a ML model: will its performance be reproduced consistently across different sites [55]?
In what follows, we will share a general method to assess the soundness of an external validation procedure grounding on the two notions mentioned above, dataset cardinality and dataset similarity.
To illustrate this method, we will apply it to the case of the COVID-19 diagnosis [12]. In particular, we will report about the challenges that we met in the validation of a state-of-the-art ML model with external datasets coming from across three continents, as well as of the lessons that we learnt in the interpretation of the results. Finally, we will share a set of practical recommendations for the meta-validation of external validation procedures (that is their validation), so as to meet the requirements of generalizability and reproducibility that diagnostic and prognostic ML models must guarantee in daily (clinical) practice [4].
Section snippets
Methods
In this Section, we describe our methodological contribution for the assessment of the soundness of external validation procedures. This contribution combines recent metrics and formulas, and integrates them together to get a tool for the qualitative (also visual) assessment of the validity of an external validation procedure. For this reason, we see our proposal as a lean method for meta-validation.
As said above, this method integrates two different sets of metrics. One set of metrics is aimed
Use case: external validation of models supporting COVID-19 diagnosis
In this section, we describe the multiple empirical settings where we applied our methodological approach, the characteristics of the training and validation datasets, and we briefly discuss the results of the validation of a ML model to diagnose COVID-19.
Discussion
The external validation of ML models is increasingly being proposed as the main (and only) means to certify the supposed validity of the model on (virtually any) unseen data [19], [60]. However, as also the findings shown above illustrate, the result from an external validation cannot guarantee reliability per se [26]: if the external validation had been performed only on the Italy-2 dataset, where the diagnostic model performed even better than on the original test set and exhibited very high
Conclusions and final recommendations
The external validation of a medical machine learning model is very important for a number of reasons [19], [60]. This is also the case because an external validation corroborates the reputation on the model, and hence the users’ trust and performance expectancy, which are known to be positively correlated with the behavioral intention to adopt and use the system in the actual practice [23], [35]. However, for external validation to provide a sounder basis for more reliable estimates (than
Acronyms and abbreviations
- •
BA: Basophils
- •
CBC: Complete Blood Count
- •
DAC: Data Agreement Criterion
- •
DRC: Data Representativeness Criterion
- •
ED: Emergency Department
- •
EO: Eosinophils
- •
HCT: Hematocrit
- •
HGB: Hemoglobin
- •
HSR: Hospital San Raffaele
- •
IOG: Istituto Ortopedico Galeazzi
- •
LY: Lymphocytes
- •
MCH: Mean Corpuscular Hemoglobin
- •
MCHC: Mean Corpuscular Hemoglobin Concentration
- •
MCV: Mean Corpuscular Volume
- •
ML: Machine Learning
- •
MSS: Minimum Sample Size
- •
NE: Neutrophils
- •
PLT1: Platelets
- •
RBC: Red Blood Cells
- •
RBF: Radial Basis Function
- •
WBC: White Blood Cells
Declaration of Competing Interest
The authors report no competing interest.
Acknowledgments and Declarations
Ethics Approval: Research involving human subjects complied with all relevant national and international regulations, institutional policies and is in accordance with the tenets of the Helsinki Declaration (as revised in 2013), and was approved by the authors Institutional Review Board (70/INT/2020).
References (70)
- et al.
Application of deep learning technique to manage covid-19 in routine clinical practice using ct images: results of 10 convolutional neural networks
Comput. Biol. Med.
(2020) - et al.
External validation is necessary in prediction research:: a clinical example
J Clin Epidemiol
(2003) - et al.
The need to separate the wheat from the chaff in medical informatics: introducing a comprehensive checklist for the (self)-assessment of medical ai studies
Int J Med Inform
(2021) - et al.
Theoretical analysis of a performance measure for imbalanced data
2010 20th International Conference on Pattern Recognition
(2010) - et al.
A kernel method for the two-sample-problem
Adv Neural Inf Process Syst
(2006) Estimating classification error rate: repeated cross-validation, repeated hold-out and bootstrap
Computational statistics & data analysis
(2009)- et al.
Analysis of the factors influencing healthcare professionals adoption of mobile electronic medical record (EMR) using the unified theory of acceptance and use of technology (UTAUT) in a tertiary hospital
BMC Med Inform Decis Mak
(2015) - et al.
Race-specific wbc and neutrophil count reference intervals
Int J Lab Hematol
(2010) - et al.
Loss of smell and taste in combination with other symptoms is a strong predictor of covid-19 infection
MedRxiv
(2020) - et al.
Estimation of required sample size for external validation of risk models for binary outcomes
Stat Methods Med Res
(2021)
Advances in domain adaptation theory
Internal and external validation of predictive models: a simulation study of bias and precision in small samples
J Clin Epidemiol
Machine learning algorithm validation with a limited sample size
PLoS ONE
A calibration hierarchy for risk models was defined: from utopia to empirical data
J Clin Epidemiol
Using the data agreement criterion to rank experts beliefs
Entropy
Standardization and harmonization in hematology: instrument alignment, quality control materials, and commutability issue
Int J Lab Hematol
Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal
BMJ
Estimating kullback-leibler divergence using kernel machines
2019 53rd Asilomar Conference on Signals, Systems, and Computers
Minimum sample size for external validation of a clinical prediction model with a continuous outcome
Stat Med
Challenges to the reproducibility of machine learning models in health care
JAMA
knn-based high-dimensional kullback-leibler distance for tracking
Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS’07)
Diagnostics of prior-data agreement in applied bayesian analysis
J Appl Stat
Sampling uncertainty and confidence intervals for the brier score and brier skill score
Weather Forecasting
The balanced accuracy and its posterior distribution
Proceedings of ICPR 2010
Explainable deep learning for pulmonary disease and coronavirus covid-19 detection from x-rays
Comput Methods Programs Biomed
Development, evaluation, and validation of machine learning models for covid-19 detection based on routine blood tests
Clinical Chemistry and Laboratory Medicine (CCLM)
As if sand were stone. new concepts and metrics to probe the ground on which to build trustable ai
BMC Med Inform Decis Mak
The proof of the pudding: in praise of a culture of real-world validation for medical artificial intelligence
Ann Transl Med
A very uncommon haemoglobin value resulting from a severe acute malnutrition in a 16-month-old child in ethiopia
Clinical Chemistry and Laboratory Medicine (CCLM)
On over-fitting in model selection and subsequent selection bias in performance evaluation
The Journal of Machine Learning Research
The matthews correlation coefficient (mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation
BioData Min
Statistical power analysis for the behavioral sciences
External validation of multivariable prediction models: a systematic review of methodological conduct and reporting
BMC Med Res Methodol
Sample size considerations for the external validation of a multivariable prognostic model: a resampling study
Stat Med
Systematic review and meta-analysis of within-subject and between-subject biological variation estimates of 20 haematological parameters
Clinical Chemistry and Laboratory Medicine (CCLM)
Cited by (0)
- 1
These authors contributed equally.