Main

Advances in computing power, deep learning architectures and expert labelled datasets have spurred the development of medical imaging artificial intelligence (AI) systems that rival clinical experts1,2,3,4,5,6,7,8. Yet it is remarkably challenging to deploy AI systems that assist with even simple clinical tasks6,8. Machine learning algorithms that were designed to reduce the time it took for clinically actionable inferences, when deployed in clinics, resulted in patients inadvertently experiencing event greater delays9. When taken out of siloed and controlled laboratory environments, end users of AI systems must contend with input quality control and network latency, and must devise ways to integrate these systems within established clinical practice. Some of these early forays into translatable clinical machine learning have shown that designing systems to work seamlessly within established clinical workflows requires substantial integrative efforts at the inception of algorithm development, given the drastically limited opportunities for iteration later at the time of prospective deployment10. Extensive open-source machine learning software libraries and advances in computer performance have made it easier for researchers to develop increasingly complex AI systems tailored towards specific clinical problems11,12. In addition to moving beyond detecting features diagnostic for disease, the next generation of AI systems must account for systemic biases in training data, intuitively alert end users to the uncertainty inherent in predictions and allow for opportunities to explore and explain the mechanisms by which predictions are made. This Perspective builds on these key priority areas for the acceleration of foundational AI research in medicine. We present an overview of dataset curation nuances and architectural considerations specific to machine learning for high-dimensional medical imaging, along with a discussion of explainability, uncertainty and bias in these systems. In the process, we provide a template for researchers interested in navigating some of the issues and challenges that come with building clinically translatable AI systems13.

High-dimensional medical imaging data

We anticipate that the availability of high-quality ‘AI-ready’ annotated medical datasets will continue to lag behind demand for the foreseeable future. Retrospectively assigning clinical ground truth labels requires extensive investment of time from clinical experts, and there are substantial barriers to aggregating multi-institutional data for public release13. In addition to ‘diagnostic AI’ characterized by models trained on hard radiological ground truth labels, there will be demand for ‘disease prediction AI’ trained on potentially noisier clinical composite outcome targets8,14,15,16. Prospective data collection with standardized protocols for image acquisition and adjudication of clinical ground truth are essential steps towards building massive multicentre imaging datasets with paired clinical outcomes.

Large multicentre imaging datasets engender a multitude of privacy and liability concerns associated with potentially sensitive data embedded in the files. The Digital Imaging and Communication in Medicine (DICOM) standard was designed to capture, store and provide workflow management for medical images, and is nearly universally adopted17. Imaging files (stored either as .dcm files or within a nested folder structure) contain both pixel data and associated metadata. A multitude of open-source and proprietary tools can assist with de-identification of DICOM files13,18. Back-end hospital informatics frameworks such as the Google Healthcare API also support DICOM de-identification via ‘safe lists’—a method to scrub out metadata fields that may contain sensitive information. On the user-facing side, The MIRC Clinical Trials Processor anonymizer is a popular alternative, although it requires working with certain legacy software18. Well-documented Python packages (such as pydicom) may also be used to process DICOM files before use or transfer to collaborating institutions19. Imaging data can then be extracted and stored in a variety of machine-readable formats20. These datasets can quickly become large and unwieldy, and while a discussion on the specifics of data storage formats is beyond the scope of this Perspective, a key consideration for medical imaging AI is the preservation of image resolution.

An oft-cited drawback of automated de-identification methods or scripts is the potential for ‘burned in’ protected health information to remain on the imaging files. Despite the DICOM standard, manufacturer-specific differences make it difficult to generate simple rules via tools such as the MIRC Clinical Trials Processor to mask out regions where protected health information may be located. We suggest using a simple machine learning system for masking ‘burned in’ protected health information. In the case of echocardiograms, for example, there is a pre-defined scanning sector where the heart is visualized. Other potential options are machine learning-based optical character recognition tools to identify and mask out regions with printed text. The DICOM tags themselves can be useful for the extraction of both scan-level information and modality-specific tags. In the cases of echocardiography and cardiac magnetic resonance imaging (MRI), for example, important scan-level information such as acquisition frame rates and date, or MRI sequence (T1/T2), can readily be extracted from the DICOM metadata (Fig. 1).

Fig. 1: Cloud-based collaborative annotation workflows.
figure 1

Cloud-based tools such as MD.ai can be used to generate expert-annotated datasets and evaluate them against clinical experts via a secure connection. An implementation of MD.ai in which clinical experts make a variety of 2D measurements to quantify cardiac function is shown. Credit: MD.ai Inc, NY.

For research endeavours that involve head-to-head benchmarking of AI systems against clinicians, or for curating large datasets with the help of clinical annotators, we recommend that a copy of the scans be stored in the DICOM format. This allows for deployment over scalable and easy-to-use cloud-based annotation tools. Several solutions exist for assigning scans for assessment by clinical experts. The requirements may range from simple scan-level labels to detailed domain-specific anatomical segmentation masks. At our institution, we deployed MD.ai (New York, New York)—a cloud-based annotation system that natively works with DICOM files stored on institutionally approved cloud storage providers (Google Cloud Storage or Amazon AWS). Alternatives offer similar functionality, such as ePadLite (Stanford, California), which is available free of cost21. An additional advantage of the cloud-based annotation approach is that the scans are kept at native resolution and quality. Real-time collaboration simulates ‘team-based’ clinical decision-making. Annotations and labels can easily be exported for downstream analyses. Most importantly, many of these tools are accessible remotely from any modern web browser and are extremely easy to use, drastically improving user experience and reducing the technical burden on clinical collaborators.

Finally, newer machine learning training paradigms such as federated learning may help circumvent many of the barriers associated with data sharing. Kaissis et al. reviewed the principles, security risks and implementation challenges of federated learning22. The key feature of this method is that local copies of algorithms are trained at each institution, and the only information that is shared is the features learned by the neural network during training. At predetermined intervals, the information learned (trained weights) from each institutional algorithm is then pooled together and redistributed—effectively learning from a large multicentre dataset without the need to transmit or share any of the medical imaging data23,24. This has been instrumental in rapidly training algorithms to detect features of COVID-19 from computed tomography scans of the chest25. Although there have been successful demonstrations of federated learning in medical imaging, there remain substantial technical challenges in implementing these methods for routine clinical use25. Specifically in the context of high-dimensional imaging machine learning systems, the network latency introduced by the need to transmit and update trained weights from multiple participating centres becomes a fundamental rate-limiting step in training larger neural networks. Researchers must also ensure that the transmission of the trained weights is secure and encrypted between participating institutions, which further increases network latency26. Furthermore, curating datasets for quality and consistency while designing a study can be extremely challenging without access to the source data. Many conceptually similar federated learning frameworks still assume a degree of access to the source data27.

Computational architectures

Neural network architectures used in modern clinical machine learning are largely derived from those optimized for large photo or video recognition tasks28. These architectures are remarkably robust even in the otherwise challenging task of fine-grained classification, where classes have subtle intraclass variance (breeds of dogs), rather than obviously different objects with high interclass variance (airplanes versus dogs). With adequate pre-training on large datasets (for example, ImageNet) these ‘off the shelf’ architectures outperform their tailor-made fine-grained classifier counterparts29. Many of these architectures are available for use in popular machine learning frameworks such as TensorFlow and Pytorch30,31,32,33,34. Most importantly, these frameworks often provide ImageNet pre-trained weights for a variety of different neural network architectures, allowing researchers to rapidly repurpose them for specialized medical imaging tasks35.

Unfortunately, the vast majority of clinical imaging modalities are not simply static ‘images’. An echocardiogram, for example, is a two-dimensional (2D) ultrasonographic video of the heart. These ‘videos’ can be taken from multiple different view planes, allowing for a more complete assessment of the heart. CT and MRI scans can be thought of as a stack of 2D images that must be analysed in sequence, or practitioners run the risk of missing valuable relationships between organs along one axis or another. These ‘imaging’ modalities are thus are more similar to videos, where unstacking them as images may lead to the loss of spatial or temporal context: processing a video by analysing each frame as a separate independent image, for example, leads to the loss of temporal information between each video frame4,36,37. In a variety of tasks utilizing echocardiography and CT and MRI scans, video-based neural network algorithms have shown considerable improvements over their 2D counterparts, yet integrating multiple different view planes brings an additional layer of dimensionality that is challenging to incorporate into current frameworks2,4,38. Unlike the extensive libraries of pre-trained image-based networks, support for video algorithms remains limited. Researchers interested in deploying newer architectures will probably need to perform pre-training steps on large publicly available video datasets (such as Kinetics and UCF101 (University of Central Florida 101 – Action Recognition Data Set)) themselves39. Furthermore, video networks can be orders of magnitude more computationally expensive to train. While pre-training using large natural scenery datasets is an accepted strategy in developing clinical imaging machine learning systems, performance gains are not guaranteed40. Reports of performance improvements are common with pre-training, especially when working with smaller datasets, but the benefits taper off with larger training datasets2.

The lack of medical imaging-specific architectures was raised as a key challenge in the 2018 National Institutes of Health roadmap13. We extend this further by proposing that how we train these architectures has a large role to play in how well these systems will translate to the real world. We believe that the next generation of high-dimensional medical imaging AI will require training on richer, contextually more meaningful targets, rather than simple categorical labels. Most medical imaging AI systems today focus on diagnosing a handful of diseases from a normal background. The typical approach is to assign a numeric label (disease: 1; normal: 0) when training these algorithms. This is quite different from how clinical trainees learn to diagnose different diseases from imaging scans. In an effort to provide more ‘medical knowledge’ as opposed to simply pre-training on natural images or videos, Taleb et al.37 proposed a series of novel self-supervised pre-training techniques using large unlabelled medical imaging datasets with the aim of assisting the development of 3D medical imaging–based AI systems. Neural networks learn to ‘describe’ the imaging scans provided as inputs by first performing a set of ‘proxy tasks’37. For example, by tasking networks to ‘reassemble’ scrambled input scans as one would a jigsaw puzzle, they can be trained to ‘understand’ which anatomical structures line up with one another in various pathological and physiological states. Pairing data from imaging scans with their radiology reports is another interesting strategy that saw considerable success with chest X-ray–based AI systems41. In the spirit of providing more nuanced clinical context and embedding more ‘knowledge’ into neural networks, the text in the reports is processed via state-of-the-art natural language machine learning algorithms that subsequently train the vision network to better understand what makes various diseases appear ‘different’. Most importantly, however, they show that using such approaches can reduce the amount of labelled data by up to two orders of magnitude for specific downstream classification tasks41. Unlabelled imaging studies—either alone or in combination with paired text reports—can therefore serve as the groundwork for effective pre-training. This would be followed by fine-tuning on a smaller sample of high-quality ground truth data towards a specific supervised learning task.

Although these steps help adapt existing neural network architectures for medical imaging, designing new architectures to specific tasks requires rare expertise. A model architecture is analogous to the brain, and the trained weights (the mathematical functions optimized during training) are analogous to the mind. Advances in evolutionary search algorithms make use of machine learning methods to discover new architectures tailored to a specific task, resulting in hyper-efficient and higher-performance architectures than those constructed by humans42,43. These offer a unique opportunity in the development of imaging-modality-specific architecture. Training deep learning algorithms rely on graphical processing units (GPUs) to perform the massively parallel matrix multiplication operations. The availability of cloud computing ‘pay as you go’ GPU resources and consumer grade GPUs with high memory capacities have all helped reduce the barrier to entry for researchers interested in developing machine learning systems for medical imaging. Despite these advances, training complex modern network architectures on large video datasets requires multiple GPUs running for weeks33. Clinical research groups should note that while training a single model might be feasible on a relatively inexpensive computer, finding the right combination of settings for the best performance almost always requires the use of specialized hardware and computing clusters to return results within a reasonable timeframe. Powerful abstraction layers (Pytorch Lightning, for example) also allow research groups to establish internal standards for structuring their code in a modular format. Adopting such modular approaches—where neural network architectures and datasets can be swapped out easily—helps to rapidly repurpose systems designed for clinical imaging modalities in the past to newer use cases. This approach also helps extend the capabilities of these systems by integrating subcomponents in novel ways.

Time-to-event analyses and uncertainty quantification

As medical AI systems shift from ‘diagnostic’ to more ‘prognostic’ applications, time-to-event predictions (rather than simple binary predictions) will find more relevance in the clinical setting. Time-to-event analyses are characterized by the ability to predict event probabilities as a function of time, whereas binary classifiers can provide predictions for only one predetermined duration. Unlike binary classifiers, time-to-event analyses account for censoring of data to allow for individuals who either were lost to follow-up or did not experience the event of interest within the observation timeframe. Survival analyses are commonplace in clinical research, and are central to the development of evidence-based practice guidelines. Extending traditional survival models with image- and video-based machine learning may provide powerful insight into the prognostic value of features within histological sections or medical imaging scans. For example, integrating extensions of Cox proportional-hazards loss functions into traditional neural network architectures made cancer outcome prediction from histopathology slides alone possible44,45. We do not advocate using such vision networks to dictate how care should be administered, but instead advocate their use as a method to flag cases where features of advanced malignancy were missed by clinicians. Incorporating time-to-event analyses will be increasingly relevant in clinical situations where indolent and early stages of disease have detectable features that that may progress rapidly after a certain amount of time. Retinal features diagnostic of macular degeneration, for example, often take years to manifest8. Patients with incipient features of disease may be labelled as ‘normal’, muddying the waters for neural networks attempting to make predictions about the future risk of developing complications of macular degeneration. Incorporating concepts of survival and censoring may help train systems to better separate normal individuals from those with mild, moderate and rapidly advancing disease. Similarly, training vision networks for time-to-event analyses may find use in screening for lung cancer, helping with risk stratification based on expected potential for aggressive spread. Critical for such translational efforts is the availability of robust and well-validated deep learning extensions of the Cox regression. Over the past several years, a number of deep learning implementations of the Cox model have been described. Kvamme et al. proposed a series of proportional and non-proportional extensions of the Cox model, with additional implementations of survival methods described in the past, such as DeepSurv and DeepHit46 (Fig. 2).

Fig. 2: Quantifying uncertainty in machine learning outputs.
figure 2

Machine learning models trained with standard methods can be extremely confident even when incorrect, as described by Sensoy et al.47. Left: as a digit is rotated 180°, the system confidently assigns a label from ‘1’ to ‘7’. Right: with methods that account for classification uncertainty, however, the system assigns an uncertainty score that can help alert clinicians to potentially erroneous predictions. Figure courtesy of the authors of ref. 47.

Time-to-event predictions can, however, prove to be problematic from an actionable standpoint. In the hypothetical example of lung cancer screening, a suspicious nodule on a computed tomography scan of the chest might yield a prediction for median survival with and without appropriate therapeutic interventions. It might be interesting for the clinician to know how certain the machine learning system is about its prediction for an individual patient. Humans tend to err on the side of caution when unsure about a task. This is mirrored by machine learning systems where the output is a ‘class probability’ or ‘likelihood of being correct’ on a scale of 0 to 1. Most medical imaging machine learning systems described in literature today, however, lack the implicit ability to say ‘I don’t know’ when provided input data that are out of distribution for the model. A classifier trained to predict pneumonia from computed tomography scans (for example) is by design coerced to provide an output (of either pneumonia or no pneumonia) even if the input image is that of a cat. In their paper on uncertainty quantification in deep learning, Sensoy et al. addressed these issues with a series of loss functions that assign an ‘uncertainty score’ as a way to avoid erroneous, but confident, predictions47. The benefits of uncertainty quantification arises later in the translational phase of a project, when AI systems are deployed in environments working alongside human users. Confidence measures were a key element of AlphaFold2, the protein-folding machine learning system that achieved unparalleled levels of accuracy in the 14th Critical Assessment of Protein Structure Prediction (CASP14) challenge, giving the DeepMind research team a way to gauge how much trust they should place in the predictions being generated48,49. Numerous implementations of uncertainty quantification methods are available under permissive licenses and are compatible with commonly used machine learning frameworks50. The incorporation of uncertainty quantification may help increase both the interpretability and the reliability of high-stakes medical imaging machine learning systems, and reduce the likelihood of automation bias—a phenomenon whereby clinicians may over-rely on automation51.

Explainable AI and risk of harm

Aside from quantifying how certain machine learning systems are of their predictions, understanding how these machine learning systems arrive at their conclusions is of considerable interest to both the engineers building these systems and the clinicians using them. Saliency maps and class activation maps remain the de-facto standard for explaining how machine learning algorithms make their predictions52,53. Adebayo et al. recently showed that relying solely on the visual appearance of saliency maps can be misleading even if at first glance they appear contextually relevant. In a series of extensive tests, they found that instead of deriving true meaning from model weights, many popular methods for generating post-hoc saliency maps are in fact no different from ‘edge detectors’ (algorithms that simply map sharp transition areas between pixel intensities)54. Furthermore, even when these visualization methods work, little can be deciphered beyond ‘where’ the machine learning algorithms are looking, with numerous examples in which saliency maps look nearly identical for both correct and incorrect predictions55. These drawbacks are more pronounced when the difference between a ‘diseased’ state and a ‘normal state’ requires attention on the same region of an image or video56 (Fig. 3).

Fig. 3: Misleading nature of post-hoc model explanations.
figure 3

a, Experiments conducted by Adebayo et al.54 with models trained on true labels from the MNIST dataset (top) and models trained on random noise (bottom). Models trained on random noise still yield the circular shape of the digit zero when evaluated by the majority of visualization methods. These offer little in terms of true saliency maps, functioning more as class invariant edge detectors. b, Detection of echocardiographic view planes: both incorrect classifications (top left) and correct classifications (top right) yield similar saliency maps (bottom). Figure courtesy of the authors of ref. 54.

Clinicians should note that heatmaps alone are insufficient methods for explaining how AI systems function, and care must be taken when attempting to identify failure modes using visualizations such as the ones shown above. A more granular approach may involve serial occlusion tests, where performance is assessed on images after intentional masking of regions that clinicians would otherwise use to make diagnoses or predictions57. The idea is quite intuitive: by running the algorithm on images with areas known to be important for diagnosing a certain condition masked off (for example, masking out the left ventricle when attempting to diagnose heart failure), a precipitous decline in performance should be seen. This helps to confirm that the AI system is attending to relevant areas. Specifically in the context of high-dimensional medical imaging studies, activation maps may offer unique insights into the relative importance of certain temporal phases of video-like imaging studies. Certain diseases may show pathognomic features when the heart is contracting, for example, whereas other conditions may require one to focus on when the heart is relaxing. Often such experiments may show that machine learning systems identify potentially informative features from regions of images that clinicians would not traditionally use6. In addition to gleaning information on how these machine learning systems generate their outputs, rigorous visualization experiments may offer a unique opportunity to learn biological insights from the machine learning systems being evaluated. On the other hand, deviations of activation from clinically known areas of importance may signal that networks are learning non-specific features, making them unlikely to generalize well to other datasets58.

The features learned by an machine learning system can depend on architectural design choices. More importantly, machine learning systems will learn and perpetuate systemic inequities on the basis of the training data and targets provided to it59,60. As healthcare AI systems move towards future prediction of disease, greater care must be taken in accounting for the extensive disparities in access to healthcare and the outcomes across these groups. In a recent review, Chen et al. gave an in-depth overview of potential sources of bias from problem selection to the post-deployment phase61. Here we focus on potential solutions early in the development of machine learning systems. There have been demands for methods to explain otherwise ‘black box’ predictions of modern machine learning systems, while others have advocated restricting ourselves to more explainable models to begin with55. An intermediate approach involves training medical imaging neural networks using black box models in addition to incorporating inputs for structured data when training the overall AI system. This can be achieved by building ‘fusion networks’ in which tabular data are incorporated into image- or video-based neural networks, or other more advanced methods with the same fundamental goal (autoencoders that generate a low-dimensional representation of the combined data)14,62,63. Even without the incorporation of demographic inputs into high-dimensional vision networks, it is critical that research groups audit their models by comparing performance across genders, ethnicities, geographies and income groups. Machine learning systems may inadvertently learn to further perpetuate and discriminate against minorities and people of colour, and it is essential to understand this kind of bias early in the model development process59,61. Trust in machine learning systems is critical for wider adoption, as is exploring how and why specific features or variables lead to predictions via a combination of saliency maps and model agnostic approaches of estimating feature importance64,65,66. An alternative approach is constraining a machine learning algorithm within the training logic, ensuring that optimization steps occur to control for demographic variables of interest. This is analogous to a multivariable regression model wherein the effect of risk factors of interest can be studied independently of baseline demographic variables. From a technical standpoint, this would involve inserting an additional penalty loss in the training loop, keeping in mind the potential trade-offs with slightly lower model performance67. Fairlearn, for example, is popular toolkit for assessing fairness in traditional machine learning models, and constrained optimizations based on the Fairlearn algorithms (FairTorch) have been developed that are a promising exploratory foray into incorporating bias adjustments within the training process68. Numerous open-source toolkits exist to help researchers determine the relative importance different variables and input streams (image predictions, and variables such as gender and race). These techniques may allow the development of more equitable machine learning systems, and even uncover hidden biases where none are anticipated69.

Conclusion

Although computational architectures and access to high-quality data are key to building good models, developing translatable machine learning systems for high-dimensional imaging modalities requires proactive efforts to better represent the ‘video-like’ nature of the data, in addition to building in features that help address bias, uncertainty and explainability at the earliest stages of model development. The scepticism surrounding medical imaging and AI is healthy and, for the most part, warranted. We hope that meaningful steps towards improving the delivery of AI will be made possible by building in features that allow researchers to assess clinical performance, integration within hospital workflows, interactions with clinicians and the downstream risk of socio-demographic harm. We hope that researchers will find this Perspective useful, for both the overview of potential challenges that await them in terms of clinical deployment and the tacit guidance towards how some of these issues may be addressed.