Elsevier

Medical Image Analysis

Volume 81, October 2022, 102527
Medical Image Analysis

Object recognition in medical images via anatomy-guided deep learning

https://doi.org/10.1016/j.media.2022.102527Get rights and content

Highlights

  • A conceptual framework to synergistically marry the unmatched strengths of high-level human knowledge (natural intelligence) and artificial intelligence to arrive at a robust, accurate, and general object recognition method for medical image analysis.

  • The AAR-DL approach combines an advanced anatomy-modeling strategy (AAR), model-based object recognition (AAR-R), and deep learning object detection networks.

  • AAR-DL consists of 4 key modules wherein prior knowledge is made use of judiciously at every stage.

  • AAR-DL has demonstrated high accuracy and robustness to image artifacts and deviations.

  • AAR-DL performs like an expert human operator in object recognition with localization accuracy within 1–2 voxels and remarkable robustness.

Abstract

Purpose

Despite advances in deep learning, robust medical image segmentation in the presence of artifacts, pathology, and other imaging shortcomings has remained a challenge. In this paper, we demonstrate that by synergistically marrying the unmatched strengths of high-level human knowledge (i.e., natural intelligence (NI)) with the capabilities of deep learning (DL) networks (i.e., artificial intelligence (AI)) in garnering intricate details, these challenges can be significantly overcome. Focusing on the object recognition task, we formulate an anatomy-guided deep learning object recognition approach named AAR-DL which combines an advanced anatomy-modeling strategy, model-based non-deep-learning object recognition, and deep learning object detection networks to achieve expert human-like performance.

Methods

The AAR-DL approach consists of 4 key modules wherein prior knowledge (NI) is made use of judiciously at every stage. In the first module AAR-R, objects are recognized based on a previously created fuzzy anatomy model of the body region with all its organs following the automatic anatomy recognition (AAR) approach wherein high-level human anatomic knowledge is precisely codified. This module is purely model-based with no DL involvement. Although the AAR-R operation lacks accuracy, it is robust to artifacts and deviations (much like NI), and provides the much-needed anatomic guidance in the form of rough regions-of-interest (ROIs) for the following DL modules. The 2nd module DL-R makes use of the ROI information to limit the search region to just where each object is most likely to reside and performs DL-based detection of the 2D bounding boxes (BBs) in slices. The 2D BBs hug the shape of the 3D object much better than 3D BBs and their detection is feasible only due to anatomy guidance from AAR-R. In the 3rd module, the AAR model is deformed via the found 2D BBs providing refined model information which now embodies both NI and AI decisions. The refined AAR model more actively guides the 4th refined DL-R module to perform final object detection via DL. Anatomy knowledge is made use of in designing the DL networks wherein spatially sparse objects and non-sparse objects are handled differently to provide the required level of attention for each.

Results

Utilizing 150 thoracic and 225 head and neck (H&N) computed tomography (CT) data sets of cancer patients undergoing routine radiation therapy planning, the recognition performance of the AAR-DL approach is evaluated on 10 thoracic and 16 H&N organs in comparison to pure model-based approach (AAR-R) and pure DL approach without anatomy guidance. Recognition accuracy is assessed via location error/ centroid distance error, scale or size error, and wall distance error. The results demonstrate how the errors are gradually and systematically reduced from the 1st module to the 4th module as high-level knowledge is infused via NI at various stages into the processing pipeline. This improvement is especially dramatic for sparse and artifact-prone challenging objects, achieving a location error over all objects of 4.4 mm and 4.3 mm for the two body regions, respectively. The pure DL approach failed on several very challenging sparse objects while AAR-DL achieved accurate recognition, almost matching human performance, showing the importance of anatomy guidance for robust operation. Anatomy guidance also reduces the time required for training DL networks considerably.

Conclusions

(i) High-level anatomy guidance improves recognition performance of DL methods. (ii) This improvement is especially noteworthy for spatially sparse, low-contrast, inconspicuous, and artifact-prone objects. (iii) Once anatomy guidance is provided, 3D objects can be detected much more accurately via 2D BBs than 3D BBs and the 2D BBs represent object containment with much more specificity. (iv) Anatomy guidance brings stability and robustness to DL approaches for object localization. (v) The training time can be greatly reduced by making use of anatomy guidance.

Introduction

Image segmentation is the process of delineating the region occupied by the objects of interest in a given image. This operation is a fundamentally required first step in numerous applications of medical imagery. In the medical imaging field, this activity has rich literature that spans over 45 years. In spite of numerous advances, including via deep learning (DL) networks (DLNs) in recent years, the problem has defied a robust, fail-safe, and satisfactory solution especially for objects that are manifest with low contrast, are spatially sparse, have variable shape among individuals, or are sites of imaging artifacts or pathology or surgical manipulation in the body. Although image processing techniques, notably DLNs, are uncanny in their ability to harness low-level intensity pattern information on objects, they fall short in the high-level task of identifying and localizing an entire object as a gestalt. This dilemma has been a fundamental unmet challenge in medical image segmentation.

In this paper, we consider segmentation as consisting of two dichotomous processes, as first suggested in (Falcao et al., 1998) – recognition and delineation. Recognition is the process of finding the whereabouts of the object in the image, in other words, localizing the object. Delineation is the process of precisely delineating the region occupied by the object. Recognition is a high-level process. It is trivial for knowledgeable humans to recognize objects, especially anatomic organs, in images. On the other hand, delineation is a meticulous operation that requires low(pixel)-level detailed quantitative information. For knowledgeable humans, it calls for toilsome effort, which makes manual object delineation impractical as a routine approach. On the other hand, computer algorithms, particularly DLNs, can outperform humans in reproducibility and efficiency of delineation once accurate recognition help is offered to them. Object recognition is the key first step for segmentation especially in medical images. Robust and accurate object recognition can facilitate many different delineation methods in achieving satisfactory performance. Recognition alone or delineationless segmentation is also helpful in object quantification (Tong et al., 2019; Wang et al., 2014; Han et al., 2004). Therefore, robust and accurate object recognition is an important operation in medical image analysis.

In this paper, we focus on designing an object recognition system that is able to match knowledgeable humans in performance. We propose to synergistically marry strengths of human knowledge, or natural intelligence (NI), with the unmatched capabilities of DLNs, or artificial intelligence (AI), to arrive at a robust, accurate, and general object recognition approach for medical image segmentation and analysis.

Object recognition has been investigated for decades, and it has long been found that achieving satisfactory performance just based on the information from the image alone is difficult. By incorporating prior information coming from human knowledge into the recognition algorithm, we can improve recognition performance considerably. Object recognition methods in the literature1 can be divided roughly into four categories: those based on object shape and geographic models, atlas-based, machine-learning-based, and deep-learning-based. Object-model-based methods (e.g., Cootes et al., 1995; Staib and Duncan, 1992; Pizer et al., 2003; Shen et al., 2011; Jin et al., 2016; Udupa et al., 2014; Niethammer et al., 2015) capture prior anatomic information in the form of population models wherein object shape, appearance, and geographic layout and their variation are codified (learned) and encoded into the model. To recognize an object in a given image, the models are placed in the image using different initialization strategies and then adjusted to fit the underlying image intensity optimally. Methods differ as to how models are defined, built, deployed, and optimally fit to the image. In atlas-based approaches (e.g., Gee et al., 1993; Christensen et al., 1994; Shattuck et al., 2008; Ashburner and Friston, 2009), a set of registered images (atlases) from different subjects along with the matching segmentations of objects is first created. The given image is then registered to the atlas images and the appropriately transformed object masks from the atlases are fused to localize the object in the given image. Again, methods differ as to how atlases are selected, registered, optimally matched to the given image, and fused. Machine-learning-based methods have been developed to locate objects by finding bounding boxes (BBs) that enclose objects (e.g., Pauly et al., 2011; Gauriau et al., 2015; Samarakoon et al., 2017; Criminisi et al., 2013; Zhou et al., 2013). Classical machine learning algorithms rely on human feature engineering. Their accuracy and generalizability are limited. Besides, they cannot be effectively improved by continually expanding the training set. In recent years, DL-based methods have outperformed traditional methods in many different fields. For object detection in natural images, almost all of the state-of-the-art methods utilize DL technology (Duan et al., 2019; Liu et al., 2020; Tan et al., 2020; Du et al., 2020; Song et al., 2020). These methods predict BBs and objectness scores for target objects at every pixel in the whole image.

Recently, DL-based detection methods have been introduced into medical imaging (e.g., Roth et al., 2018; Oda et al., 2017; De et al., 2016; Mamani et al., 2017, 2018; Zhou et al., 2019; Hussain et al., 2017; de Vos et al., 2017; Liu et al., 2019; Xu et al., 2019a, 2019b) toward the goal of object localization. In some instances, DL is employed as a powerful feature extractor for the following machine learning modules (Roth et al., 2018; Oda et al., 2017) to detect challenging organs. In order to make the 2D networks from computer vision applications operate effectively on 3D volume data, one of the practical solutions suggested is to train three independent networks to recognize target organs in 2D slices via BBs extracted separately along axial, sagittal, and coronal planes and then to combine the output of the three networks to determine the location and size of target organs (De et al., 2016; Mamani et al., 2017, 2018; Zhou et al., 2019). Slices from three orthogonal planes can also be fed to a single network as a set of inputs (Hussain et al., 2017; de Vos et al., 2017). This allows the network to combine information from different slice orientations when making decisions. Sometimes, when multiple modality images (such as positron emission tomography (PET) and computed tomography (CT)) are available, they can help improve organ recognition performance (Liu et al., 2019). With the rise of 3D Convolutional Networks (ConvNets), they have been exploited to handle the problem of organ recognition (Xu et al., 2019a, 2019b). These methods directly predict the geometric parameters of the target organs with 3D BBs. Compared with the 2D ConvNets, 3D ConvNets can take full advantage of the 3D spatial information in volumetric images within one forward propagation. However, due to the limits on the number of training volume samples, difficulty in ground truth generation for 3D ConvNets, and increased non-specificity for objects (see below), the implementations of 3D ConvNets are more challenging than their 2D counterparts.

Current state-of-the-art in object recognition in medical imagery presents several hurdles which the present study attempts to overcome: (i) Training efficiency and specificity. DL methods face two challenges – appetite for data and loss of specificity due to large problem domain. The two issues are inter-related. Recognizing multiple organs in an entire body region under the pretext that every object can appear anywhere in the whole image is vastly different from recognizing each object within an object-shaped container region where the object is known anatomically most likely to reside. Our anatomic model-based automatic anatomy recognition (AAR) strategy (Udupa et al., 2014) (not a DL method) provides the container information needed to train the DL networks, so training becomes efficient and specific to each object. (ii) 2D vs. 3D issues. 3D BBs when found properly can encapsulate the whole 3D object. However, the price to pay is to have a large number of labeled 3D object sets. Depending on the shape and orientation of the object, they may also encapsulate other nearby objects within the 3D BB, minimizing specificity. While 2D BBs do not have these issues, they lack the 3D context. With the anatomy-guided AAR container idea, once an object-shaped region of interest is determined via AAR, 2D BBs can be found within this region to overcome the above issues with 3D BBs and 2D BBs to find a tight-fitting stack of 2D BBs via appropriately trained DL networks and to considerably improve recognition accuracy beyond AAR's recognition. See Fig. 1. (iii) Generalizability across objects. This anatomy-guided DL recognition strategy is general and works for any object with modellable anatomy. (iv) Handling pathological objects. With the lack of anatomic context, BB methods run into difficulty when object shape is distorted due to pathology. This issue is overcome considerably by the proposed approach.

A schematic representation of the proposed approach, which we will refer to as AAR-DL, is shown in Fig. 2. The AAR-DL approach, described in Section 2, consists of 4 stages which are shown by 4 modules in the figure – AAR-R, DL-R, rAAR-R, and rDL-R.

The first module AAR-R performs recognition of objects in a given body region B via the AAR approach (Udupa et al., 2014) by making use of the already created anatomy model of the objects in B, as briefly outlined in Section 2.1. It outputs a fuzzy model mask FMt(O) for each object O included in the AAR anatomy model which indicates where object O is likely to be in input image I. FMt(O) provides the needed region-of-interest (ROI) information in the form of a container, shaped like O, albeit fuzzy, for the second module DL-R, as described in Section 2.2. DL-R outputs a stack BB(O) of 2D BBs for each O by detecting the BBs in each slice within the ROI for O. In the 3rd module rAAR-R (for refined AAR-R), described in Section 2.3, the AAR fuzzy model is refined by making use of the more precise localization information in BB(O) provided by DL-R. rAAR-R deforms the fuzzy model and outputs a refined recognition result (model) rFMt(O) for each object O. Finally, rDL-R (for refined DL-R), described in Section 2.4, performs refined DL-based recognition by utilizing the refined fuzzy model rFMt(O) as an additional input channel and outputs a refined stack of 2D BBs, rBB(O), for each object O. This module has its own pre-trained network which is denoted rDL-R model. We present the implementation details and the method of evaluation of AAR-DL in Section 2.5.

In Section 3, we perform recognition accuracy analyses of the entire AAR-DL system and assess the improvements achieved at each stage based on 150 and 225 CT data sets of the thorax and head and neck body regions, respectively. We also perform a comprehensive comparison of AAR-DL's performance with object localization accuracy reported in the literature for the two body regions. We summarize our conclusions in Section 4.

The key novelties/contributions of this paper are five-fold: (i) A DL approach for recognizing anatomic objects that is guided at different stages explicitly by anatomic knowledge encoded via geometric and geographic population anatomy models. (ii) Differential treatment of spatially sparse and inconspicuous objects that are challenging to recognize and large easy-to-recognize objects. (iii) An accurate and efficient object recognition DL network for effectively utilizing prior information. (iv) Demonstration that for anatomy-guided DL organ recognition, 2D BBs are more effective than 3D BBs. (v) Demonstration of the robustness of the method across even the most challenging objects in the presence of deviations due to artifacts, pathology, and surgical manipulation.

Section snippets

Automatic anatomy recognition (AAR-R module)

Since the AAR framework has been previously published (Udupa et al., 2014), here we will give a brief summary to the extent needed for understanding the AAR-R module (Fig. 2) in the proposed AAR-DL approach. AAR is a general approach, developed before the advent of DL techniques, based on fuzzy anatomy modeling for recognizing and delineating all objects in a body region. It consists of three stages – model building, object recognition, and object delineation. Since the goal of this paper is

Experiments, results, and discussion

We describe the image data sets in Section 3.1, and illustrate and evaluate recognition results based on two body regions – Thorax and Head & Neck (H&N) – in Section 3.2. We also discuss computational considerations in Section 3.3. We make a comparison between AAR-DL and other methods from the literature closely related to our work.

Conclusion

The main theme of this paper is synergistically marrying the unmatched strengths of high-level human knowledge (i.e., natural intelligence (NI)) with the unmatched capabilities of deep learning networks (i.e., artificial intelligence (AI)) to arrive at a robust, accurate, and general object recognition method for medical image analysis. We proposed an anatomy-model-guided deep learning object recognition approach named AAR-DL which combines an advanced anatomy-modeling strategy (AAR),

Main contributions

  • A deep learning (DL) approach for recognizing anatomic objects that is guided at different stages explicitly by anatomic knowledge encoded via geometric and geographic population anatomy models.

  • Differential treatment of spatially sparse and inconspicuous objects that are challenging to recognize and large easy-to-recognize objects.

  • An accurate and efficient object recognition DL network for effectively utilizing prior information.

  • Demonstration that for anatomy-guided DL organ recognition, 2D

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

Acknowledgements

The research reported in this paper is supported by a grant from the National Cancer Institute R42 CA199735. Data sets utilized in this research were generated from a prior grant R41 CA199735.

References (53)

  • Y. Tong et al.

    Disease Quantification in PET/CT Images without Explicit Object Delineation

    Med. Image Anal.

    (2019)
  • J.K. Udupa et al.

    Body-wide hierarchical fuzzy modeling, recognition, and delineation of anatomy in medical images

    Med. Image Anal.

    (2014)
  • X. Wu et al.

    AAR-RT - A system for auto-contouring organs at risk on CT images for radiation therapy planning: principles, design, and large-scale evaluation on head-and-neck and thoracic cancer cases

    Med. Image Anal.

    (2019)
  • M. Abadi et al.

    Tensorflow: a system for large-scale machine learning

  • Bochkovskiy, A., Wang, C.Y., & Liao, H.Y.M., 2020. YOLOv4: optimal Speed and Accuracy of Object Detection. arXiv...
  • G. Christensen et al.

    3-D Brain Mapping Using a Deformable Neuroanatomy

    Phys. Med. Biol.

    (1994)
  • B.D. De Vos et al.

    2D image classification for 3D anatomy localization: employing deep convolutional neural networks

    (2016)
  • B.D. de Vos et al.

    ConvNet-based localization of anatomical structures in 3-D medical images

    IEEE Trans. Med. Imaging

    (2017)
  • X. Du et al.

    Spinenet: learning scale-permuted backbone for recognition and localization

  • K. Duan et al.

    Centernet: keypoint triplets for object detection

  • J. Fu et al.

    Dual attention network for scene segmentation

  • J. Gee et al.

    Elastically Deforming 3D Atlas to Match Anatomical Brain Images

    J. Comput. Assist. Tomogr.

    (1993)
  • G. Ghiasi et al.

    Nas-fpn: learning scalable feature pyramid architecture for object detection

  • M.A. Hussain et al.

    Segmentation-free kidney localization and volume estimation using aggregated orthogonal decision CNNs

  • C. Jin et al.

    3d fast automatic segmentation of kidney based on modified aam and random forest

    IEEE Trans. Med. Imaging

    (2016)
  • Z. Lambert et al.

    SegTHOR: segmentation of thoracic organs at risk in CT images

  • Cited by (13)

    • Performance evaluation of image processing algorithms for eye blinking detection

      2023, Measurement: Journal of the International Measurement Confederation
    View all citing articles on Scopus
    View full text