Abstract

More and more research on left ventricle quantification skips segmentation due to its requirement of large amounts of pixel-by-pixel labels. In this study, a framework is developed to directly quantify left ventricle multiple indices without the process of segmentation. At first, DenseNet is utilized to extract spatial features for each cardiac frame. Then, in order to take advantage of the time sequence information, the temporal feature for consecutive frames is encoded using gated recurrent unit (GRU). After that, the attention mechanism is integrated into the decoder to effectively establish the mappings between the input sequence and corresponding output sequence. Simultaneously, a regression layer with the same decoder output is used to predict multi-indices of the left ventricle. Different weights are set for different types of indices based on experience, and l2-norm is used to avoid model overfitting. Compared with the state-of-the-art (SOTA), our method can not only produce more competitive results but also be more flexible. This is because the prediction results in our study can be obtained for each frame online while the SOTA only can output results after all frames are analyzed.

1. Introduction

Great importance has been attached to the research of left ventricle quantification during the identification and diagnosis of cardiac disease in clinical routine. It can evaluate the current situation of patients and then make appropriate and correct judgments on the prognosis and outcome of the disease. In recent years, Cardiac Magnetic Resonance (CMR) has become a crucial imaging modality in clinical cardiology practice due to its high signal-to-noise ratio, noninvasive imaging to cardiac chambers, no need for geometric assumptions, and great vessels [1, 2]. Although much effort has been devoted to left ventricle quantification over the last several decades [28], it remains in the research stage and the reported algorithms are still not robust and flexible enough to support clinic practice due to the complexity of medical imaging. Therefore, left ventricle quantification is still acknowledged as a challenge with much room for improvements in robustness, flexibility, and accuracy.

In general, left ventricle quantification can be implemented according to Simpson’s rule [9] after epicardial and endocardial contours are delineated on the short-axis slices [5, 8]. However, due to the characteristics of cardiac MR images and the great variability of the images among patients, left ventricle segmentation in MRI still remains a challenge. On the one hand, segmentation complexity depends on the slice level of the image. For example, it is more difficult for Apical and basal slice images to segment than that of midventricular. Furthermore, segmentation-based methods usually introduce two phases to quantify the left ventricle, resulting in difficulty to optimize the framework as a whole. At present, the end-to-end framework is increasingly popular with the development of deep learning technology [3, 10].

In this paper, in order to achieve a reliable and accurate solution to quantify left ventricle from short-axis view cardiac MR images, an elegant framework based on DenseNet [11] and GRU- [12] based encoder-decoder with attention is built. Specifically, the structural parameters such as myocardium and cavity areas, region wall thicknesses and cavity dimensions (Figure 1) will be evaluated for each frame in our approach. The main contributions of this work are as follows.

Firstly, a shared and end-to-end framework was proposed to directly quantify left ventricle indices without segmentation. All frames share the DenseNet to extract their spatial feature, making the model more concise and avoid overfitting.

Secondly, in order to catch the temporal information across consecutive frames, the extracted features of consecutive frames are encoded by GRU blocks. This is because GRU has a great ability to retain long-term dependencies in sequential data, and, in comparison with its sibling LSTM, it has similar accuracy with fewer parameters and faster speed.

Finally, the encoded temporal features of consecutive frames and each frame were mapped to final representations via a decoder constructed by GRU blocks with attention, which imparts the decoder to focus on local or global features. The final representations are input to the regression layer to quantify indices of the frame.

The rest of this paper is organized as follows: In Section 2, some related works are discussed. Then, deep encoder-decoder recurrent neural network Models with attention is proposed in Section 3. Experimental settings and details about the dataset are introduced in Section 4. Results and detailed discussions are shown in Section 5. Section 6 is the conclusion.

The final goal of left ventricle research is to obtain structural and functional parameters for accurate identification and diagnosis of cardiac diseases. These parameters may be a low level of myocardium and cavity areas, region wall thicknesses, and cavity dimensions, or a high level of End Systole Volume (ESV), End Diastole Volume (EDV), and Eject Fraction (EF). Plenty of left ventricle quantification methods have been reported in the past few decades, which mainly can be divided into 2 categories: segmentation-based methods and regression-based methods.

2.1. Segmentation-Based Methods

In practical clinical medicine, the doctors generally obtain reliable quantification by manually contouring the borders of myocardium, which is time-consuming and unfavourable for diagnostic automation [2]. In addition, manual contouring is prone to intra- and interobserver variability. Therefore, much research about automated segmentation methods has been carried out, much of which focuses on the segmentation on left ventricle based on mathematical calculations or deep learning method. This is because every frame of all slices can be quantized according to Simpson’s Rule after segmentation.

2.1.1. Segmentation Based on Mathematical Calculations

Segmentation based on mathematical calculations delineates the endo- and/or epicardial boundaries in all frames of a cardiac sequence according to the shape characteristic of left ventricle. However, in order to increase the robustness and accuracy, most of the mathematical segmentation requires a prior knowledge, problematic assumptions, or user interaction [1315].

2.1.2. Deep Learning-Based Segmentation

A lot of research has been dedicated to the deep learning-based segmentation since the AlexNet model obtained remarkable classification accuracy in the ImageNet Large Scale Visual Recognition Challenge in 2012 [16]. Deep learning-based Segmentation for left ventricle keeps up with the development of deep learning technology in segmentation. The representative methods consist of CNN [4, 5, 17], sliding window [1820], Full Connected Network [6, 2123], U-Net [7, 8, 24], and 3D Convolution [2527].

As long as the segmentation of all frames in a cardiac sequence is accomplished, the structural and functional parameters can be calculated according to MRI parameters and fixed methods such as Simpson’s Rule. However, numerous mathematics-based segmentation requires prior knowledge or user intervention, while deep learning-based segmentation requires a large number of training samples. Each pixel of these samples needs to be labeled to obtain robustness and good results.

2.2. Regression-Based Methods

In order to reduce large tedious labeling work and user intervention, much research skips the segmentation process and then directly obtains the cardiac parameters by regression based on the extracted features from cardiac MRIs. Two kinds of regression methods can be classified based on hand-crafted features and automatic deep learning features separately. In earlier years, almost all direct regression methods extract hand-crafted features for regression of cardiac parameters due to the limitation of computing power. These methods usually follow a common two-phase framework: cardiac image representation and indices prediction. First, cardiac images are usually represented by hand-crafted features or features obtained by unsupervised learning, such as Bhattacharyya coefficients between image distributions [28, 29], appearance features [30], multiple low-level image features [31], as well as features from multiscale convolution deep belief network [32], and supervised descriptor learning [33]. Then, cardiac indices are estimated from these features with a separate regression model.

Since 2012, with the development of deep learning technology, the regression based on automatic deep learning features has sparked an impressive research effort, with promising performances and a breadth of techniques [4, 6, 10, 30, 32, 33]. With its unique capabilities of end-to-end, pipeline simplicity, and no loss of information, regression based on automatic deep learning is increasingly favored by researchers. In this paper, we follow this trend and develop an end-to-end framework to quantify left ventricle indices directly without segmentation.

3. Methods

The proposed model consists of three tightly integrated modules. Firstly, the deep convolution neural network DenseNet is used for cardiac image (frame) representation, and then an encoder-decoder structure built using GRU blocks is deployed for temporal dynamic modeling of cardiac sequences. To emphasize the contribution of the current input frame, an attention model is added to the decoder network. The outputs of encoder-decoder module are fed into a multitask learning module for left ventricle indices estimation.

3.1. Framework and Problem Formulation

The overview of the proposed framework is shown in Figure 2. To better describe the problem, we use to represent the entire dataset and the left ventricle indices are denoted as , where indicates the subject and indicates the frame. While is a two-dimensional vector which describes the area of myocardium and left ventricle cavity, is a three-dimensional vector which describes three left ventricle cavity dimensions, and is a six-dimensional vector which describes the regional wall thickness in six directions. Then, the proposed framework can be modelled as a regression problem with input and output . That is, consecutive frames before the current frame are used to predict the indices of the current frame . For the input , frames are regarded as a circular sequence since heartbeat is a cycle process. According to this perspective, will be substituted by if , in which equals 20 on DIG-Cardiac database [4]. The algorithm description is presented in Algorithm 1 in order to state the framework more clearly.

Model: DenseNet, Encoder, and Decoder as shown in Figure 2
Input:consecutive frames
Output:at frame
Algorithms: Predict the indicesfor frame
(1) //Sequence Vector
(2)fordo
(3) = DenseNet()//Feature Vector
(4) // is vector concatenation
(5)end for
(6) //Encoded Vector
(7)
(8)fordo
(9)
(10)end for
(11)return //The last output of Decoder

In the proposed framework, both the encoder and decoder apply a two-layer GRU to the corresponding input sequence. For each element in the input sequence, each layer computes the following function:,,,.where is the hidden state at time t, is the input at time t, is the hidden state of the layer at time t-1 or the initial hidden state at time 0, and are the reset, update, and new gates, respectively. is the sigmoid function, and is the Hadamard product.

3.2. Spatial Feature Extraction Using DenseNet

DenseNet, with compelling advantages such as alleviating the vanishing-gradient problem, strengthening feature propagation, encouraging feature reuse, and substantially reducing the number of parameters, has the excellent capability of capturing spatial features. Compared with ResNet, the -th layer receives the feature maps of all preceding layers , as input to further improve the information flow between layers:where refers to the concatenation of the feature maps produced in layers 0, …,  − 1. This encourages feature reuse throughout the network and leads to more compact models and consistently the most parameter efficient variant. In this study, DenseNet is used to extract spatial features for each frame. Our DenseNet in Figure 2 contains a convolutional layer (conv1) capturing low-level features from images, three DenseBlocks (one DenseBlock includes one Bottleneck (0) and one Bottleneck (1)), and two Transition layers between adjacent dense blocks. Table 1 lists more details of the DenseNet network.

To better interpret what the DenseNet learns about from the frames, the feature map (Figure 3) of DenseNet in different layers is visualized. And the activation feature map of the second convolution layer in the first BottleNeck for each DenseBlock is presented. From the visualization, it can be seen that the shallow layer can preserve detailed information such as contour and texture. While the layers go deeper, they will extract more abstract features so that we can understand their feature map with the naked eye. To state conveniently, the convolution layer from the first, second, and third DenseBlock are named DB1Conv, DB2Conv, and DB3Conv. The filters from DB1Conv capture low-level visual features from cardiac frames, such as the low-frequency appearance of the ventricles and textures of the ventricles. Taking the low-level features of DB1Conv as input, subsequent blocks (DB2Conv) extract more complex cardiac information by combining low-level features. Gradually, the high-frequency noise or textures are discarded in this procedure. Lastly, the detailed spatial information of the feature maps in DB3Conv becomes increasingly blurred. Instead, they extract more abstract and higher level features in each local area from the preceding feature maps during the training procedure.

3.3. Temporal Information Encoding and Decoding Using GRUs with Attention

Heartbeat is a continuous cycle process from systole to diastole. The previous frame sequence is beneficial to parameters’ evaluation of the current frame. For sequence modelling, compared with LSTM, GRU is easier to modify and does not need memory units and thus results in less training time with the same performance as that of LSTM. In this paper, GRUs are used in modelling the input sequence of several continuous frames. Then, the modelled sequence will be decoded to predict the parameters of current frames. Human beings can quickly understand the parameters trend and relationships between the input sequence and the output with the naked eye. However, the neural network is not so direct to detect these relationships automatically. Thus, the attention mechanism is integrated into the decoder to learn these relationships through gradient descent and backpropagation. Figure 4 presents the decoding process with the attention mechanism.

4. Experiment and Configuration

4.1. Dataset

Our approach is evaluated on the DIG-Cardiac database [4]. It consists of 145 cardiac MR images collected from 3 hospitals affiliated with two health care centers (London Healthcare Center and St. Josephs HealthCare). The subjects’ ages were from 16 yrs to 97 yrs, with an average of 58.9 yrs. The pixel spacings of the MR images range from 0.6836 mm/pixel to 2.0833 mm/pixel, with a mode of 1.5625 mm/pixel. Diverse pathologies are in presence including regional wall motion abnormalities, myocardial hypertrophy, mildly enlarged left ventricle, atrial septal defect, and left ventricle dysfunction. Each subject contains frames throughout a cardiac cycle. In each frame, the left ventricle is divided into equal thirds (basal, midcavity, and apical) perpendicular to the long axis of the heart following the standard AHA prescription and a representative midcavity slice is selected for this database. All cardiac images undergo several preprocessing steps, including landmark labeling, rotation, ROI cropping, and resizing. The resulted images are approximately aligned with a dimension of 80  80. Then, these cardiac images are manually contoured to obtain the epicardial and endocardial borders, which are double-checked by two experienced cardiac radiologists (A. Islam and M. Bhaduri). The ground truth values of left ventricle indices and cardiac phase can be easily obtained from the two borders. The values of RWT and cavity dimensions are normalized by the image dimension, while the areas are normalized by the pixel number (6400). During evaluation, the obtained results need to be converted into physical thickness (in mm) and area (in mm2) by reversing the resizing procedure and multiplying the pixel spacing for each subject.

4.2. Implementation and Configuration

Our model is built, trained, and tested by Pytorch [34]. The neural network is trained by minimizing L1_Loss as the objective function. Stochastic gradient descent is used to optimize the object function. The model is initialized with the default methods in Pytorch. That is, convolution layers and linear layers are initialized using Xavier methods and with random values sampled from a uniform distribution, respectively. The initial learning rate is which was gradually reduced to during training. The momentum is 0.9 and the weight decay is . In our experiments, 5-fold cross-validation is employed for performance evaluation and comparison. The dataset is divided into 5 groups according to subject number from 1 to 145, each containing 29 subjects. In each experiment, four groups are employed to train the prediction model, and the remaining group is used for the test. This procedure is repeated five times until the indices of all subjects are obtained. The model is trained on a server with the following settings: CPU—Intel(R) Core(TM) i7-9700K CPU @ 3.60 GHz (8 CPUs), Memory—16384 MB RAM, GPU—NVIDIA GeForce RTX 2080 Ti with 11048 MB Dedicated Memory. Given the batch size of 8 and the number of input frames of 3, it takes more than 20 hours to complete the training run of 200 epochs.

4.3. Evaluation Criteria and Data Conversion
4.3.1. Evaluation Criteria

The proposed framework is evaluated in terms of mean absolute error (MAE) and coefficient between the ground truth and prediction. That is, for all frames in the cardiac cycle, MAE and coefficient are computed for each left ventricle index. The MAE and coefficient are computed according to equations (2) and (3). The final MAE and coefficient are averaged by all frames of testing subjects:where and .

4.3.2. Data Conversion

The provided values of Rwts, Dims, and Areas in the DIG-Cardiac database are normalized values according to actual measurements. This Original Value (OV) is utilized to train and test the model. After that, the output values are converted to Physical Measurements (PM) to evaluate the proposed method according to the database specification. For Areas, the formula is as follows:

For Dims and Rwts, the formula is as follows:where pix_spacing is the original pixel spacing (in mm) of MR images and ratio_resize_inverse is the ratio to reverse the resize procedure provided with DIG-Cardiac database. 6400 and 800 are normalizing parameters used in the database.

5. Results and Analysis

5.1. Performance Evaluation of the Proposed Method

After 5-fold cross-validation experiments, the prediction results of multiple indices are shown in Table 2. In the experiments, the parameter SeqLen is set to 3 which is small for computation efficiency. The learned model can predict all the indices well for each frame. The averages MAE for Area, Dim, and Rwt are 209.74 mm2, 2.48 mm, and 1.57 mm. Significantly, the maximums of these indices for Area, Dim, and Rwt in the DIG-Cardiac database reach 4936 mm2, 81.0 mm, and 24.4 mm, respectively. Good performance is attributed to the GRU, which has the ability to extract features and preserve more temporal feature or variation trend. And both the temporal feature from frame sequences and spatial feature from the current frame are conducive to the prediction improvement.

5.2. Comparison with SOTA and Other Baselines

Our framework provides competitive results compared to SOTA (see details in Table 3). The MAE of Dim is slightly lower than that of SOTA. In addition, our approach shows more flexibility than SOTA because the former can predict the indices frame by frame and support the input frames of any number, while the latter only can predict the indices of all frames at once by inputting the whole frame sequence, which results in the adjustment of the framework when MRI slice has a different number of frames. To verify the effectiveness of temporal modelling and attention mechanism, a baseline work is developed and compared with the proposed framework. In the baseline, temporal modeling is discarded completely and only DenstNet is used to extract spatial features for each frame. The extracted features are used to predict left ventricle indices for each frame. For ease of following description, the baseline is denoted as SpaNet, meaning that only spatial features are extracted for each frame. Table 3 presents the detailed experimental results for the baseline model, SOTA, and proposed framework. From the table, it can be seen that the proposed method outperforms the baseline model greatly. The reason may be that temporal modelling provides more information which is complementary to spatial features and the attention mechanism makes the model focus on the most discriminate part of the input.

6. Conclusion

In our study, a framework is developed for quantifying left ventricle indices. And the framework not only preserves the spatial information for each frame using DenseNet but also keeps the temporal information by encoding extracted features for consecutive frames using GRU. More importantly, to effectively map the input to the output, a decoder based on GRU with attention is designed. The final representation is obtained by inputting encoded temporal features of consecutive frames and each frame in it. Further, competitive results can be obtained on the DIG-Cardiac database using 5-fold cross-validation which is the same as SOTA. But our approach predicts the left ventricle indices frame-by-frame while the SOTA predicts the indices after all frames are analyzed. Therefore, our method is more flexible because it supports any number of frame inputs. We insist that longer frame sequences are more favorable for left ventricle quantification. More experiments will be done in future work to testify the influence of different length frame sequences on the performance of indices prediction for left ventricle.

Data Availability

The DIG-Cardiac database used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Natural Science Foundation of Chongqing, China (Grant nos. cstc2019jcyj-msxmX0487 and cstc2019jcyj-msxmX0544) and the National Science Foundation of China (Grant no. 61702063). The authors also acknowledge the support of China Scholarship Council (Grant no. 201808505123).