论文阅读笔记（四十九）：3D Consistent & Robust Segmentation of Cardiac Images by Deep Learning with Spatial Pr..

Abstract—We propose a method based on deep learning to perform cardiac segmentation on short axis MRI image stacks iteratively from the top slice (around the base) to the bottom slice (around the apex). At each iteration, a novel variant of U-net is applied to propagate the segmentation of a slice to the adjacent slice below it. In other words, the prediction of a segmentation of a slice is dependent upon the already existing segmentation of an adjacent slice. 3D-consistency is hence explicitly enforced. The method is trained on a large database of 3078 cases from UK Biobank. It is then tested on 756 different cases from UK Biobank and three other state-of-the-art cohorts (ACDC with 100 cases, Sunnybrook with 30 cases, RVSC with 16 cases). Results comparable or even better than the state-of-the-art in terms of distance measures are achieved. They also emphasize the assets of our method, namely enhanced spatial consistency (currently neither considered nor achieved by the state-of-the-art), and the generalization ability to unseen cases even from other databases.

Index Terms—Cardiac segmentation, deep learning, neural network, 3D consistency, spatial propagation.

Fig. 1. (Left) Intraand inter-dataset inconsistencies of the basal slice ground-truth (RVSC contains no basal slice and is therefore not shown). (Right) Ground-truth adaptation proposed for UK Biobank. the basal slice is first identified (blue), then the RVC labels are removed in this slice, and the labels are removed from the slices above (pink).

I. INTRODUCTION
The manual segmentation of cardiac images is tedious and time-consuming, which is even more critical given the new availability of huge databases (e.g. UK Biobank [1]). Magnetic resonance imaging (MRI) is widely used by cardiologists. Yet MRI is challenging to segment due to its anisotropic resolution with somewhat distant 2D slices which might be misaligned. There is hence a great need for automated and accurate cardiac MRI segmentation methods.

In recent years, many state-of-the-art cardiac segmentation methods are based on deep learning and substantially overcome the performance of previous methods. Currently, they dominate various cardiac segmentation challenges. For instance, in the Automatic Cardiac Diagnosis Challenge 1 (ACDC) of MICCAI 2017, 9 out of the 10 cardiac segmentation methods were based on deep learning. In particular, the 8 best-ranked methods were all deep learning ones. Deep learning methods can be roughly divided into to 2 classes: 2D methods, which segment each slice independently (i.e. [2], [3], [4]), and 3D methods, which segment multiple slices together as a volume (i.e. [5], [4]). 2D methods are popular because they are lightweight and require less data for training. But as no 3D context is taken into consideration, they might hardly maintain the 3D-consistency between the segmentation of different slices, and even fail on “difficult” slices. For example, the 2D method used in [3] achieves state-of-the-art segmentation on several widely used datasets but makes the most prominent errors in apical slices and even fails to detect the presence of the heart.

On the other hand, 3D methods should theoretically be robust to these issues. But in [4], with experiments on the ACDC dataset, the authors found that all the 2D approaches they proposed consistently outperformed the 3D method being considered. In fact, 3D methods have some significant shortcomings. First, using 3D data drastically reduces the number of training images. Second, 3D methods mostly rely on 3D convolution. Yet border effects from 3D convolution may compromise the information in intermediate representations of the neural networks. Third, 3D methods require far more GPU memory. Therefore, substantial downsampling of data is often necessary for training and prediction, which causes loss of information.

One possible way to combine the strengths of 2D and 3D methods is to use recurrent neural networks. In [6], the authors merge U-Net [7] and a recurrent unit into a neural network to process all slices in the same stack, arranging the slices from the base to the apex. Information from the slices already segmented in the stack is preserved in the recurrent unit and used as context while segmenting the current slice. Comparisons in [6] prove that this contextual information is helpful to achieve better segmentation.

However, the approaches based on recurrent neural networks are still limited. On the one hand, as the slice thickness (usually 5 to 10mm) is often very large compared to the slice resolution (usually 1 to 2mm), the correlation between slices is low except for adjacent slices. Thus, considering all slices at once may not be optimal. On the other hand, the prediction on each slice made by a recurrent neural network does not depend on an existing prediction. With this setting, the automatic segmentation is remarkably different from the procedure of human experts. As presented in [8], human experts are very consistent in the sense that the intra-observer variability is low; yet the inter-observer variability is high, as segmentation bias varies remarkably between human experts. Hence in general, for given a slice, there is no a unique correct segmentation. But human operators still maintain consistency in their predictions respectively. Being inspired by these facts, we adopt a novel perspective: we train our networks to explicitly maintain the consistency between the current segmentation and the already predicted segmentation on an adjacent slice. We do not assume that there is a unique correct segmentation. Instead, the prediction for the current slice explicitly depends on another previously predicted segmentation.

Another possible method to improve segmentation consistency is to incorporate anatomical prior knowledge into neural networks. In [9], the segmentation models are trained to follow the cardiac anatomical properties via a learned representation of the 3D shape. While adopting novel training procedure, this method is based on 3D convolution neural networks for segmentation. So the issues of 3D methods discussed above still exist.

In this paper, we propose a novel method based on deep learning to perform cardiac segmentation. Our main contribution is threefold:

• The spatial consistency in cardiac segmentation is barely addressed in general, while this is a remarkable aspect of human expertise. Our method explicitly provides spatially consistent results by propagating the segmentations across slices. This is a novel perspective, as we do not assume the existence of a unique correct segmentation, and the prediction of the current segmentation depends on the segmentation of the previous slice.

• After training our method with a large dataset, we demonstrate its robustness and generalization ability on a large number of unseen cases from the same cohort as well as from other reference databases. These aspects are crucial for the application of a segmentation model in general, yet have not yet been explored before.

• Most segmentation methods proceed in a 2D manner to benefit from more training samples and higher training speed in comparison with 3D methods. In contrast, we proposed an original approach that keeps the computational assets of 2D methods but still addresses key 3D issues. We hence believe in its potential impact on the community.

II. DATA

A. Datasets
The proposed method was trained using four datasets: the very large UK Biobank [1] dataset through our access application 3, the ACDC challenge training dataset, the Sunnybrook dataset [10] (made available for the MICCAI 2009 challenge on automated left ventricle (LV) segmentation), and the Right Ventricle Segmentation Challenge (RVSC) dataset [11] (provided for the MICCAI 2012 challenge on automated right ventricle (RV) segmentation). Depending on the dataset, expert manual segmentation for different cardiac structures (e.g. the left and right ventricular cavity (LVC, RVC), the left ventricular myocardium (LVM)) is provided as groundtruth for all slices at end-diastole (ED) and/or end-systole (ES) phases. All other structures in the image are considered as background (BG). Training involved a subset (80%) of the UK Biobank dataset. Testing used the remaining 20% from the same dataset, as well as the whole three other datasets. Details about these datasets are provided in Appendix A. We mainly adopt the metrics used in the three challenges above to measure segmentation performance. The exact definitions of the metrics used in this paper (e.g. Dice index, Hausdorff distance, presence rate) are provided in Appendix B.

B. Notation and Terminology
In this paper, slices in image stacks are indexed in spatial order from the basal to the apical part of the heart. Given an image stack S, we denote N the number of its slices. Given two values a and b between 0 and N, we note S[a,b] the sub-stack consisting of slices of indexes in the interval [round(a),round(b)[ (round(a) is included while round(b) is excluded) with round the function rounding to nearest integer. For instance, if S is a stack of N=10 slices of indexes from 0 to 9, then S[0.2N,0.6N] is the stack consisting of slices number 2 to 5. Similarly, if the basal slice is defined in S, we denote base its index. Then S[base] and S[base+1] are the basal slice and the first slice below the base.

Segmentation of slices above and below the base of the heart can be quite different. For convenience, in a stack with known base slice, we call the slices located above it the AB (abovethe-base) slices, and the ones located below it BB (belowthe-base) slices. In the remainder of this paper, we propose methods to determine the base slice for image stacks of UK Biobank using the provided ground-truth.

Finally,given a segmentation mask M,edge(LVC,LVM) is the number of pairs of neighboring pixels (two pixels sharing an edge, defined using the 4-connectivity) on M such that one is labeled to LVM while the other is to LVC. Similarly we define edge(LV C, BG) and edge(LV C, RV C).

C. Adaptation of the UK Biobank Ground-Truth
Let’s first compare the segmentation conventions followed by the ground-truth between datasets. For BB slices, the convention is roughly the same: if LV is segmented, LVC is well enclosed in LVM; if RVC is segmented, it is identified as the whole cardiac cavity zone next to the LV. But for AB slices, the variability of segmentation conventions within and between datasets can be significant. On the left of Fig.1, we show examples of (base slice, ground-truth) pairs from UK Biobank (row-1 and row-2, two different cases), ACDC (row-3) and Sunnybrook (row-4). For better visualization, we crop out the heart regions from the original MRI images and ground-truths accordingly. The segmentation ground-truth on these similar images are significantly different. In particular, we notice the intraand inter-dataset inconsistencies in the segmentation of (1) the RVC at the outflow tract level, (2) the LVM and LVC at the mitral valve level (some dataset seems to be segmented in a way such that the LVC mask is always fully surrounded by the LVM mask). In contrast, the convention seems roughly the same for the BB slices.

Hence we decided to adopt the ground-truth of UK Biobank to improve both consistency and generality. As presented in the right part of Fig.1, we i) set all pixels in all the slices above the base to BG; ii) relabel all the pixels in the basal slice originally labeled as RVC to BG while keeping the LVC and LVM pixels unchanged; iii) keep the ground-truth of all slices below the base unchanged.

Moreover, we propose a method to determine the basal slice automatically in the stacks of UK Biobank. While checking the ground-truth of the slices starting from the apex part, the basal slice is determined as the first one such that:
the LVC mask is not fully surrounded by the LVM mask:

e d g e (L V C, B G) + e d g e (L V C, R V C) > 0

or the area of the RVC mask shrinks substantially comparing to that of the slice below:

{\begin{cases} o v e r l a p (R V C 1, R V C 2) / R V C 2 \leq T 1 \\ R V C 1 / R V C 2 \leq T 2 \end{cases}

with RV C1 and RV C2 the RVC masks of the slice and the slice below it respectively, T1 =0.75 and T2 =0.8 thresholds. If the basal slice is not determined after examining all slices in the stack, we define that the index of the base slice is −1 (so S[base+1] is the first slice in the stack).

According to the current international guidelines in [12], the “standard” basal slice is the topmost short-axis view slice that has more than 50% myocardium around the blood cavity. To test whether the UK Biobank basal slices determined above are close to the standard basal slices, we randomly picked 50 cases (50 ED stacks + 50 ES stacks) and estimated their standard basal slices at ED and ES visually according to the guidelines. Among the 100 pairs of standard basal slice and ground-truth deduced basal slice, 59 pairs are exactly the same, 40 pairs are 1-slice away in stack, and only 1 pair is 2-slice away. The “adapted” ground-truth will stand as the ground-truth for the rest of this paper.

论文阅读笔记（四十九）：3D Consistent & Robust Segmentation of Cardiac Images by Deep Learning with Spatial Pr..

Fig. 2. (Left) ROI-net: for ROI determination over image stack. A sigmoid function is applied to the output channel to generate pixel-wise probabilities. (Right) LVRV-net and LV-net: for cardiac segmentation on ROIs. S[i] is the slice to be segmented and M[i] is the predicted mask. A softmax function is applied to the output channels to generate pixel-wise 4or 3-class probabilities.

III. METHODS
Our method mainly consists of two steps: region of interest (ROI) determination and segmentation with propagation. The first step is either based on a trained neural network (the ROI-net) or on center cropping, depending on the dataset. The second step is based on either the LVRV-net or the LVnet (originally designed by us and inspired from U-net [7]), depending on whether the RVC must be segmented. This section will also present the image preprocessing methods and the loss functions we used.

A. Region of Interest (ROI) Determination: ROI-net
On cardiac MRI images, defining an ROI is useful to save memory usage and to increase the speed of segmentation methods. There are many different ROI determination methods available in the community. But for most of them, the robustness remains a question, as the training and the evaluation are done with cases from the same cohort of limited size. We propose a robust approach as follows. With a large number of available cases from UK Biobank, a deep learning based method becomes a natural choice. In particular, we design the ROI-net to perform heart segmentation on MRI images.

Notice that for some datasets (Sunnybrook and RVSC), the images are already centered around the heart. Similar to what was done in [3], in such cases, images are simply cropped. However this is not valid for most datasets (here UK Biobank and ACDC), and an ROI needs to be determined specifically for each stack based on the predictions of ROInet, as explained in the following. ROI-net is a variant of U-net with a combination of convolutions, batch normalizations (BN) and leaky ReLUs (LReLU) [13] as building blocks. In leaky ReLU the gradient parameter when the unit is not active is set to 0.1. Furthermore, we implement deep supervision as in [14] to generate low resolution (of size 32 and 64) segmentation outputs, and then upsample and add them to the final segmentation. A sigmoid function is applied to the output channel of ROI-net to generate pixel-wise probabilities.

In brief, ROI-net takes one original MRI image as input and predicts pixel-wise probabilities as a way of heart/background binary segmentation (0 for background, 1 for the heart, and the threshold is 0.5 in inference). The heart to be segmented is defined as the union of LVC, LVM, and RVC. The ROI determination takes only the ED stack slices into account. In practice, an ROI containing the heart with some margin at ED also contains well the heart at other instants including ES. More specifically:

1) Training: The network is trained with slices in S[(base+1),(base+1)+0.4N] (the 40% of slices just below the base) of the ED stack S from the UK Biobank training cases. The purpose of using only slices in S[(base+1),(base+1)+0.4N] is to avoid the top slices around the base on which RVC ground-truth shrinks (Fig.1), and the bottom slices around the apex on which the heart is small and almost does not affect the ROI determination.

2) Prediction: To confirm the robustness of ROI-net for inference, we apply it to the sub-stacks roughly covering the largest cross-section of the hearts in a dataset (the position of the base is supposed to be unknown for individual cases). The slice indexes of these sub-stacks are determined based on visual observation for a given dataset. More specifically, the trained ROI-net is used to segment slices in S[0.2N,0.6N] of the ED stack S of all the UK Biobank cases, and slices in S[0.1N,0.5N] of the ED stack S of all the ACDC cases. For noise reduction and as post-processing for the ROI net, for each image, only the largest connected component of the output heart mask is kept for prediction.

3) ROI Determination: For each ED stack, the union of all predicted heart masks, as well as the minimum square M covering their union, is determined. We add to it a padding of width 0.3 times the size of M to generate a larger square bounding box, which is defined as the ROI for the case.

After ROI determination on an ED stack, the same ROI applies to both the ED and ES stacks of the same case. Then the ED and ES stacks are cropped out according to this ROI and used as inputs for the LVRV-net and the LV-net in the second step. Hence in the remainder of this paper, we refer to the cropped version of the images, slices or stacks.

B. Segmentation with Propagation: LVRV-net and LV-net
The second step is segmenting the cropped images (the ROIs). Depending on whether we segment RVC or not, we proposed two networks: LVRV-net and LV-net. They share the same structure template as depicted on the right of Fig.2. Both perform slice segmentation of S [i] taking S [i-1], the adjacent slice above, and M[i-1] its segmentation mask, as contextual input. In the contextual input, there are five channels in total: S[i-1] takes one, while M[i-1], being converted to pixel-wise one-hot channels (BG, LVC, LVM, RVC), takes four. In case S[i] is the first slice to be segmented in a stack, M[i-1] does not exist and is set to a null mask; in case S[i] is the top slice in a stack, S[i-1] does not exists and is set to a null image.

The main body of LVRV-net and LV-net is also a variant of U-net with convolution, BN, LReLU and deep supervision, very similar to that of ROI-net. In addition to the main body, an extra encoding branch encodes the contextual input. Information extracted by the main body encoding branch and the extra encoding branch are combined at the bottom of the network, before being decoded in the decoding branch. Finally, a softmax function is applied to the output channels to generate pixel-wise 4 or 3-class probabilities. For inference, each pixel is labeled to the class with the highest probability.

1) Training: LVRV-net and LV-net are trained to segment slices S[i] in S[(base+1),N] (the BB slices, the green column in Fig.1) and S[base,N] (the basal slice and the BB slices, the blue column and the green columns in Fig.1) respectively of the stack S at ED and ES of the UK Biobank training set. Regarding the contextual input, S[i-1] and M[i-1] are set to a null image or a null mask if they are not available as described above; otherwise M[i-1] is set to the corresponding ground-truth mask.

2) Testing: The trained LVRV-net and LV-net are used to segment the cases in the UK Biobank testing set and the other datasets (ACDC, Sunnybrook, RVSC). Let us note S′ the sub-stack to be segmented and M′ the corresponding predicted mask stack. Notice that for UK Biobank, S′ is S[(base+1),N] for LVRV-net, and S[base,N] for LV-net; for the other datasets, S′ is the whole stack. LVRV-net or LVnet iteratively segments S ′ [i] by predicting M ′ [i], taking S ′ [i1] and M′[i-1] as contextual input, for i = 0, 1, 2, etc.. In other words, the segmentation prediction of a slice is used as contextual information while segmenting the slice immediately below it in the next iteration. The segmentation prediction is iteratively “propagated” from top to bottom (or roughly speaking from base to apex) slices in S′.

3) Post-processing: We post-process the predictions at each iteration while segmenting a stack (hence the post-processed mask will be used as the contextual mask in the next iteration if it exists). A predicted mask is considered as successful if the two conditions below are satisfied:

-LVM is present on the mask;
-LVC is mostly surrounded by LVM:

e d g e (L V C, B G) + e d g e (L V C, R V C) \leq 0.5 \times e d g e (L V C, L V M)

The parameter 0.5 above is determined empirically. If the predicted mask is successful, for LVRV-net only, we further process the mask by preserving only the largest connected component of RVC and turning all the other RVC connected components (if they exist) to background; otherwise, the predicted mask is reset to a null mask.

C. Image Preprocessing
Each input image or mask of the three networks in this paper (ROI-net, LVRV-net, and LV-net) is preprocessed as follows:

1) Extreme Pixel Value Cutting and Contrast Limited Adaptive Histogram Equalization (CLAHE) for ROI-net only: Input images to ROI-net are thresholded to the 5th and 95th percentiles of gray levels. Then we apply CLAHE as implemented in OpenCV 4 to perform histogram equalization and improve the contrast of the image with the parameters clipLimit = 3 and tileGridSize = (8, 8).

2) Padding to Square and Resize: The input image or mask is zero-padded to a square if needed. Then it is resampled using nearest-neighbor interpolation to 128 × 128 for ROI-net or 192 × 192 for LVRV-net and LV-net.

3) Normalization: Finally, for each input image of all networks, the mean and standard deviation of the slice intensity histogram cropped between the 5th and 95th percentiles are computed. The image is then normalized by subtracting this mean and dividing by this standard deviation.

D. Loss Functions
We use the two Dice loss (DL) functions below to train the three neural networks mentioned above. As suggested in [15], loss functions based on Dice index help overcoming difficulties in training caused by class imbalance.

1) $D L_{1}$ for ROI-net Training: Given an input image I of N pixels, let’s note $p_{n}$ the pixel-wise probability predicted by ROI-net and $g_{n}$ the pixel-wise ground-truth value ( $g_{n}$ is either 0 or 1). $D L_{1}$ is defined as

D L_{1} = - \frac{2 \sum_{n = 1}^{N} p_{n} g_{n} + ε}{\sum_{n = 1}^{N} p_{n} + \sum_{n = 1}^{N} g_{n} + ε}

where ε is used to improve the training stability by avoiding division by 0, i.e. when $p_{n}$ and gn are 0 for each pixel n. Empirically we take ε = 1. The value of $D L_{1}$ varies between 0 and -1. Good performance of ROI-net corresponds to $D L_{1}$ close to -1.

2) $D L_{2}$ for LVRV-net Training: For the segmentation of a N-pixel input image, the outputs are four probabilities $p_{n}$ ,c with c = 0,1,2,3 (BG, LVC, LVM and RVC) such that $p_{n}$ ,c = 1 for each pixel. Let’s note $g_{n}$ ,c the corresponding c one-hot ground-truth ( $g_{n}$ ,c is 1 if the pixel is labeled with the class corresponding to c; otherwise $g_{n}$ ,c is 0). Then we define

D L_{2} = - \frac{1}{4} \sum_{c = 0}^{3} (\frac{2 \sum_{n = 1}^{N} p_{n, c} g_{n, c} + ε}{\sum_{n = 1}^{N} p_{n, c} + \sum_{n = 1}^{N} g_{n, c} + ε})

The role of ε here is similar to that in $D L_{1}$ . Empirically we use ε=1.

3) $D L_{3}$ for LV-net Training: Its formula is very similar to that of $D L_{2}$ . The only difference is, instead of calculating the average of the 4 Dice index terms with c ranges from 0 to 3, $D L_{3}$ sums up the 3 Dice index terms with c ranges from 0 to 2 (BG, LVC, LVM) and computes their average.

V. CONCLUSION AND DISCUSSION
We propose a method of segmentation with spatial propagation that is based on originally designed neural networks. By taking the contextual input into account, the spatial consistency of segmentation is enforced. Also, we conduct thorough and unprecedented testing to evaluate the generalization ability of our model and achieve performance better than or comparable to the state-of-the-art. Furthermore, an exceptionally large dataset (UK Biobank) collected from the general population is used for training and evaluation.

Given the experiments in this paper, we notice that our method is very robust in terms of distance measures (e.g. Hausdorff distance) but less precise than the state-of-the-art in terms of Dice index. The variability of ground-truth in the UK Biobank training set is one important reason for that. For instance, the high ground-truth variability on the basal slice, which is included in the testing sub-stacks for LV-net but not for LVRV-net, explains the slightly lower performance measures of LV-net in Table I. Yet this kind of variability commonly exists in large datasets so we have to decide to accept and cope with it. Furthermore, inconsistency problems may occur in segmentation (as illustrated and discussed), to which the Dice index might not be sensitive. We believe that on this problem more attention should be paid to the Hausdorff distance, according to which our proposed method performs better. For instance, in the third example shown in Fig.6, a small spot of false positive of LVC segmentation is predicted by LVRV-no-propagation-net. This is a very typical case of inconsistency: the false positive part is quite small compared to the ground-truth LVC, and therefore only causes a slight reduction of the Dice index. But it certainly brings about an explosion of the Hausdorff distance.

We did not directly measure the human performance in terms of 3D metrics on UK Biobank to compare with our method. However, the authors of [16] did conduct experiments on UK Biobank to measure human performance in terms of 2D metrics. Taking the inter-observer variability of 3 human experts into account, the reported human expert levels are about 0.93(LVC), 0.88(LVM), and 0.88(RVC) in terms of 2D Dice index, and about 3.1mm(LVC), 3.8mm(LVM) and 7.4mm(RVC) in terms of 2D Hausdorff distance. Though these results are not directly comparable to ours, they may still give a rough idea of human performance. We roughly estimate that our method, while mainly focusing on consistency, has a performance still a little bit lower than that of human experts in terms of accuracy.

Most of the existing segmentation methods do not explicitly take spatial consistency into account. In particular, they do not accurately segment the “difficult” slices around the apex. Our method, segmenting in a spatially consistent manner, is particularly more robust than them on these slices. The importance of correctly segmenting these slices is often underestimated. In many cutting-edge research projects (e.g. cardiac motion simulation and image synthesis), as a primary step, 3D meshes need to be built based on segmentation. Without spatial consistency and success on the apical slices of the segmentation, the generated meshes would be problematic.

Finally, we wonder whether our method, with better performance on distance measures than many state-of-the-art methods, would be a great tool for cardiac motion analysis. Intuitively, the smaller the Hausdorff distance between the predicted and the ground-truth contours at each instant is, the more precisely the trajectory of the corresponding structure (e.g. LVC, LVM, RVC) can be tracked across time, and hence the better the motion can be characterized. We expect to carry out research on this in the future.

论文阅读笔记（四十九）：3D Consistent & Robust Segmentation of Cardiac Images by Deep Learning with Spatial Pr..

相关推荐