arxiv: v1 [cs.cv] 21 Sep 2017

Size: px

Start display at page:

Download "arxiv: v1 [cs.cv] 21 Sep 2017"

Meghan Lang
6 years ago
Views:

H-DenseUNet: Hybrid Densely Connected UNet for Liver and Liver Tumor Segmentation from CT Volumes Xiaomeng Li 1, Hao Chen 1,2, Xiaojuan Qi 1, Qi Dou 1, Chi-Wing Fu 1, Pheng Ann Heng 1 1 Department of

cv 21 Sep 2017 Abstract Liver and liver tumor segmentation plays an important role in hepatocellular carcinoma diagnosis and treatment planning.

However, 2D convolutions can not fully leverage the spatial information along the z-axis direction while 3D convolutions suffer from high computational cost and GPU memory consumption.

hierarchically aggregating volumetric contexts under the spirit of the auto-context algorithm.

We extensively evaluated our method on the MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge.

829, respectively. Introduction Liver cancer is one of the most common cancer diseases in the world and causes massive deaths every year (Lu, Marziliano, and Thng, 2006).

1 H-DenseUNet: Hybrid Densely Connected UNet for Liver and Liver Tumor Segmentation from CT Volumes Xiaomeng Li 1, Hao Chen 1,2, Xiaojuan Qi 1, Qi Dou 1, Chi-Wing Fu 1, Pheng Ann Heng 1 1 Department of Computer Science and Engineering, The Chinese University of Hong Kong. 2 Imsight Medical Technology, Inc. arxiv: v1 cs.cv 21 Sep 2017 Abstract Liver and liver tumor segmentation plays an important role in hepatocellular carcinoma diagnosis and treatment planning. Recently, fully convolutional neural networks (FCNs) serve as the back-bone in many volumetric medical image segmentation tasks, including 2D and 3D FCNs. However, 2D convolutions can not fully leverage the spatial information along the z-axis direction while 3D convolutions suffer from high computational cost and GPU memory consumption. To address these issues, we propose a novel hybrid densely connected UNet (H-DenseUNet), which consists of a 2D DenseUNet for efficiently extracting intra-slice features and a 3D counterpart for hierarchically aggregating volumetric contexts under the spirit of the auto-context algorithm. In this way, the H-DenseUNet can harness the hybrid deep features effectively for volumetric segmentation. We extensively evaluated our method on the MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge. Our method outperformed other state-of-the-art methods on the overall score, with dice per case on liver and tumor as and 0.686, as well as global dice score on liver and tumor as and 0.829, respectively. Introduction Liver cancer is one of the most common cancer diseases in the world and causes massive deaths every year (Lu, Marziliano, and Thng, 2006). The accurate measurements from CT, including tumor volume, shape, location and further functional liver volume, can assist doctors in making accurate hepatocellular carcinoma evaluation and treatment planning. Traditionally, the liver and liver lesion are delineated by radiologists on a slice-by-slice basis, which is time-consuming and prone to inter- and intra-rater variations. Therefore, automatic liver and liver tumor segmentation methods are highly demanded in the clinical practice. Automatic liver segmentation from the contrast-enhanced CT volumes is a very challenging task due to the low intensity contrast between the liver and other neighboring organs (see the first row in Figure 1). Moreover, radiologists usually enhance CT scans by an injection protocol for clearly observing tumors, which may increase the noise inside the images on the liver region. Compared with liver Copyright c 2018 Raw image Ground truth 3D display Figure 1: Examples of contrast-enhanced CT scans showing the large variations of shape, size, location of liver lesion. Each row shows a CT scan acquired from individual patient. The red regions denote the liver while the green ones denote the lesions (see the black arrows above). segmentation, liver tumor segmentation is considered to be a more challenging task. First, the liver tumor has various size, shape, location and numbers within one patient, which hinders the automatic segmentation, as shown in Figure 1. Second, some lesions do not have clear boundaries, limiting the performance of solely edge based segmentation methods (see the lesions in the third row of Figure 1). Third, many CT scans consist of anisotropic dimensions with high variations along the z-axis direction (the voxel spacing ranges from 0.45mm to 6.0mm), which further poses challenges for automatic segmentation methods. To tackle these difficulties, strong ability for extracting visual features is required. Recently, fully convolutional neural networks (FCNs) have achieved remarkable success on a board array of recognition problems. Many researchers advance this stream using deep learning methods in the liver tumor segmentation problem and the literature can be classified into two categories broadly. (1) 2D FCNs, such as UNet architecture (Christ et al., 2016), Multi-channel FCN (Sun

2 et al., 2017), and the FCN based on VGG-16 (Ben-Cohen et al., 2016). (2) 3D FCNs, where 2D convolutions are replaced by 3D convolutions with volumetric data input (Lu et al., 2017; Dou et al., 2016). However, 2D FCN based methods ignore the contexts on the z-axis, which would lead to limited segmentation accuracy. To be specific, single or adjacent slices cropped from volumetric images were fed into 2D FCNs (Sun et al., 2017; Ben-Cohen et al., 2016) and the 3D segmentation volume is generated by simply stacking the 2D segmentation maps. Although adjacent slices are employed, it is still not enough to probe the spatial correlation along the third dimension, which may degrade the segmentation performance. To solve this problem, some researchers proposed to use tri-planar schemes (Prasoon et al., 2013), that is, three 2D FCNs were applied on orthogonal planes (e.g., the xy, yz, and xz planes) and voxel prediction results are generated by the average of these probabilities. However, this approach still cannot probe sufficient volumetric contexts, leading to limited representation ability. Compared to 2D FCNs, 3D FCNs cost a large amount of computation (e.g., high memory consumption and long training time) (Chen et al., 2016). The high memory consumption limits the depth of the network as well as the filter s field-of-view, which are the two key factors for performance gains (Simonyan and Zisserman, 2014; Chen et al., 2017). The heavy computation of 3D convolutions also impedes its application in training a large-scale dataset. Moreover, many researchers have demonstrated the effectiveness of knowledge transfer (the knowledge learnt from one source domain efficiently transferred to another domain) for boosting the performance (Chen et al., 2015; Tajbakhsh et al., 2016). Unfortunately, only a dearth of 3D pre-trained model exists, which restricts the performance and the adoption of 3D FCNs. To address the above problems, we proposed a novel hybrid densely connected UNet (H-DenseUNet) by efficiently probing intra-slice features and extracting 3D contexts for liver and liver lesion segmentation. Our H-DenseUNet pushes the limit further than other works with technical contributions that improve the following two key factors: Increased network depth. First, to fully extract hierarchical intra-slice features, we design a deep and efficient training network based on 2D convolutions, called 2D DenseUNet, which inherits the advantages of both densely connected path (Huang et al., 2016) and UNet connections (Ronneberger, Fischer, and Brox, 2015). Densely connected path is derived from densely connected network (DenseNet), where the improved information flow and parameters efficiency make it easy for training a deep network. Different from DenseNet (Huang et al., 2016), we add the UNet connections, i.e., long-range skip connections, between the encoding part and decoding part in our architecture; hence, we can enable low-level spatial feature preservation for better intra-slice context exploration. Hybrid feature exploration. Second, to further explore the volumetric feature representation, we design a 3D counterpart to hierarchically aggregate inter-slice correlations. To do so, we adopt the auto-context mechanism (Tu, 2008) by integrating high-level visual features obtained from the 2D DenseUNet. With the guidance of semantic probabilities from 2D DenseUNet, the optimization burden in the 3D counterpart can be well alleviated, which contributes to training efficiency. Moreover, the volumetric features and the high representative intra-slices contexts can be automatically optimized together for better liver and tumor recognition. In summary, this work has the following achievements: We design a DenseUNet to effectively probe hierarchical intra-slice features for liver and tumor segmentation, where densely connected path and UNet connections are carefully integrated to maximize the performance. To our knowledge, this is the first work that pushes the depth of 2D networks for liver and tumor segmentation to 167 layers. We propose a H-DenseUNet framework to explore hybrid (intra-slice and inter-slice) features for liver and tumor segmentation, which elegantly tackles the problems occurred in 2D (e.g., contexts neglect along z-axis ) and 3D convolutions (e.g., long training time), and can be served as a new paradigm for effectively exploiting 3D contexts. We adopt the auto-context mechanism for effectively and efficiently training 3D DenseUNet by integrating hybrid features. Compared with state-of-the-art methods, this work excelled others and achieved the best overall performance in the recent MICCAI 2017 LiTS Challenge. Related Work Hand-crafted feature based methods In the past decades, a lot of algorithms, including thresholding (Soler et al., 2001; Moltz et al., 2008), region growing and deformable model based methods (Wong et al., 2008; Jimenez- Carretero et al., 2011) and machine learning based methods (Huang et al., 2014; Vorontsov, Abi-Jaoudeh, and Kadoury, 2014) have been proposed to segment liver and liver tumor. Threshold-based methods classified foreground and background according to whether the intensity value is above a threshold. Variations of region growing algorithms were also popular in the liver and lesion segmentation task. For example, Wong et al. (2008) segmented tumors by a 2D region growing method with knowledge-based constraints. Level set methods also attracted attentions from researchers with the advantages of numerical computations involving curves and surfaces. For example, Jimenez-Carretero et al. (2011) proposed to classify tumor by a multi-resolution 3D level set method coupled with adaptive curvature technique. A large variety of machine learning based methods have also been proposed for liver tumor segmentation. For example, Huang et al. (2014) proposed to employ random feature subspace ensemble-based extreme machine learning for liver lesion segmentation. Vorontsov, Abi-Jaoudeh, and Kadoury (2014) proposed to segment tumor by support vector machine (SVM) classifier and then refined the results by omnidirectional deformable surface model. Deep learning based methods Convolutional neural network (CNN) has achieved great success in image recogni-

3 tion problems in computer vision community. Similarly, many researchers proposed to utilize various FCNs for learning feature representation in the application of liver and lesion segmentation. For example, Ben-Cohen et al. (2016) proposed to use a FCN for liver segmentation and livermetastasis detection in CT examinations. Christ et al. (2016, 2017) proposed a cascaded FCN architecture and dense 3D conditional random fields (CRFs) to automatically segment liver and liver lesion. In the meanwhile, Sun et al. (2017) designed a multi-channel FCN to segment liver tumors from CT images, where the probability map was generated by feature fusion from multi-channels. Recently, during the ISBI 2017 challenge on liver and lesion segmentation challenge, Han (2017) proposed a 2.5D 24-layer FCN model and achieved the first place in ISBI challenge, which employed residual blocks as the building blocks and UNet connections across the encoding part and decoding part. Here 2.5D refers to using 2D convolutions with the input of adjacent slices from 3D volume data. Both Vorontsov et al. (2017) and Chlebus et al. (2017) achieved the second place in the ISBI challenge. Vorontsov et al. (2017) also employed ResNet-like residual blocks and UNet connections with 21 convolution layers, which is a bit shallower and has fewer parameters compared to that proposed by Han (2017). Chlebus et al. (2017) trained liver and tumor based on a 28-layer UNet architecture in two individual models and subsequently filtered the false positives of tumor segmentation results by a random forest classifier. Instead of using 3D FCNs, all of the top results used 2D FCNs with different network depths, showing the efficacy of 2D FCNs regarding the underlying volumetric segmentation problem. However, these shallow networks have the limited representation ability and ignore 3D contexts, which restrict the recognition performance. Method Figure 2 shows the pipeline of our proposed method for liver and tumor segmentation. To reduce the overall computation time, a simple ResNet architecture (Han, 2017) is trained to get a quick but coarse segmentation of liver. With the input of patches within the coarse liver region, our proposed H- DenseUNet efficiently probes intra-slice and inter-slice features through a 2D DenseUNet and a 3D counterpart for accurate segmentation of the liver and liver lesions. 2D DenseUNet One of the important principles in the deep learning architecture is "the deeper, the better" (Simonyan and Zisserman, 2014; He et al., 2016). However, as the FCN goes deeper, the gradient vanish problem may occur and thus impedes the convergence of the network. Recently, Huang et al. (2016) achieved excellent performance on object recognition benchmark tasks by a deep and efficient network, called DenseNet. The success is attributed to the introduction of dense connectivity, which refers to the direct connections from any layer to all subsequent layers (see Figure 2). The l th layer receives the feature maps of all preceding layers, x 0,..., x l 1, as input: x l = H l (x 0, x 1,..., x l 1 ), (1) where x 0, x 1,..., x l 1 is the concatenation of feature maps generated in layer from x 0 to x l 1. Each function produces k feature maps and k is called growth rate. One advantage of the dense connectivity between layers is that it has fewer parameters than traditional networks because it avoids learning redundant features. Moreover, the densely connected path ensures the maximum information flow between layers, which improves the gradients flow, and thus alleviates the burden in searching for the optimal solution when building deep neural networks. However, a deep FCN network for segmentation tasks actually contains many max-pooling and upsampling operations, which may lead to the spatial loss of high resolution features. With this consideration, we develop a 2D DenseUNet, which inherits both advantages of densely connected path and UNet connections (Ronneberger, Fischer, and Brox, 2015). Specifically, the dense connection between layers is employed within each micro-block to ensure the maximum information flow while the UNet long range connection links the encoding part and decoding part to preserve low-level features. The high parameter efficiency and improved gradients flow throughout the network makes the network easy to train. Furthermore, our architecture has good capability to segment small targets, such as lesions, by preserving high resolution features through the UNet long range connections. The illustration and the detailed structure of 2D Dense- UNet are shown in Figure 2(c) and Table 1, respectively. The input is a stack of three adjacent slices and the output is a 2D score map corresponding to the center slice of the input. The depth of 2D DenseUNet is extended to 167 layer, referred as 2D DenseUNet-167, which consists of convolution layers, pooling layers, dense blocks, transition layers and upsampling layers. The dense block denotes the cascade of several micro-blocks, in which all layers are directly connected, see Figure 2(c). To change the size of feature-maps, the transition layer is employed, which consists of a batch normalization layer and one 1 1 convolution layer followed by an average pooling layer. An compression factor is included in the transition layer to compress the number of feature-maps, which prevents the expanding of feature-maps (we set 0.5 in our experiments). The upsampling layer is implemented by bilinear interpolation, followed by the summation with low-level features (ie., UNet connections) and a 3 3 convolutional layer. Before each convolution layer in the architecture, a batch normalization and a Rectified Linear Unit (ReLU) are employed. H-DenseUNet 2D DenseUNet with deep convolutions can produce hierarchical features but neglect the spatial information along the z dimension while 3D DenseUNet has large GPU computational cost and limited kernel s field-of-view as well as the network depth. To address these issues, we proposed H- DenseUNet, where 2D DenseUNet and its 3D counterpart are integrated.

Loss 2D DenseUNet 2D input: 224 224 3 Concatenation Probability maps 3D DenseUNet Loss 2D ResNet Liver segmentation Tumor segmentation (a) 1 st stage : Coarse liver segmentation 3D input: 160 160 20

illustration of the 2D DenseUNet. Figure 2: The Illustration of the pipeline for liver and lesion segmentation. (a) 2D ResNet is employed for coarse liver segmentation to reduce computation time.

inter-slice respectively. (c) The illustration of the 2D DenseUNet. The structure in the orange block is a micro-block and k denotes the growth-rate.

The concatenation of the volumetric images (original data) and the contextual information (probability maps generated from 2D DenseUNet version) are fed into the 3D counterpart to distill the visual

Specifically, the detectors in the 3D counterpart trained based not only on the features probed from the original images, but also on the probabilities of a large number of context pixels.

4 Loss 2D DenseUNet 2D input: Concatenation Probability maps 3D DenseUNet Loss 2D ResNet Liver segmentation Tumor segmentation (a) 1 st stage : Coarse liver segmentation 3D input: (b) 2 nd stage: H-DenseUNet for accurate liver and tumor segmentation Dense block 1 Transition layer 1 Upsampling layer 3 Upsampling layer 4 Upsampling layer (c) The illustration of the 2D DenseUNet. Figure 2: The Illustration of the pipeline for liver and lesion segmentation. (a) 2D ResNet is employed for coarse liver segmentation to reduce computation time. (b) The H-DenseUNet consists of 2D DenseUNet and 3D counterpart, fed with the input within coarse liver region, which are responsible for hierarchical features extraction from intra-slice and inter-slice respectively. (c) The illustration of the 2D DenseUNet. The structure in the orange block is a micro-block and k denotes the growth-rate. The motivation of this design is inspired by the autocontext algorithm (Tu, 2008). The concatenation of the volumetric images (original data) and the contextual information (probability maps generated from 2D DenseUNet version) are fed into the 3D counterpart to distill the visual features with 3D contexts. Specifically, the detectors in the 3D counterpart trained based not only on the features probed from the original images, but also on the probabilities of a large number of context pixels. The learning algorithm automatically selects and fuses important supporting context pixels together with features about 3D shape appearance, contributing to the effectiveness of 3D contexts extraction. With the guidance from supporting contexts pixels, the burden in searching for the optimal solution in the 3D counterpart has also been well alleviated, which improves the learning efficiency of the 3D network. In our experiments, the 3D counterpart of H-DenseUNet costs only 9 hours to converge, which significantly faster than training the 3D counterpart with original data solely (63 hours). The detailed structure of the 3D counterpart is shown in the Table 1, called 3D DenseUNet-65, which consists of 65 convolutions and the growth rate is 32. The input of 3D model is the cropped volumetric images and the output of the model is a 3D segmentation map. Compared with 2D DenseUNet counterpart, the number of micro-blocks in each dense block is decreased due to the high memory consumption of 3D convolutions and the limited GPU memory. Another difference is that feature-maps along the third dimension only downsample and upsample for 2 times since we use small size input along the z dimension and it is unnecessary to use a larger size by paying off the kernel s file-ofview on the xy plane and depth of the network. The rest of the network setting is the same with the 2D counterpart. We employ weighted cross-entropy function as described in equation 2 in the network and then refine the network by hard pixel mining strategy. Loss Functions and Test Procedures To train the H-DenseUNet, we first employed weighted cross-entropy function as the loss function and then refined by using hard pixel mining strategy in both two stages. We elaborate the details as follows. Weighted cross-entropy function The final predicted score map has three classes, i.e., background, liver and le-

5 Table 1: Architectures of the proposed H-DenseUNet, consisting of the 2D DenseUNet and the 3D counterpart. The symbol denotes the long range UNet summation connections. The second and forth column indicate the output size of the current stage in two architectures, respectively. Note that each "conv" corresponds the sequence BN-ReLU-Conv. Feature size 2D DenseUNet-167 (k=48) Feature size 3D DenseUNet-65 (k=32) input convolution ,96,stride ,96,stride pooling max pool,stride max pool,stride 2 1 1, 192 conv 1 1 1, 128 conv dense block , 48 conv 3 3 3, 32 conv conv conv transition layer average pool average pool 1 1, 192 conv 1 1 1, 128 conv dense block , 48 conv 3 3 3, 32 conv conv conv transition layer average pool, average pool 1 1, 192 conv 1 1 1, 128 conv dense block , 48 conv 3 3 3, 32 conv conv conv transition layer average pool average pool 1 1, 192 conv 1 1 1, 128 conv dense block , 48 conv 3 3 3, 32 conv upsampling layer upsampling dense block 3, 768, conv upsampling dense block 3, 224, conv upsampling layer upsampling dense block 2, 384, conv upsampling dense block 2, 192, conv upsampling layer upsampling dense block 1, 96, conv upsampling dense block 1, 96, conv upsampling layer upsampling convolution 1, 96, conv upsampling convolution 1, 96, conv upsampling layer upsampling, 64, conv upsampling, 64, conv convolution , , 3 sion. The weighted cross-entropy function is: L = 1 N N i=1 c=1 3 wi c yi c log p c i (2) where p i denotes probability of voxel i belongs to class c (background, liver or lesion), w c i denotes the weight and yc i indicates the ground truth label for voxel i. Hard pixel mining strategy It s observed that the voxel distribution is extremely imbalanced, hence a vast number of easy negatives may overwhelm the classifier during training. To solve this issue, we employed hard pixel mining strategy to down-weight the loss assigned to well-classified examples. To be specific, when the network comes to converge, we sort all the probability values for each class and only include the voxels with high misclassification rate in the loss function, as shown following: N 3 i=1 c=1 L = 1 {y i = c and p i Ω r } log p c i 3 c=1 1 {y (3) i = c and p i Ω r } N i=1 where 1{ } is an indicator function. Ω r is the set of probabilities within the top r misclassification error (r = 10% in our experiments). With this procedure, the "hard learning" pixels (e.g., small tumors with blurred edges) will be given more importance and the classifier can focus on learning hard cases. Test Procedures In the test stage, we first get the coarse liver segmentation result and then the 2D DenseUNet generates the more accurate background, liver and lesion probability maps on the slices within the detected coarse liver lesion. The probability maps are stacked to a volumetric data and then concatenated with the corresponding original volumetric images, the 3D counterpart can generate the accurate liver and tumor segmentation results. Then the liver and lesion are merged, and a largest connected component labeling is performed to generate the final liver. After that, the final lesion segmentation result is generated by removing lesions outside the final liver region. Experiments and Results Dataset and Pre-processing We tested our method on the dataset of MICCAI 2017 LiTS Challenge, which contains 131 and 70 contrast-enhanced 3D abdominal CT scans for training and testing respectively. The dataset was acquired by different scanners and protocols from six different clinical sites, with a largely varying in-plane resolution from 0.55 mm to 1.0 mm and slice spacing from 0.45 mm to 6.0 mm. For image preprocessing, we truncated the image intensity values of all scans to the range of -200,250 HU to remove the irrelevant details. For coarse liver segmentation, all the training images were resampled to a fixed resolution of mm 3. Since the liver lesions are extremely small in some training cases, we trained with the original images in our following stages to avoid possible artifact from image resampling. Evaluation Metrics According to the evaluation of MICCAI LiTS challenge, we employed two metrics for evaluating the segmentation performance on liver and tumor respectively, including Dice per case score and Dice global score. Dice per case score refers an average Dice per volume score while Dice global score is the Dice score evaluated by combining all datasets into one.

3D DenseUNet without pre-trained model 2D DenseUNet without pre-trained model 2D DenseUNet with pre-trained model H-DenseUNet 0.7 0.6 case 1 Loss 0.5 0.4 0.3 case 37 0.2 0.

To compare different methods comprehensively, we define the overall score by averaging the four segmentation results, seen in the last column of Table 3.

The parameters of the encoder part in 2D DenseUNet was initialized using the pre-trained model for DenseUNet-167 while the 3D counterpart was randomly initialized. The initial learning rate was 0.

For data augmentation, we adopted random mirror and scaling between 0.8 and 1.2 for all training data to alleviate overfitting.

6 3D DenseUNet without pre-trained model 2D DenseUNet without pre-trained model 2D DenseUNet with pre-trained model H-DenseUNet case 1 Loss case Epoch Figure 3: Training loss of 2D DenseUNet with and without pre-trained model, 3D DenseUNet without pre-trained model as well as H-DenseUNet. To compare different methods comprehensively, we define the overall score by averaging the four segmentation results, seen in the last column of Table 3. Implementation Details The model was implemented using Keras package (Chollet and others, 2015). The parameters of the encoder part in 2D DenseUNet was initialized using the pre-trained model for DenseUNet-167 while the 3D counterpart was randomly initialized. The initial learning rate was 0.01 and decayed according to the equation lr = lr (1 iterations/total_iterations)0.9. We used stochastic gradient descent with momentum. For data augmentation, we adopted random mirror and scaling between 0.8 and 1.2 for all training data to alleviate overfitting. The training of 2D DenseUNet model took about 21 hours using two NVIDIA Titan Xp GPU with 12 GB memory while the 3D counterpart model cost about 9 hours under the same settings. In the test phase, the total processing time of one subject depends on the number of slices, ranging from 30 seconds to 200 seconds. Ablation Analysis of H-DenseUNet In this section, we compare the results trained in an end-toend manner by 2D DenseUNet-167, 3D DenseUNet-65 and our proposed H-DenseUNet to evaluate the effectiveness of each component. We first analyze the learning behaviors of 2D DenseUNet with and without pre-trained model. Both two experiments were conducted under the same experiment settings. The training loss curves are shown in Figure 3 and we can see that the 2D DenseUNet with pre-trained model converged faster and achieved lower loss than that without pre-trained model, which shows the importance of utilizing pre-trained model with transfer learning. The testing results in Table 2 demonstrated that the pre-trained model can help the network achieve better performance consistently. Figure 4: Examples of liver and tumor segmentation results of H-DenseUNet from the testing dataset. The red regions denote the liver and the green ones denote the tumors. Even without using the pre-trained model, 2D DenseUNet still outperformed 3D DenseUNet, including faster convergence and lower loss, which highlights the effectiveness and efficiency of 2D convolutions with deep neural networks. In our experiments, 3D DenseUNet took much more training time (63 hours) to converge compared to 2D DenseUNet (21 hours). From Table 2, compared with the results generated by 3D DenseUNet, 2D DenseUNet with pre-trained model achieved 8.7% and 3% improvement on the lesion segmentation results by the measurement of Dice per case and Dice global score, respectively. Compared with 2D DenseUNet, our proposed HDenseUNet advanced the segmentation results on both two measurements for liver and tumor consistently. The performance gains show that contextual information along the z dimension, indeed, contributes to the recognition of lesion and liver, especially for lesions that have much more blurred boundary and considered to be hard cases. Compared with the loss curve of 3D DenseUNet (see Figure 3), our H-DenseUNet took much less training time to converge (around 30 hours), contributing to the efficiency in exploring 3D contexts. Figure 4 presents some examples of liver and tumor segmentation results of our H-DenseUNet on the testing dataset. We can observe that most small targets as well as large objects can be well segmented. Compare with Other Methods There were totally 49 submissions in both ISBI1 and MICCAI2 liver and tumor segmentation challenges. Both challenges employed the same training and testing datasets for fair performance comparison. Different from the ISBI challenge, more evaluation metrics have been added in the MICCAI challenge for comprehensive comparison. The detailed results of top 15 teams including both ISBI and MICCAI challenges are listed in Table 3. 1 ISBI challenge result: codalab.org/competitions/17094\#learn\_the\ _details-isbi MICCAI challenge result: codalab.org/competitions/17094\#results case 69

7 Table 2: Segmentation results by ablation study of our methods on the test dataset. Model Lesion Liver Overall score Dice global Dice per case Dice global Dice per case - 3D DenseUNet without pre-trained model D DenseUNet without pre-trained model UNet (Chlebus et al., 2017) ResNet (Han, 2017) D DenseUNet with pre-trained model H-DenseNet Table 3: Results of ISBI and MICCAI 2017 liver tumor segmentation challenge. Team Lesion Liver Overall score Dice global Dice per case Dice global Dice per case - our lehealth hans.meine Elehanx (Han, 2017) medical deepx Njust Medical (Vorontsov et al., 2017) Gchlebus (Chlebus et al., 2017) predible Lei (Bi et al., 2017) ed10b chunliang yaya mabc Notice: - denotes that the team participated in ISBI competition and the measurement was not evaluated. All the top teams in the challenges employed deep learning based methods, demonstrating the effectiveness of CNN based methods in medical image analysis. For example, Han (2017), Vorontsov et al. (2017) and Bi et al. (2017) all adopted 2D deep FCNs, where ResNet-like residual blocks were employed as the building blocks. In addition, Chlebus et al. (2017) trained liver and tumor by UNet architecture in two individual models, followed by a random forest classifier. In comparison, our method with a 167-layer network consistently outperformed these methods, which highlighted the efficacy of 2D DenseUNet with pre-trained model. Furthermore, our proposed H-DenseUNet excelled other methods on the overall performance, which validated the effectiveness of our method by incorporating volumetric and auto-context information. Conclusion We present H-DenseUNet for liver and tumor segmentation from CT volumes, a new paradigm for effectively and efficiently exploiting 3D contexts under the spirit of autocontext mechanism. The architecture gracefully addressed the problems occurred in 2D convolutions and 3D convolutions, where 2D convolutions ignore the volumetric contexts and 3D convolutions suffer from long training time. Our method excelled others on MICCAI 2017 LiTS Challenge on a single-model basis, which demonstrated the striking performance of the proposed H-DenseUNet. In addition, our network architecture is inherently general and can be easily extended to other applications. In the future, we plan to conduct extensive study of our method on a much larger liver dataset. References Ben-Cohen, A.; Diamant, I.; Klang, E.; Amitai, M.; and Greenspan, H Fully convolutional network for liver segmentation and lesions detection. In International Workshop on Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, Springer. Bi, L.; Kim, J.; Kumar, A.; and Feng, D Automatic liver lesion detection using cascaded deep residual networks. arxiv preprint arxiv: Chen, H.; Ni, D.; Qin, J.; Li, S.; Yang, X.; Wang, T.; and Heng, P. A Standard plane localization in fetal ultrasound via domain transferred deep neural networks. IEEE journal of biomedical and health informatics 19(5): Chen, H.; Dou, Q.; Wang, X.; Qin, J.; Cheng, J. C.; and Heng, P.-A d fully convolutional networks for intervertebral disc localization and segmentation. In In-

8 ternational Conference on Medical Imaging and Virtual Reality, Springer. Chen, H.; Dou, Q.; Yu, L.; Qin, J.; and Heng, P.-A Voxresnet: Deep voxelwise residual networks for brain segmentation from 3d mr images. NeuroImage. Chlebus, G.; Meine, H.; Moltz, J. H.; and Schenk, A Neural network-based automatic liver tumor segmentation with random forest-based candidate filtering. arxiv preprint arxiv: Chollet, F., et al Keras. fchollet/keras. Christ, P. F.; Elshaer, M. E. A.; Ettlinger, F.; Tatavarty, S.; Bickel, M.; Bilic, P.; Rempfler, M.; Armbruster, M.; Hofmann, F.; D Anastasi, M.; et al Automatic liver and lesion segmentation in ct using cascaded fully convolutional neural networks and 3d conditional random fields. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. Christ, P. F.; Ettlinger, F.; Grün, F.; Elshaera, M. E. A.; Lipkova, J.; Schlecht, S.; Ahmaddy, F.; Tatavarty, S.; Bickel, M.; Bilic, P.; et al Automatic liver and tumor segmentation of ct and mri volumes using cascaded fully convolutional neural networks. arxiv preprint arxiv: Dou, Q.; Chen, H.; Jin, Y.; Yu, L.; Qin, J.; and Heng, P.- A d deeply supervised network for automatic liver segmentation from ct volumes. In International Conference on Medical Image Computing and Computer- Assisted Intervention, Springer. Han, X Automatic liver lesion segmentation using a deep convolutional neural network method. arxiv preprint arxiv: He, K.; Zhang, X.; Ren, S.; and Sun, J Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Huang, W.; Yang, Y.; Lin, Z.; Huang, G.-B.; Zhou, J.; Duan, Y.; and Xiong, W Random feature subspace ensemble based extreme learning machine for liver tumor detection and segmentation. In Engineering in Medicine and Biology Society (EMBC), th Annual International Conference of the IEEE, IEEE. Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten, L Densely connected convolutional networks. arxiv preprint arxiv: Jimenez-Carretero, D.; Fernandez-de Manuel, L.; Pascau, J.; Tellado, J. M.; Ramon, E.; Desco, M.; Santos, A.; and Ledesma-Carbayo, M. J Optimal multiresolution 3d level-set method for liver segmentation incorporating local curvature constraints. In Engineering in medicine and biology society, EMBC, 2011 annual international conference of the IEEE, IEEE. Lu, F.; Wu, F.; Hu, P.; Peng, Z.; and Kong, D Automatic 3d liver location and segmentation via convolutional neural network and graph cut. International journal of computer assisted radiology and surgery 12(2): Lu, R.; Marziliano, P.; and Thng, C. H Liver tumor volume estimation by semi-automatic segmentation method. In Engineering in Medicine and Biology Society, IEEE-EMBS th Annual International Conference of the, IEEE. Moltz, J. H.; Bornemann, L.; Dicken, V.; and Peitgen, H Segmentation of liver metastases in ct scans by adaptive thresholding and morphological processing. In MICCAI workshop, volume 41, 195. Prasoon, A.; Petersen, K.; Igel, C.; Lauze, F.; Dam, E.; and Nielsen, M Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network. In International conference on medical image computing and computer-assisted intervention, Springer. Ronneberger, O.; Fischer, P.; and Brox, T U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. Simonyan, K., and Zisserman, A Very deep convolutional networks for large-scale image recognition. arxiv preprint arxiv: Soler, L.; Delingette, H.; Malandain, G.; Montagnat, J.; Ayache, N.; Koehl, C.; Dourthe, O.; Malassagne, B.; Smith, M.; Mutter, D.; et al Fully automatic anatomical, pathological, and functional segmentation from ct scans for hepatic surgery. Computer Aided Surgery 6(3): Sun, C.; Guo, S.; Zhang, H.; Li, J.; Chen, M.; Ma, S.; Jin, L.; Liu, X.; Li, X.; and Qian, X Automatic segmentation of liver tumors from multiphase contrast-enhanced ct images based on fcns. Artificial Intelligence in Medicine. Tajbakhsh, N.; Shin, J. Y.; Gurudu, S. R.; Hurst, R. T.; Kendall, C. B.; Gotway, M. B.; and Liang, J Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE transactions on medical imaging 35(5): Tu, Z Auto-context and its application to high-level vision tasks. In Computer Vision and Pattern Recognition, CVPR IEEE Conference on, 1 8. IEEE. Vorontsov, E.; Abi-Jaoudeh, N.; and Kadoury, S Metastatic liver tumor segmentation using texture-based omni-directional deformable surface models. In International MICCAI Workshop on Computational and Clinical Challenges in Abdominal Imaging, Springer. Vorontsov, E.; Chartrand, G.; Tang, A.; Pal, C.; and Kadoury, S Liver lesion segmentation informed by joint liver segmentation. arxiv preprint arxiv:

9 Wong, D.; Liu, J.; Fengshou, Y.; Tian, Q.; Xiong, W.; Zhou, J.; Qi, Y.; Han, T.; Venkatesh, S.; and Wang, S.-c A semi-automated method for liver tumor segmentation based on 2d region growing with knowledge-based constraints. In MICCAI workshop, volume 41, 159.

H-DenseUNet: Hybrid Densely Connected UNet for Liver and Tumor Segmentation from CT Volumes

IEEE TRANSACTIONS ON MEDICAL IMAGING 1 H-DenseUNet: Hybrid Densely Connected UNet for Liver and Tumor Segmentation from CT Volumes Xiaomeng Li, Hao Chen, Member, IEEE, Xiaojuan Qi, Qi Dou, Student Member,