Automated Diagnosis of Vertebral Fractures using 2D and 3D Convolutional Networks

Automated Diagnosis of Vertebral Fractures using 2D and 3D Convolutional Networks CS189 Final Project Naofumi Tomita Overview Automated diagnosis of osteoporosis-related vertebral fractures is a useful application that serves for clinicians and potential patients. We aim to construct a new automated diagnosing system based on CT images by applying the recent advancement of deep learning. In the proposal, we suggested three approaches that exploiting temporal nature of CT data, and pledged to implement two of them to compare the validity of different models by the milestone. However, we have implemented one of them so far and also changed one of the original approaches. Based on the model implemented so far, results seem reasonable and promising. We are planning to conduct rigorous experiments on the Cirst model including the effectiveness of each data augmentation technique, and implement another model that utilizes 3D convolution. Approaches We have changed the list of approaches by dropping a variant of C3D network based model and appending a weakly-supervised model. The weakly-supervised model would share similar architecture as a ResNet based model, but it will consider all images in a volume into account instead of looking at a subset of images to make a prediction. This approach is ambitious but reasonable as our data collected does not have slice-level annotations. Updated list of approaches 1. ResNet based model 2. C3D network based model 3. Weakly-supervised ResNet based model

Data A CT produces a volume of data for each examination. We collect 717 positive and 719 negative samples of Abdomen-Chest-Pelvis CT data which cover from the top of chest and to the bottom of pelvis, by courtesy of Dartmouth-Hitchcock Medical Center and students in Giesel School of Medicine at Dartmouth. Each sample has a label, whether it is positive or negative. Each volume contains images from 80 to 300, depending on an examination. Each image has a size of 512 by 512 with a single channel representing a grayscale. As a priori, we know that most of the vertebral fractures occur at spine and the spine usually appears in middle 5% of total CT slices. Thus, we extract those slices that are likely to contain fracture and spines for training a model in the Cirst approach. Model The Cirst approach is decomposed into two sub-networks: a ResNet based classicier that extracts features and makes a prediction on a single image, and a classicier network that makes a sample-level prediction by aggregating results from a previous network. The sample-level classicier is trained on top of the slice-level classicier. The ResNet based slice-level classicier has 6 residual blocks and 5 residual downsampling blocks. Each block has two convolutions each followed by batch normalization and ReLU. Input size is 512x512 and it produces a scalar value by applying a fully connected layer on 1028 features at the end. We have implemented two different sample-level classiciers using LSTM. 1. a LSTM with 3 layers, 1 hidden unit each, and it takes a sequence of slice-level predictions. We call it LSTM1. 2. a LSTM with 1 layer, with 128 hidden units, and it takes features extracted from the last layer before the fully connected layer in the slice-level classicier. We call it LSTM2. We train a slice-level classicier on a subset of images that extracted based on our assumption that for those positive images there should contain some clue or evidence of vertebral fracture. Those images inherit labels from a sample. Although this assumption would introduce some noise, we expect that a sample level classicier can absorb those noise when aggregating results from a slice-level classicier and makes a better prediction. FIGURE 1. OVERVIEW OF THE ARCHITECTURE FOR THE FIRST APPROACH

Training We have constructed our dataset by splitting samples in 80:10:10 for training, validation, and testing set. We apply data augmentation techniques to overcome the scarcity of our dataset. 1. Horizontal translation by 0-padding on each side of an image (12 pixels each side) and randomly cropping 512 by 512 image 2. Random rotation in the range of -3 to 3 degree 3. Elastic deformation proposed in [1] with α value randomly chosen from range of 3 to 6 and σ value randomly chosen from a range between α and 2α A slice-image classicier is optimized with the mini-batch stochastic gradient algorithm, where the batch size is 48 and the momentum is 0.9. The initial learning rate is set to 0.000002 and decreased every 40 epochs by half, and it stops training at 100 epochs. Each parameter is initialized with a method proposed in [2]. A sample-level classicier, LSTM1 is optimized with the stochastic gradient algorithm, with an input sequence consists of prediction scores generated through a slice-level classicier on all extracted images from one sample. As the number of images contained in a sample varies, the batch size also changes depending on each sample. The initial learning rate is set to 0.01 and decreased every 40 epochs by 10, and runs for 100 epochs. While training LSTM1, the last fc layer of the slice-level classier is Cine-tuned with a learning rate of 0.000001. The other sample-level classicier, LSTM2 is optimized with the stochastic gradient algorithm. An input sequence for LSTM2 is a sequence of vectors, each vector is extracted features from last layer before a fc layer in the slice-level classicier on extracted images from a sample. As opposed to LSTM1, the slice-level classicier is used as a feature extractor and no further learning is applied on it. The initial learning rate is set to 0.00001 and decreased every 40 epochs by 10, audit stops training at 100 epochs. Experiments We implement our model using PyTorch. We evaluate our model using on the testing set. To measure the performance in the context of medical research, we calculate accuracy of prediction, TPR (true positive rate, sensitivity, or recall), (positive predictive value or prediction), and F-1 score for each class (fractured class and normal class). We evaluate a slice-level classicier and sample-level classiciers separately that shows the efciciency of sample-level classiciers.

Single-image Classification On our testing set, a slice-level classicier achieves 78% of accuracy, and other scores are also sound. As we make a sample-level classicication on top of this classicier, we plan to improve the accuracy by tuning hyper parameters. Fractured Class Normal Class Accuracy TPR (sensitivity,recall) F-1 TPR (sensitivity, recall) F-1 0.78 0.80 0.79 0.80 0.76 0.76 0.76 TABLE 1. SLICE-LEVEL CLASSIFIER RESULTS ON TESTING SET Sample-level Classification We have tested our models as well as some simple non-parametric classiciers as a baseline. The Cirst classicier uses concidence scores obtained through a slice-level classicier and make a vote to decide a Cinal prediction. Each concidence score vote if the score is larger than 0.5, veto otherwise. If over 70% of scores are voting for positive, the classicier makes positive prediction. The second classicier take a maximum value among concidence scores, and makes positive if the score is over 0.5. The third classicier works in the same principle but uses average value instead of maximum value. Fractured Class Normal Class Accuracy TPR (sensitivity,recall) F-1 TPR (sensitivity, recall) F-1 1 Voting;70% 2 MaxPooling 3 AvgPooling 0.79 0.81 0.86 0.84 0.76 0.69 0.72 0.74 0.93 0.74 0.83 0.41 0.76 0.54 0.81 0.86 0.85 0.85 0.71 0.74 0.73 4 LSTM1 0.90 0.92 0.92 0.92 0.86 0.86 0.86 5 LSTM2 0.85 0.92 0.86 0.89 0.73 0.84 0.78 TABLE 2. SAMPLE-LEVEL CLASSIFIERS RESULTS ON TESTING SET The results for 1,2, and 3 classiciers are reasonable with respect to the result from slicelevel classicication. The third model achieves 81% of accuracy, which is better than other

two simple classiciers. LSTM based models are, however, signicicantly better than other models. A single layered LSTM1 especially achieves 90% of accuracy, which is remarkably higher than other models. Our expectation was that LSTM2 should perform better than LSTM1, as it exploits extracted features instead of a concidence score generated by a slicelevel classicier. We conjecture that the Cine-tuning of the fc layer in the slice-level classicier when training LSTM1 contributes to the performance, which means the slice-level classicier has not been optimized well. Also it seems that hyper parameters in LSTM2 need to be investigated for further improvement. Future Work As our results with the Cirst model suggest, exploiting depth information in CT data is effective to achieve higher accuracy. We keep working on conducting rigorous experiments on the Cirst model by Cine-tuning hyper parameters, exploiting another dataset for pretraining model, inspecting the effectiveness of each data augmentation technique. Also, we will implement another model that utilizes 3D convolution so we can compare different models by the Cinal date of the project. We will keep researching on weakly-supervised approach as well. References [1] Simard, Patrice Y., David Steinkraus, and John C. Platt. "Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis." ICDAR. Vol. 3. 2003. [2] He, Kaiming, et al. "Delving deep into recticiers: Surpassing human-level performance on imagenet classicication." Proceedings of the IEEE international conference on computer vision. 2015.