Skin Lesion Classification and Segmentation for Imbalanced Classes using Deep Learning Mohammed K. Amro, Baljit Singh, and Avez Rizvi mamro@sidra.org, bsingh@sidra.org, arizvi@sidra.org Abstract - This paper summarizes our work for ISIC 2018 challenge in Skin Lesion Analysis Towards Melanoma Detection [1]for both Task 1: Lesion Boundary Segmentation and Task 3: Disease Classification of ISIC 2018. We used a modified version of U-Net called dilated U-Net for Task 1 utilizing 4-Fold training method and testing time augmentation during prediction phase with final accuracy 0.77 Jaccard score on ISIC 2018 validation set. For Task 3 we build two approaches. The first approach is a one-step classifier between seven lesions type, while the second approach works into two steps, the first step by binary classifying the lesion either Nevus lesion (the Major class in the dataset) or non- Nevis (the remaining classes), and the second step to classify between the remaining six lesions type. The average accuracy on approach one was 89.8% and 88.8% for approach two and reached 91.8 % when utilizing both approaches at the same time. INTRODUCTION Skin cancer, aka Melanoma, is one of the deadliest occurring problems in today's world. Even though it is the least common, the disease is responsible for around 91,000 deaths this year until now [2]. Early detection of skin lesions can help in its treatment and improve life. Dermoscopy refers to the examination of the skin [3] under a microscope to point out skin abnormalities and then classify them. Dermoscopic images are free from any skin surface reflections. These enhanced images help the dermatologist to diagnose melanoma accurately. Deep CNN has been known for producing state of the art results when it comes to medical diagnosis. We participated in the ISIC 2018: Skin Lesion Analysis Towards Melanoma Detection challenge to train models that could act as a screening process for dermatologist by segmenting and classifying skin lesions. ISIC 2018 is based on HAM10000 ( Human Against Machine with 10000 training images ) dataset [3] with 10015 images in the training set, 193 images in the validation set, and 1000 images for testing set. The dataset contains lesions form seven classes (Melanoma, Melanocytic nevus, Basal cell carcinoma, Actinic keratosis, Benign keratosis, Dermatofibroma, and Vascular Lesions). TASK GOAL TASK 1: LESION BOUNDARY SEGMENTATION Submit automated predictions of lesion segmentation boundaries within dermoscopic images. EVALUATION METRIC The prediction scores are measured using threshold Jaccard index metric which compares pixel-wise agreement between a predicted segmentation and its corresponding ground truth with zero scores if the Jaccard index is less than 0.65 and Jaccard index value, otherwise. Then the mean of all per-image scores is taken as the final metric value for the entire set.
DATASET PROCESSING The original HAM10000 dataset from ISIC 2018 contains 2594 lesions with their 2594 corresponding masks. As a first step, we reviewed the dataset and discovered some issues in the ground truth masks as shown in figure 1. So in order to clean the dataset and train our network on accurate data, we train a sample network on the whole training set without validation allowing the network to overfitting then we checked only the images with a low score and compared the related mask. As an output of this step, we removed 106 images from the dataset and developed the model using 2488 images only. NETWORK ARCHITECTURE Figure 1 Sample of some issues in ground truth masks In task 1 for segmentation, we used a modified version of U-net architecture called dilated U-Net. In traditional U- Net design, the network consists of three parts: Encoder network, Bottleneck layers, and Decoder network, in our solution we used 6 CONV layers in the bottleneck part. TRAINING STRATEGY We train our model using 4-Fold cross-validation with 1866 images for training and 622 images for validation using RMSprop optimizer with 0.0001 learning rate. Using the K-fold cross-validation allow the model to train on the whole dataset which reduces the effect of any error in the provided masks by getting four different model each one validated on a different set of images. PREDICTION STRATEGY During prediction phase and for each model of our final 4-Fold models we used Test Time Augmentation (TTA) by conduct four predictions for each image (the original image, horizontal flipped image, vertical flipped images, and horizontal and vertical flipped image) as shown in figure 2. Then we calculated the average, the minimum, and the maximum for each fold ending with twelve different predictions for each lesion. In our final submission, we used the best three performing predictions on challenge validation set which was 0.77 threshold Jaccard index to be used for final testing prediction.
Figure 2 Prediction using Testing Time Augmentation TASK 1 RESULTS Table 1 shows the best score for each fold during model training which tested against 622 images without applying the challenge threshold of 0.65 value. Table 1 Task 1: Segmentation Performance
TASK 3: LESION DIAGNOSIS Figure 3 Different Type of Lesions TASK GOAL The primary objective of this task is to classify dermoscopic images from the following seven lesion types: Melanoma (MEL). Melanocytic nevus (NV). Basal cell carcinoma (BCC). Actinic keratosis (AKIEC). Benign keratosis (BKL). Dermatofibroma (DF). Vascular lesion (VASC). EVALUATION METRIC Final predicted results are scored using a multi-class accuracy metric (balanced across categories). DATASET PROCESSING According to the HAM1000 paper, the training dataset has images of skin similar skin lesions taken from different camera angles. The percentage data distribution for each class is shown in table 2
Table 2 HAM10000 Data Distribution The imbalance between classes is apparent especially for Dermatofibroma and Vascular Lesions. Most of the lesions are belonging to Melanocytic nevus class. Out of that, there were a total of 5515 unique images and 4500 images with variants. The table below shows the distribution for each class. NETWORK ARCHITECTURE During our classification work, we used two models, Xception [5] and DenseNet121 [6] trained on the ImageNet Dataset. We used the pre-trained weights in addition to three CONV layers with 2048, 512, and 512 neurons and two Dropout layers with drop factor 0.5. TRAINING STRATEGY We implemented two approaches. First approach to building one step classifier between all seven classes in the same time (7C model), and the second approach is to build two steps classifier with a binary classifier between Melanocytic nevus (the dominant class) and all other six classes together (2C Model), then pass the result to another classifier to classifier between the remaining six classes( 6C Model). We split the data into 70% training set and 30% testing set, and due to the presence of a different number of variants, the data was manually split. In this manual split, with careful consideration, all the unique skin lesion images were put in the testing split and the rest, which includes images with two or more variants, were put in the training split to prevent the models to overfitting during training processes, we used Adam with learning rate 0.0001 with different image sizes of 224, 256, 299, and 512. As it can be observed from the above table, the data is largely imbalanced. The majority class, the NV class, is 66% of the entire dataset while there were also classes that made up only 1-3% of the dataset. To deal with this issue, the
technique of upsampling the data was employed. This technique involved duplicating the number of images in each class until the number is equal to the number of images in the majority class (NV in 7C and 2C models and MEL for 6C model). Both the training and validation split were upsampled to obtain a balanced set of images which allow the model to train on balanced class trying to optimize the average accuracy. PREDICTION STRATEGY During prediction, we used ensemble prediction from all generated models with the two networks (Xception and Dense121) and with different image sizes (224, 256, 299, and 512). TASK 3 RESULTS We tested out models using the selected testing set with 3005 images. For 7C classifier, the average accuracy was 84.2 % as per figure 4, while the for 2C classifier we achieved 95.16% accuracy as per figure 5, and for 6C classifier, we scored 87.69% as per figure 6. During prediction, we used ensemble prediction from all generated models with the two networks (Xception and Dense121) and with different image sizes (224, 256, 299, and 512). Figure 4 7C Model Confusion Matrix for Local Test Dataset
Figure 5 2C Model Confusion Matrix for Local Test Dataset Figure 6 6C Model Confusion Matrix for Local Test Dataset
Figure 7 2C + 6C Confusion Matrix for Test Dataset Figure 8 2,6C + 7C Confusion Matrix for Test Dataset For final submission to the challenge we submit the result of the 7C model as approach one, and the output of 2C and 6C as approached two and the average of both prediction as approach three. For ISIC 2018 validation set which consists of 193 images we scored 88.8% using 7C model and 89.8% using 2C+6C models, and 91.8% with the average between 7C and 2,6C.
RESULTS AND CONCLUSION For Task 1: Segmentation, our model score 0.77 threshold Jaccard index on ISIC Validation, set while For Task 3: Classification, our ensembles models scored 88.9%, 89.8%, and 91.8% respectively. The primary challenge for segmentation task was the noise in the ground truth masks which solved by cleaning the dataset before training, while the challenge in classification task was the imbalance between classes which solved by upsampling. REFERENCES [1] "ISIC 2018: Skin Lesion Analysis Towards Melanoma Detection," 2018. [Online]. Available: https://challenge2018.isic-archive.com/. [2] "Melanoma Stats, Facts, and Figures," 2018. [Online]. Available: https://www.aimatmelanoma.org/about-melanoma/melanoma-stats-facts-and-figures. [3] "Dermoscopy & Mole Scans in Perth and Regional WA," [Online]. Available: https://myskincentre.com.au/service/dermoscopy/. [4] Philipp Tschandl and Cliff Rosendahl and Harald Kittler, "The {HAM10000} dataset, a large collection of multi-source dermatoscopic," Sci. Data, vol. 180161, p. 5, 2018. [5] F. Chollet, "Xception: Deep Learning with Depthwise Separable Convolutions," CoRR, vol. abs/1610.02357, 2016. [6] Gao Huang and Zhuang Liu and Kilian Q. Weinberger, "Densely Connected Convolutional Networks," CoRR, 2016.