Assessing the Local Credibility of a Medical Image Segmentation

Assessing the Local Credibility of a Medical Image Segmentation Joshua H. Levy, Robert E. Broadhurst, Surajit Ray, Edward L. Chaney, and Stephen M. Pizer Medical Image Display and Analysis Group (MIDAG), University of North Carolina at Chapel Hill Abstract. When a clinician uses an automatic method to segment a medical image, either she must accept the computer s segmentation or she must manually evaluate the quality of the segmentation and correct it as needed. This paper introduces another option: a methodology for identifying regions where the segmentation is not credible. Our methodology identifies regions where a local geometry to image match function returns a value that is improbably poor when compared to the distribution of values returned in that region for a set of training images with acceptable segmentations. We validate our methodology with experiments performed on CT images of the kidney. 1 Introduction Errors in medical image segmentation are inevitable. Niessen [1] has proposed that an important step towards understanding the causes of such errors is to visualize the distribution of voxels where the segmentation is incorrect. One implementation of this idea is Gerig s [2] VALMET tool. VALMET can render the surface of a segmentation so that the coloring of each point indicates the distance to the nearest point of a ground truth segmentation for that patient. This is an effective method for understanding the performance of a segmentation algorithm, but it relies on the availability of ground truth data for the image. We present here a statistical test to measure the credibility of a segmentation at various points along its boundary. This test does not require that goldstandard truth be known for the image, but it does require that a local geometry to image match function ( image match ) can be evaluated at the surface of the segmentation. The image match in [3] operates at global scale, taking into account all voxels from a shape normalized image. The image match functions defined in [4] and [5] operate at a more local scale; these functions compare the intensity profile in an object-relative region with a template. The statistical test that we propose is based on the local distribution of values of an image match function. Let I denote an image with segmentation m. Suppose that I can be decomposed into model relative regions {x} with regard to m. Let f x (m, I) denote the

local geometry to image match of the region for the specified model and image. Without loss of generality, let us assume that f x (m, I) returns non-negative values, and that a smaller value indicates a better match. Our method requires that we estimate the probability distribution of f x (m, I) independently for each region x over a set of images {I} with segmentations {m} that are known to not be grossly erroneous. Then, for a new image Î, with a segmentation ˆm, we identify the set of regions were the local image match value is significantly high: X = {x : p[f x (m, I) f x ( ˆm, Î)] < ρ} (1) for some critical value ρ. The segmentation of Î has tested positively as noncredible for the regions in X. In the work presented here the image match is due to Broadhurst [6], and we can make a principled approximation of the probability distribution used in (1). This method begins with a set of training images, each of which has a clinically acceptable segmentation. Image intensity histogram information in the form of the tuple of intensities forming the means of all histogram quantiles is recorded for each of a set of object-relative regions, over the set of training cases. The image match function then measures the sum of the squared distances between the image intensity quantile tuples seen for each region relative to a candidate model and the corresponding tuples for the set of training cases. Because the Euclidean distance between quantile sets is equivalent to the Mallows (Earth Mover s) distance between histograms, he uses an estimated squared Mahalanobis distance based on PCA as the geometry to image match. Because squared Mahalanobis distance is the sum of the squares of independent standard normals, it follows the χ 2 distribution. This allows for a simple implementation of (1) because critical values of the distribution are well known. 2 Method & Materials In this section we describe the data on which we tested our methods. In Sect. 2.1 we describe the tests we performed applying our method to assess the local credibility of these segmentations. Then, in Sect. 2.2 we discuss the validation of these tests. In this experiment, we have a set of 39 CT images of the abdomen, and we are interested in assessing the credibility of segmentations of the left kidney. The segmentations were produced using deformable shape models (DSM) [4, 7]. We chose to use m-reps as our DSM because the correspondences implied by the m-rep coordinate system make the m-rep representation particularly well suited for this application. In order to train the image intensity quantile statistics for use in the image match, and so that we have a reference point for evaluating the quality of these segmentations, we have an m-rep instance fit to an expert segmentation for each of the images. We then define a set of regions in object relative coordinates for use in the image match procedure. Specifically, we define an interior and exterior

region for a neighborhood around each of the sampled spoke ends of the m-rep model. We partition the set of images multiple times, so that each image belongs to a testing set. For each of these testing sets, there is a disjoint set from which intensity quantile statistics can be trained. The PCA on the intensity quantiles retains 2 principal modes for each interior region of the kidney and 3 principal modes for each exterior region of the kidney. The image match functions f int x (m, I) and f ext x (m, I) are the squared Mahalanobis distances for the interior and exterior of the object, respectively, in the neighborhood of x. We use the following approximations: f int x (m, I) χ 2 2 (2) f ext x (m, I) χ 2 3 (3) f sum x (m, I) = f int x (m, I) + f ext x (m, I) χ 2 5 (4) We segment each of the kidney images using two strategies. First, following the procedure described in [7], we produce automated segmentations for each image. Let M + denote this set of segmentations. The second set of segmentations, which we will label as M, is intended to test the ability of our method to predict gross errors. The segmentations in this set are clinically unacceptable and are the results of random large scale transformations applied to the reference models. Some of these models undergo a global transformation consisting of a rotation of up to 30 about an arbitrary access, along with uniform scaling by up to 20% as in Fig. 1.a. The remainder of the models undergo a local deformation produced by the translation of randomly selected medial atoms. (a) (b) (c) (d) Fig. 1. (a) A well segmented kidney (inside) from M + and the corresponding poor segmentation (outside) from M (b-d) A kidney from M +. (b) Regions that test positive (ρ = 0.05) are yellow, regions that test negative are red. (c) Noncredible region (maximum shortest distance to the reference model 0.75 cm.) are in yellow. Acceptably segmented regions are in blue. (d) Intersection of the segmentation, colored according by test result, with an axial slice of the CT image. There is clearly a gross error in the region the test identifies as noncredible.

2.1 Assessing Credibility It is our hypothesis that a noncredible region will, with high probability, have an abnormally poor image match relative to the same region in acceptable segmentations. We evaluate the local image match functions (2-4) for each region of each segmentation. We compare each image-match score with the CDF of the appropriate χ 2 distribution. After choosing a value for the parameter ρ, we identify the noncredible regions on each segmentation surface. We present the noncredible regions to the clinician and ask that she either verify or correct the segmentation in these regions. 2.2 Validation We make the assumption that the reference segmentations are clinically acceptable as the truth. This allows us to calculate an error metric for each region of each segmentation surface and to measure how effectively the test (1) identifies the noncredible regions. We take a dense sampling of points on the segmentation surface in a given region, and we measure the minimum distance from each of these points to the boundary of the reference object. For validation purposes, we say a region is truly noncredible if the maximum of this set of minimum distances is above a threshold. The threshold at which we consider a region to be incorrectly segmented is a free parameter of our validation. Rao [8] presents measures of intra-rater variability for humans performing kidney segmentation. It is tempting to use such a number as our threshold. An alternative strategy is to consider the application for which the segmentations are being produced. If this application has an inherent tolerance, it can be used as the threshold here. In our analysis we studied the performance of our algorithm against a variety of error thresholds and then empirically chose results to present here. The threshold p-value ρ in (1) is also a free parameter of our system. For a fixed error definition the choice of ρ determines both the percentage of noncredible regions that are identified by the test (sensitivity) and the percentage of clinically acceptable regions that the test correctly ignores (specificity). We produce a receiver operating characteristic (ROC) for the method by allowing ρ to vary as the error threshold is held constant. 3 Results We produce renderings of segmentation surfaces that highlight the noncredible regions as in Fig. 1.b, and Fig. 1.d. These are precisely the views that we provide to a clinician. From these views, one can see that in this particular case the segmentation algorithm was confused by tissue adjacent to the kidney. Figure 1.c requires knowledge of ground truth, and together with the previous images it illustrates both true-positives and false-negatives of our test at this ρ value. We validate our method by measuring the sensitivity and specificity given a fixed definition of error. We produce the ROC by varying the parameter ρ.

Figures 2 and 3 give examples of such curves. Tables 1 and 2 summarize the ROCs by listing the area under the curve, listing the true and false positive rates for a given value of ρ, and identifying the ρ value that produces a true positive rate of 0.9 with the minimal false positive rate. (a) (b) Fig. 2. ROC characterizing the performance of our algorithm, using f ext x as the image match, (a) on the segmentations in M + (b) on the erroneous segmentations in M. (a) (b) Fig. 3. ROC characterizing the performance of our algorithm, using f sum x match, (a) on segmentations in M + (b) on segmentations in M. as the image We can identify noncredible regions in the segmentations of M + with statistics on image match (3). The area under the curve in Fig. 2.a is above 0.8 for a range of reasonable error definitions. These statistics were less effective for predicting noncredible regions in the segmentations of M. In these cases AUC

was in the range 0.6 0.7. Image match (4) performed consistently on both data sets. It produced AUC above 0.7 for the same error thresholds. There is an intuitive explanation for the difference in AUC between Fig. 2.a and Fig. 3.a. Consider an image I with reference segmentation m and a new segmentation ˆm that is produced by eroding m in a neighborhood x. Because the tissue in the kidney is approximately homogeneous, f int x (m, I) f int x ( ˆm, I). The exterior of ˆm in this neighborhood contains more interior intensity than usual, so we expect that f ext x ( ˆm, I) will have an unusually large value. This region is likely to be a true-positive of our test using (3) and a false-negative using (2). We can not say for certain whether this region will be a true-positive or a false-negative of the test using (4). It is these such cases that explain why f ext x segmentations in M +. is more effective test than f sum x for identifying noncredible regions of the Error 0.75cm Segmentations from M + Segmentations from M AUC 0.786 0.665 ρ 0.05 0.01 0.001 0.172 0.05 0.01 0.001 0.104 Sensitivity 0.606 0.379 0.197 0.909 0.845 0.682 0.459 0.902 1 Sensitivity 0.235 0.114 0.042 0.430 0.601 0.456 0.294 0.656 Error 1cm Segmentations from M + Segmentations from M AUC 0.827 0.681 ρ 0.05 0.01 0.001 0.095 0.05 0.01 0.001 0.093 Sensitivity 0.682 0.409 0.227 0.909 0.875 0.769 0.512 0.900 1 Sensitivity 0.245 0.121 0.046 0.333 0.634 0.476 0.310 0.685 Error 1.25cm Segmentations from M + Segmentations from M AUC 0.814 0.618 ρ 0.05 0.01 0.001 0.224 0.05 0.01 0.001 0.151 Sensitivity 0.714 0.429 0.143 1.000 0.792 0.694 0.431 0.903 1 Sensitivity 0.249 0.124 0.048 0.503 0.663 0.510 0.335 0.754 Table 1. Summary of Fig. 2. Performance of our method using f ext x (m, I) as the image match. It is worth noting that not all false positives of this test are undesirable. For example, Fig. 4 shows the segmentation of an image with structured noise in it. A miscalibrated detector in the scanner led to reconstruction errors that produced the concentric circles in the image. The regions on the segmentation surface that intersect these imaging errors are labeled as noncredible. Although the segmentation in this area is acceptable, there is value in alerting the clinician to the errors in the image. Yet such cases are counted as false positives in our validation.

Error 0.75cm Segmentations from M + Segmentations from M AUC 0.725 0.713 ρ 0.05 0.01 0.001 0.236 0.05 0.01 0.001 0.002 Sensitivity 0.545 0.303 0.182 0.909 0.990 0.949 0.858 0.902 1 Sensitivity 0.292 0.131 0.051 0.558 0.767 0.708 0.612 0.650 Error 1cm Segmentations from M + Segmentations from M AUC 0.736 0.745 ρ 0.05 0.01 0.001 0.152 0.05 0.01 0.001 0.000 Sensitivity 0.500 0.273 0.227 0.909 1.000 0.981 0.912 0.900 1 Sensitivity 0.300 0.137 0.054 0.478 0.801 0.741 0.642 0.583 Error 1.25cm Segmentations from M + Segmentations from M AUC 0.699 0.784 ρ 0.05 0.01 0.001 0.333 0.05 0.01 0.001 0.000 Sensitivity 0.429 0.286 0.143 1.000 1.000 1.000 0.986 0.903 1 Sensitivity 0.302 0.138 0.056 0.666 0.819 0.762 0.661 0.478 Table 2. Summary of Fig. 3. Performance of our method using f sum x (m, I) as the image match. 4 Discussion We have presented a method for identifying noncredible regions on a segmentation surface. Our method is based on probability distributions of a local geometry to image match function. As such it does not require knowledge of ground truth. We have implemented this method using an image match function based on intensity histogram statistics, and we have applied it to identify noncredible regions in DSM segmentations of the left kidney in CT images. In the future, we plan on dropping our requirement that the local distributions of image match are independent of each other. We will allow a covariance structure between regions, and we will study the local distribution of image match conditioned on the image match value for neighboring regions. We expect to see higher specificity from the current method by merely changing implementation details. In this work, we used a single clinically accepted segmentation as a surrogate for ground truth. A better truth estimate, such as one we could produce by using STAPLE [9] to combine expert segmentations, would improve both the quality of our image match training and our ability to validate the results. While we are satisfied by the correspondences implied by the m-rep coordinate system, it is possible that we could produce tighter correspondences using an information theoretic method such as MDL [10]. In that case we would expect an improvement in the intensity histogram statistics that would lead to both better segmentations and a better ability to identify noncredible regions in segmentations.

(a) (b) (c) Fig. 4. A segmentation with false positives on the noncredibility test. (a) Positive test results (ρ = 0.01) are yellow. (b) Noncredible regions (maximum shortest distance 1 cm.) are yellow. (c) Intersection of the surface with an axial slice of the CT image. Note the presence of imaging artifacts in the region of false positives of our test. References 1. Niessen, W.J., Bouma, C.J., Vincken, K.L., Viergever, M.A.: Error metrics for quantitative evaluation of medical image segmentation. In Klette, R., Stiehl, H.S., Viergever, M.A., Vincken, K.L., eds.: Theoretical Foundations of Computer Vision, Kluwer (1998) 275 284 2. Gerig, G., Jomier, M., Chakos, M.: Valmet: A new validation tool for assessing and improving 3d object segmentation. In Niessen, W.J., Viergever, M.A., eds.: MICCAI. Volume 2208 of Lecture Notes in Computer Science., Springer (2001) 516 523 3. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In Burkhardt, H., Neumann, B., eds.: ECCV (2). Volume 1407 of Lecture Notes in Computer Science., Springer (1998) 484 498 4. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models-their training and application. Computer Vision and Image Understanding 61 (1995) 38 59 5. Stough, J., Pizer, S.M., Chaney, E.L., Rao, M.: Clustering on image boundary regions for deformable model segmentation. In: ISBI, IEEE (2004) 436 439 6. Broadhurst, R., Stough, J., Pizer, S., Chaney, E.: A statistical appearance model based on intensity quantiles histograms. ISBI (2006) 7. Pizer, S., Fletcher, T., Fridman, Y., Fritsch, D., Gash, A., Glotzer, J., Joshi, S., Thall, A., Tracton, G., Yushkevich, P., Chaney, E.: Deformable m-reps for 3d medical image segmentation. International Journal of Computer Vision - Special UNC-MIDAG issue 55(2) (2003) 85 106 8. Rao, M., Stough, J., Chi, Y.Y., Muller, K., Tracton, G., Pizer, S.M., Chaney, E.L.: Comparison of human and automatic segmentations of kidneys from ct images. International Journal of Radiation Oncology*Biology*Physics 61 (2005) 954 960 9. Warfield, S.K., Zou, K.H., III, W.M.W.: Validation of image segmentation and expert quality with an expectation-maximization algorithm. In Dohi, T., Kikinis, R., eds.: MICCAI (1). Volume 2488 of Lecture Notes in Computer Science., Springer (2002) 298 306 10. Davies, R.H., Twining, C.J., Cootes, T.F., Waterton, J.C., Taylor, C.J.: A minimum description length approach to statistical shape modeling. IEEE Trans. Med. Imaging 21(5) (2002) 525 537