IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 18, NO. 2, FEBRUARY 1999 137 A Task-Specific Evaluation of Three-Dimensional Image Interpolation Techniques George J. Grevera, Jayaram K. Udupa,* Senior Member, IEEE, and Yukio Miki Abstract Image interpolation is an important operation that is widely used in medical imaging, image processing, and computer graphics. A variety of interpolation methods are available in the literature. However, their systematic evaluation is lacking. In a previous paper, we presented a framework for the taskindependent comparison of interpolation methods based on certain image-derived figures of merit using a variety of medical image data pertaining to different parts of the human body taken from different modalities. In this work, we present an objective task-specific framework for evaluating interpolation techniques. The task considered is how the interpolation methods influence the accuracy of quantification of the total volume of lesions in the brain of multiple sclerosis (MS) patients. Sixty lesion-detection experiments coming from ten patient studies, two subsampling techniques and the original data, and three interpolation methods are carried out, along with a statistical analysis of the results. Index Terms Computer-aided diagnosis, image interpolation, multidimensional image processing, three-dimensional imaging. I. INTRODUCTION INTERPOLATION is a commonly used operation in image processing, computer graphics, and medical imaging. Multidimensional image interpolation is needed in a variety of situations including 1) representing images at a desired or isotropic level of discretization; 2) changing the orientation of the discretization grid; 3) combining image information about the same object from multiple modalities; and 4) changing grid systems, for example, from polar to rectangular. Interpolation methods may be broadly classified into two groups: scene based and object based. In scene-based techniques [13], [14], [16] the intensity values of the resulting interpolated images (scenes) are derived directly from the intensity values of the given scene. In object-based methods [3] [8], [15] interpolation is not guided systematically by the grid system as in scene-based approaches, but is directed by some object information derived from the scene. There has been repeated evidence in the literature [4] [8], [15] of the superior performance of object-based over scene- Manuscript received October 6, 1998; revised February 1, 1999. This work was supported by the NIH under Grants P01-CA53141 and NS 37172. The Associate Editor responsible for coordinating the review of this paper and recommending its publication was M. Vannier. Asterisk indicates corresponding author. G. J. Grevera is with Medical Informatics Research, Department of Radiology, University of Pennsylvania Health System, Philadelphia, PA 19104 USA. *J. K. Udupa is with the Medical Image Processing Group, Department of Radiology, 423 Guardian Drive, University of Pennsylvania Health System, Philadelphia, PA 19104 USA (e-mail: jay@mipg.upenn.edu). Y. Miki is with the Department of Nuclear Medicine and Diagnostic Imaging, Kyoto University Hospital, Sakyo-ku, Kyoto 606, Japan. Publisher Item Identifier S 0278-0062(99)03149-3. based approaches. However, a systematic objective comparison of methods has been lacking, especially in medical three-dimensional (3-D) imaging. In a recent paper [6], we compared the performance of eight interpolation methods in an application independent fashion based purely on imagederived figures of merit, using data sets from different medical applications, different body parts, different modalities, and different patients. The study indicated strong evidence of the superior performance of a family of methods known as shapebased techniques [6]. To see how far this evidence carries over in particular applications, we decided to conduct a task-specific comparison, which is the topic of this paper. The specific application we chose is the detection and quantification of multiple sclerosis (MS) lesions of the brain via MRI. Although this task is somewhat artificial from the point of view of interpolation (that is, interpolation is not needed for segmenting and quantifying lesions), we chose this application because of the resources (image data and technical and clinical expertise) available within our department related to MS research. Our experimental approach is as follows. We randomly selected ten MS patient studies from a database of about 1000 studies. (All studies pertain to MS patients who have shown positive indications on traditional clinical motor and cognitive tests.) These ten studies are then subsampled by two subsampling techniques. One technique requires the interpolation method to yield missing slices and the other technique requires estimating missing information within the slice. The various combinations of original, subsampled, and interpolated data from the original ten studies yield 60 studies. These 60 studies are then used as input to the lesion-detection system. This system [17] and the large image database was developed, evaluated, and has been extensively used [1], [2], [9] [12] in MS research independently of this study and is, therefore, not the focus of this study and will not be described in great detail. The output of this system is a set of lesions. The interpolation methods are compared by determining how the lesions detected from the interpolated data deviate from those detected from the original images. The task-specific evaluation will thus determine how well different methods estimate missing lesion information via interpolation. II. MATERIALS AND METHODS A. MR Image Data For further reference, we will call a volume image a scene where is a finite array of voxels and for any 0278 0062/99$10.00 1999 IEEE
138 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 18, NO. 2, FEBRUARY 1999 Fig. 1. A schematic representation of the methods of producing input data for lesion-detection experiments. Fig. 2. A schematic representation of the experimental setup for comparing interpolation methods. voxel in, denotes its scene intensity value. The imaging modality employed is a dual-echo fast-spin-echo MRI which yields a T2-weighted and a PD-weighted scene for each patient study. Thus, we begin with ten pairs of 3-D, T2, and PD scenes. The studies have been selected such that each study has a sufficient number of lesions that span three or more slices. These studies also contain smaller lesions that may span one or two slices. We conducted a set of six lesion-detection experiments for each patient study, as illustrated in Fig. 1. We utilized two subsampling schemes, referred to as and in Fig. 1, to create scenes of lower discretization which were subsequently converted to scenes of the original level of discretization by interpolation. In, alternate slices, and in, alternate rows and columns, were left out. For, linear grey-level ( ) [13], a method from the shape-based family [5] that was determined to be the best as per task-independent evaluation [6] ( ) and the method of Goshtasby et al. [4] ( ) were utilized for interpolation. For only linear ( ) and the shape-based method ( ) were used since the method of Goshtasby et al. is not applicable to this situation. The lesion-detection method was then applied to each of these five interpolated studies. The results were compared to the result obtained by applying the lesion-detection method to the original study (denoted by in Fig. 1). Note that, altogether, 100 interpolated 3-D scenes (five methods two scenes ten studies) were created. B. An Overview of the Lesion-Detection System The basic idea behind this system is that the segmentation process consists of two very different tasks, namely, recognition and delineation. The human expert is particularly good at the recognition task (determining roughly the whereabouts of the object) whereas the computer algorithm is particularly adept at the delineation task (determining the spatial extent of the objects). The method of [17] employs both aspects as follows. Initially, the user picks a few points on a single slice in each study about halfway through the 3-D scene to indicate white matter (WM), grey matter (GM), and cerebrospinal fluid (CSF) objects to the system. Then, a fuzzy connectedness object detection algorithm [17] detects WM, GM, and CSF as a 3-D fuzzy connected object. Subsequently, all potential lesions are detected as a fuzzy connected object. The expert is then presented with the results for accepting only the true lesions with the click of a button. Finally, the volume of each accepted lesion and the total volume are computed. This process is illustrated in Fig. 2, which also indicates the manner in which interpolation is incorporated into the experiments. The method has been tested extensively and used in clinical investigation on about 1000 studies to date [9], [17]. Fig. 3 shows some sample slices of a data set with T2 (top row) and PD (bottom row) acquisitions. Fig. 4 shows lesions defined in these slices prior to (top row) and after (bottom row) expert verification. Fig. 5 shows lesions defined in the interpolated slice corresponding to the slice in the middle column in Fig. 3 by the different methods before expert verification. C. Methods of Comparison The lesion-detection system eventually outputs a 3-D binary scene for each study by automatically thresholding the lesion connectivity scene [11]. The one voxels in this scene correspond to points in the lesions. Information pertaining to the set of one voxels that constitute each individual fuzzy connected 3-D lesion is also available from the system. Given these items of information, one possible way of comparing
GREVERA et al.: A TASK-SPECIFIC EVALUATION OF THREE-DIMENSIONAL IMAGE INTERPOLATION TECHNIQUES 139 Fig. 3. An example of a few slices from a 3-D T2 (top) and PD (bottom) scene used as input to the lesion-detection process. Fig. 4. Part of 3-D scenes which illustrate the result of the automatic lesion-detection software prior to expert verification (top) and after expert verification (bottom). Fig. 5. An example slice of the result of lesion detection (prior to expert verification) for the various interpolation methods. From top left to bottom right or; ln1; ln2; sba1; sba2, and go1.
140 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 18, NO. 2, FEBRUARY 1999 TABLE I THE MEAN OF THE MEASURES OVER TEN STUDIES BEFORE EXPERT VERIFICATION AND INCLUDING ALL LESIONS. (THE NUMERICAL VALUES ARE THE MEAN OF THE DIFFERENT MEASURES OVER TEN STUDIES FOR EACH METHOD. THE BEST RESULT FOR EACH OF THE TWO SAMPLING SCHEMES IS UNDERLINED. THE RIGHT PART LISTS THOSE AMONG ALL POSSIBLE VALID PAIRS OF METHODS FOR A GIVEN SUBSAMPLING SCHEME FOR WHICH p 0:05) TABLE II THE MEAN OF THE MEASURES OVER TEN STUDIES AFTER EXPERT VERIFICATION AND INCLUDING LESIONS OF ALL SIZES TABLE III THE MEAN OF THE MEASURES OVER TEN STUDIES BEFORE EXPERT VERIFICATION AND INCLUDING ONLY LARGE LESIONS two methods, say and, is to determine how much the binary scenes and resulting from and for a given input T2 PD scene pair deviate from the binary scene resulting from method. Let, and, where is the set of interpolation methods as described above. Then we may define the set of false negative voxels as, the set of false positive voxels as, and the set of true positive voxels as. Using these definitions, for each patient study and interpolation method we arrive at the following four measures, referred to as false negative volume fraction ( ), false positive volume fraction ( ), true positive volume fraction ( ), and similarity index ( ) These measures are computed for all studies and the methods are then compared using the paired student s t-test to determine if their performance differed with statistical significance and to ascertain which method is superior. A variety of (1) (2) (3) (4) interesting pairwise comparisons of the methods can be made, for example, to, to, to to, to, and to. Since we also have information about each 3-D lesion it is possible to analyze performance for large lesions (spanning four or more slices and being at least 200 voxels in size) and for small lesions separately. is perhaps appropriate to analyze both groups of lesions while small lesions may be lost in the subsampling process of. III. RESULTS We first calculate,,, and on the direct output of the automatic lesion-detection program. Performing this analysis prior to expert verification allows us to present results that are devoid of observer (expert) variability. These results are summarized in Table I. Analogously, Table II presents results after expert verification. Tables III and IV are analogous to Tables I and II, except that only large lesions are considered for comparison. Note that in Tables I IV the numeric values are the mean of the different measures over ten studies for each method. The best result for each of the two sampling schemes is underlined. The right part lists those among all possible valid pairs of methods for a given subsampling scheme for which. The idea underlying can be extended to compare lesions before and after expert verification. This would give us an idea as to how the interpolation methods influence the user effort required in eliminating false positives. For any interpolation
GREVERA et al.: A TASK-SPECIFIC EVALUATION OF THREE-DIMENSIONAL IMAGE INTERPOLATION TECHNIQUES 141 TABLE IV THE MEAN OF THE MEASURES OVER TEN STUDIES AFTER EXPERT VERIFICATION AND INCLUDING ONLY LARGE LESIONS TABLE V MEAN USER EFFORT REQUIRED OVER TEN STUDIES Fig. 6. An example of determining all of the unions of the various binary scenes (left) prior to expert verification in Fig. 5 and using this as a mask on a grey data set (middle) to obtain only that part of the scene that may correspond to lesions (right). method or let and denote the set of voxels constituting all lesions in a given patient study before and after expert verification, respectively. We define the user effort required for the interpolation method for a given study as The greater the, the larger is the effort required by the user to eliminate false positives. Table V lists the mean for each over all data sets and those pairs for which in a paired t-test. As a final analysis, consider an approach which combines aspects of the task-independent approach published previously [6] and the task-specific approach described thus far. Specifically, we may assess how well the various methods perform when intensities are compared in the original and interpolated scenes restricted to areas of possible lesions as illustrated in Fig. 6. First, we note that the true location and extent of the lesions are not known. Furthermore, the delineated lesion sites for a given data set differ among interpolation methods. To accommodate these situations we consider the union of all detected lesions for the original data and for the interpolated data for each study. We restrict our comparison to the slice (5) subsampling scheme ( ) only because it is applicable to all three of the selected interpolation methods and use the meansquared difference ( ) criterion used in task-independent comparison [6]. Let and be a given original scene and its interpolated version, respectively, and let be any nonempty subset of. Then, we define where denotes the number of elements in. We are interested in the cases when and where is the union of all lesions found in and for all interpolation methods. Table VI lists mean obtained over ten studies for T2 and PD scenes for different interpolation methods. It also lists pairs of methods that showed statistically significant differences ( ). Analogously, Table VII shows mean obtained over ten studies for T2 and PD scenes for the different interpolation methods. IV. DISCUSSION First of all, from Tables I IV it is clear that all interpolation methods produce results that are statistically significantly different from the original data for all four measures. This is (6)
142 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 18, NO. 2, FEBRUARY 1999 MEAN msd V TABLE VI OVER TEN T2 AND PD STUDIES TABLE VII MEAN msd L (FOR LESIONS OF ALL SIZES) OVER TEN T2 AND PD STUDIES TABLE VIII MEAN msd L (FOR LARGE LESIONS ONLY) OVER TEN T2 AND PD STUDIES true for both situations, before and after expert verification. If a method were to show no significant difference in all measures it would certainly have been a preferred method. However, such is not the case. We note that overall, exhibits the best and, although the superior performance is not always statistically significant. Perhaps the most important result is that for lesions that are not very small, gave the best and for both subsampling schemes as illustrated in Table III. These results are statistically significant for the sampling scheme which is, practically, the more relevant situation as most data are acquired with greater slice separation than the size of the pixel. Furthermore, from the point of view of the accuracy of lesion definition, these measures are the most crucial. Since, in our system, false positives are eliminated by the user, higher exhibited occasionally (not always statistically significantly) by over and does not really matter. In summary, our experiments indicate that sba is generally more accurate than and in lesion definition, especially for lesions that are not very small, but it may require more effort as reflected by (although not statistically significant) to eliminate false positives. These observations are consistent with the results obtained in task-independent comparisons [6]. As evidenced by the results in Table VI and consistent with our task-independent performance results [6], is statistically significantly better than other methods when estimated intensities are compared over the whole scene for both T2 and PD data. When these comparisons are confined only to the lesion areas s superiority seems to be not as strong, as seen from the results in Table VII. s performance in estimating lesion intensities in PD scenes is statistically significantly better than that of other methods whereas, in T2 scenes, outperforms statistically significantly. Table VIII shows values for T2 and PD scenes when calculation is confined to large lesions ( 200 voxels) only. Although the statistical significance is lost, the behavior indicated by this table is similar to that in Tables VI and VII. Finally, we observe that the various comparisons indicate a similar behavior between and. This is understandable because essentially employs linear interpolation along the line connecting each pair of matching pixels from two adjacent slices. V. CONCLUDING REMARKS Interpolation is an important operation widely utilized in many disciplines, including image processing, tomographic image reconstruction, computer graphics, and 3-D imaging. Although a variety of interpolation methods have been reported in the literature, their objective comparison has not been made, especially in 3-D imaging, the discipline to which the subject matter of this paper belongs. We have previously described an application-independent framework [6] and compared eight slice interpolation methods utilizing 3-D image data sets from 20 patients, from different modalities, applications, and body regions. The analysis indicated a strong evidence of the superior performance of a particular family of methods known as shape-based techniques [6]. The comparison method was based purely on image-derived intensity similarity criteria and was independent of the applications. In this paper we have presented a framework based on a specific application to evaluate three of the methods that previously indicated superior performance. Specifically, we chose the linear method, the method of Goshtasby et al. [4], and the top performer in the shape-based family [6]. The task considered was to determine how the methods influence the detection
GREVERA et al.: A TASK-SPECIFIC EVALUATION OF THREE-DIMENSIONAL IMAGE INTERPOLATION TECHNIQUES 143 and quantification of subtle MS lesions of the brain via dualecho MRI. We have utilized a previously published [17] and routinely clinically utilized method for lesion definition. The results indicate that the shape-based method is more accurate than others in lesion definition. However, it may produce more false positives than others and therefore may require more operator effort than others in excluding them. The observed performance differences for small lesions are generally not statistically significant, which implies that we may need larger sample sizes than the ten-patient studies used in this investigation. ACKNOWLEDGMENT The authors express their gratitude to R. Grossman for making the MRI data available for this study. REFERENCES [1] M. A. van Buchem, J. K. Udupa, F. H. Heyning, M. P. Boncoeur- Martel, Y, Miki, J. C. McGowan, D. L. Kolson, M. Polansky, and R. I. Grossman, Quantitation of macroscopic and microscopic cerebral disease burden in multiple sclerosis, Am. J. Neuroradiology, vol. 18, pp. 1287 1290, 1997. [2] M. A. van Buchem, R. I. Grossman, Y. Miki, J. K. Udupa, M. Polansky, and J. C. McGowan, Correlation of volumetric magnetization transfer imaging with clinical data in MS, Neurology, vol. 50, pp. 1609 1617, 1998. [3] D. Eberly, R. B. Gardner, B. S. Morse, S. M. Pizer, and C. Scharlach, Ridges for image analysis, J. Math. Imag. Vision, vol. 4, pp. 351 371, 1994. [4] A. Goshtasby, D. A. Turner, and L. V. Ackerman, Matching tomographic slices for interpolation, IEEE Trans. Med. Imag., vol. 11, pp. 507 516, Dec. 1992. [5] G. J. Grevera and J. K. Udupa, Shape-based interpolation of multidimensional grey-level images, IEEE Trans. Med. Imag., vol. 15, pp. 881 892, Dec. 1996. [6], An objective comparison of 3-D image interpolation methods, IEEE Trans. Med. Imag., vol. 17, pp. 642 652, Aug. 1998. [7] G. T. Herman, J. Zheng, and C. A. Bucholtz, Shape-based interpolation, IEEE Comput. Graphics Appl., vol. 12, pp. 69 79, May 1992. [8] W. E. Higgins, C. Morice, and E. L. Ritman, Shape-based interpolation of tree-like structures in three-dimensional images, IEEE Trans. Med. Imag., vol. 12, no. 3, pp. 439 450, Sept. 1993. [9] Y. Miki, R. I. Grossman, S. Samarasekera, J. K. Udupa, M. A. van Buchem, B. S. Cooney, S. N. Pollack, D. K. Kolson, M. Polansky, and L. J. Mannon, Clinical correlation of computer assisted enhancing lesion quantification in multiple sclerosis, Am. J. Neuroradiology, vol. 18, pp. 705 710, 1997. [10] Y. Miki, R. I. Grossman, J. K. Udupa, L. Wei, D. L. Kolson, and L. J. Mannon, Isolated U-fiber involvement in multiple sclerosis: Preliminary observations, Neurology, vol. 50, pp. 1301 1306, 1998. [11] Y. Miki, R. I. Grossman, J. K. Udupa, M. A. van Buchem, L. Wei, M. D. Phillips, U. Patel, J. C. McGowan, and D. L. Kolson, T2- lesion volume, enhancing lesion volume, whole brain magnetization ratio histogram peak height, % brain volume, and expanded disability status scale: Difference in cross-sectional correlations between relapsing remitting and chronic progressive multiple sclerosis, Radiology, to be published. [12] M. Phillips, R. I. Grossman, Y. Miki, L. Wei, D. L. Kolson, M. A. van Buchem, M. Polansky, and J. K. Udupa, Correlation of T2 lesion volume and MTR histogram analysis and atrophy and measures of lesion burden in patients with multiple sclerosis, Am. J. Neuroradiology, vol. 19, pp. 1055 1060, 1998. [13] W. K. Pratt, Digital Image Processing. New York: Wiley, 1991. [14] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in C. New York: Cambridge Univ. Press, 1992. [15] S. P. Raya and J. K. Udupa, Shape-based interpolation of multidimensional objects, IEEE Trans. Med. Imag., vol. 9, pp. 32 42, Mar. 1990. [16] M. R. Stytz and R. W. Parrott, Using Kriging for 3D medical imaging, Computerized Med. Imag. Graphics, vol. 17, no. 6, pp. 421 442, 1993. [17] J. K. Udupa, L. Wei, S. Samarasekera, Y. Miki, M. A. van Buchem, and R. I. Grossman, multiple sclerosis lesion quantification using fuzzy connectedness principles, IEEE Trans. Med. Imag., vol. 16, pp. 598 609, Oct. 1997.