Learning in Medical Image Databases. Cristian Sminchisescu. Department of Computer Science. Rutgers University, NJ

Size: px

Start display at page:

Download "Learning in Medical Image Databases. Cristian Sminchisescu. Department of Computer Science. Rutgers University, NJ"

Lora Scott
5 years ago
Views:

1 Learning in Medical Image Databases Cristian Sminchisescu Department of Computer Science Rutgers University, NJ December, 998 Abstract In this paper we present several results obtained by experimenting with Bayesian minimum Mahalanobis distance and k-nearest neighbor classication methods in a medical domain. The training set contains of four classes of dierent pathologies. Individual pathologies are represented in terms of their contour description, by a high-dimensional vector (the modal vector) abstracting their shape. We employ a divergence criterion to identify features with high discriminative power. The performance measurements suggest that the Bayes classier outperforms the weighted k-nearest neighbor classi- er, a result which is not surprising considering the particular noisy structure of the training data set. Keywords: Machine learning, Bayesian learning, Nearest neighbor classication, Feature selection. Introduction The problem we are investigating in this work concerns classifying dental pathologies based upon the shape of the pathology. Specically, we employ a clinical radiograph image database containing 64 dental pathologies and the database consists of 4 dierent classes of dental disease, each class consisting of 6 elements. The 4 dierent classes represent progressive evolution of the disease, that is, the class C represents incipient forms of the disease, while class C4 consists of the most advanced stage of dental disease (gure 6). Each pathology contour has been identi- ed (in the dental radiograph) and the pathology has been labeled into one of the 4 classes of the disease we are interested in by an expert physician. However, dealing simply with the pathology contour (consisting of a set of coordinates) does not give us any powerful description or abstraction regarding the shape of the pathology. In order to abstract the shape of a pathology, starting from the raw contour points, we use modal analysis, a computer vision technique for obtaining description of objects in terms of a vector of deformation modes [5]. In particular, we are employing a prototype shape (in this case, an ellipse) and we compute the modal displacement vector, associated with the process of deforming (or aligning) the prototype shape with the underlying lesion shape (gure 7). The individual components of such a vector represent the deformation modes (like, for instance, tapering, bending, pitching, and so on), of a shape at ner and ner level of details (the number of modes could be quite large, of order 0 3 ). The modal description of a shape resembles a Fourier description of a signal where the low-order frequencies convey the signal's coarse characteristics while the high frequencies convey very ne signal information. Consequently, we expect that higher order modes in the shape description will be more sensitive to noise and less useful in terms of discriminative power (or classication ability). This domain knowledge allows us, to cut the dimensionality of the modal vector (and the subsequent analysis) down to the rst 30 modes. 2 Bayesian Learning The problem we are studying can be formulated in a Bayesian framework as follows: given a set of classes c i ; i = ::c (c = 4 in our case), we want to design a classier such that, given a feature vector x, maximizes the probability of correctly classifying it. 2. Bayes Classier We use Bayes'rule: where P (c i jx) = p(xjc i)p (c i ) p(x) ()

2 p(x) = cx j= p(xjc j )P (c j ) (2) In the above equations, P (c j ) represent a priori probabilities for class c j, p(xjc j ) represent the state conditional probability densities (essentially the probability of observing x, given the class c j ) and P (c j jx) represent the a posteriori probabilities (all j's are within the range ::n). Essentially, Bayes rule gives a quantitative account about how observing the values of x changes the a priori probability P (c j ) into the a posteriori probability P (c j jx). The pattern classication problem can be formulated in terms of a set of discriminant functions g i (x); i = ::c, each one associated with a dierent class. Given a feature vector x, the classier computes c discriminant functions and selects the class corresponding to the largest discriminant. Consequently, the vector x is assigned to class c i if: g i (x) > g j (x); 8j 6= i (3) For the particular case of a Bayesian classier, we can consider g i (x) = P (c i jx), so the maximum discriminant function naturally corresponds to the maximum a posteriori probability. However, the choice of a particular discriminant function is not unique, as it could, for instance, be multiplied by a positive constant or biased by an additive constant. Moreover, replacing every g i (x) by f(g i (x)) where f is a monotonically increasing function, will preserve the results of the classication [2]. The observation leads to both analytical and computational simplications as: cp g i (x) = P (c i jx) = p(xjc i)p (c i ) j= p(xjc j )P (c j ) gives the same classications results as: (4) g i (x) = log p(xjc i ) + log P (c i ) (5) (note that the term c P j= p(xjc j )P (c j ) is a sum having the same value among all classes, so one can treat it as a constant). For the experiments, we assume equally likely a priori class probabilities, that is, P (c i ) = P (c j ); 8i; j 2 ::c. Consequently, equation 5 becomes: g i (x) = log p(xjc i ) (6) and the central problem is transfered on estimating the conditional densities p(xjc j ); j = ::c. 2.2 Multivariate Normal Density The assumption used with many Bayesian classiers, that we shall follow in the experiments as well, is that the conditional density is multivariate normal (to be dened precisely below) which represents an appropriate model for the case when the feature vectors x corresponding to a class c i are continuous valued, moderately corrupted versions of a prototypical vector i. The multivariate normal density can be written as: p(x) = exp[? (2) d 2 jj 2 2 (x? )t? (x? )] (7) where x is a d-component column vector, is the d-component mean vector, and is the d-by-d covariance matrix. This equation is always abbreviated as p(x) N(; ). By using the equations 6 and 7, one can devise the discriminant functions corresponding to the minimum-error rate classication (leaving apart the constant and scale factors): g i (x) =?(x? ) t? i (x? )? log j i j (8) Under the particular assumptions that the covariance matrices for all classes are equal, that is: i = ; 8i 2 ; ::; c, the discriminant functions become: g i (x) =?(x? ) t? (x? ) (9) In the above equation, the right hand side term: r 2 = (x? ) t? (x? ) (0) is the squared Mahalanobis distance. To classify a feature vector x, one measures the squared Mahalanobis distance from x to each of the c mean vectors and assigns x to the class corresponding to the nearest mean. In the experiments we use a minimum Mahalanobis distance classier, based on the discriminant function presented in equation Parameter Estimation: The Mean and Covariance Matrix Under the assumption of multivariate normal distribution, the Maximum Likelihood (ML) Estimation for the mean vector and covariance matrix is: ^ = n nx k= x k () 2

3 ^ = n? nx k= (x k? ^)(x k? ^) t (2) where x k is the sample data. One can notice that the ML estimate for the unknown population mean is the sample mean, or the centroid of the sample cloud. The ML estimate for the covariance matrix is the sample variance, and it can be shown to be unbiased. For the experiments we use the above formulas for obtaining estimates for the mean vectors and covariance matrices corresponding to each of the classes c i ; i 2 f::4g, involved in the learning process. 3 Nearest-Neighbor Classication Considering the feature vectors of the form: x = (a (x); a 2 (x); :::; a n (x)); the distance between two instances x i and x j is given by: d(x i ; x j ) = vu u t X n k= (a k (x i )? a k (x j )) 2 (3) Given a new instance x q, to be classied, the k- nearest neighbor algorithm performs a vote among the k instances nearest to x q. More precisely, given the set of possible classication decisions (classes) C = c ; c 2 ; c 3 ; c 4, the algorithm decides according to: ^f = argmax c2c kx i= (c; f(x i )) (4) where (a; b) = if a = b and 0 otherwise. In the experiments, we employ a distance-weighted k-nearest neighbor method: ^f = argmax c2c where w i = d(x q;x i) 2. kx i= w i (c; f(x i )) (5) 4 Discriminant Feature Selection In this section, we are interested in identifying those features (vector components) which are highly relevant in terms of the classication performance (providing good separation between the groups to be classied). A brute-force approach to such a problem is to consider the power set of the feature set, run the classication process for each such particular combination of features, and select that feature set leading to the best classication performance. However, this solution shortly proves to be intractable as, for the 30 feature vectors we use, we need to generate the power set of their features, that is 2 30 possible subsets. An alternative approach is to use a probabilistic measure to check the impact of each vector component on the separability between dierent groups. We employ divergence, as a quantitative measure for this purpose. Intuitively, the divergence is a measure of the distributional separability of two probability distributions and measures how well the features used can represent the statistics conveyed in the raw data. For testing binary hypothesis, H vs. H 2, in which the probability distributions of features is P under H, and P 2 under H 2, we dene P e as average error probability, P 2 as the probability of choosing H when class H 2 is actually true, and vice versa for P 2 : P e = P 2 P (H ) + P 2 P (H 2 ) Furthermore, the divergence is related to the error exponent in P e as follows: as the divergence increases, the error represented by P e decreases. A closed form expression for the divergence for multi-variate Gaussian data can be derived as in [4]: D = 2 (m? m 2 ) T (K? + K? 2 )(m? m 2 ) + 2 tr(k? K 2 + K? 2 K? 2I) (6) where m and m 2 are mean values, K and K 2 are covariance matrices, corresponding to the feature vectors in classes c and c 2, and tr is a function operating on a matrix argument and computing the summation of its diagonal elements. Now, we need to link the inter-group divergence which gives a measure for the similarity between two groups, with our particular application where several groups are present. Furthermore, we want to construct a criterion function, based on divergence, such that we can evaluate the separability contribution of each feature with respect to the groups involved in the classication process. More precisely, for any feature f 2 G; G = f::30g corresponding to a modal vector, we compute a divergence-based global criterion value given by: c f = 3X 4X i= j=i+ D G?f ij (7) 3

4 where Dij S represents the inter-group divergence between groups i and j, computed for the features in set S (S G). Selecting good discriminative features according to the criterion, reduces to reversely sorting the features based on the value obtained using the formula 7, that is, the better the feature, the lower the value of its corresponding criterion (if the feature provides good discriminative ability, removing it would result in a decrease in the separation ability as quantied by the divergence measure). The results of applying the criterion for each feature are plotted in gure 3. The criterion 7 impose a partial ordering on the feature set. When testing the performance for dierent classiers, we incrementally construct the classiers corresponding to all the ordered feature sets (that is the sets consisting of features fg; f; 2g; :::f; 2; :::; 30g, but note that, for instance, \" now means the rst feature according to the feature ordering criterion, and not the rst component of the vector) and in this way, the complexity of experimenting with each particular classier becomes linear in the number of present features (in our case 30). To provide further insight into the relevance of this criterion selection, we also perform a standard statistical analysis to compute the mean and standard deviation for each feature in each individual group (as well as for the global data set). The results are depicted in gures and 2. Intuitively, divergence balances the mean inter-group separability for individual vector components with the corresponding standard deviation and \assembles" these into a formula characterizing the separability of two groups involved in the classication. The criterion presented in 7 performs a greedy selection (by summing over the divergences corresponding to a set of features, for all pairs of groups involved in the classication process). It is greedy in the sense that it might assign a good merit value to some features which not necessarily provide good separability between all the groups, but sometimes only between pairs of groups (some groups might be very well separated, some other might not, but the criterion might not able to \sense" this, as it just computes an overall sum). Mean Value Standard Deviation < Feature Means > "A" "B" "C" "D" "Globally" Feature Figure : Medians < Feature Standard Deviations > "A" "B" "C" "D" "Globally" 5 Experiments and Results We experiment with the classication methods presented above by running a cross-validated procedure for obtaining more accurate results. We employ a leave-one-out method for maximizing the utilization Feature Figure 2: Standard Deviations 4

5 Criterion < Best Features: Criterion = Probabilistic Distance - Divergence> "Criterion" the ordering criterion 7 on the set of features. Subsequently, for each k-nearest neighbor classier, we pick (and plot) the value corresponding to the minimum error among all ordered feature sets (that is fg; f; 2g; f; 2; 3g; :::f; 2; 3; :::30g). Consequently, the error values we plot are not necessarily obtained for the same ordered feature set, but might correspond to dierent ordered feature sets. This might make the plot 4, nonuniform, but we felt that the important thing to analyze is the real minimum error estimate, and not the error estimate resulting from the rigid imposition of a particular subset of features (which might not provide a minimum error estimate for a classier corresponding to a particular value of k) Feature Figure 3: Feature Criterion Values of the data. For the case of the nearest-neighbor classication we use a leave-4-out method (by keeping out one member of each class), in order to avoid bias in the experiments. The confusion matrix corresponding to the minimum Mahalanobis distance classier is given in table. Minimum classication error estimation has been obtained when selecting the rst 5 best features (according to the divergence criterion described), and the value is 22.25%. The error estimations corresponding to selecting the rst k criterion features (k = ::5) are plotted in gure 5. The confusion matrix corresponding to the weighted k-nearest neighbor classier is given in table 2. Minimum classication error estimation has been obtained for 5-nearest neighbor, the rst 7 best features, and the value is 32.8%. The error estimations corresponding to selecting dierent k-nearest neighbor (k = ::0) classiers are shown in gure 4. The plot presented in gure 4 represents a compressed version of the runs actually performed. We build separate k-nearest neighbor classiers for each k 2 f::0g, and for each classier, we consider different feature sets in the order of their criterion selection, i.e. we rst evaluate the classier corresponding only to the rst best feature, then the classier corresponding to the rst and second best features and so on, up to the classier containing all the features in the set (30). Note that the sets are not created by randomly choosing features, but they are ordered in the sense that we gradually add features according to gr. G gr. G2 gr. G3 gr. G4 G 86.50% 9.50% 4.00% 0.00% G2 2.75% 75.25% 6.50% 5.50% G3 7.50% 5.00% 78.50% 9.00% G4 4.50% 9.50% 4.75% 7.25% Table : Minimum Mahalanobis Distance Classier Confusion Matrix gr. G gr. G2 gr. G3 gr. G4 G 68.75% 3.25% 0.00% 0.00% G % 62.50% 2.50% 0.00% G3 0.00% 0.00% 00.00% 0.00% G % 0.00% 25.00% 37.50% Table 2: Nearest Neighbor Classier Confusion Matrix 6 Discussion We observe that the results obtained by using the Bayes classier are better in terms of the classication error (22.25% versus 32.8%). Also, the Bayes classier use a smaller size of the feature set for obtaining its smallest error classication, that is the rst 5 versus the rst 7 criterion features, for the 5-nearest neighbor classier. The order set of the rst 7 criterion features is f2; 26; 6; 30; 6; 29; 3g. The fact that the Bayes classier gives better results can be considered as a reasonable, practically validated outcome, as it is known that the 5

6 Error Error < Error Using KNN and Best Features > "Error" K - Number of Nearest Neighbors Figure 4: k-nearest neighbor error < Error Using Minimum Mahalanobis Distance Classifier > "Error" k-nearest neighbor classiers provide expected suboptimal classication performance [2]. Furthermore, it is only when the number of samples becomes very large (approaching innity according to the theory) when k-nearest neighbor classication starts emphasizing nearly optimal behavior. The voting process among k-nearest neighbors can be understood as a trade-o between reliability (choosing many neighbors for obtaining a reliable estimate) and accuracy (choosing those neighbors that are really very close to the point to be classied), forcing towards a compromise value for k, that is a small percentage of the number of examples. From the results, we can verify that this is, indeed, the case. There is a particular issue related to the very noisy state of the training examples (the description of the lesions' shape) at present time, that negatively impacts on the k-nearest neighbor classication. It is known that these classication methods are particularly sensitive to noise [3]. We also expect that the minimum Mahalanobis distance classier will be more robust to noise as the distance it is based upon is normalized by covariance, that is, accounts for the deviations within the groups involved in the classication process. In terms of the confusion matrices associated with the two classiers, we generally notice higher probability of misclassication within \adjacent" groups, which is, again an expected behavior, as the classes C and C2 are more likely to be similar that C and C4. However, this is not always the case. We notice, that, for instance the group 4, the lowest classication performance is obtained for both classiers, and the probability of misclassication in group is important although one may expect about them being the most separated ( since they represent classes of disease in their initial and nal evolution). This might be due to the particularly noisy descriptions in this group (by analyzing gure 2 one can notice that this is indeed true, as the standard deviation for group 4 dominates over the ones for the other groups almost everywhere in the feature domain) Conclusions and Further Work K - Number of Best Features (First K) Figure 5: Mahalanobis minimum distance error In this work, we have implemented and analyzed the performance of two classiers on samples derived from four classes of pathologies encountered in a medical image database. In order to obtain better classication performance, we performed a discriminant feature analysis based on a divergence criterion. The features were subsequently ordered according to their 6

discriminative ability and we subsequently tested the classier on incrementally constructed sets (that is sets in which features are incrementally added according to the order generated by the

This methodology attempts to deal with the intractable problem of generating all possible subsets of the feature set (2 30 ), run the classiers and obtain error estimations for all such possible

7 discriminative ability and we subsequently tested the classier on incrementally constructed sets (that is sets in which features are incrementally added according to the order generated by the divergence criterion). This methodology attempts to deal with the intractable problem of generating all possible subsets of the feature set (2 30 ), run the classiers and obtain error estimations for all such possible feature sets. The classication results we obtained are quite promising, but further extensions are possible in several directions. First, operating on a large database, translated in a large training set could certainly provide more accurate error estimation results as well as further insight into this classication problem. Second, a more realistic feature selection method based on divergence might be used. At present time, the selection is based on the order generated by summing all inter-group divergence values, corresponding to a set of features and leaving one feature out each time (basically, computing a form of divergence gain for each feature). While we are certainly looking for features with higher contributions to the overall divergence this does not necessarily mean they provide good separation between any two groups and this might lead to poor classication performance. Using criteria that might take the relative inter-group divergence into account (not only their sum) might result in better discriminant feature selection. Ultimately, devising better, less noisy descriptions or extraction of shape vectors using computer vision techniques, should certainly improve classication performance. Acknowledgments: I would like to thank Sachin Lodha and Wen Li who kindly accepted to review the paper. References Figure 6: A pathology in its four progressive evolution stages Figure 7: Deformation of an ellipse into a pathology contour Database. Workshop on Biomedical Image Analysis, June 26-27, 998,Santa Barbara, California. [] A.Blum. Empirical Support for Winnow and Weighted-Majority Algorithms: Results on a Calendar Scheduling Domain. CMU-TR, 998 [2] R.Duda and P.Hart. Pattern Classication and Scene Analysis. Wiley Interscience, 973. [3] T.Mitchell. Machine Learning. McGraw-Hill, 997. [4] W.Therrien. Decision Estimation and Classication. John Wiley and Sons, 989. [5] W.Zhang, S.Dickinson, S.Sclaro, J.Feldman, S.Dunn. Shape Indexing in a Medical Image 7

Supervised vs unsupervised clustering

Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful