Computers and Mathematics with Applications. Polychotomous kernel Fisher discriminant via top down induction of binary tree

Similar documents
Kernel Methods and Visualization for Interval Data Mining

The Curse of Dimensionality

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Machine Learning for NLP

ECG782: Multidimensional Digital Signal Processing

FACE RECOGNITION USING SUPPORT VECTOR MACHINES

Content-based image and video analysis. Machine learning

Chap.12 Kernel methods [Book, Chap.7]

OBJECT CLASSIFICATION USING SUPPORT VECTOR MACHINES WITH KERNEL-BASED DATA PREPROCESSING

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Binarization of Color Character Strings in Scene Images Using K-means Clustering and Support Vector Machines

All lecture slides will be available at CSC2515_Winter15.html

GENDER CLASSIFICATION USING SUPPORT VECTOR MACHINES

Feature scaling in support vector data description

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Based on Raymond J. Mooney s slides

Data Mining in Bioinformatics Day 1: Classification

INF 4300 Classification III Anne Solberg The agenda today:

Kernel PCA in nonlinear visualization of a healthy and a faulty planetary gearbox data

Support Vector Machines

FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION

Kernel Combination Versus Classifier Combination

Robotics Programming Laboratory

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

Linear Discriminant Analysis for 3D Face Recognition System

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

SoftDoubleMinOver: A Simple Procedure for Maximum Margin Classification

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Well Analysis: Program psvm_welllogs

Data mining with Support Vector Machine

A Taxonomy of Semi-Supervised Learning Algorithms

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Clustering CS 550: Machine Learning

Support Vector Machines.

Generative and discriminative classification techniques

CS6716 Pattern Recognition

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

An Introduction to Machine Learning

Clustering and Visualisation of Data

Time Series Classification in Dissimilarity Spaces

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

S. Sreenivasan Research Scholar, School of Advanced Sciences, VIT University, Chennai Campus, Vandalur-Kelambakkam Road, Chennai, Tamil Nadu, India

A Comparative Study of SVM Kernel Functions Based on Polynomial Coefficients and V-Transform Coefficients

Support Vector Machines

SUPPORT VECTOR MACHINES

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

FEATURE GENERATION USING GENETIC PROGRAMMING BASED ON FISHER CRITERION

Unsupervised learning in Vision

Classification of Printed Chinese Characters by Using Neural Network

Support Vector Machines

Basis Functions. Volker Tresp Summer 2017

Using classification to determine the number of finger strokes on a multi-touch tactile device

The role of Fisher information in primary data space for neighbourhood mapping

Support Vector Machines

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Bagging for One-Class Learning

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Parametric Texture Model based on Joint Statistics

DECISION-TREE-BASED MULTICLASS SUPPORT VECTOR MACHINES. Fumitake Takahashi, Shigeo Abe

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

A supervised strategy for deep kernel machine

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will

Lecture 9: Support Vector Machines

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Short Survey on Static Hand Gesture Recognition

The Pre-Image Problem and Kernel PCA for Speech Enhancement

Leave-One-Out Support Vector Machines

Support Vector Machines.

Model-based segmentation and recognition from range data

Support vector machines

ENSEMBLE RANDOM-SUBSET SVM

CS 340 Lec. 4: K-Nearest Neighbors

Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest.

Graph Laplacian Kernels for Object Classification from a Single Example

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

Unsupervised Learning

Artificial Intelligence. Programming Styles

Neural Networks and Deep Learning

Contents. Preface to the Second Edition

Supervised Variable Clustering for Classification of NIR Spectra

CSE 573: Artificial Intelligence Autumn 2010

ELEC Dr Reji Mathew Electrical Engineering UNSW

Chapter DM:II. II. Cluster Analysis

Lorentzian Distance Classifier for Multiple Features

Computers and Mathematics with Applications. An embedded system for real-time facial expression recognition based on the extension theory

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Improved DAG SVM: A New Method for Multi-Class SVM Classification

Machine Learning: Think Big and Parallel

Learning Models of Similarity: Metric and Kernel Learning. Eric Heim, University of Pittsburgh

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Support Vector Machines

Comparison of supervised self-organizing maps using Euclidian or Mahalanobis distance in classification context

Network Traffic Measurements and Analysis

Learning texture similarity with perceptual pairwise distance

Transcription:

Computers and Mathematics with Applications 60 (2010) 511 519 Contents lists available at ScienceDirect Computers and Mathematics with Applications journal homepage: www.elsevier.com/locate/camwa Polychotomous kernel Fisher discriminant via top down induction of binary tree Zhao Lu a,, Lily Rui Liang b, Gangbing Song c, Shufang Wang d a Department of Electrical Engineering, Tuskegee University, Tuskegee, AL 36088, USA b Department of Computer Science and Information Technology, University of the District of Columbia, Washington, DC, USA c Department of Mechanical Engineering, University of Houston, Houston, TX, USA d Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA a r t i c l e i n f o a b s t r a c t Article history: Received 17 September 2009 Received in revised form 29 April 2010 Accepted 29 April 2010 Keywords: Kernel Fisher discriminant Binary tree Kernel-induced distance Kernelized group clustering Posterior probability In spite of the popularity of Fisher discriminant analysis in the realm of feature extraction and pattern classification, it is beyond the capability of Fisher discriminant analysis to extract nonlinear structures from the data. That is where the kernel Fisher discriminant algorithm sets in the scenario of supervised learning. In this article, a new trail is blazed in developing innovative and effective algorithm for polychotomous kernel Fisher discriminant with the capability in estimating the posterior probabilities, which is exceedingly necessary and significant in solving complex nonlinear pattern recognition problems arising from the real world. Different from the conventional divide-andcombine approaches to polychotomous classification problems, such as pairwise and oneversus-others, the method proposed herein synthesizes the multi-category classifier via the induction of top-to-down binary tree by means of kernelized group clustering algorithm. The deficiencies inherited in the conventional multi-category kernel Fisher discriminant are surmounted and the simulation on a benchmark image dataset demonstrates the superiority of the proposed approach. 2010 Elsevier Ltd. All rights reserved. 1. Introduction Being an important technique for feature extraction and pattern classification, Fisher discriminant analysis (FDA) has been widely used to build dichotomic linear classifiers that are able to discriminate between two classes. By taking the label information of the data into account, the idea of FDA is to seek a linear transformation that maximizes the between-class scatter and minimizes the within-class scatter, in order to separate one class from others. However, due to the fact that the features extracted by FDA are limited to linear combinations of input features, it is beyond the capability of FDA to capture more complex nonlinear correlations and solve many modern learning problems appropriately. On the other hand, as the turning point in the history of machine learning methods, the development of kernelbased learning systems in the mid-1990s comes with a new level of generalization performance, theoretical rigor, and computational efficiency [1]. The fundamental step of the kernel approach is to embed the data into a Euclidean space where the patterns can be discovered as linear relations. This capitalizes on the fact that over the past 50 years, statisticians and computer scientists have become very good at detecting linear relations within sets of vectors. This step therefore reduces many complex problems to a class of well-understood problems. As a general framework to represent data, the kernel method can be used if the interactions between elements of the domain occur only through inner product [1,2]. Corresponding author. E-mail address: zlu@ieee.org (Z. Lu). 0898-1221/$ see front matter 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.camwa.2010.04.048

512 Z. Lu et al. / Computers and Mathematics with Applications 60 (2010) 511 519 In an attempt to generalize the FDA to a technique capable to extract nonlinear features and build nonlinear classifier for the data, the kernel Fisher discriminant (KFD) was developed [1,3], where the kernel trick was generalized into Fisher discriminant analysis to represent complicated nonlinear relations of the input data efficiently. Analogous to the rationale in support vector machine (SVM) [1,4,5] and kernel principal component analysis (KPCA) [1,6], firstly the original inputs were nonlinearly mapped into a high-dimensional feature space induced by a kernel function, where the linear FDA was calculated, i.e., find a direction w in the feature space that separate the class means of the mapped data well (when projected onto the found direction) while achieving a small variance around these means. The quantity measuring the difference between the means is called between-class variance and the quantity measuring the variance around these class means is called within-class variance, respectively. Hence, the objective of KFD is to find a direction that maximizes the between-class variance while minimizing the within-class variance simultaneously for the mapped data in the feature space, thereby yielding a nonlinear discriminant in the input space. A crucial advantage of the Fisher discriminant algorithm over standard support vector learning is that the outputs of the former can easily be transformed into the posterior probabilities; in other words, the numbers from outputs imply not only whether a given test pattern belongs to a certain class, but also the probability of this event [1]. Due to the empirical observation that in the high-dimensional feature space, the histogram of each class of training examples as projected onto the discriminant can be closely approximated by a Gaussian, the posterior probabilities can be found via Bayes rule by estimating two one-dimensional Gaussian class-conditional probability densities for the projections of the training points onto the direction of discrimination. In practice, being able to estimate the posterior probabilities can be very useful, for instance, in applications where the output of a classifier needs to be merged with further sources of information [1]. However, in real world, most of classification problems encountered consist of multiple categories, i.e., polychotomous classification problem, such as handwritten character recognition, face detection, and so on. Although the Fisher discriminant has been naturally generalized to n-class feature extraction and dimension reduction problem with n > 2 by projecting the data onto a (n 1)-dimensional space [7], this direct method is plagued with the restriction that it can only be used in the case where the dimensionality of the input space is greater than the number of classes. In particular, as pointed out in Ref. [8], for multi-category problem, the Fisher criterion is in general not optimal with respect to minimizing the classification error rate in the lower-dimensional space due to the fact that the corresponding eigenvalue decomposition is dominated by outlier classes, which leads to the over-weight of the influence of classes which are already well separated. In this article, aimed at removing the limitation which precludes the application of Fisher discriminant to polychotomous classification problem, an innovative binary-tree based KFD algorithm is developed, in which the n-class polychotomous problem is decomposed into n 1 dichotomy problems by invoking the kernelized group clustering method developed in [9]. To validate the effectiveness of the developed classification algorithm for nonlinear multi-category problem, the simulation study on the benchmark satellite image dataset is conducted, which demonstrates the superiority of the proposed approach over other prevalent pattern classification methods. The rest of this paper is organized as follows. In the next section, a brief review of the kernel Fisher discriminant is given for the completeness of this article. Following that, the kernel-induced distances between datasets are investigated in Section 3. In Section 4, the algorithm for design of binary tree via kernelized group clustering is developed. The simulation study on satellite image data is demonstrated in Section 5, with concluding remarks in Section 6. The following generic notations will be used throughout this paper: lower case symbols such as x, y, α,... refer to scalar valued objects, lower case boldface symbols such as x, y, β,... refer to vector valued objects, and capital boldface symbols such as N, K, M,..., will be used for matrices. 2. Kernel Fisher discriminant for nonlinear feature extraction Fisher discriminant analysis (FDA) aims at finding a linear projection such that the classes are well separated, and the separability can be measured by the between-class scatter and the within-class scatter. Given a set of m-dimensional input vectors x j, j = 1,..., l, l 1 in the subset D 1 labeled ω 1 and l 2 in the subset D 2 labeled ω 2. By projecting x j onto the one-dimensional subspace in the direction of w, a corresponding set of samples y j = w T x j, j = 1,..., l divided into the subsets Υ 1 and Υ 2 can be obtained. Define m 1 and m 2 to be the empirical class means, i.e. m i = 1 l i x j. Similarly, the means of the data projected onto the direction of w can be computed as µ i = 1 l i y j = 1 w T x j = w T m i l y j Υ i i i.e. the means µ i of the projections are the projected means m i. The variances σ 1, σ 2 of the projected data can be expressed as σ i = xj Di(w T x j µ i ) T. (1) (2) (3)

Z. Lu et al. / Computers and Mathematics with Applications 60 (2010) 511 519 513 Then maximizing the between-class variance and minimizing the within-class variance can be achieved by maximizing J(w) = (µ 1 µ 2 ) 2 σ 1 + σ 2 (4) which will yield a direction w such that the ratio of between-class variance and within-class variance is maximal. Substituting the Eq. (2) for the means and (3) for the variances into (4) yields J(w) = wt S B w w T S W w (5) where the between-class scatter matrix S B and within-class scatter matrix S W are defined as S B = (m 2 m 1 )(m 2 m 1 ) T 2 S W = (x j m i )(x j m i ) T. (6) (7) The quantity in Eq. (5) is often referred to as Rayleigh coefficient or generalized Rayleigh quotient. It is well known that the w maximizing Rayleigh is the leading eigenvector of the generalized eigenproblem S B w = λs W w. (8) Similar to that in KPCA, for the purpose of generalizing the Fisher discriminant analysis into nonlinear features extraction, the original input vectors x j are mapped into a high-dimensional feature space by nonlinear mapping φ(x j ) and then reformulate the problem in the feature space, where the inner product φ T (x r )φ(x j ) is defined as kernel function, i.e., k(x r, x j ) = φ T (x r )φ(x j ). Firstly, we rewrite S W as S W = 2 (x j x T j m i m T i ) (9) and make the key postulation that w = β j φ(x j ). j=1 (10) Rewriting the numerator in the generalized Rayleigh quotient (5) as where w T S B w = (w T m 2 w T m 1 ) 2 (11) w T m i = β r φ T (x r ) 1 l i φ(x j ) = β φt (x r )φ(x j ) r = l i β r k(x r, x j ) l i. (12) By defining the rth component of the l-dimensional column vector ϑ i as (ϑ i ) r = k(x r,x j ), it can be followed from Eq. l i (12) that w T m i = β T ϑ i (13) where β = [β 1 β 2 β l ] T and then substituting (13) into (11) yields w T S B w = β T (ϑ 2 ϑ 1 )(ϑ 2 ϑ 1 ) T β = β T Nβ, (14) where N = (ϑ 2 ϑ 1 )(ϑ 2 ϑ 1 ) T. On the other hand, it follows from w T φ(x j ) = β r φ T (x r )φ(x j ) = β r k(x r, x j ) (15)

514 Z. Lu et al. / Computers and Mathematics with Applications 60 (2010) 511 519 and Eq. (13) that w T S W w = = = = 2 w T [ φ(x j )φ T (x j ) m i m T i 2 ] w [ w T φ(x j )φ T (x j )w w T m i m T w] i [ ] 2 β r k(x r, x j ) β s k(x s, x j ) β T ϑ i ϑ T β i 2 = β T Mβ β T [ K i K T i l i ϑ i ϑ T i ] β s=1 (16) where (K i ) rj = k(x r, x j ) and M = 2 [K ik T i l i ϑ i ϑ T i ], x j D i. Substituting Eqs. (14) and (16) back into the generalized Rayleigh quotient (13) yields J(β) = βt Nβ β T Mβ. (17) Hence, maximizing the Rayleigh coefficient (5) relative to w in the nonlinear feature space is equivalent to maximize J(β) in Eq. (17) with respect to β, and the projections of the mapped data in the nonlinear feature space onto the single-dimensional direction w can be calculated by Eq. (15). 3. Distance function for measuring the dissimilarity between datasets In clustering algorithms, the definition of distance measure plays a crucial role and has great impact on the clustering performance. Hence, in an attempt to develop the kernelized group clustering algorithm, the distance measure characterizing the dissimilarity between classes needs to be defined beforehand. As an abstract conception, the metric d is defined as a distance measure satisfying the following conditions: reflectivity, i.e. d(x, x) = 0 positivity, i.e. d(x, y) > 0 if x is distinct from y symmetry, i.e. d(x, y) = d( y, x) triangle inequality, i.e. d(x, y) d(x, z) + d(z, y) for every z. Basically, reflectivity and positivity are fundamental to define an appropriate dissimilarity measure [10]. The function d is a distance function if it satisfies reflectivity, positivity and symmetry. Given that the KFD extracts the nonlinear structure from the data implicitly in the feature space, it is necessary and natural to define the measure of dissimilarity in the feature space. For the points x and y in the input space, the Euclidean distance between them in the feature space was defined as: d(x, y) = Φ(x) Φ( y) (18) where is the Euclidean norm, and Φ is the implicit nonlinear map from the data space to feature space. As we have seen in Section 2, by using the kernel k, all computations can be carried out implicitly in the feature space that Φ maps into, which can have a very high (maybe infinite) dimensionality. Several commonly used kernel functions in literature are: Gaussian radial basis function (GRBF) kernel: ( x y 2 ) k(x, y) = exp. 2σ 2 Polynomial kernel: k(x, y) = (1 + x, y ) q. Sigmoid kernel: k(x, y) = tanh(α x, y + β). (19) (20) (21)

Z. Lu et al. / Computers and Mathematics with Applications 60 (2010) 511 519 515 Inverse multi-quadric kernel: k(x, y) = 1 x y 2 + c 2 (22) where σ, q, α, β, c are the adjustable parameters of the above kernel functions. The GRBF kernel and inverse multi-quadric kernel are in the class of translation-invariant kernels, and the polynomial kernel and sigmoid kernel are examples of rotation invariant kernels. The kernel function provides an elegant way of working in the feature space avoiding all the troubles and difficulties inherent in high dimensions, and this method is applicable whenever an algorithm can be cast in terms of dot products. In light of this, the distance (18) is expressed in the entries of kernel [10,11], Φ(x) Φ( y) 2 = (Φ(x) Φ( y)) T (Φ(x) Φ( y)) = Φ(x) T Φ(x) Φ( y) T Φ(x) Φ(x) T Φ( y) + Φ( y) T Φ( y) = k(x, x) + k( y, y) 2k(x, y). (23) Consequently, the distance (18) can be computed without explicitly using or even knowing the nonlinear mapping Φ, and it can be defined as the kernel-induced distance in the input space. Below we confine ourselves to the Gaussian RBF kernel, so k(x, x) = 1. Thus, we arrive at Φ(x) Φ( y) 2 = 2 2k(x, y). Further, for the sake of measuring the dissimilarity between classes in the feature space, a kernel-induced distance between datasets in the input space needs to be defined. The best-known metric between subsets of a metric space is the Hausdorff metric, which is defined as the maximum distance between any point in one shape and the point that is closest to it in the other. That is, for point sets A = {a i i = 1, 2,..., p} and B = {b j j = 1, 2,..., q}, it is { d h (A, B) = max max min b j B ai b j, max b j B (24) } min ai b j. (25) This metric is trivially computable in polynomial time, and it has some quite appealing properties. However, it might be problematic to employ Hausdorff metric for some classification applications due to the fact that the Hausdorff distance does not take into account the overall structure of the point sets. In an attempt to overcome this drawback, we adopt the sum of minimum distances function d md as follows [12] d md (A, B) = 1 2 ai A min b j B ai b j + min bj B ai b j. (26) For measuring the degree of dissimilarity between two datasets in the feature space, we consider the kernel-induced sum of minimum distance function dmd d md (A, B) = 1 min Φ(ai ) Φ(b j ) + min Φ(ai ) Φ(b j ). (27) 2 b j B ai A bj B Obviously, by the formulation of (23), the distance (27) can also be expressed only in the entries of kernel. If Gaussian RBF kernel was chosen as the kernel function, by using Eq. (24) dmd can be recast into d md (A, B) = 1 min 2 2k(ai, b j ) + min 2 2k(ai, b j ) (28) 2 b j B ai A bj B which implies that it can be calculated without knowing the nonlinear mapping Φ. 4. Top down induction of binary tree via kernelized group clustering In contrast to the direct method discussed in Section 1, the methodology of divide-and-combine usually decomposes the multi-category problem into several subproblems that can be solved by using binary classifiers. Two widely used divideand-combine methods are pairwise and one-versus-others. In the approach of pairwise, an n-class problem is converted into n(n 1)/2 dichotomic problems which cover all pairs of classes. Then, the binary classifiers are trained for each of pairs, and the classification decision for a test pattern is given on the aggregate of output magnitudes. Apparently, in pairwise methods, the number of binary classifiers built increases rapidly with the increasing of the number of classes, which easily leads to expensive computational effort. This problem is alleviated in the one-versus-others method, where only n binary classifiers are needed for n-class problem and each of them is trained to separate one class of samples from all others. However, all

516 Z. Lu et al. / Computers and Mathematics with Applications 60 (2010) 511 519 Fig. 1. Top-to-down induction of binary tree. training data have to be involved in constructing each binary classifier and one-versus-others method is not capable to yield the optimal decision boundaries. In particular, both methods can result in the existence of unclassified regions. In our approach to n-class polychotomous problems, the topology of binary tree is leveraged to facilitate the implementation of polychotomous KFD via n 1 binary classifiers, which is different than the heuristics mentioned above. The synthesis of classifier initially starts from the root node, and all classes are firstly divided into two groups of classes belonging to left node and right node respectively. Successively, from the top to down, at every non-leaf node, the multiple classes in one group are further partitioned into two groups for its child nodes, and each group also consists of multiple classes. By treating all classes in one group as a single class, the binary KFD classifier can be trained at each non-leaf node. This procedure is iterated until every leaf node only contains one individual class, and apparently the number of leaf nodes equals to the number of classes. The procedure of constructing a polychotomous KFD classifier via top down induction of binary tree is visualized in Fig. 1, where a binary tree is induced for a classification problem with 11 categories. Generally, there exist many possibilities to split the multiple classes into two partitions for each non-leaf node. Hence, how to partition the multiple classes at each non-leaf node, which directly determines the training dataset used for constructing binary KFD classifiers, is critical to the overall classification performance of the algorithm. Given the hierarchical architecture of binary tree, it is obvious that if the classification performance degrades at the upper node of the binary tree, the overall classification performance causally becomes worse. Therefore, more separable classes should be partitioned at the upper node of the binary tree, i.e., maximize the separability while partitioning the multiple classes into two groups from top to down. Although the separability between two classes can be quantified by distance function defined in Section 3 for measuring the dissimilarity between datasets, how to find the two groups of classes with maximal separability is pretty challenging. In our approach, taking into account the fact that KFD calculate the decision function implicitly in the feature space by kernel trick, this intractable problem is tackled by developing the kernelized group clustering algorithm for multiple classes. Firstly, choose the kernel functions to be used for dichotomic KFD and for computing the kernel-induced distance between classes at each node. This step also provides us a chance to use different kernels in different nodes, which enhances the flexibility of the multi-category classifier built. Then, starting from the root node and successively for every non-leaf node, the kernelized group clustering algorithm is used in partitioning the classes for each non-leaf node: compute the kernel-induced sum of minimum distance function dmd between all pairs of the classes in the non-leaf node; partition the pair of classes between which the distance is maximal into the left node and right node as the prototype classes of the child nodes, respectively. Subsequently, assign the remaining classes in the non-leaf node into the child node whose prototype class is the closest to it in the sense of kernel-induced distance function (27). This procedure is executed for every non-leaf node from top to down until the leaf nodes are reached which only contain the prototype class. Thus, the overall structure of binary tree is also determined, which provides insight into the data structure of classes. Based on the structure of the binary tree, on each non-leaf node, the samples from the datasets in its left child node and its right child node can be relabeled as +1 and 1, respectively. Then, the polychotomous KFD classifier can be obtained by training the binary KFD at every non-leaf node, which implements a decision rule that separates the samples belonging to the datasets in its left node from those belonging to the datasets in its right node. In our approach, the number of binary classifiers needed for a n-class polychotomous problem is n 1, which is less than that in pairwise and one-versus-others methods. Also, as learning proceeds from top to down, the number of data involved in the training processes decrease rapidly. For classifying an unlabeled pattern, the evaluation starts from the root node of the binary tree, and the dichotomic classifiers trained on the non-leaf nodes determine which child node the input pattern should be assigned into. This procedure is iterated until the unlabeled pattern is finally classified into the class associated with one of leaf nodes; thereby

Z. Lu et al. / Computers and Mathematics with Applications 60 (2010) 511 519 517 Table 1 Distribution of training samples in Landsat satellite image dataset. N Description Train Test 1 Red soil 1072 (24.17%) 461 (23.05%) 2 Cotton crop 479 (10.80%) 224 (11.20%) 3 Grey soil 961 (21.67%) 397 (19.85%) 4 Damp grey soil 415 (09.36%) 211 (10.55%) 5 Soil with vegetation stubble 470 (10.60%) 237 (11.85%) 6 Very damp grey soil 1038 (23.40%) 470 (23.50%) a trace from the root to one of leaf nodes is determined for classifying each unlabeled pattern. Contrary to the conventional divide-and-combine methods where all the dichotomic decision functions need to be calculated in evaluating an unlabeled pattern, only those dichotomic decision functions on the determined trace need to be calculated in the proposed method. In particular, on the strength of the dichotomous KFD in producing the posterior probability [1], the proposed binary-tree based polychotomous KFD can be readily extended to generate posterior probabilistic outputs in the case of multi-category. For the trace along which an unlabeled pattern was classified from the root to one of the leaf nodes, each dichotomous KFD associated with the non-leaf nodes on the trace is capable to produce the probability of assigning the unlabeled pattern to the child node on the trace. Given that the trace is determined by a series of dichotomous KFD, the product of the probabilities produced by each dichotomous KFD on the trace gives the probability of classifying the unlabeled pattern into one of the multiple classes, i.e., the posterior probability. Of great significance are these attractive features inherited in our approach in enhancing the flexibility and improving the computational efficiency. 5. Landsat satellite image data classification In this section, the proposed algorithm was applied on the classification of satellite image data, which is a benchmark problem and has been intensively studied and attacked by using many popular pattern recognition methods [13]. The algorithm was implemented by using the Statistical Pattern Recognition Toolbox [14]. For the sake of comparison, we use the same training and validation datasets as those used in Ref. [13]. The satellite image database was generated by taking a small section from the original Landsat Multi-Spectral Scanner (MSS) image data from a part of Western Australia. The interpretation of a scene by integrating spatial data of diverse types and resolutions including multi-spectral and radar data, maps indicating topography, land use etc. is expected to assume significant importance with the onset of an era characterized by integrative approaches to remote sensing. One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infra-red. Each pixel is an 8-bit binary word, with 0 corresponding to black and 255 to white. The spatial resolution of a pixel is about 80 m 80 m. Each image contains 2340 3380 such pixels. The database is a (tiny) sub-area of a scene, consisting of 82 100 pixels. Each line of data corresponds to a 3 3 square neighborhood of pixels completely contained within the 82 100 sub-area. Each line contains the pixel values in the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3 3 neighborhood and a number indicating the classification label of the central pixel. Hence, each sample was featured by 36 attributes, which are numerical in the range 0 to 255. Namely, the input space is of 36 dimensions. Totally, 4435 samples are included in the training dataset and 2000 samples in the testing dataset. There are six categories of different soil conditions to be classified, and their distributions in the training and testing dataset are listed in Table 1. To synthesize the polychotomous KFD classifier, the first step is to induce the topology of the binary tree to convert the multi-category problem into several binary classification problems. To calculate the kernel-induced distance between datasets in using the kernelized group clustering algorithm, the Gaussian radial basis function kernel with parameter σ = 28 was chosen as the kernel function. For satellite image database, the topological structure of binary tree obtained via top down induction can be visualized in Fig. 2. As soon as the structure of the binary tree is determined, the second step is to train the binary classifier on all non-leaf nodes by using the KFD algorithm described in Section 2. In our experiment, after the nonlinear projections of the data onto the optimal direction for each dichotomic problem were calculated by using the KFD, the optimal threshold on the onedimensional extracted features was estimated by using the soft-margin linear support vector machine. The regularization constant for linear SVM is set to be C = 1. To confirm the generalization capability of the proposed polychotomous KFD algorithm, the testing error was calculated on the testing datasets, and then compared with those obtained from other popular classification strategies [13], such as Logistic regression, RBF neural networks, K -nearest-neighbor and multi-category SVM direct method [15], and so on. The details about the parameters setting and algorithmic implementation can be referred to the references [13,15]. From the testing error rates listed in the Table 2, it can be followed that the method of polychotomous KFD proposed in this paper outperforms other state-of-the-art pattern classification strategies in terms of the generalization capability and classification accuracy. In particular, the approach of binary-tree based polychotomous KFD offers a natural framework to

518 Z. Lu et al. / Computers and Mathematics with Applications 60 (2010) 511 519 Fig. 2. Binary tree constructed by using Landsat satellite image datasets. Table 2 Comparison of the testing error rates of different pattern classification algorithms on the Landsat satellite image testing datasets. Pattern classification algorithm Testing error rate (%) Logistic discrimination 16.9 Quadratic discrimination 15.3 RBF neural networks 12.1 K -nearest-neighbor 9.4 Multi-class SVM direct method 9.15 Polychotomous KFD proposed 8.9 calculate the conditional probabilities of the classes, which can be inferred from the product of the conditional probabilities in each dichotomic classifier along the path from the root to the leaf nodes. 6. Conclusion and future works Focusing on the issue of how to extend the dichotomous KFD to solve the multi-category classification problem effectively, this article capitalizes on the topology of binary tree and developed the sophisticated distant function to convert the n-class problem into n 1 dichotomy problems. Besides the excellent generalization capability, the proposed polychotomous KFD algorithm has the superiority in converting the output of Fisher discriminant algorithm into posterior probabilities. Future research may concentrate on developing innovative multi-category classification algorithms in the presence of noisy class labels by using the probabilistic outputs [16], and their applications in angular-diversity radar target recognition [17]. References [1] B. Schölkopf, A.J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA, 2002. [2] C. Campbell, Kernel methods: a survey of current techniques, Neurocomputing 48 (2002) 63 84. [3] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, K. Müller, Fisher discriminant analysis with kernel, in: Proc. IEEE Int l Workshop Neural Networks for Signal Processing IX, August 1999, pp. 41 48. [4] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, 2000. [5] V. Kecman, Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic Models, MIT Press, 2001. [6] B. Schölkopf, A. Smola, K.R. Müller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation 10 (5) (1998) 1299 1319. [7] C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. [8] M. Loog, R.P.W. Duin, R. Haeb-Umbach, Multiclass linear dimension reduction by weighted pairwise Fisher criteria, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (7) (2001) 762 766. [9] Z. Lu, F. Lin, H. Ying, Design of decision tree via kernelized hierarchical clustering for multiclass support vector machines, Cybernetics and Systems 38 (2) (2007) 187 202. [10] E. Pekalsak, P. Paclik, R.P.W. Duin, A generalized kernel approach to dissimilarity-based classification, Journal of Machine Learning Research 2 (2001) 175 211. [11] B. Schölkopf, The kernel trick for distances, in: T.K. Leen, T.G. Diettrich, V. Tresp (Eds.), Advances in Neural Information Processing Systems, vol. 13, MIT Press, 2001, pp. 301 307. [12] T. Eiter, H. Mannila, Distance measures for point sets and their computation, Acta Informatica 34 (1997) 109 133.

Z. Lu et al. / Computers and Mathematics with Applications 60 (2010) 511 519 519 [13] R. King, C. Feng, A. Shutherland, Statlog: comparison of classification algorithms on large real-world problems, Applied Artificial Intelligence 9 (1995) 289 333. [14] Vaclav Hlavac, Vojtech Franc, Statistical pattern recognition toolbox for MATLAB, Center for Machine Perception, Czech Technical University, Prague, Czech. [15] J. Weston, C. Watkins, Support vector machines for multi-class pattern recognition, in: Proceedings of the 7th European Symposium on Artificial Neural Networks, Bruges, Belgium, 1999. [16] N. Lawrence, B. Schölkopf, Estimating a kernel Fisher discriminant in the presence of label noise, in: Proceedings of the 18th International Conference on Machine Learning, San Francisco, 2001. [17] K.C. Lee, J.S. Ou, Radar target recognition by using linear discriminant algorithm on angular-diversity RCS, Journal of Electromagnetic Waves and Applications 21 (14) (2007) 2033 2048.