SELF-ORGANIZING methods such as the Self-

Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 Maximal Margin Learning Vector Quantisation Trung Le, Dat Tran, Van Nguyen, and Wanli Ma Abstract Kernel Generalised Learning Vector Quantisation (KGLVQ) was proposed to extend Generalised Learning Vector Quantisation into the kernel feature space to deal with complex class boundaries and thus yielded promising performance for complex classification tasks in pattern recognition. However KGLVQ does not follow the maximal margin principle, which is crucial for kernel-based learning methods. In this paper we propose a maximal margin approach (MLVQ) to the KGLVQ algorithm. MLVQ inherits the merits of KGLVQ and also follows the maximal margin principle to improve the generalisation capability. Experiments performed on the wellknown data sets available in UCI repository show promising classification results for the proposed method. I. INTRODUCTION SELF-ORGANIZING methods such as the Self- Organizing Map (SOM) or Learning Vector Quantisation (LVQ) introduced by Kohonen [8] provide a successful and intuitive method of processing data for easy access [6]. LVQ aims at generating the prototypes or reference vectors which delegate for the data of classes [7]. Although LVQ is a fast and simple learning algorithm, sometimes its prototypes diverge and, as a result, degrade its recognition ability [12]. To address this problem, Generalised Learning Vector Quantisation (GLVQ) [12] was proposed. It is a generalisation of the original model proposed by Kohonen, where the prototypes are updated based on the steepest descent method to minimise a cost function. GLVQ has been widely applied to and also shown good performance in many applications [9], [11], [12]. However, its performance deteriorates for complex data sets since pattern classes with nonlinear class boundaries usually need a large number prototypes. It thus becomes very difficult to determine the reasonable number of prototypes and their positions to achieve a good generalisation performance [10]. To overcome this drawback, in [10] Kernel Generalised Learning Vector Quantisation (KGLVQ) was proposed, which learns the prototypes of data in the feature space. Like LVQ and GLVQ, KGLVQ can be used for two class and multi-class classification problems. In the case of two-class classification problems, the entire feature space is divided into subspaces induced by two core prototypes. In each subspace, a mid-perpendicular hyperplane of the two core prototypes was employed to classify the data. However, the hyperplanes of KGLVQ do not guarantee maximising margins, which is crucial for kernel methods [13], [14], [15]. Trung Le and Van Nguyen are with Faculty of Information Technology, HCMc University of Pedagogy, Hochiminh city, Vietnam (email: {trunglm, vannk}@hcmup.edu.vn). Dat Tran and Wanli Ma are with Faculty of Education, Science, Technology and Mathematics, University of Canberra, Australia (email:{dat.tran, wanli.ma}@canberra.edu.au). In this paper, we propose a maximal margin approach to KGLVQ, and we name it MLVQ. It takes the advantage of maximising margins to improve the generalisation capability, as seen in Support Vector Machine [3], [1]. MLVQ is different from the approach in [4], which aims at maximising the hypothesis margin rather than the real margin. In our approach, a finite number of prototypes m and n are used to represent the positive and the negative classes, respectively, in a binary data set. The entire feature space is divided into m n subspaces, which are induced by the permutation pairs of the prototypes. In each subspace, a mid-perpendicular hyperplane of two correspondent prototypes is employed to classify the data. The cost function in our approach takes into account maximising the margins of hyperplanes to boost the generalisation capability. Experiments performed on 9 data sets in UCI repository show a promising performance of the proposed method. II. MAXIMAL MARGIN KERNEL GENERALISED LEARNING VECTOR QUANTISATION A. Introduction Consider a binary training set X = {(x 1, y 1 ), (x 2, y 2 ),..., (x l, y l )}, where x 1, x 2,..., x l R d are data points and y 1, y 2,..., y l { 1, 1} are labels. This training set is mapped into a high dimensional space namely feature space through a function φ(.). Based on the idea of Vector Quantisation (VQ), m prototypes A 1, A 2,..., A m of the positive class and n prototypes B 1, B 2,..., B n of the negative class will be discovered in the feature space. The classification is based on the minimum distance to the prototypes in each class. More precisely, given a new vector x the decision function is as follows: ( f(x) = sign φ(x) b j0 2 φ(x) a i0 2) (1) { where i 0 = arg min φ(x) a i 2}, j 0 = { 1 i m arg min φ(x) b j 2}, and a i, b j are coordinates of 1 j n A i, B j, i = 1,..., m; j = 1,..., n, respectively. B. Optimisation Problem Given a labeled training vector (x, y), let us denote a and b as two prototypes of the positive class and negative class which are, respectively, closest to φ(x). Let µ(x, a, b) be the function which satisfies the following criterion: if x is correctly classified, µ(x, a, b) < 0; otherwise µ(x, a, b) 0. Let g be a monotonically increasing function. To improve the error rate, µ(x, a, b) should decrease for all training vectors. 978-1-4673-6129-3/13/$31.00 2013 IEEE 1668

Therefore, the criterion is formulated as minimising of the following function: min l {A},{B} ( ( g µ x i, a (i), b (i))) (2) where {A} and {B} are the sequences {A 1, A 2,..., A m } and {B 1, B 2,..., B n }, respectively, and a (i) and b (i) are two prototypes of two classes which are closest to φ(x i ). C. Solution Assuming that the prototypes are linear expansions of vectors φ(x 1 ), φ(x 2 ),..., φ(x l ), let us denote a i, i = 1,..., m and b j, j = 1,..., n as the coordinates of the prototypes: a i = b j = l k=1 l k=1 u ik φ(x k ), v jk φ(x k ), For convenience, if c = l i = 1,..., m j = 1,..., n (3) u i φ(x i ), we rewrite c as c = [u 1, u 2,..., u l ] = [u i ],...,l. Given a labeled training vector (x, y), firstly we determine two closest prototypes A and B for two classes with respect to x, and secondly we use gradient descent method to update the coordinates a and b of A and B, respectively, as follows: a = a α g a b = b α g b (4) We now introduce the algorithm for Vector Quantisation Support Vector Machine. ALGORITHM FOR VECTOR QUANTISATION SUPPORT VECTOR MACHINE Initialise Using C-Means or Fuzzy C-Means clustering to find out m prototypes for positive class and n prototypes for negative class in the input space Set t = 0 and i = 0 Repeat t = t + 1 i = (i + 1) mod l A t = A i0 where i 0 = arg min { φ(x i ) a k 2} 1 k m B t = B j0 where j 0 = arg min { φ(x i ) b k 2} 1 k n a i0 b j0 Update a i0 = a i0 α g Update b j0 = b j0 α g Until convergence is reached where the function g = g(µ, t) depends on learning 1 time t. The sigmoid function g (µ, t) = 1+e is a good µt candidate for g. If this sigmoid function is applied then = tg (µ, t) (1 g(µ, t)). g D. Selection of the µ-function We introduce some candidates for the µ-function. Let (x, y) be a labeled training vector, and a & b are two closest prototypes in the two classes to the vector. CANDIDATE 1 FOR THE µ-function [8] (LVQ) µ(x, a, b) = y ( φ(x) a 2 φ(x) b 2) = y(d 1 d 2 ) = η(d 1, d 2 ) CANDIDATE 2 FOR THE µ-function [12] (GLVQ) µ(x, a, b) = y( φ(x) a 2 φ(x) b 2 ) φ(x) a 2 + φ(x) b 2 = y(d1 d2) d 1+d 2 = η(d 1, d 2 ) where d 1 and d 2 in (5) and (6) are distances from φ(x) to the two prototypes a and b, respectively. These functions depend primarily on d 1 and d 2. The formula for adaptation of prototypes in (4) can be rewritten as follows a = a 2α g η η d 1 (a φ(x)) b = b 2α g η η d 2 (b φ(x)) If µ(x, a, b) = η(d 1, d 2 ) = y(d 1 d 2 ), the equations in (7) become: (5) (6) (7) a = a 2α g η y (a φ(x)) b = b + 2α g η y (b φ(x)) (8) If µ(x, a, b) = η(d 1, d 2 ) = y(d1 d2) d 1+d 2, the equations in (7) become: a = a α g η b = b + α g η 4yd 2 (a φ(x)) (d 1+d 2) 2 4yd 1 (b φ(x)) (d 1+d 2) 2 CANDIDATE 3 FOR THE µ-function [4] (HMLVQ) µ(x, a, b) = 1 y ( φ(x) a φ(x) b ) (10) 2 This µ-function refers to hypothesis margin in [4] and is used in AdaBoost [5]. The hypothesis margin measures how much the hypothesis can travel before it hits an instance as shown in Figure 1. The partial derivatives of µ with respect to a and b are: a = b = y 2 φ(x) a (9) (φ(x) a) y 2 φ(x) b (φ(x) b) (11) CANDIDATE 4 FOR µ-function (MLVQ) This is our proposed maximal margin approach MLVQ. The µ-function is of the form µ(x, a, b) = y( φ(x) a 2 φ(x) b 2 ) a b (12) It is noted that by referring to Theorem 1 in Appendix A, the absolute value of the µ-function in Candidate 4 is the sample margin at φ(x) in Figure 1, and it is also the distance from φ(x) to mid-perpendicular hyperplane of prototypes a and b. When x is correctly classified, this value is equal to the 1669

negative sample margin at x. Minimising µ(x, a, b) motivates maximising the sample margin at x. The partial derivatives of µ with respect to a and b are: a = b = 2y a b (φ(x) a) y( φ(x) a 2 φ(x) b 2 ) a b 3 2y a b (φ(x) b) + y( φ(x) a 2 φ(x) b 2 ) a b 3 (13) Fig. 1. (a) Hypothesis Margin, (b) Sample Margin margin at φ(x i ), and d i = φ(x i ) a 1 2 φ(x i ) b 1 2. To make it simple, let us consider a separable case, i.e., all vectors φ(x i ) are correctly classified by the hyperplane H. l The objective function becomes ( dis(φ(x i ), H)). The solving optimisation problem is transformed to the following: ( l ) ( l ) min ( dis(φ(x i ), H)) or max dis(φ(x i ), H) a 1,b 1 a 1,b 1 (16) The above objective function is the sum of all sample margins at vectors, not the margins in the original Support Vector Machine (SVM). However, it was regarded in a variation of SVM [2]. Therefore, when using 2 prototypes, 1 each for the positive and negative classes, the objective function of the proposed model closely relates to the margin of SVM. Furthermore, with m prototypes to delegate the positive class and n for the negative class, the entire space are divided into m n subspaces (receptive fields), and for each receptive field the hyperplane induced by two correspondent prototypes is used for classifying the data as shown in Figure 3. Since the objective function (2) is minimised, the margins in the corresponding fields tend to be maximised. E. Decision Function When a convergence is reached, we achieve the final prototypes a i = [u ik ] k=1,...,l, i = 1,..., m and b j = [v jk ] k=1,...,l, j = 1,..., n. For a new vector x, we can calculate the distances from φ(x) to the prototypes using the following: d(φ(x), a i ) = φ(x) a i 2 = K(x, x) 2 l u ip K(x p, x) + a i 2, p=1 d(φ(x), b j ) = φ(x) b j 2 = K(x, x) 2 l v jp K(x p, x) + b j 2, p=1 i = 1,..., m j = 1,..., n (14) The two closest prototypes to φ(x) and the decision function will be determined as follows i 0 = arg min{d(φ(x), a i )} 1 i m j 0 = arg min{d(φ(x), b j )} 1 j n f(x) = sign (d(φ(x), b j0 ) d(φ(x), a i0 )) F. The Rational of the Proposed Margin Approach (15) In this section, we discuss the rationale and the advantage of our proposed method. First, we consider the case when the numbers of prototypes for both positive and negative classes are set to 1 as shown in Figure 2. Assuming that g(µ, t) = µ is applied, by referring to Theorem 1 in Appendix A, the objective function in (2) becomes l (y i dis(φ(x i ), H)sign(d i )), where H is hyperplane induced by positive prototype a 1 and negative prototype b 1, dis(φ(x i ), H) stands for distance from φ(x i ) to H or sample Fig. 2. One positive and one negative prototype are used to classify the data set. Fig. 3. Two positive and two negative prototypes are used to classify the data set. A. Data sets III. EXPERIMENTAL RESULTS We conducted experiments on 9 data sets of UCI repository. Details of data sets are shown in Table I. The LVQ algorithms with different µ-functions mentioned above were performed in both the input and feature spaces to compare 1670

TABLE I NUMBER OF SAMPLES IN 9 DATA SETS. #POSITIVE: NUMBER OF POSITIVE SAMPLES, #NEGATIVE: NUMBER OF NEGATIVE SAMPLES AND d: DIMENSION. Data set #positive #negative d Astroparticle 2000 1089 4 Australian 383 307 14 Breast Cancer 444 239 10 Fourclass 307 555 2 Ionosphere 255 126 34 Liver Disorders 200 145 6 SvmGuide3 296 947 22 USPS 1194 6097 256 Wine 59 71 13 LVQ, GLVQ and HMLVQ with our proposed MLVQ in the input space and also compare kernel LVQ, kernel GLVQ and kernel HMLVQ with MLVQ in the kernel feature spaces. We also make comparison of MLVQ with SVM. B. Parameter Settings In our experiment, we did not use the sigmoid function g(µ, t) = 1 1+e which results in the derivative g µt = tg(1 g). The derivative of this function rapidly decreases to 0 when the time t approaches +. For example when t = 100, the derivative is nearly equal to 0 if 0.1 < µ < 0.1. 1 Instead, we applied g(µ, t) = whose derivative is 1+e µ t g t = tg(1 g). This function shows two good features as seen in Figures 4 and 5: 1) Its derivative approaches to 0 slower than that of the sigmoid function. 2) Given t, if µ of a vector exceeds a predefined threshold then the derivative or the rate at this vector is very small and the adaptation is minor. used. The learning rate α was set to 0.05. Both the numbers of positive and negative prototypes were searched in the grid {1, 2, 3}. For Kernel LVQs and SVM, the popular RBF kernel function K(x, x ) = e γ x x 2 was used. The parameter γ was ranged in the grid {2 k : k = 2l + 1, l = 8, 6,..., 1}. For SVM, the trade-off parameter C was searched in the grid {2 k : k = 2l + 1, l = 8, 6,..., 2}. Experimental results are displayed in Tables II, III and Figures 6, 7. It is shown that our MLVQ method performed very well in the input space. It also happens that the kernel models always produce better performance in comparison to the correspondent models in the input space. It is reasonable since the data tend to be more compact in the feature space. Therefore, a few of prototypes are sufficient to delegate each class. The experiment also points out that MLVQ in the kernel feature spaces and SVM are comparable, however MLVQ is preferable because it is simpler and does not require searching over a large-scale range of parameters like SVM. TABLE II CLASSIFICATION RESULTS (IN %) ON 9 DATA SETS FOR THE 4 INPUT SPACE MODELS LVQ, GLVQ, HMLVQ AND MLVQ. Data set LVQ GLVQ HMLVQ MLVQ Astroparticle 66% 68% 70% 84% Australian 82% 82% 83% 85% Breast Cancer 94% 95% 95% 96% Fourclass 90% 90% 93% 88% Ionosphere 70% 69% 71 % 84% Liver Disorders 60% 61% 62% 64% SvmGuide3 74% 76% 74% 65% USPS 74% 74% 73% 95% Wine 91% 94% 90% 93% TABLE III CLASSIFICATION RESULTS (IN %) ON 9 DATA SETS FOR THE 4 kernel FEATURE SPACE MODELS LVQ, GLVQ, HMLVQ, MLVQ AND SVM. Fig. 4. The graph of the derivative of sigmoid function. Data set LVQ GLVQ HMLVQ MLVQ SVM Astroparticle 86% 89% 89% 95% 96% Australian 83% 82% 84% 88% 86% Breast Cancer 96% 97% 97% 97% 96% Fourclass 98% 99% 100% 99% 98% Ionosphere 92% 93% 92% 95% 93% Liver Disorders 62% 62% 64% 66% 60% SvmGuide3 75% 77% 76% 79% 76% USPS 82% 83% 82% 98% 98% Wine 96% 98% 95% 99% 99% Fig. 5. The graph of the derivative of new sigmoid function. To evaluate accuracies, cross validation with 5 folds were IV. CONCLUSION In this paper, we have introduced MLVQ, a new maximal margin approach to Kernel Generalised Learning Vector Quantisation. MLVQ maximises the real margin which is crucial for kernel method and can be applied to both the input space and feature space. The experiments conducted on the 9 data sets in UCI repository demostrate good performance of MLVQ in both the input space and feature space. 1671

Fig. 8. The formula to evaluate margin. Fig. 6. Classification results (in %) on 9 data sets for the 4 input space models LVQ, GLVQ, HMLVQ (HM) and MLVQ (SM). partial derivatives of µ with respect to a and b are a = b = 2y a b (φ(x) a) y( φ(x) a 2 φ(x) b 2 ) a b 3 2y a b (φ(x) b) + y( φ(x) a 2 φ(x) b 2 ) a b 3 (19) Fig. 7. Classification results (in %) on 9 data sets for the 4 kernel feature space models: kernel LVQ (KLVQ), kernel GLVQ (KGLVQ), kernel HMLVQ (KHM), kernel MLVQ (KSM) and SVM. APPENDIX Theorem 1 Let M, A and B be the points in the affine space R d. Let (H) : w T x + b = 0 be the mid-perpendicular hyperplane of segment AB. The following equality holds: Margin(M, H) = MA 2 MB 2 2AB where the sample margin Margin(M, H) is distance from point M to hyperplane (H). PROOF MA 2 MB 2 = MA 2 MB 2 = ( MA MB)( MA + MB) = 2 BA. MI = 2BA( MH + HI) (17) where I is the midpoint of segment AB, H is projected point of M to hyperplane (H) as in Figure 8. Since HI is orthogonal to BA and MH is parallel to BA, we have: MA 2 MB 2 = 2 BA. MH = 2BA.MH = 2BA.M argin(m, H) Corollary 1 If µ(x, a, b) = y( φ(x) a 2 φ(x) b 2 ) a b (18) then the PROOF a = 2y(a φ(x)) a b y( φ(x) a 2 φ(x) b 2 ) 2(a b) 2 a b a b 2 = 2y a b (φ(x) a) y( φ(x) a 2 φ(x) b 2 ) a b 3 (20) b = 2y(b φ(x)) a b y( φ(x) a 2 φ(x) b 2 ) 2(b a) 2 a b a b 2 = 2y a b (φ(x) b) + y( φ(x) a 2 φ(x) b 2 ) a b 3 (21) REFERENCES [1] C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121 167, 1998. [2] Colin Campbell and Kristin P. Bennett. A linear programming approach to novelty detection, 2001. [3] C. Cortes and V. Vapnik. Support-vector networks. In Machine Learning, pages 273 297, 1995. [4] K. Crammer, R. Gilad-bachrach, A. Navot, and N. Tishby. Margin analysis of the lvq algorithm. In Advances in Neural Information Processing Systems 2002, pages 462 469, 2002. [5] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997. [6] B. Hammer and T. Villmann. Generalized relevance learning vector quantization. Neural Networks, 15:1059 1068, 2002. [7] T. Kohonen. Self-organization and associative memory: 3rd edition. Springer-Verlag, 1989. [8] T. Kohonen. Learning vector quantization. The Handbook of Brain Theory and Neural Networks, pages 537 540, 1995. [9] C-L. Liu and M. Nakagawa. Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognition, 34(3):601 615, 2001. 1672

[10] A. K. Qinand and P. N. Suganthan. A novel kernel prototype-based learning algorithm. In ICPR, pages 621 624, 2004. [11] A. Sato. Discriminative dimensionality reduction based on generalized lvq. In ICANN, pages 65 72, 2001. [12] A. Sato and K. Yamada. Generalized learning vector quantization. In NIPS, pages 423 429, 1995. [13] B. Schölkopf and A.J. Smola. Learning with kernels : support vector machines, regularization, optimization, and beyond. The MIT Press, 2 edition, 2002. [14] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. [15] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 2 edition, 1999. 1673