SELF-ORGANIZING methods such as the Self-

Similar documents
A New Fuzzy Membership Computation Method for Fuzzy Support Vector Machines

Machine Learning for NLP

Relevance Determination for Learning Vector Quantization using the Fisher Criterion Score

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Classification by Support Vector Machines

Naïve Bayes for text classification

Efficient Case Based Feature Construction

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

KBSVM: KMeans-based SVM for Business Intelligence

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

Classification by Support Vector Machines

Rule extraction from support vector machines

All lecture slides will be available at CSC2515_Winter15.html

The Effects of Outliers on Support Vector Machines

Training Data Selection for Support Vector Machines

Use of Multi-category Proximal SVM for Data Set Reduction

Fast Support Vector Machine Classification of Very Large Datasets

Support vector machines. Dominik Wisniewski Wojciech Wawrzyniak

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Kernel-based online machine learning and support vector reduction

Support Vector Machines

CS570: Introduction to Data Mining

Generating the Reduced Set by Systematic Sampling

Support Vector Machines.

Global Metric Learning by Gradient Descent

Discriminative classifiers for image recognition

Machine Learning Classifiers and Boosting

Opinion Mining by Transformation-Based Domain Adaptation

Feature scaling in support vector data description

SoftDoubleMinOver: A Simple Procedure for Maximum Margin Classification

Ensembles. An ensemble is a set of classifiers whose combined results give the final decision. test feature vector

Kernel Methods and Visualization for Interval Data Mining

Leave-One-Out Support Vector Machines

Contents. Preface to the Second Edition

Local Linear Approximation for Kernel Methods: The Railway Kernel

DECISION-TREE-BASED MULTICLASS SUPPORT VECTOR MACHINES. Fumitake Takahashi, Shigeo Abe

CGBoost: Conjugate Gradient in Function Space

Support Vector Machines

Lab 2: Support vector machines

Classification by Support Vector Machines

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Content-based image and video analysis. Machine learning

Version Space Support Vector Machines: An Extended Paper

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Bagging for One-Class Learning

1 Case study of SVM (Rob)

A Practical Guide to Support Vector Classification

Lab 2: Support Vector Machines

Neural Networks and Deep Learning

FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION

Minimum Risk Feature Transformations

Figure (5) Kohonen Self-Organized Map

Binarization of Color Character Strings in Scene Images Using K-means Clustering and Support Vector Machines

Clustering with Reinforcement Learning

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Fuzzy-Kernel Learning Vector Quantization

12 Classification using Support Vector Machines

Support Vector Machines

Support vector machines

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN

Univariate Margin Tree

Function approximation using RBF network. 10 basis functions and 25 data points.

Second Order SMO Improves SVM Online and Active Learning

Hierarchical Local Clustering for Constraint Reduction in Rank-Optimizing Linear Programs

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Dynamic Ensemble Construction via Heuristic Optimization

Recursive Similarity-Based Algorithm for Deep Learning

Data Mining in Bioinformatics Day 1: Classification

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

NEAREST-INSTANCE-CENTROID-ESTIMATION LINEAR DISCRIMINANT ANALYSIS (NICE LDA) Rishabh Singh, Kan Li (Member, IEEE) and Jose C. Principe (Fellow, IEEE)

Support Vector Machines

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

Support Vector Machines.

Using Decision Boundary to Analyze Classifiers

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Efficient Pruning Method for Ensemble Self-Generating Neural Networks

SVM Classification in Multiclass Letter Recognition System

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

Information Management course

A Comparative Study of SVM Kernel Functions Based on Polynomial Coefficients and V-Transform Coefficients

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Client Dependent GMM-SVM Models for Speaker Verification

Robust 1-Norm Soft Margin Smooth Support Vector Machine

ENSEMBLE RANDOM-SUBSET SVM

J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Technical Report. February 5, 1998

Feature Selection in a Kernel Space

Support Vector Machines + Classification for IR

SUPPORT VECTOR MACHINES

Kernel Combination Versus Classifier Combination

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Transcription:

Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 Maximal Margin Learning Vector Quantisation Trung Le, Dat Tran, Van Nguyen, and Wanli Ma Abstract Kernel Generalised Learning Vector Quantisation (KGLVQ) was proposed to extend Generalised Learning Vector Quantisation into the kernel feature space to deal with complex class boundaries and thus yielded promising performance for complex classification tasks in pattern recognition. However KGLVQ does not follow the maximal margin principle, which is crucial for kernel-based learning methods. In this paper we propose a maximal margin approach (MLVQ) to the KGLVQ algorithm. MLVQ inherits the merits of KGLVQ and also follows the maximal margin principle to improve the generalisation capability. Experiments performed on the wellknown data sets available in UCI repository show promising classification results for the proposed method. I. INTRODUCTION SELF-ORGANIZING methods such as the Self- Organizing Map (SOM) or Learning Vector Quantisation (LVQ) introduced by Kohonen [8] provide a successful and intuitive method of processing data for easy access [6]. LVQ aims at generating the prototypes or reference vectors which delegate for the data of classes [7]. Although LVQ is a fast and simple learning algorithm, sometimes its prototypes diverge and, as a result, degrade its recognition ability [12]. To address this problem, Generalised Learning Vector Quantisation (GLVQ) [12] was proposed. It is a generalisation of the original model proposed by Kohonen, where the prototypes are updated based on the steepest descent method to minimise a cost function. GLVQ has been widely applied to and also shown good performance in many applications [9], [11], [12]. However, its performance deteriorates for complex data sets since pattern classes with nonlinear class boundaries usually need a large number prototypes. It thus becomes very difficult to determine the reasonable number of prototypes and their positions to achieve a good generalisation performance [10]. To overcome this drawback, in [10] Kernel Generalised Learning Vector Quantisation (KGLVQ) was proposed, which learns the prototypes of data in the feature space. Like LVQ and GLVQ, KGLVQ can be used for two class and multi-class classification problems. In the case of two-class classification problems, the entire feature space is divided into subspaces induced by two core prototypes. In each subspace, a mid-perpendicular hyperplane of the two core prototypes was employed to classify the data. However, the hyperplanes of KGLVQ do not guarantee maximising margins, which is crucial for kernel methods [13], [14], [15]. Trung Le and Van Nguyen are with Faculty of Information Technology, HCMc University of Pedagogy, Hochiminh city, Vietnam (email: {trunglm, vannk}@hcmup.edu.vn). Dat Tran and Wanli Ma are with Faculty of Education, Science, Technology and Mathematics, University of Canberra, Australia (email:{dat.tran, wanli.ma}@canberra.edu.au). In this paper, we propose a maximal margin approach to KGLVQ, and we name it MLVQ. It takes the advantage of maximising margins to improve the generalisation capability, as seen in Support Vector Machine [3], [1]. MLVQ is different from the approach in [4], which aims at maximising the hypothesis margin rather than the real margin. In our approach, a finite number of prototypes m and n are used to represent the positive and the negative classes, respectively, in a binary data set. The entire feature space is divided into m n subspaces, which are induced by the permutation pairs of the prototypes. In each subspace, a mid-perpendicular hyperplane of two correspondent prototypes is employed to classify the data. The cost function in our approach takes into account maximising the margins of hyperplanes to boost the generalisation capability. Experiments performed on 9 data sets in UCI repository show a promising performance of the proposed method. II. MAXIMAL MARGIN KERNEL GENERALISED LEARNING VECTOR QUANTISATION A. Introduction Consider a binary training set X = {(x 1, y 1 ), (x 2, y 2 ),..., (x l, y l )}, where x 1, x 2,..., x l R d are data points and y 1, y 2,..., y l { 1, 1} are labels. This training set is mapped into a high dimensional space namely feature space through a function φ(.). Based on the idea of Vector Quantisation (VQ), m prototypes A 1, A 2,..., A m of the positive class and n prototypes B 1, B 2,..., B n of the negative class will be discovered in the feature space. The classification is based on the minimum distance to the prototypes in each class. More precisely, given a new vector x the decision function is as follows: ( f(x) = sign φ(x) b j0 2 φ(x) a i0 2) (1) { where i 0 = arg min φ(x) a i 2}, j 0 = { 1 i m arg min φ(x) b j 2}, and a i, b j are coordinates of 1 j n A i, B j, i = 1,..., m; j = 1,..., n, respectively. B. Optimisation Problem Given a labeled training vector (x, y), let us denote a and b as two prototypes of the positive class and negative class which are, respectively, closest to φ(x). Let µ(x, a, b) be the function which satisfies the following criterion: if x is correctly classified, µ(x, a, b) < 0; otherwise µ(x, a, b) 0. Let g be a monotonically increasing function. To improve the error rate, µ(x, a, b) should decrease for all training vectors. 978-1-4673-6129-3/13/$31.00 2013 IEEE 1668

Therefore, the criterion is formulated as minimising of the following function: min l {A},{B} ( ( g µ x i, a (i), b (i))) (2) where {A} and {B} are the sequences {A 1, A 2,..., A m } and {B 1, B 2,..., B n }, respectively, and a (i) and b (i) are two prototypes of two classes which are closest to φ(x i ). C. Solution Assuming that the prototypes are linear expansions of vectors φ(x 1 ), φ(x 2 ),..., φ(x l ), let us denote a i, i = 1,..., m and b j, j = 1,..., n as the coordinates of the prototypes: a i = b j = l k=1 l k=1 u ik φ(x k ), v jk φ(x k ), For convenience, if c = l i = 1,..., m j = 1,..., n (3) u i φ(x i ), we rewrite c as c = [u 1, u 2,..., u l ] = [u i ],...,l. Given a labeled training vector (x, y), firstly we determine two closest prototypes A and B for two classes with respect to x, and secondly we use gradient descent method to update the coordinates a and b of A and B, respectively, as follows: a = a α g a b = b α g b (4) We now introduce the algorithm for Vector Quantisation Support Vector Machine. ALGORITHM FOR VECTOR QUANTISATION SUPPORT VECTOR MACHINE Initialise Using C-Means or Fuzzy C-Means clustering to find out m prototypes for positive class and n prototypes for negative class in the input space Set t = 0 and i = 0 Repeat t = t + 1 i = (i + 1) mod l A t = A i0 where i 0 = arg min { φ(x i ) a k 2} 1 k m B t = B j0 where j 0 = arg min { φ(x i ) b k 2} 1 k n a i0 b j0 Update a i0 = a i0 α g Update b j0 = b j0 α g Until convergence is reached where the function g = g(µ, t) depends on learning 1 time t. The sigmoid function g (µ, t) = 1+e is a good µt candidate for g. If this sigmoid function is applied then = tg (µ, t) (1 g(µ, t)). g D. Selection of the µ-function We introduce some candidates for the µ-function. Let (x, y) be a labeled training vector, and a & b are two closest prototypes in the two classes to the vector. CANDIDATE 1 FOR THE µ-function [8] (LVQ) µ(x, a, b) = y ( φ(x) a 2 φ(x) b 2) = y(d 1 d 2 ) = η(d 1, d 2 ) CANDIDATE 2 FOR THE µ-function [12] (GLVQ) µ(x, a, b) = y( φ(x) a 2 φ(x) b 2 ) φ(x) a 2 + φ(x) b 2 = y(d1 d2) d 1+d 2 = η(d 1, d 2 ) where d 1 and d 2 in (5) and (6) are distances from φ(x) to the two prototypes a and b, respectively. These functions depend primarily on d 1 and d 2. The formula for adaptation of prototypes in (4) can be rewritten as follows a = a 2α g η η d 1 (a φ(x)) b = b 2α g η η d 2 (b φ(x)) If µ(x, a, b) = η(d 1, d 2 ) = y(d 1 d 2 ), the equations in (7) become: (5) (6) (7) a = a 2α g η y (a φ(x)) b = b + 2α g η y (b φ(x)) (8) If µ(x, a, b) = η(d 1, d 2 ) = y(d1 d2) d 1+d 2, the equations in (7) become: a = a α g η b = b + α g η 4yd 2 (a φ(x)) (d 1+d 2) 2 4yd 1 (b φ(x)) (d 1+d 2) 2 CANDIDATE 3 FOR THE µ-function [4] (HMLVQ) µ(x, a, b) = 1 y ( φ(x) a φ(x) b ) (10) 2 This µ-function refers to hypothesis margin in [4] and is used in AdaBoost [5]. The hypothesis margin measures how much the hypothesis can travel before it hits an instance as shown in Figure 1. The partial derivatives of µ with respect to a and b are: a = b = y 2 φ(x) a (9) (φ(x) a) y 2 φ(x) b (φ(x) b) (11) CANDIDATE 4 FOR µ-function (MLVQ) This is our proposed maximal margin approach MLVQ. The µ-function is of the form µ(x, a, b) = y( φ(x) a 2 φ(x) b 2 ) a b (12) It is noted that by referring to Theorem 1 in Appendix A, the absolute value of the µ-function in Candidate 4 is the sample margin at φ(x) in Figure 1, and it is also the distance from φ(x) to mid-perpendicular hyperplane of prototypes a and b. When x is correctly classified, this value is equal to the 1669

negative sample margin at x. Minimising µ(x, a, b) motivates maximising the sample margin at x. The partial derivatives of µ with respect to a and b are: a = b = 2y a b (φ(x) a) y( φ(x) a 2 φ(x) b 2 ) a b 3 2y a b (φ(x) b) + y( φ(x) a 2 φ(x) b 2 ) a b 3 (13) Fig. 1. (a) Hypothesis Margin, (b) Sample Margin margin at φ(x i ), and d i = φ(x i ) a 1 2 φ(x i ) b 1 2. To make it simple, let us consider a separable case, i.e., all vectors φ(x i ) are correctly classified by the hyperplane H. l The objective function becomes ( dis(φ(x i ), H)). The solving optimisation problem is transformed to the following: ( l ) ( l ) min ( dis(φ(x i ), H)) or max dis(φ(x i ), H) a 1,b 1 a 1,b 1 (16) The above objective function is the sum of all sample margins at vectors, not the margins in the original Support Vector Machine (SVM). However, it was regarded in a variation of SVM [2]. Therefore, when using 2 prototypes, 1 each for the positive and negative classes, the objective function of the proposed model closely relates to the margin of SVM. Furthermore, with m prototypes to delegate the positive class and n for the negative class, the entire space are divided into m n subspaces (receptive fields), and for each receptive field the hyperplane induced by two correspondent prototypes is used for classifying the data as shown in Figure 3. Since the objective function (2) is minimised, the margins in the corresponding fields tend to be maximised. E. Decision Function When a convergence is reached, we achieve the final prototypes a i = [u ik ] k=1,...,l, i = 1,..., m and b j = [v jk ] k=1,...,l, j = 1,..., n. For a new vector x, we can calculate the distances from φ(x) to the prototypes using the following: d(φ(x), a i ) = φ(x) a i 2 = K(x, x) 2 l u ip K(x p, x) + a i 2, p=1 d(φ(x), b j ) = φ(x) b j 2 = K(x, x) 2 l v jp K(x p, x) + b j 2, p=1 i = 1,..., m j = 1,..., n (14) The two closest prototypes to φ(x) and the decision function will be determined as follows i 0 = arg min{d(φ(x), a i )} 1 i m j 0 = arg min{d(φ(x), b j )} 1 j n f(x) = sign (d(φ(x), b j0 ) d(φ(x), a i0 )) F. The Rational of the Proposed Margin Approach (15) In this section, we discuss the rationale and the advantage of our proposed method. First, we consider the case when the numbers of prototypes for both positive and negative classes are set to 1 as shown in Figure 2. Assuming that g(µ, t) = µ is applied, by referring to Theorem 1 in Appendix A, the objective function in (2) becomes l (y i dis(φ(x i ), H)sign(d i )), where H is hyperplane induced by positive prototype a 1 and negative prototype b 1, dis(φ(x i ), H) stands for distance from φ(x i ) to H or sample Fig. 2. One positive and one negative prototype are used to classify the data set. Fig. 3. Two positive and two negative prototypes are used to classify the data set. A. Data sets III. EXPERIMENTAL RESULTS We conducted experiments on 9 data sets of UCI repository. Details of data sets are shown in Table I. The LVQ algorithms with different µ-functions mentioned above were performed in both the input and feature spaces to compare 1670

TABLE I NUMBER OF SAMPLES IN 9 DATA SETS. #POSITIVE: NUMBER OF POSITIVE SAMPLES, #NEGATIVE: NUMBER OF NEGATIVE SAMPLES AND d: DIMENSION. Data set #positive #negative d Astroparticle 2000 1089 4 Australian 383 307 14 Breast Cancer 444 239 10 Fourclass 307 555 2 Ionosphere 255 126 34 Liver Disorders 200 145 6 SvmGuide3 296 947 22 USPS 1194 6097 256 Wine 59 71 13 LVQ, GLVQ and HMLVQ with our proposed MLVQ in the input space and also compare kernel LVQ, kernel GLVQ and kernel HMLVQ with MLVQ in the kernel feature spaces. We also make comparison of MLVQ with SVM. B. Parameter Settings In our experiment, we did not use the sigmoid function g(µ, t) = 1 1+e which results in the derivative g µt = tg(1 g). The derivative of this function rapidly decreases to 0 when the time t approaches +. For example when t = 100, the derivative is nearly equal to 0 if 0.1 < µ < 0.1. 1 Instead, we applied g(µ, t) = whose derivative is 1+e µ t g t = tg(1 g). This function shows two good features as seen in Figures 4 and 5: 1) Its derivative approaches to 0 slower than that of the sigmoid function. 2) Given t, if µ of a vector exceeds a predefined threshold then the derivative or the rate at this vector is very small and the adaptation is minor. used. The learning rate α was set to 0.05. Both the numbers of positive and negative prototypes were searched in the grid {1, 2, 3}. For Kernel LVQs and SVM, the popular RBF kernel function K(x, x ) = e γ x x 2 was used. The parameter γ was ranged in the grid {2 k : k = 2l + 1, l = 8, 6,..., 1}. For SVM, the trade-off parameter C was searched in the grid {2 k : k = 2l + 1, l = 8, 6,..., 2}. Experimental results are displayed in Tables II, III and Figures 6, 7. It is shown that our MLVQ method performed very well in the input space. It also happens that the kernel models always produce better performance in comparison to the correspondent models in the input space. It is reasonable since the data tend to be more compact in the feature space. Therefore, a few of prototypes are sufficient to delegate each class. The experiment also points out that MLVQ in the kernel feature spaces and SVM are comparable, however MLVQ is preferable because it is simpler and does not require searching over a large-scale range of parameters like SVM. TABLE II CLASSIFICATION RESULTS (IN %) ON 9 DATA SETS FOR THE 4 INPUT SPACE MODELS LVQ, GLVQ, HMLVQ AND MLVQ. Data set LVQ GLVQ HMLVQ MLVQ Astroparticle 66% 68% 70% 84% Australian 82% 82% 83% 85% Breast Cancer 94% 95% 95% 96% Fourclass 90% 90% 93% 88% Ionosphere 70% 69% 71 % 84% Liver Disorders 60% 61% 62% 64% SvmGuide3 74% 76% 74% 65% USPS 74% 74% 73% 95% Wine 91% 94% 90% 93% TABLE III CLASSIFICATION RESULTS (IN %) ON 9 DATA SETS FOR THE 4 kernel FEATURE SPACE MODELS LVQ, GLVQ, HMLVQ, MLVQ AND SVM. Fig. 4. The graph of the derivative of sigmoid function. Data set LVQ GLVQ HMLVQ MLVQ SVM Astroparticle 86% 89% 89% 95% 96% Australian 83% 82% 84% 88% 86% Breast Cancer 96% 97% 97% 97% 96% Fourclass 98% 99% 100% 99% 98% Ionosphere 92% 93% 92% 95% 93% Liver Disorders 62% 62% 64% 66% 60% SvmGuide3 75% 77% 76% 79% 76% USPS 82% 83% 82% 98% 98% Wine 96% 98% 95% 99% 99% Fig. 5. The graph of the derivative of new sigmoid function. To evaluate accuracies, cross validation with 5 folds were IV. CONCLUSION In this paper, we have introduced MLVQ, a new maximal margin approach to Kernel Generalised Learning Vector Quantisation. MLVQ maximises the real margin which is crucial for kernel method and can be applied to both the input space and feature space. The experiments conducted on the 9 data sets in UCI repository demostrate good performance of MLVQ in both the input space and feature space. 1671

Fig. 8. The formula to evaluate margin. Fig. 6. Classification results (in %) on 9 data sets for the 4 input space models LVQ, GLVQ, HMLVQ (HM) and MLVQ (SM). partial derivatives of µ with respect to a and b are a = b = 2y a b (φ(x) a) y( φ(x) a 2 φ(x) b 2 ) a b 3 2y a b (φ(x) b) + y( φ(x) a 2 φ(x) b 2 ) a b 3 (19) Fig. 7. Classification results (in %) on 9 data sets for the 4 kernel feature space models: kernel LVQ (KLVQ), kernel GLVQ (KGLVQ), kernel HMLVQ (KHM), kernel MLVQ (KSM) and SVM. APPENDIX Theorem 1 Let M, A and B be the points in the affine space R d. Let (H) : w T x + b = 0 be the mid-perpendicular hyperplane of segment AB. The following equality holds: Margin(M, H) = MA 2 MB 2 2AB where the sample margin Margin(M, H) is distance from point M to hyperplane (H). PROOF MA 2 MB 2 = MA 2 MB 2 = ( MA MB)( MA + MB) = 2 BA. MI = 2BA( MH + HI) (17) where I is the midpoint of segment AB, H is projected point of M to hyperplane (H) as in Figure 8. Since HI is orthogonal to BA and MH is parallel to BA, we have: MA 2 MB 2 = 2 BA. MH = 2BA.MH = 2BA.M argin(m, H) Corollary 1 If µ(x, a, b) = y( φ(x) a 2 φ(x) b 2 ) a b (18) then the PROOF a = 2y(a φ(x)) a b y( φ(x) a 2 φ(x) b 2 ) 2(a b) 2 a b a b 2 = 2y a b (φ(x) a) y( φ(x) a 2 φ(x) b 2 ) a b 3 (20) b = 2y(b φ(x)) a b y( φ(x) a 2 φ(x) b 2 ) 2(b a) 2 a b a b 2 = 2y a b (φ(x) b) + y( φ(x) a 2 φ(x) b 2 ) a b 3 (21) REFERENCES [1] C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121 167, 1998. [2] Colin Campbell and Kristin P. Bennett. A linear programming approach to novelty detection, 2001. [3] C. Cortes and V. Vapnik. Support-vector networks. In Machine Learning, pages 273 297, 1995. [4] K. Crammer, R. Gilad-bachrach, A. Navot, and N. Tishby. Margin analysis of the lvq algorithm. In Advances in Neural Information Processing Systems 2002, pages 462 469, 2002. [5] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997. [6] B. Hammer and T. Villmann. Generalized relevance learning vector quantization. Neural Networks, 15:1059 1068, 2002. [7] T. Kohonen. Self-organization and associative memory: 3rd edition. Springer-Verlag, 1989. [8] T. Kohonen. Learning vector quantization. The Handbook of Brain Theory and Neural Networks, pages 537 540, 1995. [9] C-L. Liu and M. Nakagawa. Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognition, 34(3):601 615, 2001. 1672

[10] A. K. Qinand and P. N. Suganthan. A novel kernel prototype-based learning algorithm. In ICPR, pages 621 624, 2004. [11] A. Sato. Discriminative dimensionality reduction based on generalized lvq. In ICANN, pages 65 72, 2001. [12] A. Sato and K. Yamada. Generalized learning vector quantization. In NIPS, pages 423 429, 1995. [13] B. Schölkopf and A.J. Smola. Learning with kernels : support vector machines, regularization, optimization, and beyond. The MIT Press, 2 edition, 2002. [14] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. [15] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 2 edition, 1999. 1673