Linearly and Quadratically Separable Classifiers Using Adaptive Approach

Size: px

Start display at page:

Download "Linearly and Quadratically Separable Classifiers Using Adaptive Approach"

Percival Bridges
6 years ago
Views:

1 Soliman MAMA, Abo-Bakr RM. Linearly and quadratically separable classifiers using adaptive approach. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 26(5): Sept DOI /s x Linearly and Quadratically Separable Classifiers Using Adaptive Approach Mohamed Abdel-Kawy Mohamed Ali Soliman 1 and Rasha M. Abo-Bakr 2 1 Department of Computer and Systems Engineering, Faculty of Engineering, Zagazig University, Zagazig, Egypt 2 Departement of Mathematics, Faculty of Science, Zagazig University, Zagazig, Egypt mamas2000@hotmail.com; rasha abobakr@hotmail.com Received October 3, 2009; revised May 14, Abstract This paper presents a fast adaptive iterative algorithm to solve linearly separable classification problems in R n. In each iteration, a subset of the sampling data (n-points, where n is the number of features) is adaptively chosen and a hyperplane is constructed such that it separates the chosen n-points at a margin ɛ and best classifies the remaining points. The classification problem is formulated and the details of the algorithm are presented. Further, the algorithm is extended to solving quadratically separable classification problems. The basic idea is based on mapping the physical space to another larger one where the problem becomes linearly separable. Numerical illustrations show that few iteration steps are sufficient for convergence when classes are linearly separable. For nonlinearly separable data, given a specified maximum number of iteration steps, the algorithm returns the best hyperplane that minimizes the number of misclassified points occurring through these steps. Comparisons with other machine learning algorithms on practical and benchmark datasets are also presented, showing the performance of the proposed algorithm. Keywords linear classification, quadratic classification, iterative approach, adaptive technique 1 Introduction Pattern recognition [1-2] is the scientific discipline whose goal is the classification of objects into a number of categories or classes. Depending on applications, these objects can be images or signal waveforms or any type of measurements that need to be classified. Linear separability is an important topic in the domains of artificial intelligence and machine learning. There are many real life problems in which there is a linear separation. A linear model is very robust against noise since a nonlinear model may consider the noisy samples in training data and perform more calculations to fit them. However, it may be less efficient than a linear model for testing data. Multilayer nonlinear (NL) neural networks, such as the back-propagation algorithm, work well for nonlinear classification problems. However, using back-propagation for a linear problem is overkill, with thousands of iterations needed to get to the point where linear separation can bring us fast. Linear separability methods are also used for training Support Vector Machines (SVMs) [3-4] used for pattern recognition. Support Vector Machines are linear learning machines on linearly or nonlinearly separable data. They are trained by finding a hyperplane that linearly separates the data. In the case of nonlinearly separable data, the data are mapped into some other Euclidean space. Thus, SVM is still doing a linear separation but in a different space. In this paper, a novel and efficient method of finding a hyperplane which separates two linearly separable (LS) sets in R n is proposed. It is an adaptive iterative linear classifier (AILC) approach. The main idea in our approach is to detect the boundary region between the two classes where the points of different classes are close to each other. Then, from this region, n-points belonging to the two different classes are chosen and a hyperplane is constructed such that each of the n-points lies at prescribed distance ɛ (but points belonging to each class lie at opposite sides) from it. There exist precisely two such hyperplanes from which we choose the one that correctly classifies more points. If the chosen hyperplane successfully classifies all the points, calculations are terminated. Otherwise, other n-points are chosen to start a next iteration. These n-points are chosen adaptively from the misclassified ones as those were furthest from the constructed hyperplane in the current iteration because these points are most probably lying in the critical region between the two classes. Compared with other iterative linear classifiers, this approach is Regular Paper 2011 Springer Science + Business Media, LLC & Science Press, China

2 Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classifiers 909 adaptive and numerical results show that very few iteration steps are sufficient for convergence even for large sampling data. The concept of a hyperplane is extended to performing quadratic classifications, not just linear ones. Analogous to the separating hyperplane that is represented by a linear (first degree) equation, in quadratic classification a second degree hypersurface is constructed to separate the two classes. This paper is divided into seven sections. In Section 2, a brief survey of methods which classify LS classes are introduced showing theoretical basis for the most related ones to the proposed classifier. In Section 3, the main idea, geometric interpretation and mathematical formulation of the proposed AILC are presented. Illustrative examples are given in Section 4. The quadratically separable classifier is discussed and demonstrated by some examples in Section 5. Comparisons with other known algorithms are performed for linearly and nonlinearly separable benchmark datasets and results are presented in Section 6. Finally, in Section 7, conclusions and future work are discussed. 2 Comparison with Existing Algorithms Numerous techniques exist in literature for solving the linear separability classification problem. These techniques include the methods based on solving linear constraints (the Fourier-Kuhn elimination algorithm [5] or linear programming [6] ), the methods based on the perceptron algorithm [7], and the methods based on computational geometry (convex hull) techniques [8]. In addition, statistical approaches are characterized by an explicit underlying probability model, which provides a probability that an instance belongs to a specific class, rather than simply a classification. The most related algorithms to the one proposed in this work are the perceptron and SVM algorithms. The perceptron algorithm was proposed by Rosenblatt [5] for computing a hyperplane that linearly separates two finite and disjoint sets of points. In the perceptron algorithm, starting with arbitrary hyperplane, the dataset is tested sequentially point after point to check if it is correctly classified. If a point is misclassified, the current hyperplane is updated to correctly classify this point. This process is repeated until a hyperplane is found that succeeds to classify the full dataset. If two classes are linearly separable, the perceptron algorithm will provide, in a finite number of steps, a hyperplane that linearly separates the two classes. However it is not known, ahead of time, how many iteration steps are needed for the algorithm to converge. SVM [3], as a linear learning method, is trained by finding an optimum hyperplane that separates the dataset (with largest possible margin) by solving a constrained convex quadratic programming optimization problem which is time consuming. In the proposed AILC, starting with an arbitrary hyperplane, the full dataset is tested and the information about the relative locations of the misclassified points with respect to the hyperplane is utilized to predict the critical region between the two classes where a better hyperplane can exist. This adaptive nature of iteration speeds up the convergence to a hyperplane that successfully separates the two classes. In Section 3, the classification problem is reformulated to produce the required information at low cost. In addition, theoretical basis and implementation of AILC are provided. 3 Adaptive Iterative Linear Classifier (AILC) In this section we present adaptive iterative linear classifier (AILC). The main idea in our approach is to simulate how one can predict a line in R 2 that separates points belonging to two linearly separable classes. First, it detects the boundary region between the two classes where points of different classes are close to each other. From this region of interest, it can choose two points (one point of each class) that seem to be most difficult (nearest) and predict a line that not only separates the two points but, as much as possible, correctly separates the two classes, that is it tries to construct a line having one of the points with remaining points of its class in one side of the line and the second point with the rest of its class in the other side. If such a line exists, the task is done. Otherwise, another two points are chosen to start a next iteration. These new points are chosen adaptively as those expected, by the constructed line in the current iteration, to lie in the border region between the two classes. Construction of a separating line in our approach is characterized by the requirement that the 2-points lie at prescribed distance ɛ (but at opposite sides) from it. In fact, there exist precisely two such lines from which we choose the one that correctly separates more points. A generalization about R n is straight forward. Starting with n-points in R n belonging to two different classes, we construct a hyperplane such that each of the n-points lies at prescribed distance ɛ (but points belonging to each class lie at opposite sides) from it. Again, there exist precisely two such hyperplanes from which we choose the one that correctly classifies more points. If the chosen hyperplane successfully classifies all the points, we terminate calculations. Otherwise, a new iteration is started by choosing another n-point from the misclassified ones (see Subsection 3.1 for more details).

3 910 J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5 This approach is more efficient than other related methods proposed in the literature. For example, the CLS [9-11] examines each possible hyperplane passing by every n-points set to check if it can successfully classify the remaining points. When such a hyperplane is reached, the required hyperplane is constructed such that it, further, properly separates the n-points according to their classes. 3.1 Geometric Interpretation and Theoretical Basis for AILC The classification problem considered in this work consists of finding a hyperplane P that linearly separates N points in R n. Each of these points belongs to either of the two disjoint classes A or B that lie in the positive or negative half space of P, respectively. If the training data are linearly separable, then a hyperplane exists such that P (w; t) : x T i w + t = 0 (1) x T i w + t > 0, for all x i A x T i w + t < 0, for all x i B, (2) where x T i R n is the feature vector (or the coordinates) of point i while w R n is termed the weight vector and t R the bias (or t is termed the threshold) of the hyperplane. Defining the class identifier variables is that each e i is a measure of the distance between point x i and P. This can easily be proven as follows. Recall that the distance between any point x i R n and the hyperplane P (W ; c), x i P (W ; c) is given by δ(x i, P ) = x T i W + c / W = e i / W > 0, (6) where W is the L 2 norm of W (length of vector W ), then e i = W δ(x i, P ). (7) In our approach, since W = (w 1, w 2,..., w n ) T consists of n-unknown components, we choose n-points and assume that they all lie at constant distances from a trial hyperplane P such that each point lies in the proper half space according to its class. Substitution of x T i, d i and e i = ɛ > 0, i = 1, 2,..., n in (5) and noting that c = 1 or 1, produces two linear systems of equations in the n-unknowns w 1, w 2,..., w n. Solution of these systems (assuming linear independence of the equations) produces two hyperplanes: P 1 = P (W 1 ; 1) and P 2 = P (W 2 ; 1). The first adaptive feature of the proposed algorithm is to select from P 1 and P 2 the more efficient one in classifying the remaining N n points. { 1, if xi class A, d i = 1, if x i class B, (3) (2) reduces to the single form d i (x T i w + t) > 0, i = 1, 2,..., N. (4) Dividing (4) by t yields d i (x T i W + c) = e i, e i > 0, i = 1, 2,..., N, (5) where W = w/ t is a weighted vector having the same direction of w (normal to hyperplane P (W ; c): x T i W + c = 0) and pointing to its positive half space, and c = 1 or 1 according to the sign of t is either positive or negative, respectively. In (5), we have introduced the variables e i, i = 1, 2,..., N for the first time. These variables will be the source of information in our approach. According to (5), a hyperplane P (W ; c) will correctly separate the two classes if e i > 0, i = 1, 2,..., N. However, for a trial hyperplane P (Ŵ ; ĉ), if substitution of Ŵ and ĉ in (5) produces negative value for e i then point i is misclassified by P. Another interesting importance of these Fig.1. Choice of the better hyperplane. The arrow of each hyperplane refers to its positive half-space. In Fig.1, an elastration in R 2 is presented with N = 16 (8 points of each class), where we refer by a black circle to the class with identifier d = 1 and a triangle to the other class with d = 1. The starting 2-points are enclosed in squares. Both P 1 and P 2 successfully separate the chosen points into the two classes. However it is not guaranteed that both P 1 and P 2 correctly classify the full N-set of points. P 2 succeeded in classifying 12 points (5 circles in its positive half space and 7 triangles in the other side) but failed with the rest 4 points whereas P 1 succeeded in classifying 6 points (4 circles in its positive half space and 2 triangles in the other side) but failed with the rest 10 points. Thus the algorithm chooses P 2.

4 Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classifiers Mathematical Formulation Let x T i = [x i1, x i2,..., x in ] be the row representation of the components of an input data point x i R n that has n features and let N be the number of data points belonging to two disjoint classes (A and B). Then, applying (5) for all N-points yields the system where d 1 0 d 2 D =... 0 D(X T W + C) = E (8) d N, x 11 x 12 x 1n x 21 x 22 x 2n X T =.... x N1 x N2 x Nn W 1 c W W = 2., C = c. c W n 1 J N = 1. 1 n 1 N 1, E = e 1 e 2. e N N n N 1, = cj N,. (9) And thus, the classification problem is formulated by D(X T W + cj N ) = E. (10) One has to notice that matrices X T and D represent the input data such that, for each point i, X T contains in each row i, the feature vector X T i and D is a diagonal matrix formed such that its diagonal elements are the elements of vector d = [d 1 d 2... d N ] T. Thus, interchanging the rows of both X T and D corresponds to reordering of the N-points. In (10), J N is an N-vector whose entries are all unity and c = ±1. Also, referring to (5), for a separating hyperplane, all the entries of vector E must be positive. Hence the classification problem reads: find a hyperplane such that all the entries of E are all positive, or equivalently find W and c, such that: E > 0. (11) The proposed solution consists of the partitioning of the N-system (see (10)), into two subsystems; the first one consists of the first n-equations while the second subsystem consists of the next (N n) equations. Let X T be partitioned as: [ X T = then (10) is rewritten as a n n b (N n) n [ D1 0 ]([ a ] [ J ]) [ 1 E1 ] W + c =, (12) 0 D 2 b J 2 E 2 where a is a nonsingular square matrix of dimension n, b is in general a rectangular matrix of dimension (N n) n, J 1 and J 2 are vectors with unit n and N n components, respectively. D 1 and D 2 are diagonal square matrices of dimensions n and N n, respectively. (12) can then be written as ], D 1 (aw + cj 1 ) = E 1, (13) D 2 (bw + cj 2 ) = E 2. (14) And the classification problem becomes: find W and c, such that: E 1 > 0, E 2 > Adaptive Iterative Linear Classifier (AILC) To simplify the solution of (13) and (14), choose a small positive number ɛ and assume e 1 = e 2 = = e n = ɛ > 0, (15) then E 1 = ɛj 1 > 0 and hence, upon substitution in (13), using D 1 1 = D 1, and solving W as a function of c, it reduces to W = a 1 Q. (16) Here Q = (ɛd 1 J 1 cj 1 ) is a vector of length n and is computed easily because its i-entry is given by: ɛd i c, 1 < i < n. Substituting (16) in (14) E 2 = D 2 bw + cd 2 J 2. (17) To compute E 2, note that the i-th-entry of E 2 is e i = d i (b T i W + c), n + 1 i N. (18) Clearly, since vector Q is dependent on the value of c, then so are both W and E Adaptive Procedure In the proposed AILC, we try to speed up the convergence rate by making full use of all available information within and after iteration. Two adaptive choices are performed as follows. First, within iteration r, the algorithm chooses the value of c as +1 or 1 such that the constructed hyperplane correctly classifies more points as described in

5 912 J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5 Subsection 3.1. In Algorithm 1 the implementation of this adaptive choice is presented. Algorithm 1. Iteration r (a 1, b, D 1, D 2; c, W, E 2, m) 1. Set c = Compute the vectors W (c) and E 2(c) using (16), (17). 3. Compute m(c) as the number of negative entries of E 2(c). 4. if m(c) = 0, then E 2(c) > 0, go to step else if c = 1, set c = 1 and repeat steps if m(1) < m( 1) then c = 1 produces the accepted hyperplane P r. Set c = 1, go to step else c = 1 produces the accepted hyperplane P r. Set c = 1, go to step The separating hyperplane P is defined by c, W (c). end iteration r. 9. The best hyperplane P r is defined by c, W = W (c), return also, E 2(c), m(c). end iteration r. Second, after an iteration r, vector E r = [ E 1 ] E is 2 r computed. E r is constructed as augmentation of E 1 all of whose n-entries equal ɛ and E 2 whose entries are computed by (18). In fact, E r contains important information about the fitness of the constructed hyperplane P r as a separator. First, recall that a negative sign of an entry e i of E r means that point i is misclassified by the hyperplane. Second, (in (7)) the absolute value of e i provides a measure of the distance of point i from the hyperplane. Thus, if entries of E r are all positive, then P r is an acceptable classifier, otherwise, the entries having the lowest values in E r would correspond to the furthest misclassified points from P r and hence such points more probably lie in the critical region between the two classes where an objective classifier P has to be constructed. Accordingly, we choose n of these points (that, in addition, must be linearly independent and belong to both of the different classes) to determine the hyperplane in the next iteration. So, matrix a in (12) is chosen by adaptively reordering the input matrix X T after each iteration such that the first n-rows of X T and D correspond to the data of the chosen n-points. An illustration in R 2 is shown in Fig.2 where black circles and triangles refer to the classes that must lie in the positive and negative half space, respectively. The misclassified points lie in the shaded regions and the chosen 2-points for the next iteration are shown in rectangles Implementation of AILC Algorithm 1 describes a typical iteration r that returns either a separating hyperplane P or a hyperplane P r. Although it does not successfully classify all the points, it minimizes the number m of misclassified points through the adaptive choice of c. Adaptive Reordering Algorithm (Algorithm 2) rearranges X T, d such that the first n-points in X T (forming a in the next iteration) must satisfy the conditions: 1) correspond to rows that have the lowest values in E, 2) a is nonsingular, and 3) belonging to the two classes. The details of such an algorithm are presented in Algorithm 2. The complete algorithm AILC is presented in Algorithm 3. Algorithm 2. Adaptive Reordering (n, N, ɛ, X T, d, E 2) 1. Form vector E as augmentation of E 1 (all its n- entries equal ɛ) and E Form vector F such that its entries are the rows numbers of E when it is sorted in an ascending order. 3. Set a(n, n) = zero matrix, da(n) = zero vector, flag(n) = zero vector. 4. Set i = 1, j = while i < n 6. while j < N I. k = F (j), a T i = X T k II. if rank (first i rows of a) = i, then set: da(i) = d(k); flag(k) = i; break, end. III. j = j + 1; go to step i = i + 1; go to step i = n. 9. while j N I. K = F (j), a T i = X T k II. if (d(k) = da(n 1) and rank (first i rows of a) = i), then set: da(i) = d(k); flag(k) = i; break, end. III. j = j + 1; go to step for each 1 k N, if (flag(k) = i 0), X T k = X T i, d(k) = d(i). 11. for i = 1 to n, X T i = a T i, d(i) = da(i). 4 Numerical Illustration Fig.2. Illustration of the adaptive choice of next iteration in the classifier (AILC) in R 2. In this section, the use of algorithm AILC is demonstrated by three linearly separable (LS) examples. The

6 Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classifiers 913 Algorithm 3. AILC (N, n, X T, d, ɛ, rmax; c, W, r, m) Input: data N, n, N n array X T, class identifier N 1 array d, maximum number of iterations rmax, and parameter ɛ. Output: a hyperplane (c, W ), iteration r, and number of misclassified points m. 1. Arrange X T, d such that the first n-rows of X T form a nonsingular n n matrix a. 2. Set m 0 = N, r = while r rmax a) Form the partitioned matrices a, b, D 1, D 2 then compute a 1 (see (12)). b) Call Iteration r (a 1, b, D 1, D 2; c, W, E 2, m). c) if m = 0 (successful separation), return c, W, r, m; break; end. d) else if m < m 0, m 0 = m, c opt = c, W opt = W, r opt = r. e) Call Adaptive Reordering (n, N, ɛ, X T, d, E 2). f) r = r + 1. g) go to step return data of hyperplane with minimum misclassified points: c = c opt, W = W opt, also return m = m 0, r = r opt. end. first is a 2D-classification problem where successive iterations are visualized to illustrate the adaptive feature and convergence behavior of the algorithm. The influence of the value of ε and the reordering of input data on the convergence are numerically discussed. The second example is a 3D-classification problem in R 3 while the third one is a 4D-classification problem in R 4 where the standard benchmark classification dataset: IRIS [12] is arranged as two LS classes. Example 1. A 2D-classification problem consists of two classes A (black circles) and B (triangles) given: A = {(4, 3), (0, 4), (2, 1.6), (7, 3), (3, 4), (4, 3), (3, 2)} B = {(4, 4), ( 3, 0), ( 6, 1), (1, 0), (1, 0.5), (0, 7), (6, 2)}. Points of the two classes A, B are represented in Fig.3(a) showing great difficulty in classifying these data. A circle about the starting two points ( 4, 3), (4, 4) are also shown. Figs. 3(b) 3(d) show the application of our algorithm to this problem with ɛ = After each iteration, the computed weight vector and threshold are shown in Table 1. To discuss the dependency of the proposed algorithm on the starting n-points and the parameter ɛ, we repeat solving the previous example starting with another two points ( 3, 0), (4, 4) and select ɛ = 0.4. The number of iteration changes; two iterations were required to classify these difficult data although the starting points Fig.3. 2D plot of the two-class classification problem (class A (black circles), class B (triangles)). Squares indicate the worst points after iterations 1 and 2. (a) Original dataset. (b) After iteration 1. (c) After iteration 2. (d) After iteration 3. Table 1. Weight Vectors and Threshold Values Obtained by Executing the Algorithm i (iteration) W (weight vector) c (threshold) 1 (1.6375, 2) 1 2 ( , ) 1 3 ( 0.3, 1.175) 1 Table 2. Weight Vectors and Threshold Values Obtained by Executing the Algorithm i (iteration) W (weight vector) c (threshold) 1 (0.4667, ) 1 2 ( , ) 1 belong to the same class (B). The results are presented in Table 2 and Fig.4. It would be mentioned that no more than 4 iterations were needed to solve this classification problem irrespective of starting points and for < ɛ < 0.5. Example 2. The algorithm presented in Algorithm 3 was tested by applying it to an LS 3D-classification problem that consists of two classes A( ) and B( ). A = {(1, 4.5, 1), (2, 4, 3), (6, 5, 4), (4, 6, 5), (4, 5, 6), (1, 3, 1)} B = {(0, 4, 0), (2, 4, 3), ( 4, 4, 2), ( 3, 4, 4), ( 2, 3, 3), ( 4, 4, 1)}.

7 914 J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5 Fig.4. Classification of the same classes (black circles = class A, triangles = class B) represented in Fig.3(a) when we start with ( 3, 0), (4, 4) and ɛ = 0.4. iteration are included in squares. The worst points after the first Table 3. Weight Vectors and Threshold Values Obtained by Executing the Algorithm i (iteration) W (weight vector) c (threshold) 1 (0.9375, 0.175, 0.425) 1 2 (0.465, 0.175, 0.31) 1 Starting with points (0, 4, 0), (1, 4.5, 1), (2, 4, 3) and choosing ɛ = 0.3, Algorithm in Table 3 was applied to classify these data. Two iterations were sufficient to solve this classification problem as shown in Fig.5. The situation after the first and second iterations are shown in Figs. 5(a) and 5(b) respectively. In each case, the graph was rotated such that the view was perpendicular to the separating plane. After the first iteration the points (2, 4, 3), (1, 3, 1), (0, 4, 0) were found to be the worst. The results of different iterations are summarized in Table 3. The dataset describes every iris plant using four input parameters (Sepal length, Sepal width, Petal length, and Petal width). The dataset contains a total of 150 samples with 50 samples for each of the three classes. Some of the publications that used only the samples belonging to the Iris Versicolour and the Iris Virginica classes include: Fisher [13] (1936), Dasarathy (1980), Elizondo (1997), and Gates (1972). Although the IRIS dataset is nonlinearly separable, it is known that all the samples of the Iris Setosa class are linearly separable from the rest of the samples (Iris Versicolour and Iris Virginica). Therefore, in this example, a linearly separable dataset was constructed from the IRIS dataset such that the samples belonging to Iris Versicolour and Iris Virginica classes were grouped in one class and the Iris Setosa was considered to be the other class. Thus, a linearly separable 4D-classification problem was considered in this example with 100 points in class A and 50 points in class B. Using the proposed algorithm with ɛ = 0.5, data were completely classified after two iterations and the results were collected in Table 4. Table 4. Weight Vectors and Threshold Values Iris Classification Problem i (iteration) W (weight vector) c (threshold) 1 (0, 0, 0, 2.5) 1 2 (0.3763, , , ) 1 5 Classification of Quadratically Separable Sets Two classes A, B are said to be quadratically separable if there exists a quadratic polynomial P 2 (y) = 0, y R m such that P 2 (y) > 0 if y A and P 2 (y) < 0 if y B. In R 2, a general quadratic polynomial can be put in the form: w 1 y w 2 y w 3 y 1 y 2 + w 4 y 1 + w 5 y 2 + c = 0. (18) (18) represents a conic section (parabola, ellipse, or hyperbola depending on the values of coefficients w i ). Now, consider a mapping φ : R 2 R 5 such that a point y(y 1, y 2 ) R 2 is mapped into a point x R 5, with components: Fig.5. Original dataset and the constructed hyperplanes for 3Dproblem of Example 2. (a) After first iteration. (b) After second iteration. Example 3. The IRIS dataset [12] classifies a plant as being an Iris Setosa, Iris Versicolour or Iris Virginica. x 1 = y1, 2 x 2 = y2, 2 x 3 = y 1 y 2, x 4 = y 1, x 5 = y 2. (19) Using this mapping, P 2 (y) = 0 is transformed into a hyperplane; x T w + c = 0 in R 5. The transformed linear classification problem can be solved by algorithm AILC to get w and c and hence a quadratic polynomial P 2 (y) = 0 is determined. Generally, a quadratic polynomial in R m can be transformed into a hyperplane in R n with n = m +

8 Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classifiers 915 m(m + 1)/2. Quadratic polynomials in R 3 represent surfaces such as ellipsoids, paraboloids, hyperboloids, cone. Although the algorithm is applicable to higher dimensions, we represent an example in R 2 for convenience in visualization. A set of points belonging to two classes (black and red +) are presented in Fig.6. The mapping φ defined by (19) is used to generate coordinates in R 5 corresponding to input data points. Algorithm AILC is used to solve the transformed linearly separable problem with two different values of ɛ = 0.4 and ɛ = 0.5. For each of these values, the resulting quadratic equation is plotted in blue. Although the algorithm successfully classified the points in both the cases, it shows the sensitivity to the value of ɛ. For ɛ = 0.4, five iterations were required to converge to a parabola (see Fig.6(a)) while it takes eleven iterations when ɛ = 0.5 to converge to the hyperbola shown in Fig.6(b). Moreover, the algorithm may diverge for other range of values compared with the case of the linearly separable classification problems where very few iterations (1 3) were sufficient for convergence for 0 < ɛ < 0.5. Fig.6. Classification by a conic section using different values of ɛ. (a) ɛ = 0.4. (b) ɛ = 0.5. For the difficult dataset presented in Fig.7, the application of the algorithm produces the shown separating ellipse. 6 Numerical Results In this section we discuss the performance of algorithm AILC compared with other learning algorithms in the case of linearly and nonlinearly separable practical and benchmark datasets. 6.1 Classification of Linearly Separable Datasets Fig.7. Application of algorithm produces an ellipse for the quadratically separable data. following linearly separable datasets were chosen, including benchmark dataset IRIS [12] and some randomly generated datasets. 1) IRIS: full description of IRIS dataset is given in Section 4 (Example 3). Here, we consider two classes: Iris Setosa (50 samples) versus the non- Setosa (the remaining 100 samples belonging to the Iris Versicolour and the Iris Virginica). 2) G ) G ) G ) G ) G The following procedure describes the automatic generation of data. Generate a random array consisting of N rows and n columns as input matrix X T. To define the class identifier d, we first generate a random vector of length n + 1 for the weight W and c then for 1 i N compute b i = X T i w + c and define d i as +1 or 1 according to b i > δ or b i < δ where δ is a small positive number that preserves a margin between the two generated sets. The generated data consist of X T and d in the form of N (n + 1) array. Table 5 gives the summary of the datasets being used. Table 5. Description of the Benchmark and Randomly Generated Linearly Separable Datasets Samples N Features n IRIS G G G G G For the evaluation of the AILC algorithm, the In the next experiment these linearly separable

9 916 J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5 datasets are used to predict the performance of our proposed algorithm AILC and other machine learning algorithms including: decision tree, support vector machine and radial basis function network. A summary of these algorithms is given in Table 6. We compared our results with the implementations in WEKA [14-15]. Table 6. Summary of Machine Learning Algorithms Used J48 RBF MLP (L) SMO (d) AILC (d) to Produce Results of Tables (7, 9) Decision tree learner Radial basis function network Multilayer perceptron with back-propagation neural network using L hidden layers Sequential minimal optimization algorithm for support vector classification with polynomial kernel of degree d Proposed adaptive iterative linear (d = 1) and quadratic (d = 2) classifier Table 7. Results for the Empirical Comparison Showing the Number of Misclassified Instants J48 SMO (1) RBF AILC (1) IRIS (2) G (3) G (4) G (2) G (42) G (255) For each dataset, the full data were used in training different algorithms to predict the best separating hyperplane. The number of misclassified samples, if exist, is reported in Table 7. In addition, for AILC, the number of required iterations to obtain the separating hyperplane is given in parentheses. One can easily conclude (from Table 7 and many other experiments not reported here) that although the number of required iterations increases significantly with the increase of number of features n, it is nearly independent of the number of samples N. Being independent of N shows the strength point of the adaptive technique and being significantly dependent on n presents the weakness of the proposed technique that resulted from the assumption that the chosen n-points have to be at an equal and prescribed distance from the hyperplane. However, the proposed algorithm succeeded in separating all these datasets while other algorithms did not. 6.2 Behavior of Algorithm AILC in Nonlinearly Separable Datasets In this subsection, we will discuss the behavior of the proposed adaptive iterative linear classifier algorithm if the dataset is nonlinearly separable. And a comparison among this algorithm and decision tree, back-propagation neural network and support vector machines is presented Datasets Used for Empirical Evaluation For an empirical evaluation of the algorithm AILC with the nonlinearly separable datasets we have chosen five datasets from the UCI machine learning repository [12] for binary classification tasks. 1) Breast-Cancer (BC). We used the original Winconsin breast cancer dataset, which consists of 699 samples of breast-cancer medical data out of two classes. Sixteen examples containing missing values have been removed. 65.5% of the samples came from the majority class. 2) Pima Indian Diabetes (DI). This dataset contains 768 samples with eight attributes (features) each plus a binary class label. 3) Ionosphere (IO). This database contains 351 samples of radar return signals from the ionosphere. Each sample consists of 34 real valued attributes plus binary class information. 4) IRIS. Full description of IRIS dataset is given in Section 4 (Example 3). Here, only the 100 samples belonging to the Iris Versicolour and the Iris Virginica classes are considered. 5) Sonar (SN). The sonar database is a highdimensional dataset describing sonar signals in 60 realvalued attributes. The dataset contains 208 samples. Table 8 gives an overview of the datasets being used. The numbers of the examples in brackets show the original size of the dataset before the examples containing missing values had been removed. Table 8. Numerical Description of the Benchmark Datasets Used for Empirical Evaluation Samples Majority Features (Instances) Class (%) (Attributes) BC (699) DI IO IRIS SN There exist many different techniques to evaluate the performance of different learning techniques based on data with a limited number of samples. The stratified ten-fold cross-validation technique is gaining ascendancy and is probably the evaluation method of choice in most practical limited-data situations. In this technique, the data are divided randomly into ten parts in which the class is represented in approximately the same proportions as in the full dataset. Each part is

10 Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classifiers 917 Table 9. Results for the Empirical Comparison Showing the Number of Misclassified Instances and Accuracy on the Test Set Using 10-Fold Cross Validation BC DI IO IRIS SN J48 32 (95.31%) 196 (74.48%) 34 (90.31%) 6 (94%) 60 (71.15%) MLP (3) 36 (94.73%) 181 (76.43%) 31 (91.17%) 7 (93%) 41 (80.28%) SMO (1) 21 (96.93%) 179 (76.69%) 44 (87.46%) 6 (94%) 50 (75.96%) SMO (2) 24 (96.49%) 171 (77.73%) 33 (90.60%) 7 (93%) 37 (82.21%) AILC (1) 37 (94.58%) 199 (74.09%) 69 (80.34%) 6 (94%) 71 (65.87%) AILC (2) 4 (96%) held out in turn and the learning scheme trained on the remaining nine-tenths; then its error rate is calculated on the holdout set. Thus the learning procedure is executed a total of ten times on different training sets (each of which has a lot in common). Finally, ten error estimates are averaged to yield an overall error estimate. In this study, the technique of cross validation was applied to benchmark datasets (see Table 8) to predict the performance of our proposed algorithm AILC and other machine learning algorithms including: decision tree, back-propagation neural network and support vector machines (see Table 6). We compared our results with the implementations in WEKA [14-15]. The results of comparison are summarized in Table 9 where the number of misclassified instances and accuracy of classification, in parentheses, are given. Although AILC is a linear classifier, it produces reasonable results even in the case of nonlinearly separable datasets. Again, as in the linearly separable case (Subsection 6.1), one can easily conclude that the performance of AILC is independent of the size of the samples N but reduces with the increase of feature dimension n. Note that for IRIS dataset where n = 4, AILC is as accurate as SVM when using polynomial kernel of degree 1 and its performance outperforms that of SVM when using polynomial kernel of degree 2. For the datasets BC (n = 9) and DI (n = 8), comparable results are obtained even N is large (see Table 8). On the other hand less acceptable results are obtained in case of IO (n = 34) and SN (n = 60). 7 Conclusions A fast adaptive iterative algorithm AILC for classifying linearly separable data is presented. In a binary classification problem containing N samples with n features, the main idea of the algorithm is that it chooses adaptively a subset of n-samples and constructs a hyperplane that separates the n-samples at a margin ɛ and it best classifies the remaining points. This process is repeated until the separating hyperplane is obtained. If such a hyperplane was not obtained after the prescribed number of iterations, the algorithm returns the hyperplane that misclassifies fewer samples. Further, a quadratically separable classification problem can be mapped from its physical space to another larger where the problem becomes linearly separable. From various numerical illustrations and comparisons with other classification algorithms using benchmark datasets, one can conclude: 1) the algorithm is fast due to its adaptive feature; 2) the complexity of the algorithm is C 1 N +C 2 n 2, C 1 and C 2 are independent on N, which ensures excellent performance especially when n is small; 3) the assumption that n-samples must lie at a prescribed margin from the hyperplane is restrictive and makes the convergence rate dependent on n; and on the other hand, the user must provide the prescribed parameter ɛ which is problem dependent; 4) convergence rates of AILC are measured either by a number of required iterations to get the separating hyperplane or by a number of misclassified samples after prescribed number of iterations. Theoretical and numerical results show that convergence rates are nearly independent on N but reduce with the increase of n, and usually fewer iterations are sufficient for convergence for small n. Although reasonable results were obtained, convergence was greatly dependent on the value n which in turn depends on the prescribed parameter ɛ. Other algorithms are in development to predict the value of ɛ that ensures maximum margin for the n-points. Moreover, the classification problem as formulated in Section 3 may be developed as a linear programming algorithm that determines ɛ as an n-valued vector, rather than a scalar value, and produces the hyperplane with maximum margin. References [1] Duda R O, Hart P E, Stork D G. Pattern Classification. New York: Wiley-Interscience, [2] Theodoridis S, Koutroumbas K. Pattern Recognition. Academic Press, An Imprint of Elsevier, [3] Cristianini N, Shawe T J. An Introduction to Support Vector Machines. Vol. I, Cambridge University Press, [4] Atiya A. Learning with kernels: Support vector machines, regularization, optimization, and beyond. IEEE Transactions on Neural Networks, 2005, 16(3): 781.

918 J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5 [5] Rosenblatt F. Principles of Neurodynamics. Spartan Books, 1962. [6] Taha H A. Operations Research An Introduction. Macmillan Publishing Co.

ACM Transactions on Mathematical Software, 1996, 22(4): 469-483. [9] Tajine M, Elizondo D. New methods for testing linear separability. Neurocomputing, 2002, 47(1-4): 295-322. [10] Elizondo D.

11 918 J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5 [5] Rosenblatt F. Principles of Neurodynamics. Spartan Books, [6] Taha H A. Operations Research An Introduction. Macmillan Publishing Co., Inc, [7] Zurada J M. Introduction to Artificial Neural Systems. Boston: PWS Publishing Co., USA, [8] Barber C B, Dodkin D P, Huhdanpaa H. The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software, 1996, 22(4): [9] Tajine M, Elizondo D. New methods for testing linear separability. Neurocomputing, 2002, 47(1-4): [10] Elizondo D. Searching for linearly separable subsets using the class of linear separability method. In Proc. IEEE-IJCNN, Budapest, Hungary, Jul , 2004, pp [11] Elizondo D. The linear separability problem: Some testing methods. IEEE Transactions on Neural Networks, 2006, 17(2): [12] Mar. 31, [13] Fisher R A. The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 1936, 7: [14] ml/weka/, May 1, [15] Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, Rasha M. Abo-Bakr was born in 1976 in Egypt, received her Bachelor s degree from Mathematics (Computer Science) Department, Faculty of Science, Zagazig University, Egypt. She was also awarded her Master s degree in computer science in 2003, with a thesis titled Computer Algorithms for System Identification. Since 2003 she has been an assistant lecturer at Mathematics (Computer Science) Department, Faculty of Science, Zagazig University. She received her Ph.D. degree in mathematics & computer science from Zagazig University, in 2011, with a dissertation titled Symbolic Modeling of Dynamical Systems Using Soft Computing Techniques. Her research interests are artificial intelligence, soft computing technologies, and astronomy. Mohamed Abdel-Kawy Mohamed Ali Soliman received the B.S. degree in electrical and electronic engineering from M.T.C (Military Technical College), Cairo, Egypt, with grade (Excellent) in 1974, the M.S. degree in electronic and communications engineering from Faculty of Engineering, Cairo University, Egypt, with the research on observers in modern control systems theory, 1985, and the Ph.D. degree in aeronautical engineering, the thesis title is Intelligent Management for Aircraft and Spacecraft Sensors Systems, He is currently head of Computer and Systems Engineering Department, Faculty of Engineering, Zagazig University. His research interests lie in the intersection of the general fields of computer science and engineering, brain science, and cognitive science.

Rule extraction from support vector machines

Rule extraction from support vector machines Haydemar Núñez 1,3 Cecilio Angulo 1,2 Andreu Català 1,2 1 Dept. of Systems Engineering, Polytechnical University of Catalonia Avda. Victor Balaguer s/n E-08800