The Pseudo Gradient Search and a Penalty Technique Used in Classifications.

The Pseudo Gradient Search and a Penalty Technique Used in Classifications. Janyl Jumadinova Advisor: Zhenyuan Wang Department of Mathematics University of Nebraska at Omaha Omaha, NE, 68182, USA Abstract The aim of this work is to use the pseudo gradient search to solve classification problems. In most classifiers, the goal is to reduce the misclassified rate that is discrete. Since pseudo gradient search is a local search, to use it for classification problem, objective function has to be real valued. A penalty technique is used for this purpose. 1. Introduction Classification is an optimization problem with an objective of minimizing misclassified data. Given a set of training data with a size l and n predictive attributes, each new record needs to be classified. Various classification methods have been proposed and good results have been achieved using nonlinear integrals, such as Choquet integral, as aggregation tool [8]. The use of weighted Choquet integrals with respect to fuzzy measures in classification was first proposed by Xu et al.[8]. The Choquet integral was used to project the data onto an optimal line to make classification one-dimensional. Since the projection is generally nonlinear, the classification is also nonlinear. 1

Yan et al. proposed nonlinear classification methods using linear programming and signed fuzzy measures [10] to account for linearly inseparable data. Other classification methods that use statistics and machine learning approaches have been proposed such as decision tree [2], [3] and support vector machine [9] algorithms. Classification plays an important part in some fields, such as medicine and manufacturing. For example, disease must be correctly diagnosed before proper treatment can be given. In addition to making classifying data faster and cheaper, automatization of classification jobs can help eliminate human errors. Gradient search algorithm can be used to solve nonlinear optimization problems, such as classification. It uses partial derivatives of the function to pick the best direction for the search. When the objective function is not differentiable, pseudo gradient search can be applied, where differences are used instead to obtain the best direction. The paper is organized as follows. In Section 2, background information on classification problem is given. In Section 3, pseudo gradient search and penalty technique are discussed. In Section 4, pseudo gradient search algorithm for classification problem is presented. Section 5 describes some testing examples. 2. Classification The goal of classification is to build a model of the classifying attribute based on the predictive attributes. Then we can use this model to determine what class an observation belongs to. This paper uses an idea of the classification method based on nonlinear integrals discussed in [8]. The idea is to project the points in the feature space onto a real axis through nonlinear integral and then to optimally classify these points according to certain criterion. Each point in the feature space becomes a value of the virtual variable, ŷ i, i = 1,..., n. This way, each classification boundary is just a point on the real axis. Next, a few mathematical concepts will be discussed. Let x 1, x 2,..., x n be predictive attributes and P(X) be the power set of X, then X = {x 1, x 2,..., x n } is a feature space. Let (X, P(X)) be a measurable space and µ : P(X) [0, ) be a fuzzy measure satisfying the following 2

conditions: 1) µ( ) = 0 (vanishing at the empty set) 2) µ(a) µ(b) if A B, A, B P(X) (monotonicity) µ is nonadditive in general and it is regular if µ(x) = 1. Its nonadditivity represents the interaction among predictive attributes towards a certain objective attribute. An observation of predictive attributes can be defined as a function f : X (, ), then the jth observation of attribute x i is f ji = f j (x i ), i = 1, 2,..., n and j = 1, 2,..., l. The Choquet integral of a nonnegative function f is defined as: (c) f dµ = 0 µ(f α ) dα, where F α = {x f(x) α} for any α [0, ) is a level set of f. where To calculate Choquet integral the following procedure will be used: (c) fdµ = 2 n 1 j=1 z j µ j min f(x i ) max f(x i ) if it is > 0 or j = 2 n 1; i : frc( z j = j 2 i ) [ 1 2, 1) i : frc( j 2 i ) [0, 1 2 ) 0 otherwise. for j = 1, 2,..., 2 n 1, where frc( j j ) is the fractional part of and with a 2 i 2 i convention that the maximum on the empty set is zero. If we express j in binary form j n j n 1...j 1, then {i frc( j 2 i ) [ 1 2, 1)} = {i j i = 1} and {i frc( j 2 i ) [0, 1 2 )} = {i j i = 0} 3

3. Pseudo Gradient Search and Penalty Technique One way of solving classification problem is to use gradient search. We can start at a point and then move in the direction that gives the largest increase in the values of the objective function f, where, directional derivative has the largest value. The gradient of the objective function is the vector of first derivatives. It s norm is the magnitude of the gradient vector. When the objective function is not differentiable, the traditional gradient search fails. In such case, we can replace gradient with pseudo gradient to determine the best search direction. Advantages of pseudo gradient search are its fast convergence and the fact that objective function doesn t have to be differentiable. Disadvantage is getting trapped in some local minimum or maximum and not being able to find global minimum or maximum, like in any other local search. The goal of the pseudo gradient search in this paper is to reduce the number of misclassified observations, or ideally, to obtain no misclassified observations. Objective function in classification problems is usually the misclassification rate, which is discrete. However, to use pseudo gradient search, objective function has to be real valued. A penalty technique can be applied to classification problem to make objective function real valued. Penalty techniques are generally used to make the constrained problem into unconstrained problem by penalizing infeasible solutions. There is no general guideline on how to design penalty functions. It is usually problem-dependant. For the classification problem, it is convenient to express penalty function in terms of the sum of the distances of each misclassified point from the boundary. Pseudo gradient search works as follows: We take a small step in the positive direction and we take a step of the same length in the negative direction from the initial value and we calculate the value of the penalized objective function associated with the step. The step that had smaller value of the penalized objective function determines the direction where we want to go. When the best search direction is determined, the length of the step is iteratively doubled in that direction. When the value of the penalized objective function between steps starts increasing, direction is reversed and the length of the step is iteratively shortened in half until the value of the penalized objective function between iterations increases again. 4

4. Pseudo Gradient Search Algorithm for Classification Problem Algorithm: Summary of the variables: n: the number of attributes l: the number of observations m: the number of misclassified attributes δ: a small number, 10 6 in our case, used as the step length when testing for the best direction ŷ j : the virtual variable, used to project points into a real line q: the vector of objective attribute a and b: vectors used in multiregression b : best boundary reducing misclassifications p: value of the penalty function µ k : denotes µ(a), where A = k i =1 {x i} and k has a binary expression k = k n k n 1...k. t: running time in seconds Algorithm: 1. Input: number of attributes, n; number of observations, l; and the data. 2. Initialize vector q, where q j = y j, j = 1,.., l and y j is the value of the objective attribute, and initialize vectors a and b by picking 2n standard uniform random numbers. The first n numbers are for vector a and the second n numbers are for vector b. These numbers represent vector g. 3. Calculate a i and b i as follows: a i = g i min 1 i n g i (1 g i )(1 min 1 i n g i ) b i = 2g n+i 1 max 1 i n 2g n+i 1 for i = 1,.., n, where a and b are n-dimensional vectors, a = (a 1, a 2,..., a n ) and b = (b 1, b 2,..., b n ), that are used to balance the various phases and 5

scales of predictive attributes and should satisfy the following: a i 0 for i = 1, 2,..., n with min 1 i n a i = 0; and 1 b i 1 for i = 1, 2,..., n with max 1 i n b i = 1. 4. Construct matrix Z with dimensions l x 2 n as follows: z j0 = 1, min(a i + b i f ji ) max(a i + b i f ji ), if z z jk = jk > 0 or k = 2 n 1; 0, otherwise. where j = 1,..., l and k = 0,..., 2 n 1. 5. Apply the QR decomposition theorem to find the matrix least squares solution of the system of linear equations Zv=q, where unknown variables c, µ 1, µ 2,..., µ 2 n 1 are the elements of v. 6. Calculate ŷ j = c + (c) (a + bf j )dµ = c + 2 n 1 k=1 z jkµ k, for j = 1,.., l where ŷ j the current estimate of the virtual variable of the objective attribute for the j th observation. 7. Find the best boundary, b. Once the estimated y values have been computed, we need to classify them. This is done by searching for the boundary that minimizes the number of misclassified points. We simplify this case by allowing only two values for the classifying attribute. For local search algorithms, it is important to pick a good starting point. Therefore, here the initial boundary is obtained from the ratio of the observations in class 1 and class 2. Then, for each candidate boundary, we classify the computed ŷ j. Out of all candidate boundaries we pick the one that minimizes the number of misclassified points. 8. Calculate the initial penalty as follows: p 0 = m l m i=1 b ŷ i, for i = 1,..m, where m is the number of the misclassified points and b is the best boundary. 9. a. Take a step in the positive direction by adding δ to a i, for i = 1,..., n, where δ = 10 6. b. Repeat steps 3-7. 6

c. Take a step in the negative direction by subtracting 2δ from a i. d. Repeat steps 3-7. e. Compare the penalties obtained by stepping into positive direction, p δ +, and into negative direction, p δ to the initial penalty. If they are greater or equal to p 0, then no step should be taken. If p δ + < p δ, then (p 0 p δ +)a i a i, otherwise ( 1)(p 0 p δ )a i a i, where a i monitors changes in each dimension of vector a. 10. Repeat previous step for vector b. b will keep track of changes in each dimension of vector b. 11. Reset a and b to their original values. 12. Start Doubling: a. Double a and b vectors: 2 a i a i for all i = 1,..., n and 2 b i b i for all i = n + 1,..., 2n. b. If a i < 0, a i = 0. Values of vector a must be nonnegative. c. If b i > 1, then b i = 1 If b i < 1, then b i = 1 since the values for b must be between -1 and 1. d. Let p l be the latest penalty, calculate p l as in step 8 e. Repeat this step until p l > p 0. 13. Reverse the directions of the vectors (doubling went too far) by making them negative. 14. Start Halving: a. 1 2 a i a i for all i = 1,...n, and 1 2 b i b i for all i = n + 1,..., 2n. b. If a i < 0, a i = 0. Values of vector a must be nonnegative. c. If b i > 1, then b i = 1 If b i < 1, then b i = 1 since the values for b must be between -1 and 1. d. Calculate the latest penalty as in step 8 e. Repeat this step until p l > p 0. 7

15. Reverse the directions of the vectors (halving went too far) by making them negative. 16. Repeat steps 3-7 to obtain new values for c, µ, ŷ j, and p. 17. Let Max( a, b) = max 1 i n { a i a i, b i b i }. We need Max( a, b) to check if the largest change in any dimension of the change vectors was greater than δ. 18. If p 0 > 0, p l > 0 and Max( a, b) δ, then go on to the next step. Otherwise go to the last step. 19. If p l > p 0, go on to the next step, otherwise skip the next step. 20. The change from a to a and b to b has to be iteratively reduced until the penalty is smaller than the initial penalty. a. a i a i â i, b i b i ˆb i for all i = 1,..., n. b. If â i > δ, then â i 2 â i and a i + â i a i for all i = 1,.., n. c. If ˆb i > δ, then ˆb i 2 ˆb i and b i + ˆb i b i for all i = 1,.., n. d. Repeat steps 5, 6, 7, and 9. g. Let M = max 1 i n { â i, ˆb i }. Then if p > p 0 and M > δ, go to step 21 b, otherwise continue with the next step. 21. Output p l, m and the running time. 5. Simulation Results The algorithm has been coded in Java and it has been run on a Pentium M 1.73 GHz computer. We ran the algorithm on the whole data for each database (reclassification). Data sets used for our simulations can be seen in Table 1 and they are described in more detail below. 8

Data name number of attributes size of the data set Data from [8] 3 200 Leptograptus crabs 5 200 Synthetic data 2 1250 PIMA 8 768 Heart 13 270 Credit Card 24 1000 Table 1: Databases used in simulations. Classification results are summed up in Table 2. Best results are recorded based on 10 runs on each data set, unless perfect classification was obtain before that. Data name latest penalty Accuracy running time (sec) Data from [8] 0.54846 99 % 2.015 Crabs 0 100 % 8.820 Synthetic data 1.27107 93 % 10.812 Pima Diabetes 10.24656 81 % 11.515 Heart 7.58831 84 % 40.016 Credit card 13.72654 77 % 8.422 Table 2: Classification Results. Data from [8]: First, artificial data presented in Tables IV and V in [8] was used for comparison purposes. We obtain nearly perfect classification. Leptograpsus crabs data: This is the data on morphology of rock crabs of genus Leptograpsus [6], [12]. There are 100 specimens both male and female (evenly distributed) of two color forms (two classes) - blue form or orange form. Attributes are: 1. FL - frontal lip of carapace (mm) 2. RW - rear width of carapace (mm) 3. CL - length along the midline of carapace (mm) 4. CW - maximum width of carapace (mm) 5. BD - body depth (mm) 9

After 5 runs, we are able to obtain perfect classification. Synthetic data: Synthetic data is the data from Ripley [6] and it is available on [12]. It has two real-valued attributes - x and y coordinates and a class, 0 or 1. We can plot a graph of this data set in 2-Dimensional space, with attribute 1 and attribute 2 representing corresponding axis. Among 1250 data points, 650 points are in class 1 and 650 points are in class 2 (evenly split). Red points (crosses) represent class 1 and blue points (circles) represent class 2. attribute 2 attribute 1 Figure 1: Synthetic data set. Figure 2 shows the results of classification. For this data set, the best classification accuracy we get is around 93%, that is 88 out of 1250 points are misclassified. 10

attribute 2 attribute 1 Figure 2: Synthetic data classification. The bold lines in Figure 2 approximately show the classification obtained by the algorithm. Figure 3 shows the progress made by the algorithm over 10 runs. 11

92.64 % 92.88 % 92.96 % 92.96 % 90.72 % 90.72 % 87.52 % classification accuracy, % 83.04 % 85.92 % 87.52 % number of runs Figure 3: Classification Results over 10 runs. Pima Indians diabetes data: This data was tested on females at least 21 years old of Pima Indian heritage living near Phoenix, Arizona according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases (Smith et al, 1988). Class 1 means the patient was tested positive for diabetes and 0 - negative. Attributes are: 1. npreg - number of pregnancies 2. glu - plasma glucose concentration in an oral glucose tolerance test 3. bp - diastolic blood pressure (mm Hg) 4. skin - triceps skin fold thickness (mm) 5. ins - serum insulin (micro U/ml) 6. bmi - body mass index (weight in kg/(height in m) 2 ) 7. ped - diabetes pedigree function 8. age - in years 12

More information about this data set can be found in [11]. Our algorithm compares better to algorithms in [5] and [2]. Heart Data: This data set is from a well-known StatLog project [5] and is available on [11]. The database contains 13 attributes, that are: 1. age 2. sex 3. chest pain type ( 4 values ) 4. resting blood pressure 5. serum cholestoral in mg/dl 6. fasting blood sugar > 120 mg/dl 7. resting electrocardiographic results ( values 0, 1, 2 ) 8. maximum heart rate achieved 9. exercise induced angina 10. oldpeak = ST depression induced exercise relative to rest 11. the slope of the peak exercise ST segment 12. number of major vessels (0-3) colored by flourosopy 13. thal: 3=normal; 6=fixed defect; 7=reversable defect Classes are absence or presence of heart disease. Our algorithm performs better than some other proposed algorithms such as Genetic Programming algorithms and Decision Tree algorithms [1], [2], [3]. German Credit Card Data: This is another data set taken from StatLog project [5] and is available on [11]. Classes are good and bad credit. We used data set with numerical attributes. It is the data set that has been edited and several indicator variables have been added by Strathclyde University, to make it suitable for algorithms that take in numerical values. Original attributes are: 1. Status of the existing checking account 2. Duration in months 3. Credit history 4. Purpose 5. Credit amount 6. Savings account/bonds 13

7. Length of present employment 8. Installment rate in percentage of disposable income 9. Personal status and sex 10. Other debtors / guarantors 11. Number of years living in present residence 12. Property 13. Age 14. Other installment plans 15. Housing 16. Number of existing credits at this bank 17. Job 18. Number of people being liable to or provide maintenance for 19. Telephone 20. Foreign worker The same data set was used in [5]. Our algorithm does slightly better than some of the existing algorithms [1], [5] for this database. 6. Conclusion Our algorithm provides a fast way to classify data using local, pseudo gradient search by converting objective function into real valued function with the help of the penalty. We notice that we have premature convergence at some runs, that is the algorithm is converging before misclassification rate is minimized. Overall, algorithm produces good results, the misclassification rate is low and the convergence rate is fast. 7. References [1] J. Eggermont, J. Kok, W. Kosters, Genetic Programming for Data Classification: Partitioning the Search Space, Proceedings of the 2004 ACM symposium on Applied computing, pp.1001-1005, 2004. [2] J. R. Quinlan, Induction of Decision Tree, Machine Learning, pp. 81-106, 1986 [3] M. Last, O. Maimon, A compact and accurate model for classification, IEEE Transactions on Knowledge and Data Engineering, Vol. 16, pp.203-215, 2004. 14

[4] M. Liu, Z. Wang, Classification using generalized Choquet integral projections, Proc. IFSA, pp.421-426, 2005. [5] D. Michie, D. Spiegelhalter, and C. Taylor. Machine Learning, Neural and Statistical Classification, Ellis Horwood, 1994. [6] B. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, 1996. [7] M. Spilde, Z. Wang, Solving nonlinear optimization problems based on Choquet integrals by using a soft computing technique, Proc. IFSA, pp.450-454, 2005. [8] K. Xu, Z. Wang, P.A.Heng, K.S.Leung, Classification by nonlinear integral projections, IEEE T. Fuzzy Systems 11, No.2, pp.187-201, 2003. [9] V. Vapnik, Statistical Learning Theory, Wiley, 1998 [10] N. Yan, Z. Wang, Y. Shi, Z. Chen, Nonlinear classification by linear programming with signed fuzzy measures, Proc. FUZZIEEE, pp.1484-1489, 2006. [11] http://www.ics.uci.edu/ mlearn/mlsummary.html. [12] http://www.stats.ox.ac.uk/pub/prnn. 15