ORG - Oblique Rules Generator - PDF Free Download

ORG - Oblique Rules Generator Marcin Michalak,MarekSikora,2, and Patryk Ziarnik Silesian University of Technology, ul. Akademicka 6, 44- Gliwice, Poland {Marcin.Michalak,Marek.Sikora,Patryk.Ziarnik}@polsl.pl 2 Institute of Innovative Technologies EMAG, ul. Leopolda 3, 4-89 Katowice, Poland Abstract. In this paper the new approach to generating oblique decision rules is presented. On the basis of limitations for every oblique decision rules parameters the grid of parameters values is created and then for every node of this grid the oblique condition is generated and its quality is calculated. The best oblique conditions build the oblique decision rule. Conditions are added as long as there are non-covered objects and the limitation of the length of the rule is not exceeded. All rules are generated with the idea of sequential covering. Keywords: machine learning, decision rules, oblique decision rules, rules induction. Introduction Example based rules induction is, apart from decision trees induction, one of the most popular technique of knowledge discovery in databases. So-called decision rules are the special kind of rules. Sets of decision rules built by induction algorithms are usually designed for two basic aims. One is developing a classification system that exploits determined rules. Other aim is describing patterns in an analyzed dataset. Apart from the number of algorithms that generate hyper-cuboidal decision rules it is worth to raise the question: Aren t the oblique decision rules more flexible to describe the nature of the data? On the one hand every simple condition like "parameter less/greater than value" may be interpreted in the intuitive way, but on the other hand the linear combination of the parameters "a parameter a ±a 2 parameter a 2 + a less/greater may substitute several non-oblique decision rules with the cost of being a little less interpretable. In this article we describe the method of generating oblique decision rules (Oblique Decision rules Generator ORG) which is the kind of exhausting searching of oblique conditions in the space of oblique decision rule parameters. As oblique decision rules may be treated as the generalization of the standard decision rules the next part of the paper presents some achievements in the area of rules generalization. Then some basic notions that deal with oblique decision rules are presented. Afterwards the algorithm that generates oblique decision rules (ORG) is defined. The paper ends with comparison of results obtained on several our synthetic and some well known datasets. L. Rutkowski et al. (Eds.): ICAISC 22, Part II, LNCS 7268, pp. 52 59, 22. c Springer-Verlag Berlin Heidelberg 22

ORG - Oblique Rules Generator 53 2 Related Works The simplest method of generalization used by all induction algorithms is rules shortening consists in removing elementary conditions. Heuristic strategies are applied here (for example hill climbing) or exhaustive searching. Rules are shortened until a quality (e.g. precision) of the shortened rule drops below some fixed threshold. Such solution was applied, inter alia, in the RSES system [2] where rules are shortened as long as the rule precision does not decrease. In the case of unbalanced data introducing various threshold values of shortened rules quality leads to keeping better sensitivity and specificity of an obtained classifier. The other approach to rules generalization is concerned with decision rules joining algorithms that consists in merging two or more similar rules [,6]. In [6] an iterative joining algorithm relying on merging ranges occurring in corresponding elementary conditions of input rules is presented. The merging ends when a new rule covers all positive examples covered by joined rules. Rule quality measures [] are used for output rules quality assessment. Paper [] presents a similar approach, where rules are grouped before joining [] or the similarity between rules is calculated, and rules belonging to the same group or sufficiently similar are joined. The special case of a rules joining algorithm is the algorithm proposed in [3], in which authors introduce complex elementary conditions in rules premises. The complex conditions are linear combinations of attributes occurring in simple elementary conditions of rules premises. The algorithm applies to the special kind of rules obtained in so-called dominance based rough set model [8] only, and is not fit for aggregation of classic decision rules, in which ranges of elementary conditions can be bounded above and below simultaneously. Finally, also algorithms that make it possible to generate oblique elementary conditions during the model constructing are worth to be mentioned. One manages here with algorithms of oblique decision trees induction [5,9,2]. A special case of getting a tree with oblique elementary conditions is an application of the linear SVM in construction of the tree nodes [3]. For decision rules, an algorithm that enables oblique elementary conditions to appear during the rules induction is ADReD [4]. Considering obtained rules in terms of their description power we can say that, even though the number of elementary conditions in rules premises is usually less than in rules allowing no oblique conditions, unquestionable disadvantage of these algorithms is a very complicated form of elementary conditions in which all conditional attributes are frequently used. Other approach introducing oblique elementary conditions in rules premises consists in applying the constructive induction (especially the data driven constructive induction) and inputting new attributes depending on linear combinations of existing features, and next determining rules by the standard induction algorithm [4,7] based on the attributes set extended this way. 3 Oblique Decision Rules Fundamentals. Decision rules with oblique conditions assume more complex form of descriptors than standard decision rules. The oblique condition is a

54 M. Michalak, M. Sikora, and P. Ziarnik condition in which a plane separating decision classes is a linear combination of conditional values of attributes a i A (elementary conditions) on the assumption that all of them are of numerical type: A i= c ia i +c where a i A, c i,c R. The oblique condition can be defined as: A i= c ia i + c or i = A c i a i + c < The oblique condition describes a hyperplane in a condition attributes space. The condition of the rule determines which elements from the decision class are covered by the given rule. Each oblique decision rule is defined by the intersection of oblique conditions. Parameters of the Descriptor and Their Ranges - the Analysis. Let us define the space of all hyperplanes which are single oblique conditions. The n dimensional hyperplane can be described with a linear equation of the following general form A x + A 2 x 2 +... + A n x n + C =where A i,c R and at least one of the A i. In the proposed solution, instead of the general form, we can use the normal form of the hyperplane equation: α x + α 2 x 2 +... + α n x n ρ = where α i are the direction cosines (α 2 + α 2 2 +... + α n 2 =)andρ is the hyperplane distance from the origin of the coordinate system. This notation makes it possible to limit the range of every parameter. To explain how to find a real value range of descriptor parameters we could consider a straight line in the plane defined by the following normal form: x cos θ + y sin θ ρ = where θ is the angle of depression to the x axis and ρ is the distance between the line and the origin as illustrated in Fig.. Every line in the plane corresponds to a proper point in the parameters space. Determination of a straight line in (θ, ρ)-space could be realized by searching a chosen subset of that space using a grid method. The angle θ is naturally bounded, so it can be defined as θ [, 2π). It is enough to determine a step of creating a grid for this variable. It is also possible to bound the values of Fig.. The normal parameters for a line

ORG - Oblique Rules Generator 55 parameter ρ. The lower bound is and the upper bound could be calculated as follows: The set of points is finite so we could determine maximal values of each coordinate. If some values of variables are negative, data could be translated into such a coordinate system where all of coordinates are positive. Fig. 2. The idea of the values of the parameter The idea is to find a straight line which passes through the point and its distance from the origin is the longest one (Fig. 2.). This problem could be solved by searching the global maximum of a function of the distance between the line and the origin depending on the value of the angle θ: ρ max (θ opt )=x max cos ( arctan y max x max ) + y max sin ( arctan y max x max Having set boundary values for all parameters of the condition we only have to determine the resolution of searching of the parameter space - a step for each parameter for the grid method: θ [, 2π), ρ [,ρ max ). The solution could be used for each hyperplane using the dependency of the sum of the squaresof the direction cosines, for example for planes in 3 dimensional and any n dimensional space. Correct Side of the Condition. Each oblique condition requires to define its correct side. To determine this we can use a normal vector to a hyperplane (containing a considered condition) as follows: for n dimensional space each hyperplane could be described with its normal vector n defined as n = [A,A 2,..., A n ]. We should calculate one vector more to find a correct side of a considered condition for a given point called T. The initial point P of such a vector could be any point lying on the hyperplane and the final point should be the point T. According to this, the second vector v is defined as: P =(x P,x P 2,..., x Pn ); T =(x T,x T 2,..., x Tn ) v = PT =(x T x P,x T 2 x P 2,..., x Tn x Pn ) The next step is to calculate the dot product of these two vectors: n and v: n v = n v cos α To decide whether the point T is lying on the correct side of the condition we should consider the value of the dot product in the following way: )

56 M. Michalak, M. Sikora, and P. Ziarnik. If the value is greater than, the point T is considered to be on the correct side of the condition. 2. If the value is equal to, the point T is assumed to be on the correct side of the condition. 3. If the value is less than, the point T is not on the correct side of the condition. In this moment we can limit the bound for the angle θ in such a way θ [,π) and for each θ consider also the second case when thecorrectsideistheopposite one. 4 Description of the Algorithm The purpose of the algorithm is to find the best oblique decision rules for each decision class of the input data taking into account several defined constraints. In general, the are two basic steps of the algorithm:. Create a parameter grid using a determined step for each parameter. 2. The growth of the new created rule depends on checking all conditions defined with the grid nodes. It is possible to constrain a number of rules defining the maximal number of rules which describe each class. Successive rules should be generated as long as there are still training objects which do not support any rule and the constraint is still not achieved. For each decision rule successive oblique conditions are obtained using a hill climbing method. Below, the description of generating the single oblique decision rule is shown:. For each cell of parameter grid create a condition and calculate its quality for the given training set using one of possible quality measures. 2. Save only the first best condition (with the highest quality). 3. Reduce the training set (just for the time of generating next condition) by rejecting all training objects which do not cover previously found conditions. 4. Find a successive condition with the first highest quality using the reduced training set. 5. A new condition should be added to the rule only if the extended rule is better than the rule generated in the previous iteration and the constraint (maximal number of descriptors for each rule) is not achieved. Otherwise, the new condition must be rejected and the search of the next conditions for this rule is stopped. 6. Continue searching successive conditions after reducing the training set by rejecting all objects which do not recognise the current rule. The addition of conditions should be stopped when the rule consists of the determined maximal number of conditions or the quality of the oblique decision rule with added condition is not improved (such a found condition is excluded). After the rule is generated we remove all covered positive objects from the training set and in the case when the maximal number of rules per decision class is not achieved we start to generate the new rule.

ORG - Oblique Rules Generator 57 5 Experiments and Results First experiments were done for three synthetic datasets, preparedexactly for the task of searching oblique decision rules: two two-dimensional (2D and double2d) and one three-dimensional (3D). Simple visualisation of these datasets is shown on the Fig. 3. Each dataset contains objects that belong to two classes. Two-dimensional datasets are almost balanced (562:438 and 534:466) but the third dataset has the proportion of classes size 835:65. First two-dimensional dataset looks like the square divided into two classes by its diagonal. The second two-dimensional dataset may be described as follows: one class is in two opposite corners and the second class is the rest. Three-dimensional dataset are unbalanced because only the one corner belongs to the smaller class. For this datasets the limitation of the maximal number of the rules per decision class and the maximal number of conditions per decision rule is given in the table with the results. As the quality measure the average of the rule precision and coverage was used..9.9.8.8.7.7.6.6.5.5.5.4.3.4.3.2.2..2.4.6.8.2..2.4.6.8.2.4.6.8.4.6.8 Fig. 3. Visualisation of the synthetic datasets: 2D (left); double2d (center); 3D (right) For the further experiments several datasets from UCI repository were taken into consideration: iris, balance scale, ecoli, breast wisconsin [6]. Also the Ripley s synth.tr data were used [5]. For every experiment the limitation of number of rules per decision class and the number of conditions per single rule for the ORG algorithm was the same: at most two rules built from at most two conditions. The quality measure remained the same as for the previous datasets. Results of ORG are compared with PART algorithm [7] obtained with the WEKA software. The WEKA implementation of PART algorithm does not give the information about the error standard deviation in the -CV model so it can not be compared with the ORG results. 6 Conclusions and Further Works In this short article the intuitive and kind of exhausting way of oblique decision rules generating was presented. This algorithm, called ORG, is based on the limitation for parameters of oblique condition. In this approach it is possible to constrain the number of obtained rules (per single decision class) and also the shape of rules (with the definition of maximal number of oblique conditions).

58 M. Michalak, M. Sikora, and P. Ziarnik Table. Results on synthetic datasets avg. std avg. rules avg. elem. ORG params/class dataset accuracy dev. number cond. number max number of: PART ORG PART ORG PART ORG PART ORG rules conditions 2D 95.5 96..5 2 8 3 2 2 double 2D 93.8 84.3 3. 4 3 23 6 2 2 3D 94.8 98.2.2 3 2 22 2 Table 2. Results on popular benchmark datasets dataset avg. accuracy std dev. avg. rules number avg. elem. cond. number PART ORG PART ORG PART ORG PART ORG iris 94 94 4.6 2 3. 3 5.2 balance scale 84 92 2.4 46 6 26 2 Ripley 85 8 8.4 4 2 6 4 breast wisconsin 94 97.7 3 8 6. ecoli 84 76 8. 2 33 9 We may see, on the basis of the results for the synthetic datasets, that ORG may be successfully applied for datasets that contain various oblique dependencies. It may be observed, in comparison with PART results, in the decrease (on average: five times) of the average number of decision rules for every decision class. In the case of popular benchmark datasets the decrease of the number of rules per decision class may be also observed. On the basis of these observations our further works will focus on finding the best conditions in the strategy with taking into consideration also the length of the condition. It should be also worth being examined whether the calculation of oblique condition parameters limitations should be analyzed more often than only in the beginning of dataset analysis. Acknowledgements. This work was supported by the European Community from the European Social Fund. The research and the participation of the second author is supported by National Science Centre (decision DEC-2//D/ST6/77) References. An, A., Cercone, N.: Rule quality measures for rule induction systems - description and evaluation. Computational Intelligence 7, 49 424 (2) 2. Bazan, J., Szczuka, M., Wróblewski, J.: A New Version of Rough Set Exploration System. In: Alpigini, J.J., Peters, J.F., Skowron, A., Zhong, N. (eds.) RSCTC 22. LNCS (LNAI), vol. 2475, pp. 397 44. Springer, Heidelberg (22) 3. Bennett, K.P., Blue, J.A.: A support vector machine approach to decision trees. In: Proceedings of the IJCNN 998, pp. 2396 24 (997) 4. Bloedorn, E., Michalski, R.S.: Data-Driven Constructive Induction. IEEE Intelli. Syst. 3(2), 3 37 (998)

ORG - Oblique Rules Generator 59 5. Cantu-Paz, E., Kamath, C.: Using evolutionary algorithms to induce oblique decision trees. In: Proc. of Genet. and Evol. Comput. Conf., pp. 53 6 (2) 6. Frank, A., Asuncion, A.: UCI Machine Learning Repository (2), http://archive.ics.uci.edu/ml 7. Frank, E., Witten, I.H.: Generating Accurate Rule Sets Without Global Optimization. In: Proc. of the 5th Int. Conf. on Mach. Learn., pp. 44 5 (998) 8. Greco, S., Matarazzo, B., Słowiński, R.: Rough sets theory for multi-criteria decision analysis. Eur. J. of Oper. Res. 29(), 47 (2) 9. Kim, H., Loh, W.-Y.: Classification trees with bivariate linear discriminant node models. J. of Comput. and Graph. Stat. 2, 52 53 (23). Latkowski, R., Mikołajczyk, M.: Data decomposition and decision rule joining for classification of data with missing values. In: Peters, J.F., Skowron, A., Grzymała- Busse, J.W., Kostek, B.z., Świniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3, pp. 299 32. Springer, Heidelberg (24). Mikołajczyk, M.: Reducing Number of Decision Rules by Joining. In: Alpigini, J.J., Peters, J.F., Skowron, A., Zhong, N. (eds.) RSCTC 22. LNCS (LNAI), vol. 2475, pp. 425 432. Springer, Heidelberg (22) 2. Murthy, S.K., Kasif, S., Salzberg, S.: A system for induction of oblique decision trees. J. of Artif. Intell. Res. 2, 32 (994) 3. Pindur, R., Sasmuga, R., Stefanowski, J.: Hyperplane Aggregation of Dominance Decision Rules. Fundam. Inf. 6(2), 7 37 (24) 4. Raś, Z.W., Daradzińska, A., Liu, X.: System ADReD for discovering rules based on hyperplanes. Eng. App. of Artif. Intell. 7(4), 4 46 (24) 5. Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press (996) 6. Sikora, M.: An algorithm for generalization of decision rules by joining. Found. on Comp. and Decis. Sci. 3(3), 227 239 (25) 7. Ślęzak, D., Wróblewski, J.: Classification Algorithms Based on Linear Combinations of Features. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 999. LNCS (LNAI), vol. 74, pp. 548 553. Springer, Heidelberg (999)