Implementation of Classification Rules using Oracle PL/SQL

Size: px

Start display at page:

Download "Implementation of Classification Rules using Oracle PL/SQL"

Angelica Dina Watts
6 years ago
Views:

1 1 Implementation of Classification Rules using Oracle PL/SQL David Taniar 1 Gillian D cruz 1 J. Wenny Rahayu 2 1 School of Business Systems, Monash University, Australia David.Taniar@infotech.monash.edu.au 2 Department of Computer Science and Computer Engineering, La Trobe University, Australia wenny@cs.latrobe.edu.au Abstract Data Mining is a means of uncovering hidden patterns and relationships in databases. Classification is one of the techniques that categories data, based on values that contribute to the outcome of that data. C4.5 is a very effective algorithm that builds decision trees that assist in classifying data. Oracle is very popular amongst organisations that store data and they require data mining techniques to be performed on their databases. This paper describes the C4.5 algorithm and the example provided proves that Classification Rules can be implemented in the Oracle 8i environment. 1. Introduction Data Mining can be defined as a technique by which hidden patterns and relationships can be uncovered in large amounts of data. This is extremely effective especially in cases where organisations collect large amounts of data and are then not able to extract the information they first required from the data. There are many different Data Mining techniques. The more important ones include Association rules, Sequential patterns, Classification and Clustering [3]. This paper deals with Classification in depth and thus the basics first need to be understood. Classification Rules are a Data Mining Technique that is used to determine a particular category that a value will fall into based on certain properties of that value. There are many methods by which classification can be done such as by means of Neural Networks, Statistical Algorithms, Genetic Algorithms, Rule Induction method, Nearest Neighbour method, Data Visualisation etc [2]. The intention of this paper is not only to explain Classification Rules but also to illustrate that these rules can be implemented in an environment that it would be most needed. Most organisations run on an Oracle Platform, as it is one of the most robust database platforms available. Since a large majority of organisation host their databases on this platform it would be wise to provide them with tools that they would require for them to fully make use of their data that they store. One of the tools that they would need would be the use of Data Mining techniques that they could run on their databases. Thus the main aim of this paper is not only to explain the concepts of the renowned Classification Rules algorithm C4.5 but to also prove that Classification Rules can be implemented within the Oracle Environment. In this paper Classification Rules are implemented in the Oracle 8i environment with the help of Decision Trees. Decision trees are basically graphical representations of various rules, which can be determined with the help of a variety of algorithms that are available. The most commonly used algorithm is the C4.5 algorithm, which was introduced by Ross Quinlan as an enhanced version of the ID3 algorithm, which was also developed by him.

2 2 2. The C4.5 Algorithm The C4.5 algorithm by Ross Quinlan is based on decision trees [see Fig 1] and thus would be a good starting point in explaining the algorithm [3]. A decision tree as said earlier is formed based on Attributes, Values and Outcomes. The values of the attribute are what determine the outcome. For example Attribute = Outlook, Value = Sunny and Outcome = Play. This means that the Attribute Outlook can contain many different Values. One of the Values for the attribute Outlook is Sunny and the Outcome could mean that a person could decide to Play a game of Golf if the weather conditions are Sunny. The decision tree is built from the root to the leaf and the root would be considered as the starting point of the tree. The algorithm will then decide the nodes that are to be included in the tree. These nodes are determined based on the values that pertain to them. If the Nodes can be further subdivided, they are then branched into further nodes or are left as leaves if they cannot be sub divided. These leaves are the Outcomes of the Attributes [see Fig 13 for example of a decision tree]. /* mode generation */ FormTree ( data ) { /* evaluation value calculation */ EvalAttNode ( data ) ; /* data division */ DivNodeData ( data ) ; } For each sub data I FormTree ( subdata [i] ) ; Fig 1 C4.5 Tree Generation Algorithm The C4.5 algorithm begins with determining the Information provided by all the attributes involved in the database [5]. Information provided by an attribute is also termed as Entropy. The more even the probability distribution the greater is the Information. Information of an attribute can be decided based on the probability distribution of the outcomes of that attribute. If P = (p1, p2 pn) Then Information (P) = - (p1*log 2 (p1) + p2*log 2 (p2) + pn*log 2 (pn)) Another Value to be decided is the Gain of the attribute/value [5]. The Gain of a value is the amount of that specifies how much this value relates to the data on the whole. Thus if Information (Table data) is represented by I(T) Then Gain (Attribute, T) = Information (T) Information (Attribute) The attribute with the highest gain will be considered to be the root of the tree. The values of this attribute with the highest gain will be considered as the next level of nodes/leaves. Again the highest gain of the values of this attribute is determined by using the same formula as used initially. The sequence of forming the nodes on this level of the tree is decided by the amount of information and gain that the values provide. The value with the highest gain is considered first for the next level [4, 5]. At the next level it has to be decided as to whether the node can be further subdivided or not. If for all values of that attribute there is only one result then there can be not further subdivision and this branch will end here. The final Leaf will be the outcome. If on the other hand there are more results/outcomes for that values and attribute then the Node can be further subdivided. This would lead to the next Level of the decision tree. Again at this level, the information of all attributes is determined keeping the attributes and values of the previous levels constant. The attribute with the highest gain is again considered and the entire process starts again.

3 3 The C4.5 algorithm is very useful and appropriate as it deals with some real life situations in terms of size of the tree, data types of the values etc. The C4.5 algorithm allows pruning of a decision tree such that it allows only values that are necessary and thus eliminates any freak cases. The C4.5 algorithm also allows continuous values to be used, which was not allowed in the earlier version known as ID3 algorithm and a number of other algorithms [4]. 3. Example using the C4.5 Algorithm In order to understand the implementation of the C4.5 algorithm in the PL/SQL environment, the reader needs to grasp the exact working and assumptions made while implementing this program. The algorithm can be better explained with the help of the following example. In this example weather conditions are used to forecast whether a person should or should not play golf. Fig 2 shows a table illustrating a number of records. Each record illustrates values for four weather conditions i.e. Outlook, Temperature, Humidity and Windy and an end result specifying whether the person Played or Didn t Play golf. These weather conditions can be termed as Attributes. It should be noted that two of the attributes namely Temperature and Humidity have continuous values [5]. OUTLOOK TEMPERATURE HUMIDITY WINDY RESULT Sunny FALSE Don' t Play Sunny TRUE Don' t Play Overcast FALSE Play Rain FALSE Play Rain FALSE Play Rain TRUE Don' t Play Overcast TRUE Play Sunny FALSE Don' t Play Sunny FALSE Play Rain FALSE Play Sunny TRUE Play Overcast TRUE Play Overcast FALSE Play Rain TRUE Don' t Play Fig 2 Table showing Attributes and Values for Golfing Example The records in Fig 2 will be used as Training Data in order to help determine rules to help build a decision tree. In order to determine the first rule, the amount of Information provided by each attribute needs to be established. In order to find out whether a person should play or not play golf can be decided by first determining the Information provided by each attribute. The amount of information from each attribute depends on the probability distribution of the values of each attribute. For Example for the attribute Outlook, the probability of the result being Play when the Value of outlook is Sunny is Probability (Play, Sunny) = 2/5 Probability (Don t Play, Sunny) = 3/5 P (Sunny) =(2/5, 3/5) [see Fig 3]

4 4 OUTLOOK TEMPERATURE HUMIDITY WINDY RESULT Sunny FALSE Don't Play Sunny TRUE Don't Play Sunny FALSE Don't Play Sunny FALSE Play Sunny TRUE Play Fig 3 Table showing values when Attribute =''Sunny' Therefore Information of the attribute Outlook when outlook = Sunny can be calculated by applying the following formula. If P = (p1, p2 pn) Then I (P) = - (p1*log 2 (p1) + p2*log 2 (p2) + pn*log 2 (pn)) Sunny (2/5, 3/5) - (0.4 * log 2 (0.4) * log 2 (0.6)) -(0.4* * ) -( ) -( ) Therefore overall information for Attribute Outlook for the entire training data can be calculated applying the formula Overall Information = I (P, T) = (X / Y) * I Where X is the number of times the value of the attribute appears in the training data, Y is the total number of values in the training data and I is the information calculated for that specific value of the attribute [5]. Information (Outlook, T) = 5/14 * Information (Sunny) = 5/14 (0.9709) = = 5/14 * Information (Rain) = 5/14 (0.9709) = =4/14 * Information (Overcast) = 4/4 (0) = 0 Thus, information (Outlook, T) = = The same method is used to find out the information provided by all other attributes. While considering the number of times an item appears in the training data, one should consider only values that appear twice or more in the entire training data. This method was adopted by C4.5 as a means of eliminating freak cases. Thus, information of attributes = Outlook = Windy = Temperature = Humidity = Result = 0.94 Assumption 1: It is assumed that attributes with non-continuous values have greater priority than the attributes with continuous values.

5 5 The next step is to find out the Gain value of the Attributes. Gain of an attribute specifies the gain of information due to that particular attribute. Gain can be found out by applying the following formula [5]. Level 1: Gain = Info (T) Info (attribute X) Info (T) = Info (Result) = Info (9/14, 5/14) = 0.94 The first Node or level of the decision tree will be the attribute with the highest gain [see Fig 4]. Attribute Outlook has the highest gain of (keeping in mind assumption 1 where attributes with non-continuous values have lower priority). Thus the first Level of the decision tree or the first node will be based on the Attribute Outlook. Attribute Information Gain Outlook Windy Temperature Humidity Fig 4 Information and Gain of Attributes Level 2: Attribute Outlook has three different values namely Sunny, Overcast and Rain. Thus Level 1 of the tree can be divided into three other nodes if they can be taken further to a third level or will remain as leaves if they cannot be further sub divided. The next step would be to find out which of the three values of attribute outlook need to be divided next. In other words find out which value has the highest gain. Overcast has the highest gain value and hence will be the first node in Level 2 [see Fig 5]. Information Gain Sunny Overcast Rain Fig 5 Information and Gain for Values of Outlook Probability (Overcast) = (1, 0) = 1 = Play Since there is only one possibility (i.e. In all circumstances where Outlook = overcast the end result is always Play ) this remains the final leaf for this branch in the decision tree. The next two possibilities for the following nodes are either Sunny or Rain. From looking at Fig 6 it can be seen that Rain has a larger positive result than Sunny i.e. 3 Plays. Thus Rain will be considered to be the next node. Play Don t Play Sunny 2 3 Rain 3 2 Fig 6 Table showing results for values Sunny and Rain

6 6 Since there are more than one possibility [see Fig 7] i.e. there exists some cases where the result is Play and some other cases where the result is Don t play, it is necessary to determine whether any other attributes contribute to the final result. Outlook Temperature Humidity Windy Result Rain FALSE Play Rain FALSE Play Rain TRUE Don' t Play Rain FALSE Play Rain TRUE Don' t Play Fig 7 Outlook = 'rain' To find this out, the information for all the other attributes have to be determined for situations where Outlook = Rain. Here again cases with less than two values appearing in the table are eliminated. Since no cases appear twice for the attribute Temperature, where Outlook = Rain, this attribute can be eliminated for this stage. Windy has the highest Gain value among the noncontinuous attributes and thus the next node will be Windy [See Fig 8]. Information Gain Windy (Non-Continuous) Humidity (Continuous) Fig 8 Table showing Information and Gain When Outlook = 'rain' By grouping values by Rain and Windy, it can be seen that there are two possible outcomes when Outlook = rain based on whether Windy = true or false [See Fig 9]. Outlook Windy Result Rain TRUE Don t Play Rain FALSE Play Fig 9 Results based on outlook ='rain' and Windy conditions This would end this branch of the tree. The next node/leaf to be considered will be the next value of attribute Outlook that has the highest information. The next value to be considered will be the value sunny as it is the only value for Attribute Outlook that is remaining. Assumption 2: Attributes are not repeated. From looking at Fig 10 it can be seen that when Outlook = Sunny there are more than one possibility for the result. Here again the information for all Attributes needs to be determined. Again here, as was done earlier, while determining Information for attributes, values with less than two cases must be eliminated in order to minimise freak cases.

7 7 Outlook Temperature Humidity Windy Result Sunny FALSE Don't Play Sunny TRUE Don't Play Sunny FALSE Don't Play Sunny FALSE Play Sunny TRUE Play Fig 10 Outlook = 'sunny' When Information is determined [see Fig 11] it can be seen that Temperature gets eliminated as no items appear more than once. Attribute Humidity has the Highest Gain value of 0.94 and thus Humidity is assumed to be the attribute that helps classify Sunny into definite results. Attribute Information Gain Humidity Fig 11 Information and Gain for Outlook ='sunny' By determining the average value and extending it to the closest actual value that appears in the training data, a centre point can be determined and this point can act as the deciding factor in the building of this branch of the decision tree [4]. Humidity Result 70 Play 70 Play 85 Don't Play 90 Don't Play 95 Don't Play Fig 12 Determining Centre Point Determining Centre Point [see Fig 12]: Highest Value where Result = Play = 70 Lowest Value where Result = Don t Play = 85 Average Value => 77.5 Decreased to Lower Value in the training data where Result = Play => 75 Thus it can be said that if Attribute is Sunny and Humidity is less than or equal to 75, result will always be Play and if Attribute is Sunny and Humidity is greater than 75 then the result will always be Don t Play. Since there are no more values for Attribute Outlook the decision tree will end here. Figure 12 below illustrates the decision tree that was just created while illustrating the above example.

8 8 Outlook Overcast Rain Sunny Windy Humidity False True =< 75 > 75 Play Play Don t Play Play Don t Play Fig 13 Decision tree 4. Implementation and Results Of the program using Oracle 8i Pl/SQL The Classification Rules program was implemented using the C4.5 algorithm as a base from which to work from. The same process, as explained in the earlier section of this paper, was used to arrive at the results shown in Fig 14. The program was built based on the data illustrated in the previously discussed weather conditions example [see Fig 2]. This data was stored in a table in the Oracle platform and procedures and functions called on this table to form the decision tree. A number of temporary tables also had to be created in order to store values. The PL/SQL procedures called functions that determined the state of the decision tree based on the Information and Gain values, which were arrived at based on calculations performed within a number of procedures. It must be noted that the procedures and functions illustrated in this paper were fabricated only to prove that Classification Rules can be implemented over an Oracle platform. It should also be noted that the all procedures and functions that are listed can be improved significantly in order to allow better performance of the overall program in terms of performance, speed and accuracy.

9 9 Fig 14 Result of the Classification Rules program Fig 15 illustrates the method in which the algorithm was implemented into programs by means of using procedures to populate temporary tables that in turn was used to determine information/gain values. These information/ gain values were stored in a temporary table call Information and from this table the highest gain values were sought in order to establish which attributes would be considered for the next level in the tree. Fig 15 Source Code illustrating method of calculation of Information/gain Values

10 10 Once the attribute was determined a procedure called Proc_Build_tree was called to actually build the tree [see Fig 16]. This procedure called on several other functions in order to build the different branches of the decision tree. Fig 15 Source Code illustrating building of Decision Tree The Classification Rules program presented accurate results allowing users to determine whether a person should Play or Not Play a game of golf based on the weather conditions. By using the training data the program was able to present a decision tree, which could be used to determine the end result based on the input. 5. Discussion The C4.5 algorithm can be extremely intensive and comprises of a large amount of mathematical calculation that needs to be completed in order to arrive at definite conclusions. The algorithm can be time consuming to understand (especially the calculations side for non mathematical persons) but hopefully it has been easily explained in this paper with the help of the example used. A few assumptions have been made throughout sections 4 and 5 of this paper but these assumptions can be eliminated if other techniques of the C4.5 algorithm is brought into play, to be precise - Split-Information, Gain ratio etc [4, 5]. However for basic understanding of Classification Rules using the C4.5 algorithm, the terms and processes described in this paper is sufficient for deriving accurate results. From the programming point of view, this program will work effectively for only this example as values such as Outlook, Windy, true, 70 etc have all been hard coded into the program. The procedures have been coded in a way such that the program will run for the training data that is listed throughout this paper. Small changes to the codes will be necessary if the user wishes to include added records to the training data. For example Fig 16 shows program code when the Attribute = Outlook. The code is sufficient for when the training data used is identical to the records shown in Fig 2. If a record is added to this training database then it might be necessary to include in this same procedure code for when Attribute = Temperature, Windy and Humidity. The codes that need to be inserted will be identical to

11 11 the code that already exists except that the Attribute name and table name may have to be changed accordingly. Again this Classification Rules program was implemented only to prove that Classification Rules are possible to execute within the Oracle 8i environment and not for the purpose of performing data mining on tables that the user provides. Fig 16 Program source code showing that code needs to be modified for accurate results if the training data is updated/modified The training data is read by the system once at the start and the tree is then built. Attributes that are continuous are more complicated to work with as they receive less priority over the non-continuous attributes. Table space is required to run such a program as a lot of data that arises out of calculations need to be inserted into temporary tables. Pruning of the decision tree can be done to eliminate cases that are considered to be freak cases and also to simply down size the tree [4]. 6. Conclusion Considering the different types of Data Mining that can be carried out by a business based on the data they collect and the information they require from their database, Classification Rules would probably stand at the top of the list for being the most popular data mining technique. From all that was discussed in this paper one can say that Classification Rules can help determine set outcomes and results and put ordinary data into categories based on some pre -established variables. The C4.5 algorithm by Ross Quinlan is one of the best Classification algorithms ever established and proved to an extremely solid foundation on which to build the Classification Rules program for this paper. From the processes set out in the previous sections of this paper it can be said that it is highly possible to implement classification rules within the Oracle environment with the help of PL/SQL procedures and functions. Given more time and resources the program can be further improved so that the program can run successfully even if the original data in the training data table is modified, changed or extended.

12 12 In spite of all the problems and drawbacks detected in the Classification Rules Program it must be noted that given the time constraint, this program can still produce successful results and thus proves the possibility of Classification Rules being implemented within the Oracle 8i environment. References 1. Gasser, Michael Extending Decision Tree Learning 2. Joshi, Karuna Pande Analysis Of Data Mining Algorithms 3. Kubota, Kazuto, Akihiko Nakase, Hiroshi Sakai, Shigeru Oyanagi, Parallelization of Decision Tree Algorithm and its Performance Evaluation, ISBN / Quinlan, Ross, Programs For Machine Learning, Morgan Kaufman Publishers, San Mateo, CA, Quinlan, Ross, Building Classification Models: ID3 and C4.5 Journal of Artificial Intelligence Research 4, 77 90, 1996, 6. Quinlan, Ross, improved use of continuous attributes in c4.5 SzcszSzprojectzSzjairzSzpubzSzvolume4zSzquinlan96a.pdf/quinlan96improved.pdf

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

1 CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM John R. Koza Computer Science Department Stanford University Stanford, California 94305 USA E-MAIL: Koza@Sunburn.Stanford.Edu