IMPLEMENTATION OF DATA MINING TECHNIQUES USING ORACLE 8i PL/SQL

Size: px

Start display at page:

Download "IMPLEMENTATION OF DATA MINING TECHNIQUES USING ORACLE 8i PL/SQL"

Lindsey Robertson
5 years ago
Views:

1 IMPLEMENTATION OF DATA MINING TECHNIQUES USING ORACLE 8i PL/SQL DAVID TANIAR GILLIAN D CRUZ CHENG LEE School of Business Systems, Monash University, PO Box 63B, Clayton 3800, Australia ABSTRACT This paper discusses the concepts relating to the creation and implementation of programs for Data Mining purposes using PL/SQL in the Oracle 8i Environment. The programs are explained in a manner that helps the reader understand the principles involved in creating programs for the purpose of finding Association and Sequential Patterns in data. It explains the achievements, limitations and problems faced at all stages of the implementation of the programs. The programs illustrated in this paper are based on the Apriori Algorithm, which is well known for both its simplicity and accuracy. 1. INTRODUCTION In today s world, there are numerous tools that are being used to do data mining. Various languages have been utilized to write programs for this purpose. This paper was initially an attempt To develop programs that would aid small establishments to incorporate data mining into their concerns. To view the advantages and disadvantages of developing programs in PL/SQL to do data mining. To establish the limitations of PL/SQL in regard to data mining for further study and research purposes. This paper discusses all the issues specified above. The programs listed and explained in this paper have emerged following this study. They help illustrate the working of two important data mining techniques association rules and sequential patterns. This paper shows and explains how PL/SQL can be used to achieve valid results for both these data mining techniques. 2. DATA MINING TECHNIQUES Data mining techniques are basically used for either data analysis and/or prediction. The two techniques that are discussed in this paper are Association Rules and Sequential Patterns [3]. An Association algorithm creates rules that describe how often events occur together. This rule is based on every transaction that has occurred and the transaction number would be the main deciding factor in this technique [1,4]. A commonly used algorithm for carrying out these association rules is the Apriori algorithm. In order to find correlations between subjects, items or events, a frequent item set needs to be established [4]. A frequent item set is a list of item/s that appear on a regular basis within a particular transaction. For an item to be considered frequent, it has to exceed a minimum set level. At every stage of the algorithm a frequent item set is established. There are two deciding issues in the Apriori algorithm i.e. Support and Confidence. A Support Level is the minimum amount of times an item must appear in the list of

2 transactions [2,5]. A minimum support is necessary for an item to be considered as frequent. A Confidence level is the amount times more than one item from the previous frequent item set appears in the database [2,5]. Thus a confidence level denotes the strength of the correlation if items. Sequential Patterns, like the name suggests, is predicting or analysing the chain of events that take place after a set event has occurred. The commonly used algorithm used to write programs for sequential patterns is the Sequential Apriori Algorithm. This algorithm, also like the Association Apriori Algorithm, features Support and Confidence levels and works in the same way as in the Association Algorithm. 3. IMPLEMENTATION USING ORACLE 8I PL/SQL The programs written for this paper are explained below and have been written using the Oracle 8i PL/SQL Platform. These programs were written with the aim of being able to accomplish data mining with the help of programs written in Pl/SQL. 3.1 Association Rules Using PL/SQL The program discussed in this paper, to further explain the concept of association rules pertains to a fictional retail store. It is assumed that the store manager sought to establish what products to stock on his shelves together such that customers would buy more products every time they entered his store. Example: If it was known that 80% of customers who came into the store bought shampoo and conditioner together in a transaction then it would be wise for the shop manager to place shampoos and conditioners near each other so as to entice customers to buy both products instead of just one of them. In the program two cases are considered. One case has a relatively low support and confidence level and the second case has a reasonably high level of support and confidence. The program is discussed in this manner so as to emphasise the difference in the results due to the changes in the support and confidence levels. Level 1: To form the first Level, items need to be counted according to how many times they have appeared in the database. Figure 1 shows the cursor that is used to group and count items in the database based on the number of transactions. To establish the items that qualify to be included in the frequent item set, the algorithm shown in Figure 2, was implemented. Level 2: Items that qualified from Level 1 are now used in the second level. In this Level, two items are combined and a count is done to observe the number of times these two items appeared together in a transaction. From CURSOR PurchasesCursor IS SELECT item, count (transid) AS NoOfOccurances FROM purchases GROUP BY item ORDER BY noofoccurances desc; PurchasesRow PurchasesCursor%ROWTYPE; Figure 1 Cursor used for Level 1 Input: DB containing transactions. Output: Frequent Item set 1. Count number of times an item appears in the transaction list. 2. Group by transaction. 3. Input Support Level. 4. Frequent item set = items with count > Support Level. Figure 2 Algorithm to generate frequent items at Level 1

3 this stage on, at every Level a Confidence Interval has to be set. Figure 3 shows the selection of items for a candidate item set, i.e. a combination of two items. It must be remembered that the items need not be in any particular order. The algorithm shown in Figure 4 can be used to achieve results for this level. Level 3: In this stage the two items that qualified in the previous level are combined with another item so as to create a triple combination of all items that qualified. Again a count of the number of times these new combinations appear in a single transaction is conducted. The same algorithm as shown in Figure 4 can be used again at this level. CURSOR LevelTwoCursor IS SELECT LEVELONE.ITEMONE, LEVELONE_A1.ITEMONE AS Itemtwo, PURCHASES.TRANSID FROM LEVELONE, LEVELONE LEVELONE_A1, PURCHASES, PURCHASES PURCHASES_A1 WHERE ((LEVELONE.ITEMONE=PURCHASES.ITEM) AND (LEVELONE_A1.ITEMONE=PURCHASES_A1.ITEM) AND (PURCHASES.TRANSID=PURCHASES_A1.TRANSID) AND (LEVELONE.ITEMONE<LEVELONE_A1.ITEMONE)) ORDER BY LEVELONE.ITEMONE ASC, LEVELONE_A1.ITEMONE ASC; LevelTwoRow LevelTwoCursor%ROWTYPE; Figure 3 Cursor used at level 2 Input: Frequent Item set from previous level Output: Frequent Item Set of number of (item/s at previous level + 1). 1. Count the number of times frequent item set from previous level appears along with new item in the list of transactions. 2. Group by Transaction 3. Input Confidence Level. 4. Frequent Item Set for current level = item with count > confidence Level. Figure 4 Algorithm to generate frequent item frequent item sets After this level no further combinations can be made with the frequent item sets and thus the program concludes with this step. Figure 5 shows the actual output of the program using the support and confidence values as specified above. Figure 5 Program Results from the Association Rules Generation Program

4 3.2 Sequential Patterns using PL/SQL The program discussed in this paper to explain sequential rules uses a database related to a fictional video store. The database contains a list of children s videos that were hired out over a three-month period. These transactions can be used to establish a sequential rule wherein the program finds the videos that are most likely to be hired out by a customer based on the correlation or pattern that exists between the videos. In this case the transaction number does not play an important role as it did in the association rule program. Here item sets are found based on the count of the number of customers that rent a particular video. This shows that a particular video has a certain amount of chance of being hired based on if it is normally hired in conjunction with another list of videos. Level 1: At the first level items are counted based on how many customers rent the video. The algorithm used to find sequential patterns at this level is shown in Figure 6. The cursor that is used to select items for the first level is similar to that of association rules. At this level, support is set at 50%. Input: database containing transactions. Output: Frequent Item set for level 1 1.Count number of times an item appears in the transaction list. 2.Group by customer. 3.Input Support Level. 4.Frequent item set = items with count > Support Level. Figure 6 Algorithm used to generate frequent items at Level 1 Level 2: At this stage/level of the program, a confidence level has to be set along with the support level. Again a count is done but this time of a combination of two items and includes only those items that have qualified to be included in the frequent item set. The algorithm used for this level is illustrated in Figure 7. Figure 8 shows the cursor used to select a combination of two items. Input: Frequent Item set from previous level Output: Freq Item Set of number of (item/s at prev level + 1). 1.Count the number of times frequent item sets from previous level appears along with new item in the list of transactions. 2.Group by Customer 3.Input Confidence Level. 4.Frequent Item Set for current level = item with count > confidence Level. Figure 7 Algorithm used to generate frequent item sets Level 3: At Level 3 of the program, three items from are combined and another count is carried out based on how many customers hired out these videos over the three month period. Here again the confidence level is set to 75%. The same algorithm as shown in Figure 7 is used at this stage. A cursor similar to the cursors used in the association program at level 2 is used, the only difference being that the count is now done based on the customer instead of the transaction number.

5 CURSOR LevelTwoCursor IS SELECT Distinct sales.custid, LevelOne.Item1ID, LevelOneCopy.Item2ID FROM LEVELONE, LEVELONECOPY, Sales, Sales Sales_A1 WHERE (Sales.ItemID=LEVELONE.ITEM1ID) AND (Sales_A1.ITEMID=LEVELONECOPY.ITEM2ID) AND (Sales.CustID=Sales_A1.CustID) AND (Sales.ITEMID<>Sales_A1.ITEMID) AND (Levelone.Item1ID<Levelonecopy.Item2ID); LevelTwoRow LevelTwoCursor%ROWTYPE; Figure 8 Cursor used to generate combination of items for Level two Level 4: At Level 4 confidence is again set at 75%. Here four items are grouped together and a count is carried out of the number of customers that hired these videos. Items that have a count greater than the confidence level are now considered to be the frequent item set. The algorithm shown in Figure 7 is used again to determine the new frequent item set for this level. A cursor similar to the cursor used at Level 2 and 3 are used to select a combination of 3 items. From the results shown in Figure 9, it is seen that item ID s 5, 7, 8 and 12 have been hired out by 3 out the 4 customers that visited the store in those three months. After this level no further combinations of items can be made and hence the program ends at this stage. Figure 9 Results from the sequential Patterns Generation Program 4. DISCUSSION The main aim of both programs were initially to show that algorithms for association rules and sequential patterns can be made into programs that would run in the Oracle 8i PL/SQL environment. From the results that were seen from running both the programs, it is obvious that both programs give valid results and demonstrates the principles of Data Mining effectively. Both programs would be able to run effectively even if the database was larger. If the database for the association rules program were

6 changed to include another item in the transaction list, the program would still work effectively. Both programs demonstrate that data mining can be done on Oracle databases with not much problem. Although both programs attained valid and expected results, there are some limitations. The programs will work effectively as long as there are no more levels than the number of levels that are built into the programs. If the database uses less number of levels than the built in number of levels then there would be no problem but if the number of levels were to exceed the maximum number of levels built into the program then there would be a very prominent dilemma. The number of items in the database or the any other factor cannot predict the number of levels needed to carry out the program effectively. Even if the number of levels were estimated by some means, the next problem would be the building of temporary tables. For every stage/level of the programs, temporary tables had to be prepared to store results and data from previous levels. Cursors that are used at later stages in the programs, uses data from these temporary tables. 5 CONCLUSION This paper has proved that it is possible to create programs for the purpose of data mining within the PL/SQL Oracle 8i environment. The programs created for the purpose of this paper have demonstrated the validity and accuracy that can be achieved by data mining programs. Although there have been visible limitations to this program, as there is in almost every program, it has still managed to prove to that with continued study and experimentation, it is possible to create complete and flexible programs for data mining purposes using PL/SQL. The programs built for this paper are not meant to be used for real time data mining purposes (as they have some limitations) but has be designed in a way to explain and prove that data mining using PL/SQL is possible, which it has done to the maximum possible extent. These programs if developed further in an environment that did not have memory and storage space as an issue, it can by all means be developed into programs that can be used in mostly any situation to fulfil data mining goals and objectives. In conclusion, it can be confidently said that the programs developed for the purpose of this paper have fully achieved it s goals by explaining and proving the ability of creating valid programs for data mining purposes in the Oracle 8i PL/SQL environment. ACKNOWLEDGMENTS This paper is partially funded by the Victorian Partnership For Advanced Computing (VPAC) Education Program Grant Round 2, EDPNMO006/2001. REFERENCES [1] H. Wang, AXL Version 1.2 Documentation Data Mining Applications in AXL Association Rules Apriori Algorithm, [2] Institute of Knowledge Processing and Language Engineering, Faculty of Computer Science, University of Magdeburg, Germany, Apriori Find Association Rules/Hyperedges with Apriori Algorithm Support and Confidence [3] J. Fong, Technologies for Mining Frequent Patterns in Large Databases, Dept of Comp. Sc., City University Of Hong Kon.g [4] J.S. Soo, M.S. Chen and P.S. Yu, Using a Hash-Based method with transaction Trimming and Database Scan Reduction for Mining Association Rules paperpszszdmtkdetbf.pdf/park97using.pdf [5] Postech Strategic Management Of Information Systems Laboratory, Data Mining

Implementation of Classification Rules using Oracle PL/SQL

1 Implementation of Classification Rules using Oracle PL/SQL David Taniar 1 Gillian D cruz 1 J. Wenny Rahayu 2 1 School of Business Systems, Monash University, Australia Email: David.Taniar@infotech.monash.edu.au