SQL Model in Language Encapsulation and Compression Technique for Association Rules Mining 1 Somboon Anekritmongkol, 2 Kulthon Kasemsan 1, Faculty of information Technology, Rangsit University, Pathumtani 12000, Thailand, somboon_a@hotmail.com 2 Faculty of information Technology, Rangsit University, Pathumtani 12000, Thailand, kasemsan@rangsit.rsu.ac.th Abstract Discovery of association rules is an important for Data mining. One of the most famous association rule learning algorithms is Apriori rule. Apriori algorithm is one of algorithms for generation of association rules. The drawback of Apriori Rule algorithm is the number of time to read data in the database equal number of each candidate is generate. Many research papers have been published trying to reduce the amount of time needed to read data from the database. In this paper, we propose a new algorithm that will work rapidly and without frequency tree or temporary candidate itemsets in RAM or Hard disk. SQL Model in Language Encapsulation and Compression Technique for Association Rules Mining (SMILE-ARM). This algorithm will generate candidates are greater than minimum support on-the-fly by SQL. This algorithm is based on two major ideas. Firstly, compress data. Secondly, generation of candidate itemsets on-the-fly by SQL. Based on the experimental results, an increase in the number of transactions or the number of items did not affect the speed at which candidates were generated by this algorithm. The construction method of SQL Model in Language Encapsulation and Compression Technique for Association Rules Mining (SMILE-ARM) technique has twenty times higher mining efficiency in execution time than Apriori Rule. 1. Introduction Keywords: Data Mining, Association Rule, Apriori Rule, Frequent Itemset Data mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. Association rule mining is searches for recurring relationships in a database. One of the most popular technique in association rule mining is Apriori rule [1][2]. Association rule mining is usually associated with huge information. Association rules exhaustively look for hidden patterns, making them suitable for discovering predictive rules involving subsets of data set attributes. Association rule learners are used to discover elements that co-occur frequently within a data set consisting of multiple independent selections of elements (such as purchasing transactions), and to discover rules. In my point of view, Firstly, most of information in data set is same pattern. Secondly, amount of time to read the whole database. Thirdly, the pruning candidate in each step of process. This paper proposes the development of algorithm to discover association rules from large amounts of information that is faster than Apriori rule by using SQL Model in Language Encapsulation and Compression Technique for Association Rules Mining(SMILE-ARM). The improvement focuses on compress data and improve performance without generate temporary frequency of Itemsets and reducing the number of times to read data from the database 2. Basic in Association Rule Let D = {T 1, T 2,...,T n } [2] be a set of n transactions and let I be a set of items, I = {I 1, I 2,..., I m }. Each transaction is a set of items, i.e. Ti I. An association rule is an implication of the form X Y, where X, Y I, and X Y = ; X is called the antecedent and Y is called the consequent of the rule. In general, a set of items, such as X or Y, is called an itemset. In this work, a transaction record transformed into a binary format where only positive binary values are included as items. This is done for efficiency purposes because transactions represent sparse binary vectors. Let P(X) be the probability of appearance of itemset X in D and let P(Y X) be International Journal of Information Processing and Management(IJIPM) Volume 4, Number 1, January 2013 doi:10.4156/ijipm.vol4.issue1.9 65
the conditional probability of appearance of itemset Y given itemset X appears. For an itemset X I, support(x) is defined as the fraction of transactions Ti D such that X Ti. That is, P(X) = support(x). The support of a rule X Y is defined as support(x Y) = P(X Y). An association rule X Y has a measure of reliability called confidence (X Y) defined as P(Y X) = P(X Y)/P(X) = support(x Y)/support(X). The standard problem of mining association rules [1] is to find all rules whose metrics are equal to or greater than some specified minimum support and minimum confidence thresholds. A k-itemset with support above the minimum threshold is called frequent. We use a third significance metric for association rules called lift, lift(x Y) = P(Y X)/P(Y) = confidence (X Y)/support(Y). Lift quantifies the predictive power of X Y; we are interested in rules such that lift(x Y) > 1. 3. Apriori rule Apriori is an algorithm proposed by R. Agrwal and R. Srikant in 1994. Apriori rule employs an iterative approach know as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets. First, the set of frequency 1-itemsets is found by scanning the database to accumulate the count of each time and collecting those items satisfy minimum support. The resulting set is L 1. Next L 1 used to find the set of frequency 2-itemsets, which is used to find, and so on, until no more frequency k-itemsets can be found. The finding of each L k requires one full scan of database. Algorithm: Apriori rule. Find frequent itemsets using an iterative level-wide approach based on candidate generation. Input: 1. D, a database of transaction; 2. min_sup, The minimum support count threshold. Output: L, frequent itemsets in D Method: (1) L1 = Find_Frequent_1-Itemset(D); (2) for (k=2;l k-1 ¹ 0;k++){ (3) C k = apriori_gen(l K -1); (4) for each transaction TÎD { // scan D for count (5) C t = subset(c k,t);// Get subset of t that are candidate (6) for each candidate cî C t (7) c.count++; (8) } (9) L k = {cîc k c.count ³ min_sup} (10) } (11) Return L = È K L K ; Procedure apriori_gen(l k-1 frequent (k-1)-itemsets) (1) for each itemset l 1 Î L k-1 (2) for each itemset l 2 ÎL k-1 (3) if(l 1 [1] = l 2 [1]^(l 1 [2]=l 2 [2]^ ^(l 1 [k-2]= l 2 [k-2])^( l 1 [k-1])< l 2 [k-1] then { (4) C= l 1 >< l 2 ; // join step : generate candidates (5) if has_infrequent_subset(c,l k-1 ) the (6) delete c; // prune step : remove unfruitful candidate (7) else add c to C k ; (8) } (9) return C k ; Procedure has_infrequent_subset(c:candidate k-itemset; L k-1 : frequent(k-1)-itemsets); // use prior knowledge (1) for each (k-1)-subset s of c (2) if s Ï L k-1 then (3) return True; (4) return False; (5) 66
4. Boolean Algebra Boolean algebra, developed in 1854 by George Boole in his book An Investigation of the Laws of Thought. Some operations of ordinary algebra, in particular multiplication xy, addition x+y, and negation -x, have their counterparts in Boolean algebra, respectively the Boolean Operations AND, OR, and NOT also called conjunction x y, disjunction x y, and negation or complement x sometime!x. Some authors use instead the same arithmetic operations as ordinary algebra reinterpreted for Boolean algebra, treating xy as synonymous with x y and x+y with x y. Figure 1. Logic Gate 5. Structured Query Language(SQL) SQL was developed at IBM by Donald D. Chamberlin and Raymond F.Boyce in the early 1970. SQL was designed to manipulate and retrieve data stored in IBM's original quasi-relational database management system, System R, which a group at IBM San Jose Research Laboratory had developed during the 1970. SQL is a database computer language designed for managing data in relational database management systems (RDBMS). The Structured Query Language (SQL) defines the methods used to create and manipulate relational databases on all major platforms. SQL Command SELECT INSERT UPDATE DELETE CREATE object ALTER object DROP object Table 1. Basic SQL Command Description Retrieves data from a table. Add new Data to a table. Modifies existing data in a table. Removes existing data from a table. Create a new database object Modifies the structure of an object. Removes an existing database object. SQL query syntax Select [Distinct] <column-name(s), arithmetic expression> From <table-name(s)> [Where <condition>] [Group by <column-name(s)>] [Having <condition>] [Order by <column-name(s)> [ASC/DESC]] SQL Insert syntax Insert into <table-name> [(column-name-1, column-name-2, )] Value (<value-1,value-2, >) 6. SQL Model in Language Encapsulation and Compression Technique for Association Rules Mining(SMILE-ARM) An Algorithm of SMILE-ARM, The step of algorithm is 1. Find frequent itemsets minimum support. 2. Compress data by create a new structure are same pattern of transactions. 3. Generate Candidate by SQL from candidate 2-itemsets until k-itemsets by generate candidate based on actual data in pattern transactions. SMILE_ARM is able to generate any candidate itemsets without previous 67
candidate such as create candidate 4-itemsets do not need to create candidate 3-itemsets or candidate 2- itemsets. SMILE-ARM is able to apply to do classification. Algorithm: Pseudo-code of SMILE-ARM. Tid_cl, a table of transaction Tid_Pattern, a table contain Pattern Data Final_Candidate, the result of candidates Min_sup Tid_Item, Itemset Tid, transaction ID Min_sup, Minimum support count all transactions Procedure Find L1 (1) for each itemset count Tk Tid_cl; (2) Delete Tk < Min_sup; (3) add to Final_Candidate; Procedure Compress Structure (1) for each Tid, Tidk L1 { (2) if Tid <> Tid_Patternk then (3) Create Tid_Patternk;\\ Create Pattern; (4) else (5) Tid_Patternk++; (6) } Procedure SQL_Command(k) (1) Insert into Final_Candidate (Candidate, Count_Items, xdim) (2) Select Tk.Tid_Item+Tk-1.Tid_Item+ +T1.Tid_Item, sum(t1.feq),k (3) From Tid_Pattern as T1, Tid_Pattern as T2,,Tid_Pattern as Tk (4) Where (T1.Tid_Item < T2.Tid_Item and T2.Tid_Item < T3.Tid_Item and Tk-1.Tid_Item < Tk.Tid_Item ) and (T1.Tid=T2.Tid and T2.Tid=T3.Tid and Tk-1.Tid =Tk.Tid) (5) Group by Tk.Tid_item, Tk-1.Tid_item,, T1.Tid_item (6) Having sum(t1.feq) >= min_sup Procedure Generate Associate Data (1) Find L1; (2) Compress Structure; (3) for each (k=2; k = Max_Itemk; k++){ (4) SQL_Model(k); (5) } Example: Minimum support 10 percent = 1.5 Table 2. Transaction Data Trans# Item Trans# Item Trans# Item T001 T001 T001 T002 T002 T003 T003 T004 T004 T004 T004 T005 T005 I5 I4 I4 T006 T006 T007 T007 T008 T008 T008 T009 T009 T009 T010 T010 T010 I4 T011 T011 T011 T012 T012 T013 T013 T013 T014 T014 T014 T015 T015 I4 68
Step 1: Find L1 The algorithm scans all of transactions in order to find frequency item of each itemsets and remove frequency item less than minimum support. Eliminate {I5} lower Minimum Support Table 3. Support Count Itemsets Support count {} 9 {} 14 {} 11 {I4} 4 {I5} 1 Table 4. Result of candidate-1 Itemset Itemsets Support count {} 9 {} 14 {} 11 {I4} 4 Step 2: Compress Data Count transactions are same items such as T003 = {, }, T006 = {, }, T007 = {, }, T012 = {, } and T015 = {, } or {, } = {(T003), (T006), (T007), (T012), (T015)} = 5. Create pattern is same structure such as T001 create Pattern-1, T002 Pattern-2, {T003, T006, T007, T012, T015} Pattern-3, T004 Pattern-4, {T010, T013 } Pattern-5, T005 Pattern-6 and {T008, T009, T011, T014} Pattern-7. Table 5. Data Compression I4 Count Pattern-1 X X 1 Pattern-2 X X 1 Pattern-3 X X 5 Pattern-4 X X X X 1 Pattern-5 X X X 2 Pattern-6 X X 1 Pattern-7 X X X 4 Step 3: Find Candidate 2-Itemsets to k-itemsets To discover the set of frequency 2-itemsets to k-itemsets by SQL Model, Generate candidate from Pattern table. SQL Command for candidate 2-Itemsets. Insert into CL_Final (Candidate,Count_Items,xdim) Select T2.Tid_Item+T1.Tid_Item, sum(t1.feq),2 From Tid_Pattern as T1, Tid_Pattern as T2 Where (T1.Tid_Item < T2.Tid_Item) and (T1.Tid=T2.Tid) Group by T2.Tid_item, T1.Tid_item Having sum (T1.Feq) >= Min_Support 69
Figure 2. Generate Candidate K-Itemsets Table 6. Result of candidate=2 Itemsets Pattern Data Count Final Candidate Candiates xdim 1 {,} 1 {,} 8 2 2 {,I4} 1 {,} 6 2 3 {,} 5 {,I4} 3 2 4 {,,,I4} 1 {,} 10 2 5 {,,I4} 2 {,I4} 4 2 6 {,} 1 7 {,,} 4 SQL Command for candidate 3-itemsets. Insert into Final_Candidate (Candidate,Count_Items,3) Select T3.Tid_Item +T2.Tid_Item+T1.Tid_Item, sum(t1.feq),2 From Tid_Pattern as T1, Tid_Pattern as T2, Tid_Pattern as T3 Where (T1.Tid_Item < T2.Tid_Item and T2.Tid_Item < T3.Tid_Item) and (T1.Tid=T2.Tid and T2.Tid=T3.Tid) Group by T 3.Tid_item, T 2.Tid_item, T 1.Tid_item Having sum (T1.Feq) >= Min_Support Attribute of <xdim> is for control to generate candidates on each step. Table 7. Generate from Candidate-3 Itemsets Pattern Data Count Final Candidate Candidate xdim 1 {,} 1 {,,} 5 3 2 {,I4} 1 {,,I4} 3 3 3 {,} 5 4 {,,,I4} 1 5 {,,I4} 2 6 {,} 1 7 {,,} 4 Final Result: Candidate-Itemsets ³ Min_Support 70
Itemsets {} {} {} {I4} {,} {,} {,I4} {,} {,I4} {,,} {,,I4} Table 8. Final Result Final Candidate 9 14 11 4 8 6 3 10 4 5 3 7. Experimental results In this section, we performed a set of experiments to evaluate the effectiveness of SMILE-ARM. The experiment dataset consists of two kinds of data. First, data from Phranakorn Yontrakarn Co., Ltd. This company sales and offer car services to discover association data. Second, generate sampling data. The experiment of four criteria, Firstly, Increase amount of records from 10,000 to 50,000 records and fixed 10 items (Figure3, Figure4). Secondly, increase of item and fixed amount of records = 50,000 (Figure5, Figure6). Thirdly, increase amount of records from 10,000 to 190,000 records (Figure7). Fourthly, decrease of minimum support and fixed items and amount of records (Figure8). Experiment 1: Increase number of records. Step 10,000 records. Fixed 10 items. Compare Apriori Rule with SQL Model in Language Encapsulation and Compression Technique for Association Rules Mining (SMILE-ARM). Apriori rule, with increasing amount of record will take longer time. SMILE- ARM has 35 times higher mining efficiency in execution time than Apriori Rule (10 Itemsets, 50,000 records). Figure 3. Data from Phranakorn Yontrakarn Co., Ltd. Itemsets=10, Increase number of records 71
Figure 4. Sampling Data, Itemsets =10, Increase number of records Experiment 2: Increase items. Fixed number of records 50,000 records. Compare Apriori Rule with SQL Model in Language Encapsulation and Compression Technique for Association Rules Mining (SMILE-ARM). Apriori rule, with increasing itemsets will take longer time. SMILE-ARM has 30 times higher mining efficiency in execution time than Apriori Rule (30 Itemsets, 50,000 records). That it mean amount of records will be slightly affected SMILE-ARM but Apriori rule performance takes a lot of time. Figure 5. Data from Phranakorn Yontrakarn, Increase Itemsets, Fixed number of records = 50,000 records Figure 6. Sampling Data, Number of records=50,000 records, Increase Itemsets 72
Experiment 3: Increase number of records from 10,000 to 190,000 records. Compare Apriori Rule with SQL Model in Language Encapsulation and Compression Technique for Association Rules Mining (SMILE-ARM). Real-life data from Phranakorn Yontrakarn Co., Ltd. shows effective performance between Apriori rule and SMILE-ARM as 190,000 records, Apriori rule takes 238 seconds but SMILE-ARM takes only 6 seconds on processing time. SMILE-ARM has 39 times higher mining efficiency in execution time than Apriori Rule (10 Itemsets, 190,000 records). Figure 7. Data from Phranakorn Yontrakarn Co., Ltd. Increase number of recoords and Increase Itemsets Experiment 4: Decrease of minimum support from 60% to 10%. The step to change minimum support, Apriori rule low minimum support takes time to process 464 sec. but SMILE-ARM takes time to process 3 sec. SMILE-ARM has 154 times higher mining efficiency in execution time than Apriori Rule (10 Itemsets, 20,000 records, density 80%). SMILE-ARM slightly affects the performance because SMILE-ARM compresses data, SQL is generate candidate direct to database without temporary candidates. Figure 8. Performance of Apriori Rule and SMILE-ARM Table 9. Number of candidates Minimum Support (%) Number of Candidates 10 1023 20 1012 30 847 40 509 50 175 60 56 The result of experiments, SQL Model in Language Encapsulation and Compression Technique for Association Rules Mining (SMILE-ARM) discovers an association is faster than 73
Apriori rule. If increasing the number of records, Apriori rule will take time to read the whole data. If increasing the number of items, Apriori rule will create more candidates depending on the number of items but SQL Model in Language Encapsulation and Compression Technique for Association Rules Mining (SMILE-ARM) takes shorter time because it will compress data and SQL is generate candidate on-the-fly direct to database without temporary candidates. 8. Conclusion The paper proposes a new association rule mining theoretic models and designs a new algorithm based on established theories. SQL Model in Language Encapsulation and Compression Technique for Association Rules Mining (SMILE-ARM) is compresses data and processes only the actual data by SQL command. SQL is very effective in terms of performance. SMILE-ARM is able to discover data more than twenty times faster than Apriori rule. From the experiments, we found that the number of records and number of Items would not affect the performance of SQL Model in Language Encapsulation and Compression Technique for Association Rules Mining (SMILE-ARM). The results is a especially amplified in itemsets from experiment 4 where we used data from 10,000 records, 10 Items and density 80%. SMILE-ARM has proven to have the best performance. We believe that SMILE-ARM is an important way to find Association data or Data mining and the best of algorithm for classification. 9. References [1] Agrawal R., Imielinski T. and Swami A., Mining association rules between sets of items in large databases, In Proceedings of ACM SIGMOD International Conference on Management of Data Washington D.C., USA, pp. 207 216, 1993. [2] Ramakrishnan Srikant and Rakesh Agawal, Mining Generalized Association Rules, In Proceedings of 21th VLDB Conference Zurich, Swizerland, pp. 407-419, 1995. [3] Fukuda, T., Morimoto, Y. Morishita, S. and Tokuyama T., Data Mining using Two- Dimensional Optimized Association Rules Scheme, Algorithms, and Visualization, In Proceedings of ACM - SIGMOD International Conference on the Management of Data, pp. 13-23, 1996. [4] M. Zaki, S. Parthasarathy, M. Ogihara and W. Li. "New Algorithms for Fast Discovery of Association Rules". 3rd International Conference on Knowledge Discovery and Data, Mining KDD 97, Newport Beach, CA, 283 296, 1997. [5] Christian Borgelt, Simple Algorithms for Frequent Item Set Mining, Springer-Verlag, Berlin, Germany, pp.351-369, 2010. [6] Zuling Chen and Guoqing Chen, Building an Association Classifier Based on Fuzzy Association Rules, International Journal of Computational Intelligence Systems, Vol. 1-3, pp. 262-272, 2008. [7] HUANG Liusheng, CHEN Huaping, WANG Xun, CHEN Guoliang, A Fast Algorithm for Mining Association Rules. Journal of Computer Science and Technology, Vol.15, pp. 619-624, 2000. [8] DU XiaoPing, TANG SgiWei and Akifumi Makinouchi, Maintaining Discovered Frequent Itemsets: Cases for Changeable Database and Support, Journal of Computer Science and Technology, Vol.18, pp. 648-658, 2003. [9] S.Prakash and R.M.S.Parvathi, An Enhanced Scaling Apriori for Association Rule Mining Efficiency. European Journal of Scientific Research, Vol.39, pp.257-264, 2010. [10] LI Qingzhong, WANG Haiyang, YAN Zhongmin and MA Shaohan, Efficient Mining of Association Rules by Reducing the Number of Passes over the Database, Journal of Computer Science and Technology, Vol.16, pp. 182-188, 2001. [11] Somboon Anekritmongkol and M.L. Kulthon Kasemsan, The Comparative of Boolean Algebra Compress and Apriori Rule Techniques for New Theoretic Association Rule Mining Model, International Journal of Advancements in Computing Technology, Vol. 3 No. 1, pp. 58-67, 2011. 74
[12] Shui Wang, Le Wang, "An Implementation of FP-Growth Algorithm Based on High Level Data Structures of Weka-JUNG Framework", Journal of JCIT, Vol. 5, No. 9, pp. 287-294, 2010. [13] S. Roy, D. K. Bhattacharyya, "OPAM: An Efficient One Pass Association Mining Technique without Candidate Generation", Journal of JCIT, Vol. 3, No. 3, pp. 32-38, 2008. 75