A binary based approach for generating association rules

Size: px

Start display at page:

Download "A binary based approach for generating association rules"

Jocelyn Turner
6 years ago
Views:

1 A binary based approach for generating association rules Med El Hadi Benelhadj, Khedija Arour, Mahmoud Boufaida and Yahya Slimani 3 LIRE Laboratory, Computer Science Department, Mentouri University, Constantine, Algeria National Institute of Applied Science and Technology, Tunis, Tunisia 3 Computer Science Department, Faculty of Sciences, Tunis, Tunisia Abstract - Advanced database application areas, such as computer aided design, office automation, digital libraries, data-mining, hypertext and multimedia systems, need to handle complex data structures with set-valued attributes. These information systems contain implicit data that will be necessary to extract and exploit, by using data mining techniques. To exploit the data from these systems, the choice of appropriate storage structures becomes essential. In this paper, we propose a new compact structure to represent a transactions database, called a signatures tree, to speed up the signature file scanning. The construction of this tree requires only one single access to the transactions database. This tree will be used later to compute maximum support, extract frequent itemsets and generate association rules. Keywords: Data Mining, Frequent itemset, Signature file, Signature tree. Introduction Extracting Knowledge from Databases involves the extraction of implicit information, unknown and potentially useful, stored in large databases. The amounts of data collected are becoming increasingly important and their analysis more tedious. Data mining is an essential step in a KDD process. The efficient search of information in large databases to extract knowledge from the contributing to a decision is vital for any expert. Several methods and techniques are used in KDD process to extract knowledge from large databases. Mining association rules which trends to find interesting association or correlation relationships among large amounts of data is one of these techniques. An association rule R is defined as an implication of the form R: S T such that S I and T I and S T = Ø, I being a set of items. This generation of association rules involves two steps:. The extraction of frequent itemsets (with support Minsup),. The generation of association rules (with confidence Minconf). Using this technique, we can generate, from a set of transactions, the frequent itemsets (itemsets with support above a minimum fixed by the user) and then the association rules from these itemsets. The first step is the most expensive with high demands for computation and data access [] [3]. Because of that, we focus our attention in this paper on the frequent counting. Our proposition consists to adopt a binary approach for generating frequent itemsets. Transaction database are represented by using a new compact data structure based on tree signatures. The use of signatures provides a low cost of storage and a speed of binary operations. Several algorithms that generate association rules are based on two sub-steps "Generate" and "Verify" such as Apriori []. However, this phase is the most expensive, because of multiple access to transactions database. Other algorithms have tried to improve the algorithm Apriori. The partition algorithm [0] divides the transaction database into partitions, which increases the number of locally frequent itemsets that are globally rare, thus generating a loss of time doing redundant computation [3]. However, the algorithm Dynamic Itemset Counting [] is a generalization of the Apriori algorithm. FP-Growth [9] extracts the frequent itemsets without generating candidate itemsets. It is based on a FP-Tree structure, which requires a complete reconstruction of the FP-Tree, for each updating. DFPMT-A [] mining frequent itemsets are based on Apriori algorithm and uses dynamic approach like Longest Common Subsequence. In order to compute the support of a collection of itemsets, it is necessary to access to the transaction database. As the transaction database is generally large, a solution for avoiding repetitive and costly access is to represent it using compact structures. As an example of these compact structures, we can mention: BitMap [8], FP-Tree [9], Patricia tree [], Transposed Form [] and so on. Standard data structures cannot provide scalability, in terms of the data size and the performance for large databases, we must rely to adopt a binary and compact structure to improve performance and search space. In this paper, we propose an approach using a binary tree structure to represent the transaction database. Each transaction is represented by a binary signature. The set of signatures is a signature file which is represented as a signature tree. In the process of generating frequent itemsets, a signature S I is associated with each itemset I and is constructed with the same way that the selected transaction signatures. Each S I is associated with the identifier of transaction (Tid), which generating S I. This process constructs the signatures transaction tree, a compact structure, finds all the frequent itemsets, based on a maximum support, and une only one access to transactions database. The reminder of the paper is organized as follows: Section gives an overview of the concept of tree signature. Section 3 presents our proposed structure called STT (Signature

2 Transaction Tree) and the tree construction process. Section gives some basic concepts. In Section 5, we discuss the search process of a signature in STT. Section is devoted to the process of generating frequent itemsets based on STT. In Section 7, we analyze theoretical complexity of our proposition. In Section 8, we report the experiment results. Finally, Section 9 concludes the paper. Specification of the Signature tree A signature file can be considered as a set of bit strings, which are called signatures. Several approaches have been proposed to represent the signature file: sequential signature file, bit_slice file and several other variants. The signature file method involves a high processing. This problem is resolved by partitionning the signatures file or by introducting an auxiliary data structure. Recently, Chen proposed an approach to represent the signature file as a signature tree [5]. Definition []: A signature is a binary vector of length m obtained by applying one (or several) hash function (s). Table I shows an example of signature extraction of a block ("full text scan"). The signatures of the words of the block are combined by superposition (by or-ing) the word signature. Full Text Scan TABLE I. Signature Generation Bloc Signature the left child corresponds to the value "0" and the right one to the value "". Each leaf node contains three informations: a signature, the number of transactions generating this signature and the transactions identifier (Tid). The number of leaf node in STT is equal to the number of signatures. 3.. Example The construction of a signature tree requires two phases: The application of the hash function H(item) = Integer (for example, we use modulo function) to obtain the signature for each item in a transaction; the composition of these signatures will give the transaction signature. All transactions and their signatures are represented in Table II. We note that the transactions T 3 and T generate the same signature (the phenomenon of collision). It will be represented only once in STT (S 3 in our example). Each transaction signature is inserted in the STT tree. Each leaf of this tree contains the signature, the Tid and the number of transactions generating this signature. TABLE II. 3 5 Transactions signatures and corresponding STT Tid Transactions Signatures,,,, 8, 0,,, 0 3,, 0,,, 5, 8, 3,,, 0, 3, Definition [5]: A tree of signatures Ts represents a set of signatures S = {S,..., S n } where: S i S j for all i j and S k = m for k n. Ts is a binary tree such that: For each internal node of Ts, the left edge leaving it is always labeled with "0" and the right edge is always labeled with "". Ts have n leaves labeled,,, n, used as pointers to n different signatures S,..., S n in S. Let nf be a leaf node. Denote the pointer p(nf) to the corresponding signature S 3,,Tid 3,Tid S,, Tid S,, Tid S,, Tid S 5,, Tid 5 Each internal node v is associated to a positive number, noted by position(v), to tell which bit will be checked. 3 Structure of STT Improvement of algorithm performance for discovering association rules requires an optimization of the extraction phase of frequent itemsets. To reach this objective, we propose, the STT structure representing the transaction signatures. Each transaction is represented by a signature of size m. STT has the advantage of being both a compact structure (binary representation) and dynamic (care of updates). A signature tree contains two types of nodes: internal nodes and leaf nodes. For each internal node of STT, 3.. Construction process of STT At the beginning, the tree contains an initial node: a node containing the first signature transaction, his Tid and the number "". Then, we take a new signature transaction, a composition of signatures items, and we insert it into the STT. Let s be the signature we wish to enter. We cross the tree from the root. Let v be the node encountered and assume that v is an internal node with position (v) = i. Then, s[i] will be checked. If s[i] = 0, we go left, otherwise, we go right. If v is a leaf node, we compare s with the signature s' into v. If s = s', we add only the Tid in the leaf v and increment the number of transactions nt. Otherwise, s is the new signature. We

3 assume that the first k bits of s agree with s'; but s differs from s' in the (k+) th position. We construct a new node u with position (u) = k+ and replace v with u. We mean that the position of v in the tree is occupied by u and v becomes one of u s children. If position (u) =, we make v be the left and s be the right child of u, respectively. If position (u) = 0, we make v the right child of u and s the left child of u. ) Steps to generate STT The following steps (a) to (f) show how to create the STT: The step (a) build a root node r such that r is a leaf node and contains the signature S, the number "" and the identifier of the first transaction T. (a) Insert T (S ) S,, Tid (b) Insert T (S ): S [] S [] create internal node v with position(v) = and leaf node {S,, Tid }. S3,, Tid3 S,, Tid S,, Tid The step (f) inserts an existing signature. We use the same way of the steps (b) to (e). When arrives in a corresponding leaf node, we remarques that S = S 3. We increment the number and we add the identifier of the transaction T. (f) Insert T (S ): S[] = 0, S[] = 0. S = S 3 Increment the number and add Tid in the leaf node. S5,, Tid5 S,, Tid S,, Tid S,, Tid The steps (b) to (e) insert a new signature S i to the corresponding leaf node in STT, using the value of signature bit position in each internal node. (c) Insert T 3 (S 3 ): S 3 [] = 0, S [] S 3 [] create internal node v with position(v) = and leaf node {S,, Tid }. S 3,,Tid 3,Tid S,, Tid S,, Tid S,, Tid S 5,, Tid 5 S 3,, Tid 3 S,, Tid S,, Tid (d) Insert T (S ): S [] =, the first different bit between S and S is the th bit, S [] = and S [] = 0 create internal node v with position(v) = and leaf node {S,, Tid }. S 3,, Tid 3 S,, Tid S,, Tid S,, Tid (e) Insert T 5 (S 5 ): S 5 [] = 0, S 5 [] =. The first different bit between S and S 5 is the th bit, S[] = 0 and S5[] = create internal node v with position(v) = and leaf node {S5,, Tid 5 }. Figure. Steps of STT Construction. ) Algorithm to construct STT The algorithm Cons_STT to construct the tree is presented below. It is composed of two parts: Signature Generation: A hash function "Gen_Sig(Ti)" provides the signature of the transaction Ti. Insertion: Call Insert(S I ) procedure that inserts the signature S I in STT. TABLE III. STT Construction Algorithm Algorithm Cons_STT /* STT construction*/ Begin /* Input: Set of Transactions */ /* Output: STT */ /* v is the current internal node, it contains 3 fields: bit position to check, left child pointer and right child pointer */ /* f is the current leaf node, containing 3 fields: the signature S, Tids and number */ S = Gen_Sig (T ) Build a root node r such that r is a leaf node. /* It will contain the first transaction signature S and the corresponding Tid */

4 For i = à n Do S i = Gen_Sig (T i ) Call Insert (S i ) Do The following procedure inserts a given transaction signature to STT: TABLE IV. Insert algorithm in STT Procedure Insert (s,stt) Begin Stack root While Stack not empty Do v pop (Stack) If v is internal node Then j position (v) If s[j] = Then push (Stack, right_child) Else push (Stack, left_child) Else /* v: leaf node = s' and nt */ If s = s' Then /* Old signature */ nt nt + Else /* New signature */ Assume that the first k bits of s agree with s'. s differs from s' in the (k+)th position. Generate a new internal node with position(u) = k+. Generate a new leaf node v' {s, Tid, } If s [k+] = Then v' will be the right child of u and v the left child Else v' will be the left child of u and v the right child Do The insertion procedure presents two possible cases: s = s'. In this case, we add Tid in the leaf node and we increment the transactions number nt. s s'. A new internal node u is created, containing the corresponding position to first different bit between s and s'. We create also a new leaf node v' containing {s, Tid, }. If position (s) =, we make v' be the left and v be the right child of u. If position (s) = 0, we make v' the right child of u and v the left child of u. Basic Concepts Definition 3: An item I i is any object, attribute, literal, into a finite set of distinct elements D = {I, I,..., I n }. Definition : An itemset I is a subset of D. A k-itemset is an itemset of cardinality k. Definition 5: A transaction T i is an itemset wich is associated to an identifier: the Transaction Identifier (Tid). Definition : The support of an itemset I denoted Support (I) is the number of transactions containing I. Definition 7: A minimum support Minsup is a threshold fixed by the user. Definition 8: An itemset I is denoted fréquent if Support(I) Minsup. Definition 9: The maximum support Maxsup of an itemset I is equal to the sum of the selected Tids sets. If S I is the signature of the itemset I and L, L,, L p the set of selected Tids during the search process of S I, the maximum support is: Maxsup (I) = L + L + + L p Where L i = number of Tids in the leaf i. Définition 0: An itemset I is said Mfréquent if Maxsup(I) Minsup. 5 Search Process in STT Now, we discuss how to search a signature S I of an itemset I in STT. During the traversal of STT, the inexact matching is done as follows:. Let v be the node encountered and position(v) be the position to be checked.. If position(v) =, we move to the right child of v. 3. If position(v) = 0, both the right and left child of v will be explored. In fact, this process corresponds to the signature matching criterion, i.e., for a bit position i in S I, if it is set to "", the corresponding bit position in s must be set to ""; if it is set to "0", the corresponding bit position in s can be "" or "0". This reflects that only the signatures s, such that s ٨ S I = s, are selected. Example. The itemset I is composed by the items, and. If we apply the hash function, we obtain the signature S I = The procedure Search(S I ) will select all the signatures that contain S I (in our example, S, S and S 5 ). Figure represents the bold path in the STT tree when searching S I. S 3,, Tid 3, Tid S,, Tid S 5,, Tid 5 S,, Tid S,, Tid Figure. Result of search SI = 00000

5.. Algorithm to search a signature in STT This algorithm search a signature and computes the maximum support of itemset using STT structure. TABLE V.

5 5.. Algorithm to search a signature in STT This algorithm search a signature and computes the maximum support of itemset using STT structure. TABLE V. Search Algorithm in STT Algorithm Search /* Input: an itemset I */ /* Output:Maxsup (I) */ Début S I = Gen_Sig (I) ST Ø push (Stack, root); While Stack not empty Do v pop (Stack); If v is an internal node Then i position (v) /*bit position to check */ If S I [i] = Then push (Stack, right_child(v)) Else push (Stack, left_child(v)) push (Stack, right_child(v)) Else /* v is a leaf node */ Compare S I with the signature S If S contains SI Then ST ST {Tids} Do Use ST to compute Maxsup(I) Return Maxsup(I) Frequent Itemsets Generation The generation of frequent itemsets computes, for each candidate itemsets I, the maximum support of I Maxsup(I) and compares it to a minimum Minsup, defined by the user. An itemset I is said frequent if Maxsup(I) is greather than Minsup. TABLE VI. Generation Frequent Itemset Algorithm Algorithm Generation_FI /* Input: Set of itemsets I = {I, I,...,In}*/ /* Output: Set of frequent itemsetsfi */ Begin /* Initially, FI is empty */For i = to n Do Search (I i,maxsup(i i )) If Maxsup(I i ) > Minsup Then FI = FI {I i } /* Union */ Do Return FI Example. If we consider database transaction of Table II, the itemset I = {,, } and Minsup =, we can select 3 signatures S, S and S 5. Then, Support(I) = 3 and is greather than Minsup. We conclud that the itemset I is a frequent one. 7 Complexity Study The algorithm Gen_STT to build the signatures tree has a complexity of O(n*m) where: n: number of transaction signatures m: size of a signature For against, the procedure for insertion of each signature in STT requires one tree parsing for the first signature, for the second, and so on. The number of path traversed is: n = n (n +) / = (n + n) / The complexity of the insertion algorithm is about O (n ). The search procedure of a signature in the tree STT has been studied by Chen [5] and is of order O (n/ l ), where: n is the number of signatures l the number of bits to "" in S I. The Generation_FI algorithm for generating frequent itemsets contains a loop that is run n times (n being the number of candidate itemsets). Complexity to handle a candidate itemset is equal to n times that of the search procedure, thus the order of: O(n (n/ l )) = O (n / l ). Finally, the complexity of generating frequent itemsets is: O (nm) + O (n ) + O (n / l ) ~ O (n ). 8 Experimental Study We have implemented our proposal in C++, on an Intel Core Duo,80 GHZ and GB RAM. At a first experimentation, we performed the same test on two transaction databases, dense and sparse, varying the minimum support to measure its influence on the total number of frequent itemsets extracted and comparing our results with those obtained by Apriori []. Figure 3 and Figure show the results obtained respectively with Mushroom and T0ID00K transactions bases. The number of extracted frequent itemset by our approch, based on maxsup, is about equal to the frequent itemset obtained by Apriori. Figure 3. Experimentation with a dense database (Mushroom)

Time vs Minsup for Kosarak Figure 8 shows that our result is better only for a little support (less than,5%). Figure 5.

6 In figure 7, we have the same time for support between 50% and 90% and a better time for support less than 50%. Figure. Experimentation with a sparse database (T0ID00K) In the second experience, we consider several transaction databases and we compute the time required for different Minsup. Figure 8. Time vs Minsup for Kosarak Figure 8 shows that our result is better only for a little support (less than,5%). Figure 5. Time vs Minsup for Mushroom Figure 5 shows the Time vs Minsup for Mushroom obtained by our approach and respectivelly Apriori []. Our approach gives a better result for a support less than 0% and the same result for support greather than 0%. Figure 9. Time vs Minsup for T0ID00K Figure 9 gives a linear time for the both approch. Our approach provides better result than Apriori. Figure. Time vs Minsup for Retail Figure gives the Time vs Minsup for Retail obtained by the both approch. Our result is also better for all supports. Figure 7. Time vs Minsup for Accident Figure 0. Time vs Minsup for T0I0D00K In figure 0, for T0I0D00K, the result is approximatly the same for our approach and Apriori. 9 Conclusion In this paper, we proposed a new data structure to represent database transactions. The main caracteristics of this structure is the use of a binary signature, which can be associated to each transaction. These binary signatures are then organised in a tree, in which each edge is labeled with "0" or "", and each internal node is associated with a number, indicating which bit in a signature to check. Thus, the searching of a signature uses only a signature binary tree and nead only one acces to transactions database.

7 The complexity of our proposal is linear. In order to show the efficiency of our approch, we have conducted a series of experimentation to compare our proposal with the Apriori algorithm, using different database transactions and différent supports. The results of these experimentation show that the signature tree algorithm outperforms significantly Apriori. 0 References [] Agrawal R., Srikant Ramakrishnan, "Fast Algorithm for Mining Association Rules", in Proceeding of the 0th VLDB Conference Santiago, September -5, pp , Chile, 99. [] Bodon F. "A Trie-based APRIORI Implementation for Mining Frequent Item sequences", OSDM05 Proc. of the st Int. Workshop on Open Source Data mining, pp. 5-5, August, Chicago, Illinois, USA, 005. FIMI03, vol. 90 of CEUR Workshop Proceedings, Melbourne, Florida, USA, 003. [] Joshi S. and Jain R.C., "A Dynamic Approach for Frequent Pattern Mining Using Transposition of Database", In Proceeding of the International Conference on Communication Software and Networks (ICCSN'0), Feb -8, pp 98-50, Singapore, 00. [3] Zaki M., J. Parthasarathy, S. Ogihara and M., Li W., "New Algorithms for Fast Discovery of Association Rules", In proceeding of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD-97), pp 83-8, August 7, Newport Beach, California, USA, 997. [3] Bodon F., "A Fast APRIORI Implementation", Proceeding of FIMI'03, 9th Workshop on Frequent Itemset Mining Implementations. In conjonction with the 3rd IEEE International Conference on Data Mining, pp. -5, November 9, Melbourne, Florida, USA, 003. [] Brin S., Motawni R., Ulman J.D., "Dynamic itemset counting and implication rules for market basket data", In Proceedings of the ACM SIGMOD,, pp. 55-, May - 5, Tucson, Arizona, USA, 997. [5] Chen Yangjun, Chen Yibin.,"On the Signature Tree Construction and Analysis", IEEE Transactions on Knowledge and Data Engineering, vol. 8, Issue 9, pp. 07-, September 00. [] Faloutsos C. "Signature Files: Design and Performance Comparaison of Some Signature Extraction Methods", ACM Sigmod Record, Volume, Issue, pp. 3 8, May 985. [7] FIMI repository. [8] Gardarin G., Ph. Pucheral, and F. Wu., "Bitmap based algorithms for mining association rules", In Proceedings of th Int. Conf. Bases de Données Avancées, pp , Octobre -30, Hammamet, Tunisie, 998. [9] Han J., Pei J. and Yin Y., "Mining frequent patterns without candidate generation". In Proceedings of the 000 ACM SIGMOD Int. Conf. on Management of Data, Dallas, pp. -, May -9, 000. [0] Savesere A., Omiecinski E., Navathe S., "An efficient algorithm for mining association rules in large databases", In Proceedings of the th VLDB Conference, pp 3-, September -5, Zurich, Switzerland,995. [] Zandolin D. and Pietracaprina A., "Mining Frequent Itemsets using Patricia Tries", In Proceedings of the Workshop on Frequent Itemset Mining Implementations,

Improved Frequent Pattern Mining Algorithm with Indexing

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.