Upper bound tighter Item caps for fast frequent itemsets mining for uncertain data Implemented using splay trees. Shashikiran V 1, Murali S 2

Size: px

Start display at page:

Download "Upper bound tighter Item caps for fast frequent itemsets mining for uncertain data Implemented using splay trees. Shashikiran V 1, Murali S 2"

Annis Manning
5 years ago
Views:

Volume 117 No. 7 2017, 39-46 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.

1 Volume 117 No , ISSN: (printed version); ISSN: (on-line version) url: ijpam.eu Upper bound tighter Item caps for fast frequent itemsets mining for uncertain data Implemented using splay trees Shashikiran V 1, Murali S 2 1 shashikiran@vit.ac.in, 2 murali.s@vit.ac.in 1 School of Information Technology and Engineering, 2 School of Computer Science and Engineering Abstract 1,2 VIT University, Vellore, Tamilnadu, India The majority of existing data mining algorithms mine frequent itemsets from precise data. A wellknownalgorithm is FP-growth, which builds a compact FP-tree structure to capture important contents of precise data and mines frequent itemsets from the FP-tree. However, there are situations in which data are uncertain. To capture important contents (e.g., existential probabilities) of uncertain data for mining frequent itemsets, the UFgrowth algorithm uses a UF-tree structure. However, the UF-tree can be large. Other tree structures for handling uncertain data may achieve compactness at the expense of looser upper bounds on expected supports. To solve this problem, we propose upper bound tighter item caps for faster frequent itemset mining algorithms to capture uncertain data. Our comparative study shows, UBTIC has 10% on average better upper bound than the existing formulations in the literature. 1. Introduction Data mining task is to discover interesting, unexpected and useful pattern in a large database. Frequent pattern are the building block for finding associations rules that discovers the interesting relationship between frequent items in a database which has important applications in market basket analysis. A remarkable progress has been made in this field and a number of effective algorithms have been designed to mines the frequently occurring patterns in a database. Frequent patterns are the sequences or substructures that exist in the transactional database equal to or greater than a threshold defined by the user. Pattern mining algorithm can be applied on various types of data such as transaction database, stream database and sequence database. In a traditional (certain) transaction database, we simply perform a database scan and count the transactions that include the itemset. This does not work in an uncertain transaction database. 1.1 What is an uncertain data? In these databases of precise data, users definitely know whether an item (or an event) is present in, or is absent from, a transaction in the databases. However, there are situations in which users are uncertain about the presence or absence of some items or events [4,5,10]. For example, a 39

2 physician may highly suspect (but cannot guarantee) that a patient suffers from flu. The uncertainty of such suspicion can be expressed in terms of existential probability. So, in this uncertain database of patient records, each transaction ti represents a visit to the physician s office. Each item within ti represents a potential disease, and is associated with an existential probability expressing the likelihood of a patient having that disease in ti. For instance, in ti, the patient has an 80% likelihood of having the flu, and a 60% likelihood of having a cold regardless of having the flu or not. With this notion, each item in a transaction ti in traditional databases containing precise data can be viewed as an item with a 100% likelihood of being present in ti. 2. Related work and Background Let (i) Item in the transactional database consists of m domain items (ii) X = {x1, x2,..xk} be a pattern of k-itemset, where X Item and 1 k m (iii) database consist of n transaction and each transaction tj item. Every item xi in a transaction tj = {x1, x2,,xh} is attached with a probability value known as existential probability P(xi, tj) which represent the probability with which xi is present in tj and probability of an item lies between 0 and 1 as shown in (1). 0 < P(xi, t j) 1 (1) The existential probability P(X, tj) of a pattern X in a transaction tj is thus the multiplication of the probabilities value associated with item x within X when the items are independent as shown in (2). P(X, tj) = x X P(x, tj) (2) P(x, tj) is the probability with which x exist in tj.the expected support expsup(x) of pattern X in the whole database is the summation of existential probability P(X, tj) of X in transaction tj over all n transactions in the probabilistic database as shown in (3). expsup(x) (3) The items x X in each transaction tj is independent. A pattern X is said to be frequent pattern in a probabilistic database if and only expsup(x) minsup, where minsup is the minimum support threshold value defined by the user. Given a probabilistic database and user defined threshold value, the problem of frequent pattern mining is to extract a complete and accurate set of frequent pattern from probabilistic database satisfying the condition expsup minsup. 2.1 Concept of Item Cap. To mine frequent itemsets, the PUF-growth algorithm [12] uses the concept of item cap IC(X, tj) to estimate expsup(x, tj) in an ordered transaction tj = {y1,..., yr 1, yr,..., yh}, where X = {x1,..., xk} tj and xk = yr: IC (X, tj) = P(y1, tj) if h = 1 40

3 P(yr, tj) M1(yr, tj) if h 2 (1) Where M1 (yr, tj) = max1, P(yq, tj) is the highest existential probability value among all r 1 items in the proper prefix {y1,..., yr 1} tj. I. U-Apriori Algorithm It is a modification of the [2] Apriori algorithm. The only difference is in Apriori algorithm the support count of candidate pattern is incremented by their true support, but in U-Apriori algorithm[1] the expected support of a given pattern is incremented by the product of probability value associated with each items in the pattern. II. UF growth algorithm To effectively represent uncertain data UF-Growth algorithm [6] was proposed by Leung et al. which is a variant of FP-Growth [5] Algorithm for precise data. The algorithm first scans the probabilistic database to compute the expected support of every singleton item. Then it finds all the frequent items having expected support greater than minimum support. It sorts the frequent item in descending order of expected support. Second time algorithm scans the probabilistic database and inserts every transaction into the UF-tree as in.the FP-tree. The advantage of UFgrowth is that it needs only two scan of database to get the information about the database and avoids the candidate generation step. But UF-tree are quite large compared to FP-tree the reason being that UF-tree combines the two nodes only if they share the same item name and same expected support. III. UFP-Growth Algorithm On the basis of the UFP-tree structure[7], the algorithm constructs conditional sub-trees iteratively and discovers the frequent itemset based on expected support. In UFP-Growth algorithm the node with similar existential probability values and same item name are grouped together to form a cluster. Size of UFP-tree may be large as UF-tree depends on the clustering parameter. Each node in UFP-tree stores (i) an item name x (ii) highest existential probability value (among all nodes in a cluster) (iii) occurrence count. IV UH-Mine Algorithm Extended from H-Mine [9] algorithm, UH-Mine algorithm [8] was proposed by Aggarwal to uncertain cases and uses a dynamic linked list to maintain a hyperlink array structure known as UH-struct. UH-Mine algorithm is also based on the divide-and-conquer approach as tree based approach. UH-struct stores the probability of each item besides the link for the item. Firstly, the UH-mine algorithm scans the probabilistic database to discover all frequent items based on the expected support. Then, the algorithm constructs a head table which stores all frequent items based on their expected support. For every item, the head table contains (i) the item name (ii) item s expected support (iii) a pointe. All transactions are inserted into the UH-Struct data 41

4 structure after the construction of head table. The algorithms then repeatedly construct the head tables where distinct itemsets are prefix and produce the itemsets that are frequent. V. PUF-Growth To further compact the size of UFP-tree and to tighten the upper bound on expected support Leung [11] has proposed the concept of prefix-capped uncertain frequent pattern tree (PUF-tree) structure. The tree structure captures essential data about probabilistic database so that frequent patterns can be mined from the PUF-tree. The tree is built by considering existential probability upper bound for every item when creating a k-itemset (where k>1). Every node in PUF-tree stores (i) an item name x (ii) a prefix item cap (PIC). VI. TPC-Growth Like all other pattern growth algorithm, TPC-growth algorithm [12] first scans the database to construct a TPC-tree. Then we find the entire distinct frequent pattern and store them in a consistent order so that tree can be constructed from that frequent items. After second database scan TPC tree is constructed in similar manner as PUF-tree. Then from the TPC-tree structure, projected database is constructed for each potential frequent item and mine the frequent item recursively. 3. Improvements in the upper bounds (Our Approach) 3. 1 Upper Bound Tightened Item Cap (UBTIC) Design alternatives Consider the following database to evaluate our study with the existing approaches. A B C D E T T T T T T tj is represented as the current transaction. { A B C D E } are the items in a transaction and the values for each are the existential probability of each item in the database. Table 1. Item cap value comparison. A B C D E CUF Tree

5 [14] PUF Tree Tube S/P [13] UBTIC -01 (Mean tj * Maximum of tj) UBTIC -02 (Mean of tj * products of 2 maximum in tj) UBTIC-03 ( Existential Probability of item * Max probability of tj) In the table 1, Our approach UBTIC 01/02/03 has better lesser upper bound values than CUF tree, PUF tree, Tube-S and Tube P tree. Especially UBTIC 03 has tighter bounds with all the existing compounded item caps. 3.2 Frequent Itemset mining from uncertain Data based on Splay trees. Our proposed system uses Splay trees to construct the tree and have average time (logn) or better than 2-3 tree and tree for faster access to recently accessed data Reading and Assigning values for Splay trees. Input Transaction Database. For each item i If( key > 0 && key < 1) a. assign key to itemset from file (a[i],b[i]..) calculate exp_sup = exp_sup + item(a[i]) max[i] = a[i]b[i] 43

6 end end. Insertion, Deletion and searching procedures for splay trees are self-explanatory. 4. Conclusion. In this paper, we proposed Upper bound tighter item caps faster frequent itemset mining from uncertain data. The algorithms capture important information from uncertain data (e.g., items and their existential probabilities) in tree structures, from which potential frequent itemsets can be mined using tightened upper bounds The bounds provided by the our UBTIC are better than CUF and PUF tree based algorithms. They are 10 %, 12% and 11% better upper bounds than CUF tree,puf tree and Tube [12] based approaches. 5.References [1] C. K. Chui, B. Kao, E. Hung, "Mining Frequent Itemsets from Uncertain Data," in Advances in Knowledge Discovery and Data Mining, Springer, 2007, pp [2] R. Agrawal and R. Srikant, "Fast algorithms for mining association rules," in Proceedings of the 20th VLDB Conference, Santiago, Chile, [3] T. Calders, C. Garboni, and B. Goethals, "Approximation of frequentness probability of itemsets in uncertain data," in Proc. IEEE ICDM, Sydney, NSW, [4] C. C. Aggarwal, "On Unifying Privacy and Uncertain Data Models," in IEEE 24th International Conference on Data Engineering, Cancun, [5] J. Han, J. Pei, and Y. Yin, "Mining frequent patterns without candidate generation," in ACM SIGMOD International Conference on Management of Data, New York, [6] C.K.-S. Leung, M.A.F. Mateo, and D.A.Brajczuk, "A Tree- Based Approach for Frequent Pattern Mining from Uncertain Data," in Advances in Knowledge Discovery and Data Mining,Japan, Springer, 2008, pp [7] C. C. Aggarwal, Y. Li, J. Wang, and J. Wang, "Frequent pattern mining with uncertain data," in Knowledge discovery and data mining, Paris, France, [8] Y Tong, L Chen, Y Cheng, PS Yu, "Mining frequent itemsets over uncertain databases," Proceedings of the VLDB Endowment, vol. 5, no. 11, pp , [9] Pei J, Han J, Lu H, et al., "H-mine: hyper-structure mining of frequent patterns in large databases," in Proceedings IEEE International Conference on Data Mining, San Jose, CA,

7 [10] Grahne G, Zhu J., "Fast algorithms for frequent itemset mining using FP-trees," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 10, pp , [11] C. K.-S. Leung and S. K. Tanbeer, "PUF-Tree: A Compact Tree Structure for Frequent Pattern Mining of Uncertain Data," in Advances in Knowledge Discovery and Data Mining, Australia, Springer, 2013, pp [12] C. K. Leung, R. K. MacKinnon, S. K. Tanbeer, "Tightening upper bounds to the expected support for uncertain frequent pattern mining," in 18th Annual Conference KES, Poland, [13] Carson Kai-Sang Leung Richard Kyle MacKinnon Syed K. Tanbeer, Fast Algorithms for Frequent Itemset Mining from Uncertain Data, IEEE International Conference on Data Mining, 2014 [14] C. K.-S. Leung and S. K. Tanbeer, Fast tree-based mining of frequent itemsets from uncertain data, in Proc. DASFAA 2012, pp Rajesh, M., and J. M. Gnanasekar. "Congestion control in heterogeneous WANET using FRCC." Journal of Chemical and Pharmaceutical Sciences ISSN 974 (2015): Rajesh, M., and J. M. Gnanasekar. "A systematic review of congestion control in ad hoc network." International Journal of Engineering Inventions 3.11 (2014): Rajesh, M., and J. M. Gnanasekar. " Annoyed Realm Outlook Taxonomy Using Twin Transfer Learning." International Journal of Pure and Applied Mathematics (2017) Rajesh, M., and J. M. Gnanasekar. " Get-Up-And-Go Efficientmemetic Algorithm Based Amalgam Routing Protocol." International Journal of Pure and Applied Mathematics (2017) Rajesh, M., and J. M. Gnanasekar. " Congestion Control Scheme for Heterogeneous Wireless Ad Hoc Networks Using Self-Adjust Hybrid Model." International Journal of Pure and Applied Mathematics (2017)

8 46

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN: IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY A Brief Survey on Frequent Patterns Mining of Uncertain Data Purvi Y. Rana*, Prof. Pragna Makwana, Prof. Kishori Shekokar *Student,