Mining Frequent Itemsets from Uncertain Databases using probabilistic support

Mining Frequent Itemsets from Uncertain Databases using probabilistic support Radhika Ramesh Naik 1, Prof. J.R.Mankar 2 1 K. K.Wagh Institute of Engg.Education and Research, Nasik Abstract: Mining of frequent itemsets is one of the popular knowledge discovery and data mining tasks. The frequent itemset mining algorithms find itemsets from traditional transaction databases, in which the content of each transaction i.e. items is definitely known and precise. There are many real-life applications like location-based services, sensor monitoring systems in which the content of transactions is uncertain. This initiates the requirement of uncertain data mining. The frequent itemset mining in uncertain transaction databases semantically and computationally differs from traditional techniques applied to standard certain transaction databases. The consideration of existential uncertainty of itemsets, indicating the probability that an itemset occurs in a transaction, makes the traditional techniques inapplicable. Hence the mining methods like the Apriori and the tree based mining needs to be modified for handling the uncertain data. The uncertain data has attribute as well as tuple uncertainty. This paper introduces the techniques for mining frequent itemsets from uncertain databases that makes use of the probabilistic support concept which considers the aspects of uncertain data completely. KEYWORDS: Frequent item sets, Uncertain databases, Existential probability, Apriori algorithm, FP-tree, Incremental mining. 1. Introduction Knowledge discovery in databases (KDD) is to identify efficient and helpful information from large databases. Many techniques have been proposed for knowledge discovery. Among them, finding association rules from transaction databases is the most common An important step in the mining process is the extraction of frequent itemsets, or sets of items that co-occur in a major fraction of the transactions. Apart from market-basket analysis, frequent itemsets mining is also a core component in association-rule mining and sequential-pattern mining Many databases used in important and novel applications are often uncertain. For example, the data regarding the locations of users obtained through RFID and GPS systems are not precise due to measurement errors. The data collected from sensors in habitat monitoring systems (e.g., temperature and humidity) are incorrect. The customers purchase behaviors, as observed in supermarket transaction databases, contain statistical information that predicts what a customer may purchase in the future. The dataset can thus be considered as a collection of tuples/transactions, each contains a set of items that are associated with the probabilities of being present. In such database each record contains a set of items that are associated with existential probabilities. An itemset is considered frequent if it appears in a large number of the transactions. The occurrence frequency is expressed in terms of a support count. However for uncertain databases due to its probabilistic nature the occurrence frequency of an itemset is expressed by an expected support. The uncertain databases are interpreted using the Possible World Semantics (PWS) i.e a database is viewed conceptually, as a set of deterministic instances called as possible worlds, each of which contains a set of tuples. In an uncertain database D consisting of d number of transactions a transaction t contains a number of items. Each item x in t is associated with a non-zero probability P(x), which indicates the likelihood that item x is present in transaction t. Thus there are two possible worlds W1 and W2. In one case, item x is present in transaction t in another case; item x is not present in t. From the dataset, the probability of each world being the true world is known. If P(Wi) is the probability that world Wi is the true world, then we have P(W1)=Pti(x) and P(W2)=1 - Pti(x).This concept can be extended to cover the cases in which transaction t contains other items. For example, let item y be another item in ti with probability Pti(y). If the observation of item x and item y are independently done, then there are four possible worlds. The probability of the world in which ti contains both items x and y, is Pti(x).Pti(y). It can be further extended to cover the datasets that contains more than one transaction.[7].thus the number of possible worlds for an uncertain data can be exponential. Hence it s a challenging task to discover knowledge from this type of data. The algorithms f o r precise data are not directly applicable for uncertain data because of its probabilistic nature. The uncertain data has two types of uncertainty associated with it Attribute uncertainty and tuple uncertainty [1]. Mining of frequent itemsets fr om databases has basic two approaches: Apriori based mining algorithms [2] [3] and tree structure based mining algorithms. [5][6].These traditional algorithms are modified to Volume 2, Issue 2 March April 2013 Page 432

handle the uncertain data. There are basically two broad types of frequent itemset mining algorithms: the Apriori based algorithms and the tree based algorithms. Both the types make use of the support count in the mining process. The major difference between the two types is in the process applied for mining. The Apriori based algorithms uses the candidate generation,candidate pruning and the candidate testing phases where as the tree based methods does not involve the candidate generation and the pruning phases. In this paper we emphasize the need of probabilistic support concept in frequent itemset mining from the uncertain databases. Also we propose its application in the Apriori based as well as the tree based techniques used in mining of frequent itemsets The rest of the paper is organized as follows: section2 describes the existing techniques used in frequent itemset mining for uncertain databases. It also specifies the limitations of the methods and the need for the proposed system.section 3 describes the proposed system structure. Section 4 describes the analysis followed by the conclusion in section 5. 2. Related Work 2.1 UApriori Algorithm The First expected support-based frequent itemset mining algorithm was proposed by Chui et al. in 2007. This algorithm extends the well- known Apriori algorithm of frequent itemset mining to the uncertain environment and uses the generate-and test framework to find all expected support-based frequent itemsets.[2],[10] The algorithm first finds all the expected support based 1-frequent items. Then, it repeatedly joins all expected support-based frequent i-itemsets to produce i +1 itemset candidates and test i+1-itemset candidates to obtain expected support-based frequent i + 1-itemsets. Finally, it ends when no expected support-based frequent i+1-itemsets are generated. The w e l l -known downward c l o s u r e p r o p e r t y i s a l s o applicable in uncertain databases. So, the traditional Apriori pruning can be used when we check whether an itemset is an expected supportbased frequent itemset. Thus the U-Apriori finds the frequent itemsets using the same steps of candidate generation and candidate pruning like the Apriori for the uncertain databases that consists of data with probabilistic values. But it has a limitation that it does not scale well on large datasets. As due to the uncertain nature of data each item is associated with a probability value so the itemsets are required to be processed with these values..the efficiency degrades and the problem b e c o m e s m o r e s e r i o u s u n d e r u n c e r t a i n datasets in particular when most of the existential probabilities are of low value. 2.2 UApriori algorithm with Data Trimming To improve the e ciency of the U-Apriori algorithm, a data trimming technique was proposed [8]. Its basic idea is to trim away items with low existential probabilities from the original dataset and to mine the trimmed dataset instead. Hence the computational cost of those insigni cant candidate increments can be reduced. In addition, the I/O cost can be greatly reduced since the size of the trimmed dataset is much smaller than the original one. The basic framework of the Apriori needs to be changed for the application of the trimming process. The mining process starts by passing an uncertain dataset D into the trimming module. It rst obtains the frequent items by scanning D once. A trimmed dataset D is constructed by removing the items with existential probabilities smaller than a trimming threshold. It is then mined by using U-Apriori. If an itemset is frequent in trimmed dataset DT then it must also be frequent in original dataset D. On the other hand, if an itemset is infrequent in DT, we cannot conclude that it is infrequent in D. The role of the pruning module is to estimate the upper bound of the mining error by the statistics gathered from the trimming module and to prune the itemsets which cannot be frequent in D. After mining DT, the expected supports of the frequent and potentially frequent itemsets are veri ed against the original dataset D by the patch up module. Table 1: An Uncertain Database TID Transactions T1 A B (0.2) C (0.9) D (0.7) F(0.8) (0.8) T2 A B (0.7) C (0.9) E (0.5) (0.8) T3 A C (0.8) E (0.8) F (0.3) (0.5) T4 B (0.5) D (0.5) F (0.7) The approach that considers an itemset frequent if its expected support is above minsup has a major drawback. Uncertain transaction databases naturally involve uncertainty concerning the support of an itemset. Considering this is important when evaluating whether an itemset is frequent or not. However, this information is forfeited when using the expected support approach. confidence with which an itemset is frequent is very important for interpreting uncertain itemsets. Therefore the concepts that evaluate the uncertain data in a probabilistic way are required. 2.3 Tree based Approaches. The tree based approaches are different from the Apriori based as they do not involve the candidate Volume 2, Issue 2 March April 2013 Page 433

generation and the candidate pruning phases for finding the frequent itemsets instead they make use of tree structure to store the data [7].From the tree structure the frequent itemsets can be mined using the algorithms like F-Growth.These algorithms are also modified for the uncertain data. 2.3.1 UF-Growth A tree-based algorithm, called UF-growth, for mining uncertain data to find the frequent itemsets is specified by Leung et.al. [11] The algorithm consists of two main operations: (i) the construction of UF-trees and (ii) the mining of frequent patterns from UF-trees. As with many tree-based mining algorithms, a key challenge is to represent and store the u n c e r t a i n data in a tree? For uncertain data, each item is explicitly associated with an existential probability ranging from a positive value close to 0 (indicating that the item has an insigni cantly low chance to be present in TDB) to a value of 1 (indicating that the item is de nitely present). Moreover, t h e existential probability of the item can vary from one transaction to another. Different items may have the same existential probability. To effectively represent uncertain data, a UF-tree which is a variant of the FP-tree can be used.[4] Each node in the UF-tree stores:(i) an item, (ii) its expected support, and (iii) the number of occurrence of such expected support for such an item. The UF-growth algorithm constructs the UF-tree as follows: It scans the database once and accumulates the expected support of each item. Hence, it s all frequent items (i.e., items having expected support = minsup). It s o r t s t h e s e f r e q u e n t i t e m s i n descending order of accumulated expected support. The algorithm then scans the database the second time and inserts each transaction into the UF-tree in a similar fashion as in the construction of an FP- tree except for the following: The new transaction is merged with a child (or descendant) node of the root of the UF-tree (at the highest support level) only if the same item and the same expected support exist in both the transaction and the child (or descendant) node Figure 1: Example of FP-Tree After the UF-tree is constructed the UFgrowth algorithm recursively mines frequent patterns from this tree in a similar fashion as in the FP-growth algorithm except for the following: -- When forming a UF-tree for the projected database for a pattern X, it is necessary t o keep track of the expected support (in addition to the occurrence) of X. -- When computing the expected support of an extension of a pattern X (say, X U {y}), it is required to multiply the expected support of y in a tree path by the expected support X. Thus UF-Growth algorithm can be used for finding the frequent itemsets.it requires large memory for the storage of the transactions in the tree structure form. For the uncertain data this may lead to an exponential rise in the space requirement due to the nature of the uncertainty. The above mentioned algorithms lack the incremental approach.the databases are of evolving nature and hence the frequent itemsets are changed when new transactions are added to the databases.the frequent itemset mining algorithms are required to be modified for the evolving databases. The fast update algorithm computes the frequent itemsets for the evolving uncertain databases without rescanning the entire original database [9].It uses the previously mined frequent itemets for the computation of frequent itemsets on an added database. 3. Proposed work 3.1 Apriori using Probabilistic Support In uncertain transaction databases, the support of an item or itemset cannot be represented by a unique value, but rather, must be represented by a discrete probability distribution. Given an uncertain (transaction) database T and the set W of possible worlds (instantiations) of T, the support probability Pi(X) of an itemset X is the probability that X has the support i. The support probabilities associated with an itemset X for different support va l ues f or m the s u p p or t pr oba bi l i t y distribution of t h e support of X. The probabilistic support of an itemset X in an uncertain transaction database T is defined by the support probabilities of X (Pi(X)) for all possible support values i. This probability distribution is called support probability distribution. We are interested in the probability that an itemset is frequent, i.e. the probability that an itemset occurs in at least minsup transactions. Let T be an uncertain transaction database and X be an itemset. P>=i(X) denotes the probability that the support of X is at least i. For a given minimal support minsup, the probability P>=minSup(X), which is called the frequentness probability of X, denotes the probability that the support of X is at least minsup. The traditional frequent itemset mining is based on support pruning by exploiting the anti-monotonic Volume 2, Issue 2 March April 2013 Page 434

property of support: S(X) <= S(Y) where S(X) is the support of X and Y is subset of X. In uncertain transaction databases the s u p p o r t i s defined b y a probability distribution and the itemsets are determined according to their frequentness probability. It indicates t that the frequentness probability is antimonotonic.hence Apriori algorithm can be modified based on the probabilistic frequent itemset mining approach.this algorithm like Apriori, iteratively generates the probabilistic frequent itemsets using a bottom up strategy. Each iteration is performed in two steps, a join step for generating new candidates and a pruning step for calculating the frequentness probabilities and extracting the probabilistic frequent itemsets from the candidates. The pruned candidates are, in turn, used to generate candidates in the next iteration. The basic concept that all subsets of a p r o b a b i l i s t i c Frequent i t e m s e t a r e a l s o p r o b a b i l i s t i c f r e q u e n t itemsets is exploited in the join step to limit the candidates generated and in the pruning step to remove itemsets that need not be expanded. Apriori algorithm is limited to static databases. There are number of operations related to updations on any database. Considering the insertion of tuples as the updation operation, the algorithm needs to be modified to handle such evolving databases. The fast update algorithm for the incremental mining of frequent itemsets can be applied with the concept of probabilistic support based on the Apriori approach. It is essential as the databases have an evolving nature i.e. it changes with time so the frequent itemsets vary as per the changes in the databases. The key challenge in incremental mining is the maintenance of the frequent itemsets with the updations in the database.also the recomputation of the frequent itemsets for the updated Database is impractical if the changes to the database are frequent. The paper proposes the use of the probabilistic support for mining the frequent itemsets from uncertain databases using tree based approach. It focuses on the incremental mining for the evolving uncertain databases using the Apriori based approach and the tree based approach.the work on incremental mining on uncertain databases using the tree based algorithm is untouched. Hence the paper aims at exploring this area. The type of evolving database that is being considered is by appending, or inserting a set of tuples to the database. A PFI is a set of attribute values that occurs frequently with a sufficiently high probability. The support probability mass function ( s-pmf ) of a PFI can be considered as a Poisson binomial distribution, for the attribute and tuple uncertain data. Hence the support of the attribute is computed as the pmf. The minimum support is the input from the user.the minimal support count of Database D is computed as: Msc(D) = minsup x n Where n is the size of D. The frequentness probability of I is computed as cumulative distribution function of the attribute. Using the frequentness probability in PFI testing and applying the uncertainty model to modify the Apriori for mining uncertain databases, the PFI can be computed. For evolving databases the insertion operation to the database is handled by incremental mining algorithm. The fast update algorithm for uncertain data extracts the frequent itemsets in an Apriori fashion. It involves three phases: candidate generation, candidate pruning and PFI testing. It applies a bottom up approach for mining so, the (k+1) PFIs are generated from k-pfis. The process of Incremental mining is: Figure 2: Incremental mining. 1. Candidate generation- In the first iteration, size-1 item sets that can be 1-PFIs are obtained, using the PFIs discovered from D, as well as the delta database d. In subsequent iterations, this phase produces size (k + 1) candidate item sets, based on the k-pfis found in the previous iteration. If no candidates are found, then the process halts. 2. Candidate pruning- With the aid of d and the PFIs found from D, this phase filters the candidate item sets that must not be a PFI. 3. PFI testing- For item sets that cannot be pruned, they are tested to see whether they are the true PFIs. This involves the use of the updated database, as well as the spmfs of PFIs on D. The incremental mining can also applied be applied using the tree based approach. FP-tree is used to store the transaction data.whenever an increment is applied to the original database the tree is updated with the checking of frequent itemsets in the new additional transactions. The process makes use of the initially computed frequent itemsets of the original database and compares them with the frequent itemsets of the additional database. It uses the updated FP-tree and applies the PFI testing in the mining process. There are four possible combinations to be considered for the computation of the frequent itemsets in the updated database as shown in the table Volume 2, Issue 2 March April 2013 Page 435

Table 2 : Cases in Incremental mining. Cases Original New Result Case1 Large Large Always large Case2 Large Small Determined from existing information Case3 Small Large Determined by rescanning original database Case4 Small Small Always small The support of the inserted items is computed.the FP tree generated initially is updated with the records/tuples to be inserted on the basis of the four possible cases generated as mentioned in the table. The updated FP tree is generated and the support of the items is updated. The PFI mining process is then applied to the updated tree to mine the frequent itemsets of the updated database. 4. Analysis The outcome of the Apriori algorithm is the PFI i.e. the Probabilistic Frequent itemsets is same as the PFI generated from the tree based algorithm.the computational complexity is comparatively less for the tree based algorithm as compared to the Apriori algorithm because of the phases involved in the process of PFI mining. The Apriori algorithm involves three phases of candidate generation,candidate pruning and testing whereas the tree based algorithm involves only the testing phase so it is computationally efficient but the space requirement of the tree based algorithm is large as compared to the Apriori based algorithm. 5. Conclusion The mining for frequent itemsets from the uncertain databases with the concept of probabilistic support is efficient as compared to mining with expected support. The Apriori modified with this concept computes the PFI efficiently. The tree based mining techniques requires less computations but the space requirement is high. The paper proposes a novel approach of mining frequent itemsets from evolving databases using the tree structure. It is an incremental mining that avoids the computation of PFI from scratch.it uses the PFI of previous database to compute the PFI of updated databases. Hence is efficient for databases that have small but frequent insertions. The direction of future research can be the reduction in storage space for the tree by making the transaction data compact. KNOWLEDGE AND DATA ENGINEERING,VOL. 24, NO.12, DECEMBER 2012 [2] Chiu, C.K. Chui, B. Kao, and E. Hung, Mining Frequent Itemsets from Uncertain Data, Proc. 11th Pacific-Asia Conf. Advances in Knowledge Discovery and Data Mining (PAKDD), 2007 [3] T. Bernecker, H. Kriegel, M. Renz, F. Verhein, and A. Zuefle, Probabilistic Frequent Itemset Mining in Uncertain Databases, Proc. 15th ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining (KDD),2009 [4] Carson Kai-Sang Leung, Mark Anthony F. Mateo, and Dale A.Brajczuk, A Tree-Based Approach for Frequent Pattern Mining from uncertain Data, T. Washio et al. (Eds.): PAKDD 2008, LNAI 5012, pp. 653 661, 2008 [5] C. C. Aggarwal and P. S. Yu. A survey of uncertain data algorithms and applications. IEEE Trans. Knowl.Data Eng., 21(5):609-623, 2009. [6] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, pages 1-12, 2000. [7] Q. Zhang, F. Li, and K. Yi, Finding Frequent Items in Probabilistic Data, Proc. ACM SIGMOD Int l Conf. Management of Data, 2008 [8] L. Wang, R. Cheng, S.D. Lee, and D. Cheung, Accelerating Probabilistic Frequent Itemset Mining: A Model-Based Approach, Proc. 19th ACM Int l Conf. Information and Knowledge Management (CIKM), 2010. [9] D. Cheung, J. Han, V. Ng, and C. Wong, Maintenance of Discovered Association Rules in Large Databases:An Incremental Updating Technique, Proc. 12th Int lconf. Data Eng. (ICDE), 1996 [10] C. Aggarwal, Y. Li, J. Wang, and J. Wang, Frequent Pattern Mining with Uncertain Data, Proc. 15th ACM SIGKDD Int l Conf.Knowledge Discovery and Data Mining (KDD), 2009 [11] Carson Kai-Sang Leung, Dale A. Brajczuk, Efficient Algorithms for the Mining of Constrained Frequent Patterns from Uncertain Data ACM SIGKDD Explorations Volume 11, Issue 2 REFERENCES [1] Liang Wang, David Wai-Lok Cheung, Reynold Cheng, Member, IEEE, Sau Dan Lee, and Xuan S. Yang, Efficient Mining of Frequent Item Sets on Large Uncertain Databases, IEEE TRANSACTIONS ON Volume 2, Issue 2 March April 2013 Page 436