Adaptive Load Shedding for Mining Frequent Patterns from Data Streams

Size: px
Start display at page:

Download "Adaptive Load Shedding for Mining Frequent Patterns from Data Streams"

Transcription

1 Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang 1, Wee-Keong Ng 1, and Kok-Leong Ong 2 1 School of Computer Engineering, Nanyang Technological University, Singapore {dang0008, awkng}@ntu.edu.sg 2 School of Engineering & IT, Deakin University, Australia leong@deakin.edu.au Abstract. Most algorithms that focus on discovering frequent patterns from data streams assumed that the machinery is capable of managing all the incoming transactions without any delay; or without the need to drop transactions. However, this assumption is often impractical due to the inherent characteristics of data stream environments. Especially under high load conditions, there is often a shortage of system resources to process the incoming transactions. This causes unwanted latencies that in turn, affects the applicability of the data mining models produced which often has a small window of opportunity. We propose a load shedding algorithm to address this issue. The algorithm adaptively detects overload situations and drops transactions from data streams using a probabilistic model. We tested our algorithm on both synthetic and real-life datasets to verify the feasibility of our algorithm. 1 Introduction Recently, data streams have emerged as a new data type that has attracted much attention from the data mining community. They arise naturally in a number of applications, including financial services (e.g., stock ticker, financial monitoring), sensor networks (e.g., earth sensing satellites, astronomical observatories), web tracking and personalization (e.g., web log entries or web-click streams) [2]. These stream applications share three distinguishing characteristics that limit the applicability of most traditional algorithms: (i) data continuously arrive at high and unpredictable arrival rate; (ii) the volume of data is unbounded, making it impractical to store the entire data stream; (iii) on the basic of the data, decisions are arrived at and acted upon in close to real time. Consequently, the main challenge in mining data streams is to develop adaptive algorithms that support the processing of stream data in one-pass manner with constraints on system resources. Finding frequent item(set)s (or patterns) plays an important role in analyzing data streams [11]. Given a stream of transactions, the goal is to compute all itemsets that occur in at least a fraction of the stream. To address this problem, many algorithms have been reported in the literature [11,5,13,8,15,10]. A common characteristics among them, however, is the focus on memory management while assuming that the machinery itself is fast enough to handle all A Min Tjoa and J. Trujillo (Eds.): DaWaK 2006, LNCS 4081, pp , c Springer-Verlag Berlin Heidelberg 2006

2 Adaptive Load Shedding for Mining Frequent Patterns from Data Streams 343 incoming transactions without incurring any unwanted latencies. In practice, this assumption is impractical, e.g., data streams generated from a large number of bio-sensors embedded in the soldiers uniforms [12]; data streams generated by large scale multi-player online games [4]. These applications are characterized by a large number of push-based data sources and more importantly, the data rates can be very high and unpredictable. For instance, the arrival rate of data in a network game is not easy to be predicted due to volatility in the number of players as well as the game-state of each player [4]. In [12,3], the authors have shown that the arrival rate of data streams usually exceeds the system capacity despite all the efforts in scaling up processing algorithms. For the problem of mining frequent itemsets, this issue is even more serious due to the huge number of itemsets that needs frequency updates. Given a transaction of length m, the number of frequent patterns can exponentially increase to 2 m. It is obvious that completely processing all incoming transactions is generally impractical under high load conditions. Therefore, algorithms mining from data streams must cope with system overload situations. In this paper, we study the problem of mining frequent patterns over data streams under the assumption that the CPU is a limited resource. When the CPU capacity is overloaded, the system will not be able to keep up with newly arrived data, so load shedding discarding some fraction of unprocessed data becomes necessary. We propose an algorithm to detect overload situations, and selectively drops a fraction of the transactions from data streams. Specifically, we address and provide solutions to the following questions: (i) How to determine overload situations? (ii) How much load to shed? (ii) How to approximate frequent patterns under the introduction of load shedding? We adopt an adaptive and self-regulating approach to these questions. The current statistics of data streams are periodically evaluated to detect overload conditions. An adaptive dropping strategy, based on the Hoeffding bound, is then applied to discard transactions from the data stream. We have conducted a set of experiments on both synthetic and real-life datasets to evaluate the efficiency of our approach. The results were very encouraging even when the data rate exceeds an order of magnitude over the CPU capacity and that the underlying distribution is constantly changing. Next in Section 2, we formally formulate the problem of load shedding in mining frequent patterns from data streams. The proposed approach to the problem is described in Section 3. The experimental results are reported in Section 4. In Section 5, we review related work. Finally, our conclusion is given in Section 6. 2 Problem Formulation Let I = {a 1,a 2,..., a m } be a set of literals called items. LetDS = {t 1,t 2,..., t N,...} be a data stream where each transaction t i contains a set of items (t i I); t N is the current transaction and thus N is the current length (or timestamp) of the stream. We denote the frequency of an itemset X by freq(x), which is the occurrence count of X in DS up to the N th transaction,

3 344 X.H. Dang, W.-K. Ng, and K.-L. Ong and the support of X, denoted supp(x), is the ratio of freq(x) ton; i.e., supp(x) =freq(x) N 1. X is called a frequent itemset at the point of output timestamp N if supp(x) is no less than σ (0, 1], the minimum support; it is called a maximal frequent itemset (MFI) if none of its immediate supersets are frequent. We formulate the load shedding problem for finding frequent patterns over data streams as follows. We are given a processing capacity (CPU) C of a mining system and a data stream DS with arbitrary high arrival rates. Let Load(DS) indicate the workload of the system. Then a load shedding is invoked when Load(DS) >C. The objective is to construct an adaptive algorithm that can detect and drop a fraction of transactions to guarantee Load(DS) C and yet discover a set of patterns that closely approximates to the set of actual frequent itemsets from the data stream. 3 Adaptive Load Shedding in Data Streams 3.1 Overload Detection It is obvious that the system workload is dependent on the time to process each transaction, which in turns mainly depends on the number of itemsets contained in the transaction whose frequencies must be updated. Unfortunately, we may not know how much time is needed to process a transaction if we do not know exactly the number of frequent itemsets in the transaction. The difficulty lies in the fact that we are not be able to process all transactions under a high-speed data stream. Thus, to quickly estimate the system workload, we propose an approximate method that is relied on maximal frequent itemsets (MFIs). There are three main reasons that we may utilize MFIs for this task. First, it is known that the set of MFIs also contains all frequent itemsets. Therefore, updating a MFI also updates all its subsets that are also frequent. Second, the number of MFIs is significantly smaller than the number of frequent itemsets [14]. Actually, it provides the most compact representation for all frequent patterns. Third, according to the definition, the support of MFIs is always closest to σ. Consequently, the set of MFIs essentially reflects the current content of the data stream. Let k be the number of MFIs in a transaction and X i,1 i k, beamfi. We derive estimated time (i.e., load coefficient) to process one transaction: L = k 2 Xi i=1 k 2 Xi Xj (1) i,j=1 The first summation in the equation estimates the number of frequent itemsets within each MFI. The second one estimates the common itemsets sharing among MFIs. In practice, we can ignore all MFIs whose length is only 1 or 2. This is because the number of itemsets in each of these short MFIs is very negligible compared to those in a longer one. For example, if a transaction contains a MFI of length 10, the total number of itemsets need to update frequencies is at least , whereas if the transaction contains only MFIs of length 1, this

4 Adaptive Load Shedding for Mining Frequent Patterns from Data Streams 345 number is at most equal to the transaction length. Therefore, we can quickly estimate processing time of a transaction by comparing it with a small set of MFIs. Suppose we measure the above statistics for n transactions over one time unit. Let r be the current rate of the data stream (i.e., the number of transactions arriving in one time unit), we introduce the following inequality: n i=1 r L i C (2) n The left hand side gives the estimated workload during one time unit. L i is calculated from Equation 1. C, as formulated above, is the processing capacity of the mining system. We assume that when this inequality is not held, the mining system is overloaded. 3.2 Load Shedding by Sampling Transactions In order to estimate how much load to shed, we rely on Inequality 2. Let P be a parameter expressing the fraction of transactions that should be discarded. Then P must satisfy: n i=1 P r L i C (3) n If P = 1, there is no need to shed load. Otherwise, a maximal value of P is identified such that the inequality still holds. Suppose P < 1, then we may use the following approach to discard transactions and to approximate frequent patterns. We apply one of statistical results, the Hoeffding bound [9]. Consider the situation that we randomly draw n transactions from a dataset and estimate the true support p of itemset X in this dataset (i.e., supp(x) =p). We assume that the occurrence of X in a transaction is a Bernoulli trial and denote a random variable A i =1ifX occurs in the ith transaction and A i =0 if not. Obviously, Pr(A i =1)=p and Pr(A i =0)=1 p. 1 Hence, n randomly drawing transactions are regarded as n independent Bernoulli trials. Let r be the number of times that X i = 1 occurs in these n transactions; r is called a binomial random variable and thus, its expectation is np. Then, the Hoeffding bound states that for any ɛ, 0<ɛ<1: Pr{ r np nɛ} 2e 2nɛ2 (4) Let supp E (X) =r/n be the estimated support of X computed from n sampling transactions. Equation 4 gives us the probability that the true support supp(x) ofx is deviated from its estimated support supp E (X) by an amount of ±ɛ. If we want this probability to be no more than δ, then the required number of sampling transactions is at least (by setting δ =2e 2n0ɛ2 ): n 0 = 1 2ɛ 2 ln 2 (5) δ 1 We use Pr(.) to denote the probability of a condition being met.

5 346 X.H. Dang, W.-K. Ng, and K.-L. Ong It is obvious that if the data stream is uniformly distributed, then n 0 transactions can reflect the same statistical information about the entire data stream. Hence, processing these n 0 transactions gives us a set of patterns that closely approximates (within (ɛ, δ)) the set of actual frequent itemsets over the entire data stream. Unfortunately, this assumption is often unrealistic in stream environments. Rather, when the data rate significantly varied, we often expect that the underlying distribution also changes as well. When the workload changes, the corresponding value of P must be detected. Then, each incoming transaction is chosen with probability P until we sample enough n 0 transactions, which is called a sample batch. All frequent itemsets in this sample batch are then discovered. We call them local patterns because they are found only within part of the stream. This procedure is repeated until the system workload changes to another level. By the Hoeffding bound, we are guaranteed that the true support of each local pattern is close to its estimated support computed in these n 0 transactions. We now address the problem of how to report the global frequent itemsets in the entire data stream. For ease of explanation, we ignore the error produced by applying the Hoeffding bound (i.e., ɛ = δ = 0). It is easy to prove that a global pattern must be locally frequent in at least one part of the data stream. Therefore, we can safely report all local frequent itemsets as an approximate set of all global patterns. However, due to the non-uniform distribution of the stream, this approximation clearly will result in many false global patterns that are locally frequent only. One way to reduce this number is to control the maximal support error of each pattern within a threshold σ 0 (<σ) called significant support and further classify itemset X to be frequent if supp(x) σ; infrequent if supp(x) σ 0 ; otherwise, X is sub-frequent. Collectively, both frequent and sub-frequent itemsets are also called significant patterns. With the introduction of σ 0, we need to revisit the problem of identifying n 0.Sofar,n 0 is computed from Equation 5. However, it is clear that this value cannot be chosen arbitrarily because if ɛ is too small, n 0 will be very large. For instance, with ɛ =0.001 and δ =0.01, n 0 2, 600, 000 transactions, making it too huge to buffer in main memory. On the other hand, n 0 cannot be too small since it depends on σ 0, which is used to control the number of significant patterns. For instance, we assume that each itemset appearing more than 0.01% of the sample batch size will be significant and if n 0 =10, 000, then every itemset will be chosen. Certainly, this number is extremely huge due to the nature of exponential explosion of itemsets. In view of this, we select n 0 = Max{ β 1 σ 0 ; 2ɛ ln 2 2 δ }, where β is an integer that must be greater than The Algorithm With the above analysis, this section presents our algorithm named Load Shedding for mining F requent I temsets (LSFI). We use a prefix tree S to maintain significant itemsets. Initially, S is empty. Each node in S corresponds to an itemset X and has the following fields: (1) Item: The last item of X and thus X is represented by the set of items on the path from the root to the node; (2) Acnt:

6 Adaptive Load Shedding for Mining Frequent Patterns from Data Streams 347 The accumulated frequency of X seen so far in the data stream;(3) Bid: The index of the sample batch at which X is inserted into S. The algorithm receives the following parameters: processing capacity C; data stream DS; minimum support threshold σ and significant support threshold σ 0 (0,σ]. Load shedding is invoked when the system workload exceeds C. On demand, the algorithm returns an approximate set of frequent itemsets seen so far in the data stream. LSFI includes the following steps: 1. Before processing the data stream, the algorithm initializes the sample batch index b crr = 0 and computes the sampling size n 0 = Max{ β 1 σ 0 ; 2ɛ ln 2 2 δ }. 2. Periodically, it estimates the workload of the system and identifies an appropriate sampling rate P. If the workload is no more than C, set this value to 1. Otherwise, choose P such that Inequality 3 is satisfied. 3. Each time when t N arrives, LSFI samples it with probability P. 4. When n 0 transactions has been sampled: LSFI firstly increases the index of sample batch by 1. Then, all significant itemsets from this sample batch are mined. Any X whose frequency in this batch, denoted by Ccnt(X), is greater than σ 0 n 0 will be viewed as a significant itemset. For each such itemset X: If X is already maintained in S, update Acnt(X) by adding an amount of 1/P Ccnt(X). Note that to compensate for the dropping transactions caused by P, the frequency of each itemset must be scaled up appropriately by 1/P to approximate its true frequency in the current part of the stream. Otherwise, if X is not in S, create a new node for X with Acnt(X) = 1/P Ccnt(X) and X.Bid = b curr. After that, LSFI travels S to prune all infrequent itemsets whose Acnt(X) (b crr X.Bid) σ 0 n 0. To be clear with this condition, we need to clarify some points. First, the minimum frequency needed to make X significant in each sampling batch n 0 is at least more than σ 0 n 0. Since X was inserted at X.Bid, the frequency it must accumulate to continue staying significant until the current sample batch must bemorethan(b crr X.Bid) σ 0 n 0. On the other hand, Acnt(X) is its true frequency since inserted at X.Bid. Therefore, if this value is no more than (b crr X.Bid) σ 0 n 0, X is no longer significant. In case X is removed, all its supersets are also removed. For the next sample batch index, LSFI updates it by b crr = N/n When a user requests for the results, LSFI scans S and produces all itemsets X whose Acnt(X)+X.Bid σ 0 n 0 σ N. It is worth to note that X.Bid σ 0 n 0 is X s maximal frequency lost caused by the pruning step described above. 4 Performance Results We implemented LSFI in C++ and performed experiments on a 1.9GHz Pentium machine with 1GB of main memory running Windows XP. To verify the

7 348 X.H. Dang, W.-K. Ng, and K.-L. Ong feasibility of LSFI, both synthetic and real-life datasets are utilized. Using method described in [1], we generate two datasets of size 1 million transactions using 10,000 unique items. The first one has an average transaction size of 5 items with an average pattern size of 3. The second one has average transaction size 8 with an average pattern size of 4. We denote two datasets respectively by T5.I3.D0K and T8.I4.D0K. For a real-life dataset, the KDD Cup 2000 BMS-POS dataset is used that contains 515,597 transactions with 1,657 distinct items and the average number of items per transactions is 6.5. Since our algorithm is probabilistic, both recall and precision measures will be used. For the same reason, each experiment is repeated 10 times for each parameter combination and the results are reported using their average values. 4.1 Accuracy Measurements In our experiments, we fix ɛ =0.01,δ =0.01, Accordingly, n 0 25K. We also select β =4andσ 0 =0.1σ. Thus, the last value n 0 is Max{ β 0.1σ ;25K}. Foreach experiment, C is fixed but the system workload, expressed as a multiple of CPU capacity, is varied from 2 to 10. For example, a workload of 2 corresponds to a stream rate that is twice as high as the CPU capacity when no load shedding is needed. Figure 1 shows our experiment results on the three selected datasets where σ is varied from 0.1% to 0.8%. For two synthetic datasets, there are no (or a few) frequent itemsets found at σ > 0.4%. Hence, only σ less than those is considered. As expected, at lower levels of workload, LSFI generates a higher number of true frequent patterns indicated by the high value of recall, and a smaller number of false frequent patterns shown by the high precision value. Nevertheless, the interesting point is that, at all levels of σ, the algorithm still finds more than % of all the true frequent itemsets and retains the percentage of false frequent itemsets below 10% even when the system workload is 10 times higher than C. With the same range of σ, more detail results are reported in Figure 2 where we plot precision and recall for each itemset length. Due to space constraints, only results on the real-life dataset (with pattern length 4) are reported (see [7] for results on other datasets). It is observed from the figure that as the length of itemsets increases, the precision decreases. This happens because, for longer itemsets, their support tends to be closer to σ. According to our approximation, all itemsets whose support is greater than (σ σ 0 ) are reported as frequent patterns. Therefore, the precision is often lower for longer patterns due to many of them having true support in the range (σ σ 0,σ). This observation can be realized more clearly when σ is set smaller (indicated by the slope is higher). As with lower supports we find more of longer frequent patterns. Note that the recall is not affected by this approximation as the true frequency of patterns is guaranteed by the Hoeffding bound, which is generally dependent on the number of sampling transactions. When the level of workload is 2, the recall is always found to be higher than 97% for every itemset length.

8 Adaptive Load Shedding for Mining Frequent Patterns from Data Streams 349 T5.I3.D0K T5.I3.D0K Precision (Percentage) Recall (Percentage) T8.I4.D0K T8.I4.D0K Precision (Precentage) Recall (Precentage) BMS-POS BMS-POS Precision (Percentage) Recall (Percentage) Fig. 1. Accuracy of our algorithm on both synthetic and real-life datasets 4.2 Adaptability To test the ability of LSFI to adapt to changes, we generate dataset T5T8. D0K where the first part includes 200K transactions taken from T5.I3. D0K, and the second one includes 800K transactions from T8.I4.D0K. We send the dataset to the system at a rate just below the CPU capacity. When the algorithm progresses to the second part of the dataset, the set of MFIs changes significantly. For example, at σ = 0.1%, the number of MFIs of length 5 increased from 14 to 40 and the number of MFIs of length 6 increased from 5 to 9. Furthermore, we also found 1 MFI of length 7 and 2 MFIs of length 8 that did not appear in the first part of the dataset. This is due to the length of transactions increasing from 5 to 8 also increasing the number of longer MFIs. Consequently, the system needs more time to process these transactions. With this detection, P was adjusted correspondingly. Figure 3 shows the accuracy of LSFI in this experiment. We observe that the recall at all σ thresholds is still very high ( 95%) and is likely the same. Nevertheless, the precision is

9 350 X.H. Dang, W.-K. Ng, and K.-L. Ong BMS-POS BMS-POS T5T8.N10K.D0K Precision (Percentage) min_supp=0.10% min_supp=0.20% min_supp=0.40% min_supp=0.80% Itemset Length Recall (Precentage) min_supp=0.10% min_supp=0.20% min_supp=0.40% min_supp=0.80% Itemset Length Accuracy (Percentage) Recall Precision Minimum Support (Precentage) Fig. 2. Accuracy vs. Itemset Length for BMS- POS dataset at Load Factor 2 Fig. 3. Accuracy on dataset T5T8.D0k slightly lower than that in the uniform dataset T8.I4.D0K (which occupies 80% of our new dataset). To explain this point, we note that when the size of transactions increases, LSFI finds more frequent patterns in the second part of the dataset. According to our approximation where we estimate σ 0 to be the maximum support error of any frequent pattern, a small fraction of itemsets discovered in the second part have over-estimated frequency. This means that they were (locally) frequent in the second part, but not in the first part of the dataset. To reduce this number of false frequent patterns, we can set σ 0 smaller. However, by setting σ 0 =0.1σ, the precision is still above % in all cases. That means the percentage of false frequent patterns is guaranteed no more than 10%. 5 Related Works In querying data streams, the problem of load shedding is defined as the process of finding an optimal plan for inserting dropping operations along existing arcs of a query network. Aurora [12] is one of the first projects addressing this issue by utilizing different QoS graphs representing various important levels of querying objects. Based on that, transactions will be dropped progressively starting from those that contain information about the lowest important objects. STREAM [3] is another project where a load shedding scheme based on sampling is proposed for aggregate queries. It modifies the query network by inserting load shedder operators together with sampling rates in such a way that the total sampling rate eliminates sufficient amount of dropping data. This work is similar to ours in that the random sampling is used as a means of load shedding. In stream mining, Loadstar [6] is the first work addressing the load shedding problem for classification by utilizing a set of Quality of Decision metrics. Recent work on mining frequent patterns over data streams can be classified into three models:(i) Landmark model where patterns are discovered between a particular point of time and the current time. Lossy Counting [11] and FDPM [15] are typical algorithms; (ii) Time-fading model where transactions are weighted based on the time arrival. Works on this model include estdec [5] and FP-Stream [8]; (iii) Sliding-window model where it further considers the elimination of transactions. As a new transaction arrives, the oldest one in the window is retired. FTP-DS [13], TSSW [10] are some algorithms. All these works

10 Adaptive Load Shedding for Mining Frequent Patterns from Data Streams 351 just address only the problem of memory limitation in data streams. Our work (on landmark model) further addresses the load shedding problem. 6 Conclusions In this paper, we address the problem of finding frequent patterns from data streams where the mining system may not keep up with the arrival rate of the stream. We have proposed an approach to detect the overload situation based on a small set of maximal frequent itemsets. By adopting the Hoeffding bound, we have developed an algorithm that sheds load by discarding a fraction of incoming transactions adaptively under overload situations. Experiments both on real-life and synthetic datasets have been conducted to evaluate the proposed algorithm. The results showed that both the precision and recall are guaranteed in very high values even when the arrival rate of data streams is much higher than the capacity of the mining system and the skew of data streams is simulated. References 1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB Conference, pages , B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS Conference, pages 1 16, B. Babcock, M. Datar, and R. Motwani. Load shedding for aggregation queries over data streams. In ICDE Conference, pages , C. Chambers, W. Feng, S. Sahu, and D. Saha. Measurement-based characterization of a collection of on-line games. In IMC Conference, pages 1 14, J.H. Chang and W.S. Lee. Finding recent frequent itemsets adaptively over online data streams. In ACM SIGKDD Conference, pages 487 4, Y. Chi, P.S. Yu, H. Wang, and R. R.Muntz. Loadstar: A load shedding scheme for classifying data streams. In SIAM Conference, pages , X.H. Dang, W.K. Ng, and K.L. Ong. Adaptive load shedding for mining frequent patterns from data streams. Technical Report, Nanyang Technological University. 8. C. Giannella, J. Han, J. Pei, X. Yan, and P.S. Yu. Mining Frequent Patterns in Data Streams at Multiple Time Granularities. AAAI/MIT, W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, C.H. Lin, D.Y. Chiu, Y.H. Wu, and A.L.P. Chen. Mining frequent itemsets from data streams with a time-sensitive sliding window. In SIAM Conference, G.S. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB Conference, pages , N. Tatbul, U. Çetintemel, S.B. Zdonik, M. Cherniack, and M. Stonebraker. Load shedding in a data stream manager. In VLDB Conference, pages , W.G. Teng, M.S. Chen, and P.S. Yu. A regression-based temporal pattern mining scheme for data streams. In VLDB Conference, pages , G. Yang. The complexity of mining maximal frequent itemsets and maximal frequent patterns. In ACM SIGKDD Conference, pages , J.X. Yu, Z.C., H. Lu, and A. Zhou. False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In VLDB Conference, 2004.

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology

More information

Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window

Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu Department of Computer Science National Tsing Hua University Arbee L.P. Chen

More information

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams *

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams * An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams * Jia-Ling Koh and Shu-Ning Shin Department of Computer Science and Information Engineering National Taiwan Normal University

More information

Incremental updates of closed frequent itemsets over continuous data streams

Incremental updates of closed frequent itemsets over continuous data streams Available online at www.sciencedirect.com Expert Systems with Applications Expert Systems with Applications 36 (29) 2451 2458 www.elsevier.com/locate/eswa Incremental updates of closed frequent itemsets

More information

Mining Recent Frequent Itemsets in Data Streams with Optimistic Pruning

Mining Recent Frequent Itemsets in Data Streams with Optimistic Pruning Mining Recent Frequent Itemsets in Data Streams with Optimistic Pruning Kun Li 1,2, Yongyan Wang 1, Manzoor Elahi 1,2, Xin Li 3, and Hongan Wang 1 1 Institute of Software, Chinese Academy of Sciences,

More information

Mining Frequent Itemsets in Time-Varying Data Streams

Mining Frequent Itemsets in Time-Varying Data Streams Mining Frequent Itemsets in Time-Varying Data Streams Abstract A transactional data stream is an unbounded sequence of transactions continuously generated, usually at a high rate. Mining frequent itemsets

More information

Maintaining Frequent Itemsets over High-Speed Data Streams

Maintaining Frequent Itemsets over High-Speed Data Streams Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon,

More information

Random Sampling over Data Streams for Sequential Pattern Mining

Random Sampling over Data Streams for Sequential Pattern Mining Random Sampling over Data Streams for Sequential Pattern Mining Chedy Raïssi LIRMM, EMA-LGI2P/Site EERIE 161 rue Ada 34392 Montpellier Cedex 5, France France raissi@lirmm.fr Pascal Poncelet EMA-LGI2P/Site

More information

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * Shichao Zhang 1, Xindong Wu 2, Jilian Zhang 3, and Chengqi Zhang 1 1 Faculty of Information Technology, University of Technology

More information

Stream Sequential Pattern Mining with Precise Error Bounds

Stream Sequential Pattern Mining with Precise Error Bounds Stream Sequential Pattern Mining with Precise Error Bounds Luiz F. Mendes,2 Bolin Ding Jiawei Han University of Illinois at Urbana-Champaign 2 Google Inc. lmendes@google.com {bding3, hanj}@uiuc.edu Abstract

More information

Sampling for Sequential Pattern Mining: From Static Databases to Data Streams

Sampling for Sequential Pattern Mining: From Static Databases to Data Streams Sampling for Sequential Pattern Mining: From Static Databases to Data Streams Chedy Raïssi LIRMM, EMA-LGI2P/Site EERIE 161 rue Ada 34392 Montpellier Cedex 5, France raissi@lirmm.fr Pascal Poncelet EMA-LGI2P/Site

More information

Mining Maximum frequent item sets over data streams using Transaction Sliding Window Techniques

Mining Maximum frequent item sets over data streams using Transaction Sliding Window Techniques IJCSNS International Journal of Computer Science and Network Security, VOL.1 No.2, February 201 85 Mining Maximum frequent item sets over data streams using Transaction Sliding Window Techniques ANNURADHA

More information

Sequences Modeling and Analysis Based on Complex Network

Sequences Modeling and Analysis Based on Complex Network Sequences Modeling and Analysis Based on Complex Network Li Wan 1, Kai Shu 1, and Yu Guo 2 1 Chongqing University, China 2 Institute of Chemical Defence People Libration Army {wanli,shukai}@cqu.edu.cn

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

Approximation Algorithms for Mining Patterns from Data Streams

Approximation Algorithms for Mining Patterns from Data Streams Approximation Algorithms for Mining Patterns from Data Streams A Thesis submitted to the Nanyang Technological University in fullfilment of the requirement for the degree of Doctor of Philosophy by Dang

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

Hierarchical Online Mining for Associative Rules

Hierarchical Online Mining for Associative Rules Hierarchical Online Mining for Associative Rules Naresh Jotwani Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar 382009 INDIA naresh_jotwani@da-iict.org Abstract Mining

More information

An Efficient Sliding Window Based Algorithm for Adaptive Frequent Itemset Mining over Data Streams

An Efficient Sliding Window Based Algorithm for Adaptive Frequent Itemset Mining over Data Streams JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 29, 1001-1020 (2013) An Efficient Sliding Window Based Algorithm for Adaptive Frequent Itemset Mining over Data Streams MHMOOD DEYPIR 1, MOHAMMAD HADI SADREDDINI

More information

Mining top-k frequent itemsets from data streams

Mining top-k frequent itemsets from data streams Data Min Knowl Disc (2006) 13:193 217 DOI 10.1007/s10618-006-0042-x Mining top-k frequent itemsets from data streams Raymond Chi-Wing Wong Ada Wai-Chee Fu Received: 29 April 2005 / Accepted: 1 February

More information

Interactive Mining of Frequent Itemsets over Arbitrary Time Intervals in a Data Stream

Interactive Mining of Frequent Itemsets over Arbitrary Time Intervals in a Data Stream Interactive Mining of Frequent Itemsets over Arbitrary Time Intervals in a Data Stream Ming-Yen Lin 1 Sue-Chen Hsueh 2 Sheng-Kun Hwang 1 1 Department of Information Engineering and Computer Science, Feng

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

An Improved Algorithm for Mining Association Rules Using Multiple Support Values An Improved Algorithm for Mining Association Rules Using Multiple Support Values Ioannis N. Kouris, Christos H. Makris, Athanasios K. Tsakalidis University of Patras, School of Engineering Department of

More information

Load Shedding for Aggregation Queries over Data Streams

Load Shedding for Aggregation Queries over Data Streams Load Shedding for Aggregation Queries over Data Streams Brian Babcock Mayur Datar Rajeev Motwani Department of Computer Science Stanford University, Stanford, CA 94305 {babcock, datar, rajeev}@cs.stanford.edu

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REAL TIME DATA SEARCH OPTIMIZATION: AN OVERVIEW MS. DEEPASHRI S. KHAWASE 1, PROF.

More information

Frequent Patterns mining in time-sensitive Data Stream

Frequent Patterns mining in time-sensitive Data Stream Frequent Patterns mining in time-sensitive Data Stream Manel ZARROUK 1, Mohamed Salah GOUIDER 2 1 University of Gabès. Higher Institute of Management of Gabès 6000 Gabès, Gabès, Tunisia zarrouk.manel@gmail.com

More information

A New Method for Mining High Average Utility Itemsets

A New Method for Mining High Average Utility Itemsets A New Method for Mining High Average Utility Itemsets Tien Lu 1, Bay Vo 2,3, Hien T. Nguyen 3, and Tzung-Pei Hong 4 1 University of Sciences, Ho Chi Minh, Vietnam 2 Divison of Data Science, Ton Duc Thang

More information

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Nesime Tatbul Uğur Çetintemel Stan Zdonik Talk Outline Problem Introduction Approach Overview Advance Planning with an

More information

Load Shedding for Aggregation Queries over Data Streams

Load Shedding for Aggregation Queries over Data Streams Load Shedding for Aggregation Queries over Data Streams Brian Babcock Mayur Datar Rajeev Motwani Department of Computer Science Stanford University, Stanford, CA 94305 {babcock, datar, rajeev}@cs.stanford.edu

More information

Memory issues in frequent itemset mining

Memory issues in frequent itemset mining Memory issues in frequent itemset mining Bart Goethals HIIT Basic Research Unit Department of Computer Science P.O. Box 26, Teollisuuskatu 2 FIN-00014 University of Helsinki, Finland bart.goethals@cs.helsinki.fi

More information

Deakin Research Online

Deakin Research Online Deakin Research Online This is the published version: Saha, Budhaditya, Lazarescu, Mihai and Venkatesh, Svetha 27, Infrequent item mining in multiple data streams, in Data Mining Workshops, 27. ICDM Workshops

More information

Mining Frequent Patterns with Screening of Null Transactions Using Different Models

Mining Frequent Patterns with Screening of Null Transactions Using Different Models ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

Mining Top-K Path Traversal Patterns over Streaming Web Click-Sequences *

Mining Top-K Path Traversal Patterns over Streaming Web Click-Sequences * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 25, 1121-1133 (2009) Mining Top-K Path Traversal Patterns over Streaming Web Click-Sequences * HUA-FU LI 1,2 AND SUH-YIN LEE 2 1 Department of Computer Science

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors

More information

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 5, May 2014, pg.923

More information

Scheduling Strategies for Processing Continuous Queries Over Streams

Scheduling Strategies for Processing Continuous Queries Over Streams Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX 76019 Scheduling Strategies for Processing Continuous Queries Over Streams Qingchun Jiang, Sharma Chakravarthy

More information

Efficient Incremental Mining of Top-K Frequent Closed Itemsets

Efficient Incremental Mining of Top-K Frequent Closed Itemsets Efficient Incremental Mining of Top- Frequent Closed Itemsets Andrea Pietracaprina and Fabio Vandin Dipartimento di Ingegneria dell Informazione, Università di Padova, Via Gradenigo 6/B, 35131, Padova,

More information

CSCI6405 Project - Association rules mining

CSCI6405 Project - Association rules mining CSCI6405 Project - Association rules mining Xuehai Wang xwang@ca.dalc.ca B00182688 Xiaobo Chen xiaobo@ca.dal.ca B00123238 December 7, 2003 Chen Shen cshen@cs.dal.ca B00188996 Contents 1 Introduction: 2

More information

Load Shedding in a Data Stream Manager

Load Shedding in a Data Stream Manager Load Shedding in a Data Stream Manager Nesime Tatbul, Uur U Çetintemel, Stan Zdonik Brown University Mitch Cherniack Brandeis University Michael Stonebraker M.I.T. The Overload Problem Push-based data

More information

A Framework for Clustering Massive Text and Categorical Data Streams

A Framework for Clustering Massive Text and Categorical Data Streams A Framework for Clustering Massive Text and Categorical Data Streams Charu C. Aggarwal IBM T. J. Watson Research Center charu@us.ibm.com Philip S. Yu IBM T. J.Watson Research Center psyu@us.ibm.com Abstract

More information

A Two-Phase Algorithm for Fast Discovery of High Utility Itemsets

A Two-Phase Algorithm for Fast Discovery of High Utility Itemsets A Two-Phase Algorithm for Fast Discovery of High Utility temsets Ying Liu, Wei-keng Liao, and Alok Choudhary Electrical and Computer Engineering Department, Northwestern University, Evanston, L, USA 60208

More information

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data Qiankun Zhao Nanyang Technological University, Singapore and Sourav S. Bhowmick Nanyang Technological University,

More information

Closed Non-Derivable Itemsets

Closed Non-Derivable Itemsets Closed Non-Derivable Itemsets Juho Muhonen and Hannu Toivonen Helsinki Institute for Information Technology Basic Research Unit Department of Computer Science University of Helsinki Finland Abstract. Itemset

More information

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set Pincer-Search: An Efficient Algorithm for Discovering the Maximum Frequent Set Dao-I Lin Telcordia Technologies, Inc. Zvi M. Kedem New York University July 15, 1999 Abstract Discovering frequent itemsets

More information

On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

More information

A Graph-Based Approach for Mining Closed Large Itemsets

A Graph-Based Approach for Mining Closed Large Itemsets A Graph-Based Approach for Mining Closed Large Itemsets Lee-Wen Huang Dept. of Computer Science and Engineering National Sun Yat-Sen University huanglw@gmail.com Ye-In Chang Dept. of Computer Science and

More information

In-stream Frequent Itemset Mining with Output Proportional Memory Footprint

In-stream Frequent Itemset Mining with Output Proportional Memory Footprint In-stream Frequent Itemset Mining with Output Proportional Memory Footprint Daniel Trabold 1, Mario Boley 2, Michael Mock 1, and Tamas Horváth 2,1 1 Fraunhofer IAIS, Schloss Birlinghoven, 53754 St. Augustin,

More information

Temporal Weighted Association Rule Mining for Classification

Temporal Weighted Association Rule Mining for Classification Temporal Weighted Association Rule Mining for Classification Purushottam Sharma and Kanak Saxena Abstract There are so many important techniques towards finding the association rules. But, when we consider

More information

Fast Algorithm for Mining Association Rules

Fast Algorithm for Mining Association Rules Fast Algorithm for Mining Association Rules M.H.Margahny and A.A.Mitwaly Dept. of Computer Science, Faculty of Computers and Information, Assuit University, Egypt, Email: marghny@acc.aun.edu.eg. Abstract

More information

Mining High Average-Utility Itemsets

Mining High Average-Utility Itemsets Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering

More information

Nesnelerin İnternetinde Veri Analizi

Nesnelerin İnternetinde Veri Analizi Bölüm 4. Frequent Patterns in Data Streams w3.gazi.edu.tr/~suatozdemir What Is Pattern Discovery? What are patterns? Patterns: A set of items, subsequences, or substructures that occur frequently together

More information

Maintenance of the Prelarge Trees for Record Deletion

Maintenance of the Prelarge Trees for Record Deletion 12th WSEAS Int. Conf. on APPLIED MATHEMATICS, Cairo, Egypt, December 29-31, 2007 105 Maintenance of the Prelarge Trees for Record Deletion Chun-Wei Lin, Tzung-Pei Hong, and Wen-Hsiang Lu Department of

More information

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027,

More information

FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTROL AND RESOURCE ADAPTATION

FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTROL AND RESOURCE ADAPTATION FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTROL AND RESOURCE ADAPTATION J. Chandrika 1, Dr. K. R. Ananda Kumar 2 1 Dept. of Computer Science and Engineering, MCE, Hassan,

More information

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Parallel Mining of Maximal Frequent Itemsets in PC Clusters Proceedings of the International MultiConference of Engineers and Computer Scientists 28 Vol I IMECS 28, 19-21 March, 28, Hong Kong Parallel Mining of Maximal Frequent Itemsets in PC Clusters Vong Chan

More information

An Algorithm for Mining Frequent Itemsets from Library Big Data

An Algorithm for Mining Frequent Itemsets from Library Big Data JOURNAL OF SOFTWARE, VOL. 9, NO. 9, SEPTEMBER 2014 2361 An Algorithm for Mining Frequent Itemsets from Library Big Data Xingjian Li lixingjianny@163.com Library, Nanyang Institute of Technology, Nanyang,

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH International Journal of Information Technology and Knowledge Management January-June 2011, Volume 4, No. 1, pp. 27-32 DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY)

More information

An Approximate Scheme to Mine Frequent Patterns over Data Streams

An Approximate Scheme to Mine Frequent Patterns over Data Streams An Approximate Scheme to Mine Frequent Patterns over Data Streams Shanchan Wu Department of Computer Science, University of Maryland, College Park, MD 20742, USA wsc@cs.umd.edu Abstract. In this paper,

More information

Analytical and Experimental Evaluation of Stream-Based Join

Analytical and Experimental Evaluation of Stream-Based Join Analytical and Experimental Evaluation of Stream-Based Join Henry Kostowski Department of Computer Science, University of Massachusetts - Lowell Lowell, MA 01854 Email: hkostows@cs.uml.edu Kajal T. Claypool

More information

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad Hyderabad,

More information

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal Department of Electronics and Computer Engineering, Indian Institute of Technology, Roorkee, Uttarkhand, India. bnkeshav123@gmail.com, mitusuec@iitr.ernet.in,

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract

More information

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views Fast Discovery of Sequential Patterns Using Materialized Data Mining Views Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining

More information

Mining Frequent Patterns without Candidate Generation

Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview

More information

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN: IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY A Brief Survey on Frequent Patterns Mining of Uncertain Data Purvi Y. Rana*, Prof. Pragna Makwana, Prof. Kishori Shekokar *Student,

More information

Comparison of FP tree and Apriori Algorithm

Comparison of FP tree and Apriori Algorithm International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.78-82 Comparison of FP tree and Apriori Algorithm Prashasti

More information

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set To Enhance Scalability of Item Transactions by Parallel and Partition using Dynamic Data Set Priyanka Soni, Research Scholar (CSE), MTRI, Bhopal, priyanka.soni379@gmail.com Dhirendra Kumar Jha, MTRI, Bhopal,

More information

Mining Temporal Indirect Associations

Mining Temporal Indirect Associations Mining Temporal Indirect Associations Ling Chen 1,2, Sourav S. Bhowmick 1, Jinyan Li 2 1 School of Computer Engineering, Nanyang Technological University, Singapore, 639798 2 Institute for Infocomm Research,

More information

FP-Growth algorithm in Data Compression frequent patterns

FP-Growth algorithm in Data Compression frequent patterns FP-Growth algorithm in Data Compression frequent patterns Mr. Nagesh V Lecturer, Dept. of CSE Atria Institute of Technology,AIKBS Hebbal, Bangalore,Karnataka Email : nagesh.v@gmail.com Abstract-The transmission

More information

Frequent Itemsets Melange

Frequent Itemsets Melange Frequent Itemsets Melange Sebastien Siva Data Mining Motivation and objectives Finding all frequent itemsets in a dataset using the traditional Apriori approach is too computationally expensive for datasets

More information

Frequent Pattern Mining in Data Streams. Raymond Martin

Frequent Pattern Mining in Data Streams. Raymond Martin Frequent Pattern Mining in Data Streams Raymond Martin Agenda -Breakdown & Review -Importance & Examples -Current Challenges -Modern Algorithms -Stream-Mining Algorithm -How KPS Works -Combing KPS and

More information

Mining Vague Association Rules

Mining Vague Association Rules Mining Vague Association Rules An Lu, Yiping Ke, James Cheng, and Wilfred Ng Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong, China {anlu,keyiping,csjames,wilfred}@cse.ust.hk

More information

Mining Web Access Patterns with First-Occurrence Linked WAP-Trees

Mining Web Access Patterns with First-Occurrence Linked WAP-Trees Mining Web Access Patterns with First-Occurrence Linked WAP-Trees Peiyi Tang Markus P. Turkia Kyle A. Gallivan Dept of Computer Science Dept of Computer Science School of Computational Science Univ of

More information

11.1 Facility Location

11.1 Facility Location CS787: Advanced Algorithms Scribe: Amanda Burton, Leah Kluegel Lecturer: Shuchi Chawla Topic: Facility Location ctd., Linear Programming Date: October 8, 2007 Today we conclude the discussion of local

More information

A Comparative study of CARM and BBT Algorithm for Generation of Association Rules

A Comparative study of CARM and BBT Algorithm for Generation of Association Rules A Comparative study of CARM and BBT Algorithm for Generation of Association Rules Rashmi V. Mane Research Student, Shivaji University, Kolhapur rvm_tech@unishivaji.ac.in V.R.Ghorpade Principal, D.Y.Patil

More information

UDP Packet Monitoring with Stanford Data Stream Manager

UDP Packet Monitoring with Stanford Data Stream Manager UDP Packet Monitoring with Stanford Data Stream Manager Nadeem Akhtar #1, Faridul Haque Siddiqui #2 # Department of Computer Engineering, Aligarh Muslim University Aligarh, India 1 nadeemalakhtar@gmail.com

More information

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm Marek Wojciechowski, Krzysztof Galecki, Krzysztof Gawronek Poznan University of Technology Institute of Computing Science ul.

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18 601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18 22.1 Introduction We spent the last two lectures proving that for certain problems, we can

More information

Online Mining Changes of Items over Continuous Append-only and Dynamic Data Streams

Online Mining Changes of Items over Continuous Append-only and Dynamic Data Streams Journal of Universal Computer Science, vol., no. 8 (2005), 4-425 submitted: 0/3/05, accepted: 5/5/05, appeared: 28/8/05 J.UCS Online Mining Changes of Items over Continuous Append-only and Dynamic Data

More information

Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold

Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold Zengyou He, Xiaofei Xu, Shengchun Deng Department of Computer Science and Engineering, Harbin Institute of Technology,

More information

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction International Journal of Engineering Science Invention Volume 2 Issue 1 January. 2013 An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction Janakiramaiah Bonam 1, Dr.RamaMohan

More information

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011 International Journal of Innovative Computing, Information and Control ICIC International c 2012 ISSN 1349-4198 Volume 8, Number 7(B), July 2012 pp. 5165 5178 AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR

More information

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS 23 CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS This chapter introduces the concepts of association rule mining. It also proposes two algorithms based on, to calculate

More information

Association Rule Mining. Introduction 46. Study core 46

Association Rule Mining. Introduction 46. Study core 46 Learning Unit 7 Association Rule Mining Introduction 46 Study core 46 1 Association Rule Mining: Motivation and Main Concepts 46 2 Apriori Algorithm 47 3 FP-Growth Algorithm 47 4 Assignment Bundle: Frequent

More information

K-means based data stream clustering algorithm extended with no. of cluster estimation method

K-means based data stream clustering algorithm extended with no. of cluster estimation method K-means based data stream clustering algorithm extended with no. of cluster estimation method Makadia Dipti 1, Prof. Tejal Patel 2 1 Information and Technology Department, G.H.Patel Engineering College,

More information

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering

More information

PCP and Hardness of Approximation

PCP and Hardness of Approximation PCP and Hardness of Approximation January 30, 2009 Our goal herein is to define and prove basic concepts regarding hardness of approximation. We will state but obviously not prove a PCP theorem as a starting

More information

Data Stream Processing

Data Stream Processing Data Stream Processing Part II 1 Data Streams (recap) continuous, unbounded sequence of items unpredictable arrival times too large to store locally one pass real time processing required 2 Reservoir Sampling

More information

EXERCISES SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM. 1 Applications and Modelling

EXERCISES SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM. 1 Applications and Modelling SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM EXERCISES Prepared by Natashia Boland 1 and Irina Dumitrescu 2 1 Applications and Modelling 1.1

More information