Adaptive Load Shedding for Mining Frequent Patterns from Data Streams

Size: px

Start display at page:

Download "Adaptive Load Shedding for Mining Frequent Patterns from Data Streams"

Teresa Henry
5 years ago
Views:

1 Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang 1, Wee-Keong Ng 1, and Kok-Leong Ong 2 1 School of Computer Engineering, Nanyang Technological University, Singapore {dang0008, awkng}@ntu.edu.sg 2 School of Engineering & IT, Deakin University, Australia leong@deakin.edu.au Abstract. Most algorithms that focus on discovering frequent patterns from data streams assumed that the machinery is capable of managing all the incoming transactions without any delay; or without the need to drop transactions. However, this assumption is often impractical due to the inherent characteristics of data stream environments. Especially under high load conditions, there is often a shortage of system resources to process the incoming transactions. This causes unwanted latencies that in turn, affects the applicability of the data mining models produced which often has a small window of opportunity. We propose a load shedding algorithm to address this issue. The algorithm adaptively detects overload situations and drops transactions from data streams using a probabilistic model. We tested our algorithm on both synthetic and real-life datasets to verify the feasibility of our algorithm. 1 Introduction Recently, data streams have emerged as a new data type that has attracted much attention from the data mining community. They arise naturally in a number of applications, including financial services (e.g., stock ticker, financial monitoring), sensor networks (e.g., earth sensing satellites, astronomical observatories), web tracking and personalization (e.g., web log entries or web-click streams) [2]. These stream applications share three distinguishing characteristics that limit the applicability of most traditional algorithms: (i) data continuously arrive at high and unpredictable arrival rate; (ii) the volume of data is unbounded, making it impractical to store the entire data stream; (iii) on the basic of the data, decisions are arrived at and acted upon in close to real time. Consequently, the main challenge in mining data streams is to develop adaptive algorithms that support the processing of stream data in one-pass manner with constraints on system resources. Finding frequent item(set)s (or patterns) plays an important role in analyzing data streams [11]. Given a stream of transactions, the goal is to compute all itemsets that occur in at least a fraction of the stream. To address this problem, many algorithms have been reported in the literature [11,5,13,8,15,10]. A common characteristics among them, however, is the focus on memory management while assuming that the machinery itself is fast enough to handle all A Min Tjoa and J. Trujillo (Eds.): DaWaK 2006, LNCS 4081, pp , c Springer-Verlag Berlin Heidelberg 2006

2 Adaptive Load Shedding for Mining Frequent Patterns from Data Streams 343 incoming transactions without incurring any unwanted latencies. In practice, this assumption is impractical, e.g., data streams generated from a large number of bio-sensors embedded in the soldiers uniforms [12]; data streams generated by large scale multi-player online games [4]. These applications are characterized by a large number of push-based data sources and more importantly, the data rates can be very high and unpredictable. For instance, the arrival rate of data in a network game is not easy to be predicted due to volatility in the number of players as well as the game-state of each player [4]. In [12,3], the authors have shown that the arrival rate of data streams usually exceeds the system capacity despite all the efforts in scaling up processing algorithms. For the problem of mining frequent itemsets, this issue is even more serious due to the huge number of itemsets that needs frequency updates. Given a transaction of length m, the number of frequent patterns can exponentially increase to 2 m. It is obvious that completely processing all incoming transactions is generally impractical under high load conditions. Therefore, algorithms mining from data streams must cope with system overload situations. In this paper, we study the problem of mining frequent patterns over data streams under the assumption that the CPU is a limited resource. When the CPU capacity is overloaded, the system will not be able to keep up with newly arrived data, so load shedding discarding some fraction of unprocessed data becomes necessary. We propose an algorithm to detect overload situations, and selectively drops a fraction of the transactions from data streams. Specifically, we address and provide solutions to the following questions: (i) How to determine overload situations? (ii) How much load to shed? (ii) How to approximate frequent patterns under the introduction of load shedding? We adopt an adaptive and self-regulating approach to these questions. The current statistics of data streams are periodically evaluated to detect overload conditions. An adaptive dropping strategy, based on the Hoeffding bound, is then applied to discard transactions from the data stream. We have conducted a set of experiments on both synthetic and real-life datasets to evaluate the efficiency of our approach. The results were very encouraging even when the data rate exceeds an order of magnitude over the CPU capacity and that the underlying distribution is constantly changing. Next in Section 2, we formally formulate the problem of load shedding in mining frequent patterns from data streams. The proposed approach to the problem is described in Section 3. The experimental results are reported in Section 4. In Section 5, we review related work. Finally, our conclusion is given in Section 6. 2 Problem Formulation Let I = {a 1,a 2,..., a m } be a set of literals called items. LetDS = {t 1,t 2,..., t N,...} be a data stream where each transaction t i contains a set of items (t i I); t N is the current transaction and thus N is the current length (or timestamp) of the stream. We denote the frequency of an itemset X by freq(x), which is the occurrence count of X in DS up to the N th transaction,

3 344 X.H. Dang, W.-K. Ng, and K.-L. Ong and the support of X, denoted supp(x), is the ratio of freq(x) ton; i.e., supp(x) =freq(x) N 1. X is called a frequent itemset at the point of output timestamp N if supp(x) is no less than σ (0, 1], the minimum support; it is called a maximal frequent itemset (MFI) if none of its immediate supersets are frequent. We formulate the load shedding problem for finding frequent patterns over data streams as follows. We are given a processing capacity (CPU) C of a mining system and a data stream DS with arbitrary high arrival rates. Let Load(DS) indicate the workload of the system. Then a load shedding is invoked when Load(DS) >C. The objective is to construct an adaptive algorithm that can detect and drop a fraction of transactions to guarantee Load(DS) C and yet discover a set of patterns that closely approximates to the set of actual frequent itemsets from the data stream. 3 Adaptive Load Shedding in Data Streams 3.1 Overload Detection It is obvious that the system workload is dependent on the time to process each transaction, which in turns mainly depends on the number of itemsets contained in the transaction whose frequencies must be updated. Unfortunately, we may not know how much time is needed to process a transaction if we do not know exactly the number of frequent itemsets in the transaction. The difficulty lies in the fact that we are not be able to process all transactions under a high-speed data stream. Thus, to quickly estimate the system workload, we propose an approximate method that is relied on maximal frequent itemsets (MFIs). There are three main reasons that we may utilize MFIs for this task. First, it is known that the set of MFIs also contains all frequent itemsets. Therefore, updating a MFI also updates all its subsets that are also frequent. Second, the number of MFIs is significantly smaller than the number of frequent itemsets [14]. Actually, it provides the most compact representation for all frequent patterns. Third, according to the definition, the support of MFIs is always closest to σ. Consequently, the set of MFIs essentially reflects the current content of the data stream. Let k be the number of MFIs in a transaction and X i,1 i k, beamfi. We derive estimated time (i.e., load coefficient) to process one transaction: L = k 2 Xi i=1 k 2 Xi Xj (1) i,j=1 The first summation in the equation estimates the number of frequent itemsets within each MFI. The second one estimates the common itemsets sharing among MFIs. In practice, we can ignore all MFIs whose length is only 1 or 2. This is because the number of itemsets in each of these short MFIs is very negligible compared to those in a longer one. For example, if a transaction contains a MFI of length 10, the total number of itemsets need to update frequencies is at least , whereas if the transaction contains only MFIs of length 1, this

4 Adaptive Load Shedding for Mining Frequent Patterns from Data Streams 345 number is at most equal to the transaction length. Therefore, we can quickly estimate processing time of a transaction by comparing it with a small set of MFIs. Suppose we measure the above statistics for n transactions over one time unit. Let r be the current rate of the data stream (i.e., the number of transactions arriving in one time unit), we introduce the following inequality: n i=1 r L i C (2) n The left hand side gives the estimated workload during one time unit. L i is calculated from Equation 1. C, as formulated above, is the processing capacity of the mining system. We assume that when this inequality is not held, the mining system is overloaded. 3.2 Load Shedding by Sampling Transactions In order to estimate how much load to shed, we rely on Inequality 2. Let P be a parameter expressing the fraction of transactions that should be discarded. Then P must satisfy: n i=1 P r L i C (3) n If P = 1, there is no need to shed load. Otherwise, a maximal value of P is identified such that the inequality still holds. Suppose P < 1, then we may use the following approach to discard transactions and to approximate frequent patterns. We apply one of statistical results, the Hoeffding bound [9]. Consider the situation that we randomly draw n transactions from a dataset and estimate the true support p of itemset X in this dataset (i.e., supp(x) =p). We assume that the occurrence of X in a transaction is a Bernoulli trial and denote a random variable A i =1ifX occurs in the ith transaction and A i =0 if not. Obviously, Pr(A i =1)=p and Pr(A i =0)=1 p. 1 Hence, n randomly drawing transactions are regarded as n independent Bernoulli trials. Let r be the number of times that X i = 1 occurs in these n transactions; r is called a binomial random variable and thus, its expectation is np. Then, the Hoeffding bound states that for any ɛ, 0<ɛ<1: Pr{ r np nɛ} 2e 2nɛ2 (4) Let supp E (X) =r/n be the estimated support of X computed from n sampling transactions. Equation 4 gives us the probability that the true support supp(x) ofx is deviated from its estimated support supp E (X) by an amount of ±ɛ. If we want this probability to be no more than δ, then the required number of sampling transactions is at least (by setting δ =2e 2n0ɛ2 ): n 0 = 1 2ɛ 2 ln 2 (5) δ 1 We use Pr(.) to denote the probability of a condition being met.

5 346 X.H. Dang, W.-K. Ng, and K.-L. Ong It is obvious that if the data stream is uniformly distributed, then n 0 transactions can reflect the same statistical information about the entire data stream. Hence, processing these n 0 transactions gives us a set of patterns that closely approximates (within (ɛ, δ)) the set of actual frequent itemsets over the entire data stream. Unfortunately, this assumption is often unrealistic in stream environments. Rather, when the data rate significantly varied, we often expect that the underlying distribution also changes as well. When the workload changes, the corresponding value of P must be detected. Then, each incoming transaction is chosen with probability P until we sample enough n 0 transactions, which is called a sample batch. All frequent itemsets in this sample batch are then discovered. We call them local patterns because they are found only within part of the stream. This procedure is repeated until the system workload changes to another level. By the Hoeffding bound, we are guaranteed that the true support of each local pattern is close to its estimated support computed in these n 0 transactions. We now address the problem of how to report the global frequent itemsets in the entire data stream. For ease of explanation, we ignore the error produced by applying the Hoeffding bound (i.e., ɛ = δ = 0). It is easy to prove that a global pattern must be locally frequent in at least one part of the data stream. Therefore, we can safely report all local frequent itemsets as an approximate set of all global patterns. However, due to the non-uniform distribution of the stream, this approximation clearly will result in many false global patterns that are locally frequent only. One way to reduce this number is to control the maximal support error of each pattern within a threshold σ 0 (<σ) called significant support and further classify itemset X to be frequent if supp(x) σ; infrequent if supp(x) σ 0 ; otherwise, X is sub-frequent. Collectively, both frequent and sub-frequent itemsets are also called significant patterns. With the introduction of σ 0, we need to revisit the problem of identifying n 0.Sofar,n 0 is computed from Equation 5. However, it is clear that this value cannot be chosen arbitrarily because if ɛ is too small, n 0 will be very large. For instance, with ɛ =0.001 and δ =0.01, n 0 2, 600, 000 transactions, making it too huge to buffer in main memory. On the other hand, n 0 cannot be too small since it depends on σ 0, which is used to control the number of significant patterns. For instance, we assume that each itemset appearing more than 0.01% of the sample batch size will be significant and if n 0 =10, 000, then every itemset will be chosen. Certainly, this number is extremely huge due to the nature of exponential explosion of itemsets. In view of this, we select n 0 = Max{ β 1 σ 0 ; 2ɛ ln 2 2 δ }, where β is an integer that must be greater than The Algorithm With the above analysis, this section presents our algorithm named Load Shedding for mining F requent I temsets (LSFI). We use a prefix tree S to maintain significant itemsets. Initially, S is empty. Each node in S corresponds to an itemset X and has the following fields: (1) Item: The last item of X and thus X is represented by the set of items on the path from the root to the node; (2) Acnt:

6 Adaptive Load Shedding for Mining Frequent Patterns from Data Streams 347 The accumulated frequency of X seen so far in the data stream;(3) Bid: The index of the sample batch at which X is inserted into S. The algorithm receives the following parameters: processing capacity C; data stream DS; minimum support threshold σ and significant support threshold σ 0 (0,σ]. Load shedding is invoked when the system workload exceeds C. On demand, the algorithm returns an approximate set of frequent itemsets seen so far in the data stream. LSFI includes the following steps: 1. Before processing the data stream, the algorithm initializes the sample batch index b crr = 0 and computes the sampling size n 0 = Max{ β 1 σ 0 ; 2ɛ ln 2 2 δ }. 2. Periodically, it estimates the workload of the system and identifies an appropriate sampling rate P. If the workload is no more than C, set this value to 1. Otherwise, choose P such that Inequality 3 is satisfied. 3. Each time when t N arrives, LSFI samples it with probability P. 4. When n 0 transactions has been sampled: LSFI firstly increases the index of sample batch by 1. Then, all significant itemsets from this sample batch are mined. Any X whose frequency in this batch, denoted by Ccnt(X), is greater than σ 0 n 0 will be viewed as a significant itemset. For each such itemset X: If X is already maintained in S, update Acnt(X) by adding an amount of 1/P Ccnt(X). Note that to compensate for the dropping transactions caused by P, the frequency of each itemset must be scaled up appropriately by 1/P to approximate its true frequency in the current part of the stream. Otherwise, if X is not in S, create a new node for X with Acnt(X) = 1/P Ccnt(X) and X.Bid = b curr. After that, LSFI travels S to prune all infrequent itemsets whose Acnt(X) (b crr X.Bid) σ 0 n 0. To be clear with this condition, we need to clarify some points. First, the minimum frequency needed to make X significant in each sampling batch n 0 is at least more than σ 0 n 0. Since X was inserted at X.Bid, the frequency it must accumulate to continue staying significant until the current sample batch must bemorethan(b crr X.Bid) σ 0 n 0. On the other hand, Acnt(X) is its true frequency since inserted at X.Bid. Therefore, if this value is no more than (b crr X.Bid) σ 0 n 0, X is no longer significant. In case X is removed, all its supersets are also removed. For the next sample batch index, LSFI updates it by b crr = N/n When a user requests for the results, LSFI scans S and produces all itemsets X whose Acnt(X)+X.Bid σ 0 n 0 σ N. It is worth to note that X.Bid σ 0 n 0 is X s maximal frequency lost caused by the pruning step described above. 4 Performance Results We implemented LSFI in C++ and performed experiments on a 1.9GHz Pentium machine with 1GB of main memory running Windows XP. To verify the

7 348 X.H. Dang, W.-K. Ng, and K.-L. Ong feasibility of LSFI, both synthetic and real-life datasets are utilized. Using method described in [1], we generate two datasets of size 1 million transactions using 10,000 unique items. The first one has an average transaction size of 5 items with an average pattern size of 3. The second one has average transaction size 8 with an average pattern size of 4. We denote two datasets respectively by T5.I3.D0K and T8.I4.D0K. For a real-life dataset, the KDD Cup 2000 BMS-POS dataset is used that contains 515,597 transactions with 1,657 distinct items and the average number of items per transactions is 6.5. Since our algorithm is probabilistic, both recall and precision measures will be used. For the same reason, each experiment is repeated 10 times for each parameter combination and the results are reported using their average values. 4.1 Accuracy Measurements In our experiments, we fix ɛ =0.01,δ =0.01, Accordingly, n 0 25K. We also select β =4andσ 0 =0.1σ. Thus, the last value n 0 is Max{ β 0.1σ ;25K}. Foreach experiment, C is fixed but the system workload, expressed as a multiple of CPU capacity, is varied from 2 to 10. For example, a workload of 2 corresponds to a stream rate that is twice as high as the CPU capacity when no load shedding is needed. Figure 1 shows our experiment results on the three selected datasets where σ is varied from 0.1% to 0.8%. For two synthetic datasets, there are no (or a few) frequent itemsets found at σ > 0.4%. Hence, only σ less than those is considered. As expected, at lower levels of workload, LSFI generates a higher number of true frequent patterns indicated by the high value of recall, and a smaller number of false frequent patterns shown by the high precision value. Nevertheless, the interesting point is that, at all levels of σ, the algorithm still finds more than % of all the true frequent itemsets and retains the percentage of false frequent itemsets below 10% even when the system workload is 10 times higher than C. With the same range of σ, more detail results are reported in Figure 2 where we plot precision and recall for each itemset length. Due to space constraints, only results on the real-life dataset (with pattern length 4) are reported (see [7] for results on other datasets). It is observed from the figure that as the length of itemsets increases, the precision decreases. This happens because, for longer itemsets, their support tends to be closer to σ. According to our approximation, all itemsets whose support is greater than (σ σ 0 ) are reported as frequent patterns. Therefore, the precision is often lower for longer patterns due to many of them having true support in the range (σ σ 0,σ). This observation can be realized more clearly when σ is set smaller (indicated by the slope is higher). As with lower supports we find more of longer frequent patterns. Note that the recall is not affected by this approximation as the true frequency of patterns is guaranteed by the Hoeffding bound, which is generally dependent on the number of sampling transactions. When the level of workload is 2, the recall is always found to be higher than 97% for every itemset length.

8 Adaptive Load Shedding for Mining Frequent Patterns from Data Streams 349 T5.I3.D0K T5.I3.D0K Precision (Percentage) Recall (Percentage) T8.I4.D0K T8.I4.D0K Precision (Precentage) Recall (Precentage) BMS-POS BMS-POS Precision (Percentage) Recall (Percentage) Fig. 1. Accuracy of our algorithm on both synthetic and real-life datasets 4.2 Adaptability To test the ability of LSFI to adapt to changes, we generate dataset T5T8. D0K where the first part includes 200K transactions taken from T5.I3. D0K, and the second one includes 800K transactions from T8.I4.D0K. We send the dataset to the system at a rate just below the CPU capacity. When the algorithm progresses to the second part of the dataset, the set of MFIs changes significantly. For example, at σ = 0.1%, the number of MFIs of length 5 increased from 14 to 40 and the number of MFIs of length 6 increased from 5 to 9. Furthermore, we also found 1 MFI of length 7 and 2 MFIs of length 8 that did not appear in the first part of the dataset. This is due to the length of transactions increasing from 5 to 8 also increasing the number of longer MFIs. Consequently, the system needs more time to process these transactions. With this detection, P was adjusted correspondingly. Figure 3 shows the accuracy of LSFI in this experiment. We observe that the recall at all σ thresholds is still very high ( 95%) and is likely the same. Nevertheless, the precision is

9 350 X.H. Dang, W.-K. Ng, and K.-L. Ong BMS-POS BMS-POS T5T8.N10K.D0K Precision (Percentage) min_supp=0.10% min_supp=0.20% min_supp=0.40% min_supp=0.80% Itemset Length Recall (Precentage) min_supp=0.10% min_supp=0.20% min_supp=0.40% min_supp=0.80% Itemset Length Accuracy (Percentage) Recall Precision Minimum Support (Precentage) Fig. 2. Accuracy vs. Itemset Length for BMS- POS dataset at Load Factor 2 Fig. 3. Accuracy on dataset T5T8.D0k slightly lower than that in the uniform dataset T8.I4.D0K (which occupies 80% of our new dataset). To explain this point, we note that when the size of transactions increases, LSFI finds more frequent patterns in the second part of the dataset. According to our approximation where we estimate σ 0 to be the maximum support error of any frequent pattern, a small fraction of itemsets discovered in the second part have over-estimated frequency. This means that they were (locally) frequent in the second part, but not in the first part of the dataset. To reduce this number of false frequent patterns, we can set σ 0 smaller. However, by setting σ 0 =0.1σ, the precision is still above % in all cases. That means the percentage of false frequent patterns is guaranteed no more than 10%. 5 Related Works In querying data streams, the problem of load shedding is defined as the process of finding an optimal plan for inserting dropping operations along existing arcs of a query network. Aurora [12] is one of the first projects addressing this issue by utilizing different QoS graphs representing various important levels of querying objects. Based on that, transactions will be dropped progressively starting from those that contain information about the lowest important objects. STREAM [3] is another project where a load shedding scheme based on sampling is proposed for aggregate queries. It modifies the query network by inserting load shedder operators together with sampling rates in such a way that the total sampling rate eliminates sufficient amount of dropping data. This work is similar to ours in that the random sampling is used as a means of load shedding. In stream mining, Loadstar [6] is the first work addressing the load shedding problem for classification by utilizing a set of Quality of Decision metrics. Recent work on mining frequent patterns over data streams can be classified into three models:(i) Landmark model where patterns are discovered between a particular point of time and the current time. Lossy Counting [11] and FDPM [15] are typical algorithms; (ii) Time-fading model where transactions are weighted based on the time arrival. Works on this model include estdec [5] and FP-Stream [8]; (iii) Sliding-window model where it further considers the elimination of transactions. As a new transaction arrives, the oldest one in the window is retired. FTP-DS [13], TSSW [10] are some algorithms. All these works

10 Adaptive Load Shedding for Mining Frequent Patterns from Data Streams 351 just address only the problem of memory limitation in data streams. Our work (on landmark model) further addresses the load shedding problem. 6 Conclusions In this paper, we address the problem of finding frequent patterns from data streams where the mining system may not keep up with the arrival rate of the stream. We have proposed an approach to detect the overload situation based on a small set of maximal frequent itemsets. By adopting the Hoeffding bound, we have developed an algorithm that sheds load by discarding a fraction of incoming transactions adaptively under overload situations. Experiments both on real-life and synthetic datasets have been conducted to evaluate the proposed algorithm. The results showed that both the precision and recall are guaranteed in very high values even when the arrival rate of data streams is much higher than the capacity of the mining system and the skew of data streams is simulated. References 1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB Conference, pages , B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS Conference, pages 1 16, B. Babcock, M. Datar, and R. Motwani. Load shedding for aggregation queries over data streams. In ICDE Conference, pages , C. Chambers, W. Feng, S. Sahu, and D. Saha. Measurement-based characterization of a collection of on-line games. In IMC Conference, pages 1 14, J.H. Chang and W.S. Lee. Finding recent frequent itemsets adaptively over online data streams. In ACM SIGKDD Conference, pages 487 4, Y. Chi, P.S. Yu, H. Wang, and R. R.Muntz. Loadstar: A load shedding scheme for classifying data streams. In SIAM Conference, pages , X.H. Dang, W.K. Ng, and K.L. Ong. Adaptive load shedding for mining frequent patterns from data streams. Technical Report, Nanyang Technological University. 8. C. Giannella, J. Han, J. Pei, X. Yan, and P.S. Yu. Mining Frequent Patterns in Data Streams at Multiple Time Granularities. AAAI/MIT, W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, C.H. Lin, D.Y. Chiu, Y.H. Wu, and A.L.P. Chen. Mining frequent itemsets from data streams with a time-sensitive sliding window. In SIAM Conference, G.S. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB Conference, pages , N. Tatbul, U. Çetintemel, S.B. Zdonik, M. Cherniack, and M. Stonebraker. Load shedding in a data stream manager. In VLDB Conference, pages , W.G. Teng, M.S. Chen, and P.S. Yu. A regression-based temporal pattern mining scheme for data streams. In VLDB Conference, pages , G. Yang. The complexity of mining maximal frequent itemsets and maximal frequent patterns. In ACM SIGKDD Conference, pages , J.X. Yu, Z.C., H. Lu, and A. Zhou. False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In VLDB Conference, 2004.

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology