Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06 154

Introduction Data Streams Applications that generate streams of data Network monitoring Call records in telecommunications Web server logs Sensor networks Application characteristics Massive volumes of data Records arrive at a rapid rate Data stream is sequence of records r,,r 1 L n CMPT 843, SFU, Martin Ester, 1-06 155

Introduction Computation Model Data Streams Data Stream Requirements Main Memory Synopsis Stream Processing Engine Single pass: each record is examined at most once Limited storage: main memory is limited to M User Request Approximate Answer Real-time processing: incremental maintenance of synopsis in real time CMPT 843, SFU, Martin Ester, 1-06 156

Summarization Methods Sampling Small random sample may represent data stream well enough AVG MAX Data stream 4 8 1 7 9 3 2 5 1 4 8 6 4.833 9 Sample 4 7 2 4 4.25 7 How close is the approximate answer to the actual answer? Use tail inequalities to provide probabilistic guarantees CMPT 843, SFU, Martin Ester, 1-06 157

Summarization Methods Sampling Tail probability: probability that random variable deviates far from the expectation Tail probability µ µε µ µ + µε µ Markov : Pr( X ε) ε Chebyshev : Pr( X µ µ ε) Var( X ) 2 2 µ ε CMPT 843, SFU, Martin Ester, 1-06 158

Summarization Methods Histograms Partition the attribute domain into b buckets Count C B per bucket B Equi-depth histograms Given the number of partitions Partition such that counts are equal V-optimal histograms [Jagadish et al 1998] Given the number of partitions Partition such that the frequency variance within buckets is minimized min B v B ( f v CB ) B 2 where f v : frequency of v CMPT 843, SFU, Martin Ester, 1-06 159

Summarization Methods Answering Queries Using Histograms [Ioannidis & Poosala 1999] Determine all buckets matching the query condition Assume equi-depth histogram and uniform distribution of the count over the bucket interval 1 2 3 4 5 6 7 8 9 10 select count(*) from R where 4 <= R.A <= 8 2 1 1 return CB + CB + CB = 2 CB 3 2 6 For equi-depth histograms, maximum error ± 2 C B CMPT 843, SFU, Martin Ester, 1-06 160

Summarization Methods Equi-Depth Histogram Construction Determine records (attribute values) n b 2n / r-quantile: record with rank / index r r /, r b L, r( b 1) n / b One pass computation of quantiles: [Manku, Rajagopalan & Lindsay 1998] Split memory M into b buffers of size k For each consecutive subsequence of k stream elements If there is a free buffer B then insert subsequence into B and set level of B to 0 Else merge two buffers B and B of same level l; insert result of merge into B, set level of B to l + 1; insert subsequence into B and set level of B to 0; Output record with index r after making 2 l copies of each element in final buffer CMPT 843, SFU, Martin Ester, 1-06 161

Example Summarization Methods One-Pass Quantile Computation M = 9, b = 3, k = 3, r = 10, n = 12 Data stream = 9 3 5 2 7 1 6 5 8 4 9 1 Output = 1 1 1 1 3 3 3 3 7 7 7 7 (8 is the exact result!) 1 3 7 level 2 1 3 7 1 5 8 level 1 9 3 5 2 7 1 6 5 8 4 9 1 level 0 CMPT 843, SFU, Martin Ester, 1-06 162

Sampling Summarization Methods Discussion Can be done efficiently in one pass Does not preserve correlations between different attributes One-dimensional histograms Can be done efficiently in one pass Does not preserve correlations between different attributes Multi-dimensional histograms Preserve correlations between different attributes Requires multiple passes (equi-depth or V-optimal) CMPT 843, SFU, Martin Ester, 1-06 163

Summarization Methods Randomized Sketch Synposes [Thaper et al. 2002] Synposis: random linear mapping of the data stream A: R N R Matrix A: each entry chosen indepently from a certain distribution Johnson Lindenstrauss theorem If d x 2 d log(1/ δ ) N = O( ), then for any x R 2 ε A x (1 + ε) x with probability at least 1 δ 2 2 If d large enough, then approximation error at most ε CMPT 843, SFU, Martin Ester, 1-06 164

Summarization Methods Randomized Sketch Synposes Example Matrix A Data Stream 0.61 0.13 0.67-0.39 p1=(1 1), p2=(1 2), p3=(1 1), 0.86 0.24-0.38-0.21 p4=(1 2), p5=(2 2) 0.91-0.17 0.33-0.16 D 1 2 1 2 0 2 2 1 Representation as N-dimensional vectors: p1=(1 0 0 0), p2=(0 1 0 0),... Sketch maintenance incremental S i : sketch of the data stream p 1,..., p S i, 0 =0 S + i = Si 1 A p i S 1 =(0.61 0.86 0.91),..., S 5 =(-1.35 1.99 1.32) CMPT 843, SFU, Martin Ester, 1-06 165

Summarization Methods Building Histograms from Sketches D: data distribution of data stream H: corresponding histogram (sequence of hyperrectangles (S i, v i )) both represented as N-dimensional vectors (concatenate all dimensions) Idea { 1,..., n} l {1,..., M} We maintain the sketch of the data stream AD Johnson Lindenstrauss theorem: H D 2 AH AD 2 (1+ ε) H D 2 withprobabilit y at least1 δ We determine a histogram H such that AH AD 2 is minimized Assumption: domain of each attribute known in advance CMPT 843, SFU, Martin Ester, 1-06 166

Summarization Methods Building Histograms from Sketches Input: AD (sketch of the data stream) Output: H (histogram of the data stream) Algorithm H = empty Iterate B times For each possible histogram hyperrectangle S do Consider the histogram H s = H S Compute the sketch AH S of the histogram Determine the value corresponding to S that minimizes Record this value AH S AD 2 Add the best S to H AH S AD 2 CMPT 843, SFU, Martin Ester, 1-06 167

Clustering Data Streams k-medoid Clustering [Guha, Mishra, Motwani & O'Callaghan 2000] Challenge: k-medoid clustering algorithm requires random access to the data. Approach: Cluster as many records (chunk) as fit into memory. Resulting (intermediate) medoids summarize their chunk. Cluster the set of all intermediate medoids to obtain final medoids for the entire data stream. CMPT 843, SFU, Martin Ester, 1-06 168

Clustering Data Streams k-medoid Clustering Two-phase method 1) For each (non-overlapping) subsequence S i of M records, find O(k) medoids in S i and assign other records to the closest medoid 2) Let S be the set of all medoids from the n/m subsequences, each medoid weighted by the number of corresponding records. Determine k medoids for S and return them. CMPT 843, SFU, Martin Ester, 1-06 169

Clustering Data Streams Example M = 3, k = 1, n = 5 1 Data stream 2 4 5 3 Result of first phase: 2 1 S 1 S 2 4 5 3 CMPT 843, SFU, Martin Ester, 1-06 170

Clustering Data Streams Example 1 w = 3 w = 2 5 S Result of second phase (final result): 1 w = 3 w = 2 5 CMPT 843, SFU, Martin Ester, 1-06 171

Clustering Data Streams Analysis Property 1 Given a dataset D and k-medoid clustering with cost C, where the medoids do not belong to D, then there is a clustering with k medoids from D with cost 2 C. m m p Argument Consider a record p and let m be the closest medoid in the cost C clustering. Let m be the closest medoid to p in D. If m = m, then we are done. Otherwise, applying the triangle inequality: dist( p, m') dist( p, m) + dist( m, m') 2 dist( p, m) CMPT 843, SFU, Martin Ester, 1-06 172

Clustering Data Streams Analysis Using property 1 and two similar properties, we can prove the following property: The cost of the k-medoid clustering obtained from the data stream (in one pass) is at most eight times the cost of the k-medoid clustering of a static database consisting of the same records. This assumes that we use a constant factor approximation algorithm for clustering the subsequences S i. Algortihm can be extended to cluster in more than two passes. CMPT 843, SFU, Martin Ester, 1-06 173

Clustering Data Streams Discussion k-medoid clustering in one pass Runtime of k-medoid clustering is at least O(n 2 ), i.e. runtime per record is rather high (not constant) Guarantee for clustering quality within constant factor from quality of conventional clustering (database) But factor is pretty high k-medoid clustering can also be used for generating a synopsis of the data stream CMPT 843, SFU, Martin Ester, 1-06 174

Data Stream Classification Decision Tree Classification [Domingos & Hulten 2000] Observation: Idea: For determining the best split attribute, it may be sufficient to consider only a small subset of the training examaples belonging to the current node Instead of repeated reads of the database, continue reading further portions of the data stream N1 N2 N3 N4...... CMPT 843, SFU, Martin Ester, 1-06 175

Data Stream Classification Hoeffding Bounds Challenge: How many examples are necessary at each node? How much of the data stream to use for the next choice of a split attribute / split point? Approach: Using Hoeffding bounds r: real-valued random variable with range R (and any probability distribution) n: number of observations of r r : the observed mean of r With probability 1 δ, the true mean of r is at least where ε = R 2 ln(1 δ ) 2n r ε CMPT 843, SFU, Martin Ester, 1-06 176

Data Stream Classification Decision Tree Classification G(X i ): measure of potential split attribute X i, to be maximized Goal: With high probability, the attribute chosen using n examples is the same that would have been chosen using infinite examples n should be as small as possible X a : attribute with highest observed G after seeing n examples X b : attribute with second highest G G = G ( X ) G ( X ) a b 0 CMPT 843, SFU, Martin Ester, 1-06 177

Data Stream Classification Decision Tree Classification X a is the correct choice with probability 1-δ, if n examples have been seen at this node and G Rationale > ε Assuming that theg value can be viewed as an average of the G values of examples belonging to that node If G > ε, then the Hoeffding bound guarantees for the true G G G ε > 0 with probability 1 δ CMPT 843, SFU, Martin Ester, 1-06 178

Hoeffding tree algorithm Data Stream Classification Algorithm Read examples from data stream until ε decreases monotonically with n G > ε Split the node n using the currently best attribute obtaining the children nodes n 1,..., n k Apply the same procedure to n 1,..., n k using the subsequent portions of the data stream as training examples CMPT 843, SFU, Martin Ester, 1-06 179

Data Stream Classification Experimental Evaluation VFDT: different implementations of the Hoeffding tree algorithm VFDT more accurate than C4.5 for large number of examples VFDT produces much smaller decision trees (less overfitting) CMPT 843, SFU, Martin Ester, 1-06 180

Data Stream Classification Discussion Hoeffding tree algorithm builds decision tree in a single pass with constant time per example Guarantees for similarity to conventional decision tree built from database For large sets of examples, Hoeffding trees are much smaller and more accurate than conventional decision trees Assumption: third best split attribute significantly worse than the best two ones (may not be realistic) Can the same approach be applied to other hierarchical data mining methods, e. g. hierarchical clustering? CMPT 843, SFU, Martin Ester, 1-06 181

Temporal Models Introduction So far: time stamps of records have been ignored, we have summarized over the entire stream But: often decisions based on recently observed data Ex.: stock data, sensor networks, L, r, L, r, L r 1 i,1 i,2, ri, k Timestamps 1 2... k Decay weight of older records e.g. sliding window model CMPT 843, SFU, Martin Ester, 1-06 182

Temporal Models Decaying Data Stream Records Data stream with time stamps Records are assigned weights Special time stamp NOW r1, r2, r3, L, rt,l w1, w2, w3, L, wt,l Exponential decay w Sliding window model t i. e., = 2 w ( NOW t) NOW = 1, w NOW 1 = 1 2, w NOW 2 = 1 4,... w t = 1 if NOW t < WINDOW 0, otherwise CMPT 843, SFU, Martin Ester, 1-06 183

Temporal Models Clustering Evolving Data Streams [Aggarwal et al 2003] Synopsis: micro-clusters (CF-values), organized into CF-tree maintained online CF-values extended by temporal dimension Micro-clusters stored at snapshots in time following a pyramidal time frame Offline cluster analysis Using different clustering algorithms / different parameter values based on the CF-tree CMPT 843, SFU, Martin Ester, 1-06 184

Temporal Models CF-Values Clustering Feature of a set C of points X i : CF = (N, LS, SS) N = C number of points in C LS = N i= 1 X i linear sum of the N points SS = N X 2 i i= 1 square sum of the N points CFs sufficient to calculate centroid measures of compactness and distance functions for clusters CMPT 843, SFU, Martin Ester, 1-06 185

Additivity Theorem Temporal Models CF-Values CFs of two disjoint point sets C 1 and C 2 are additive: CF(C 1 C 2 ) = CF (C 1 ) + CF (C 2 ) = (N 1 + N 2, LS 1 + LS 2, QS 1 + QS 2 ) i.e. CFs can be incrementally calculated crucial for the synopsis of a data stream CF-Tree A CF-tree is a height-balanced tree for the storage of CFs. CMPT 843, SFU, Martin Ester, 1-06 186

Temporal Models CF-Tree B = 7, L = 5 CF 1 CF 2 child 1 child 2 child 3 child 6 CF 3 CF 6 root CF 1 = CF 7 +... + CF 12 CF 7 CF 8 child 7 child 8 child 9 child 12 CF 9 CF 12 inner nodes CF 7 = CF 90 +... + CF 94 prev CF 90 CF 91 CF 94 next prev CF 95 CF 96 CF 99 next leaf nodes CMPT 843, SFU, Martin Ester, 1-06 187

Temporal Models Pyramidal Time Frame Snapshots (micro-clusters) stored at different levels of granularity, depending upon the recency Snapshot of order i taken at time intervalsα i, α integer and α 1 At any time, only last α + 1 snapshots of order i stored For data stream r 1, L,r n the maximum order of snapshots is log α n and the maximum number of stored snapshots is ( α +1) log α n For any user-specified time window w, at least one stored snapshot within NOW and NOW 2 w CMPT 843, SFU, Martin Ester, 1-06 188

References Aggarwal C. C., Han J., Wang J., Yu P.: A Framework for Clustering Evolving Data Streams, Proc. VLDB 2003. Domingos P., Hulten G.: Mining High-Speed Data Streams, Proc. ACM SIGKDD 2000. Garofalakis M., Gehrke J., Rastogi R.: Querying and Mining Data Streams: You Only Get One Look, Tutorial VLDB 2002. Guha S., Mishra N., Motwani R., O'Callaghan L.: Clustering Data Streams, Proc. IEEE FOCS, 2000. Ioannidis Y.E., Poosala V.: Histogram-Based Approximation of Set-Valued Query Answers, Proc. VLDB 1999. Jagadish H.V., Koudas N., Muthukrishnan S., Poosala V., Sevcik K., Suel T.: Optimal Histograms With Quality Guarantees, Proc. VLDB 1998. Manku S., Rajagopalan G.S., Lindsay B.G.: Approximate Median and Other Quantiles in One Pass and with Limited Memory, Proc. ACM SIGMOD 1998. Thaper N., Guha S., Indyk P., Koudas N.: Dynamic Multidimensional Histograms, Proc. ACM SIGMOD 2002. CMPT 843, SFU, Martin Ester, 1-06 189