Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Similar documents
Clustering from Data Streams

Mining Data Streams Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania

Data Mining: Principles and Algorithms Mining Data Streams

A Framework for Clustering Massive Text and Categorical Data Streams

Robust Clustering for Tracking Noisy Evolving Data Streams

Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies

CS570: Introduction to Data Mining

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

Unsupervised Learning Hierarchical Methods

Database and Knowledge-Base Systems: Data Mining. Martin Ester

E-Stream: Evolution-Based Technique for Stream Clustering

Frequent Patterns mining in time-sensitive Data Stream

2. Data Preprocessing

Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window

3. Data Preprocessing. 3.1 Introduction

Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation

COMP 465: Data Mining Still More on Clustering

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Course : Data mining

Data Streaming Algorithms for Geometric Problems

Answering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005)

Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order

On Futuristic Query Processing in Data Streams

Data Mining: Concepts and Techniques. Chapter Mining data streams

Clustering in Data Mining

Lesson 3. Prof. Enza Messina

Answering Approximate Range Aggregate Queries on OLAP Data Cubes with Probabilistic Guarantees

K-means based data stream clustering algorithm extended with no. of cluster estimation method

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Clustering Part 4 DBSCAN

Data mining techniques for data streams mining

USC Real-time Pattern Isolation and Recognition Over Immersive Sensor Data Streams

Dynamic Clustering Of High Speed Data Streams

Differentially Private H-Tree

A Framework for Clustering Evolving Data Streams

Approximation Algorithms for Clustering Uncertain Data

On Biased Reservoir Sampling in the Presence of Stream Evolution

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

University of Florida CISE department Gator Engineering. Clustering Part 4

Sketching Asynchronous Streams Over a Sliding Window

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Outline. The History of Histograms. Yannis Ioannidis University of Athens, Hellas

Data Stream Clustering Using Micro Clusters

An Empirical Comparison of Stream Clustering Algorithms

Introduction to Data Mining

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Comparative Study of Subspace Clustering Algorithms

Striped Grid Files: An Alternative for Highdimensional

Unsupervised: no target value to predict

MAIDS: Mining Alarming Incidents from Data Streams

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique

Module 9: Selectivity Estimation

CS Introduction to Data Mining Instructor: Abdullah Mueen

Evolution-Based Clustering of High Dimensional Data Streams with Dimension Projection

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Using Natural Clusters Information to Build Fuzzy Indexing Structure

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

DATA STREAMS: MODELS AND ALGORITHMS

Cluster Analysis. Ying Shen, SSE, Tongji University

Problem 1: Complexity of Update Rules for Logistic Regression

One-Pass Streaming Algorithms

High-Dimensional Incremental Divisive Clustering under Population Drift

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data Mining Algorithms

Hierarchical Clustering Lecture 9

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Hierarchical Document Clustering

Random Sampling over Data Streams for Sequential Pattern Mining

Stream Sequential Pattern Mining with Precise Error Bounds

Robust Clustering of Data Streams using Incremental Optimization

Knowledge Discovery in Databases II. Lecture 4: Stream clustering

Balanced Trees Part Two

Lecture 5: Data Streaming Algorithms

CS 229 Midterm Review

Clustering. (Part 2)

A Framework for Clustering Uncertain Data Streams

Knowledge Discovery in Databases II Summer Semester 2018

Anytime Concurrent Clustering of Multiple Streams with an Indexing Tree

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Data mining, 4 cu Lecture 6:

An improved data stream summary: the count-min sketch and its applications

Chapter 12: Query Processing. Chapter 12: Query Processing

Efficient Approximation of Correlated Sums on Data Streams

Clustering part II 1

A Review on Cluster Based Approach in Data Mining

Chapter 13: Query Processing

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

On Biased Reservoir Sampling in the presence of Stream Evolution

Locality- Sensitive Hashing Random Projections for NN Search

A Co-Clustering approach for Sum-Product Network Structure Learning

Cluster analysis. Agnieszka Nowak - Brzezinska

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Tight results for clustering and summarizing data streams

Sampling for Sequential Pattern Mining: From Static Databases to Data Streams

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Deakin Research Online

Lecture 7. Data Stream Mining. Building decision trees

Database System Concepts

Transcription:

Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06 154

Introduction Data Streams Applications that generate streams of data Network monitoring Call records in telecommunications Web server logs Sensor networks Application characteristics Massive volumes of data Records arrive at a rapid rate Data stream is sequence of records r,,r 1 L n CMPT 843, SFU, Martin Ester, 1-06 155

Introduction Computation Model Data Streams Data Stream Requirements Main Memory Synopsis Stream Processing Engine Single pass: each record is examined at most once Limited storage: main memory is limited to M User Request Approximate Answer Real-time processing: incremental maintenance of synopsis in real time CMPT 843, SFU, Martin Ester, 1-06 156

Summarization Methods Sampling Small random sample may represent data stream well enough AVG MAX Data stream 4 8 1 7 9 3 2 5 1 4 8 6 4.833 9 Sample 4 7 2 4 4.25 7 How close is the approximate answer to the actual answer? Use tail inequalities to provide probabilistic guarantees CMPT 843, SFU, Martin Ester, 1-06 157

Summarization Methods Sampling Tail probability: probability that random variable deviates far from the expectation Tail probability µ µε µ µ + µε µ Markov : Pr( X ε) ε Chebyshev : Pr( X µ µ ε) Var( X ) 2 2 µ ε CMPT 843, SFU, Martin Ester, 1-06 158

Summarization Methods Histograms Partition the attribute domain into b buckets Count C B per bucket B Equi-depth histograms Given the number of partitions Partition such that counts are equal V-optimal histograms [Jagadish et al 1998] Given the number of partitions Partition such that the frequency variance within buckets is minimized min B v B ( f v CB ) B 2 where f v : frequency of v CMPT 843, SFU, Martin Ester, 1-06 159

Summarization Methods Answering Queries Using Histograms [Ioannidis & Poosala 1999] Determine all buckets matching the query condition Assume equi-depth histogram and uniform distribution of the count over the bucket interval 1 2 3 4 5 6 7 8 9 10 select count(*) from R where 4 <= R.A <= 8 2 1 1 return CB + CB + CB = 2 CB 3 2 6 For equi-depth histograms, maximum error ± 2 C B CMPT 843, SFU, Martin Ester, 1-06 160

Summarization Methods Equi-Depth Histogram Construction Determine records (attribute values) n b 2n / r-quantile: record with rank / index r r /, r b L, r( b 1) n / b One pass computation of quantiles: [Manku, Rajagopalan & Lindsay 1998] Split memory M into b buffers of size k For each consecutive subsequence of k stream elements If there is a free buffer B then insert subsequence into B and set level of B to 0 Else merge two buffers B and B of same level l; insert result of merge into B, set level of B to l + 1; insert subsequence into B and set level of B to 0; Output record with index r after making 2 l copies of each element in final buffer CMPT 843, SFU, Martin Ester, 1-06 161

Example Summarization Methods One-Pass Quantile Computation M = 9, b = 3, k = 3, r = 10, n = 12 Data stream = 9 3 5 2 7 1 6 5 8 4 9 1 Output = 1 1 1 1 3 3 3 3 7 7 7 7 (8 is the exact result!) 1 3 7 level 2 1 3 7 1 5 8 level 1 9 3 5 2 7 1 6 5 8 4 9 1 level 0 CMPT 843, SFU, Martin Ester, 1-06 162

Sampling Summarization Methods Discussion Can be done efficiently in one pass Does not preserve correlations between different attributes One-dimensional histograms Can be done efficiently in one pass Does not preserve correlations between different attributes Multi-dimensional histograms Preserve correlations between different attributes Requires multiple passes (equi-depth or V-optimal) CMPT 843, SFU, Martin Ester, 1-06 163

Summarization Methods Randomized Sketch Synposes [Thaper et al. 2002] Synposis: random linear mapping of the data stream A: R N R Matrix A: each entry chosen indepently from a certain distribution Johnson Lindenstrauss theorem If d x 2 d log(1/ δ ) N = O( ), then for any x R 2 ε A x (1 + ε) x with probability at least 1 δ 2 2 If d large enough, then approximation error at most ε CMPT 843, SFU, Martin Ester, 1-06 164

Summarization Methods Randomized Sketch Synposes Example Matrix A Data Stream 0.61 0.13 0.67-0.39 p1=(1 1), p2=(1 2), p3=(1 1), 0.86 0.24-0.38-0.21 p4=(1 2), p5=(2 2) 0.91-0.17 0.33-0.16 D 1 2 1 2 0 2 2 1 Representation as N-dimensional vectors: p1=(1 0 0 0), p2=(0 1 0 0),... Sketch maintenance incremental S i : sketch of the data stream p 1,..., p S i, 0 =0 S + i = Si 1 A p i S 1 =(0.61 0.86 0.91),..., S 5 =(-1.35 1.99 1.32) CMPT 843, SFU, Martin Ester, 1-06 165

Summarization Methods Building Histograms from Sketches D: data distribution of data stream H: corresponding histogram (sequence of hyperrectangles (S i, v i )) both represented as N-dimensional vectors (concatenate all dimensions) Idea { 1,..., n} l {1,..., M} We maintain the sketch of the data stream AD Johnson Lindenstrauss theorem: H D 2 AH AD 2 (1+ ε) H D 2 withprobabilit y at least1 δ We determine a histogram H such that AH AD 2 is minimized Assumption: domain of each attribute known in advance CMPT 843, SFU, Martin Ester, 1-06 166

Summarization Methods Building Histograms from Sketches Input: AD (sketch of the data stream) Output: H (histogram of the data stream) Algorithm H = empty Iterate B times For each possible histogram hyperrectangle S do Consider the histogram H s = H S Compute the sketch AH S of the histogram Determine the value corresponding to S that minimizes Record this value AH S AD 2 Add the best S to H AH S AD 2 CMPT 843, SFU, Martin Ester, 1-06 167

Clustering Data Streams k-medoid Clustering [Guha, Mishra, Motwani & O'Callaghan 2000] Challenge: k-medoid clustering algorithm requires random access to the data. Approach: Cluster as many records (chunk) as fit into memory. Resulting (intermediate) medoids summarize their chunk. Cluster the set of all intermediate medoids to obtain final medoids for the entire data stream. CMPT 843, SFU, Martin Ester, 1-06 168

Clustering Data Streams k-medoid Clustering Two-phase method 1) For each (non-overlapping) subsequence S i of M records, find O(k) medoids in S i and assign other records to the closest medoid 2) Let S be the set of all medoids from the n/m subsequences, each medoid weighted by the number of corresponding records. Determine k medoids for S and return them. CMPT 843, SFU, Martin Ester, 1-06 169

Clustering Data Streams Example M = 3, k = 1, n = 5 1 Data stream 2 4 5 3 Result of first phase: 2 1 S 1 S 2 4 5 3 CMPT 843, SFU, Martin Ester, 1-06 170

Clustering Data Streams Example 1 w = 3 w = 2 5 S Result of second phase (final result): 1 w = 3 w = 2 5 CMPT 843, SFU, Martin Ester, 1-06 171

Clustering Data Streams Analysis Property 1 Given a dataset D and k-medoid clustering with cost C, where the medoids do not belong to D, then there is a clustering with k medoids from D with cost 2 C. m m p Argument Consider a record p and let m be the closest medoid in the cost C clustering. Let m be the closest medoid to p in D. If m = m, then we are done. Otherwise, applying the triangle inequality: dist( p, m') dist( p, m) + dist( m, m') 2 dist( p, m) CMPT 843, SFU, Martin Ester, 1-06 172

Clustering Data Streams Analysis Using property 1 and two similar properties, we can prove the following property: The cost of the k-medoid clustering obtained from the data stream (in one pass) is at most eight times the cost of the k-medoid clustering of a static database consisting of the same records. This assumes that we use a constant factor approximation algorithm for clustering the subsequences S i. Algortihm can be extended to cluster in more than two passes. CMPT 843, SFU, Martin Ester, 1-06 173

Clustering Data Streams Discussion k-medoid clustering in one pass Runtime of k-medoid clustering is at least O(n 2 ), i.e. runtime per record is rather high (not constant) Guarantee for clustering quality within constant factor from quality of conventional clustering (database) But factor is pretty high k-medoid clustering can also be used for generating a synopsis of the data stream CMPT 843, SFU, Martin Ester, 1-06 174

Data Stream Classification Decision Tree Classification [Domingos & Hulten 2000] Observation: Idea: For determining the best split attribute, it may be sufficient to consider only a small subset of the training examaples belonging to the current node Instead of repeated reads of the database, continue reading further portions of the data stream N1 N2 N3 N4...... CMPT 843, SFU, Martin Ester, 1-06 175

Data Stream Classification Hoeffding Bounds Challenge: How many examples are necessary at each node? How much of the data stream to use for the next choice of a split attribute / split point? Approach: Using Hoeffding bounds r: real-valued random variable with range R (and any probability distribution) n: number of observations of r r : the observed mean of r With probability 1 δ, the true mean of r is at least where ε = R 2 ln(1 δ ) 2n r ε CMPT 843, SFU, Martin Ester, 1-06 176

Data Stream Classification Decision Tree Classification G(X i ): measure of potential split attribute X i, to be maximized Goal: With high probability, the attribute chosen using n examples is the same that would have been chosen using infinite examples n should be as small as possible X a : attribute with highest observed G after seeing n examples X b : attribute with second highest G G = G ( X ) G ( X ) a b 0 CMPT 843, SFU, Martin Ester, 1-06 177

Data Stream Classification Decision Tree Classification X a is the correct choice with probability 1-δ, if n examples have been seen at this node and G Rationale > ε Assuming that theg value can be viewed as an average of the G values of examples belonging to that node If G > ε, then the Hoeffding bound guarantees for the true G G G ε > 0 with probability 1 δ CMPT 843, SFU, Martin Ester, 1-06 178

Hoeffding tree algorithm Data Stream Classification Algorithm Read examples from data stream until ε decreases monotonically with n G > ε Split the node n using the currently best attribute obtaining the children nodes n 1,..., n k Apply the same procedure to n 1,..., n k using the subsequent portions of the data stream as training examples CMPT 843, SFU, Martin Ester, 1-06 179

Data Stream Classification Experimental Evaluation VFDT: different implementations of the Hoeffding tree algorithm VFDT more accurate than C4.5 for large number of examples VFDT produces much smaller decision trees (less overfitting) CMPT 843, SFU, Martin Ester, 1-06 180

Data Stream Classification Discussion Hoeffding tree algorithm builds decision tree in a single pass with constant time per example Guarantees for similarity to conventional decision tree built from database For large sets of examples, Hoeffding trees are much smaller and more accurate than conventional decision trees Assumption: third best split attribute significantly worse than the best two ones (may not be realistic) Can the same approach be applied to other hierarchical data mining methods, e. g. hierarchical clustering? CMPT 843, SFU, Martin Ester, 1-06 181

Temporal Models Introduction So far: time stamps of records have been ignored, we have summarized over the entire stream But: often decisions based on recently observed data Ex.: stock data, sensor networks, L, r, L, r, L r 1 i,1 i,2, ri, k Timestamps 1 2... k Decay weight of older records e.g. sliding window model CMPT 843, SFU, Martin Ester, 1-06 182

Temporal Models Decaying Data Stream Records Data stream with time stamps Records are assigned weights Special time stamp NOW r1, r2, r3, L, rt,l w1, w2, w3, L, wt,l Exponential decay w Sliding window model t i. e., = 2 w ( NOW t) NOW = 1, w NOW 1 = 1 2, w NOW 2 = 1 4,... w t = 1 if NOW t < WINDOW 0, otherwise CMPT 843, SFU, Martin Ester, 1-06 183

Temporal Models Clustering Evolving Data Streams [Aggarwal et al 2003] Synopsis: micro-clusters (CF-values), organized into CF-tree maintained online CF-values extended by temporal dimension Micro-clusters stored at snapshots in time following a pyramidal time frame Offline cluster analysis Using different clustering algorithms / different parameter values based on the CF-tree CMPT 843, SFU, Martin Ester, 1-06 184

Temporal Models CF-Values Clustering Feature of a set C of points X i : CF = (N, LS, SS) N = C number of points in C LS = N i= 1 X i linear sum of the N points SS = N X 2 i i= 1 square sum of the N points CFs sufficient to calculate centroid measures of compactness and distance functions for clusters CMPT 843, SFU, Martin Ester, 1-06 185

Additivity Theorem Temporal Models CF-Values CFs of two disjoint point sets C 1 and C 2 are additive: CF(C 1 C 2 ) = CF (C 1 ) + CF (C 2 ) = (N 1 + N 2, LS 1 + LS 2, QS 1 + QS 2 ) i.e. CFs can be incrementally calculated crucial for the synopsis of a data stream CF-Tree A CF-tree is a height-balanced tree for the storage of CFs. CMPT 843, SFU, Martin Ester, 1-06 186

Temporal Models CF-Tree B = 7, L = 5 CF 1 CF 2 child 1 child 2 child 3 child 6 CF 3 CF 6 root CF 1 = CF 7 +... + CF 12 CF 7 CF 8 child 7 child 8 child 9 child 12 CF 9 CF 12 inner nodes CF 7 = CF 90 +... + CF 94 prev CF 90 CF 91 CF 94 next prev CF 95 CF 96 CF 99 next leaf nodes CMPT 843, SFU, Martin Ester, 1-06 187

Temporal Models Pyramidal Time Frame Snapshots (micro-clusters) stored at different levels of granularity, depending upon the recency Snapshot of order i taken at time intervalsα i, α integer and α 1 At any time, only last α + 1 snapshots of order i stored For data stream r 1, L,r n the maximum order of snapshots is log α n and the maximum number of stored snapshots is ( α +1) log α n For any user-specified time window w, at least one stored snapshot within NOW and NOW 2 w CMPT 843, SFU, Martin Ester, 1-06 188

References Aggarwal C. C., Han J., Wang J., Yu P.: A Framework for Clustering Evolving Data Streams, Proc. VLDB 2003. Domingos P., Hulten G.: Mining High-Speed Data Streams, Proc. ACM SIGKDD 2000. Garofalakis M., Gehrke J., Rastogi R.: Querying and Mining Data Streams: You Only Get One Look, Tutorial VLDB 2002. Guha S., Mishra N., Motwani R., O'Callaghan L.: Clustering Data Streams, Proc. IEEE FOCS, 2000. Ioannidis Y.E., Poosala V.: Histogram-Based Approximation of Set-Valued Query Answers, Proc. VLDB 1999. Jagadish H.V., Koudas N., Muthukrishnan S., Poosala V., Sevcik K., Suel T.: Optimal Histograms With Quality Guarantees, Proc. VLDB 1998. Manku S., Rajagopalan G.S., Lindsay B.G.: Approximate Median and Other Quantiles in One Pass and with Limited Memory, Proc. ACM SIGMOD 1998. Thaper N., Guha S., Indyk P., Koudas N.: Dynamic Multidimensional Histograms, Proc. ACM SIGMOD 2002. CMPT 843, SFU, Martin Ester, 1-06 189