Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Size: px
Start display at page:

Download "Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams"

Transcription

1 Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester,

2 Introduction Data Streams Applications that generate streams of data Network monitoring Call records in telecommunications Web server logs Sensor networks Application characteristics Massive volumes of data Records arrive at a rapid rate Data stream is sequence of records r,,r 1 L n CMPT 843, SFU, Martin Ester,

3 Introduction Computation Model Data Streams Data Stream Requirements Main Memory Synopsis Stream Processing Engine Single pass: each record is examined at most once Limited storage: main memory is limited to M User Request Approximate Answer Real-time processing: incremental maintenance of synopsis in real time CMPT 843, SFU, Martin Ester,

4 Summarization Methods Sampling Small random sample may represent data stream well enough AVG MAX Data stream Sample How close is the approximate answer to the actual answer? Use tail inequalities to provide probabilistic guarantees CMPT 843, SFU, Martin Ester,

5 Summarization Methods Sampling Tail probability: probability that random variable deviates far from the expectation Tail probability µ µε µ µ + µε µ Markov : Pr( X ε) ε Chebyshev : Pr( X µ µ ε) Var( X ) 2 2 µ ε CMPT 843, SFU, Martin Ester,

6 Summarization Methods Histograms Partition the attribute domain into b buckets Count C B per bucket B Equi-depth histograms Given the number of partitions Partition such that counts are equal V-optimal histograms [Jagadish et al 1998] Given the number of partitions Partition such that the frequency variance within buckets is minimized min B v B ( f v CB ) B 2 where f v : frequency of v CMPT 843, SFU, Martin Ester,

7 Summarization Methods Answering Queries Using Histograms [Ioannidis & Poosala 1999] Determine all buckets matching the query condition Assume equi-depth histogram and uniform distribution of the count over the bucket interval select count(*) from R where 4 <= R.A <= return CB + CB + CB = 2 CB For equi-depth histograms, maximum error ± 2 C B CMPT 843, SFU, Martin Ester,

8 Summarization Methods Equi-Depth Histogram Construction Determine records (attribute values) n b 2n / r-quantile: record with rank / index r r /, r b L, r( b 1) n / b One pass computation of quantiles: [Manku, Rajagopalan & Lindsay 1998] Split memory M into b buffers of size k For each consecutive subsequence of k stream elements If there is a free buffer B then insert subsequence into B and set level of B to 0 Else merge two buffers B and B of same level l; insert result of merge into B, set level of B to l + 1; insert subsequence into B and set level of B to 0; Output record with index r after making 2 l copies of each element in final buffer CMPT 843, SFU, Martin Ester,

9 Example Summarization Methods One-Pass Quantile Computation M = 9, b = 3, k = 3, r = 10, n = 12 Data stream = Output = (8 is the exact result!) level level level 0 CMPT 843, SFU, Martin Ester,

10 Sampling Summarization Methods Discussion Can be done efficiently in one pass Does not preserve correlations between different attributes One-dimensional histograms Can be done efficiently in one pass Does not preserve correlations between different attributes Multi-dimensional histograms Preserve correlations between different attributes Requires multiple passes (equi-depth or V-optimal) CMPT 843, SFU, Martin Ester,

11 Summarization Methods Randomized Sketch Synposes [Thaper et al. 2002] Synposis: random linear mapping of the data stream A: R N R Matrix A: each entry chosen indepently from a certain distribution Johnson Lindenstrauss theorem If d x 2 d log(1/ δ ) N = O( ), then for any x R 2 ε A x (1 + ε) x with probability at least 1 δ 2 2 If d large enough, then approximation error at most ε CMPT 843, SFU, Martin Ester,

12 Summarization Methods Randomized Sketch Synposes Example Matrix A Data Stream p1=(1 1), p2=(1 2), p3=(1 1), p4=(1 2), p5=(2 2) D Representation as N-dimensional vectors: p1=( ), p2=( ),... Sketch maintenance incremental S i : sketch of the data stream p 1,..., p S i, 0 =0 S + i = Si 1 A p i S 1 =( ),..., S 5 =( ) CMPT 843, SFU, Martin Ester,

13 Summarization Methods Building Histograms from Sketches D: data distribution of data stream H: corresponding histogram (sequence of hyperrectangles (S i, v i )) both represented as N-dimensional vectors (concatenate all dimensions) Idea { 1,..., n} l {1,..., M} We maintain the sketch of the data stream AD Johnson Lindenstrauss theorem: H D 2 AH AD 2 (1+ ε) H D 2 withprobabilit y at least1 δ We determine a histogram H such that AH AD 2 is minimized Assumption: domain of each attribute known in advance CMPT 843, SFU, Martin Ester,

14 Summarization Methods Building Histograms from Sketches Input: AD (sketch of the data stream) Output: H (histogram of the data stream) Algorithm H = empty Iterate B times For each possible histogram hyperrectangle S do Consider the histogram H s = H S Compute the sketch AH S of the histogram Determine the value corresponding to S that minimizes Record this value AH S AD 2 Add the best S to H AH S AD 2 CMPT 843, SFU, Martin Ester,

15 Clustering Data Streams k-medoid Clustering [Guha, Mishra, Motwani & O'Callaghan 2000] Challenge: k-medoid clustering algorithm requires random access to the data. Approach: Cluster as many records (chunk) as fit into memory. Resulting (intermediate) medoids summarize their chunk. Cluster the set of all intermediate medoids to obtain final medoids for the entire data stream. CMPT 843, SFU, Martin Ester,

16 Clustering Data Streams k-medoid Clustering Two-phase method 1) For each (non-overlapping) subsequence S i of M records, find O(k) medoids in S i and assign other records to the closest medoid 2) Let S be the set of all medoids from the n/m subsequences, each medoid weighted by the number of corresponding records. Determine k medoids for S and return them. CMPT 843, SFU, Martin Ester,

17 Clustering Data Streams Example M = 3, k = 1, n = 5 1 Data stream Result of first phase: 2 1 S 1 S CMPT 843, SFU, Martin Ester,

18 Clustering Data Streams Example 1 w = 3 w = 2 5 S Result of second phase (final result): 1 w = 3 w = 2 5 CMPT 843, SFU, Martin Ester,

19 Clustering Data Streams Analysis Property 1 Given a dataset D and k-medoid clustering with cost C, where the medoids do not belong to D, then there is a clustering with k medoids from D with cost 2 C. m m p Argument Consider a record p and let m be the closest medoid in the cost C clustering. Let m be the closest medoid to p in D. If m = m, then we are done. Otherwise, applying the triangle inequality: dist( p, m') dist( p, m) + dist( m, m') 2 dist( p, m) CMPT 843, SFU, Martin Ester,

20 Clustering Data Streams Analysis Using property 1 and two similar properties, we can prove the following property: The cost of the k-medoid clustering obtained from the data stream (in one pass) is at most eight times the cost of the k-medoid clustering of a static database consisting of the same records. This assumes that we use a constant factor approximation algorithm for clustering the subsequences S i. Algortihm can be extended to cluster in more than two passes. CMPT 843, SFU, Martin Ester,

21 Clustering Data Streams Discussion k-medoid clustering in one pass Runtime of k-medoid clustering is at least O(n 2 ), i.e. runtime per record is rather high (not constant) Guarantee for clustering quality within constant factor from quality of conventional clustering (database) But factor is pretty high k-medoid clustering can also be used for generating a synopsis of the data stream CMPT 843, SFU, Martin Ester,

22 Data Stream Classification Decision Tree Classification [Domingos & Hulten 2000] Observation: Idea: For determining the best split attribute, it may be sufficient to consider only a small subset of the training examaples belonging to the current node Instead of repeated reads of the database, continue reading further portions of the data stream N1 N2 N3 N CMPT 843, SFU, Martin Ester,

23 Data Stream Classification Hoeffding Bounds Challenge: How many examples are necessary at each node? How much of the data stream to use for the next choice of a split attribute / split point? Approach: Using Hoeffding bounds r: real-valued random variable with range R (and any probability distribution) n: number of observations of r r : the observed mean of r With probability 1 δ, the true mean of r is at least where ε = R 2 ln(1 δ ) 2n r ε CMPT 843, SFU, Martin Ester,

24 Data Stream Classification Decision Tree Classification G(X i ): measure of potential split attribute X i, to be maximized Goal: With high probability, the attribute chosen using n examples is the same that would have been chosen using infinite examples n should be as small as possible X a : attribute with highest observed G after seeing n examples X b : attribute with second highest G G = G ( X ) G ( X ) a b 0 CMPT 843, SFU, Martin Ester,

25 Data Stream Classification Decision Tree Classification X a is the correct choice with probability 1-δ, if n examples have been seen at this node and G Rationale > ε Assuming that theg value can be viewed as an average of the G values of examples belonging to that node If G > ε, then the Hoeffding bound guarantees for the true G G G ε > 0 with probability 1 δ CMPT 843, SFU, Martin Ester,

26 Hoeffding tree algorithm Data Stream Classification Algorithm Read examples from data stream until ε decreases monotonically with n G > ε Split the node n using the currently best attribute obtaining the children nodes n 1,..., n k Apply the same procedure to n 1,..., n k using the subsequent portions of the data stream as training examples CMPT 843, SFU, Martin Ester,

27 Data Stream Classification Experimental Evaluation VFDT: different implementations of the Hoeffding tree algorithm VFDT more accurate than C4.5 for large number of examples VFDT produces much smaller decision trees (less overfitting) CMPT 843, SFU, Martin Ester,

28 Data Stream Classification Discussion Hoeffding tree algorithm builds decision tree in a single pass with constant time per example Guarantees for similarity to conventional decision tree built from database For large sets of examples, Hoeffding trees are much smaller and more accurate than conventional decision trees Assumption: third best split attribute significantly worse than the best two ones (may not be realistic) Can the same approach be applied to other hierarchical data mining methods, e. g. hierarchical clustering? CMPT 843, SFU, Martin Ester,

29 Temporal Models Introduction So far: time stamps of records have been ignored, we have summarized over the entire stream But: often decisions based on recently observed data Ex.: stock data, sensor networks, L, r, L, r, L r 1 i,1 i,2, ri, k Timestamps k Decay weight of older records e.g. sliding window model CMPT 843, SFU, Martin Ester,

30 Temporal Models Decaying Data Stream Records Data stream with time stamps Records are assigned weights Special time stamp NOW r1, r2, r3, L, rt,l w1, w2, w3, L, wt,l Exponential decay w Sliding window model t i. e., = 2 w ( NOW t) NOW = 1, w NOW 1 = 1 2, w NOW 2 = 1 4,... w t = 1 if NOW t < WINDOW 0, otherwise CMPT 843, SFU, Martin Ester,

31 Temporal Models Clustering Evolving Data Streams [Aggarwal et al 2003] Synopsis: micro-clusters (CF-values), organized into CF-tree maintained online CF-values extended by temporal dimension Micro-clusters stored at snapshots in time following a pyramidal time frame Offline cluster analysis Using different clustering algorithms / different parameter values based on the CF-tree CMPT 843, SFU, Martin Ester,

32 Temporal Models CF-Values Clustering Feature of a set C of points X i : CF = (N, LS, SS) N = C number of points in C LS = N i= 1 X i linear sum of the N points SS = N X 2 i i= 1 square sum of the N points CFs sufficient to calculate centroid measures of compactness and distance functions for clusters CMPT 843, SFU, Martin Ester,

33 Additivity Theorem Temporal Models CF-Values CFs of two disjoint point sets C 1 and C 2 are additive: CF(C 1 C 2 ) = CF (C 1 ) + CF (C 2 ) = (N 1 + N 2, LS 1 + LS 2, QS 1 + QS 2 ) i.e. CFs can be incrementally calculated crucial for the synopsis of a data stream CF-Tree A CF-tree is a height-balanced tree for the storage of CFs. CMPT 843, SFU, Martin Ester,

34 Temporal Models CF-Tree B = 7, L = 5 CF 1 CF 2 child 1 child 2 child 3 child 6 CF 3 CF 6 root CF 1 = CF CF 12 CF 7 CF 8 child 7 child 8 child 9 child 12 CF 9 CF 12 inner nodes CF 7 = CF CF 94 prev CF 90 CF 91 CF 94 next prev CF 95 CF 96 CF 99 next leaf nodes CMPT 843, SFU, Martin Ester,

35 Temporal Models Pyramidal Time Frame Snapshots (micro-clusters) stored at different levels of granularity, depending upon the recency Snapshot of order i taken at time intervalsα i, α integer and α 1 At any time, only last α + 1 snapshots of order i stored For data stream r 1, L,r n the maximum order of snapshots is log α n and the maximum number of stored snapshots is ( α +1) log α n For any user-specified time window w, at least one stored snapshot within NOW and NOW 2 w CMPT 843, SFU, Martin Ester,

36 References Aggarwal C. C., Han J., Wang J., Yu P.: A Framework for Clustering Evolving Data Streams, Proc. VLDB Domingos P., Hulten G.: Mining High-Speed Data Streams, Proc. ACM SIGKDD Garofalakis M., Gehrke J., Rastogi R.: Querying and Mining Data Streams: You Only Get One Look, Tutorial VLDB Guha S., Mishra N., Motwani R., O'Callaghan L.: Clustering Data Streams, Proc. IEEE FOCS, Ioannidis Y.E., Poosala V.: Histogram-Based Approximation of Set-Valued Query Answers, Proc. VLDB Jagadish H.V., Koudas N., Muthukrishnan S., Poosala V., Sevcik K., Suel T.: Optimal Histograms With Quality Guarantees, Proc. VLDB Manku S., Rajagopalan G.S., Lindsay B.G.: Approximate Median and Other Quantiles in One Pass and with Limited Memory, Proc. ACM SIGMOD Thaper N., Guha S., Indyk P., Koudas N.: Dynamic Multidimensional Histograms, Proc. ACM SIGMOD CMPT 843, SFU, Martin Ester,

Clustering from Data Streams

Clustering from Data Streams Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting

More information

Mining Data Streams Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Mining Data Streams Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono Mining Data Streams Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data

More information

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors

More information

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania Mining for Patterns and Anomalies in Data Streams Sampath Kannan University of Pennsylvania The Problem Data sizes too large to fit in primary memory Devices with small memory Access times to secondary

More information

Data Mining: Principles and Algorithms Mining Data Streams

Data Mining: Principles and Algorithms Mining Data Streams Data Mining: Principles and Algorithms Mining Data Streams Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 2014 Jiawei Han. All rights reserved.

More information

A Framework for Clustering Massive Text and Categorical Data Streams

A Framework for Clustering Massive Text and Categorical Data Streams A Framework for Clustering Massive Text and Categorical Data Streams Charu C. Aggarwal IBM T. J. Watson Research Center charu@us.ibm.com Philip S. Yu IBM T. J.Watson Research Center psyu@us.ibm.com Abstract

More information

Robust Clustering for Tracking Noisy Evolving Data Streams

Robust Clustering for Tracking Noisy Evolving Data Streams Robust Clustering for Tracking Noisy Evolving Data Streams Olfa Nasraoui Carlos Rojas Abstract We present a new approach for tracking evolving and noisy data streams by estimating clusters based on density,

More information

Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at: www.ijarcsms.com Mining

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Presented by Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Irina Rozenbaum rozenbau@paul.rutgers.edu

More information

Unsupervised Learning Hierarchical Methods

Unsupervised Learning Hierarchical Methods Unsupervised Learning Hierarchical Methods Road Map. Basic Concepts 2. BIRCH 3. ROCK The Principle Group data objects into a tree of clusters Hierarchical methods can be Agglomerative: bottom-up approach

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

E-Stream: Evolution-Based Technique for Stream Clustering

E-Stream: Evolution-Based Technique for Stream Clustering E-Stream: Evolution-Based Technique for Stream Clustering Komkrit Udommanetanakit, Thanawin Rakthanmanon, and Kitsana Waiyamai Department of Computer Engineering, Faculty of Engineering Kasetsart University,

More information

Frequent Patterns mining in time-sensitive Data Stream

Frequent Patterns mining in time-sensitive Data Stream Frequent Patterns mining in time-sensitive Data Stream Manel ZARROUK 1, Mohamed Salah GOUIDER 2 1 University of Gabès. Higher Institute of Management of Gabès 6000 Gabès, Gabès, Tunisia zarrouk.manel@gmail.com

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window

Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu Department of Computer Science National Tsing Hua University Arbee L.P. Chen

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation

Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation Sudipto Guha AT&T Labs-Research sudipto@research.att.com Nick Koudas AT&T Labs-Research koudas@research.att.com

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

Course : Data mining

Course : Data mining Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 reading assignment LRU book: chapter

More information

Data Streaming Algorithms for Geometric Problems

Data Streaming Algorithms for Geometric Problems Data Streaming Algorithms for Geometric roblems R.Sharathkumar Duke University 1 Introduction A data stream is an ordered sequence of points that can be read only once or a small number of times. Formally,

More information

Answering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005)

Answering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005) Answering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005) Foto N. Afrati Computer Science Division NTUA, Athens, Greece afrati@softlab.ece.ntua.gr

More information

Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order

Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order Luca Foschini joint work with Sorabh Gandhi and Subhash Suri University of California Santa Barbara ICDE 2010

More information

On Futuristic Query Processing in Data Streams

On Futuristic Query Processing in Data Streams On Futuristic Query Processing in Data Streams Charu C. Aggarwal IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532 charu@us.ibm.com Abstract. Recent advances in hardware technology

More information

Data Mining: Concepts and Techniques. Chapter Mining data streams

Data Mining: Concepts and Techniques. Chapter Mining data streams Data Mining: Concepts and Techniques Chapter 8 8.1. Mining data streams Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 2006

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Answering Approximate Range Aggregate Queries on OLAP Data Cubes with Probabilistic Guarantees

Answering Approximate Range Aggregate Queries on OLAP Data Cubes with Probabilistic Guarantees Answering Approximate Range Aggregate Queries on OLAP Data Cubes with Probabilistic Guarantees Alfredo Cuzzocrea 1, Wei Wang 2, Ugo Matrangolo 3 1 DEIS Dept. University of Calabria 87036 Rende, Cosenza,

More information

K-means based data stream clustering algorithm extended with no. of cluster estimation method

K-means based data stream clustering algorithm extended with no. of cluster estimation method K-means based data stream clustering algorithm extended with no. of cluster estimation method Makadia Dipti 1, Prof. Tejal Patel 2 1 Information and Technology Department, G.H.Patel Engineering College,

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REAL TIME DATA SEARCH OPTIMIZATION: AN OVERVIEW MS. DEEPASHRI S. KHAWASE 1, PROF.

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Data mining techniques for data streams mining

Data mining techniques for data streams mining REVIEW OF COMPUTER ENGINEERING STUDIES ISSN: 2369-0755 (Print), 2369-0763 (Online) Vol. 4, No. 1, March, 2017, pp. 31-35 DOI: 10.18280/rces.040106 Licensed under CC BY-NC 4.0 A publication of IIETA http://www.iieta.org/journals/rces

More information

USC Real-time Pattern Isolation and Recognition Over Immersive Sensor Data Streams

USC Real-time Pattern Isolation and Recognition Over Immersive Sensor Data Streams Real-time Pattern Isolation and Recognition Over Immersive Sensor Data Streams Cyrus Shahabi and Donghui Yan Integrated Media Systems Center and Computer Science Department, University of Southern California

More information

Dynamic Clustering Of High Speed Data Streams

Dynamic Clustering Of High Speed Data Streams www.ijcsi.org 224 Dynamic Clustering Of High Speed Data Streams J. Chandrika 1, Dr. K.R. Ananda Kumar 2 1 Department of CS & E, M C E,Hassan 573 201 Karnataka, India 2 Department of CS & E, SJBIT, Bangalore

More information

Differentially Private H-Tree

Differentially Private H-Tree GeoPrivacy: 2 nd Workshop on Privacy in Geographic Information Collection and Analysis Differentially Private H-Tree Hien To, Liyue Fan, Cyrus Shahabi Integrated Media System Center University of Southern

More information

A Framework for Clustering Evolving Data Streams

A Framework for Clustering Evolving Data Streams VLDB 03 Paper ID: 312 A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu IBM T. J. Watson Research Center & UIUC charu@us.ibm.com, hanj@cs.uiuc.edu,

More information

Approximation Algorithms for Clustering Uncertain Data

Approximation Algorithms for Clustering Uncertain Data Approximation Algorithms for Clustering Uncertain Data Graham Cormode AT&T Labs - Research graham@research.att.com Andrew McGregor UCSD / MSR / UMass Amherst andrewm@ucsd.edu Introduction Many applications

More information

On Biased Reservoir Sampling in the Presence of Stream Evolution

On Biased Reservoir Sampling in the Presence of Stream Evolution Charu C. Aggarwal T J Watson Research Center IBM Corporation Hawthorne, NY USA On Biased Reservoir Sampling in the Presence of Stream Evolution VLDB Conference, Seoul, South Korea, 2006 Synopsis Construction

More information

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Sketching Asynchronous Streams Over a Sliding Window

Sketching Asynchronous Streams Over a Sliding Window Sketching Asynchronous Streams Over a Sliding Window Srikanta Tirthapura (Iowa State University) Bojian Xu (Iowa State University) Costas Busch (Rensselaer Polytechnic Institute) 1/32 Data Stream Processing

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

Outline. The History of Histograms. Yannis Ioannidis University of Athens, Hellas

Outline. The History of Histograms. Yannis Ioannidis University of Athens, Hellas The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline Prehistory Definitions and Framework The Early Past 10 Years Ago The Recent Past Industry Competitors The Future Prehistory

More information

Data Stream Clustering Using Micro Clusters

Data Stream Clustering Using Micro Clusters Data Stream Clustering Using Micro Clusters Ms. Jyoti.S.Pawar 1, Prof. N. M.Shahane. 2 1 PG student, Department of Computer Engineering K. K. W. I. E. E. R., Nashik Maharashtra, India 2 Assistant Professor

More information

An Empirical Comparison of Stream Clustering Algorithms

An Empirical Comparison of Stream Clustering Algorithms MÜNSTER An Empirical Comparison of Stream Clustering Algorithms Matthias Carnein Dennis Assenmacher Heike Trautmann CF 17 BigDAW Workshop Siena Italy May 15 18 217 Clustering MÜNSTER An Empirical Comparison

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,

More information

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1 Hierarchy An arrangement or classification of things according to inclusiveness A natural way of abstraction, summarization, compression, and simplification for understanding Typical setting: organize

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Striped Grid Files: An Alternative for Highdimensional

Striped Grid Files: An Alternative for Highdimensional Striped Grid Files: An Alternative for Highdimensional Indexing Thanet Praneenararat 1, Vorapong Suppakitpaisarn 2, Sunchai Pitakchonlasap 1, and Jaruloj Chongstitvatana 1 Department of Mathematics 1,

More information

Unsupervised: no target value to predict

Unsupervised: no target value to predict Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning

More information

MAIDS: Mining Alarming Incidents from Data Streams

MAIDS: Mining Alarming Incidents from Data Streams MAIDS: Mining Alarming Incidents from Data Streams (Demonstration Proposal) Y. Dora Cai David Clutter Greg Pape Jiawei Han Michael Welge Loretta Auvil Automated Learning Group, NCSA, University of Illinois

More information

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique P.Nithya 1, V.Karpagam 2 PG Scholar, Department of Software Engineering, Sri Ramakrishna Engineering College,

More information

Module 9: Selectivity Estimation

Module 9: Selectivity Estimation Module 9: Selectivity Estimation Module Outline 9.1 Query Cost and Selectivity Estimation 9.2 Database profiles 9.3 Sampling 9.4 Statistics maintained by commercial DBMS Web Forms Transaction Manager Lock

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Evolution-Based Clustering of High Dimensional Data Streams with Dimension Projection

Evolution-Based Clustering of High Dimensional Data Streams with Dimension Projection Evolution-Based Clustering of High Dimensional Data Streams with Dimension Projection Rattanapong Chairukwattana Department of Computer Engineering Kasetsart University Bangkok, Thailand Email: g521455024@ku.ac.th

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

Using Natural Clusters Information to Build Fuzzy Indexing Structure

Using Natural Clusters Information to Build Fuzzy Indexing Structure Using Natural Clusters Information to Build Fuzzy Indexing Structure H.Y. Yue, I. King and K.S. Leung Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, New Territories,

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

DATA STREAMS: MODELS AND ALGORITHMS

DATA STREAMS: MODELS AND ALGORITHMS DATA STREAMS: MODELS AND ALGORITHMS DATA STREAMS: MODELS AND ALGORITHMS Edited by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 Kluwer Academic Publishers Boston/Dordrecht/London

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

One-Pass Streaming Algorithms

One-Pass Streaming Algorithms One-Pass Streaming Algorithms Theory and Practice Complaints and Grievances about theory in practice Disclaimer Experiences with Gigascope. A practitioner s perspective. Will be using my own implementations,

More information

High-Dimensional Incremental Divisive Clustering under Population Drift

High-Dimensional Incremental Divisive Clustering under Population Drift High-Dimensional Incremental Divisive Clustering under Population Drift Nicos Pavlidis Inference for Change-Point and Related Processes joint work with David Hofmeyr and Idris Eckley Clustering Clustering:

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining

More information

Data Mining Algorithms

Data Mining Algorithms for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester

More information

Hierarchical Clustering Lecture 9

Hierarchical Clustering Lecture 9 Hierarchical Clustering Lecture 9 Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 1 Lecture 9: Required Reading Witten et al. (2011:

More information

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4 Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

Random Sampling over Data Streams for Sequential Pattern Mining

Random Sampling over Data Streams for Sequential Pattern Mining Random Sampling over Data Streams for Sequential Pattern Mining Chedy Raïssi LIRMM, EMA-LGI2P/Site EERIE 161 rue Ada 34392 Montpellier Cedex 5, France France raissi@lirmm.fr Pascal Poncelet EMA-LGI2P/Site

More information

Stream Sequential Pattern Mining with Precise Error Bounds

Stream Sequential Pattern Mining with Precise Error Bounds Stream Sequential Pattern Mining with Precise Error Bounds Luiz F. Mendes,2 Bolin Ding Jiawei Han University of Illinois at Urbana-Champaign 2 Google Inc. lmendes@google.com {bding3, hanj}@uiuc.edu Abstract

More information

Robust Clustering of Data Streams using Incremental Optimization

Robust Clustering of Data Streams using Incremental Optimization Robust Clustering of Data Streams using Incremental Optimization Basheer Hawwash and Olfa Nasraoui Knowledge Discovery and Web Mining Lab Computer Engineering and Computer Science Department University

More information

Knowledge Discovery in Databases II. Lecture 4: Stream clustering

Knowledge Discovery in Databases II. Lecture 4: Stream clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Lecture notes Knowledge Discovery in Databases II Winter Semester 2012/2013 Lecture 4: Stream

More information

Balanced Trees Part Two

Balanced Trees Part Two Balanced Trees Part Two Outline for Today Recap from Last Time Review of B-trees, 2-3-4 trees, and red/black trees. Order Statistic Trees BSTs with indexing. Augmented Binary Search Trees Building new

More information

Lecture 5: Data Streaming Algorithms

Lecture 5: Data Streaming Algorithms Great Ideas in Theoretical Computer Science Summer 2013 Lecture 5: Data Streaming Algorithms Lecturer: Kurt Mehlhorn & He Sun In the data stream scenario, the input arrive rapidly in an arbitrary order,

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Clustering. (Part 2)

Clustering. (Part 2) Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works

More information

A Framework for Clustering Uncertain Data Streams

A Framework for Clustering Uncertain Data Streams A Framework for Clustering Uncertain Data Streams Charu C. Aggarwal, Philip S. Yu IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532, USA { charu, psyu }@us.ibm.com Abstract In recent

More information

Knowledge Discovery in Databases II Summer Semester 2018

Knowledge Discovery in Databases II Summer Semester 2018 Ludwig Maximilians Universität München Institut für Informatik Lehr und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases II Summer Semester 2018 Lecture 3: Data Streams Lectures

More information

Anytime Concurrent Clustering of Multiple Streams with an Indexing Tree

Anytime Concurrent Clustering of Multiple Streams with an Indexing Tree JMLR: Workshop and Conference Proceedings 41:19 32, 2015 BIGMINE 2015 Anytime Concurrent Clustering of Multiple Streams with an Indexing Tree Zhinoos Razavi Hesabi zhinoos.razavi@rmit.edu.au Timos Sellis

More information

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis 1 Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis 1 Storing data on disk The traditional storage hierarchy for DBMSs is: 1. main memory (primary storage) for data currently

More information

Data mining, 4 cu Lecture 6:

Data mining, 4 cu Lecture 6: 582364 Data mining, 4 cu Lecture 6: Quantitative association rules Multi-level association rules Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Data mining, Spring 2010 (Slides adapted

More information

An improved data stream summary: the count-min sketch and its applications

An improved data stream summary: the count-min sketch and its applications Journal of Algorithms 55 (2005) 58 75 www.elsevier.com/locate/jalgor An improved data stream summary: the count-min sketch and its applications Graham Cormode a,,1, S. Muthukrishnan b,2 a Center for Discrete

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Efficient Approximation of Correlated Sums on Data Streams

Efficient Approximation of Correlated Sums on Data Streams Efficient Approximation of Correlated Sums on Data Streams Rohit Ananthakrishna Cornell University rohit@cs.cornell.edu Flip Korn AT&T Labs Research flip@research.att.com Abhinandan Das Cornell University

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Introduction to Indexing R-trees. Hong Kong University of Science and Technology Introduction to Indexing R-trees Dimitris Papadias Hong Kong University of Science and Technology 1 Introduction to Indexing 1. Assume that you work in a government office, and you maintain the records

More information

On Biased Reservoir Sampling in the presence of Stream Evolution

On Biased Reservoir Sampling in the presence of Stream Evolution On Biased Reservoir Sampling in the presence of Stream Evolution Charu C. Aggarwal IBM T. J. Watson Research Center 9 Skyline Drive Hawhorne, NY 532, USA charu@us.ibm.com ABSTRACT The method of reservoir

More information

Locality- Sensitive Hashing Random Projections for NN Search

Locality- Sensitive Hashing Random Projections for NN Search Case Study 2: Document Retrieval Locality- Sensitive Hashing Random Projections for NN Search Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 18, 2017 Sham Kakade

More information

A Co-Clustering approach for Sum-Product Network Structure Learning

A Co-Clustering approach for Sum-Product Network Structure Learning Università degli Studi di Bari Dipartimento di Informatica LACAM Machine Learning Group A Co-Clustering approach for Sum-Product Network Antonio Vergari Nicola Di Mauro Floriana Esposito December 8, 2014

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Tight results for clustering and summarizing data streams

Tight results for clustering and summarizing data streams Tight results for clustering and summarizing data streams Sudipto Guha Abstract In this paper we investigate algorithms and lower bounds for summarization problems over a single pass data stream. In particular

More information

Sampling for Sequential Pattern Mining: From Static Databases to Data Streams

Sampling for Sequential Pattern Mining: From Static Databases to Data Streams Sampling for Sequential Pattern Mining: From Static Databases to Data Streams Chedy Raïssi LIRMM, EMA-LGI2P/Site EERIE 161 rue Ada 34392 Montpellier Cedex 5, France raissi@lirmm.fr Pascal Poncelet EMA-LGI2P/Site

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

Deakin Research Online

Deakin Research Online Deakin Research Online This is the published version: Saha, Budhaditya, Lazarescu, Mihai and Venkatesh, Svetha 27, Infrequent item mining in multiple data streams, in Data Mining Workshops, 27. ICDM Workshops

More information

Lecture 7. Data Stream Mining. Building decision trees

Lecture 7. Data Stream Mining. Building decision trees 1 / 26 Lecture 7. Data Stream Mining. Building decision trees Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 26 1 Data Stream Mining 2 Decision Tree Learning Data Stream Mining 3

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information