Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams
|
|
- Marianna Barker
- 5 years ago
- Views:
Transcription
1 Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester,
2 Introduction Data Streams Applications that generate streams of data Network monitoring Call records in telecommunications Web server logs Sensor networks Application characteristics Massive volumes of data Records arrive at a rapid rate Data stream is sequence of records r,,r 1 L n CMPT 843, SFU, Martin Ester,
3 Introduction Computation Model Data Streams Data Stream Requirements Main Memory Synopsis Stream Processing Engine Single pass: each record is examined at most once Limited storage: main memory is limited to M User Request Approximate Answer Real-time processing: incremental maintenance of synopsis in real time CMPT 843, SFU, Martin Ester,
4 Summarization Methods Sampling Small random sample may represent data stream well enough AVG MAX Data stream Sample How close is the approximate answer to the actual answer? Use tail inequalities to provide probabilistic guarantees CMPT 843, SFU, Martin Ester,
5 Summarization Methods Sampling Tail probability: probability that random variable deviates far from the expectation Tail probability µ µε µ µ + µε µ Markov : Pr( X ε) ε Chebyshev : Pr( X µ µ ε) Var( X ) 2 2 µ ε CMPT 843, SFU, Martin Ester,
6 Summarization Methods Histograms Partition the attribute domain into b buckets Count C B per bucket B Equi-depth histograms Given the number of partitions Partition such that counts are equal V-optimal histograms [Jagadish et al 1998] Given the number of partitions Partition such that the frequency variance within buckets is minimized min B v B ( f v CB ) B 2 where f v : frequency of v CMPT 843, SFU, Martin Ester,
7 Summarization Methods Answering Queries Using Histograms [Ioannidis & Poosala 1999] Determine all buckets matching the query condition Assume equi-depth histogram and uniform distribution of the count over the bucket interval select count(*) from R where 4 <= R.A <= return CB + CB + CB = 2 CB For equi-depth histograms, maximum error ± 2 C B CMPT 843, SFU, Martin Ester,
8 Summarization Methods Equi-Depth Histogram Construction Determine records (attribute values) n b 2n / r-quantile: record with rank / index r r /, r b L, r( b 1) n / b One pass computation of quantiles: [Manku, Rajagopalan & Lindsay 1998] Split memory M into b buffers of size k For each consecutive subsequence of k stream elements If there is a free buffer B then insert subsequence into B and set level of B to 0 Else merge two buffers B and B of same level l; insert result of merge into B, set level of B to l + 1; insert subsequence into B and set level of B to 0; Output record with index r after making 2 l copies of each element in final buffer CMPT 843, SFU, Martin Ester,
9 Example Summarization Methods One-Pass Quantile Computation M = 9, b = 3, k = 3, r = 10, n = 12 Data stream = Output = (8 is the exact result!) level level level 0 CMPT 843, SFU, Martin Ester,
10 Sampling Summarization Methods Discussion Can be done efficiently in one pass Does not preserve correlations between different attributes One-dimensional histograms Can be done efficiently in one pass Does not preserve correlations between different attributes Multi-dimensional histograms Preserve correlations between different attributes Requires multiple passes (equi-depth or V-optimal) CMPT 843, SFU, Martin Ester,
11 Summarization Methods Randomized Sketch Synposes [Thaper et al. 2002] Synposis: random linear mapping of the data stream A: R N R Matrix A: each entry chosen indepently from a certain distribution Johnson Lindenstrauss theorem If d x 2 d log(1/ δ ) N = O( ), then for any x R 2 ε A x (1 + ε) x with probability at least 1 δ 2 2 If d large enough, then approximation error at most ε CMPT 843, SFU, Martin Ester,
12 Summarization Methods Randomized Sketch Synposes Example Matrix A Data Stream p1=(1 1), p2=(1 2), p3=(1 1), p4=(1 2), p5=(2 2) D Representation as N-dimensional vectors: p1=( ), p2=( ),... Sketch maintenance incremental S i : sketch of the data stream p 1,..., p S i, 0 =0 S + i = Si 1 A p i S 1 =( ),..., S 5 =( ) CMPT 843, SFU, Martin Ester,
13 Summarization Methods Building Histograms from Sketches D: data distribution of data stream H: corresponding histogram (sequence of hyperrectangles (S i, v i )) both represented as N-dimensional vectors (concatenate all dimensions) Idea { 1,..., n} l {1,..., M} We maintain the sketch of the data stream AD Johnson Lindenstrauss theorem: H D 2 AH AD 2 (1+ ε) H D 2 withprobabilit y at least1 δ We determine a histogram H such that AH AD 2 is minimized Assumption: domain of each attribute known in advance CMPT 843, SFU, Martin Ester,
14 Summarization Methods Building Histograms from Sketches Input: AD (sketch of the data stream) Output: H (histogram of the data stream) Algorithm H = empty Iterate B times For each possible histogram hyperrectangle S do Consider the histogram H s = H S Compute the sketch AH S of the histogram Determine the value corresponding to S that minimizes Record this value AH S AD 2 Add the best S to H AH S AD 2 CMPT 843, SFU, Martin Ester,
15 Clustering Data Streams k-medoid Clustering [Guha, Mishra, Motwani & O'Callaghan 2000] Challenge: k-medoid clustering algorithm requires random access to the data. Approach: Cluster as many records (chunk) as fit into memory. Resulting (intermediate) medoids summarize their chunk. Cluster the set of all intermediate medoids to obtain final medoids for the entire data stream. CMPT 843, SFU, Martin Ester,
16 Clustering Data Streams k-medoid Clustering Two-phase method 1) For each (non-overlapping) subsequence S i of M records, find O(k) medoids in S i and assign other records to the closest medoid 2) Let S be the set of all medoids from the n/m subsequences, each medoid weighted by the number of corresponding records. Determine k medoids for S and return them. CMPT 843, SFU, Martin Ester,
17 Clustering Data Streams Example M = 3, k = 1, n = 5 1 Data stream Result of first phase: 2 1 S 1 S CMPT 843, SFU, Martin Ester,
18 Clustering Data Streams Example 1 w = 3 w = 2 5 S Result of second phase (final result): 1 w = 3 w = 2 5 CMPT 843, SFU, Martin Ester,
19 Clustering Data Streams Analysis Property 1 Given a dataset D and k-medoid clustering with cost C, where the medoids do not belong to D, then there is a clustering with k medoids from D with cost 2 C. m m p Argument Consider a record p and let m be the closest medoid in the cost C clustering. Let m be the closest medoid to p in D. If m = m, then we are done. Otherwise, applying the triangle inequality: dist( p, m') dist( p, m) + dist( m, m') 2 dist( p, m) CMPT 843, SFU, Martin Ester,
20 Clustering Data Streams Analysis Using property 1 and two similar properties, we can prove the following property: The cost of the k-medoid clustering obtained from the data stream (in one pass) is at most eight times the cost of the k-medoid clustering of a static database consisting of the same records. This assumes that we use a constant factor approximation algorithm for clustering the subsequences S i. Algortihm can be extended to cluster in more than two passes. CMPT 843, SFU, Martin Ester,
21 Clustering Data Streams Discussion k-medoid clustering in one pass Runtime of k-medoid clustering is at least O(n 2 ), i.e. runtime per record is rather high (not constant) Guarantee for clustering quality within constant factor from quality of conventional clustering (database) But factor is pretty high k-medoid clustering can also be used for generating a synopsis of the data stream CMPT 843, SFU, Martin Ester,
22 Data Stream Classification Decision Tree Classification [Domingos & Hulten 2000] Observation: Idea: For determining the best split attribute, it may be sufficient to consider only a small subset of the training examaples belonging to the current node Instead of repeated reads of the database, continue reading further portions of the data stream N1 N2 N3 N CMPT 843, SFU, Martin Ester,
23 Data Stream Classification Hoeffding Bounds Challenge: How many examples are necessary at each node? How much of the data stream to use for the next choice of a split attribute / split point? Approach: Using Hoeffding bounds r: real-valued random variable with range R (and any probability distribution) n: number of observations of r r : the observed mean of r With probability 1 δ, the true mean of r is at least where ε = R 2 ln(1 δ ) 2n r ε CMPT 843, SFU, Martin Ester,
24 Data Stream Classification Decision Tree Classification G(X i ): measure of potential split attribute X i, to be maximized Goal: With high probability, the attribute chosen using n examples is the same that would have been chosen using infinite examples n should be as small as possible X a : attribute with highest observed G after seeing n examples X b : attribute with second highest G G = G ( X ) G ( X ) a b 0 CMPT 843, SFU, Martin Ester,
25 Data Stream Classification Decision Tree Classification X a is the correct choice with probability 1-δ, if n examples have been seen at this node and G Rationale > ε Assuming that theg value can be viewed as an average of the G values of examples belonging to that node If G > ε, then the Hoeffding bound guarantees for the true G G G ε > 0 with probability 1 δ CMPT 843, SFU, Martin Ester,
26 Hoeffding tree algorithm Data Stream Classification Algorithm Read examples from data stream until ε decreases monotonically with n G > ε Split the node n using the currently best attribute obtaining the children nodes n 1,..., n k Apply the same procedure to n 1,..., n k using the subsequent portions of the data stream as training examples CMPT 843, SFU, Martin Ester,
27 Data Stream Classification Experimental Evaluation VFDT: different implementations of the Hoeffding tree algorithm VFDT more accurate than C4.5 for large number of examples VFDT produces much smaller decision trees (less overfitting) CMPT 843, SFU, Martin Ester,
28 Data Stream Classification Discussion Hoeffding tree algorithm builds decision tree in a single pass with constant time per example Guarantees for similarity to conventional decision tree built from database For large sets of examples, Hoeffding trees are much smaller and more accurate than conventional decision trees Assumption: third best split attribute significantly worse than the best two ones (may not be realistic) Can the same approach be applied to other hierarchical data mining methods, e. g. hierarchical clustering? CMPT 843, SFU, Martin Ester,
29 Temporal Models Introduction So far: time stamps of records have been ignored, we have summarized over the entire stream But: often decisions based on recently observed data Ex.: stock data, sensor networks, L, r, L, r, L r 1 i,1 i,2, ri, k Timestamps k Decay weight of older records e.g. sliding window model CMPT 843, SFU, Martin Ester,
30 Temporal Models Decaying Data Stream Records Data stream with time stamps Records are assigned weights Special time stamp NOW r1, r2, r3, L, rt,l w1, w2, w3, L, wt,l Exponential decay w Sliding window model t i. e., = 2 w ( NOW t) NOW = 1, w NOW 1 = 1 2, w NOW 2 = 1 4,... w t = 1 if NOW t < WINDOW 0, otherwise CMPT 843, SFU, Martin Ester,
31 Temporal Models Clustering Evolving Data Streams [Aggarwal et al 2003] Synopsis: micro-clusters (CF-values), organized into CF-tree maintained online CF-values extended by temporal dimension Micro-clusters stored at snapshots in time following a pyramidal time frame Offline cluster analysis Using different clustering algorithms / different parameter values based on the CF-tree CMPT 843, SFU, Martin Ester,
32 Temporal Models CF-Values Clustering Feature of a set C of points X i : CF = (N, LS, SS) N = C number of points in C LS = N i= 1 X i linear sum of the N points SS = N X 2 i i= 1 square sum of the N points CFs sufficient to calculate centroid measures of compactness and distance functions for clusters CMPT 843, SFU, Martin Ester,
33 Additivity Theorem Temporal Models CF-Values CFs of two disjoint point sets C 1 and C 2 are additive: CF(C 1 C 2 ) = CF (C 1 ) + CF (C 2 ) = (N 1 + N 2, LS 1 + LS 2, QS 1 + QS 2 ) i.e. CFs can be incrementally calculated crucial for the synopsis of a data stream CF-Tree A CF-tree is a height-balanced tree for the storage of CFs. CMPT 843, SFU, Martin Ester,
34 Temporal Models CF-Tree B = 7, L = 5 CF 1 CF 2 child 1 child 2 child 3 child 6 CF 3 CF 6 root CF 1 = CF CF 12 CF 7 CF 8 child 7 child 8 child 9 child 12 CF 9 CF 12 inner nodes CF 7 = CF CF 94 prev CF 90 CF 91 CF 94 next prev CF 95 CF 96 CF 99 next leaf nodes CMPT 843, SFU, Martin Ester,
35 Temporal Models Pyramidal Time Frame Snapshots (micro-clusters) stored at different levels of granularity, depending upon the recency Snapshot of order i taken at time intervalsα i, α integer and α 1 At any time, only last α + 1 snapshots of order i stored For data stream r 1, L,r n the maximum order of snapshots is log α n and the maximum number of stored snapshots is ( α +1) log α n For any user-specified time window w, at least one stored snapshot within NOW and NOW 2 w CMPT 843, SFU, Martin Ester,
36 References Aggarwal C. C., Han J., Wang J., Yu P.: A Framework for Clustering Evolving Data Streams, Proc. VLDB Domingos P., Hulten G.: Mining High-Speed Data Streams, Proc. ACM SIGKDD Garofalakis M., Gehrke J., Rastogi R.: Querying and Mining Data Streams: You Only Get One Look, Tutorial VLDB Guha S., Mishra N., Motwani R., O'Callaghan L.: Clustering Data Streams, Proc. IEEE FOCS, Ioannidis Y.E., Poosala V.: Histogram-Based Approximation of Set-Valued Query Answers, Proc. VLDB Jagadish H.V., Koudas N., Muthukrishnan S., Poosala V., Sevcik K., Suel T.: Optimal Histograms With Quality Guarantees, Proc. VLDB Manku S., Rajagopalan G.S., Lindsay B.G.: Approximate Median and Other Quantiles in One Pass and with Limited Memory, Proc. ACM SIGMOD Thaper N., Guha S., Indyk P., Koudas N.: Dynamic Multidimensional Histograms, Proc. ACM SIGMOD CMPT 843, SFU, Martin Ester,
Clustering from Data Streams
Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting
More informationMining Data Streams Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono
Mining Data Streams Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data
More informationMining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.
DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors
More informationMining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania
Mining for Patterns and Anomalies in Data Streams Sampath Kannan University of Pennsylvania The Problem Data sizes too large to fit in primary memory Devices with small memory Access times to secondary
More informationData Mining: Principles and Algorithms Mining Data Streams
Data Mining: Principles and Algorithms Mining Data Streams Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 2014 Jiawei Han. All rights reserved.
More informationA Framework for Clustering Massive Text and Categorical Data Streams
A Framework for Clustering Massive Text and Categorical Data Streams Charu C. Aggarwal IBM T. J. Watson Research Center charu@us.ibm.com Philip S. Yu IBM T. J.Watson Research Center psyu@us.ibm.com Abstract
More informationRobust Clustering for Tracking Noisy Evolving Data Streams
Robust Clustering for Tracking Noisy Evolving Data Streams Olfa Nasraoui Carlos Rojas Abstract We present a new approach for tracking evolving and noisy data streams by estimating clusters based on density,
More informationVolume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at: www.ijarcsms.com Mining
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.
More informationSummarizing and mining inverse distributions on data streams via dynamic inverse sampling
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Presented by Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Irina Rozenbaum rozenbau@paul.rutgers.edu
More informationUnsupervised Learning Hierarchical Methods
Unsupervised Learning Hierarchical Methods Road Map. Basic Concepts 2. BIRCH 3. ROCK The Principle Group data objects into a tree of clusters Hierarchical methods can be Agglomerative: bottom-up approach
More informationDatabase and Knowledge-Base Systems: Data Mining. Martin Ester
Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro
More informationE-Stream: Evolution-Based Technique for Stream Clustering
E-Stream: Evolution-Based Technique for Stream Clustering Komkrit Udommanetanakit, Thanawin Rakthanmanon, and Kitsana Waiyamai Department of Computer Engineering, Faculty of Engineering Kasetsart University,
More informationFrequent Patterns mining in time-sensitive Data Stream
Frequent Patterns mining in time-sensitive Data Stream Manel ZARROUK 1, Mohamed Salah GOUIDER 2 1 University of Gabès. Higher Institute of Management of Gabès 6000 Gabès, Gabès, Tunisia zarrouk.manel@gmail.com
More information2. Data Preprocessing
2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459
More informationMining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window
Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu Department of Computer Science National Tsing Hua University Arbee L.P. Chen
More information3. Data Preprocessing. 3.1 Introduction
3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation
More informationApproximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation
Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation Sudipto Guha AT&T Labs-Research sudipto@research.att.com Nick Koudas AT&T Labs-Research koudas@research.att.com
More informationCOMP 465: Data Mining Still More on Clustering
3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following
More informationNotes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)
1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should
More informationCourse : Data mining
Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 reading assignment LRU book: chapter
More informationData Streaming Algorithms for Geometric Problems
Data Streaming Algorithms for Geometric roblems R.Sharathkumar Duke University 1 Introduction A data stream is an ordered sequence of points that can be read only once or a small number of times. Formally,
More informationAnswering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005)
Answering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005) Foto N. Afrati Computer Science Division NTUA, Athens, Greece afrati@softlab.ece.ntua.gr
More informationSpace-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order
Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order Luca Foschini joint work with Sorabh Gandhi and Subhash Suri University of California Santa Barbara ICDE 2010
More informationOn Futuristic Query Processing in Data Streams
On Futuristic Query Processing in Data Streams Charu C. Aggarwal IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532 charu@us.ibm.com Abstract. Recent advances in hardware technology
More informationData Mining: Concepts and Techniques. Chapter Mining data streams
Data Mining: Concepts and Techniques Chapter 8 8.1. Mining data streams Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 2006
More informationClustering in Data Mining
Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationAnswering Approximate Range Aggregate Queries on OLAP Data Cubes with Probabilistic Guarantees
Answering Approximate Range Aggregate Queries on OLAP Data Cubes with Probabilistic Guarantees Alfredo Cuzzocrea 1, Wei Wang 2, Ugo Matrangolo 3 1 DEIS Dept. University of Calabria 87036 Rende, Cosenza,
More informationK-means based data stream clustering algorithm extended with no. of cluster estimation method
K-means based data stream clustering algorithm extended with no. of cluster estimation method Makadia Dipti 1, Prof. Tejal Patel 2 1 Information and Technology Department, G.H.Patel Engineering College,
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REAL TIME DATA SEARCH OPTIMIZATION: AN OVERVIEW MS. DEEPASHRI S. KHAWASE 1, PROF.
More informationClustering Part 4 DBSCAN
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationData mining techniques for data streams mining
REVIEW OF COMPUTER ENGINEERING STUDIES ISSN: 2369-0755 (Print), 2369-0763 (Online) Vol. 4, No. 1, March, 2017, pp. 31-35 DOI: 10.18280/rces.040106 Licensed under CC BY-NC 4.0 A publication of IIETA http://www.iieta.org/journals/rces
More informationUSC Real-time Pattern Isolation and Recognition Over Immersive Sensor Data Streams
Real-time Pattern Isolation and Recognition Over Immersive Sensor Data Streams Cyrus Shahabi and Donghui Yan Integrated Media Systems Center and Computer Science Department, University of Southern California
More informationDynamic Clustering Of High Speed Data Streams
www.ijcsi.org 224 Dynamic Clustering Of High Speed Data Streams J. Chandrika 1, Dr. K.R. Ananda Kumar 2 1 Department of CS & E, M C E,Hassan 573 201 Karnataka, India 2 Department of CS & E, SJBIT, Bangalore
More informationDifferentially Private H-Tree
GeoPrivacy: 2 nd Workshop on Privacy in Geographic Information Collection and Analysis Differentially Private H-Tree Hien To, Liyue Fan, Cyrus Shahabi Integrated Media System Center University of Southern
More informationA Framework for Clustering Evolving Data Streams
VLDB 03 Paper ID: 312 A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu IBM T. J. Watson Research Center & UIUC charu@us.ibm.com, hanj@cs.uiuc.edu,
More informationApproximation Algorithms for Clustering Uncertain Data
Approximation Algorithms for Clustering Uncertain Data Graham Cormode AT&T Labs - Research graham@research.att.com Andrew McGregor UCSD / MSR / UMass Amherst andrewm@ucsd.edu Introduction Many applications
More informationOn Biased Reservoir Sampling in the Presence of Stream Evolution
Charu C. Aggarwal T J Watson Research Center IBM Corporation Hawthorne, NY USA On Biased Reservoir Sampling in the Presence of Stream Evolution VLDB Conference, Seoul, South Korea, 2006 Synopsis Construction
More informationMining Frequent Itemsets for data streams over Weighted Sliding Windows
Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 4
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationSketching Asynchronous Streams Over a Sliding Window
Sketching Asynchronous Streams Over a Sliding Window Srikanta Tirthapura (Iowa State University) Bojian Xu (Iowa State University) Costas Busch (Rensselaer Polytechnic Institute) 1/32 Data Stream Processing
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining
More informationOutline. The History of Histograms. Yannis Ioannidis University of Athens, Hellas
The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline Prehistory Definitions and Framework The Early Past 10 Years Ago The Recent Past Industry Competitors The Future Prehistory
More informationData Stream Clustering Using Micro Clusters
Data Stream Clustering Using Micro Clusters Ms. Jyoti.S.Pawar 1, Prof. N. M.Shahane. 2 1 PG student, Department of Computer Engineering K. K. W. I. E. E. R., Nashik Maharashtra, India 2 Assistant Professor
More informationAn Empirical Comparison of Stream Clustering Algorithms
MÜNSTER An Empirical Comparison of Stream Clustering Algorithms Matthias Carnein Dennis Assenmacher Heike Trautmann CF 17 BigDAW Workshop Siena Italy May 15 18 217 Clustering MÜNSTER An Empirical Comparison
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,
More informationHierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1
Hierarchy An arrangement or classification of things according to inclusiveness A natural way of abstraction, summarization, compression, and simplification for understanding Typical setting: organize
More informationComparative Study of Subspace Clustering Algorithms
Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that
More informationStriped Grid Files: An Alternative for Highdimensional
Striped Grid Files: An Alternative for Highdimensional Indexing Thanet Praneenararat 1, Vorapong Suppakitpaisarn 2, Sunchai Pitakchonlasap 1, and Jaruloj Chongstitvatana 1 Department of Mathematics 1,
More informationUnsupervised: no target value to predict
Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning
More informationMAIDS: Mining Alarming Incidents from Data Streams
MAIDS: Mining Alarming Incidents from Data Streams (Demonstration Proposal) Y. Dora Cai David Clutter Greg Pape Jiawei Han Michael Welge Loretta Auvil Automated Learning Group, NCSA, University of Illinois
More informationImproving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique
Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique P.Nithya 1, V.Karpagam 2 PG Scholar, Department of Software Engineering, Sri Ramakrishna Engineering College,
More informationModule 9: Selectivity Estimation
Module 9: Selectivity Estimation Module Outline 9.1 Query Cost and Selectivity Estimation 9.2 Database profiles 9.3 Sampling 9.4 Statistics maintained by commercial DBMS Web Forms Transaction Manager Lock
More informationCS Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts
More informationEvolution-Based Clustering of High Dimensional Data Streams with Dimension Projection
Evolution-Based Clustering of High Dimensional Data Streams with Dimension Projection Rattanapong Chairukwattana Department of Computer Engineering Kasetsart University Bangkok, Thailand Email: g521455024@ku.ac.th
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data
More informationUsing Natural Clusters Information to Build Fuzzy Indexing Structure
Using Natural Clusters Information to Build Fuzzy Indexing Structure H.Y. Yue, I. King and K.S. Leung Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, New Territories,
More informationSum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015
Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth
More informationDATA STREAMS: MODELS AND ALGORITHMS
DATA STREAMS: MODELS AND ALGORITHMS DATA STREAMS: MODELS AND ALGORITHMS Edited by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 Kluwer Academic Publishers Boston/Dordrecht/London
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationOne-Pass Streaming Algorithms
One-Pass Streaming Algorithms Theory and Practice Complaints and Grievances about theory in practice Disclaimer Experiences with Gigascope. A practitioner s perspective. Will be using my own implementations,
More informationHigh-Dimensional Incremental Divisive Clustering under Population Drift
High-Dimensional Incremental Divisive Clustering under Population Drift Nicos Pavlidis Inference for Change-Point and Related Processes joint work with David Hofmeyr and Idris Eckley Clustering Clustering:
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining
More informationData Mining Algorithms
for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester
More informationHierarchical Clustering Lecture 9
Hierarchical Clustering Lecture 9 Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 1 Lecture 9: Required Reading Witten et al. (2011:
More informationSummary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4
Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is
More informationHierarchical Document Clustering
Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters
More informationRandom Sampling over Data Streams for Sequential Pattern Mining
Random Sampling over Data Streams for Sequential Pattern Mining Chedy Raïssi LIRMM, EMA-LGI2P/Site EERIE 161 rue Ada 34392 Montpellier Cedex 5, France France raissi@lirmm.fr Pascal Poncelet EMA-LGI2P/Site
More informationStream Sequential Pattern Mining with Precise Error Bounds
Stream Sequential Pattern Mining with Precise Error Bounds Luiz F. Mendes,2 Bolin Ding Jiawei Han University of Illinois at Urbana-Champaign 2 Google Inc. lmendes@google.com {bding3, hanj}@uiuc.edu Abstract
More informationRobust Clustering of Data Streams using Incremental Optimization
Robust Clustering of Data Streams using Incremental Optimization Basheer Hawwash and Olfa Nasraoui Knowledge Discovery and Web Mining Lab Computer Engineering and Computer Science Department University
More informationKnowledge Discovery in Databases II. Lecture 4: Stream clustering
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Lecture notes Knowledge Discovery in Databases II Winter Semester 2012/2013 Lecture 4: Stream
More informationBalanced Trees Part Two
Balanced Trees Part Two Outline for Today Recap from Last Time Review of B-trees, 2-3-4 trees, and red/black trees. Order Statistic Trees BSTs with indexing. Augmented Binary Search Trees Building new
More informationLecture 5: Data Streaming Algorithms
Great Ideas in Theoretical Computer Science Summer 2013 Lecture 5: Data Streaming Algorithms Lecturer: Kurt Mehlhorn & He Sun In the data stream scenario, the input arrive rapidly in an arbitrary order,
More informationCS 229 Midterm Review
CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask
More informationClustering. (Part 2)
Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works
More informationA Framework for Clustering Uncertain Data Streams
A Framework for Clustering Uncertain Data Streams Charu C. Aggarwal, Philip S. Yu IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532, USA { charu, psyu }@us.ibm.com Abstract In recent
More informationKnowledge Discovery in Databases II Summer Semester 2018
Ludwig Maximilians Universität München Institut für Informatik Lehr und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases II Summer Semester 2018 Lecture 3: Data Streams Lectures
More informationAnytime Concurrent Clustering of Multiple Streams with an Indexing Tree
JMLR: Workshop and Conference Proceedings 41:19 32, 2015 BIGMINE 2015 Anytime Concurrent Clustering of Multiple Streams with an Indexing Tree Zhinoos Razavi Hesabi zhinoos.razavi@rmit.edu.au Timos Sellis
More informationAdvances in Data Management Principles of Database Systems - 2 A.Poulovassilis
1 Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis 1 Storing data on disk The traditional storage hierarchy for DBMSs is: 1. main memory (primary storage) for data currently
More informationData mining, 4 cu Lecture 6:
582364 Data mining, 4 cu Lecture 6: Quantitative association rules Multi-level association rules Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Data mining, Spring 2010 (Slides adapted
More informationAn improved data stream summary: the count-min sketch and its applications
Journal of Algorithms 55 (2005) 58 75 www.elsevier.com/locate/jalgor An improved data stream summary: the count-min sketch and its applications Graham Cormode a,,1, S. Muthukrishnan b,2 a Center for Discrete
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationEfficient Approximation of Correlated Sums on Data Streams
Efficient Approximation of Correlated Sums on Data Streams Rohit Ananthakrishna Cornell University rohit@cs.cornell.edu Flip Korn AT&T Labs Research flip@research.att.com Abhinandan Das Cornell University
More informationClustering part II 1
Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:
More informationA Review on Cluster Based Approach in Data Mining
A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,
More informationChapter 13: Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationIntroduction to Indexing R-trees. Hong Kong University of Science and Technology
Introduction to Indexing R-trees Dimitris Papadias Hong Kong University of Science and Technology 1 Introduction to Indexing 1. Assume that you work in a government office, and you maintain the records
More informationOn Biased Reservoir Sampling in the presence of Stream Evolution
On Biased Reservoir Sampling in the presence of Stream Evolution Charu C. Aggarwal IBM T. J. Watson Research Center 9 Skyline Drive Hawhorne, NY 532, USA charu@us.ibm.com ABSTRACT The method of reservoir
More informationLocality- Sensitive Hashing Random Projections for NN Search
Case Study 2: Document Retrieval Locality- Sensitive Hashing Random Projections for NN Search Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 18, 2017 Sham Kakade
More informationA Co-Clustering approach for Sum-Product Network Structure Learning
Università degli Studi di Bari Dipartimento di Informatica LACAM Machine Learning Group A Co-Clustering approach for Sum-Product Network Antonio Vergari Nicola Di Mauro Floriana Esposito December 8, 2014
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationStatistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte
Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,
More informationTight results for clustering and summarizing data streams
Tight results for clustering and summarizing data streams Sudipto Guha Abstract In this paper we investigate algorithms and lower bounds for summarization problems over a single pass data stream. In particular
More informationSampling for Sequential Pattern Mining: From Static Databases to Data Streams
Sampling for Sequential Pattern Mining: From Static Databases to Data Streams Chedy Raïssi LIRMM, EMA-LGI2P/Site EERIE 161 rue Ada 34392 Montpellier Cedex 5, France raissi@lirmm.fr Pascal Poncelet EMA-LGI2P/Site
More informationPreprocessing Short Lecture Notes cse352. Professor Anita Wasilewska
Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept
More informationDeakin Research Online
Deakin Research Online This is the published version: Saha, Budhaditya, Lazarescu, Mihai and Venkatesh, Svetha 27, Infrequent item mining in multiple data streams, in Data Mining Workshops, 27. ICDM Workshops
More informationLecture 7. Data Stream Mining. Building decision trees
1 / 26 Lecture 7. Data Stream Mining. Building decision trees Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 26 1 Data Stream Mining 2 Decision Tree Learning Data Stream Mining 3
More informationDatabase System Concepts
Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth
More information