Optimal Tracking of Distributed Heavy Hitters and Quantiles

Size: px
Start display at page:

Download "Optimal Tracking of Distributed Heavy Hitters and Quantiles"

Transcription

1 Optimal Tracking of Distributed Heavy Hitters and Quantiles Qin Zhang Hong Kong University of Science & Technology Joint work with Ke Yi PODS 2009 June 30,

2 2- Models and Problems

3 Distributed streaming model [Babcock, Olston SIGMOD 2003] Alice observes A(t) by time t 3 2 Carole tries to track f(a(t) B(t)) for all t (an ɛ-approximation is usually allowed) Bob observes B(t) by time t A(t), B(t) : multisets 3-

4 Distributed streaming model [Babcock, Olston SIGMOD 2003] Alice observes A(t) by time t 3 2 Carole tries to track f(a(t) B(t)) for all t (an ɛ-approximation is usually allowed) Bob observes B(t) by time t Find an (approximate) tracking protocol while minimizing communication A(t), B(t) : multisets 3-2

5 Distributed streaming model [Babcock, Olston SIGMOD 2003] Extend to k sites A (t) 3 2 Coordinator f(a (t) A 2 (t)... A k (t)) Site Site A 2 (t) Site Find a protocol to minimize communication. 3-3

6 Combination of two models X Y (One-shot) communication model f(x Y ); minimize communication 4-

7 Combination of two models X Y (One-shot) communication model f(x Y ); minimize communication 4-2

8 Combination of two models X Y A(t) Streaming model f(a(t)); minimize space (One-shot) communication model f(x Y ); minimize communication 4-3

9 Combination of two models X Y A(t) Streaming model f(a(t)); minimize space (One-shot) communication model f(x Y ); minimize communication Distributed streaming model a.k.a. Continuous commun. model Only interested in communication cost 4-4

10 Applied motivation: distributed monitoring C Large-scale distributed holistic querying/monitoring. Picture from the tutorial Streaming in a connected world: Querying and tracking distributed data streams at VLDB 06 and SIGMOD 07 5-

11 Three different fs Heavy hitters: For a multiset A, item i is a heavy hitter if (count of i) > φ A non-heavy hitter if (count of i) < (φ ɛ) A Application: Find out IP addresses that appear % over a network 6-

12 Three different fs Heavy hitters: For a multiset A, item i is a heavy hitter if (count of i) > φ A non-heavy hitter if (count of i) < (φ ɛ) A Application: Find out IP addresses that appear % over a network Single Quantile: Median, /4-quantile. Application: The median size of packets over a network. 6-2

13 Three different fs Heavy hitters: For a multiset A, item i is a heavy hitter if (count of i) > φ A non-heavy hitter if (count of i) < (φ ɛ) A Application: Find out IP addresses that appear % over a network Single Quantile: Median, /4-quantile. Application: The median size of packets over a network. All Quantiles: A is a set of distinct elements from a totally ordered universe. Quantile of x = y < x, y A / A, usually allows a ±ɛ error. Need a data structure to extract ɛ-approximate quantile for any x. 6-3

14 7- Previous and our results

15 Previous results in distributed streaming model Frequent moments (F p = i (count of i)p ) Total count F Simple: each site sends an update every time its local count has increased by a ( + ɛ) factor Complexity O(k/ɛ log n) n: total # items received at all sites, ɛ: relative error 8-

16 Previous results in distributed streaming model Frequent moments (F p = i (count of i)p ) Total count F Simple: each site sends an update every time its local count has increased by a ( + ɛ) factor Complexity O(k/ɛ log n) n: total # items received at all sites, ɛ: relative error Distinct count F 0, Second frequency moment F 2 F 0 : O(k/ɛ 2 log n log(n/δ)) [Cormode, Muthukrishnan, Yi, SODA 08] F 2 : O((k 2 /ɛ 2 + k 3/2 /ɛ 4 ) log n log(n/δ) [same paper] 8-2

17 Previous results in distributed streaming model Frequent moments (F p = i (count of i)p ) Total count F Simple: each site sends an update every time its local count has increased by a ( + ɛ) factor Complexity O(k/ɛ log n) n: total # items received at all sites, ɛ: relative error Distinct count F 0, Second frequency moment F 2 F 0 : O(k/ɛ 2 log n log(n/δ)) [Cormode, Muthukrishnan, Yi, SODA 08] F 2 : O((k 2 /ɛ 2 + k 3/2 /ɛ 4 ) log n log(n/δ) [same paper] Entropy of the stream Shannon entropy and related entropies [Arackaparambil, Brody, Chakrabarti, ICALP 09] 8-3

18 Previous results: Heavy Hitters Streaming model (space complexity) O(/ɛ), [Karp, Shenker, Papadimitriou TODS 03], [Demaine, Munro, Lopez-Ortiz, ESA 02], [Metwally, Agrawal, Abbadi TODS 06], [Misra, Gries 982] 9-

19 Previous results: Heavy Hitters Streaming model (space complexity) O(/ɛ), [Karp, Shenker, Papadimitriou TODS 03], [Demaine, Munro, Lopez-Ortiz, ESA 02], [Metwally, Agrawal, Abbadi TODS 06], [Misra, Gries 982] One-shot communication model O(k/ɛ) - easy 9-2

20 Previous results: Heavy Hitters Streaming model (space complexity) O(/ɛ), [Karp, Shenker, Papadimitriou TODS 03], [Demaine, Munro, Lopez-Ortiz, ESA 02], [Metwally, Agrawal, Abbadi TODS 06], [Misra, Gries 982] One-shot communication model O(k/ɛ) - easy Distributed streaming model Heuristics [e.g. Babcock, Olston, SIGMOD 03] 9-3

21 Previous results: All Quantiles Streaming model (space complexity) O(/ɛ log(ɛn)) [Greenwald, Khanna, SIGMOD 0] O(/ɛ log(/δ)), with failure probability δ [Cormode, Muthukrishnan, J. Alg 05] 0-

22 Previous results: All Quantiles Streaming model (space complexity) O(/ɛ log(ɛn)) [Greenwald, Khanna, SIGMOD 0] O(/ɛ log(/δ)), with failure probability δ [Cormode, Muthukrishnan, J. Alg 05] One-shot communication model O(k/ɛ) - easy 0-2

23 Previous results: All Quantiles Streaming model (space complexity) O(/ɛ log(ɛn)) [Greenwald, Khanna, SIGMOD 0] O(/ɛ log(/δ)), with failure probability δ [Cormode, Muthukrishnan, J. Alg 05] One-shot communication model O(k/ɛ) - easy Distributed streaming model O(k/ɛ 2 log n) [Cormode, Garofalakis, Muthukrishnan, Rastogi, SIGMOD 05] 0-3

24 Our results Tracking heavy hitters Complexity: Θ(k/ɛ log n) -

25 Our results Tracking heavy hitters Complexity: Θ(k/ɛ log n) Tracking one quantile (in particular: Median) Complexity: Θ(k/ɛ log n) -2

26 Our results Tracking heavy hitters Complexity: Θ(k/ɛ log n) Tracking one quantile (in particular: Median) Complexity: Θ(k/ɛ log n) Tracking all quantiles Upper bound: O(k/ɛ log 2 (/ɛ) log n) -3

27 2- Technical details

28 Tracking Heavy Hitters Divide into multiple rounds: Every time m increase by a factor of + ɛ. m: total # elements inserted 3-

29 Tracking Heavy Hitters Divide into multiple rounds: Every time m increase by a factor of + ɛ. m: total # elements inserted In each round Compute total count m: O(k) messages Broadcast m: O(k) messages Each site: For each i, sends out a message every time count(i) increases by ɛm/(2k) Sends out a message every time total local count increases by ɛm/(2k) Coordinator: Terminate the round when O(k) messages have been received 3-2

30 Tracking Heavy Hitters Divide into multiple rounds: Every time m increase by a factor of + ɛ. m: total # elements inserted In each round Compute total count m: O(k) messages Broadcast m: O(k) messages Each site: For each i, sends out a message every time count(i) increases by ɛm/(2k) Sends out a message every time total local count increases by ɛm/(2k) Coordinator: Terminate the round when O(k) messages have been received Track the count of each element with error at most ɛm 3-3

31 Tracking Heavy Hitters Total messages in one round: O(k) # rounds: O(log +ɛ n) = O(log n/ɛ) Divide into multiple rounds: Every time m increase by a factor of + ɛ. m: total # elements inserted In each round Compute total count m: O(k) messages Broadcast m: O(k) messages Each site: For each i, sends out a message every time count(i) increases by ɛm/(2k) Sends out a message every time total local count increases by ɛm/(2k) Coordinator: Terminate the round when O(k) messages have been received Track the count of each element with error at most ɛm 3-4

32 Tracking Heavy Hitters: lower bound Construct an input where the set of heavy hitters undergoes Ω(log n/ɛ) updates In each update, one item changes from nonheavy to heavy Argue that Ω(k) sites need to be contacted in order to correctly identify this item at the right time 4-

33 Tracking the Median Divide tracking period into rounds: m (roughly) doubles in each round In each round Maintain a dynamic set of intervals each has ɛ 8 m to ɛ m items: # messages O(k/ɛ) in each 2 round each leaf contains ɛm/8 ɛm/2 elements, maintain an approximate count pivot element 5-

34 Tracking the Median Divide tracking period into rounds: m (roughly) doubles in each round In each round Maintain a dynamic set of intervals each has ɛ 8 m to ɛ m items: # messages O(k/ɛ) in each 2 round After Θ(ɛm) items, recalculate the median from the old median by using the dynamic set of intervals. # messages O(k) per recalculation, O(/ɛ) recalculations in each round 5-2

35 Tracking the Median Divide tracking period into rounds: m (roughly) doubles in each round In each round Maintain a dynamic set of intervals each has ɛ Total messages in one round: O(k/ɛ) # rounds: O(log n) 8 m to ɛ m items: # messages O(k/ɛ) in each 2 round After Θ(ɛm) items, recalculate the median from the old median by using the dynamic set of intervals. # messages O(k) per recalculation, O(/ɛ) recalculations in each round 5-3

36 Tracking all Quantiles approximate count with absolute error < ɛm/ log(/ɛ) A balanced tree approximate median: either half contains at least /4 of the elements each leaf contains Θ(ɛm) elements structure can be used to extract the rank of any x with absolute error < ɛm quantile with error < ɛ 6-

37 Tracking all Quantiles approximate count with absolute error < ɛm/ log(/ɛ) A balanced tree approximate median: either half contains at least /4 of the elements each leaf contains Θ(ɛm) elements Divide tracking period into rounds: m (roughly) doubles in each round 6-2

38 Tracking all Quantiles approximate count with absolute error < ɛm/ log(/ɛ) A balanced tree approximate median: either half contains at least /4 of the elements each leaf contains Θ(ɛm) elements Divide tracking period into rounds: m (roughly) doubles in each round At the beginning of each round, initialize the structure with O(k/ɛ) communication, then broadcast. 6-3

39 Tracking all Quantiles approximate count with absolute error < ɛm/ log(/ɛ) A balanced tree approximate median: either half contains at least /4 of the elements each leaf contains Θ(ɛm) elements Divide tracking period into rounds: m (roughly) doubles in each round At the beginning of each round, initialize the structure with O(k/ɛ) communication, then broadcast. Each site tracks each interval, sends a message for every Θ(ɛm/(k log(/ɛ))) new elements arrive. Total: O(k/ɛ log 2 (/ɛ)) 6-4

40 Tracking all Quantiles approximate count with absolute error < ɛm/ log(/ɛ) A balanced tree approximate median: either half contains at least /4 of the elements each leaf contains Θ(ɛm) elements Divide tracking period into rounds: m (roughly) doubles in each round At the beginning of each round, initialize the structure with O(k/ɛ) communication, then broadcast. Each site tracks each interval, sends a message for every Θ(ɛm/(k log(/ɛ))) new elements arrive. Total: O(k/ɛ log 2 (/ɛ)) Maintain the balance of the tree. Total: O(k/ɛ log(/ɛ)) 6-5

41 Remarks and open problems Focused on communication only, but the algorithms can be implemented with small space O(/ɛ) each site for heavy hitter O(/ɛ log(ɛn)) each site for quantiles 7-

42 Remarks and open problems Focused on communication only, but the algorithms can be implemented with small space O(/ɛ) each site for heavy hitter O(/ɛ log(ɛn)) each site for quantiles Randomized algorithms? There is an upperbound O ( (k + /ɛ 2 ) poly log(n, k, /ɛ) ) 7-2

43 Remarks and open problems Focused on communication only, but the algorithms can be implemented with small space O(/ɛ) each site for heavy hitter O(/ɛ log(ɛn)) each site for quantiles Randomized algorithms? There is an upperbound O ( (k + /ɛ 2 ) poly log(n, k, /ɛ) ) How about deletions? One site, arbitrary function in Z d, competitive analysis. (Yi and Zhang SODA 2009) 7-3

44 Remarks and open problems Focused on communication only, but the algorithms can be implemented with small space O(/ɛ) each site for heavy hitter O(/ɛ log(ɛn)) each site for quantiles Randomized algorithms? There is an upperbound O ( (k + /ɛ 2 ) poly log(n, k, /ɛ) ) How about deletions? One site, arbitrary function in Z d, competitive analysis. (Yi and Zhang SODA 2009) Other tracking problems

45 The End T HAN K YOU Q and A 8-

Optimal Sampling from Distributed Streams

Optimal Sampling from Distributed Streams Optimal Sampling from Distributed Streams Graham Cormode AT&T Labs-Research Joint work with S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Qin Zhang (HKUST) 1-1 Reservoir sampling [Waterman??; Vitter 85] Maintain

More information

arxiv: v1 [cs.ds] 1 Dec 2008

arxiv: v1 [cs.ds] 1 Dec 2008 Optimal Tracking of Distributed Heavy Hitters and Quantiles Ke Yi Qin Zhang arxiv:0812.0209v1 [cs.ds] 1 Dec 2008 Department of Computer Science and Engineering Hong Kong University of Science and Technology

More information

B561 Advanced Database Concepts. 6. Streaming Algorithms. Qin Zhang 1-1

B561 Advanced Database Concepts. 6. Streaming Algorithms. Qin Zhang 1-1 B561 Advanced Database Concepts 6. Streaming Algorithms Qin Zhang 1-1 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store everything.

More information

Misra-Gries Summaries

Misra-Gries Summaries Title: Misra-Gries Summaries Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting

More information

B561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1

B561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1 B561 Advanced Database Concepts 2.2. Streaming Model Qin Zhang 1-1 Data Streams Continuous streams of data elements (massive possibly unbounded, rapid, time-varying) Some examples: 1. network monitoring

More information

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Presented by Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Irina Rozenbaum rozenbau@paul.rutgers.edu

More information

Dynamic Range Majority Data Structures

Dynamic Range Majority Data Structures Dynamic Range Majority Data Structures Amr Elmasry 1, Meng He 2, J. Ian Munro 3, and Patrick K. Nicholson 3 1 Department of Computer Science, University of Copenhagen, Denmark 2 Faculty of Computer Science,

More information

Efficient Approximation of Correlated Sums on Data Streams

Efficient Approximation of Correlated Sums on Data Streams Efficient Approximation of Correlated Sums on Data Streams Rohit Ananthakrishna Cornell University rohit@cs.cornell.edu Flip Korn AT&T Labs Research flip@research.att.com Abhinandan Das Cornell University

More information

Distinct random sampling from a distributed stream

Distinct random sampling from a distributed stream Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2013 Distinct random sampling from a distributed stream Yung-Yu Chung Iowa State University Follow this and additional

More information

One-Pass Streaming Algorithms

One-Pass Streaming Algorithms One-Pass Streaming Algorithms Theory and Practice Complaints and Grievances about theory in practice Disclaimer Experiences with Gigascope. A practitioner s perspective. Will be using my own implementations,

More information

Course : Data mining

Course : Data mining Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 reading assignment LRU book: chapter

More information

Simply Top Talkers Jeroen Massar, Andreas Kind and Marc Ph. Stoecklin

Simply Top Talkers Jeroen Massar, Andreas Kind and Marc Ph. Stoecklin IBM Research - Zurich Simply Top Talkers Jeroen Massar, Andreas Kind and Marc Ph. Stoecklin 2009 IBM Corporation Motivation and Outline Need to understand and correctly handle dominant aspects within the

More information

Robust Lower Bounds for Communication and Stream Computation

Robust Lower Bounds for Communication and Stream Computation Robust Lower Bounds for Communication and Stream Computation! Amit Chakrabarti! Dartmouth College! Graham Cormode! AT&T Labs! Andrew McGregor! UC San Diego Communication Complexity Communication Complexity

More information

Dynamic Indexability and the Optimality of B-trees and Hash Tables

Dynamic Indexability and the Optimality of B-trees and Hash Tables Dynamic Indexability and the Optimality of B-trees and Hash Tables Ke Yi Hong Kong University of Science and Technology Dynamic Indexability and Lower Bounds for Dynamic One-Dimensional Range Query Indexes,

More information

Multi-Dimensional Online Tracking

Multi-Dimensional Online Tracking Multi-Dimensional Online Tracking Ke Yi Qin Zhang Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong, China {yike,qinzhang}@cse.ust.hk Abstract We propose

More information

UnivMon: Software-defined Monitoring with Universal Sketch

UnivMon: Software-defined Monitoring with Universal Sketch UnivMon: Software-defined Monitoring with Universal Sketch Zaoxing (Alan) Liu Joint work with Antonis Manousis (CMU), Greg Vorsanger(JHU), Vyas Sekar (CMU), and Vladimir Braverman(JHU) Network Management:

More information

Methods for Finding Frequent Items in Data Streams

Methods for Finding Frequent Items in Data Streams The VLDB Journal manuscript No. (will be inserted by the editor) Methods for Finding Frequent Items in Data Streams Graham Cormode Marios Hadjieleftheriou Received: date / Accepted: date Abstract The frequent

More information

Sublinear Models for Streaming and/or Distributed Data

Sublinear Models for Streaming and/or Distributed Data Sublinear Models for Streaming and/or Distributed Data Qin Zhang Guest lecture in B649 Feb. 3, 2015 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index

More information

Merging Frequent Summaries

Merging Frequent Summaries Merging Frequent Summaries M. Cafaro, M. Pulimeno University of Salento, Italy {massimo.cafaro, marco.pulimeno}@unisalento.it Abstract. Recently, an algorithm for merging counter-based data summaries which

More information

Efficient Strategies for Continuous Distributed Tracking Tasks

Efficient Strategies for Continuous Distributed Tracking Tasks Efficient Strategies for Continuous Distributed Tracking Tasks Graham Cormode Bell Labs, Lucent Technologies cormode@bell-labs.com Minos Garofalakis Bell Labs, Lucent Technologies minos@bell-labs.com Abstract

More information

On the Cell Probe Complexity of Dynamic Membership or

On the Cell Probe Complexity of Dynamic Membership or On the Cell Probe Complexity of Dynamic Membership or Can We Batch Up Updates in External Memory? Ke Yi and Qin Zhang Hong Kong University of Science & Technology SODA 2010 Jan. 17, 2010 1-1 The power

More information

Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003

Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003 Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003 Graham Cormode graham@dimacs.rutgers.edu dimacs.rutgers.edu/~graham S. Muthukrishnan muthu@cs.rutgers.edu Everyday

More information

Finding Frequent Items in Data Streams

Finding Frequent Items in Data Streams Finding Frequent Items in Data Streams Graham Cormode Marios Hadjieleftheriou AT&T Labs Research, Florham Park, NJ {graham,marioh}@research.att.com ABSTRACT The frequent items problem is to process a stream

More information

Sketching Asynchronous Streams Over a Sliding Window

Sketching Asynchronous Streams Over a Sliding Window Sketching Asynchronous Streams Over a Sliding Window Srikanta Tirthapura (Iowa State University) Bojian Xu (Iowa State University) Costas Busch (Rensselaer Polytechnic Institute) 1/32 Data Stream Processing

More information

University of Waterloo Waterloo, Ontario, Canada N2L 3G1 Cambridge, MA, USA Introduction. Abstract

University of Waterloo Waterloo, Ontario, Canada N2L 3G1 Cambridge, MA, USA Introduction. Abstract Finding Frequent Items in Sliding Windows with Multinomially-Distributed Item Frequencies (University of Waterloo Technical Report CS-2004-06, January 2004) Lukasz Golab, David DeHaan, Alejandro López-Ortiz

More information

Streaming Algorithms for Matching Size in Sparse Graphs

Streaming Algorithms for Matching Size in Sparse Graphs Streaming Algorithms for Matching Size in Sparse Graphs Graham Cormode g.cormode@warwick.ac.uk Joint work with S. Muthukrishnan (Rutgers), Morteza Monemizadeh (Rutgers Amazon) Hossein Jowhari (Warwick

More information

CAM Conscious Integrated Answering of Frequent Elements and Top-k Queries over Data Streams

CAM Conscious Integrated Answering of Frequent Elements and Top-k Queries over Data Streams CAM Conscious Integrated Answering of Frequent Elements and Top-k Queries over Data Streams ABSTRACT Sudipto Das Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California,

More information

Algorithms for data streams

Algorithms for data streams Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Thanks! Algorithms for data streams Irene Finocchi finocchi@di.uniroma1.it http://www.dsi.uniroma1.it/ finocchi/ May 9, 2012 1 / 99 Irene

More information

An improved data stream summary: the count-min sketch and its applications

An improved data stream summary: the count-min sketch and its applications Journal of Algorithms 55 (2005) 58 75 www.elsevier.com/locate/jalgor An improved data stream summary: the count-min sketch and its applications Graham Cormode a,,1, S. Muthukrishnan b,2 a Center for Discrete

More information

An Efficient Transformation for Klee s Measure Problem in the Streaming Model

An Efficient Transformation for Klee s Measure Problem in the Streaming Model An Efficient Transformation for Klee s Measure Problem in the Streaming Model Gokarna Sharma, Costas Busch, Ramachandran Vaidyanathan, Suresh Rai, and Jerry Trahan Louisiana State University Outline of

More information

Sketching Probabilistic Data Streams

Sketching Probabilistic Data Streams Sketching Probabilistic Data Streams Graham Cormode AT&T Labs - Research graham@research.att.com Minos Garofalakis Yahoo! Research minos@acm.org Challenge of Uncertain Data Many applications generate data

More information

Clustering from Data Streams

Clustering from Data Streams Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting

More information

Computing Frequent Elements Using Gossip

Computing Frequent Elements Using Gossip Computing Frequent Elements Using Gossip Bibudh Lahiri and Srikanta Tirthapura Iowa State University, Ames, IA, 50011, USA {bibudh,snt}@iastate.edu Abstract. We present algorithms for identifying frequently

More information

Duplicate Detection in Click Streams

Duplicate Detection in Click Streams Duplicate Detection in Click Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California, Santa Barbara {metwally, agrawal, amr}@cs.ucsb.edu Abstract

More information

Tracking Most Frequent Items Dynamically

Tracking Most Frequent Items Dynamically What s Hot and What s Not: Tracking Most Frequent Items Dynamically GRAHAM CORMODE DIMACS Center, Rutgers University and S. MUTHUKRISHNAN Department of Computer Science, Rutgers University Most database

More information

Efficiently Summarizing Data Streams over Sliding Windows

Efficiently Summarizing Data Streams over Sliding Windows Efficiently Summarizing Data Streams over Sliding Windows Nicoló Rivetti, Yann Busnel, Achour Mostefaoui To cite this version: Nicoló Rivetti, Yann Busnel, Achour Mostefaoui. Efficiently Summarizing Data

More information

Heavy-Hitter Detection Entirely in the Data Plane

Heavy-Hitter Detection Entirely in the Data Plane Heavy-Hitter Detection Entirely in the Data Plane VIBHAALAKSHMI SIVARAMAN SRINIVAS NARAYANA, ORI ROTTENSTREICH, MUTHU MUTHUKRSISHNAN, JENNIFER REXFORD 1 Heavy Hitter Flows Flows above a certain threshold

More information

Finding the Frequent Items in Streams of Data By Graham Cormode and Marios Hadjieleftheriou

Finding the Frequent Items in Streams of Data By Graham Cormode and Marios Hadjieleftheriou Finding the Frequent Items in Streams of Data By Graham Cormode and Marios Hadjieleftheriou doi:1.1145/1562764.1562789 Abstract The frequent items problem is to process a stream otems and find all those

More information

Streaming algorithms for embedding and computing edit distance

Streaming algorithms for embedding and computing edit distance Streaming algorithms for embedding and computing edit distance in the low distance regime Diptarka Chakraborty joint work with Elazar Goldenberg and Michal Koucký 7 November, 2015 Outline 1 Introduction

More information

Sketching and streaming algorithms for processing massive data

Sketching and streaming algorithms for processing massive data Sketching and streaming algorithms for processing massive data The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Nelson,

More information

Efficiently Summarizing Distributed Data Streams over Sliding Windows

Efficiently Summarizing Distributed Data Streams over Sliding Windows Efficiently Summarizing Distributed Data Streams over Sliding Windows Nicolò Rivetti, Yann Busnel, Achour Mostefaoui To cite this version: Nicolò Rivetti, Yann Busnel, Achour Mostefaoui. Efficiently Summarizing

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

Efficient Quantile Retrieval on Multi-Dimensional Data

Efficient Quantile Retrieval on Multi-Dimensional Data Efficient Quantile Retrieval on Multi-Dimensional Data Man Lung Yiu 1, Nikos Mamoulis 1, and Yufei Tao 2 1 Department of Computer Science, University of Hong Kong, Pokfulam Road, Hong Kong, {mlyiu2,nikos}@cs.hku.hk

More information

Mergeable Summaries. Pankaj K. Agarwal. Zengfeng Huang. Graham Cormode. Ke Yi. Jeff M. Phillips. Zhewei Wei ABSTRACT

Mergeable Summaries. Pankaj K. Agarwal. Zengfeng Huang. Graham Cormode. Ke Yi. Jeff M. Phillips. Zhewei Wei ABSTRACT Mergeable Summaries Pankaj K. Agarwal Duke University Jeff M. Phillips University of Utah Graham Cormode AT&T Labs Research Zhewei Wei HKUST Zengfeng Huang HKUST Ke Yi HKUST ABSTRACT We study the mergeability

More information

Querying Distributed Data Streams (Invited Keynote Talk)

Querying Distributed Data Streams (Invited Keynote Talk) Querying Distributed Data Streams (Invited Keynote Talk) Minos Garofalakis School of Electronic and Computer Engineering Technical University of Crete minos@softnet.tuc.gr Abstract. Effective Big Data

More information

Sublinear Algorithms for Big Data Analysis

Sublinear Algorithms for Big Data Analysis Sublinear Algorithms for Big Data Analysis Michael Kapralov Theory of Computation Lab 4 EPFL 7 September 2017 The age of big data: massive amounts of data collected in various areas of science and technology

More information

Set-Difference Range Queries

Set-Difference Range Queries CCCG 2013, Waterloo, Ontario, August 8 10, 2013 Set-Difference Range Queries David Eppstein Michael T. Goodrich Joseph A. Simons Abstract We introduce the problem of performing set-difference range queries,

More information

Faster, Space-Efficient Selection Algorithms in Read-Only Memory for Integers

Faster, Space-Efficient Selection Algorithms in Read-Only Memory for Integers Faster, Space-Efficient Selection Algorithms in Read-Only Memory for Integers Timothy M. Chan 1, J. Ian Munro 1, and Venkatesh Raman 2 1 Cheriton School of Computer Science, University of Waterloo, Waterloo,

More information

Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling

Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling Graham Cormode Bell Laboratories cormode@lucent.com S. Muthukrishnan Rutgers University muthu@cs.rutgers.edu Irina

More information

Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling

Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling Graham Cormode Bell Laboratories, Murray Hill, NJ cormode@lucent.com S. Muthukrishnan Rutgers University, Piscataway,

More information

Querying Distributed Data Streams

Querying Distributed Data Streams Querying Distributed Data Streams Minos Garofalakis Technical University of Crete Software Technology and Network Applications Lab LIFT Cast: Antonios Deligiannakis, Vasilis Samoladas, Odysseas Papapetrou,

More information

No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams

No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams Jin ~i', David ~aier', Kristin Tuftel, Vassilis papadimosl, Peter A. Tucke? 1 Portland State University 2~hitworth

More information

Lecture 5: Data Streaming Algorithms

Lecture 5: Data Streaming Algorithms Great Ideas in Theoretical Computer Science Summer 2013 Lecture 5: Data Streaming Algorithms Lecturer: Kurt Mehlhorn & He Sun In the data stream scenario, the input arrive rapidly in an arbitrary order,

More information

CS168: The Modern Algorithmic Toolbox Lecture #2: Approximate Heavy Hitters and the Count-Min Sketch

CS168: The Modern Algorithmic Toolbox Lecture #2: Approximate Heavy Hitters and the Count-Min Sketch CS168: The Modern Algorithmic Toolbox Lecture #2: Approximate Heavy Hitters and the Count-Min Sketch Tim Roughgarden & Gregory Valiant March 30, 2016 1 The Heavy Hitters Problem 1.1 Finding the Majority

More information

Optimisation While Streaming

Optimisation While Streaming Optimisation While Streaming Amit Chakrabarti Dartmouth College Joint work with S. Kale, A. Wirth DIMACS Workshop on Big Data Through the Lens of Sublinear Algorithms, Aug 2015 Combinatorial Optimisation

More information

Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors

Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors Naga K. Govindaraju Nikunj Raghuvanshi Dinesh Manocha University of North Carolina at Chapel Hill {naga,nikunj,dm}@cs.unc.edu

More information

arxiv: v2 [cs.ds] 30 Sep 2016

arxiv: v2 [cs.ds] 30 Sep 2016 Synergistic Sorting, MultiSelection and Deferred Data Structures on MultiSets Jérémy Barbay 1, Carlos Ochoa 1, and Srinivasa Rao Satti 2 1 Departamento de Ciencias de la Computación, Universidad de Chile,

More information

Mining data streams. Irene Finocchi. finocchi/ Intro Puzzles

Mining data streams. Irene Finocchi.   finocchi/ Intro Puzzles Intro Puzzles Mining data streams Irene Finocchi finocchi@di.uniroma1.it http://www.dsi.uniroma1.it/ finocchi/ 1 / 33 Irene Finocchi Mining data streams Intro Puzzles Stream sources Data stream model Storing

More information

Finding Frequent Items in Time Decayed Data Streams

Finding Frequent Items in Time Decayed Data Streams Finding Frequent Items in Time Decayed Data Streams Shanshan Wu 1, Huaizhong Lin 1(B), Leong Hou U 2, Yunjun Gao 1, and Dongming Lu 1 1 College of Computer Science and Technology, Zhejiang University,

More information

Data Streaming Algorithms for Geometric Problems

Data Streaming Algorithms for Geometric Problems Data Streaming Algorithms for Geometric roblems R.Sharathkumar Duke University 1 Introduction A data stream is an ordered sequence of points that can be read only once or a small number of times. Formally,

More information

Annotations For Sparse Data Streams

Annotations For Sparse Data Streams Annotations For Sparse Data Streams Justin Thaler, Simons Institute, UC Berkeley Joint Work with: Amit Chakrabarti, Dartmouth Graham Cormode, University of Warwick Navin Goyal, Microsoft Research India

More information

Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors

Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors Naga K. Govindaraju Nikunj Raghuvanshi Dinesh Manocha University of North Carolina at Chapel Hill {naga,nikunj,dm}@cs.unc.edu

More information

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania Mining for Patterns and Anomalies in Data Streams Sampath Kannan University of Pennsylvania The Problem Data sizes too large to fit in primary memory Devices with small memory Access times to secondary

More information

Parameterized Approximation Schemes Using Graph Widths. Michael Lampis Research Institute for Mathematical Sciences Kyoto University

Parameterized Approximation Schemes Using Graph Widths. Michael Lampis Research Institute for Mathematical Sciences Kyoto University Parameterized Approximation Schemes Using Graph Widths Michael Lampis Research Institute for Mathematical Sciences Kyoto University May 13, 2014 Overview Topic of this talk: Randomized Parameterized Approximation

More information

Approximate Continuous Querying over Distributed Streams

Approximate Continuous Querying over Distributed Streams Approximate Continuous Querying over Distributed Streams GRAHAM CORMODE AT&T Labs Research and MINOS GAROFALAKIS Yahoo! Research & University of California, Berkeley While traditional database systems

More information

What s Hot and What s Not: Tracking Most Frequent Items Dynamically

What s Hot and What s Not: Tracking Most Frequent Items Dynamically What s Hot and What s Not: Tracking Most Frequent Items Dynamically Graham Cormode Center for Discrete Mathematics and Computer Science (DIMACS) Rutgers University, Piscataway NJ graham@dimacs.rutgers.edu

More information

Sensors. Outline. Sensor networks. Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams

Sensors. Outline. Sensor networks. Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams ributaries and s: Efficient and Robust Aggregation in Sensor Network Streams Acknowledgements for slides: Amit Manjhi, SIGMOD 5 Amit Manjhi, Suman Nath, Phillip B. Gibbons Carnegie Mellon University Intel

More information

Data Streams Algorithms

Data Streams Algorithms Data Streams Algorithms Phillip B. Gibbons Intel Research Pittsburgh Guest Lecture in 15-853 Algorithms in the Real World Phil Gibbons, 15-853, 12/4/06 # 1 Outline Data Streams in the Real World Formal

More information

An Efficient Transformation for Klee s Measure Problem in the Streaming Model Abstract Given a stream of rectangles over a discrete space, we consider the problem of computing the total number of distinct

More information

arxiv: v1 [cs.db] 30 Jun 2012

arxiv: v1 [cs.db] 30 Jun 2012 Sketch-based Querying of Distributed Sliding-Window Data Streams Odysseas Papapetrou, Minos Garofalakis, Antonios Deligiannakis Technical University of Crete {papapetrou,minos,adeli}@softnet.tuc.gr arxiv:1207.0139v1

More information

Lecture Online Algorithms and the k-server problem June 14, 2011

Lecture Online Algorithms and the k-server problem June 14, 2011 Approximation Algorithms Workshop June 13-17, 2011, Princeton Lecture Online Algorithms and the k-server problem June 14, 2011 Joseph (Seffi) Naor Scribe: Mohammad Moharrami 1 Overview In this lecture,

More information

Parameterized Approximation Schemes Using Graph Widths. Michael Lampis Research Institute for Mathematical Sciences Kyoto University

Parameterized Approximation Schemes Using Graph Widths. Michael Lampis Research Institute for Mathematical Sciences Kyoto University Parameterized Approximation Schemes Using Graph Widths Michael Lampis Research Institute for Mathematical Sciences Kyoto University July 11th, 2014 Visit Kyoto for ICALP 15! Parameterized Approximation

More information

arxiv: v2 [cs.ds] 22 May 2017

arxiv: v2 [cs.ds] 22 May 2017 A High-Performance Algorithm for Identifying Frequent Items in Data Streams Daniel Anderson Pryce Bevin Kevin Lang Edo Liberty Lee Rhodes Justin Thaler arxiv:1705.07001v2 [cs.ds] 22 May 2017 Abstract Estimating

More information

Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints

Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints Gonzalo Navarro University of Chile, Chile gnavarro@dcc.uchile.cl Sharma V. Thankachan Georgia Institute of Technology, USA sthankachan@gatech.edu

More information

Buffered Count-Min Sketch on SSD: Theory and Experiments

Buffered Count-Min Sketch on SSD: Theory and Experiments Buffered Count-Min Sketch on SSD: Theory and Experiments Mayank Goswami, Dzejla Medjedovic, Emina Mekic, and Prashant Pandey ESA 2018, Helsinki, Finland The heavy hitters problem (HH(k)) Given stream of

More information

Kernelization via Sampling with Applications to Finding Matchings and Related Problems in Dynamic Graph Streams

Kernelization via Sampling with Applications to Finding Matchings and Related Problems in Dynamic Graph Streams Kernelization via Sampling with Applications to Finding Matchings and Related Problems in Dynamic Graph Streams Sofya Vorotnikova University of Massachusetts Amherst Joint work with Rajesh Chitnis, Graham

More information

Topics in Data Stream Sampling and Insider Threat Detection

Topics in Data Stream Sampling and Insider Threat Detection Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2016 Topics in Data Stream Sampling and Insider Threat Detection Yung-Yu Chung Iowa State University Follow this

More information

A Simple Augmentation Method for Matchings with Applications to Streaming Algorithms

A Simple Augmentation Method for Matchings with Applications to Streaming Algorithms A Simple Augmentation Method for Matchings with Applications to Streaming Algorithms MFCS 2018 Christian Konrad 27.08.2018 Streaming Algorithms sequential access random access Streaming (1996 -) Objective:

More information

Deterministic Communication

Deterministic Communication University of California, Los Angeles CS 289A Communication Complexity Instructor: Alexander Sherstov Scribe: Rebecca Hicks Date: January 9, 2012 LECTURE 1 Deterministic Communication This lecture is a

More information

Approximating the Caro-Wei Bound for Independent Sets in Graph Streams

Approximating the Caro-Wei Bound for Independent Sets in Graph Streams Approximating the Caro-Wei Bound for Independent Sets in Graph Streams Graham Cormode, Jacques Dark, and Christian Konrad Department of Computer Science, Centre for Discrete Mathematics and its Applications

More information

Answering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005)

Answering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005) Answering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005) Foto N. Afrati Computer Science Division NTUA, Athens, Greece afrati@softlab.ece.ntua.gr

More information

Improved Algorithm for Weighted Flow Time

Improved Algorithm for Weighted Flow Time Joint work with Yossi Azar Denition Previous Work and Our Results Overview Single machine receives jobs over time. Each job has an arrival time, a weight and processing time (volume). Allow preemption.

More information

Stream Sequential Pattern Mining with Precise Error Bounds

Stream Sequential Pattern Mining with Precise Error Bounds Stream Sequential Pattern Mining with Precise Error Bounds Luiz F. Mendes,2 Bolin Ding Jiawei Han University of Illinois at Urbana-Champaign 2 Google Inc. lmendes@google.com {bding3, hanj}@uiuc.edu Abstract

More information

Frequent Itemsets in Streams

Frequent Itemsets in Streams Frequent Itemsets in Streams Eiríkur Fannar Torfason teirikur@student.ethz.ch 31.5.2013 1 Introduction This report is written for a seminar at ETH Zürich, titled Algorithms for Database Systems. This year

More information

Linear-Space Data Structures for Range Minority Query in Arrays

Linear-Space Data Structures for Range Minority Query in Arrays Linear-Space Data Structures for Range Minority Query in Arrays Timothy M. Chan 1, Stephane Durocher 2, Matthew Skala 2, and Bryan T. Wilkinson 1 1 Cheriton School of Computer Science, University of Waterloo,

More information

Question 7.11 Show how heapsort processes the input:

Question 7.11 Show how heapsort processes the input: Question 7.11 Show how heapsort processes the input: 142, 543, 123, 65, 453, 879, 572, 434, 111, 242, 811, 102. Solution. Step 1 Build the heap. 1.1 Place all the data into a complete binary tree in the

More information

Checking and Spot-Checking the Correctness of Priority Queues

Checking and Spot-Checking the Correctness of Priority Queues Checking and Spot-Checking the Correctness of Priority Queues Matthew Chu 1, Sampath Kannan 1, and Andrew McGregor 2 1 Dept. of Computer and Information Science, University of Pennsylvania {mdchu,kannan}@cis.upenn.edu

More information

Mining top-k frequent itemsets from data streams

Mining top-k frequent itemsets from data streams Data Min Knowl Disc (2006) 13:193 217 DOI 10.1007/s10618-006-0042-x Mining top-k frequent itemsets from data streams Raymond Chi-Wing Wong Ada Wai-Chee Fu Received: 29 April 2005 / Accepted: 1 February

More information

Processing Data-Stream Join Aggregates Using Skimmed Sketches

Processing Data-Stream Join Aggregates Using Skimmed Sketches Processing Data-Stream Join Aggregates Using Skimmed Sketches Sumit Ganguly, Minos Garofalakis, and Rajeev Rastogi Bell Laboratories, Lucent Technologies, Murray Hill NJ, USA. sganguly,minos,rastogi @research.bell-labs.com

More information

Department of Computer Science

Department of Computer Science Yale University Department of Computer Science Pass-Efficient Algorithms for Facility Location Kevin L. Chang Department of Computer Science Yale University kchang@cs.yale.edu YALEU/DCS/TR-1337 Supported

More information

Online Algorithms. Lecture 11

Online Algorithms. Lecture 11 Online Algorithms Lecture 11 Today DC on trees DC on arbitrary metrics DC on circle Scheduling K-server on trees Theorem The DC Algorithm is k-competitive for the k server problem on arbitrary tree metrics.

More information

Implementing Hierarchical Heavy Hitters in RapidMiner: Solutions and Open Questions

Implementing Hierarchical Heavy Hitters in RapidMiner: Solutions and Open Questions Implementing Hierarchical Heavy Hitters in RapidMiner: Solutions and Open Questions Marco Stolpe and Peter Fricke firstname.surname@tu-dortmund.de Technical University of Dortmund, Artificial Intelligence

More information

Approximate Quantile Computation over Sensor Networks

Approximate Quantile Computation over Sensor Networks Approximate Quantile Computation over Sensor Networks Xifeng Yan, Jiong Yang, Wei Wang, Jiawei Han University of Illinois at Urbana-Champaign, {xyan, jioyang, hanj}@cs.uiuc.edu University of North Carolina

More information

Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window Downloaded 02/06/18 to 37.44.198.214. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

More information

Mining Approximate Frequent Closed Flows over Packet Streams

Mining Approximate Frequent Closed Flows over Packet Streams Mining Approximate Frequent Closed Flows over Packet Streams Imen Brahmi 1, Sadok Ben Yahia 1, and Pascal Poncelet 2 1 Faculty of Sciences of Tunis, Tunisia imen.brahmi@gmail.com, sadok.benyahia@fst.rnu.tn

More information

Cardinality Estimation: An Experimental Survey

Cardinality Estimation: An Experimental Survey : An Experimental Survey and Felix Naumann VLDB 2018 Estimation and Approximation Session Rio de Janeiro-Brazil 29 th August 2018 Information System Group Hasso Plattner Institut University of Potsdam

More information

Distributed Data Streaming Algorithms for Network Anomaly Detection

Distributed Data Streaming Algorithms for Network Anomaly Detection Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2017 Distributed Data Streaming Algorithms for Network Anomaly Detection Wenji Chen Iowa State University Follow

More information

Join Sizes, Frequency Moments, and Applications

Join Sizes, Frequency Moments, and Applications Join Sizes, Frequency Moments, and Applications Graham Cormode 1 and Minos Garofalakis 2 1 University of Warwick, Department of Computer Science, Coventry CV4 7AL, UK. Email: G.Cormode@warwick.ac.uk 2

More information

Online Mining Changes of Items over Continuous Append-only and Dynamic Data Streams

Online Mining Changes of Items over Continuous Append-only and Dynamic Data Streams Journal of Universal Computer Science, vol., no. 8 (2005), 4-425 submitted: 0/3/05, accepted: 5/5/05, appeared: 28/8/05 J.UCS Online Mining Changes of Items over Continuous Append-only and Dynamic Data

More information

Latest on Linear Sketches for Large Graphs: Lots of Problems, Little Space, and Loads of Handwaving. Andrew McGregor University of Massachusetts

Latest on Linear Sketches for Large Graphs: Lots of Problems, Little Space, and Loads of Handwaving. Andrew McGregor University of Massachusetts Latest on Linear Sketches for Large Graphs: Lots of Problems, Little Space, and Loads of Handwaving Andrew McGregor University of Massachusetts Latest on Linear Sketches for Large Graphs: Lots of Problems,

More information