Optimal Tracking of Distributed Heavy Hitters and Quantiles

Size: px

Start display at page:

Download "Optimal Tracking of Distributed Heavy Hitters and Quantiles"

Rafe Eaton
5 years ago
Views:

1 Optimal Tracking of Distributed Heavy Hitters and Quantiles Qin Zhang Hong Kong University of Science & Technology Joint work with Ke Yi PODS 2009 June 30,

2 2- Models and Problems

3 Distributed streaming model [Babcock, Olston SIGMOD 2003] Alice observes A(t) by time t 3 2 Carole tries to track f(a(t) B(t)) for all t (an ɛ-approximation is usually allowed) Bob observes B(t) by time t A(t), B(t) : multisets 3-

4 Distributed streaming model [Babcock, Olston SIGMOD 2003] Alice observes A(t) by time t 3 2 Carole tries to track f(a(t) B(t)) for all t (an ɛ-approximation is usually allowed) Bob observes B(t) by time t Find an (approximate) tracking protocol while minimizing communication A(t), B(t) : multisets 3-2

5 Distributed streaming model [Babcock, Olston SIGMOD 2003] Extend to k sites A (t) 3 2 Coordinator f(a (t) A 2 (t)... A k (t)) Site Site A 2 (t) Site Find a protocol to minimize communication. 3-3

6 Combination of two models X Y (One-shot) communication model f(x Y ); minimize communication 4-

7 Combination of two models X Y (One-shot) communication model f(x Y ); minimize communication 4-2

8 Combination of two models X Y A(t) Streaming model f(a(t)); minimize space (One-shot) communication model f(x Y ); minimize communication 4-3

Combination of two models X Y 2 3 2 2 3 2 + A(t) 2 3 4 2 Streaming model f(a(t)); minimize space (One-shot) communication model

9 Combination of two models X Y A(t) Streaming model f(a(t)); minimize space (One-shot) communication model f(x Y ); minimize communication Distributed streaming model a.k.a. Continuous commun. model Only interested in communication cost 4-4

10 Applied motivation: distributed monitoring C Large-scale distributed holistic querying/monitoring. Picture from the tutorial Streaming in a connected world: Querying and tracking distributed data streams at VLDB 06 and SIGMOD 07 5-

11 Three different fs Heavy hitters: For a multiset A, item i is a heavy hitter if (count of i) > φ A non-heavy hitter if (count of i) < (φ ɛ) A Application: Find out IP addresses that appear % over a network 6-

12 Three different fs Heavy hitters: For a multiset A, item i is a heavy hitter if (count of i) > φ A non-heavy hitter if (count of i) < (φ ɛ) A Application: Find out IP addresses that appear % over a network Single Quantile: Median, /4-quantile. Application: The median size of packets over a network. 6-2

13 Three different fs Heavy hitters: For a multiset A, item i is a heavy hitter if (count of i) > φ A non-heavy hitter if (count of i) < (φ ɛ) A Application: Find out IP addresses that appear % over a network Single Quantile: Median, /4-quantile. Application: The median size of packets over a network. All Quantiles: A is a set of distinct elements from a totally ordered universe. Quantile of x = y < x, y A / A, usually allows a ±ɛ error. Need a data structure to extract ɛ-approximate quantile for any x. 6-3

14 7- Previous and our results

15 Previous results in distributed streaming model Frequent moments (F p = i (count of i)p ) Total count F Simple: each site sends an update every time its local count has increased by a ( + ɛ) factor Complexity O(k/ɛ log n) n: total # items received at all sites, ɛ: relative error 8-

16 Previous results in distributed streaming model Frequent moments (F p = i (count of i)p ) Total count F Simple: each site sends an update every time its local count has increased by a ( + ɛ) factor Complexity O(k/ɛ log n) n: total # items received at all sites, ɛ: relative error Distinct count F 0, Second frequency moment F 2 F 0 : O(k/ɛ 2 log n log(n/δ)) [Cormode, Muthukrishnan, Yi, SODA 08] F 2 : O((k 2 /ɛ 2 + k 3/2 /ɛ 4 ) log n log(n/δ) [same paper] 8-2

17 Previous results in distributed streaming model Frequent moments (F p = i (count of i)p ) Total count F Simple: each site sends an update every time its local count has increased by a ( + ɛ) factor Complexity O(k/ɛ log n) n: total # items received at all sites, ɛ: relative error Distinct count F 0, Second frequency moment F 2 F 0 : O(k/ɛ 2 log n log(n/δ)) [Cormode, Muthukrishnan, Yi, SODA 08] F 2 : O((k 2 /ɛ 2 + k 3/2 /ɛ 4 ) log n log(n/δ) [same paper] Entropy of the stream Shannon entropy and related entropies [Arackaparambil, Brody, Chakrabarti, ICALP 09] 8-3

18 Previous results: Heavy Hitters Streaming model (space complexity) O(/ɛ), [Karp, Shenker, Papadimitriou TODS 03], [Demaine, Munro, Lopez-Ortiz, ESA 02], [Metwally, Agrawal, Abbadi TODS 06], [Misra, Gries 982] 9-

19 Previous results: Heavy Hitters Streaming model (space complexity) O(/ɛ), [Karp, Shenker, Papadimitriou TODS 03], [Demaine, Munro, Lopez-Ortiz, ESA 02], [Metwally, Agrawal, Abbadi TODS 06], [Misra, Gries 982] One-shot communication model O(k/ɛ) - easy 9-2

20 Previous results: Heavy Hitters Streaming model (space complexity) O(/ɛ), [Karp, Shenker, Papadimitriou TODS 03], [Demaine, Munro, Lopez-Ortiz, ESA 02], [Metwally, Agrawal, Abbadi TODS 06], [Misra, Gries 982] One-shot communication model O(k/ɛ) - easy Distributed streaming model Heuristics [e.g. Babcock, Olston, SIGMOD 03] 9-3

21 Previous results: All Quantiles Streaming model (space complexity) O(/ɛ log(ɛn)) [Greenwald, Khanna, SIGMOD 0] O(/ɛ log(/δ)), with failure probability δ [Cormode, Muthukrishnan, J. Alg 05] 0-

22 Previous results: All Quantiles Streaming model (space complexity) O(/ɛ log(ɛn)) [Greenwald, Khanna, SIGMOD 0] O(/ɛ log(/δ)), with failure probability δ [Cormode, Muthukrishnan, J. Alg 05] One-shot communication model O(k/ɛ) - easy 0-2

23 Previous results: All Quantiles Streaming model (space complexity) O(/ɛ log(ɛn)) [Greenwald, Khanna, SIGMOD 0] O(/ɛ log(/δ)), with failure probability δ [Cormode, Muthukrishnan, J. Alg 05] One-shot communication model O(k/ɛ) - easy Distributed streaming model O(k/ɛ 2 log n) [Cormode, Garofalakis, Muthukrishnan, Rastogi, SIGMOD 05] 0-3

24 Our results Tracking heavy hitters Complexity: Θ(k/ɛ log n) -

25 Our results Tracking heavy hitters Complexity: Θ(k/ɛ log n) Tracking one quantile (in particular: Median) Complexity: Θ(k/ɛ log n) -2

26 Our results Tracking heavy hitters Complexity: Θ(k/ɛ log n) Tracking one quantile (in particular: Median) Complexity: Θ(k/ɛ log n) Tracking all quantiles Upper bound: O(k/ɛ log 2 (/ɛ) log n) -3

27 2- Technical details

28 Tracking Heavy Hitters Divide into multiple rounds: Every time m increase by a factor of + ɛ. m: total # elements inserted 3-

29 Tracking Heavy Hitters Divide into multiple rounds: Every time m increase by a factor of + ɛ. m: total # elements inserted In each round Compute total count m: O(k) messages Broadcast m: O(k) messages Each site: For each i, sends out a message every time count(i) increases by ɛm/(2k) Sends out a message every time total local count increases by ɛm/(2k) Coordinator: Terminate the round when O(k) messages have been received 3-2

30 Tracking Heavy Hitters Divide into multiple rounds: Every time m increase by a factor of + ɛ. m: total # elements inserted In each round Compute total count m: O(k) messages Broadcast m: O(k) messages Each site: For each i, sends out a message every time count(i) increases by ɛm/(2k) Sends out a message every time total local count increases by ɛm/(2k) Coordinator: Terminate the round when O(k) messages have been received Track the count of each element with error at most ɛm 3-3

31 Tracking Heavy Hitters Total messages in one round: O(k) # rounds: O(log +ɛ n) = O(log n/ɛ) Divide into multiple rounds: Every time m increase by a factor of + ɛ. m: total # elements inserted In each round Compute total count m: O(k) messages Broadcast m: O(k) messages Each site: For each i, sends out a message every time count(i) increases by ɛm/(2k) Sends out a message every time total local count increases by ɛm/(2k) Coordinator: Terminate the round when O(k) messages have been received Track the count of each element with error at most ɛm 3-4

32 Tracking Heavy Hitters: lower bound Construct an input where the set of heavy hitters undergoes Ω(log n/ɛ) updates In each update, one item changes from nonheavy to heavy Argue that Ω(k) sites need to be contacted in order to correctly identify this item at the right time 4-

33 Tracking the Median Divide tracking period into rounds: m (roughly) doubles in each round In each round Maintain a dynamic set of intervals each has ɛ 8 m to ɛ m items: # messages O(k/ɛ) in each 2 round each leaf contains ɛm/8 ɛm/2 elements, maintain an approximate count pivot element 5-

34 Tracking the Median Divide tracking period into rounds: m (roughly) doubles in each round In each round Maintain a dynamic set of intervals each has ɛ 8 m to ɛ m items: # messages O(k/ɛ) in each 2 round After Θ(ɛm) items, recalculate the median from the old median by using the dynamic set of intervals. # messages O(k) per recalculation, O(/ɛ) recalculations in each round 5-2

35 Tracking the Median Divide tracking period into rounds: m (roughly) doubles in each round In each round Maintain a dynamic set of intervals each has ɛ Total messages in one round: O(k/ɛ) # rounds: O(log n) 8 m to ɛ m items: # messages O(k/ɛ) in each 2 round After Θ(ɛm) items, recalculate the median from the old median by using the dynamic set of intervals. # messages O(k) per recalculation, O(/ɛ) recalculations in each round 5-3

36 Tracking all Quantiles approximate count with absolute error < ɛm/ log(/ɛ) A balanced tree approximate median: either half contains at least /4 of the elements each leaf contains Θ(ɛm) elements structure can be used to extract the rank of any x with absolute error < ɛm quantile with error < ɛ 6-

37 Tracking all Quantiles approximate count with absolute error < ɛm/ log(/ɛ) A balanced tree approximate median: either half contains at least /4 of the elements each leaf contains Θ(ɛm) elements Divide tracking period into rounds: m (roughly) doubles in each round 6-2

38 Tracking all Quantiles approximate count with absolute error < ɛm/ log(/ɛ) A balanced tree approximate median: either half contains at least /4 of the elements each leaf contains Θ(ɛm) elements Divide tracking period into rounds: m (roughly) doubles in each round At the beginning of each round, initialize the structure with O(k/ɛ) communication, then broadcast. 6-3

39 Tracking all Quantiles approximate count with absolute error < ɛm/ log(/ɛ) A balanced tree approximate median: either half contains at least /4 of the elements each leaf contains Θ(ɛm) elements Divide tracking period into rounds: m (roughly) doubles in each round At the beginning of each round, initialize the structure with O(k/ɛ) communication, then broadcast. Each site tracks each interval, sends a message for every Θ(ɛm/(k log(/ɛ))) new elements arrive. Total: O(k/ɛ log 2 (/ɛ)) 6-4

40 Tracking all Quantiles approximate count with absolute error < ɛm/ log(/ɛ) A balanced tree approximate median: either half contains at least /4 of the elements each leaf contains Θ(ɛm) elements Divide tracking period into rounds: m (roughly) doubles in each round At the beginning of each round, initialize the structure with O(k/ɛ) communication, then broadcast. Each site tracks each interval, sends a message for every Θ(ɛm/(k log(/ɛ))) new elements arrive. Total: O(k/ɛ log 2 (/ɛ)) Maintain the balance of the tree. Total: O(k/ɛ log(/ɛ)) 6-5

41 Remarks and open problems Focused on communication only, but the algorithms can be implemented with small space O(/ɛ) each site for heavy hitter O(/ɛ log(ɛn)) each site for quantiles 7-

42 Remarks and open problems Focused on communication only, but the algorithms can be implemented with small space O(/ɛ) each site for heavy hitter O(/ɛ log(ɛn)) each site for quantiles Randomized algorithms? There is an upperbound O ( (k + /ɛ 2 ) poly log(n, k, /ɛ) ) 7-2

43 Remarks and open problems Focused on communication only, but the algorithms can be implemented with small space O(/ɛ) each site for heavy hitter O(/ɛ log(ɛn)) each site for quantiles Randomized algorithms? There is an upperbound O ( (k + /ɛ 2 ) poly log(n, k, /ɛ) ) How about deletions? One site, arbitrary function in Z d, competitive analysis. (Yi and Zhang SODA 2009) 7-3

44 Remarks and open problems Focused on communication only, but the algorithms can be implemented with small space O(/ɛ) each site for heavy hitter O(/ɛ log(ɛn)) each site for quantiles Randomized algorithms? There is an upperbound O ( (k + /ɛ 2 ) poly log(n, k, /ɛ) ) How about deletions? One site, arbitrary function in Z d, competitive analysis. (Yi and Zhang SODA 2009) Other tracking problems

45 The End T HAN K YOU Q and A 8-

Optimal Sampling from Distributed Streams

Optimal Sampling from Distributed Streams Graham Cormode AT&T Labs-Research Joint work with S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Qin Zhang (HKUST) 1-1 Reservoir sampling [Waterman??; Vitter 85] Maintain