Optimizing Grouped Aggregation in Geo- Distributed Streaming Analytics

Size: px

Start display at page:

Download "Optimizing Grouped Aggregation in Geo- Distributed Streaming Analytics"

Ophelia Morris
5 years ago
Views:

1 Optimizing Grouped Aggregation in Geo- Distributed Streaming Analytics Benjamin Heintz, Abhishek Chandra University of Minnesota Ramesh K. Sitaraman UMass Amherst & Akamai Technologies

2 Wide- Area Streaming Analytics Geo- distributed Streams Video streaming: client logs CDN: server logs Smart homes: sensor readings Real- time Analytics Audience behavior Web application usage Energy consumption Heintz, Chandra, Sitaraman HPDC 5

Wide- Area Streaming Analytics Geo- distributed Streams Video streaming: client logs CDN: server logs Smart homes: sensor readings Real- time Analytics Audience behavior Web application usage Energy

3 Wide- Area Streaming Analytics Geo- distributed Streams Video streaming: client logs CDN: server logs Smart homes: sensor readings Real- time Analytics Audience behavior Web application usage Energy consumption Hour- by- hour, how many unique users have streamed each video? How many bytes are served for each content provider every minute? Every 5 minutes, show me the average electrical demand by zip code. Heintz, Chandra, Sitaraman HPDC 5

4 Wide- Area Streaming Analytics Geo- distributed Streams Video streaming: client logs CDN: server logs Smart homes: sensor readings Real- time Analytics Audience behavior Web application usage Energy consumption WindowedGrouped Aggregation Hour- by- hour, how many unique users have streamed each video? How many bytes are served for each content provider every minute? Every 5 minutes, show me the average electrical demand by zip code. Heintz, Chandra, Sitaraman HPDC 5

5 Windowed Grouped Aggregation Example For each minute Window ß windowed For each content_provider_id ß grouped Sum bytes_transferred ß aggregation Widely used General aggregation sum, count average, standard deviation (approximate) sets, counts, Heintz, Chandra, Sitaraman HPDC 5 5

6 System Model Hub- and- Spoke Infrastructure Compute at edges and center Queries served from center Server Central Data Warehouse Server Server Server Heintz, Chandra, Sitaraman HPDC 5 6

7 Metrics WAN System- level measure of cost Major OPEX component Staleness User- level measure of quality Delay in getting final results Often an SLA component window begins t window ends t+w staleness results available t+w+s time Heintz, Chandra, Sitaraman HPDC 5 7

8 Our Goal Algorithms to jointly minimize staleness & traffic Algorithm Input: sequence of arrivals Output: sequence of updates : total number of updates Staleness: delay until last update reaches center Heintz, Chandra, Sitaraman HPDC 5 8

9 Naïve Algorithms Pure streaming immediately flushes upon arrival Excessive traffic Pure batching flushes only after end of window Excessive staleness Heintz, Chandra, Sitaraman HPDC 5 9

10 A Running Example Count words in the stream fast data is fast Network: aggregate / second Window length: 8 seconds Heintz, Chandra, Sitaraman HPDC 5

11 Pure Streaming : 6 8 Heintz, Chandra, Sitaraman HPDC 5

12 Pure Streaming : 6 8 Flush immediately Heintz, Chandra, Sitaraman HPDC 5

13 Pure Streaming : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5

14 Pure Streaming : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5

15 Pure Streaming : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5 5

16 Pure Streaming : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5 6

17 Pure Streaming : fast data is Heintz, Chandra, Sitaraman HPDC 5 7

18 Pure Streaming : fast data is Heintz, Chandra, Sitaraman HPDC 5 8

19 Pure Streaming : fast data is Heintz, Chandra, Sitaraman HPDC 5 9

Pure Streaming : 8 6 8 Staleness: : fast

20 Pure Streaming : Staleness: : fast data is Heintz, Chandra, Sitaraman HPDC 5

21 Pure Batching : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5

22 Pure Batching : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5

23 Pure Batching : 6 8 fast data is Heintz, Chandra, Sitaraman HPDC 5

24 Pure Batching : fast data is Heintz, Chandra, Sitaraman HPDC 5

25 Pure Batching : fast data is Heintz, Chandra, Sitaraman HPDC 5 5

26 Pure Batching : fast data is ` fast Heintz, Chandra, Sitaraman HPDC 5 6

27 Pure Batching : data ` fast data is Heintz, Chandra, Sitaraman HPDC 5 7

28 Pure Batching : 6 8 ` fast data is is Heintz, Chandra, Sitaraman HPDC 5 8

29 Pure Batching : is 6 8 Staleness: : ` fast data is Heintz, Chandra, Sitaraman HPDC 5 9

30 Pure Batching : is 6 8 Staleness: fast Naïve Algorithms: `Staleness vs. data : is Heintz, Chandra, Sitaraman HPDC 5

31 Roadmap Naïve algorithms Optimal offline algorithms Practical online algorithms Experimental evaluation Heintz, Chandra, Sitaraman HPDC 5

32 Eager Offline Optimal Idea: flush immediately after last arrival Properties: Flushes exactly once per key (traffic optimal) Flushes at the earliest possible time (staleness optimal) Heintz, Chandra, Sitaraman HPDC 5

33 Eager Offline Optimal : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5

34 Eager Offline Optimal : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5

35 Eager Offline Optimal : fast 6 8 Flush immediately upon last arrival Heintz, Chandra, Sitaraman HPDC 5 5

36 Eager Offline Optimal : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5 6

37 Eager Offline Optimal : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5 7

38 Eager Offline Optimal : fast data is Heintz, Chandra, Sitaraman HPDC 5 8

39 Eager Offline Optimal : fast data is Heintz, Chandra, Sitaraman HPDC 5 9

40 Eager Offline Optimal : fast ` fast data is Heintz, Chandra, Sitaraman HPDC 5

41 Eager Offline Optimal : Staleness: : fast data is Heintz, Chandra, Sitaraman HPDC 5

42 Lazy Offline Optimal Idea: flush as late as possible after last arrival Properties Still one flush per key (traffic optimal) Staleness optimal by construction Heintz, Chandra, Sitaraman HPDC 5

43 Lazy Offline Optimal : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5

44 Lazy Offline Optimal : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5

45 Lazy Offline Optimal : 6 8 fast data is Heintz, Chandra, Sitaraman HPDC 5 5

46 Lazy Offline Optimal : 5 fast data is 6 8 Flush as late as ` possible after last arrival Heintz, Chandra, Sitaraman HPDC 5 6

47 Lazy Offline Optimal : fast data ` data is Heintz, Chandra, Sitaraman HPDC 5 7

48 Lazy Offline Optimal : fast data is Heintz, Chandra, Sitaraman HPDC 5 8

49 Lazy Offline Optimal : fast ` data is is Heintz, Chandra, Sitaraman HPDC 5 9

50 Lazy Offline Optimal : fast fast data is Heintz, Chandra, Sitaraman HPDC 5 5

51 Lazy Offline Optimal : Staleness: : fast data is Heintz, Chandra, Sitaraman HPDC 5 5

52 Family of Optimal Algorithms Eager Lazy 6 8 Heintz, Chandra, Sitaraman HPDC 5 5

53 Roadmap Naïve algorithms Optimal offline algorithms Practical online algorithms Experimental evaluation Heintz, Chandra, Sitaraman HPDC 5 5

54 Caching Analogy Insight: Treat the edge as a cache For staleness When to flush? Dynamic sizing policy For traffic What to flush? Eviction policy fast data is Heintz, Chandra, Sitaraman HPDC 5 5

55 Practical Online Algorithms Dynamic sizing Emulate the offline optimal cache sizes Eager: predict how many keys will arrive again Lazy: predict how late we can defer flushing Heintz, Chandra, Sitaraman HPDC 5 55

predict how late we can defer flushing Eviction Don t reinvent the

56 Practical Online Algorithms Dynamic sizing Emulate the offline optimal cache sizes Eager: predict how many keys will arrive again Lazy: predict how late we can defer flushing Eviction Don t reinvent the wheel Leverage existing work (LRU, LFU, ) Heintz, Chandra, Sitaraman HPDC 5 56

57 Hybrid Online Algorithm Eager: better to overestimate size Many early evictions Risk: incorrect eviction decisions Lazy: better to underestimate size Risk: bandwidth misprediction Be pessimistic Hybrid: linear combination of eager, lazy cache sizes Heintz, Chandra, Sitaraman HPDC 5 57

58 Roadmap Naïve algorithms Optimal offline algorithms Practical online algorithms Experimental evaluation Heintz, Chandra, Sitaraman HPDC 5 58

59 Experiments: Data Set Month- long Akamai download analytics beacon log Name Query Size small (cpid, bw) bytes downloaded (sum) medium (cpid, bw, country_code) bytes downloaded (moments) large (cpid, bw, url) client ip (HyperLogLog) O( ) groups O( ) groups O( 6 ) groups Heintz, Chandra, Sitaraman HPDC 5 59

60 Experiments: Implementation Implemented algorithms in Apache Storm One Storm cluster at center, each edge Seven total PlanetLab sites (washington.edu) (wisc.edu) (csuohio.edu) (uwaterloo.ca) (yale.edu) Replayer Logs Summer Summer Summer Aggregator Reorderer SocketSender (princeton.edu) (ucla.edu) Summer SocketReceiver Reorderer Summer Summer StatsCollector Aggregator Heintz, Chandra, Sitaraman HPDC 5 6

61 Experimental Evaluation: Single- Normalized Mean Batching Hybrid(.5) Streaming Optimal Normalized 95th- %ile Staleness Staleness Batching Hybrid(.5) Streaming Optimal Heintz, Chandra, Sitaraman HPDC 5 6

62 Experimental Evaluation: Single- Normalized Mean Batching Hybrid(.5) Streaming Optimal Near- optimal traffic Normalized 95th- %ile Staleness Batching Hybrid(.5) Streaming Optimal Staleness 65%. Significant staleness reduction Heintz, Chandra, Sitaraman HPDC 5 6

63 Experimental Evaluation: Multi- ( edges) Staleness Batching Hybrid(.5) Streaming Optimal Batching Hybrid(.5) Streaming Optimal Normalized Mean Normalized 95th- %ile Staleness... Heintz, Chandra, Sitaraman HPDC 5 6

64 Experimental Evaluation: Multi- (6 edges) Batching Hybrid(.5) Streaming Optimal Staleness Batching Hybrid(.5) Streaming Optimal Normalized Mean Low High Normalized 95th- %ile Staleness Low High Heintz, Chandra, Sitaraman HPDC 5 6

65 Related Work Systems Twitter Heron (SIGMOD 5) Apache Spark Streaming (SOSP ) Google MillWheel (VLDB ) Muppet (VLDB ) Wide- area Computing WANalytics (CIDR 5) JetStream (NSDI ) Pietzuch et al. (ICDE 6) Optimization Trade- offs BlinkDB (VLDB ) LazyBase (EuroSys ) Heintz, Chandra, Sitaraman HPDC 5 65

66 Conclusion Geo- distributed windowed, grouped aggregation Naïve algorithms: low staleness or low traffic Optimal offline algorithms to jointly minimize both Practical algorithms via caching Dynamic sizing policies Reuse existing eviction policies Heintz, Chandra, Sitaraman HPDC 5 66

Complex Interactions in Content Distribution Ecosystem and QoE

Complex Interactions in Content Distribution Ecosystem and QoE Zhi-Li Zhang Qwest Chair Professor & Distinguished McKnight University Professor Dept. of Computer Science & Eng., University of Minnesota