HotCloud 17. Lube: Mitigating Bottlenecks in Wide Area Data Analytics. Hao Wang* Baochun Li

Size: px

Start display at page:

Download "HotCloud 17. Lube: Mitigating Bottlenecks in Wide Area Data Analytics. Hao Wang* Baochun Li"

Laura Patrick
5 years ago
Views:

1 HotCloud 17 Lube: Hao Wang* Baochun Li Mitigating Bottlenecks in Wide Area Data Analytics iqua

2 Wide Area Data Analytics DC Master Namenode Workers Datanodes 2

3 Wide Area Data Analytics Why wide area data analytics? DC #1 DC #2 DC #n Data Volume User Distribution Regulation Policy Master Workers Workers Problems Namenode Datanodes Datanodes Widely shared resources Fluctuating available provision Distributed runtime environment Heterogenous utilizations 2

4 Fluctuating WAN Bandwidths Bandwidth (Mbps) (VC) (CT) (TR) (WT) (TR) 0 0:00 6:00 12:00 18:00 0:00 6:00 12:00 Jan 1 Jan 2 Measured by iperf on SAVI testbed 3

5 Heterogenous Memory Util Nodes in different DCs may have different resource utilizations node_1 node_2 node_3 node_ Time ( s) Running Berkeley Big Data Benchmark on AWS EC2 4 nodes across 4 regions. Collected by jvmtop 4

6 Runtime Bottlenecks Fluctuation Heterogeneity Bottlenecks emerges at runtime Any time Any nodes Bottlenecks Any resources Data analytics performance Long completion times Low resource utilization Invalid optimization 5

7 Optimization of Data Analytics Existing optimization method does not consider runtime bottlenecks Clarient [OSDI 16] considers the heterogeneity of available WAN bandwidth Iridium [SIGCOMM 15] trades off between time and WAN bandwidth usage Geode [NSDI 15] saves WAN usage via data placement and query plan selection SWAG [SoCC 15] reorders jobs across datacenters Much of this performance work has been motivated by three widely-accepted mantras about the performance of data analytics network, disk and straggler. Making Sense of Performance in Data Analytics Frameworks 6 NSDI 15, Kay Ousterhout

8 Mitigating Bottlenecks at Runtime Mitigating bottlenecks How to detect bottlenecks? How to overcome the scheduling delay? How to enforce the bottleneck mitigation? Resource queue Task queue in bottleneck 7

9 Architecture of Lube Three major components Performance monitors Bottleneck detecting module Bottleneck-aware scheduler Lube Client Online Bottleneck Detector Training Pool Bottleneck Detector Network I/O Disk I/O Model Update Lightweight Performance Monitors JVM more metrics Lube Master Bottleneck Info. Cache Available Worker Pool (worker, intensity) Lube Scheduler Submitted Task Queue Bottleneck-aware Scheduling 8

10 Detecting Bottlenecks ARIMA y t = θ 0 +φ 1 y t 1 +φ 2 y t 2 + +φ p y t p + ε t θ 1 ε t 1 θ 2 ε t 2 θ q ε t q ε Ramdon error y t Current state θ φ Coefficients Historical input Autoregressive (AR) + output Current states Moving Average(MA) state (time_1, mem_util) (time_2, mem_util) (time_t-1, mem_util) ARIMA(p, d, q) (time_t, mem_util) 9

11 Detecting Bottlenecks HMM Hidden Markov Model t past future Hidden states: O Observation states: Q Q q 1 q 2 q i A(a ij ) q j Emission probability: A B(b j (k)) Transition probability: B O O 1 O 2 O d O k To make HMM online {time_stamp: mem, net, cpu, disk} Sliding Hidden Markov Model A sliding window for new observations 10 A moving average approximation for outdated observations

12 Bottleneck-Aware Scheduling Memory utilization of executor processes Built-in task schedulers: Data-locality Network utilization of datanode processes Bottleneck-aware scheduler: CPU utilization of executor processes Data-locality Bottlenecks at runtime A single worker node is Disk (SSD) utilization of datanode processes bottlenecked continuously while all nodes are rarely bottlenecked at the same time Time (s) 11

13 Implementation & Deployment Implementation Spark (scheduler) redis database (cache) Python scikit-learn, Keras (ML) Deployment 37 EC2 m4.2xlarge instances 9 regions Berkeley Big Data Benchmark An 1.1 TB dataset Master Node Worker Nodes Lube Scheduler Master Redis Server Bottleneck Detection Module Worker Redis Server nethogs jvmtop iotop APIs: HGET worker_id time HSET worker_id {time: {metric: val_ob, val_inf}} SUBSCRIBE metric_1 metric_2 PUBLISH + HSET metric {time: val} (e.g, iotop {time: I/O}) 12

14 Evaluation Accuracy 100 ARIMA 100 SlidHMM Hit Rate (%) Query-1 Hit Rate (%) Query-2 Calculation #((time, detection) (time, observation)) hitrate = #(time, detection) 100 a b c 100 a b c Hit Rate (%) Query-3 Hit Rate (%) Query-4 ARIMA ignores nonlinear patterns a b c 13

15 Evaluation Completion Times 1.0 Query-1 Pure Spark Lube-ARIMA 1.0 Lube-SlidHMM Query Task completion times Query-3 Time (ms) Query-4 Time (ms) Average 75th Lube-ARIMA s s Lube-SlidHMM s s Time (ms) Time (ms)

16 Evaluation Completion Times Pure Spark Lube-ARIMA ARIMA + Spark Lube-SlidHMM Query completion times SlidHMM + Spark Time (s) Time (s) Query-1 Query Query-2 Query-4 Lube-ARIMA Lube-SlidHMM Reduce median query response time by up to 33% Control Groups for overhead ARIMA + Spark SlidHMM + Spark Negligible overhead 15

17 Conclusion Runtime performance bottleneck detection ARIMA, HMM A simple greedy bottleneck-aware task scheduler Jointly consider data-locality and bottlenecks Lube, a closed-loop framework mitigating bottlenecks at runtime. 16

18 The End Thank You

19 Discussion Bottleneck detection models More performance metrics could be explored More efficient models for time series prediction, e.g., Reinforcement Learning, LSTM Bottleneck-aware scheduling Fine-grained scheduling with specific resource awareness WAN conditions We measure pair-wise WAN bandwidths by a cron job running iperf locally Try to exploit support from SDN interfaces 18

WITH large volumes of data generated and stored at geographically

WITH large volumes of data generated and stored at geographically 1 Mitigating Bottlenecks in Wide Area Data Analytics via Machine Learning Hao Wang and Baochun Li, Fellow, IEEE Department of Electrical and Computer Engineering, University of Toronto Abstract Over the