The Hardware & Software Implications of Microservices and How Big Data Can Help

Size: px

Start display at page:

Download "The Hardware & Software Implications of Microservices and How Big Data Can Help"

Ralf Atkinson
5 years ago
Views:

1 The Hardware & Software Implications of Microservices and How Big Data Can Help Christina Delimitrou Cornell University with Yu Gan, Yanqi Zhang, Shuang Chen, Neeraj Kulkarni, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Ankitha Shetty, Nayan Katarki, Brett Clancy, Chris Colen, Dailun Cheng, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Meghna Pancholi, Siyuan Hu ASBD Workshop June 2 nd 2018

2 Executive Summary Shift from monoliths to microservices: Modularity, specialization, simplicity, accelerated development Change assumptions about datacenter server design Complicate scheduling and resource management Amplify effects Revisit architectural design decisions for microservices Highlight management challenges of microservices Motivate the need for data-driven approaches for systems whose scale & complexity keeps increasing 2

3 From Monoliths to Microservices 3

4 Motivation Advantages of microservices: Ease & speed of code development & deployment Security, error isolation PL/framework heterogeneity Challenges of microservices: Change server design assumptions Complicate resource management à dependencies Amplify tail-at-scale effects More sensitive to performance unpredictability No representative end-to-end apps with microservices 4

5 An End-to-End Suite for Cloud & IoT Microservices 4 end-to-end applications using popular open-source microservices à ~30-40 microservices per app Social Network Movie Reviewing/Renting/Streaming E-commerce Drone control service Programming languages and frameworks: node.js, Python, C/C++, Java/Javascript, Scala, PHP, and Go Nginx, memcached, MongoDB, CockroachDB, Mahout, Xapian Apache Thrift RPC, RESTful APIs Docker containers Lightweight RPC-level distributed tracing 5

are Thrift RPCs UpdateUser ReviewStorage UserReviewDB MovieReviewDB UpdateMovie

6 Movie Streaming Client http Load Balancer http nginx fastcgi Frontend php-fpm movieid text uniqueid AssignRating Compose Phase Login ComposeReview All arrows are Thrift RPCs UpdateUser ReviewStorage UserReviewDB MovieReviewDB UpdateMovie Store Phase memcached memcached memcached memcached mongodb mongodb mongodb mongodb 6

7 Movie Streaming Browse movie info (movie plot, photos, videos, cast, stats, etc.) ML widgets: n Recommender for movies to watch n Recommender for ads User authentication/payment Search: n Xapian: search movie DB Analytics: n Mahout: user analytics based on input stored in HDFS n Spark MLlib: in-memory ML analytics 7

8 Architectural Implications [CAL 18] Big vs. small servers: Power management using RAPL More pressure on single-thread performance, low tail latency 8

9 Architectural Implications Tail Latency (msec) Movie Streaming QPS Tail Latency (msec) Social Network QPS Tail Latency (sec) E-commerce QPS Tail Latency (sec) IoT - Cloud Xeon ThunderX QPS Big vs. small servers: Power management using RAPL More pressure on single-thread performance, low tail latency Low-power SoCs, e.g., Cavium ThunderX2 Similar latency, but earlier saturation 9

10 Architectural Implications Tail Latency (msec) nginx AssignR MovieID ProcText ReviewID Movie Streaming Application proc TCP proc (RPCs) RevStorage UserReview MovReview memcached mongodb End-to-End Monolith Compose Computation:Communication ratio: Monolithic service à high load Microservices à high load 10

11 Architectural Implications DRAM DRAM DRAM CPU QPI CPU PCIe Gen3 Virtex7 QSFP 10Gbps NIC PCIe Gen3 QSFP QSFP 10Gbps Computation:Communication ratio: Monolithic service à high load Microservices à high load RPC/REST acceleration à NIC offloads, FPGAs 11

Architectural Implications L1i MPKI L1-i cache pressure: 40 35 30 25 20 15 10 5 0 nginx text Monoliths à Large code footprints à L1i thrashing Microservices à Small footprint/microservice n Assuming

12 Architectural Implications L1i MPKI L1-i cache pressure: nginx text Monoliths à Large code footprints à L1i thrashing Microservices à Small footprint/microservice n Assuming dedicated cores Social Network image msgid taguser urlshorten video compose msgstore wrtimeline wrgraph mem$ mongodb L1i MPKI frontend login E-Commerce orders search cart wishlist catalogue recommend shipping payment invoice qmaster mem$ mongodb 12

13 End-to-End Latency Breakdown Post-rightsizing (resource ratios to avoid glaring bottlenecks) Bottlenecks shift with load Need online, dynamic decisions 13

14 Resource Management Implications Netflix Twitter Amazon Movie Streaming Challenges of microservices: Change server design assumptions Dependencies complicate resource management 14

15 Dependencies & Backpressure read <k,v> nginx mem$ http1 15

16 Dependencies & Backpressure read <k,v> nginx mem$ http1 RX nginx TX RX mem$ http2 Traditional techniques like autoscale may help/penalize the wrong microservice Dependencies change at runtime à difficult to infer impact 16

17 Determine Per-Tier QoS Queueing models nginx mem$ QoS 2 Queueing network simulation QoS 1 Complex microservices graphs, blocking, cyclic dependencies, etc. 17

18 Power Management for Microservices Two types of latency slack: Microservices off the critical path Microservices on the critical path with relaxed QoS Frequency Frequency Frequency.2GHz End-to-end Latency End-to-end Latency End-to-end Latency QoS 100 Utilization Utilization Utilization 0 time time time 18

19 Scalability Challenges Determine per-tier QoS for 1000s of microservices à intractable Put visceral graph here 19

20 Tail at Scale Effects Microservices add an extra dimension to tail at scale effects A single slow microservice affects end-to-end latency Much more pressure on performance predictability & availability Monitoring at the edge Determining per-tier QoS for 10000s of microservices is intractable Scalable data-driven approach Need for online performance debugging 20

21 Proactive Performance Debugging Dependencies between microservices à propagate & amplify QoS violations Finding the culprit of a QoS violation is difficult Post-QoS violation, returning to nominal operation is hard Anticipating QoS violations & identifying culprits Seer: Data-driven Performance Debugging for Microservices Combines lightweight RPC-level distributed tracing with hardware monitoring Leverages scalable deep learning to signal QoS violations with enough slack to apply corrective action 21

22 Queue CPU Mem Net Disk Performance Implications 22

23 Queue CPU Mem Performance Implications Net Disk 23

24 Seer: Data-Driven Performance Debugging [HotCloud 18] Leverage the massive amount of traces collected over time 1. Apply online, practical data mining techniques that identify the culprit of an upcoming QoS violation 2. Use per-server hardware monitoring to determine the cause of the QoS violation 3. Take corrective action to prevent the QoS violation from occurring Need to predict 100s of msec a few sec in the future 24

2% in tail latency Client http WebUI QueryEngine Cassandra Tracing Collector microservices ztracer uservice K

25 Tracing Framework RPC level tracing Based on Apache Thrift Timestamp start-end for each microservice Store in centralized DB (Cassandra) Record all requests à No sampling Overhead: <0.1% in throughput and <0.2% in tail latency Client http WebUI QueryEngine Cassandra Tracing Collector microservices ztracer uservice K ztracer uservice K+1 Gantt charts latency [ ] TCP Proc TCP TCP Proc TCP [ ] TCP proc RX App proc TCP proc TX RPC time TX RPC time RX 25

26 Deep Learning to the Rescue Why? Architecture-agnostic Adjusts to changes in dependencies over time High accuracy, good scalability Inference within the required window 26

27 DNN Configuration Input signal Container utilization Latency Queue depth Output signal Which microservice will cause a QoS violation in the near future? 27

28 DNN Configuration Input signal Container utilization Latency Queue depth Output signal Which microservice will cause a QoS violation in the near future? 28

DNN Configuration Training once: slow (hours - days) Across load levels, load distributions, request types Distributed queue traces, annotated with QoS violations Weight/bias inference with

29 DNN Configuration Training once: slow (hours - days) Across load levels, load distributions, request types Distributed queue traces, annotated with QoS violations Weight/bias inference with SGD Retraining in the background Inference continuously: streaming trace data 93% accuracy in signaling upcoming QoS violations 91% accuracy in attributing QoS violation to correct microservice 29

DNN Configuration Challenges: Accuracy stable or increasing with cluster size In large clusters inference too slow to prevent QoS violations Offload on TPUs, 10-100x

30 DNN Configuration Challenges: Accuracy stable or increasing with cluster size In large clusters inference too slow to prevent QoS violations Offload on TPUs, x improvement; 10ms for 90 th %ile inference Fast enough for most corrective actions to take effect (net bw partitioning, RAPL, cache partitioning, scale-up/out, etc.) 30

31 Experimental Setup 40 dedicated servers ~1000 single-concerned containers Machine utilization 80-85% Inject interference to cause QoS violation Using microbenchmarks (CPU, cache, memory, network, disk I/O) 31

32 Restoring QoS Identify cause of QoS violation Private cluster: performance counters & utilization monitors Public cluster: contentious microbenchmarks Adjust resource allocation RAPL (fine-grain DVFS) & scale-up for CPU contention Cache partitioning (CAT) for cache contention Memory capacity partitioning for memory contention Network bandwidth partitioning (HTB) for net contention Storage bandwidth partitioning for I/O contention 32

33 Demo Queue CPU Mem Net Disk 33

34 34

35 A New Cloud Stack Applications Programming frameworks Cluster management CAL 18a in submission CAL 18b, HotCloud 18, in submission Hardware design in submission 35

36 A New Cloud Stack Applications Programming frameworks Cluster management CAL 18a in submission CAL 18b, HotCloud 18, in submission Hardware design Thank you! in submission 36

SEER: LEVERAGING BIG DATA TO NAVIGATE THE COMPLEXITY OF PERFORMANCE DEBUGGING IN CLOUD MICROSERVICES

SEER: LEVERAGING BIG DATA TO NAVIGATE THE COMPLEXITY OF PERFORMANCE DEBUGGING IN CLOUD MICROSERVICES Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou Cornell