No Tradeoff Low Latency + High Efficiency

Size: px

Start display at page:

Download "No Tradeoff Low Latency + High Efficiency"

Scott Little
5 years ago
Views:

1 No Tradeoff Low Latency + High Efficiency Christos Kozyrakis

2 Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS), Occupy 1,000s of servers in datacenters 2

3 Example: Web-search Front end Q Web Servers Back end R Root Q Parent QQ Parent Q Parent Q R R R Leaf R R Leaf Leaf R Leaf R R Leaf Leaf R Leaf R R Leaf Leaf R Metrics: Massive queries distributed per sec datasets, (QPS) multiple a 99 th -% tiers, latency high threshold fan-out 3

4 Characteristics High throughput + low latency (100μsec 10msec) Focus on tail latency (e.g., 99 th percentile SLO) Distributed and (often) stateful Multi-tier with high fan-out Diurnal patterns but load is never 0 4

5 Trends [Adrian Cockcroft 14] Apps as collection of micro-services Loosely-coupled services each with bounded context Even lower tail latency requirements From 100s to just a few μsec 5

6 Conventional Wisdom for LC Apps 6

7 #1 LC apps cannot be power managed 7

8 LC Apps & Power Management Power (DVFS off) [Lo et al 14] Search load Tail latency (DVFS on) 8

9 #2 LC apps must use dedicated servers 9

114% 115% 105% 117% 120% 136% >300% B A D CPU power 124% 107% 116% 109% 115% 105% 101% 100% 100% Network

10 LC Apps & Server Sharing Impact of interference on websearch s latency LLC >300% >300% >300% >300% >300% >300% >300% 264% 123% 300% DRAM >300% >300% >300% >300% >300% >300% >300% 270% 122% HyperThread 110% 107% 114% 115% 105% 117% 120% 136% >300% B A D CPU power 124% 107% 116% 109% 115% 105% 101% 100% 100% Network 36% 36% 37% 37% 39% 42% 48% 55% 64% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% O K Load [Lo et al 15] 10

Both meet QoS 1ms RPC fails Memcached fails Both fail to meet QoS QoS requirement

11 LC Apps & Core Sharing Achieved QoS w/ two low-latency workloads Memcached QPS (% Peak) Bad QoS with minor interference! Both meet QoS 1ms RPC fails Memcached fails Both fail to meet QoS QoS requirement = 95 th < 5x low-load 95 th ms RPC-service QPS (% Peak) [Leverich 14] 11

12 #3 LC apps must use local Flash 12

13 Local Vs Remote NVMe Access p95 read latency (us) x latency 75% throughput drop Local Flash iscsi (1 core) libaio+libevent (1core) IOPS (Thousands) 13

Sharing NVMe Flash 2000 1800 1600 The curve you paid for p95 read latency

75%read 50%read 0 250 500 750 1000 1250 Total IOPS (Thousands) The curve

14 Sharing NVMe Flash The curve you paid for p95 read latency (us) %read 99%read 95%read 90%read 75%read 50%read Total IOPS (Thousands) The curve you get with sharing Flash performance varies a lot with read/write ratio 14

15 #4 LC apps must bypass the kernel for I/O cannot use TCP cannot use Ethernet must use specialized HW 15

LC Apps & Linux/TCP/Ethernet 64-byte TCP echo 60 10 HW Limit Linux 50 40 30 20 10 4.

16 LC Apps & Linux/TCP/Ethernet 64-byte TCP echo HW Limit Linux x Gap Millions x Gap 0 Microseconds 0 Requests per Second [Belay et al 15] 16

17 Conventional Wisdom for LC Apps 17

18 Low Latency + High Efficiency Understand the problem Understand the opportunity 18

19 Understanding the Latency Problem Latency (usec) MM1 memcached N=1 MM4 memcached N=4 [Delimitrou 15] Load Latency-critical apps as queuing systems Tail latency affected by Service time, queuing delay, scheduling time, load imbalance 19

Understanding the Latency Problem Latency (usecs) 1000 900 800 700 600 500 400 300 95th-% Provisioned QPS 200 100 0

20 Understanding the Latency Problem Latency (usecs) th-% Provisioned QPS % 8% 14% 19% 24% 30% 35% 41% 46% 51% 57% 62% 68% 73% 78% 84% 89% 95% 100% Memcached QPS (% of peak) [Leverich 15] 20

21 Understanding the Opportunity Significant latency slack! Tail latency [Lo et al 14] 0% 20% 40% 60% 80% 100% % of maximum cluster load Iso-latency: maintain constant tail latency at all loads Enables power management, HW sharing, adaptive batching, Key point: use tail latency as control signal 21

22 Low Latency + High Efficiency Understand the problem Understand the opportunity End-to-end & latency-aware management 22

23 Pegasus: Iso-latency + Fine-grain DVFS L P L L L [Lo et al 14] Measures latency slack end-to-end Sets uniform latency goal across all servers RAPL as knob for power Power is set by workload specific policy 23

24 Pegasus: Iso-latency + Fine-grain DVFS Achieves dynamic energy proportionality [Lo et al 14] 24

25 Heracles: Iso-latency + QoS Isolation Latency readings Controller Can BE grow? LC workload DRAM BW CPU + Memory LLC CPU CPU power DVFS CPU Power Network HTB Net. BW Internal feedback loops [Lo et al 15] Goal: meet SLO while using idle resources for batch workloads End-to-end latency measurements control isolation mechanisms 25

26 Heracles: Iso-latency + QoS Isolation Cores LLC Core freq. Network BW LC BE LC BE BE Max LC BE L Latency readings LC workload DRAM BW CPU + Memory LLC CPU Controller DVFS CPU power CPU Power Network HTB Can BE grow? Net. BW ü Internal feedback loops [Lo et al 15] 26

27 Heracles: Iso-latency + QoS Isolation Cores LLC Core freq. Network BW LC BE LC BE BE Max LC BE L Latency readings LC workload DRAM BW CPU + Memory LLC CPU Controller DVFS CPU power CPU Power Network HTB Can BE grow? Net. BW L Internal feedback loops [Lo et al 15] 27

28 Iso-latency + QoS Isolation >90% HW utilization No tail latency problems [Lo et al 15] LC apps + best-effort tasks on the same servers HW & SW isolation mechanisms (cores, caches, network, power) Iso-latency based control for resource allocation 28

29 Low Latency + High Efficiency Understand the problem Understand the opportunity End-to-end & latency-aware management Optimize for modern multi-core hardware Specialize the SW 29

30 System SW for Low Latency + High BW App 1 App 2 Userspace Kernelspace OS Kernel TCP/IP C C C C C RX/TX 30

31 System SW for Low Latency + High BW Control plane App 1 App 2 Userspace Kernelspace OS Kernel TCP/IP RX TX RX TX RX TX RX TX C C C C C RX/TX 31

32 System SW for Low Latency + High BW Ring 3 Control plane App 1 App 2 Guest Ring 0 Host Ring 0 OS kernel Dune RX TX RX TX RX TX RX TX C C C C C 32

33 System SW for Low Latency + High BW Ring 3 Control plane Memcached libix HTTPd libix Guest Ring 0 IX TCP/IP Custom Transport Host Ring 0 OS Kernel Dune RX TX RX TX RX TX RX TX C C C C C [Belay et al 14] 33

34 Run-to-Completion Ring 3 event-driven app 3 Event Conditions libix Batched Syscalls Guest Ring 0 TCP/IP 2 4 TCP/IP RX FIFO 1 RX TX 6 5 Timer Improves data cache locality Removes scheduling unpredictably 34

35 Adaptive Batching Ring 3 event-driven app 3 Event Conditions libix Batched Syscalls Guest Ring 0 TCP/IP 2 4 TCP/IP RX FIFO 1 Adaptive Batch Calculation RX TX 6 5 Timer Enabled by ISO-latency Improves instruction cache locality and prefetching 35

IX: Low Latency + High Bandwidth Latency (µs) 750 500 250 IX tail lower

Less Tail Latency + TCP + commodity HW + protection + 1M connections 5x

36 IX: Low Latency + High Bandwidth Latency (µs) IX tail lower than Linux avg 0 Linux (avg) Linux (99 th pct) IX (avg) IX (99 th pct) 2x Less Tail Latency + TCP + commodity HW + protection + 1M connections 5x More RPS SLA USR: Throughput (RPS x 10 6 ) [Belay et al 14] 36

37 Low Latency + High Bandwidth + High Efficiency MC load MC load MC latency MC latency SLO dynamic Pareto Power Time (seconds) Time (seconds) Time (seconds) BE Performance cached running Figure on IX 7: Consolidation for three load patterns. of memcached with a background batch task Time (seconds) 37

38 ReFlex: IX + QoS Scheduling Ring 3 Control Plane App libix App libix Guest Ring 0 IX ReFlex Host Ring 0 Linux kernel Dune RX TX RX TX SQ CQ Core Core Core 38

39 ReFlex: IX + QoS Scheduling Ring 3 ReFlex Server 3 Event Conditions libix Batched Syscalls Guest Ring 0 NVMe TCP/IP 2 TCP/IP Scheduler NVMe CQ 1 RX TX SQ 4 [Klimovic et al 17] 39

40 ReFlex: IX + QoS Scheduling Ring 3 Event Conditions ReFlex Server libix 7 Batched Syscalls Guest Ring 0 NVMe 6 TCP/IP TCP/IP Scheduler NVMe 8 CQ RX TX SQ 5 [Klimovic et al 17] 40

41 ReFlex: Local Remote Latency p95 Read Latency (us) Linux: 75K IOPS/core ReFlex: 850K IOPS/core Local-1T ReFlex-1T Libaio-1T IOPS (Thousands) 41

ReFlex: Local Remote Latency p95 Read Latency (us) 1000

42 ReFlex: Local Remote Latency p95 Read Latency (us) Unloaded latency Local Flash 78 µs ReFlex 99 µs Libaio 121 µs Local-1T ReFlex-1T Libaio-1T IOPS (Thousands) 42

43 ReFlex: Local Remote Latency p95 Read Latency (us) ReFlex saturates Flash device Local-1T Local-2T ReFlex-1T ReFlex-2T Libaio-1T Libaio-2T IOPS (Thousands) 43

44 ReFlex: Private Shared QoS Read p95 latency (us) I/O sched disabled I/O sched enabled Latency SLO IOPS (Thousands) I/O sched disabled Tenant A IOPS SLO I/O sched enabled Tenant B IOPS SLO 0 Tenant A Tenant B Tenant C Tenant D 0 Tenant A Tenant B Tenant C Tenant D Tenants A & B: latency-critical; Tenant C + D: best effort Without scheduler: latency and bandwidth QoS for A/B are violated Scheduler rate limits best-effort tenants to enforce SLOs 44

45 ReFlex: Private Shared QoS Read p95 latency (us) I/O sched disabled I/O sched enabled Latency SLO IOPS (Thousands) I/O sched disabled Tenant A IOPS SLO I/O sched enabled Tenant B IOPS SLO 0 Tenant A Tenant B Tenant C Tenant D 0 Tenant A Tenant B Tenant C Tenant D Tenants A & B: latency-critical; Tenant C + D: best effort Without scheduler: latency and bandwidth QoS for A/B are violated Scheduler rate limits best-effort tenants to enforce SLOs 45

46 ReFlex: Private Shared QoS Read p95 latency (us) I/O sched disabled I/O sched enabled Latency SLO IOPS (Thousands) I/O sched disabled Tenant A IOPS SLO I/O sched enabled Tenant B IOPS SLO 0 Tenant A Tenant B Tenant C Tenant D 0 Tenant A Tenant B Tenant C Tenant D Tenants A & B: latency-critical; Tenant C + D: best effort Without scheduler: latency and bandwidth QoS for A/B are violated Scheduler rate limits best-effort tenants to enforce SLOs 46

47 Looking Forward Programming models for low tail latency Especially for multi-tier applications Variability within and across applications Short vs long requests Cluster-level issues Provisioning, scheduling, load-balancing, throttling, μsecond-scale tail latency Can we still get efficiency at ultra low latency? Hardware for latency-critical applications Big vs little cores, isolation mechanisms, accelerators 47

48 Conclusions Conventional wisdom is wrong Low latency is compatible with high efficiency Efficiency: energy proportionality, resource usage, throughput Encouraging initial results Room for improvements throughout the stack [Source:ThinkStock] 48

ReFlex: Remote Flash Local Flash

ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis October 28, 216 IAP Cloud Workshop Flash in Datacenters Flash provides 1 higher throughput and 2 lower latency than disk PCIe