WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Size: px

Start display at page:

Download "WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS"

Laurence Merritt
6 years ago
Views:

1 WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES ON BIG AND SMALL SERVER PLATFORMS Shuang Chen*, Shay Galon**, Christina Delimitrou*, Srilatha Manne**, and José Martínez* *Cornell University **Cavium Inc.

2 EXECUTIVE SUMMARY How to achieve low tail latency for interactive cloud services? Tail latency more important and challenging The entire stack from SW to HW is involved Understand how tail latency reacts to application and system changes See how current designs work Get insights on future designs Page 1 of 20

3 MOTIVATION Page 2 of 20

4 LOW LATENCY Tail latency e.g., QoS defined as 99 th %ile in 500usec = Page 3 of 20

5 LOW TAIL LATENCY REQUIREMENTS The entire stack from SW to HW is involved Application Resource Manager Application bottleneck Different user cases Scalability Virtualization OS Hardware Overhead of virtualization SW isolation mechanisms Overhead of context switching HW isolation mechanisms Hyperthreading Page 4 of 20

6 CATEGORIZE LC APPLICATIONS By requirement of tail latency us: memcached ms: web server, in-memory database s: persistent database By statefulness Stateful: memcached Stateless: web server Page 5 of 20

7 SELECTED LC WORKLOADS NGINX Web server Stateless 99 th % in tens of ms Memcached Key-value store Stateful 99 th % in hundreds of us NGINX QoS Strictness Memcached Statefulness Page 6 of 20

8 SERVER ARCHITECTURE P L1 I/D: 32/32KB 22 Cores 2 Threads/Core P L1 I/D: 32/32KB L2: 256KB L2: 256KB P L1 I/D: 78/32KB 48 Cores 1 Thread/Core P L1 I/D: 78/32KB LLC: 55MB, 20 ways 14nm LLC: 16MB, 16 ways 28nm Memory: 128G DDR4 Memory: 128G DDR4 NIC: 10Gbps NIC: 10Gbps Intel Xeon E v4 Cavium ThunderX $4,115 $785 Page 7 of 20

9 STUDIED PARAMETERS Application Application bottleneck Different user cases Scalability Resource Manager Virtualization OS Hardware Overhead of virtualization SW isolation mechanisms Overhead of context switching HW isolation mechanisms Hyperthreading Page 8 of 20

10 INPUT LOAD Xeon ThunderX 5.2x 5x Memcached NGINX Page 9 of 20

11 MEMCACHED LATENCY DECOMPOSITION NIC RX NIC IRQ Kernel Syscall Xeon ThunderX Receive At 10% of max throughput Send Little user-space processing Network delay 2x slower than Xeon User Xeon ThunderX At 90% of max throughput , Queuing delay 1,290 1, Page 10 of 20

12 STUDIED PARAMETERS Application Application bottleneck Different user cases Scalability Resource Manager Virtualization OS Hardware Overhead of virtualization SW isolation mechanisms Overhead of context switching HW isolation mechanisms Hyperthreading Page 11 of 20

13 MEMCACHED VALUE SIZE Xeon ThunderX Memory copy Network processing and transmission ThunderX is more sensitive Page 12 of 20

14 NUMBER OF MEMCACHED ITEMS Xeon ThunderX Cache capacity ThunderX is more sensitive Page 13 of 20

15 STUDIED PARAMETERS Application Application bottleneck Different user cases Scalability Resource Manager Virtualization OS Hardware Overhead of virtualization SW isolation mechanisms Overhead of context switching HW isolation mechanisms Hyperthreading Page 14 of 20

16 SCALABILITY Memcached NGINX Interrupt handling Load imbalance Lock contention Page 15 of 20

17 STUDIED PARAMETERS Application Application bottleneck Different user cases Scalability Resource Manager Virtualization OS Hardware Overhead of virtualization SW isolation mechanisms Overhead of context switching HW isolation mechanisms Hyperthreading Page 16 of 20

18 CONTEXT SWITCHING Memcached on Xeon Memcached on ThunderX Statically spawned threads VS dynamically allocated cores ThunderX is more sensitive Page 17 of 20

19 STUDIED PARAMETERS Application Application bottleneck Different user cases Scalability Resource Manager Virtualization OS Hardware Overhead of virtualization SW isolation mechanisms Overhead of context switching HW isolation mechanisms Hyperthreading Page 18 of 20

20 HYPERTHREADING Reduce the overhead of context switching Allocate two threads on two hyperthreads Make better use of execution units Co-locate different applications Memcached & Nginx on the same hyperthreads Memcached & Nginx on different hyperthreads Page 19 of 20

elasticity Lock alternatives Load balance Virtualization OS Hardware Reduce the

21 IMPLICATIONS QUESTIONS? OF THESE STUDIES Application Resource Manager Reduce queuing delays Improve elasticity Lock alternatives Load balance Virtualization OS Hardware Reduce the overhead of virtualization Avoid context switching Make best use of SW isolation mechanisms Big VS Small Cores Make best use of HW features Page 20 of 20

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT : EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1