Farewell to Servers: Resource Disaggregation

Size: px

Start display at page:

Download "Farewell to Servers: Resource Disaggregation"

Douglas Welch
5 years ago
Views:

1 Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang

2 2

3 Monolithic Computer OS / Hypervisor 3

4 Can monolithic Application Hardware servers continue to Heterogeneity meet datacenter needs? Flexibility Perf / $

5 TPU GPU FPGA HBM ASIC NVM NVMe DNA Storage 5

6 Making new hardware work with existing servers is like fitting puzzles 6

7 Can monolithic Application Hardware servers continue to Heterogeneity meet datacenter needs? Flexibility Perf / $

8 Poor Hardware Elasticity Hard to change hardware components - Add (hotplug), remove, reconfigure, restart No fine-grained failure handling - The failure of one device can crash a whole machine 8

9 Can monolithic Application Hardware servers continue to Heterogeneity meet datacenter needs? Flexibility Perf / $

10 Poor Resource Utilization Whole VM/container has to run on one physical machine - Move current applications to make room for new ones wasted! cpu mem Server 1 Server 2 Job 12 Available Space Required Space 10

11 Resource Utilization in Production Clusters * Google Production Cluster Trace Data. * Alibaba Production Cluster Trace Data. Unused Resource + Waiting/Killed Jobs Because of Physical-Node Constraints 11

12 Can monolithic Application Hardware servers continue to Heterogeneity meet datacenter needs? Flexibility Perf / $

13 How to achieve better heterogeneity, flexibility, and perf/$? Go beyond physical node boundary 13

14 Resource Disaggregation: Breaking monolithic servers into networkattached, independent hardware components 14

15 15

16 Application Flexibility Heterogeneity Hardware Perf / $ Network 16

17 Why Possible Now? Network is faster InfiniBand (200Gbps, 600ns) Optical Fabric (400Gbps, 100ns) Berkeley Firebox More processing power at device SmartNIC, SmartSSD, PIM Network interface closer to device Omni-Path, Innova-2 HP The Machine Intel Rack-Scale System IBM Composable System 17

18 Disaggregated Datacenter End-to-End Solution Unmodified Application Dist Sys Performance Heterogeneity OS Flexibility Network Reliability Hardware $ Cost

19 Disaggregated Datacenter End-to-End Solution Physically Disaggregated Resources Disaggregated Operating System (OSDI 18) New Processor and Memory Architecture Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP 17) RDMA Network

20 20

21 Can Existing Kernels Fit? monolithic kernel CPU microkernel CPU Kern Core Kern GPU Kern P-NIC mem NIC mem NIC Disk Server Disk network across servers Server Monolithic/Micro-kernel (e.g., Linux, L4) Shared Main Memory Disk NIC Monolithic Server Multikernel (e.g., Barrelfish, Helios, fos) 21

22 Existing Kernels Don t Fit Access remote resources Network Distributed resource mgmt Fine-grained failure handling 22

23 When hardware is disaggregated The OS should be also 23

24 OS Virtual File & Process Memory Storage Mgmt System System Network 24

25 Network Process Mgmt Network Virtual Memory System Network File & Storage System Network File & Storage System Network 25

The Splitkernel Architecture Split OS functions into monitors Run each monitor at h/w device Process Monitor Processor (CPU) GPU Minitor Processor (GPU) XPU Manager New h/w (XPU) Network messaging

26 The Splitkernel Architecture Split OS functions into monitors Run each monitor at h/w device Process Monitor Processor (CPU) GPU Minitor Processor (GPU) XPU Manager New h/w (XPU) Network messaging across non-coherent components Distributed resource mgmt and failure handling network messaging across non-coherent components Memory Monitor NVM Monitor HDD Monitor SSD Monitor Memory NVM Hard Disk SSD 26

27 LegoOS The First Disaggregated OS Memory Processor Storage NVM 27

28 How Should LegoOS Appear to Users? As a set of hardware devices? As a giant machine? Our answer: as a set of virtual Nodes (vnodes) - Similar semantics to virtual machines - Unique vid, vip, storage mount point - Can run on multiple processor, memory, and storage components 28

29 Abstraction - vnode vnode1 Process Monitor Processor (CPU) GPU Minitor Processor (GPU) XPU Manager New h/w (XPU) vnode2 network messaging across non-coherent components Memory Monitor NVM Monitor HDD Monitor SSD Monitor Memory NVM Hard Disk SSD One vnode can run multiple hardware components One hardware component can run multiple vnodes 29

30 Abstraction Appear as vnodes to users Linux ABI compatible Support unmodified Linux system call interface (common ones) A level of indirection to translate Linux interface to LegoOS interface 30

31 LegoOS Design 1. Clean separation of OS and hardware functionalities 2. Build monitor with hardware constraints 3. RDMA-based message passing for both kernel and applications 4. Two-level distributed resource management 5. Memory failure tolerance through replication 31

32 Separate Processor and Memory CPU Processor $ CPU $ Last-Level TLB MMU DRAM PT 32

33 Separate Processor and Memory CPU Processor $ CPU $ Last-Level Separate and move hardware units Network to memory component Memory TLB MMU DRAM PT Memory 33

34 Separate Processor and Memory Virtual Memory CPU Processor $ CPU $ Last-Level Separate and move hardware units Network to memory component Memory TLB MMU DRAM PT Memory 34

35 Separate Processor and Memory CPU Processor $ CPU $ Last-Level Separate and move virtual memory system Network Memory to memory component TLB MMU Virtual Memory DRAM PT Memory 35

36 Separate Processor and Memory Virtual Address Virtual Address CPU Virtual Address Virtual Address Processor $ CPU $ Last-Level Network Processor components only see virtual memory addresses All levels of cache are virtual cache Memory TLB MMU Virtual Memory DRAM PT Memory Memory components manage virtual and physical memory 36

37 Challenge: network is 2x-4x slower than memory bus 37

38 Add Extended Cache at Processor CPU Processor $ CPU $ Last-Level Network Memory TLB MMU Virtual Memory DRAM PT Memory 38

39 Add Extended Cache at Processor Processor CPU $ CPU $ Last-Level DRAM Add small DRAM/HBM at processor Use it as Extended Cache, or ExCache TLB DRAM Network MMU Virtual Memory PT Memory Software and hardware comanaged Memory Inclusive Virtual cache 39

40 LegoOS Design 1. Clean separation of OS and hardware functionalities 2. Build monitor with hardware constraints 3. RDMA-based message passing for both kernel and applications 4. Two-level distributed resource management 5. Memory failure tolerance through replication 40

Distributed Resource Management Process Monitor Processor (CPU) GPU Minitor Processor (GPU) Global Resource Mgmt Global Process Manager (GPM) Global Memory Manager (GMM) network messaging across

41 Distributed Resource Management Process Monitor Processor (CPU) GPU Minitor Processor (GPU) Global Resource Mgmt Global Process Manager (GPM) Global Memory Manager (GMM) network messaging across non-coherent components Global Storage Manager (GSM) Memory Monitor NVM Monitor HDD Monitor SSD Monitor 1. Coarse-grain allocation Memory NVM Hard Disk SSD 2. Load-balancing 3. Failure handling 41

42 Implementation and Emulation Process Monitor Processor Processor CPU CPU LLC CPU CPU Disk Reserve DRAM as ExCache (4KB page as cache line) h/w only on hit path, s/w managed miss path ExCache Indirection layer to store states for 113 Linux syscalls RDMA Network Memory Monitor Linux Kernel Module CPU CPU CPU CPU CPU LLC Disk LLC Disk DRAM DRAM Memory Storage Memory Limit number of cores, kernel-space only Storage/Global Resource Monitors Implemented as kernel module on Linux Network RDMA RPC stack based on LITE [SOSP 17] 42

43 Slowdown Performance Evaluation ExCache/Memory Size (MB) LegoOS Config: 1P, 1M, 1S Linux swap SSD Linux swap ramdisk InfiniSwap LegoOS Unmodified TensorFlow, running CIFAR-10 Working set: 0.9G 4 threads Systems in comparison Baseline: Linux with unlimited memory Swap to SSD, and ramdisk Only 1.3x to 1.7x slowdown when InfiniSwap [NSDI 17] disaggregating devices with LegoOS To gain better resource packing, 43 elasticity, and fault tolerance!

44 LegoOS Summary Resource disaggregation calls for new system LegoOS: a new OS designed and built from scratch for datacenter resource disaggregation Split OS into distributed micro-os services, running at device Many challenges and many potentials 44

45 Disaggregated Datacenter flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use Physically Disaggregated Resources Disaggregated Operating System (OSDI 18) New Processor and Memory Architecture Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP 17) RDMA Network

46 Network Requirements for Resource Disaggregation Low latency High bandwidth Scale RDMA Reliable 46

47 RDMA (Remote Direct Memory Access) CPU User Kernel Memory NIC CPU User Kernel Memory NIC Directly read/write remote memory Bypass kernel Socket over Ethernet Memory zero copy CPU User CPU User Benefits: Kernel Kernel Low latency Memory NIC Memory NIC High throughput Low CPU utilization RDMA 47

48 Things have worked well in HPC Special hardware Few applications Cheaper developer 48

49 RDMA-Based Datacenter Applications Pilaf [ATC 13] HERD-RPC [ATC 16] Cell [ATC 16] FaRM [NSDI 14] Wukong [OSDI 16] FaSST [OSDI 16] HERD [SIGCOMM 14] Hotpot [SoCC 17] NAM-DB [VLDB 17] RSI [VLDB 16] DrTM [SOSP 15] APUS [SoCC 17] Octopus [ATC 17] DrTM+R [EuroSys 16] FaRM+Xact Mojim [SOSP 15] [ASPLOS 15] 49

50 Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 50

51 Native RDMA User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Kernel Space OS Kernel Bypassing Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 51

52 Userspace Fat applications No resource sharing Hardware Low-level Difficult to use Difficult to share RDMA Socket Developers want High-level Easy to use Resource share Isolation Abstraction Mismatch 52

53 Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 53

54 Userspace Native RDMA Hardware User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Kernel Space OS Kernel Bypassing Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 54

55 Userspace On-NIC SRAM stores and caches metadata Hardware Requests /us Expensive, Total Size (MB) Write-64B Write-1K unscalable hardware 55

56 Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 56

57 Fat applications No resource sharing Expensive, unscalable hardware Are we removing too much from kernel? 57

58 LITE - Without Local Indirection Kernel TiEr High-level High-level abstraction abstraction Resource Resource sharing sharing Protection Performance isolation Performance isolation Protection 58

59 User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 59

60 Simpler applications User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr LITE APIs Memory RPC/Msg APIs Sync APIs Mem Mgmt Kernel Space LITE Connections Queues Keys Memory space Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 60

61 Simpler applications User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr Mem Mgmt LITE APIs Memory RPC/Msg APIs Sync APIs Kernel Space LITE Connections Queues Keys Permission check Address mapping Global lkey Memory space Global rkey Hardware RNIC Global lkey Global rkey Cheaper hardware Scalable performance 61

62 Implementing Remote memset Native RDMA LITE 62

63 Main Challenge: How to preserve the performance benefit of RDMA? 63

64 LITE Design Principles 1.Indirection only at local node 2.Avoid hardware-level indirection 3.Hide kernel-space crossing cost except for the problem of too many layers of indirection David Wheeler Great Performance and Scalability 64

65 LITE RDMA:Size of MR Scalability Requests /us Write-64B LITE_write-64B Write-1K LITE_write-1K 0 LITE 1 scales 4 much 16 better 64 than 256 native 1024 RDMA wrt MR Total size Size (MB) and numbers 65

66 LITE Application Effort Application LOC LOC using LITE Student Days LITE-Log LITE-MapReduce 600* 49 4 LITE-Graph LITE-Kernel-DSM Simple to use Needs no expert knowledge Flexible, powerful abstraction Easy to achieve optimized performance * LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition 66

67 MapReduce Results LITE-MapReduce adapted from Phoenix [1] Runtime (sec) Hadoop Phoenix LITE LITE-MapReduce 2 outperforms Hadoop by 4.3x to 5.3x 0 Phoenix 2-node 4-node 8-node [1]: Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07) 67

68 LITE Summary Virtualizes RDMA into flexible, easy-to-use abstraction Preserves RDMA s performance benefits Indirection not always degrade performance! Division across user space, kernel, and hardware 68

69 Disaggregated Datacenter flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use Physically Disaggregated Resources Disaggregated Operating System (OSDI 18) New Processor and Memory Architecture Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP 17) RDMA Network

70 Disaggregated Datacenter flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use Physically Disaggregated Resources Virtually Disaggregated Resources Disaggregated OS (OSDI 18) New Processor and Memory Architecture Disaggregated Persistent Memory Network-Attached NVM Distributed Shared Persistent Memory (SoCC 17) Distributed Non-Volatile Memory Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP 17) RDMA Network New Network Topology, Routing, Congestion-Ctrl InfiniBand

71 Conclusion New hardware and software trends point to resource disaggregation My research pioneers in building an end-to-end solution for disaggregated datacenter Opens up new research opportunities in hardware, software, networking, security, and programming language 71

72 Thank you Questions? wuklab.io

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang Datacenter 3 Monolithic Computer OS / Hypervisor 4 Can monolithic Application Hardware