Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Similar documents
Farewell to Servers: Resource Disaggregation

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

Distributed Shared Persistent Memory

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

RDMA and Hardware Support

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

genzconsortium.org Gen-Z Technology: Enabling Memory Centric Architecture

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

PASTE: A Network Programming Interface for Non-Volatile Main Memory

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Toward a Memory-centric Architecture

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Extending RDMA for Persistent Memory over Fabrics. Live Webcast October 25, 2018

IO virtualization. Michael Kagan Mellanox Technologies

Efficient Memory Disaggregation with Infiniswap. Juncheng Gu, Youngmoon Lee, Yiwen Zhang, MosharafChowdhury, Kang G. Shin

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Messaging Overview. Introduction. Gen-Z Messaging

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

SNIA NVM Programming Model Workgroup Update. #OFADevWorkshop

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

Paving the Road to Exascale

FaRM: Fast Remote Memory

RDMA Requirements for High Availability in the NVM Programming Model

Enhancing NVMe-oF Capabilities Using Storage Abstraction

Windows Support for PM. Tom Talpey, Microsoft

Windows Support for PM. Tom Talpey, Microsoft

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

A Distributed Hash Table for Shared Memory

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Network Requirements for Resource Disaggregation

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

The SNIA NVM Programming Model. #OFADevWorkshop

Creating High Performance Clusters for Embedded Use

Distributed Shared Persistent Memory

New Interconnnects. Moderator: Andy Rudoff, SNIA NVM Programming Technical Work Group and Persistent Memory SW Architect, Intel

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

Near- Data Computa.on: It s Not (Just) About Performance

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Accessing NVM Locally and over RDMA Challenges and Opportunities

How Next Generation NV Technology Affects Storage Stacks and Architectures

Why AI Frameworks Need (not only) RDMA?

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Unevenly Distributed. Adrian

Mapping MPI+X Applications to Multi-GPU Architectures

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

The Future of High Performance Interconnects

Data Centers and Cloud Computing

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

GEN-Z AN OVERVIEW AND USE CASES

Accelerator-centric operating systems

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

Data Centers and Cloud Computing. Data Centers

Maximizing heterogeneous system performance with ARM interconnect and CCIX

PASTE: A Networking API for Non-Volatile Main Memory

SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS

RiceNIC. A Reconfigurable Network Interface for Experimental Research and Education. Jeffrey Shafer Scott Rixner

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

HMM: GUP NO MORE! XDC Jérôme Glisse

Research Statement. 1 Thesis research: Efficient networked systems for datacenters with RPCs. Anuj Kalia

IsoStack Highly Efficient Network Processing on Dedicated Cores

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

CS 350 Winter 2011 Current Topics: Virtual Machines + Solid State Drives

Multifunction Networking Adapters

Storage Protocol Offload for Virtualized Environments Session 301-F

NVthreads: Practical Persistence for Multi-threaded Applications

Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

CSC501 Operating Systems Principles. OS Structure

Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Persistent Memory over Fabrics

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

G2M Research Presentation Flash Memory Summit 2018

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

Industry Collaboration and Innovation

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit

Company. Intellectual Property. Headquartered in the Silicon Valley

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

OFED Storage Protocols

Advanced Database Systems

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

NVMe Over Fabrics (NVMe-oF)

CIS : Scalable Data Analysis

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Load-Sto-Meter: Generating Workloads for Persistent Memory Damini Chopra, Doug Voigt Hewlett Packard (Enterprise)

Transcription:

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang

Datacenter

3

Monolithic Computer OS / Hypervisor 4

Can monolithic Application Hardware servers continue to Heterogeneity meet datacenter needs? Flexibility Perf / $

TPU GPU FPGA HBM ASIC NVM NVMe DNA Storage 6

Making new hardware work with existing servers is like fitting puzzles 7

Tons of Issues Fitting bus standards Slots for new devices? Power for new devices? Changing bus? Software for new devices? 8

6 months of my life went to Firmware v0 SATA Firmware v1 Firmware v2 9

Can monolithic Application Hardware servers continue to Heterogeneity meet datacenter needs? Flexibility Perf / $

Application Elasticity and Perf / $ Whole VM/container has to run on one physical machine - Move current applications to make room for new ones VM1 App1 App2 Container1 Core Core Core Core Core Core Cor e Core Core Core Core 8GB Main Memory 3GB Memory 4GB Main Memory Node 1 Node 2 11

CPU/Memory Usages across Machines and across Jobs Source: Gu et al. Efficient Memory Disaggregation with Infiniswap NSDI 17 Memory usage in datacenter are highly imbalanced 12

Resource Utilization in Production Clusters * Google Production Cluster Trace Data. https://github.com/google/cluster-data Unused Resource + Waiting/Killed Jobs Because of Physical-Node Constraints 13

No fine-grained failure handling 14

Can monolithic Application Hardware servers continue to Heterogeneity meet datacenter needs? Flexibility Perf / $

How to achieve better heterogeneity, flexibility, and perf/$? Go beyond physical node boundary 16

Resource Disaggregation: Breaking monolithic servers into networkattached, independent hardware components 17

18

Application Flexibility Heterogeneity Hardware Hardware Perf / $ Network Distributed Memory 19

Disaggregated Datacenter End-to-End Solution Unmodified Application Performance Dist Sys Heterogeneity OS Flexibility Network Reliability Hardware $ Cost

Disaggregated Datacenter flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-touse Physically Disaggregated Resources Virtually Disaggregated Resources Disaggregated Operating System New Processor and Memory Architecture Dumb Remote Memory + Smart Control One-Sided Remote Memory / NVM Distributed Shared Persistent Memory (SoCC 17) Distributed Non-Volatile Memory Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP 17) RDMA Network New Network Topology, Routing, Congestion-Ctrl Infiniband

Key Challenge of Resource Disaggregation: Cost of Crossing Network Bandwidth Latency Mem Bus 50-100 GB/s ~50ns PCIe 3.0 (x16) 16 GB/s ~700ns Network hardware is much faster than before InfiniBand (EDR) InfiniBand (HDR) 12.5 GB/s 500ns 25 GB/s <500ns Current network still slower than local memory bus GenZ 32-400 GB/s <100ns 22

Network Requirements for Resource Disaggregation Low latency High bandwidth Scale RDMA Reliable 23

RDMA (Remote Direct Memory Access) Directly read/write remote memory Bypass kernel Memory zero copy Benefits: Low latency High throughput Low CPU utilization 24

Things have worked well in HPC Special hardware Few applications Cheaper developer 25

RDMA-Based Datacenter Applications Pilaf [ATC 13] HERD-RPC [ATC 16] Cell [ATC 16] FaRM [NSDI 14] Wukong [OSDI 16] FaSST [OSDI 16] HERD [SIGCOMM 14] Hotpot [SoCC 17] NAM-DB [VLDB 17] RSI [VLDB 16] DrTM [SOSP 15] APUS [SoCC 17] Octopus [ATC 17] DrTM+R [EuroSys 16] FaRM+Xact Mojim [SOSP 15] [ASPLOS 15] 26

Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 27

Native RDMA Kernel Bypassing 28

Fat applications No resource sharing Low-level Difficult to use Difficult to share RDMA Socket Developers want High-level Easy to use Resource share Isolation Abstraction Mismatch 29

Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 30

Native RDMA Kernel Bypassing 31

Requests /us On-NIC SRAM stores and caches metadata Expensive, 7.5 6 4.5 3 1.5 0 Write-64B Write-1K 1 4 16 64 256 1024 Total Size (MB) unscalable hardware 32

Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 33

Fat applications No resource sharing Expensive, unscalable hardware Are we removing too much from kernel? 34

LITE -Without Local Indirection Kernel TiEr High-level High-level abstraction abstraction Resource Resource sharing sharing Protection Performance isolation Performance isolation Protection 35

All problems in computer science can be solved by another level of indirection Butler Lampson 36

User Space Hardware RNIC 37

Simpler applications User Space LITE APIs Memory APIs RPC/Msg APIs Sync APIs Kernel Space LITE Hardware RNIC 38

Simpler applications User Space LITE APIs Memory APIs RPC/Msg APIs Sync APIs Kernel Space LITE Global lkey Global rkey Hardware RNIC Global lkey Global rkey Cheaper hardware Scalable performance 39

Implementing Remote memset Native RDMA LITE 40

All problems in computer science can be solved by another level of indirection Butler Lampson David Wheeler except for the problem of too many layers of indirection David Wheeler 41

Main Challenge: How to preserve the performance benefit of RDMA? 42

LITE Design Principles 1.Indirection only at local node 2.Avoid hardware-level indirection 3.Hide kernel-space crossing cost except for the problem of too many layers of indirection David Wheeler Great Performance and Scalability 43

Requests /us LITE RDMA:Size of MR Scalability 7.5 6 4.5 Write-64B LITE_write-64B Write-1K LITE_write-1K 3 1.5 0 LITE scales much better than native 1 4 16 64 256 1024 Total Size (MB) RDMA wrt MR size and numbers 44

LITE Application Effort Application LOC LOC using LITE Student Days LITE-Log 330 36 1 LITE-MapReduce 600* 49 4 LITE-Graph 1400 20 7 LITE-Kernel-DSM 3000 45 26 Simple to use Needs no expert knowledge Flexible, powerful abstraction Easy to achieve optimized performance * LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition 45

Runtime (sec) MapReduce Results LITE-MapReduce adapted from Phoenix [1] 25. 23. 21. 8. Hadoop Phoenix LITE 6. 4. LITE-MapReduce 2. outperforms Hadoop 0. by 4.3x to 5.3x Phoenix 2-node 4-node 8-node [1]: Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07) 46

LITE Summary Virtualizes RDMA into flexible, easy-to-use abstraction Preserves RDMA s performance benefits Indirection not always degrade performance! Division across user space, kernel, and hardware 47

Disaggregated Datacenter flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-touse Physically Disaggregated Resources Virtually Disaggregated Resources Disaggregated Operating System New Processor and Memory Architecture Dumb Remote Memory + Smart Control One-Sided Remote Memory / NVM Distributed Shared Persistent Memory (SoCC 17) Distributed Non-Volatile Memory Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP 17) RDMA Network New Network Topology, Routing, Congestion-Ctrl Infiniband

Traditional OSes Manages single node and all hardware resources in it Bad for hardware heterogeneity and hotplug Does not handle component failure 49

50

Lego: the First Disaggregated OS 51

When hardware is disaggregated, the OS should be also! 52

OS 53

54

Micro-OS Service Manages hardware resource of a component Virtualizes the hardware Communicates with other micro-os services Runs in hardware controller (kernel space) Only processors have user space 55

Lego Heterogeneity Architecture Flexibility Elasticity Resource util Failure isolation 56

Many Challenges Handling component failure Manage distributed, heterogeneous resources Fitting micro-os services in hardware controller Implementing Lego on current servers 57

Lego Implementation Built from scratch, >200K LOC and growing Runs all Linux ABIs and unmodified binaries Three micro-os services: processor, memory, storage Global resource manager Emulates disaggregated hardware with regular servers 58

Lego Summary Resource disaggregation calls for new system Lego: new OS designed and built from scratch for datacenter resource disaggregation Split OS into distributed micro-os services, running at device Many challenges and many potentials 59

Disaggregated Datacenter flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-touse Physically Disaggregated Resources Virtually Disaggregated Resources Disaggregated Operating System New Processor and Memory Architecture Dumb Remote Memory + Smart Control One-Sided Remote Memory / NVM Distributed Shared Persistent Memory (SoCC 17) Distributed Non-Volatile Memory Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP 17) RDMA Network New Network Topology, Routing, Congestion-Ctrl Infiniband

Physical Resource Disaggregation Great support of heterogeneity Very flexible in resource management But needs hardware, network, and OS changes 61

Is there any less disruptive way to achieve better resource utilization and elasticity?

Virtually Disaggregated Datacenter Use resources on remote (distributed) machines Monolithic Server Monolithic Server Monolithic Server CPU CPU CPU CPU CPU CPU CPU CPU CPU Main Memory Main Memory Main Memory Storage Storage Storage 63

Using Remote/Distributed Resources Was a popular idea in 90s Remote memory/paging/swap Network block device Distributed shared memory (DSM) No production-scale adoption Cost of network communication Coherence traffic 64

65

Remote/Distributed Memory in Modern Times New and heterogeneous applications Large parallelism New computation and memory requirements New programming models Network is 10x-100x faster InfiniBand: 200Gbps, <500ns GenZ: 32-400GB/s, <100ns New types of memory NVM, HBM 66

Recent New Attempts Distributed Shared Memory Grappa Network swapping InfiniSwap Non-coherent distributed memory VMware 67

Our View Design new remote/distributed memory systems that leverage new hardware and network properties Design for modern datacenter applications Tradeoff of performance and programability First project: distributed Non-Volatile Memory (Persistent Memory) 68

DSM Distributed Shared Persistent Memory (DSPM) a significant step towards using PM in datacenters 69

DSPM Native memory load/store interface Local or remote (transparent) Pointers and in-memory data structures Supports memory read/write sharing 70

DSM Distributed Shared Persistent Memory (DSPM) a significant step towards using PM in datacenters 71

DSPM Memory load/store interface Local or remote (transparent) DSPM: (Distributed) One Layer MemoryApproach Pointers and in-memory data structures Supports memory read/write sharing Benefits of both memory and storage No redundant layers Persistent naming No data marshaling/unmarshalling Data durability and reliability (Distributed) Storage 72

Hotpot: A Kernel-Level RDMA-Based DSPM System Easy to use Native memory interface Fast, scalable Flexible consistency levels Data durability & reliability 73

Hotpot Architecture 74

Hotpot Architecture 75

Hotpot Architecture 76

Hotpot Architecture 77

Hotpot Architecture 78

Hotpot Architecture 79

MongoDB Results Modify MongoDB with ~120 LOC, use MRMW mode Compare with tmpfs, PMFS, Mojim, Octopus using YCSB 10/9/17 80

Disaggregated Datacenter flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-touse Physically Disaggregated Resources Virtually Disaggregated Resources Disaggregated Operating System New Processor and Memory Architecture Dumb Remote Memory + Smart Control One-Sided Remote Memory / NVM Distributed Shared Persistent Memory (SoCC 17) Distributed Non-Volatile Memory Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP 17) RDMA Network New Network Topology, Routing, Congestion-Ctrl Infiniband

Conclusion New hardware and software trends point to resource disaggregation WukLab is building an end-to-end solution for disaggregated datacenter Opens up new research opportunities in hardware, software, networking, security, and programming language 82

Thank you Questions? wuklab.io