Accelerator-centric operating systems

Similar documents
GPUnet: networking abstractions for GPU programs

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs

GPUnet: Networking Abstractions for GPU Programs. Author: Andrzej Jackowski

GPUfs: Integrating a file system with GPUs

Eleos: Exit-Less OS Services for SGX Enclaves

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Paving the Road to Exascale

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Memory Management Strategies for Data Serving with RDMA

Messaging Overview. Introduction. Gen-Z Messaging

Interconnect Your Future

Empower Diverse Open Transport Layer Protocols in Cloud Networking GEORGE ZHAO DIRECTOR OSS & ECOSYSTEM, HUAWEI

Building NVLink for Developers

FaRM: Fast Remote Memory

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

IsoStack Highly Efficient Network Processing on Dedicated Cores

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

! Readings! ! Room-level, on-chip! vs.!

Advanced Computer Networks. End Host Optimization

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

The Common Communication Interface (CCI)

Networking at the Speed of Light

The Power of Batching in the Click Modular Router

Graph Streaming Processor

Application Acceleration Beyond Flash Storage

2008 International ANSYS Conference

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

PARAVIRTUAL RDMA DEVICE

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

HMM: GUP NO MORE! XDC Jérôme Glisse

Using (Suricata over) PF_RING for NIC-Independent Acceleration

FPGA Augmented ASICs: The Time Has Come

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Birds of a Feather Presentation

Virtualization, Xen and Denali

Interconnect Your Future

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

The NE010 iwarp Adapter

Containing RDMA and High Performance Computing

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IO virtualization. Michael Kagan Mellanox Technologies

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Advanced Computer Networks. RDMA, Network Virtualization

Baidu s Best Practice with Low Latency Networks

Flexible Architecture Research Machine (FARM)

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

UCX: An Open Source Framework for HPC Network APIs and Beyond

M 3 Microkernel-based System for Heterogeneous Manycores

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

A 101 Guide to Heterogeneous, Accelerated, Data Centric Computing Architectures

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

RDMA programming concepts

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

Process Description and Control

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Research Faculty Summit Systems Fueling future disruptions

Programmable NICs. Lecture 14, Computer Networks (198:552)

Generic System Calls for GPUs

Snapify: Capturing Snapshots of Offload Applications on Xeon Phi Manycore Processors

IO-Lite: A Unified I/O Buffering and Caching System

OPERATING SYSTEM. Chapter 4: Threads

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Building the Most Efficient Machine Learning System

Martin Dubois, ing. Contents

New Communication Standard Takyon Proposal Overview

Building the Most Efficient Machine Learning System

The rcuda middleware and applications

Solutions for Scalable HPC

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Containers Do Not Need Network Stacks

Mapping MPI+X Applications to Multi-GPU Architectures

RapidIO.org Update.

Review: Hardware user/kernel boundary

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware

Why AI Frameworks Need (not only) RDMA?

Accelerating Web Protocols Using RDMA

RDMA and Hardware Support

Sharing High-Performance Devices Across Multiple Virtual Machines

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Operating-System Structures

Transcription:

Accelerator-centric operating systems Rethinking the role of s in modern computers Mark Silberstein EE, Technion

System design challenge: Programmability and Performance 2

System design challenge: Programmability and Performance Hardware architectures Systems Developers Systems Software 3

Computer hardware: circa ~2000 Network Adapter Graphical Processing Units (GPUs) Storage controller Size = transistor count 4

Systems software stack circa ~2000 Applications OS I/O devices 5

Computer hardware: circa ~2015 Network I/O accelerator GPU parallel accelerator Storage I/O accelerator Accelerators for encryption, media, signal processing... 6

Central Processing Units (s) are no longer Central Network I/O accelerator GPU parallel accelerator Storage I/O accelerator r e Pow ance m r y t o i f l i r b a Pe m m a r g o r P Accelerators for encryption, media, signal processing... 7

Systems software stack circa 2015 Accelerated applications OS I/O I/O Manycore processors accelerators FPGA FPGAs Hybrid DSPs -GPU GPUs GPUs 8

Software-hardware gap is widening Accelerated applications Inadequate abstractions and management mechanisms OS I/O I/O Manycore processors accelerators FPGA FPGAs DSPs GPUs GPUs 9

THE problem: - centric software architecture Network Storage GPU 10

Breaking the -centric system design OS Services GPU OS Services Operating system OS Services Network Storage Hardware is here We need OS support 11

Accelerator-centric OS architecture Applications Accelerator I/O services (network, files) OS Accelerator abstractions and mechanisms Accelerator applications Accelerator OS support (Interprocessor I/O, file system, networking APIs, memory management) Hardware support for OS I/O Manycore I/O accelerators processors FPGAs FPGA DSPs GPUs 12

This talk Applications Accelerator I/O services (network, files) OS Accelerator abstractions and mechanisms Accelerator applications Accelerator OS support (Interprocessor I/O, file system, networking APIs ) OSDI14, CACM14 ASPLOS13, TOCS14, Hardware support for OS Storage Network Manycore I/O accelerators processors FPGA DSPs GPUs GPUs 13

GPU 101 and motivation GPUnet: Network Stack for GPUs GPUfs: File access support for GPUs Recap: Accelerator-centric OS architecture 14

Hybrid GPU- 101 Architecture GPU 15

Co-processor model GPU Computation 16

Co-processor model GPU Computation tation 17

Co-processor model GPU kernel Computation tation GPU t a t i o n 18

Co-processor model GPU Computation 19

GPUs make a difference... Top 10 fastest supercomputers use GPUs GPUs enable order-of-magnitude speedups in... Physics Vision HCI Meteorology Graph Algorithms Deep Nets Bioinformatics Linear Algebra Finance 20

GPUs make a difference, but why only in HPC? Top 10 fastest supercomputers use GPUs GPUs enable order-of-magnitude speedups in... Physics Vision HCI Meteorology Graph Algorithms Deep Nets Bioinformatics Linear Algebra Finance... Web servers, Network services Antivirus File search???? 21

Programming complexity exposed Example: GPU-accelerated server 22

server NIC recv() compute() send() 23

Inside a GPU-accelerated server NIC GPU PCIe bus Theory recv() GPU_compute() send() 24

Inside a GPU-accelerated server recv(); NIC GPU recv(); batch(); Theory recv() GPU_compute() send() 25

Inside a GPU-accelerated server transfer(); NIC GPU recv(); Theory recv() GPU_compute() send() batch(); optimize(); transfer(); 26

Inside a GPU-accelerated server invoke(); NIC recv(); Theory recv() GPU_compute() send() batch(); optimize(); transfer(); balance(); GPU_compute(); GPU_compute() 27

Inside a GPU-accelerated server transfer(); NIC GPU recv(); Theory recv() GPU_compute() send() batch(); optimize(); transfer(); balance(); GPU_compute(); GPU_compute() transfer(); cleanup(); 28

Inside a GPU-accelerated server send(); NIC GPU recv(); Theory recv() GPU_compute() send() batch(); optimize(); transfer(); balance(); GPU_compute() transfer(); cleanup(); dispatch(); send(); 29

Aggressive pipelining Inside a GPU-accelerated server Buffering, asynchrony, multithreading NIC recv (); recv (); recv batch(); recv(); (); batch(); recv() GPU_compute() send() batch(); optimize(); batch(); optimize(); optimize(); transfer(); optimize(); transfer(); transfer(); balance(); transfer(); balance(); balance(); GPU_compute(); balance(); GPU_compute(); GPU_compute(); transfer(); GPU_compute(); GPU_compute() transfer(); transfer(); cleanup(); transfer(); cleanup(); cleanup(); dispatch(); cleanup(); dispatch(); dispatch(); send(); dispatch(); send(); send(); send(); 30

y r a s s e c e n un This code is for a to manage a GPU recv (); recv (); recv (); batch(); batch(); batch(); batch(); optimize(); optimize(); optimize(); optimize(); transfer(); transfer(); transfer(); transfer(); balance(); balance(); balance(); GPU_compute(); balance(); GPU_compute(); GPU_compute(); transfer(); GPU_compute() transfer(); transfer(); cleanup(); cleanup(); transfer(); cleanup(); dispatch(); dispatch(); dispatch(); cleanup(); send(); send(); send(); dispatch(); 31

GPUs are not co-processors GPUs are peer-processors They need I/O abstractions 32

GPUnet: socket API for GPUs Application view node0.technion.ac.il GPU native server socket(af_inet,sock_stream); listen(:2340) GPUnet Network GPU native client client socket(af_inet,sock_stream); connect( node0:2340 ) socket(af_inet,sock_stream); connect( node0:2340 ); GPUnet 33

GPU-accelerated server with GPUnet not involved NIC GPU PCIe bus recv() GPU_compute() send() 34

GPU-accelerated server with GPUnet GPU NIC PCIe bus recv() GPU_compute() send() 35

GPU-accelerated server with GPUnet No request batching send() recv() NIC recv() recv() recv() GPU_compute() GPU_compute() GPU_compute() send() send() send() Transparent pipelining 36

GPU-accelerated server with GPUnet send() recv() NIC recv() recv() recv() GPU_compute() GPU_compute() GPU_compute() send() send() send() Seamless buffer management 37

GPUnet design Simplicity GPU Socket API Reliable in-order streaming GPU Reliable channel RDMA Transports Non-RDMA Transports Infiniband UNIX Domain Sockets, TCP/IP NIC Performance 38

GPUfs: file access for GPUs Application view ) ile GPU3 m m ap () le ) d_fi hare n( s ope f d_ re ha ( s en op System-wide shared namespace GPU2 GPU1 write () s POSIX ()-like API GPUfs Host File System Persistent storage 39

Face verification server client (unmodified) via rsocket GPU server (GPUnet) memcached (unmodified) via rsocket Infiniband? = recv() features() GPU_features() query_db() compare() GPU_compare() send() 40

Latency (μsec) Face verification: Different implementations 2500 1 GPU (no GPUnet) 2000 6 cores 1.9x throughput 1/3x latency (500usec) ½ LOC 1500 1 GPU GPUnet 1000 500 23 34 54 Throughput (KReq/sec) 41

Recap: Accelerator-centric OS design 42

Why OS layer on accelerators? To abstract away... Hardware interaction overhead Programming model gap I/O and memory performance gap I/O topology 43

Challenges Hardware Systems software consistency, NUMA, limitations No OS hardware support, physical device sharing, state sharing Applications Data layout reorganization, resource management 44

45

46

47

Coming up next... Distributed accelerator applications High concurrency servers Multi-accelerator OS support Interprocessor I/O, file system, networking APIs, VM, memory consistency, isolation, security Manycore I/O accelerators processors FPGAs FPGA DSPs GPUs 48

Team Accelerated systems group, Technion Amir Wated, Sagi Shachar, Feras Daud, Pavel Lifshitz Collaborators Operating System Architecture group, UT Austin Sangman Kim, Yige Hu, Emmett Witchel 49

Accelerator-centric OS design GPUfs GPU GPUnet GPU Looking for a graduate degree in systems? We're hiring! mark@ee.technion.ac.il 50