Accelerator-centric operating systems Rethinking the role of s in modern computers Mark Silberstein EE, Technion
System design challenge: Programmability and Performance 2
System design challenge: Programmability and Performance Hardware architectures Systems Developers Systems Software 3
Computer hardware: circa ~2000 Network Adapter Graphical Processing Units (GPUs) Storage controller Size = transistor count 4
Systems software stack circa ~2000 Applications OS I/O devices 5
Computer hardware: circa ~2015 Network I/O accelerator GPU parallel accelerator Storage I/O accelerator Accelerators for encryption, media, signal processing... 6
Central Processing Units (s) are no longer Central Network I/O accelerator GPU parallel accelerator Storage I/O accelerator r e Pow ance m r y t o i f l i r b a Pe m m a r g o r P Accelerators for encryption, media, signal processing... 7
Systems software stack circa 2015 Accelerated applications OS I/O I/O Manycore processors accelerators FPGA FPGAs Hybrid DSPs -GPU GPUs GPUs 8
Software-hardware gap is widening Accelerated applications Inadequate abstractions and management mechanisms OS I/O I/O Manycore processors accelerators FPGA FPGAs DSPs GPUs GPUs 9
THE problem: - centric software architecture Network Storage GPU 10
Breaking the -centric system design OS Services GPU OS Services Operating system OS Services Network Storage Hardware is here We need OS support 11
Accelerator-centric OS architecture Applications Accelerator I/O services (network, files) OS Accelerator abstractions and mechanisms Accelerator applications Accelerator OS support (Interprocessor I/O, file system, networking APIs, memory management) Hardware support for OS I/O Manycore I/O accelerators processors FPGAs FPGA DSPs GPUs 12
This talk Applications Accelerator I/O services (network, files) OS Accelerator abstractions and mechanisms Accelerator applications Accelerator OS support (Interprocessor I/O, file system, networking APIs ) OSDI14, CACM14 ASPLOS13, TOCS14, Hardware support for OS Storage Network Manycore I/O accelerators processors FPGA DSPs GPUs GPUs 13
GPU 101 and motivation GPUnet: Network Stack for GPUs GPUfs: File access support for GPUs Recap: Accelerator-centric OS architecture 14
Hybrid GPU- 101 Architecture GPU 15
Co-processor model GPU Computation 16
Co-processor model GPU Computation tation 17
Co-processor model GPU kernel Computation tation GPU t a t i o n 18
Co-processor model GPU Computation 19
GPUs make a difference... Top 10 fastest supercomputers use GPUs GPUs enable order-of-magnitude speedups in... Physics Vision HCI Meteorology Graph Algorithms Deep Nets Bioinformatics Linear Algebra Finance 20
GPUs make a difference, but why only in HPC? Top 10 fastest supercomputers use GPUs GPUs enable order-of-magnitude speedups in... Physics Vision HCI Meteorology Graph Algorithms Deep Nets Bioinformatics Linear Algebra Finance... Web servers, Network services Antivirus File search???? 21
Programming complexity exposed Example: GPU-accelerated server 22
server NIC recv() compute() send() 23
Inside a GPU-accelerated server NIC GPU PCIe bus Theory recv() GPU_compute() send() 24
Inside a GPU-accelerated server recv(); NIC GPU recv(); batch(); Theory recv() GPU_compute() send() 25
Inside a GPU-accelerated server transfer(); NIC GPU recv(); Theory recv() GPU_compute() send() batch(); optimize(); transfer(); 26
Inside a GPU-accelerated server invoke(); NIC recv(); Theory recv() GPU_compute() send() batch(); optimize(); transfer(); balance(); GPU_compute(); GPU_compute() 27
Inside a GPU-accelerated server transfer(); NIC GPU recv(); Theory recv() GPU_compute() send() batch(); optimize(); transfer(); balance(); GPU_compute(); GPU_compute() transfer(); cleanup(); 28
Inside a GPU-accelerated server send(); NIC GPU recv(); Theory recv() GPU_compute() send() batch(); optimize(); transfer(); balance(); GPU_compute() transfer(); cleanup(); dispatch(); send(); 29
Aggressive pipelining Inside a GPU-accelerated server Buffering, asynchrony, multithreading NIC recv (); recv (); recv batch(); recv(); (); batch(); recv() GPU_compute() send() batch(); optimize(); batch(); optimize(); optimize(); transfer(); optimize(); transfer(); transfer(); balance(); transfer(); balance(); balance(); GPU_compute(); balance(); GPU_compute(); GPU_compute(); transfer(); GPU_compute(); GPU_compute() transfer(); transfer(); cleanup(); transfer(); cleanup(); cleanup(); dispatch(); cleanup(); dispatch(); dispatch(); send(); dispatch(); send(); send(); send(); 30
y r a s s e c e n un This code is for a to manage a GPU recv (); recv (); recv (); batch(); batch(); batch(); batch(); optimize(); optimize(); optimize(); optimize(); transfer(); transfer(); transfer(); transfer(); balance(); balance(); balance(); GPU_compute(); balance(); GPU_compute(); GPU_compute(); transfer(); GPU_compute() transfer(); transfer(); cleanup(); cleanup(); transfer(); cleanup(); dispatch(); dispatch(); dispatch(); cleanup(); send(); send(); send(); dispatch(); 31
GPUs are not co-processors GPUs are peer-processors They need I/O abstractions 32
GPUnet: socket API for GPUs Application view node0.technion.ac.il GPU native server socket(af_inet,sock_stream); listen(:2340) GPUnet Network GPU native client client socket(af_inet,sock_stream); connect( node0:2340 ) socket(af_inet,sock_stream); connect( node0:2340 ); GPUnet 33
GPU-accelerated server with GPUnet not involved NIC GPU PCIe bus recv() GPU_compute() send() 34
GPU-accelerated server with GPUnet GPU NIC PCIe bus recv() GPU_compute() send() 35
GPU-accelerated server with GPUnet No request batching send() recv() NIC recv() recv() recv() GPU_compute() GPU_compute() GPU_compute() send() send() send() Transparent pipelining 36
GPU-accelerated server with GPUnet send() recv() NIC recv() recv() recv() GPU_compute() GPU_compute() GPU_compute() send() send() send() Seamless buffer management 37
GPUnet design Simplicity GPU Socket API Reliable in-order streaming GPU Reliable channel RDMA Transports Non-RDMA Transports Infiniband UNIX Domain Sockets, TCP/IP NIC Performance 38
GPUfs: file access for GPUs Application view ) ile GPU3 m m ap () le ) d_fi hare n( s ope f d_ re ha ( s en op System-wide shared namespace GPU2 GPU1 write () s POSIX ()-like API GPUfs Host File System Persistent storage 39
Face verification server client (unmodified) via rsocket GPU server (GPUnet) memcached (unmodified) via rsocket Infiniband? = recv() features() GPU_features() query_db() compare() GPU_compare() send() 40
Latency (μsec) Face verification: Different implementations 2500 1 GPU (no GPUnet) 2000 6 cores 1.9x throughput 1/3x latency (500usec) ½ LOC 1500 1 GPU GPUnet 1000 500 23 34 54 Throughput (KReq/sec) 41
Recap: Accelerator-centric OS design 42
Why OS layer on accelerators? To abstract away... Hardware interaction overhead Programming model gap I/O and memory performance gap I/O topology 43
Challenges Hardware Systems software consistency, NUMA, limitations No OS hardware support, physical device sharing, state sharing Applications Data layout reorganization, resource management 44
45
46
47
Coming up next... Distributed accelerator applications High concurrency servers Multi-accelerator OS support Interprocessor I/O, file system, networking APIs, VM, memory consistency, isolation, security Manycore I/O accelerators processors FPGAs FPGA DSPs GPUs 48
Team Accelerated systems group, Technion Amir Wated, Sagi Shachar, Feras Daud, Pavel Lifshitz Collaborators Operating System Architecture group, UT Austin Sangman Kim, Yige Hu, Emmett Witchel 49
Accelerator-centric OS design GPUfs GPU GPUnet GPU Looking for a graduate degree in systems? We're hiring! mark@ee.technion.ac.il 50