Accelerator-centric operating systems

Size: px

Start display at page:

Download "Accelerator-centric operating systems"

Rosanna Marcia Thornton
6 years ago
Views:

1 Accelerator-centric operating systems Rethinking the role of s in modern computers Mark Silberstein EE, Technion

2 System design challenge: Programmability and Performance 2

3 System design challenge: Programmability and Performance Hardware architectures Systems Developers Systems Software 3

4 Computer hardware: circa ~2000 Network Adapter Graphical Processing Units (GPUs) Storage controller Size = transistor count 4

5 Systems software stack circa ~2000 Applications OS I/O devices 5

6 Computer hardware: circa ~2015 Network I/O accelerator GPU parallel accelerator Storage I/O accelerator Accelerators for encryption, media, signal processing... 6

7 Central Processing Units (s) are no longer Central Network I/O accelerator GPU parallel accelerator Storage I/O accelerator r e Pow ance m r y t o i f l i r b a Pe m m a r g o r P Accelerators for encryption, media, signal processing... 7

8 Systems software stack circa 2015 Accelerated applications OS I/O I/O Manycore processors accelerators FPGA FPGAs Hybrid DSPs -GPU GPUs GPUs 8

9 Software-hardware gap is widening Accelerated applications Inadequate abstractions and management mechanisms OS I/O I/O Manycore processors accelerators FPGA FPGAs DSPs GPUs GPUs 9

10 THE problem: - centric software architecture Network Storage GPU 10

11 Breaking the -centric system design OS Services GPU OS Services Operating system OS Services Network Storage Hardware is here We need OS support 11

12 Accelerator-centric OS architecture Applications Accelerator I/O services (network, files) OS Accelerator abstractions and mechanisms Accelerator applications Accelerator OS support (Interprocessor I/O, file system, networking APIs, memory management) Hardware support for OS I/O Manycore I/O accelerators processors FPGAs FPGA DSPs GPUs 12

13 This talk Applications Accelerator I/O services (network, files) OS Accelerator abstractions and mechanisms Accelerator applications Accelerator OS support (Interprocessor I/O, file system, networking APIs ) OSDI14, CACM14 ASPLOS13, TOCS14, Hardware support for OS Storage Network Manycore I/O accelerators processors FPGA DSPs GPUs GPUs 13

14 GPU 101 and motivation GPUnet: Network Stack for GPUs GPUfs: File access support for GPUs Recap: Accelerator-centric OS architecture 14

15 Hybrid GPU- 101 Architecture GPU 15

16 Co-processor model GPU Computation 16

17 Co-processor model GPU Computation tation 17

18 Co-processor model GPU kernel Computation tation GPU t a t i o n 18

19 Co-processor model GPU Computation 19

20 GPUs make a difference... Top 10 fastest supercomputers use GPUs GPUs enable order-of-magnitude speedups in... Physics Vision HCI Meteorology Graph Algorithms Deep Nets Bioinformatics Linear Algebra Finance 20

21 GPUs make a difference, but why only in HPC? Top 10 fastest supercomputers use GPUs GPUs enable order-of-magnitude speedups in... Physics Vision HCI Meteorology Graph Algorithms Deep Nets Bioinformatics Linear Algebra Finance... Web servers, Network services Antivirus File search???? 21

22 Programming complexity exposed Example: GPU-accelerated server 22

23 server NIC recv() compute() send() 23

24 Inside a GPU-accelerated server NIC GPU PCIe bus Theory recv() GPU_compute() send() 24

25 Inside a GPU-accelerated server recv(); NIC GPU recv(); batch(); Theory recv() GPU_compute() send() 25

26 Inside a GPU-accelerated server transfer(); NIC GPU recv(); Theory recv() GPU_compute() send() batch(); optimize(); transfer(); 26

27 Inside a GPU-accelerated server invoke(); NIC recv(); Theory recv() GPU_compute() send() batch(); optimize(); transfer(); balance(); GPU_compute(); GPU_compute() 27

28 Inside a GPU-accelerated server transfer(); NIC GPU recv(); Theory recv() GPU_compute() send() batch(); optimize(); transfer(); balance(); GPU_compute(); GPU_compute() transfer(); cleanup(); 28

29 Inside a GPU-accelerated server send(); NIC GPU recv(); Theory recv() GPU_compute() send() batch(); optimize(); transfer(); balance(); GPU_compute() transfer(); cleanup(); dispatch(); send(); 29

30 Aggressive pipelining Inside a GPU-accelerated server Buffering, asynchrony, multithreading NIC recv (); recv (); recv batch(); recv(); (); batch(); recv() GPU_compute() send() batch(); optimize(); batch(); optimize(); optimize(); transfer(); optimize(); transfer(); transfer(); balance(); transfer(); balance(); balance(); GPU_compute(); balance(); GPU_compute(); GPU_compute(); transfer(); GPU_compute(); GPU_compute() transfer(); transfer(); cleanup(); transfer(); cleanup(); cleanup(); dispatch(); cleanup(); dispatch(); dispatch(); send(); dispatch(); send(); send(); send(); 30

31 y r a s s e c e n un This code is for a to manage a GPU recv (); recv (); recv (); batch(); batch(); batch(); batch(); optimize(); optimize(); optimize(); optimize(); transfer(); transfer(); transfer(); transfer(); balance(); balance(); balance(); GPU_compute(); balance(); GPU_compute(); GPU_compute(); transfer(); GPU_compute() transfer(); transfer(); cleanup(); cleanup(); transfer(); cleanup(); dispatch(); dispatch(); dispatch(); cleanup(); send(); send(); send(); dispatch(); 31

32 GPUs are not co-processors GPUs are peer-processors They need I/O abstractions 32

33 GPUnet: socket API for GPUs Application view node0.technion.ac.il GPU native server socket(af_inet,sock_stream); listen(:2340) GPUnet Network GPU native client client socket(af_inet,sock_stream); connect( node0:2340 ) socket(af_inet,sock_stream); connect( node0:2340 ); GPUnet 33

34 GPU-accelerated server with GPUnet not involved NIC GPU PCIe bus recv() GPU_compute() send() 34

35 GPU-accelerated server with GPUnet GPU NIC PCIe bus recv() GPU_compute() send() 35

36 GPU-accelerated server with GPUnet No request batching send() recv() NIC recv() recv() recv() GPU_compute() GPU_compute() GPU_compute() send() send() send() Transparent pipelining 36

37 GPU-accelerated server with GPUnet send() recv() NIC recv() recv() recv() GPU_compute() GPU_compute() GPU_compute() send() send() send() Seamless buffer management 37

38 GPUnet design Simplicity GPU Socket API Reliable in-order streaming GPU Reliable channel RDMA Transports Non-RDMA Transports Infiniband UNIX Domain Sockets, TCP/IP NIC Performance 38

39 GPUfs: file access for GPUs Application view ) ile GPU3 m m ap () le ) d_fi hare n( s ope f d_ re ha ( s en op System-wide shared namespace GPU2 GPU1 write () s POSIX ()-like API GPUfs Host File System Persistent storage 39

40 Face verification server client (unmodified) via rsocket GPU server (GPUnet) memcached (unmodified) via rsocket Infiniband? = recv() features() GPU_features() query_db() compare() GPU_compare() send() 40

41 Latency (μsec) Face verification: Different implementations GPU (no GPUnet) cores 1.9x throughput 1/3x latency (500usec) ½ LOC GPU GPUnet Throughput (KReq/sec) 41

42 Recap: Accelerator-centric OS design 42

43 Why OS layer on accelerators? To abstract away... Hardware interaction overhead Programming model gap I/O and memory performance gap I/O topology 43

44 Challenges Hardware Systems software consistency, NUMA, limitations No OS hardware support, physical device sharing, state sharing Applications Data layout reorganization, resource management 44

45 45

46 46

47 47

48 Coming up next... Distributed accelerator applications High concurrency servers Multi-accelerator OS support Interprocessor I/O, file system, networking APIs, VM, memory consistency, isolation, security Manycore I/O accelerators processors FPGAs FPGA DSPs GPUs 48

49 Team Accelerated systems group, Technion Amir Wated, Sagi Shachar, Feras Daud, Pavel Lifshitz Collaborators Operating System Architecture group, UT Austin Sangman Kim, Yige Hu, Emmett Witchel 49

50 Accelerator-centric OS design GPUfs GPU GPUnet GPU Looking for a graduate degree in systems? We're hiring! mark@ee.technion.ac.il 50

GPUnet: networking abstractions for GPU programs

GPUnet: networking abstractions for GPU programs net: networking abstractions for programs Mark Silberstein Technion Israel Institute of Technology Sangman Kim, Seonggu Huh, Xinya Zhang Yige Hu, Emmett Witchel University of Texas at Austin Amir Wated