GPUnet: networking abstractions for GPU programs

Size: px

Start display at page:

Download "GPUnet: networking abstractions for GPU programs"

Alison Tate
5 years ago
Views:

1 net: networking abstractions for programs Mark Silberstein Technion Israel Institute of Technology Sangman Kim, Seonggu Huh, Xinya Zhang Yige Hu, Emmett Witchel University of Texas at Austin Amir Wated Technion

2 What A socket API for programs running on Why -accelerated servers are hard to build Results vs. 50% throughput, 60% latency, ½ LOC

3 Motivation: -accelerated networking applications Data processing server MapReduce

4 Recent -accelerated networking applications SSLShader (Jang 2011), MapReduce (Stuart 2011), Deep Neural Networks (Coates 2013), Dandelion (Rossbach 2013), Rhythm (Agrawal 2014)...

5 Recent -accelerated networking applications SSLShader (Jang 2011), MapReduce (Stuart 2011), Deep Neural Networks (Coates 2013), Dandelion (Rossbach 2013), Rhythm (Agrawal 2014)... required heroic efforts

6 -accelerated networking apps: Recurring themes Pipelining and buffer management NIC- interaction Request batching

7 -accelerated networking apps: Recurring themes --NIC Pipelining NIC- interaction We will sidestep these problems Request batching

8 The real problem: is the only boss NIC Storage

9 Example: server NIC compute() send()

10 Inside a -accelerated server NIC PCIe bus Theory _compute() send()

11 Inside a -accelerated server ; NIC ; batch(); Theory _compute() send()

12 Inside a -accelerated server NIC ; Theory _compute() send() batch(); optimize();

13 Inside a -accelerated server invoke(); NIC ; Theory _compute() send() batch(); optimize(); balance(); _compute(); _compute()

14 Inside a -accelerated server NIC ; Theory _compute() send() batch(); optimize(); balance(); _compute(); _compute() cleanup();

15 Inside a -accelerated server send(); NIC ; Theory _compute() send() batch(); optimize(); balance(); _compute() cleanup(); dispatch(); send();

16 Aggressive pipelining Inside -accelerated server Doubleabuffering, asynchrony, multithreading NIC recv (); recv (); recv batch(); ; (); batch(); _compute() send() batch(); optimize(); batch(); optimize(); optimize(); optimize(); balance(); balance(); balance(); _compute(); balance(); _compute(); _compute(); _compute(); _compute() cleanup(); cleanup(); cleanup(); dispatch(); cleanup(); dispatch(); dispatch(); send(); dispatch(); send(); send(); send();

17 This code is for a to manage a recv (); recv (); recv (); batch(); batch(); batch(); batch(); optimize(); optimize(); optimize(); optimize(); balance(); balance(); balance(); _compute(); balance(); _compute(); _compute(); _compute() cleanup(); cleanup(); cleanup(); dispatch(); dispatch(); dispatch(); cleanup(); send(); send(); send(); dispatch();

18 s are not co-processors s are peer-processors They need I/O abstractions File system I/O [fs ASPLOS13] Network I/O this work

19 net: socket API for s Application view node0.technion.ac.il native server socket(af_inet,sock_stream); listen(:2340) net Network native client client socket(af_inet,sock_stream); connect( node0:2340 ) socket(af_inet,sock_stream); connect( node0:2340 ); net

20 -accelerated server with net not involved NIC PCIe bus _compute() send()

21 -accelerated server with net NIC PCIe bus _compute() send()

22 -accelerated server with net No request batching send() NIC _compute() _compute() _compute() send() send() send()

23 -accelerated server with net Automatic request pipelining send() NIC _compute() _compute() _compute() send() send() send() Automatic buffer management

24 Building a socket abstraction for s

25 Goals NIC PCIe bus Simplicity Performance Reliable streaming abstraction for s NIC data path optimizations

26 Design option 1: Transport layer processing on Transport processing Network buffers controls the flow of data NIC

27 Design option 1: Transport layer processing on Transport processing Network buffers NIC Extra - memory transfers

28 Design option 2: Transport layer processing on Transport processing Network buffers P2P DMA NIC

29 Design option 2: Transport layer processing on Transport processing Network buffers applications access network through? TCP/IP on? P2P DMA NIC

30 Not, Not We need help from NIC hardware

31 RDMA: offloading transport layer processing to NIC Streaming Streaming Message buffers Message buffers Reliable RDMA NIC

32 net layers Socket API Reliable in-order streaming Reliable channel RDMA Transports Non-RDMA Transports Infiniband UNIX Domain Socket, TCP/IP

33 net layers Simplicity Socket API Reliable in-order streaming Reliable channel RDMA Transports Non-RDMA Transports Infiniband UNIX Domain Socket, TCP/IP NIC Performance

34 See the paper for Coalesced API calls Latency-optimized - flow control management Bounce buffers Non-RDMA support performance optimizations

35 Implementation Standard API calls, blocking/nonblocking libnet.a: AF_INET, Streaming over Infiniband RDMA Fully compatible with rsocket library libunixnet.a: AF_LOCAL: Unix Domain Sockets support for inter /-

36 Implementation application net socket library Bounce buffers net proxy memory fallback Flow control NIC Network buffers memory

37 Evaluation Analysis of -native server design Matrix product server In--memory MapReduce Face verification server 2x6 Intel E5-2620, NVIDIA Tesla K20Xm, Mellanox Connect-IB HCA, Switch-X bridge

38 In--memory MapReduce fs Map Map Receiver Sort Reduce net Receiver Sort Reduce

39 In--memory MapReduce: Scalability 1 (no network) 4 s (net) K-means 5.6 sec 1.6 sec (3.5x) Word-count 29.6 sec 10 sec (2.9x) net enables scale-out for accelerated systems

40 Face verification server client (unmodified) via rsocket server (net) Infiniband? = features() _features() query_db() compare() _compare() send() memcached (unmodified) via rsocket

41 Latency (μsec) Face verification: Different implementations (no net) 99th % 6 cores 25th-75th% 1 net Median Throughput (KReq/sec) 54

42 Latency (μsec) Face verification: Different implementations (no net) 99th % 6 cores 1.9x throughput 1/3x latency ½ LOC 25th-75th% 1 net Median Throughput (KReq/sec) 54

43 Latency (μsec) Face verification: Different implementations (no net) 99th % 6 cores Large variability in latency 1 25th-75th% Median net Throughput (KReq/sec) 54

44 Face verification on all processors 2x + 10x Similar latency 4.5x throughput Latency (μsec) cores 2xnet+ 10x net Throughput optimized Latency optimized Throughput (KReq/sec) 186

45 Set s free! net net is a library providing networking abstractions for s mark@ee.technion.ac.il

Accelerator-centric operating systems

Accelerator-centric operating systems Rethinking the role of s in modern computers Mark Silberstein EE, Technion System design challenge: Programmability and Performance 2 System design challenge: Programmability