I/O, today, is Remote (block) Load/Store, and must not be slower than Compute, any more

Size: px

Start display at page:

Download "I/O, today, is Remote (block) Load/Store, and must not be slower than Compute, any more"

Cameron Harper
5 years ago
Views:

1 I/O, today, is Remote (block) Load/Store, and must not be slower than Compute, any more Manolis Katevenis FORTH, Heraklion, Crete, Greece (in collab. with Univ. of Crete)

2 Acknowledgements Vassilis Papaefstathiou Manolis Marazakis Nikolaos Chrysos Iakovos Mavroidis John Goodacre (ARM, KALEAO) and many many more!... 2

3 The curse of being perceived as Slow Tell a computer engineer that something is slow and s/he will design-in a myriad other reasons to ensure that it will always remain slow! Optimize for the critical path and the common case Input/Output (I/O) hardware was slow Architecture (virtualization) and Software (OS, Languages, Apps) adjusted assuming that I/O is slow I/O hardware is no longer slow but Architecture & Software Inertia ensure that it remains Shared-Memory Parallel Programs work over Memory Message-Passing Parallel Programs work over I/O 3

4 I/O, nowadays, is NVM s and fast Networks Slow peripherals only indirectly connected, to few Proc s Most hi-speed processors are in Datacenters, HPC clusters Storage becomes Non-Volatile Memories like DRAM magnetic disks indirectly, through network & controllers High-speed network thruput approaches that of DRAM Short-range (e.g. on-board) network latency has no reason to be longer than DRAM latency 4

5 Memory vs. I/O once upon a time vs. now Memory is fast Virtualized in hardware, at ns scale user process believes it owns entire address space ld/st instructions, caches, TLB ensure ns scale I/O is slow don t bother to optimize rarely virtualized at user level (=process owns I/O devices) L1-miss L2-hit (on-chip) latency on the order of 10 ns on-chip PL ( I/O ) register read in Zynq UltraScale+ FPGA on the order of 100 ns (~500 ns for PCIe) Why?? because people didn t think it was worth optimizing 5

6 Make I/O (interproc. commun.) a first-class citizen! Parallel (& distributed) systems are ubiquitous nowadays Interprocessor Communication is as important as Computation Shared Memory communication limited to ~100 s threads We need communication across hw-coherence islands ( I/O to nearby chips) to be as fast as DRAM access (which is also nearby chips ) There is no reason for a (16-64 By) message send-receive to be any slower than a cache-line invalidate to a nearby socket followed by a cache-line miss from that socket 6

7 Inefficiencies in Cluster Communication, to be overcomed Sender Memory user ideal (zero copy) RDMA Receiver Memory user cp DMA Sender NIC Network Receiver NIC DMA cp kernel or lib kernel or lib Five (5) copyings of the data, instead of just one (1)! Data copying consumes time and energy Start by eliminating (large) NIC buffers: DMA across the network RDMA 7

8 DMA Initiation: System Call, or directly from User-Level? Sender user Mem user process virtual addresses ideal (zero copy) RDMA Receiver Memory user pinned buf cp kernel or lib kernel dma rd phy. addr. DMA vdma NI Network dma write ack NI ack cp pinned kernel or lib System Call to initiate a DMA (with physical addr.) 3 to 4 μs on 64-bit ARM in Xilinx Zynq UltraScale+ under Linux For the virtualized DMA engine (8 channels), that accepts virtual address arguments, Initiation (writing 3-4 registers) <0.5 μs Next Questions: Who translates addresses & where? Why copy User to/from Kernel/Library? 8

9 Scalability: Global Virtual Addresses & Progressive Translation Source Memory Hier archy read Translate (MMU) Copy (RDMA) Engine 64 bit virtual addresses src addr dst addr PID Network Global Virtual Address Space Routing Translate (MMU) write Desti nation Memory Hier archy Data for Scalability: not all nodes can be required to know all other nodes translation tables see Progressive Address Translation in the M. Katevenis paper in: Stamatis Vassiliadis Symposium 2007 Destination Addresses in network packets must be Virtual for Scalability: 64-bit GVA + (global) protection domain identifier 9

10 Current Address Translation Architectures not ready for GVAS read or write addresses, virtual or physical, but truncated to: Processor Cores Caches local DRAM vdma engine SMMU physical only addresses 44 bits 40 bits translate if virtual route based on phy. address 448 GBy window to other PS peripheral devices Address translation architecture of A53 in Xilinx Zynq UltraScale+ DMA engine is virtualized, but truncates virtual addresses to <64 b no mechanism to send the untranslated VA to the network (remote) load/store instructions suffer compulsory addr. translation Programming System PS Programmable Logic PL FPGA 10

11 Mbox Remote Enqueue: Asynchronous Event Multiplexing P1 P2 single word messages sent via single store store instruction Multiple Senders memory 2 m2 m1... then sent memory 3 Remote Queue m1m2 "Mbox" Shared Space longer messages: first composed locally... wait on empty The basic synchronization mechanism for parallel programs, I believe Event Multiplexor requests, notifications (like Select system call) P3 Single Receiver Atomic enqueue into shared space, unlike dedicated space with RDMA Space reserved only for number of actual senders not potential senders 11

12 Store-multiple instruction for atomic-packet send (remenq)? P1 Multiple Senders P2 single word messages sent via single store store instruction m1 store multiple: atomic packet?? m2... then sent m2 memory 3 Remote Queue m1m2 "Mbox" Shared Space longer messages alternative: first composed locally... Need a fast way to send a single, guaranteed-atomic network packet Store-multiple (registers) instruction exists on ARMv8, but wait on empty its current imlementation does not guarantee that all data words will be transferred in a single AXI burst to the I/O hardware P3 Single Receiver 12

13 Per-Task Mbox Q s & Scheduler as Interrupt Generalization High rity now arriving Prio Low Event Queues Q21 Q01 Q11 Processes / Tasks T21 T11 T01 Interrupt! Run Scheduler P core low power sleep when all Queues Empty switch task per time quantum Tasks block and wait when synchronously reading from an empty queue Scheduler selects a non-empty queue of highest priority and runs its task Time quantum switches Q s and Tasks inside top non-empty priority class Enqueues into higher priority Q s interrupt lower-priority running tasks RDMA is the Datapath Queues are the Control 13

14 Why copy data between User and Kernel/Library buffers? Sender user Mem user process virtual addresses ideal (zero copy) RDMA Receiver Memory user pinned buf cp kernel or lib kernel dma rd phy. addr. DMA vdma NI Network dma write ack NI ack cp pinned kernel or lib Pinned buffers: traditional DMA does not tolerate page faults Registered buffer addresses, in the lack of System (I/O) MMU Non-cacheable buffers, in the lack of cache-coherent DMA Send buffer reuse immediately after send initiation Transfer occurs before receive-buffer ready, and receive Application unwilling to accept reception at library-specified address 14

15 MPI: Send-Rcv Match at Receiver, User-Specified Rcv Addr Producer (user process): write into sb (early) receive ready to send, ready to at which address? send receive, but in my buffer, pointer to rb please! match sb rb done Consumer (user process): read from rb (done) cost = one extra network round-trip time for sender to learn the address 15

16 Snd-Rcv Match at Sender, User-Specified Receive Address Producer (user process): write into sb send match pointer to rb (early) receive ready to receive, but in my buffer, please! sb rb minimal latency achieved; early receive (rb pre-alloc by user) still needed does not work with MPI_ANY_SOURCE done Consumer (user process): read from rb (done) 16

17 Receiver willing to accept Data at any Address (and not copy) Producer (user process): write into sb send sb "allocate buffer, and... here come the data" or "use rb[i] that had been preallocated for me rb tell me when and where my data will be ready rcv match done Consumer (user process): read from rb (done) Eager delivery at preallocated, library-managed buffer, then given to user 17

18 Conclusions I/O is Interprocessor Communication and it should be made as fast as Compute Short messages: (remote) load/store-multiple instructions Long messages: virtualized DMA engine no system calls Global Virtual Address Space (GVAS) RDMA is the Datapath Queues (mbox) are the Control 18

19 Back-up Slides 19

20 Two Slides from my Stamatis Vassiliadis 2007 Symposium talk: Network Routing as Generalization of Address Decoding 1x: I/O 01: SRAM 1 00: SRAM 0 2 (a) P Physical Addr. SRAM 0 Data SRAM 1 I/O Brid ge Physical Address Decoding in a uniprocessor (b) P SRAM Packet body hdr 2 MS bi ts of physical desti nati on address No de 0 00: 1x: 01: No de 1 Geographical Address Routing in a multiprocessor No de 2 10: 11: No de

21 Progressive Address Translation: Localize Migration Updates P Tbl_A xx 1x xx 01 1x Tbl_B 00 0x Tbl_C 10 xx 11 0x Tbl_D x x Domain D pg9' Tbl_D pg Tbl_D3 Packets carry global virtual addresses Tables provide physical route (address) for the next few steps When page 9 migrates within D, only tables in that domain need updating Variable-size-page translation tables look like internet routing tables (longest-prefix matches if we want small-page-within-big-region migration) Tables that partition the system, for protection against untrusted operating systems, look like internet firewalls 21

Cluster Communica/on Latency:

Cluster Communica/on Latency: towards approaching its Minimum Hardware Limits, on Low-Power Pla=orms Manolis Katevenis FORTH, Heraklion, Crete, Greece (in collab. with Univ. of Crete) http://www.ics.forth.gr/carv/