Cluster Communica/on Latency:

Size: px

Start display at page:

Download "Cluster Communica/on Latency:"

Rudolf Fletcher
6 years ago
Views:

1 Cluster Communica/on Latency: towards approaching its Minimum Hardware Limits, on Low-Power Pla=orms Manolis Katevenis FORTH, Heraklion, Crete, Greece (in collab. with Univ. of Crete)

2 Acknowledgements Iakovos Mavroidis John Goodacre (ARM, KALEAO) Vassilis Papaefstathiou Manolis Marazakis Nikolaos Chrysos and many many more!... 2

3 The Essence of Communica/on Producer node 2 node 1 node 3 node 4 node 5 write (remote) Interconnect Expected Consumers node 6 node 7 Producer node 2 node 1 node 3 node 4 node 5 read request (remote) reply (data) Interconnect Consumer node 6 node 7 node N node... node 8 node N node... node 8 Push style, like If you know the consumers else like spam If space already exists Zero latency, when pre-sent One-way latency, when sent just-in-/me Pull style, like web browser When reader decides it needs the data and allocates space Two-way latency, star/ng when the data is needed unless able to prefetch 3

4 Remote DMA versus Cache Coherence Producer Consumer Producer Consumer P P P P P P P P store C C C C store LM LM LM LM invalidate forwarded request data NoC request forwarded invalidate Lower Cache and Directory (Multiple Banks) NoC Remote Store (word) or Remote Write DMA (block) Example push style: explicitely sending the data: 5x fewer packets ( less /me, less energy) 4

5 Can we approach this ideal, in Cluster Communica/on? Cluster Compu/ng / Datacenters / HPC: many thousands of nodes node = island of (hardware) cache coherence Can we get Cluster Communica/on to be analogously fast like store & load instruc/ons in a coherence island? + small delay due to distance, but not much more Clusters originate from mass, commodity market, where customers did not care about a few extra μs Work done in Energy-Efficient Datacenter/HPC projects 64-bit ARM processors 5

6 Inefficiencies in Cluster Communica/on, to be overcomed Sender Memory user ideal (zero copy) RDMA Receiver Memory user cp DMA Sender NIC Network Receiver NIC DMA cp kernel or lib kernel or lib Five (5) copyings of the data, instead of just one (1)! Data copying consumes /me and energy Start by elimina/ng (large) NIC buffers: DMA across the network RDMA 6

7 DMA Ini/a/on: System Call, or directly from User-Level? Sender user Mem user process virtual addresses ideal (zero copy) RDMA Receiver Memory user pinned buf cp kernel or lib kernel dma rd phy. addr. DMA vdma NI ack Network dma write NI ack cp pinned kernel or lib System Call to ini/ate a DMA (with physical addr.) 3 to 4 μs on 64-bit ARM in Xilinx Zynq UltraScale+ under Linux For the virtualized DMA engine (8 channels), that accepts virtual address arguments, Ini/a/on (wri/ng 3-4 registers) <0.5 μs Next Ques/ons: Who translates addresses & where? Why copy User to/from Kernel/Library? 7

8 Scalability: Global Virtual Addresses & Progressive Transla/on Source Memory Hier archy read Translate (MMU) Copy (RDMA) Engine 64 bit virtual addresses src addr dst addr PID Network Global Virtual Address Space Routing Translate (MMU) write Desti nation Memory Hier archy Data for Scalability: not all nodes can be required to know all other nodes transla/on tables see Progressive Address Transla/on in the M. Katevenis paper in: Stama9s Vassiliadis Symposium 2007 Des/na/on Addresses in network packets must be Virtual for Scalability: 64-bit GVA + (global) protec/on domain iden/fier 8

9 Two Slides from my Stama9s Vassiliadis 2007 Symposium talk: Network Rou/ng as Generaliza/on of Address Decoding Physical Address Decoding in a uniprocessor Geographical Address Rou/ng in a mul/processor hwp:// 9

10 Progressive Address Transla/on: Localize Migra/on Updates Packets carry global virtual addresses Tables provide physical route (address) for the next few steps When page 9 migrates within D, only tables in that domain need upda/ng Variable-size-page transla/on tables look like internet rou/ng tables (longest-prefix matches if we want small-page-within-big-region migra/on) Tables that par//on the system, for protec/on against untrusted opera/ng systems, look like internet firewalls 10

11 Scalability: Global Virtual Addresses & Progressive Transla/on Source Memory Hier archy read Translate (MMU) Copy (RDMA) Engine 64 bit virtual addresses src addr dst addr PID Network Global Virtual Address Space Routing Translate (MMU) write Desti nation Memory Hier archy Data for Scalability: not all nodes can be required to know all other nodes transla/on tables see Progressive Address Transla/on in the M. Katevenis paper in: Stama9s Vassiliadis Symposium 2007 Des/na/on Addresses in network packets must be Virtual for Scalability: 64-bit GVA + (global) protec/on domain iden/fier 11

12 Current Address Transla/on Arch. not ready for GVAS read or write addresses, virtual or physical, but truncated to: Processor Cores Caches local DRAM vdma engine SMMU physical only addresses 44 bits 40 bits translate if virtual route based on phy. address 448 GBy window to other PS peripheral devices Address transla/on architecture of A53 in Xilinx Zynq UltraScale+ DMA engine is virtualized, but truncates virtual addresses to <64 b no mechanism to send the untranslated VA to the network (remote) load/store instruc/ons suffer compulsory addr. transla/on Programming System PS Programmable Logic PL FPGA 12

13 Why copy data between User and Kernel/Library buffers? Sender user Mem user process virtual addresses ideal (zero copy) RDMA Receiver Memory user pinned buf cp kernel or lib kernel dma rd phy. addr. DMA vdma NI ack Network dma write NI ack cp pinned kernel or lib Pinned buffers: tradi/onal DMA does not tolerate page faults Registered buffer addresses, in the lack of System (I/O) MMU Non-cacheable buffers, in the lack of cache-coherent DMA Send buffer reuse immediately ayer send ini/a/on Transfer occurs before receive-buffer ready, and receive Applica/on unwilling to accept recep/on at library-specified address 13

14 MPI: Send-Rcv Match at Receiver, User-Specified Rcv Addr Producer (user process): write into sb (early) receive ready to send, ready to at which address? send receive, but in my buffer, pointer to rb please! match sb rb done Consumer (user process): read from rb (done) cost = one extra network round-trip /me for sender to learn the address 14

15 Snd-Rcv Match at Sender, User-Specified Receive Address Producer (user process): write into sb send match pointer to rb (early) receive ready to receive, but in my buffer, please! sb rb minimal latency achieved; early receive (rb pre-alloc by user) s/ll needed does not work with MPI_ANY_SOURCE done Consumer (user process): read from rb (done) 15

16 Receiver willing to accept Data at any Address (and not copy) Producer (user process): write into sb send sb "allocate buffer, and... here come the data" or "use rb[i] that had been preallocated for me rb tell me when and where my data will be ready rcv match done Consumer (user process): read from rb (done) Eager delivery at preallocated, library-managed buffer, then given to user 16

17 Mbox Remote Enqueue: Asynchronous Event Mul/plexing P1 Multiple Senders P2 single word messages sent via single store store instruction memory 2 m2 m1... then sent memory 3 Remote Queue m1m2 "Mbox" Shared Space longer messages: first composed locally... wait on empty Atomic enqueue into shared space, unlike dedicated space with RDMA Space reserved only for number of actual senders not poten/al senders Event Mul/plexor requests, no/fica/ons (like Select system call) P3 Single Receiver RDMA comple/on should enqueue no/fica/on to receiver and sender 17

18 Per-Task Mbox Q s & Scheduler as Interrupt Generaliza/on High rity now arriving Prio Low Event Queues Q21 Q01 Q11 Processes / Tasks T21 T11 T01 Interrupt! Run Scheduler P core low power sleep when all Queues Empty switch task per time quantum Tasks block and wait when synchronously reading from an empty queue Scheduler selects a non-empty queue of highest priority and runs its task Time quantum switches Q s and Tasks inside top non-empty priority class Enqueues into higher priority Q s interrupt lower-priority running tasks RDMA is the Datapath Queues are the Control 18

19 Conclusions Interprocessor Commun. analogous to load/store instr. No system calls virtualized DMA engines Need a new Address Transla/on Architecture Global Virtual Address Space (GVAS) No/fica/ons, Mboxes, Scheduler: generalized interrupts Adapt libraries & API s to user-level, zero-copy commun. 19

20 Backup Slides 20

21 RDMA/wr-allocate? Soyware-Guided Prefetch & Coherence Cache coherence through Soyware: simpler HW, energy efficient Vassilis Papaefstathiou: ICS 13, and PhD Disserta/on SW fetches data from the cache of the last producer before task execu/on prefetch while execu/ng other task wait for comple/on core Cache... fetch FETCH (read-only) for read-only task arguments override old copies at destina/on FETCH-O (with ownership) for read-write task args migrate cache-line ownership ensure that each cache-line is dirty in at most one cache (else, write-back order is undefined) keep? NoC update? Memory command data core Cache v d tag data hit miss 21

22 Epoch Numbers: allow SW to control Cache Replacements set_i Cache Way 0 Tags Data W0_epoch Cache Way 1 Tags Data W1_epoch Cache Way 2 Tags Data W2_epoch Cache Way 3 Tags Data W3_epoch dst. addr. hashes here Cache Controller configurable Epochs Policy current epoch epoch(s) to preserve COMPARE & DECIDE: incoming RDMA packet whom to evict, if at all keeping this in the cache Use a few (e.g. 3) tag bits per cache line to record which group of cache lines each belongs to; configure replacement policy by groups 22

23 Epoch Numbers Iden/fying Task Life/mes Curr. Epoch 3 Epoch Tags Data Core Tags Data active epochs prev currnext sliding window 0 1 past future wrap around present Next Epoch 4 EBP Scratchpad-like proper/es with the convenience of caches DMA prefetched lines next epoch run/me advances current epoch ayer tasks finish evict past lines first, then evict from epoch(s) with # of lines per set > their quota 23

24 Evalua/on Summary: Coher. (HW/SW), Prefetch (gradual/bulk), Epochs Perf. impr.: +3% to +44% (+20%) L2 misses: -17% to -67% (-32%) NoC traffic: -2% to -56% (-25%) Energy: -8% to -53% (-28%) HW Coherent +Epochs Perf. impr.: +3% to +33% (+14%) L2 misses: -22% to -71% (-38%) NoC traffic: -24% to -72% (-41%) Energy: -18% to -65% (-44%) No HW Coherence +Epochs Conven/onal HW Prefetcher Perf: -14% to +20% Perf: +6% to +66% Explicit Bulk Pref. RDMA, SW ini/ated Baseline: HW-coherent caches, no prefetching 24

Cluster Communica/on Latency:

Cluster Communica/on Latency: towards approaching its Minimum Hardware Limits, on Low-Power Pla=orms Manolis Katevenis FORTH, Heraklion, Crete, Greece (in collab. with Univ. of Crete) http://www.ics.forth.gr/carv/