The ExaNeSt Project: Interconnects, Storage, and Packaging for Exascale Systems

Size: px

Start display at page:

Download "The ExaNeSt Project: Interconnects, Storage, and Packaging for Exascale Systems"

Leonard Richard
5 years ago
Views:

1 The ExaNeSt Project: Interconnects, Storage, and Packaging for Exascale Systems M. Katevenis, Nikolaos Chrysos, e.a. Foundation for Research & Technology - Hellas (FORTH) On Behalf of the ExaNeSt Consortium Euromicro DSD 2016, Limassol Aug. 31

2 Storage & data Germany The ExaNeSt Consortium Netherlands Italy Italy Italy UK Applications Italy UK Greece - coordinator UPV - ES Technology UK Interconnects 2

3 What ExaNeSt is about ARMv8, UNIMEM Partitioned Global Address Space (PGAS) low energy compute low overhead communication heterogeneous: FPGA accelerators working closely with ExaNoDe, EcoScale, (& EuroServer) Network: unified compute & storage, low latency Storage: distributed, in-node non-volatile memories Extreme Compute Density: totally-liquid cooling Prototype: 1K cores, 4 Tby DRAM, 40 Tby SSD, 0.5 M DSP sl s Real Applications: Scientific, Engineering, Data Analytics 3

4 The ExaNeSt Prototype ( ) Using Xilinx Zynq UltrScale+ FPGAs Four 64-bit ARM cores per FPGA Quad FPGA Daugther Boards (QFDB) Four FPGAs per QFDB 8 QFDB s per Blade System: Dozen Blades 4

The ExaNeSt Prototype (2016 17) Using Xilinx Zynq UltrScale+ FPGAs Four 64-bit ARM cores per FPGA Electronics immersed in 3M Novec

5 The ExaNeSt Prototype ( ) Using Xilinx Zynq UltrScale+ FPGAs Four 64-bit ARM cores per FPGA Electronics immersed in 3M Novec liquid Quad FPGA Daugther Boards (QFDB) Four FPGAs per QFDB 8 QFDB s per Blade System: Dozen Blades Rack-level water circulation 5

ExaNeSt: Unimem PGAS Memory Model Enables remote loads/stores to global address space System-wide coherent memories w/o expensive hardware only one

6 ExaNeSt: Unimem PGAS Memory Model Enables remote loads/stores to global address space System-wide coherent memories w/o expensive hardware only one node may cache data Global Virtual Address Space Resiliency : page can move seamlessly upon node failures Difficult to maintain a global page table 6

ExaNeSt Unimem Implementation Enables remote loads/stores to global address space System-wide coherent memories w/o expensive hardware only one node may cache data Global

7 ExaNeSt Unimem Implementation Enables remote loads/stores to global address space System-wide coherent memories w/o expensive hardware only one node may cache data Global Virtual Address Space Resiliency : page can move seamlessly upon node failures Difficult to maintain a global page table ExaNeSt pages stay within a coherence island (node) 7

System @ 1.2 GHz : 4 Cortex A53 ARM cores 4.

8 ExaNest Package (Coherence Island): Xilinx Zynq Ultrascale+ Trenz Board Xilinx Zynq Ultrascale+ FPGA ExaNeSt Prototype: among the first to use 64-bit ARM FPGAs Processing 1.2 GHz : 4 Cortex A53 ARM cores 4.8 GFLOPS Plus: Real Time Processors (Cortex R5), IOMMU, Virtualized DMA Engine Progr. Logic 2.5K DSP add-mul 300 MHz 250 1K GFLOPS 8

ExaNeSt Node: Quad-FPGA-DaughterBoard (QFDB) 4 Ultrascale+ FPGAs all-to-all connectivity 2 x HSS (GTH) + 16 x LVDS 64 GBytes DDR4 16 GB/FPGA @ 160 Gb/s

9 ExaNeSt Node: Quad-FPGA-DaughterBoard (QFDB) 4 Ultrascale+ FPGAs all-to-all connectivity 2 x HSS (GTH) + 16 x LVDS 64 GBytes DDR Gb/s 512 GBytes SSD/NVMe 4x PCIe v2 (8 GBytes/s) 10 HSS links to remote 10 Gb/s per link 16 Gb/s best case o 120x130mm2 o Currently in layout + fabrication 9

ExaNeSt Blade: Packaging and Cooling Unit Initially 4 QFDBs + 2 KALEAO + 2 Thermal-only DBs for tests Later 16 QFDB-compatible slots Blade Mezzanine board

10 ExaNeSt Blade: Packaging and Cooling Unit Initially 4 QFDBs + 2 KALEAO + 2 Thermal-only DBs for tests Later 16 QFDB-compatible slots Blade Mezzanine board QFDB Passive interconnect among local QFDBs Custom mesh-like network 32 SFP+ (cable slots) for system interconnect 500+ Gb/s per blade PCB HSS links SFP+ cables 10

Few global wires Hybrid direct + indirect networks Dragonfly +

11 Flexible System-Interconnect Topologies Tier 1 Blade Tier 2 System Multi-level Dragonfly QFDB blade system Small diameter High bisection Few global wires Hybrid direct + indirect networks Dragonfly + central routing boards Segregate throughput- from latency-sensitive traffic 11

12 ExaNeSt: Interconnection Network Design Goals: low latency RDMA : true zero copy flow prioritization: short (compute) vs bulky (storage) throttle congestive flows at network edges at DMA sources resiliency: error detect/correct, monitor links, multipath routing all-optical proof-of-concept switch using 2 2/4 4 building blocks 12

13 ExaNeSt Interconnect Hierarchy Hierarchy Tech Switching Tier 4 System Optical Tier 3 Tier 2 Tier 1 Tier 0 QFDB Ultrascal e+ FPGA Rack/ Cabinet Backplane Chassis Blade/ Mezzanine Node Unit Package Chip-2-Chip AXI Load/Store Weak order Optica l Optica l AXI Xbar Etherne t Etherne t Etherne t AXI Xbar APEnet APEnet APEnet APEnet Fanout T1-T2 >200 racks 5-15 chassis 6-24 Nodes/1 U 4-16 Nodes 4 FPGAs 4-6 SMP cores Bandwidth LVDS 12x (14.4Gbps) HSS 2x (32 Gbps) 40Gbps Lat low 20ns 200ns Address Scheme Custom MAC address GAS^r, MAC to GAS partition GAS^r partition Reliability X X EDC LO FA MO ACK Bit Corruption A53 Cores OK FAST OK 13

14 ExaNeSt Storage Architecture 14

15 ExaNeSt Storage Architecture QFDBs w. SSD storage Bring data closer to compute inside QFDB-level SSDs 15

16 ExaNeSt: Per-Job On-Demand SSD Caches File Payload : SSDs cache; on miss storage server 16

17 Applications, Traces Main Applications: Material science: LAMMPS Climate change: REGCM Engineering: openfoam, SailFish Astrophysics: Gadget, Pinocchio, Changa, Swift Neuroscience: DPSNN High Energy Physics: LQCD Data Analytics: MonetDB Traces generated: Scalasca profiling tool: MPI calls instrumented, several GBytes per trace, filtered down to tens of Mbytes by keeping what our network simulators will need; generally, to be made publicly available. Next Applications Porting & Tuning: currently porting selected App s to ARM, on the EuroServer Prototype 17

18 Conclusions: ExaNest TODOs Optimize, integrate & evaluate core system-level components ARM/Unimem Packaging & Cooling Interconnects Distributed NVM / Storage Fine-tuned Applications Large-scale optimized 64-bit ARM Proto, also leveraged by Other FET-HPC Projects: ExaNoDe and EcoScale ( ) 18

19 Εuropean Exascale System Ιnterconnect & Storage Interconnection Network In-node Storage Advanced Cooling Real Applications Stay

20 backup 20

21 ExaNeSt RDMA Operation Overview From user space to user space : no kernel, no copies No page pinning to avoid OS ovrhds: dest page fault Src DMA channels implement rate congestion 21

22 Potential for International Cooperations Application Programming Interfaces (API s) needed for taking advantage of new Technologies: NVM s / Distributed Storage Zero-copy, user-level communication (RDMA, mailboxes) Congestion mitigation & Resilience in Networks 22

23 Relations with cppp, SRA, other Projects cppp we feel: part of it; contributor to it; cppp is necessary for our goals SRA very useful for planing: some of our partners already contributors, more to come Relations with other FETHPC/CoE: Already within a group with ExaNoDe & EcoScale minimal collaboration axis: low-energy (ARM), UNIMEM Looking forward to widen the group on this axis Looking forward to followup projects & EsD on same axis Application CoE s are essential for HW-SW co-design Also need a CoE on HP Computing Systems Arch & SysSW 23

24 ExaNeSt RDMA Receive Context Table 24

25 ExaNeSt System Hierarchy Hierarchy Scale Performance DRAM Storage MaxPower Chiplet (DoA) Heterogeneous CPU/GPU comp unit Interposer (3D-IC) 4 x Chiplet, (DoA) Compute Node (Shared IO & Accel.) 2 Interposer plus I/O+OpenCL FPGA Package ( 16-17): Xilinx Zynq Ultrascale+ FPGA XCZU9EG CPU/GPU/ DSP Package (2020+), 2+ FPGAs on MCM New technology Compute Element (DB PCB) 2 x Node Daughter Board (New Tech) GFLOPS 8 CPUs GFLOPS 32 CPUs 1 packages 3.5 TFLOPS 64 CPUs 1 package 1 package 2 packages 4 packages 250 GFLOPS 4 ARM-53 CPUs (GPU + 2.5K DSPs) 1.5 TFLOPS 32 CPUs 7 TFLOPS 128 CPUs Up to 6x 8GB virtualized 15 W (16 GB) 64 GB virtualized 70 W 128 GB 18 GBytes DDR4 64 GBytes HMC 6 TFLOPS 128 CPUs 256 GB Host SSD GB virtualized virtualized 140 W + 20 W for I/O 20 W 50 W 256 GB 6.8 TB 320 W SSD 4TB 200 W 25

26 ExaNeSt System Hierarchy Hierarchy Scale Performance DRAM Storage MaxPower Daughter Board (New Tech) 4 packages 6 TFLOPS 128 CPUs 256 GB SSD 4 TB 200 W Mezzanine (motherboard for Elements) 4 x Element Blade (deployment unit / hot-swap) 3 x Mezzanine Blade 16 x DaughterBoards Chassis 6 x Blade+2 NetBlades 8 packages 28 TFLOPS 512 CPUs 24 packages 84 TFLOPS 1536 CPUs 64 packages 384 packages 96 TFLOPS 2K CPUs 576 TFLOPS 12.2 K CPUs 1 TB 27 TB 3 TB 81 TB 1.28 kw W Interconnect 4.2 kw W cooling 4 TB 64 TB 3.2 kw 24 TB 384 TB 25.6 kw + 5 kw cooling 26

27 ExaNeSt System Hierarchy Hierarchy Scale Performance DRAM Storage MaxPower Chassis 6 x Blade+2 NetBlades 384 packages 576 TFLOPS 12.2 K CPUs 24 TB 384 TB 25.6 kw + 5 kw cooling Rack (metal frame) 72 Blade 1728 packages 6 PFLOPS 110K CPUs 221 TB 5.8 PB 324 kw + 1 kw TOR Rack (metal frame) 12 x Chassis 4608 packages 6.9 PFLOPS 147K CPUs 288 TB 4.5 PB 367 kw Example HPC System 100 x Rack 173K packages 500 PFLOPS 11 M CPUs 22 PB 58 PB 32.5 MW ExaScale Level 167 x Rack 288K packages 1 ExaFLOPS 18.5M CPUs 37 PB 1 ExaByte 54 MW Example HPC System 100 Rack 460K packages 690 PFLOPS 14.7M CPUs 28.8 PB 450 PB 37 MW Exascale 144 x Rack 663K packages 1 ExaFLOPS 21M CPUs 41 PB 684 PB 53 MW 27

RapidIO.org Update. Mar RapidIO.org 1

RapidIO.org Update. Mar RapidIO.org 1 RapidIO.org Update rickoco@rapidio.org Mar 2015 2015 RapidIO.org 1 Outline RapidIO Overview & Markets Data Center & HPC Communications Infrastructure Industrial Automation Military & Aerospace RapidIO.org