Dynamical Exascale Entry Platform

Size: px

Start display at page:

Download "Dynamical Exascale Entry Platform"

Rosanna Montgomery
5 years ago
Views:

1 DEEP Dynamical Exascale Entry Platform 2 nd IS-ENES Workshop on High performance computing for climate models , Toulouse, France Estela Suarez The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/ ) under Grant Agreement n

2 The DEEP project DEEP: Dynamical Exascale Entry Platform EU-project funded by the FP7 program: 2011 call for Exascale 3 Projects selected: DEEP, MontBlanc, CRESTA DEEP EU-funding: 8.03 M Duration: 3 years Dec 2011 Nov Partners Coordinator: JSC 2

Goals Build a prototype of an Exascale architecture: With accelerators

Booster with Intel Xeon Phi TM and EXTOLL network Energy efficient with

Programming environment Libraries and Performance analysis tools

3 Goals Build a prototype of an Exascale architecture: With accelerators that can react autonomously ( Booster ) Hardware Development: Build a Booster with Intel Xeon Phi TM and EXTOLL network Energy efficient with hot water cooling Software Development Ressource-Management System Programming environment Libraries and Performance analysis tools Porting scientific applications DEEP hardware architecture DEEP software architecture 3

4 DEEP Concept Constellation Systems IBM Blue Gene/L Cluster Systems Graphic-card accelerated IBM Blue Gene Family K computer Infiniband fat tree Torus 3d/5d/6d Low - medium scalable architecture Applications with Complex Regular communication pattern Highly scalable architecture 4

5 DEEP Concept Constellation Systems IBM Blue Gene/L Cluster Systems Graphic-card accelerated IBM Blue Gene Family K computer Infiniband fat tree Torus 3d/5d/6d Low - medium scalable architecture Highly scalable architecture 5

6 Hardware Architecture 128 CNs - Sandy Bridge 512 BNs Intel Xeon Phi TM Highly scalable code parts (HSCP) with regular communication patterns are off-loaded to the Booster 6

7 Hot-water cooling 40 o C 45 o C Dry Coolers 128 CNs 2xSandy Bridge DEEP Cluster BNs KNC - 0,5 PF/sec DEEP Booster DEEP cold plate 7

8 Software-Architecture 8

9 DEEP Programming g Model Combine task-based and MPI programming Introduce highly parallel tasks implemented using MPI Extend OmpSs to handle these tasks and data transfer between Cluster and Booster Rely on MPI for scalability MPI CN T CN T Global OmpSs MPI Start Task InfiniBand Run Task BI BN BN BN T T MPI BI BN BN BN T T CN Complete Task BI BN BN BN 9

10 Advantages of DEEP with respect to standard accelerated architectures Booster Nodes do not need a host CPU Direct communication between BNs through EXTOLL Flexible assignment of groups of Cluster Nodes and Booster Nodes possible (also dynamically) Large blocks of code can be sent to the Booster Minimization of communication between CPU and accelerator Booster Nodes run standard Linux I/O is done through h the Cluster

DEEP Pilot Applications Scientific applications: Brain simulation

Computational fluid engineering (CERFACS) High temperature

DEEP concept and its programmability Compare its performance with

11 DEEP Pilot Applications Scientific applications: Brain simulation (EPFL) Space weather simulation (KULeuven) Climate simulation (CYI) Computational fluid engineering (CERFACS) High temperature superconductivity (CINECA) Seismic imaging (CGGVS) Goals: Evaluate the DEEP concept and its programmability Compare its performance with standard architectures Create best practice guidelines Propose improvements to the DEEP System 11

Two coupled models: Climate Simulation (CYI) Structure ECHAM (Atmospheric) MESSy (Physicochemical interactions) EMAC = ECHAM/MESSy Atmospheric Chemistry 3D transpositions Inverse Legendre transforms

12 Two coupled models: Climate Simulation (CYI) Structure ECHAM (Atmospheric) MESSy (Physicochemical interactions) EMAC = ECHAM/MESSy Atmospheric Chemistry 3D transpositions Inverse Legendre transforms 3D transpositions Inverse Fourier transforms 3D transpositions Transport core calculations, 4D transpositions Grid point calculations, local, physics, chemistry Direct Fourier transforms 3D transpositions Direct Legendre transforms 3D transpositions Spectral lintegration, ti nonlocal, meteorology MESSy MECCA 12

13 Climate Simulation (CYI) Analysis Computation Communication 70 Perce entage of the total ex xecution time Total MECCA MESSy global end Others Phases of the application 13

14 Climate Simulation (CYI) Division 3D transpositions Inverse Legendre transforms Clust ter Gridpoint calculations, local, physics, chemistry Direct Fourier transforms MESSy uster Cl 3D transpositions Inverse Fourier transforms 3D transpositions 3D transpositions Direct Legendre transforms MECCA Booste er 3D transpositions Transport core calculations, 4D transpositions Spectral integration, nonlocal, meteorology 14

15 DEEP Status (I) () Project is now in Month 14 DEEP Cluster Hardware status: DEEP Cluster installed at JSC(Juelich) 3x Intel Xeon Phi workstations installed at JSC Booster backplane design finished BNC prototype ready and under test BNC (Booster Node Card) prototype 15

DEEP Status (II) Software status: DEEP software architecture Programming model

ParaStationMPI port to MIC and EXTOLL ongoing Low-level Cluster-Booster protocol

Structure of applications analysed First ansatz on application division (between

16 DEEP Status (II) Software status: DEEP software architecture Programming model definition completed: Global MPI + OmpSs OmpSs runtime (Nanos++) ported to MIC ParaStationMPI port to MIC and EXTOLL ongoing Low-level Cluster-Booster protocol implemented Mathematical libraries under evaluation Scientific Applications: Structure of applications analysed First ansatz on application division (between Cluster and Booster) already defined First experiences with Intel Xeon Phi (KNC) Space Weather application 16

17 Thank you for your attention The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/ ) under Grant Agreement n

18 BACK 18

19 Concurrency Levels scalar + parallel strong scaling S 1 N = s+ p N weak scaling W=s+pN 2 concurrency levels, K and N S K,N = 1 p r =.50, N = , 000 K = , 000 f = 1: S = p + p Kf N S = p r =.95, N = , K = , f = 4: 19

20 Technology Implementation Internal Clock Frequency EXTOLL performance comparison Link Bandwidth [Gb/s] Latency [µs] Effective maximum Bandwidth [GB/s] Message Rate [msg/s] EXTOLL Xilinx Virtex6 200 MHz 16 Gb/s 1.2 µs Up to 1.4 ~ 25 millions VENTOUX FPGA GB/s EXTOLL TOURMALET 65 nm ASIC 750 MHz 120 Gb/s 0.6 µs Up to 10 GB/s ~ 100 millions IB FDR* 45 nm ASIC Unknown 56 Gb/s 0.8 µs Not Not measured measured Typical 1GE ASIC based e.g. 125 MHz 1 Gb/s e.g. 40 µs 0.11 GB/s ~0.5 millions 10GE ASIC based ~125 to 312 MHz 12,5 Gb/s 12.5 µs 1.2 GB/s < 2.5 millions Cray Gemini 90 nm ASIC 650/800 MHz 75 Gb/s 1.5 µs Up to 5.9 ~ 2 millions GB/s Tianhe 1a ASICb based Unknown 80 Gb/s 2.5 µs Up to 6.34 ~ 1 3 millions GB/s TOFU (K Computer) 65 nm ASIC MHz 50 Gb/s 1.5 µs Up to 4.76 GB/s > 8 millions * Mellanox literature data 20

BN BN BN BI BN BN BN Low latency 1-2 µsec

latency <1 µsec Large Bandwidth 32 Gbit/s

21 InfiniBand vs. Extoll InfiniBand BI BN BN BN Extoll BI BN BN BN BI BN BN BN Low latency 1-2 µsec Large Bandwidth 32 Gbit/s Clos-Topologie (fat-tree) tree) 36-port Switches Very flexible But not so scalable Very low latency <1 µsec Large Bandwidth 32 Gbit/s 3D-torus Topology 10-port Switches Highly scalable But low flexibility 21

22 Application startup in detail Application main() part highly scalable code-part OmpSs Global MPI Resource management Application's main() part runs on Cluster Nodes (CN) only Resources managed statically or dynamically OmpSs acts as an abstraction layer Actual spawn done via Global MPI Spawn is a collective operation of Cluster processes Highly scalable code-parts (HSCP) utilise multiple Booster Nodes (BN) 22

The DEEP (and DEEP-ER) projects

The DEEP (and DEEP-ER) projects Estela Suarez - Jülich Supercomputing Centre BDEC for Europe Workshop Barcelona, 28.01.2015 The research leading to these results has received funding from the European