Systems Architectures towards Exascale

Size: px

Start display at page:

Download "Systems Architectures towards Exascale"

Morgan Bennett
5 years ago
Views:

1 Systems Architectures towards Exascale D. Pleiter German-Indian Workshop on HPC Architectures and Applications Pune 29 November 2016

2 Outline Introduction Exascale computing Technology trends Architectures today and tomorrow Power challenges Balanced system challenge Summary and Conclusions 2/38

3 Introduction 3/38

materials science, nanotechnology, neuroscience and medicine, and

4 Forschungszentrum Jülich JSC One of Europe's largest interdisciplinary research centres; 5,000 employees Special expertise in physics, materials science, nanotechnology, neuroscience and medicine, and information technology Leader in various European HPC projects, including PRACE 4

petascale File Server Lustre GPFS IBM Blue Gene/Q JUQUEEN 5.

5 HPC Infrastructure at JSC: Dual track concept IBM Power 4+ JUMP, 9 TFlop/s IBM Blue Gene/L JUBL, 45 TFlop/s IBM Power 6 JUMP, 9 TFlop/s Intel Nehalem JUROPA 300 TFlop/s IBM Blue Gene/P JUGENE, 1 PFlop/s petascale File Server Lustre GPFS IBM Blue Gene/Q JUQUEEN 5.9 PFlop/s JURECA ~ 2 PFlop/s + Booster ~ 10 PFlop/s General-Purpose Cluster JUQUEEN successor ~ 50 PFlop/s Highly Scalable System pre-exascale exascale 5

projects Education and Training Application support User support Peer review support and

6 Jülich Supercomputing Centre Supercomputer operation for Centre Forschungszentrum Jülich Regional Jülich-Aachen ResearchAlliance (JARA) Helmholtz & National NIC, GCS Europe PRACE, EU projects Education and Training Application support User support Peer review support and coordination Research and development Algorithms and tools Exascale architectures and technologies 6/38

7 HPC Research at JSC Communities Exascale Labs Simulation Labs Algorithms and tools 7/38

Computational Science Research Simulation Laboratories Teams of 3-10 people 9 Labs established Simulation Laboratories concept Research in given science domain Computational science expertise in

8 Computational Science Research Simulation Laboratories Teams of 3-10 people 9 Labs established Simulation Laboratories concept Research in given science domain Computational science expertise in house Provider of services and support to science domain Provide advanced service Interface to science domains Contribute to enabling co-design Examples for support activities Code parallelisation or porting Code re-factorisation Community code maintenance 8/38

9 Exascale Computing 9/38

10 What is exascale computing? Answer 1: Running HPL at 1 EFlop/s = 1018 Flop/s HPL = High-Performance LINPACK HPL benchmark Solve a dense N N system of linear equations A x = b Rules: Algorithm: LU factorization with partial pivoting Double precision Problem size can be freely chosen Freedom to look for your sweet spot Top500 listing Ranking of system according to floating-point operation throughput 10/38

11 Top500 performance trends Performance doubles every 12.9 months Exascale at 2020? 11/38

12 What is exascale computing? (cont.) Answer 2: Future generation supercomputers enabling new science Exascale challenges [PRACE Scientific Case, 2012] Weather, climatology and solid Earth Sciences Astrophysics, high-energy physics and plasma physics Materials science, chemistry and nano-science Engineering sciences and industrial applications Life sciences and medicine 12/38

13 What is exascale computing? (cont.) Exascale computer = x of today's petascale supercomputers Flop/s metric will become less relevant Other capabilities (application view) Support for less regular computational tasks Significantly larger memory footprint Extreme data processing capabilities Improved/optimized data transport capabilities Scalable visualisation capabilities Management of complex work-flows... 13/38

Exascale application highlights Structure of matter Goal: Computing first principles results from theory of strong interaction: Quantum Chromodynamics (QCD) Numerical simulations enabled through

14 Exascale application highlights Structure of matter Goal: Computing first principles results from theory of strong interaction: Quantum Chromodynamics (QCD) Numerical simulations enabled through formulation on the lattice Materials science Calculations based on the Density Functional Theory method Major tool for exploring properties of materials Phase-change materials Dilute magnetic semiconductors Challenge: Need many atoms but application complexity typically scales O(N3atom) 14/38

Africa Needed to analyse data coming from many sources But: relatively small power envelope Key HPC systems for SKA [M.

Improve the understanding of the human brain by means of models Challenges: By SKA Project Development Office and Swinburne

15 Exascale application highlights (cont.) Radio astronomy Very large radio telescope project Increasing relevance of HPC for radio-astronomy Sites: Australia and South Africa Needed to analyse data coming from many sources But: relatively small power envelope Key HPC systems for SKA [M. Diesmann, 2013] Brain research Goals: Central Signal Processor Science Data Processor Generate high-resolution brain atlases Improve the understanding of the human brain by means of models Challenges: By SKA Project Development Office and Swinburne Astronomy Productions - Swinburne Astronomy Productions for SKA Project Development Office, CC BY 3.0, [S. Lefranc et al., 2016] Extreme scale data volumes Memory footprint limitations 15/38

16 Technology trends 16/38

17 Technology trends: Moore's Law Observation Time evolution of optimal manufacturing costs for integrated circuits results in exponential increase of number of components per circuit Term tends to be abused for any exponential growth [G. Moore, 1965] [M. Bohr, 2007] 17/38

18 Technology trends: Dennard scaling Dennard scaling allowed to change the following parameters at constant power: Increase of transistor density (Moore s law) Increase clock frequency Reduce supply voltage Increase performance = increase transistor density Allows to implement, e.g., more cores Trend towards increasing parallelism [L. Chang et al., 2010] 18/38

19 Technology trends: Rent's rule Rent's rule T =k G p G Number of logic elements (gates) T Number of edge connections (terminals) k Rent's coefficient p Rent's exponent, where typically p 1 Difficult to balance communication and compute Strategy for problem mitigating: Memory hierarchy Multiple, fast but small on-chip memory (cache) Slower but larger off-chip memory Trend towards deeper memory hierarchies 19/38

20 Architectures today and tomorrow 20/38

Highly-Scalable Blue Gene/Q: JUQUEEN Highly parallel

Processor level System level JUQUEEN = 28 Blue Gene/Q racks

9 PFlop/s peak Tight integration through on-chip network

21 Highly-Scalable Blue Gene/Q: JUQUEEN Highly parallel architecture by design Micro-architecture level (SIMD, SMT) Processor level System level JUQUEEN = 28 Blue Gene/Q racks PowerPC ISA + extensions GFlop/s per node 28,672 nodes 5.9 PFlop/s peak Tight integration through on-chip network Performance ranking (November 2016) Top500: Green500: Graph500: rank #19 rank #85 rank #5 21

22 Pilot Systems for Human Brain Project IBM+NVIDIA pilot system JURON 18 Minsky server each with 2 IBM POWER8 processors 4 NVIDIA P100 GPUs 1 NVMe card Full fat tree IB EDR Cray pilot system JULIA 60 compute nodes each with 1 Intel Knights Landing processor Pruned OPA network 2 DataWarp nodes with 2 NVMe cards each 22/38

Exascale Swim Line: Xeon Phi KNL processor features 64-72 simple cores operating at moderate clock speed

(Bmem 450 GByte/s, Cmem = 16 GByte) Large capacity DDR4 (Bmem 115 GByte/s, Cmem 384 GByte) Integration

23 Exascale Swim Line: Xeon Phi KNL processor features simple cores operating at moderate clock speed Dual wide SIMD pipelines (VPU) Bfp 3 TFlop/s Hierarchical memory architecture High-bandwidth MCDRAM (Bmem 450 GByte/s, Cmem = 16 GByte) Large capacity DDR4 (Bmem 115 GByte/s, Cmem 384 GByte) Integration approach Single-socket nodes 16,000 nodes Bfp 50 PFlop/s Pruned network topologies Dragonfly or similar 23/38

Exascale Swim Line: POWER + GPU Heterogeneous architecture Features Most arithmetic operations on GPU Integration of multiple memory tiers 2x POWER8: ~0.

24 Exascale Swim Line: POWER + GPU Heterogeneous architecture Features Most arithmetic operations on GPU Integration of multiple memory tiers 2x POWER8: ~0.5 TFlop/s 4x P100: 19 TFlop/s Integration HBM + DDR (+ NVRAM) Efficient data transport channels Fat nodes 2,500 nodes Bfp 50 PFlop/s Full fat-tree topologies 24/38

25 Node Level Comparison Blue Gene/Q 205 2,662 2*POWER8 + 4*P100 19,038 Bfp [Flop/cycle] 128 2,048 14,336 Memory bandwidth Bmem [GByte/s] Memory capacity Cmem [GByte] Network node bi-section bandwidth Bnet [Gbit/s] 28 ~ , Floating-point throughput (DP) Bfp [GFlop/s] KNL Assuming a Xeon Phi 7230 with single OPA port Considering only floating-point operations of P100 25/38

26 Power Challenges 26/38

27 Power challenge Green500: Rank supercomputers according to power efficiency Supercomputer = System listed in TOP500 Metric = Floating-point performance vs. power consumption Flop/s/W Performance = HPL performance (like for TOP500) Current number #1: MFlop/s/W This is equivalent to 106 pj/flop Exascale goal: keep below 20 MW This is equivalent to 20 pj/flop Caveat: naïve approach HPL load is not representative Green500 does not cover full system Need improvement 27/38

28 Power challenge: Green500 28/38

29 Power challenge: Data movement costs Comparison NVIDIA GPU as of 2010 Projection NVIDIA GPU Process technology 40 nm 10 nm Voltage (VDD) 0.9 V 0.65 V 1.6 GHz 2 GHz DP-FMA energy 50 pj 6.5 pj Wire energy (256 bits, 10 mm) 310 pj 150 pj Frequency [S.W. Keckler et al., 2011] Data movement costs will even more dominate 29/38

30 Balanced System Challenge 30/38

31 Challenge: Keeping system balanced Simple machine model comprising Storage devices Data transport/processing devices Information exchange I kx, y (W ) Information exchanged between storage device x and y arithmetic unit R memory bus Simple performance model Latency-bandwidth model register file M memory Δ t fp (W )=λ fp + I fp (W )/βfp Δ t mem (W )=λ mem +I mem (W )/βmem 31/38

32 Keeping system balanced (cont.) Balance condition Δ t mem (W ) Δ t fp (W ) λ mem +I mem (W )/βmem λ fp + I fp (W )/βfp Balanced architecture Cheap to increase I fp βfp I mem βmem Expensive to increase Arithmetic Intensity 32/38

33 Technology trend: memory DDR-SDRAM Density scales according to Moore's law Very moderate increase in bandwidth High-bandwidth memory technologies Examples: GDDR5, HBM, HMC High bandwidth, low capacity Example: NVIDIA Pascal GPU Non-volatile memory technologies Examples: NAND flash, PCM Low bandwidth, high capacity 33/38

34 Memory capacity vs. bandwidth Consider C mem R= B mem Typical values R High-bandwidth memory DDR NVMe card O(10 ms) O(1 s) O(1000 s) No established balance conditions But: deeper memory hierarchies unavoidable 34/38

required to address usability challenge Target 5 tiers Tier 1: volatile memory

35 Hierarchical storage architectures European FETHPC project 10 academic + commercial partners Goal: Enable hierarchical storage architecture Co-design required to address usability challenge Target 5 tiers Tier 1: volatile memory Tier 2-3: non-volatile memory Tier 4-5: spinning disks Explore in-storage compute 35/38

Thinking hierarchical: An example Retention time analysis Use case analysis based on classification of data objects produced/consumed during HPC work-flow Retention time classes Transient (Temporary)

36 Thinking hierarchical: An example Retention time analysis Use case analysis based on classification of data objects produced/consumed during HPC work-flow Retention time classes Transient (Temporary) Short-term (Campaign) Data discarded on simulation completion or when later processing steps are concluded Data used throughout the execution of the scientific work-flow [El Sayed, 2016] Permanent (Forever) Data outliving the machine used to generate it Natural mapping onto storage hierarchy 36/38

37 Summary and Conclusions 37/38

38 Summary and conclusions Significant efforts required towards next levels of supercomputing Strong technology constraints Several swim lanes towards exascale emerging Application demands for exascale computing Examples: materials sciences, brain research Need for application driven co-design approach Helps to make right design trade-off decisions Guides applications towards exascale enablement 38/38

Porting Scientific Applications to OpenPOWER

Porting Scientific Applications to OpenPOWER Dirk Pleiter Forschungszentrum Jülich / JSC #OpenPOWERSummit Join the conversation at #OpenPOWERSummit 1 JSC s HPC Strategy IBM Power 6 JUMP, 9 TFlop/s Intel