Exascale: challenges and opportunities in a power constrained world

Exascale: challenges and opportunities in a power constrained world Carlo Cavazzoni c.cavazzoni@cineca.it SuperComputing Applications and Innovation Department

CINECA CINECA non profit Consortium, made up of 70 Italian universities*, The National Institute of Oceanography and Experimental Geophysics - OGS, the CNR (National Research Council), and the Ministry of Education, University and Research (MIUR). CINECA is the largest Italian computing centre, one of the most important worldwide. The HPC department manage the HPC infrastructure, provide support to Italian and European researchers, promote technology transfer initiatives for industry.

PRACE The mission of PRACE (Partnership for Advanced Computing in Europe) is to enable high impact scientific discovery and engineering research and development across all disciplines to enhance European competitiveness for the benefit of society. PRACE seeks to realize this mission by offering world class computing and data management resources and services through a peer review process. PRACE also seeks to strengthen the European users of HPC in industry through various initiatives. PRACE has a strong interest in improving energy efficiency of computing systems and reducing their environmental impact. http://www.prace-ri.eu/call-announcements/ http://www.prace-ri.eu/prace-resources/

SuperComputing Applications and Innovation. Accelerate the scientific discovery by providing high performance computing resources, data management and storage systems, tools and HPC services, and expertise at large Develop and promote technical and scientific services related to high-performance computing for the Italian and European research community. Enables world-class scientific research by operating and supporting leading-edge supercomputing technologies and by managing a state-of-the-art and effective environment for the different scientific communities. Support and consultancy in HPC tools and techniques and in several scientific domains, such as physics, particle physics, material sciences, chemistry, fluid dynamics

FERMI Name: Fermi Architecture: BlueGene/Q (10 racks) Processor type: IBM PowerA2 @1.6 GHz Computing Nodes: 10.240 Each node: 16 cores and 16GB of RAM Computing Cores: 163.840 RAM: 1GByte / core (163 TBytetotal) Internal Network: 5D Torus Disk Space: 2PByte of scratch space Peak Performance: 2PFlop/s Power Consumption: 820 kwatts High-end system, only for extremely scalable applications N. 7 in Top 500 rank (June 2012) National and PRACE Tier-0 calls

GALILEO Name: Galileo Model: IBM/Lenovo NeXtScale X86 based system for production of medium scalability applications Processor type: Intel Xeon Haswell@ 2.4 GHz Computing Nodes: 516 Each node: 16 cores, 128 GB of RAM Computing Cores: 8.256 RAM: 66 TByte Internal Network: Infiniband4xQDR switches (40 Gb/s) Accelerators: 768 Intel Phi 7120p (2 per node on 384 nodes) + 80 NvidiaK80 Peak Performance: 1.5 PFlops National and PRACE Tier-1 calls

PICO Storage and processing of large volumes of data Model: IBM NeXtScale IB linux cluster Processor type: Intel Xeon E5 2670 v2 @2,5Ghz Computing Nodes: 80 Each node: 20 cores/node, 128 GB of RAM 2 Visualization nodes 2 Big Memnodes 4 data mover nodes Storage 50TByte of SSD 5PByte on-line repository (same fabric of the cluster) 16PByte of tapes Services Hadoop & PBS OpenStack cloud NGS pipelines Workflows (weather/sea forecast) Analytics High-throughput workloads

Cineca Road-map Tier0: Fermi (BGQ) Tier1: Galileo BigData: Pico Tier0: new (on going procurement) (HPC Top10) BigData: Galileo/Pico Tier0 BigData: 50PFlops 50PByte today Q1 2016 Q1 2019

Dennardscalinglaw (downscaling) new VLSI gen. old VLSI gen. L = L / 2 V = V / 2 F = F * 2 D = 1 / L 2 = 4D P = P do not hold anymore! The core frequency and performance do not grow following the Moore s law any longer L = L / 2 V = ~V F = ~F * D = 1 / L 2 = 4 * D P = 4 * P Increase the number of cores to maintain the architectures evolution on the Moore s law The power crisis! Programming crisis!

Moore s Law Number of transistors per chip double every 18 month The true it double every 24 month Oh-oh! Huston!

The siliconlattice 0.54 nm Si lattice 50 atoms! There will be still 4~6 cycles (or technology generations) left until we reach 11 ~ 5.5 nm technologies, at which we will reach downscaling limit, in some year between 2020-30 (H. Iwai, IWJT2008).

Amdahl's law upper limit for the scalability of parallel applications determined by the fraction of the overall execution time spent in nonscalable operations (Amdahl's law). maximum speedup tends to 1 / ( 1 P ) P= parallel fraction 1000000 core P = 0.999999 serial fraction= 0.000001

HPC trends (constrained by the three law) Peak Performance exaflops Moore law opportunity FPU Performance gigaflops Dennard law Number of FPUs 10^9 Moore + Dennard App. Parallelism Serial fraction 1/10^9 Amdahl's law challenge

Energy trends traditional RISK and CISC chips are designed for maximum performance for all possible workloads A lot of silicon to maximize single thread performace Energy Datacenter Capacity Compute Power

Change of paradigm New chips designed for maximum performance in a small set of workloads Simple functional units, poor single thread performance, but maximum throughput Energy Datacenter Capacity Compute Power

Architecture toward exascale CPU ACC. GPU/MIC/FPGA Single thread perf. throughput CPU ACC. OpenPower Nvidia GPU bottleneck CPU ACC. AMD APU Photonic -> platform flexibility TSV -> stacking SoC KNL, ARM 3D stacking Active memory

Exascale architecture CPU ACC. Nvidia GPU Hybrid CPU ACC. AMD APU two model Homogeneus SoC ARM Intel

Accelerator/GPGPU + Sum of 1D array

Intel Vector Units Next to come: AVX-512 up to 16 Multiply Add / clock

Applications Challenges -Programming model -Scalability -I/O, Resiliency/Fault tolerance -Numerical stability -Algorithms -Energy Aware/Efficiency

www.quantum-espresso.org H2020 MaXCenter of Excellence

Scalability The case of Quantum Espresso QE parallelization hierarchy

ok for petascale, not enough for exascale 400 CNT10POR8 -CP on BGQ Ab-initio simulations -> numerical solution of the quantum mechanical equations 350 seconds /steps 300 250 200 150 calphi dforce rhoofr updatc ortho 100 50 0 4096 8192 16384 32768 65536 2048 4096 8192 16384 32768 1 2 4 8 16 Virtual cores Real cores Band groups 23

QE evolution High Throughput / Ensamble Simulations Communication avoiding New Algorithm: CG vs Davidson Coupled Application DSL LAMMPS Task level parallelism Double buffering QE QE QE Reliability Completeness Robustness Standard Interface

Multi-level parallelism Workload Management: system level, High-throughput Python: Ensemble simulations, workfows MPI: Domain partition OpenMP: Node Level shared mem CUDA/OpenCL/OpenAcc:/OpenMP4 floating point accelerators

26 QE (Al2O3 small benchmark) Energy to solution as a function of the clock

Conclusions Exascale Systems, will be there Power is the main architectural constraints Exascale Applications? Yes, but Concurrency, Fault Tolerance, I/O Energy aware