PORTABLE AND SCALABLE SOLUTIONS FOR CFD ON MODERN SUPERCOMPUTERS

PORTABLE AND SCALABLE SOLUTIONS FOR CFD ON MODERN SUPERCOMPUTERS Ricard Borrell Pol Head and Mass Transfer Technological Center cttc.upc.edu Termo Fluids S.L termofluids.co Barcelona Supercomputing Center BSC.es

Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks

Moore s Law Moore's law is the observation that the number of transistors in a dense integrated circuit doubles approximately every two years, wikipedia https://www.karlrupp.net/wp-content/uploads/2015/06/35years.png

Moore s Law Equivalent in HPC: Number of FLOP/s double approximately every two years (for LINPACK benchmark) top500.org

LINPACK vs HPCG (74% of peak) (64%) (65%) (85%) (93%) (1.1% of peak) (4.4%) (0.3%!!) (1.2%) Top supercomputers run very well LINPACK benchmark but to solve PDEs are very inefficient (1.6%) Top500.org & hpcg-benchmark.org

Memory wall ~6% Karl Rupp: www.karlrupp.net J.Dongarra: ATPESC 2015 The arithmetic intensity needed to achieve the peak performance of comp. devices keeps growing The dominant kernel in CFD is the SpMV or equivalent stencil operations Flops/byte: BLAS1 ~ 1/8, SpMV ~ 1.4/8*, BLAS2~ 2/8, BLAS3 ~ (2/3 n)/8 (double precision) The performance of a CFD code is placed between BLAS1 and BLAS2 Memory bandwidth is the limiting factor: performance relies on counting bytes no flops * Laplacian discretization in a tetrahedral mesh ~3%

Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks

TermoFluids CODE [1/6] General purpose unstructured CFD code Based on finite volume symmetry-preserving discretization on unstructured meshes Includes several LES and Regularization models for incompressible turbulent flows Expansion to multi-physics simulations: multi-phase flows, particles propagation, reactive flows, fluid structure interactions, multi-fluid flows, dynamic meshes...

TermoFluids CODE [2/6] HPC at TermoFluids C++ object oriented Parallelization based on the distributed memory model (pure MPI) recently developed hybrid model with GPU co-processors (MPI+CUDA) Performance barriers: Synchronism: inter CPU communications (point-to-point, all-reduce) Flops: Low arithmetic intensity memory wall Random memory accesses Curie TGCC MareNostrum BSC JFF CTTC Lomonosov MSU Mira ALCF MinoTauro BSC

TermoFluids CODE [3/6] Largest scalability tests*: Performed on Mira supercomputer (BG/Q) of the Argonne Leadership Computing Facility (ALCF) 76% Scalability tests up to 131K CPU-cores 67% All phases of simulation analyzed at the largest scale: pre-processing, check-pointing (IO)... Test case: Differentially Heated cavity * for last points only 15K and 7K cells/core respectively *R. Borrell, J. Chiva, O. Lehmkuhl, I. Rodriguez and A. Oliva. Evolving TermoFluids CFD code towards peta-sacle simulations. International Journal of Computational Fluids Dynamics. In press.

TermoFluids CODE [4/6] Largest production simulations performed in the context of PRACE Tier0 projects 6th PRACE CALL: 10th PRACE CALL: DRAGON Understanding the DRAG crisi: ON the flow past a circular cylinder form critical to transcritical Reynolds numbers Direct Numerical Simulation of Gravity Driven Bubbly Flows 23M hours (largest simulation 4096 CPUcores) 22M hours (largest simulation 3072 CPUcores)

TermoFluids CODE [5/6]

TermoFluids CODE [6/6] Industrial applications: Same software libraries used for leading edge computational projects and industrial applications ENAIR: 3D simulation of wind turbine blades x CLEAN SKY: EFFAN - optimization of the electrical ram air fan used in all-electrical aircrafts HP: Simulation of 3D printers

Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks

Generic algebraic approach [1/5] Applied to the LES simulation of turbulent flow of incompressible Newtonian fluids Finite Volume second order symmetry-preserving discretization Temporal discretization based on a second order explicit Adams-Bashforth scheme Pressure-velocity coupling: fractional step projection method

Generic algebraic approach [2/5] We are in a disruptive moment where different HPC solutions compete Portability across many architectures is a must We are developing an algebraic Generic Integration Platform (GIP) to perform time integrations: TIME INTEGRATION based on STENCIL OPERATIONS TIME INTEGRATION based on ALGEBRAIC KERNELS Code portability Code modularity G.Oyarzun, R. Borrell, A. Gorobets and A. Oliva. Portable implementation model for CFD simulations. Application to hybrid CPU/GPU supercomputers. SIAM Journal on Scientific Computing 2015. Under review

Generic algebraic approach [3/5] Algebraic kernels: Vector-vector operations AXPY: Y=ax+y DOT product Sparse matrix vector product (SpMV) Non linear operators (convective term): Convective term decomposed into two SpMVs Similar process to modify the diffusive term according to the turbulent viscosity

Generic algebraic approach [4/5] generic integration platform Generic approach: CFD time integration depends on the specific implementation of 4 abstract classes The algebraic operators are imported form external code:termofluids, openfoam, Saturne etc The GIP can be used to port the time integration of other simulation codes to new architectures Diagram of the implementation strategy

Generic algebraic approach [5/5] 98% of time integration is spent in only three algebraic kernels This situation favors the portability of code through different computing platforms Outside Poisson solver number SpMV 30 AXPY 10 DOT 2 PCG iteration number SpMV 2 AXPY 3 DOT 2 1.32% 8.79% 9.12% 80.77% SpMV AXPY DOT OTHERS LES simulation flow around ASMO CAR, mesh 5.5M 32 GPUS

Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks

Accelerators in HPC [1/2] Accelerators becoming increasingly popular in leading edge supercomputers Potential to significantly reduce space, power consumption, and cooling demands Context: Constrained power consumption target (~25MW for the entire system ) power wall top500.org list June 2014 13% of the Top500 list systems are based on hybrid nodes Considering the first 15 positions of Top500 list 8 (53%) are based in hybrid nodes 100% of the fist 15 positions in the Green500 list are hybrid nodes with accelerators (NVIDIA)

Accelerators in HPC [2/2] Design goals for CPUs Design goals for GPUs Make a single thread very fast Throughput matters and single threads do not Reduce latency through large caches More transistors dedicated to computation Predict, speculate Hide memory latency through concurrency Remove modules to make simple instruction fast (out-of-order control logic, branch predictor logic, memory pre fetch unit) Share the cost of instruction stream across many ALUs (SIMD model) Multiple context per stream multiprocessor (SM) hide latency Source: Tim Warburton ATPESC 2014

MinoTauro Supercomputer MinoTauro (BSC) was used in the present work Nodes: 2 Intel E5649 (6-Core) processors at 2.53 GHz (Westmere) 12 GB RAM per CPU 2 M2090 NVIDIA GPU Cards (Tesla) 6 GB RAM per GPU Network: Infiniband QDR (40 Gbit each) to a non-blocking network

Implementation Algebraic kernels: Vector-vector operations CUBLAS 5.0 Sparse matrix vector product Sliced ELLPACK format (1) Grouping rows by number of entries (2) Use ELLPACK format on each subgrups GLOPS Device, SpMV format Mesh size (thousands of cells) 50 100 200 400 800 1600 Avg. speedup CPU CSR MKL 2.45 2.18 1.49 1.37 1.30 1.18 1.7x CPU ELLPACK 3.44 3.02 2.89 2.76 2.41 2.06 GPU CSR cusparse 3.64 4.10 4.40 4.58 4.79 4.70 3.3x GPY HYB cusparse 8.74 11.2 13.4 14.9 15.6 15.9 1.1x GPU sliced ELLPACK 10.9 12.8 14.9 15.9 16.2 16.4

SPMV kernel [1/4] No ordering Cuthill - Mckee ordering Theoretically achievable performance (perfect locality and alignment supposed): Arithmetic intensity (Ax=b for uniform tetrahedral mesh) A bytes: (8*5*N)+(4*5*N) = 60N b bytes: 8N x bytes: 8N (max. cache reuse) SpMV bytes: 76N SpMV flops: 9N Flop/byte ratio: 9/76 = 0.12

SPMV kernel [2/4] Theoretically achievable performance (perfect locality and alignment supposed): Performance on Intel Xeon E5640 (6 core,turbo Freq. 2.93 GHz, Bandwidth 25.6 GB/s) Peak performance: 24 flops x cycle x 2.93 G cycles/s = 70.32 Gflops/s (flops: (2 FMA + 2 SIMD) x 6 cores ) Time computations: 10N flop/70.32 Gflop/s = 0.14 N ns Time data comm.: 76N bytes/ 32 GB/s = 2,33 N ns Ratio: time comm../ time comp. ~17!! Achivable performance: 9/76 x 32 = 3.8 Gflops (~5% peak) Performance on NVIDIA M2090 (Tesla) Peak performance: 666.1 Gflop/s Bandwidth: 141.6 GB/s (ECC on) Achievable performance: 9/76 x 141,6 = 16.8 Gflops (~2.5% peak)

SPMV kernel [3/4] Net performance on a single 6-core CPU (left) and on a single GPU (right)

SPMV kernel [4/4] Speedup GPU vs CPU For normal workloads per device, it's better exploited the bandwidth of GPUS! Remember: RAM CPU 12 GB (6 cores) RAM GPU 6 GB

Multi-GPU SPMV kernel [1/4] MPI + CUDA implementation Parallelization based on a domain decomposition One MPI-process per subdomain and one GPU per MPI-process Local data partition: separate inner parts (do not require data from other subdomains) from interface parts (require external elements) Local data partition + two stream model -> overlapping computations on GPU with communications

Multi-GPU SPMV kernel [2/4] (left): Weak speedup test up to 128 GPUs, (right): overlapping effect on the executions with 128 GPUs

Multi-GPU SPMV kernel [3/4] Strong scalability, (left): speedup, (right): parallel efficiency Note: CPU execution 1 device is 6 cores

Multi-GPU SPMV kernel [3/4] load GPU 400K 80% load GPU 200K 55% load GPU 100K 35% Strong scalability, (left): test, in terms of speedup or (right): parallel efficiency Note: CPU execution 1 device is 6 cores

Multi-GPU SPMV kernel [4/4] (left): normalized performance of SpMV computations (right): estimated speedup for hypothetical constant performance (canceling cache and occupancy effects)

Multi-GPU SPMV kernel [4/4] but GPU 4 times faster! (left):net performance of the computing part for the strong speedup test. (right): estimated speedup if performance remained constant (canceling cache and occupancy effects)

LES test: flow around ASMO car [1/4] Flow around ASMO car, Re=7e5 5.5 million unstructured mesh with prismatic boundary layer Sub-grid scale: wall-adapting local-eddy viscosity (WALE) Poisson solver: CG with Jacobi diagonal scaling Flow and turbulent structures around simplified car models. D.E. Aljure, O. Lehmkuhl,, I. Rodríguez, A. Oliva. Computers & Fluids 96 (2014) 122 135

LES test: flow around ASMO car [2/4] (left): Relative weight of the main operations for different number of CPUs and GPUS. (right): average relative weight over all tests

LES test: flow around ASMO car [3/4] Note: CPU execution 1 device is 6 cores Performance of overall CFD code in any system can be estimated testing only three algebraic kernels

LES test: flow around ASMO car [4/4] Speedup multi GPU vs multi CPU Note: CPU execution 1 device is 6 cores

Tests on Mont Blanc ARM [1/2] Mont Blanc: European project focused on the development of a new type of computer architecture capable of setting future global HPC standards, built from energy efficient solutions used in embedded and mobile devices Termo Fluids S.L is part of the Industrial User Group (IUG) We have run parallel LES simulations on MontBlanc nodes using the AGP platform Specifics: Load distribution SpMV kernel 100K rows OpenCL + openmp + MPI model required to engage all component of nodes Shared memory between CPU and GPU requires an accurate load distribution CPU: Cortex-A15 1.7 GHz dual core GPU: Mali T-604 (OpenCL 1.1 capable) Network: 10 Gbit/s Ethernet

Tests on Mont Blanc ARM [2/2] Similar performance of CPU and GPU makes meaningful hybridization Languages: CPU OpenMP, GPU OpenCL Synchronization points (clfinish()) required to maintain main memory coherence 16% 16% 40% 16% 16%

Tests on Mont Blanc ARM Weak speedup [2/2] Strong speedup

Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks

CONCLUDING REMARKS Exaflops come with disruptive changes in HPC technology We developed a portable version of our CFD code based on an algebraic operational approach ~98% com computing time is spent on three kernels: SpMV, AXPY, DOT The three kernels are clearly memory bounded: performance depends exclusively on the bandwidth achieved (not on flops) Bandwidth is more profitable with throughput oriented (latency hiding) approach of GPUs Overall time-step performance is perfectly estimable at any system by testing the three basic kernels Speedup of multi-gpu vs multi-cpu implementation on LES simulation of flow around ASMO car ranges from 4x to 8x at MinoTauro supercomputer