PORTABLE AND SCALABLE SOLUTIONS FOR CFD ON MODERN SUPERCOMPUTERS

Similar documents
Two-Phase flows on massively parallel multi-gpu clusters

Available online at ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013)

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Large scale Imaging on Current Many- Core Platforms

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Master Informatics Eng.

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

Code Saturne on POWER8 clusters: First Investigations

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Future Generation Computer Systems

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

cuibm A GPU Accelerated Immersed Boundary Method

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Introduction to parallel Computing

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

High Performance Computing (HPC) Introduction

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

CME 213 S PRING Eric Darve

Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

QR Decomposition on GPUs

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Particleworks: Particle-based CAE Software fully ported to GPU

Trends in HPC (hardware complexity and software challenges)

Performance Benefits of NVIDIA GPUs for LS-DYNA

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

HPC projects. Grischa Bolls

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

GPU PROGRESS AND DIRECTIONS IN APPLIED CFD

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

The Mont-Blanc approach towards Exascale

Center for Computational Science

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

High Performance Computing on GPUs using NVIDIA CUDA

HPC Algorithms and Applications

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Accelerating Implicit LS-DYNA with GPU

Turbostream: A CFD solver for manycore

Advances of parallel computing. Kirill Bogachev May 2016

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

ANSYS HPC Technology Leadership

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

A Massively Parallel Two-Phase Solver for Incompressible Fluids on Multi-GPU Clusters

Software and Performance Engineering for numerical codes on GPU clusters

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

Real Application Performance and Beyond

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Tesla GPU Computing A Revolution in High Performance Computing

Applications of Berkeley s Dwarfs on Nvidia GPUs

The Stampede is Coming: A New Petascale Resource for the Open Science Community

How HPC Hardware and Software are Evolving Towards Exascale

Optimising the Mantevo benchmark suite for multi- and many-core architectures

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Tuning Alya with READEX for Energy-Efficiency

GPU COMPUTING WITH MSC NASTRAN 2013

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Aerodynamics of a hi-performance vehicle: a parallel computing application inside the Hi-ZEV project

Numerical Algorithms on Multi-GPU Architectures

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Technology for a better society. hetcomp.com

OpenACC programming for GPGPUs: Rotor wake simulation

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Lecture 15: More Iterative Ideas

CPU-GPU Heterogeneous Computing

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Transcription:

PORTABLE AND SCALABLE SOLUTIONS FOR CFD ON MODERN SUPERCOMPUTERS Ricard Borrell Pol Head and Mass Transfer Technological Center cttc.upc.edu Termo Fluids S.L termofluids.co Barcelona Supercomputing Center BSC.es

Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks

Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks

Moore s Law Moore's law is the observation that the number of transistors in a dense integrated circuit doubles approximately every two years, wikipedia https://www.karlrupp.net/wp-content/uploads/2015/06/35years.png

Moore s Law Equivalent in HPC: Number of FLOP/s double approximately every two years (for LINPACK benchmark) top500.org

LINPACK vs HPCG (74% of peak) (64%) (65%) (85%) (93%) (1.1% of peak) (4.4%) (0.3%!!) (1.2%) Top supercomputers run very well LINPACK benchmark but to solve PDEs are very inefficient (1.6%) Top500.org & hpcg-benchmark.org

Memory wall ~6% Karl Rupp: www.karlrupp.net J.Dongarra: ATPESC 2015 The arithmetic intensity needed to achieve the peak performance of comp. devices keeps growing The dominant kernel in CFD is the SpMV or equivalent stencil operations Flops/byte: BLAS1 ~ 1/8, SpMV ~ 1.4/8*, BLAS2~ 2/8, BLAS3 ~ (2/3 n)/8 (double precision) The performance of a CFD code is placed between BLAS1 and BLAS2 Memory bandwidth is the limiting factor: performance relies on counting bytes no flops * Laplacian discretization in a tetrahedral mesh ~3%

Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks

TermoFluids CODE [1/6] General purpose unstructured CFD code Based on finite volume symmetry-preserving discretization on unstructured meshes Includes several LES and Regularization models for incompressible turbulent flows Expansion to multi-physics simulations: multi-phase flows, particles propagation, reactive flows, fluid structure interactions, multi-fluid flows, dynamic meshes...

TermoFluids CODE [2/6] HPC at TermoFluids C++ object oriented Parallelization based on the distributed memory model (pure MPI) recently developed hybrid model with GPU co-processors (MPI+CUDA) Performance barriers: Synchronism: inter CPU communications (point-to-point, all-reduce) Flops: Low arithmetic intensity memory wall Random memory accesses Curie TGCC MareNostrum BSC JFF CTTC Lomonosov MSU Mira ALCF MinoTauro BSC

TermoFluids CODE [3/6] Largest scalability tests*: Performed on Mira supercomputer (BG/Q) of the Argonne Leadership Computing Facility (ALCF) 76% Scalability tests up to 131K CPU-cores 67% All phases of simulation analyzed at the largest scale: pre-processing, check-pointing (IO)... Test case: Differentially Heated cavity * for last points only 15K and 7K cells/core respectively *R. Borrell, J. Chiva, O. Lehmkuhl, I. Rodriguez and A. Oliva. Evolving TermoFluids CFD code towards peta-sacle simulations. International Journal of Computational Fluids Dynamics. In press.

TermoFluids CODE [4/6] Largest production simulations performed in the context of PRACE Tier0 projects 6th PRACE CALL: 10th PRACE CALL: DRAGON Understanding the DRAG crisi: ON the flow past a circular cylinder form critical to transcritical Reynolds numbers Direct Numerical Simulation of Gravity Driven Bubbly Flows 23M hours (largest simulation 4096 CPUcores) 22M hours (largest simulation 3072 CPUcores)

TermoFluids CODE [5/6]

TermoFluids CODE [6/6] Industrial applications: Same software libraries used for leading edge computational projects and industrial applications ENAIR: 3D simulation of wind turbine blades x CLEAN SKY: EFFAN - optimization of the electrical ram air fan used in all-electrical aircrafts HP: Simulation of 3D printers

Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks

Generic algebraic approach [1/5] Applied to the LES simulation of turbulent flow of incompressible Newtonian fluids Finite Volume second order symmetry-preserving discretization Temporal discretization based on a second order explicit Adams-Bashforth scheme Pressure-velocity coupling: fractional step projection method

Generic algebraic approach [2/5] We are in a disruptive moment where different HPC solutions compete Portability across many architectures is a must We are developing an algebraic Generic Integration Platform (GIP) to perform time integrations: TIME INTEGRATION based on STENCIL OPERATIONS TIME INTEGRATION based on ALGEBRAIC KERNELS Code portability Code modularity G.Oyarzun, R. Borrell, A. Gorobets and A. Oliva. Portable implementation model for CFD simulations. Application to hybrid CPU/GPU supercomputers. SIAM Journal on Scientific Computing 2015. Under review

Generic algebraic approach [3/5] Algebraic kernels: Vector-vector operations AXPY: Y=ax+y DOT product Sparse matrix vector product (SpMV) Non linear operators (convective term): Convective term decomposed into two SpMVs Similar process to modify the diffusive term according to the turbulent viscosity

Generic algebraic approach [4/5] generic integration platform Generic approach: CFD time integration depends on the specific implementation of 4 abstract classes The algebraic operators are imported form external code:termofluids, openfoam, Saturne etc The GIP can be used to port the time integration of other simulation codes to new architectures Diagram of the implementation strategy

Generic algebraic approach [5/5] 98% of time integration is spent in only three algebraic kernels This situation favors the portability of code through different computing platforms Outside Poisson solver number SpMV 30 AXPY 10 DOT 2 PCG iteration number SpMV 2 AXPY 3 DOT 2 1.32% 8.79% 9.12% 80.77% SpMV AXPY DOT OTHERS LES simulation flow around ASMO CAR, mesh 5.5M 32 GPUS

Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks

Accelerators in HPC [1/2] Accelerators becoming increasingly popular in leading edge supercomputers Potential to significantly reduce space, power consumption, and cooling demands Context: Constrained power consumption target (~25MW for the entire system ) power wall top500.org list June 2014 13% of the Top500 list systems are based on hybrid nodes Considering the first 15 positions of Top500 list 8 (53%) are based in hybrid nodes 100% of the fist 15 positions in the Green500 list are hybrid nodes with accelerators (NVIDIA)

Accelerators in HPC [2/2] Design goals for CPUs Design goals for GPUs Make a single thread very fast Throughput matters and single threads do not Reduce latency through large caches More transistors dedicated to computation Predict, speculate Hide memory latency through concurrency Remove modules to make simple instruction fast (out-of-order control logic, branch predictor logic, memory pre fetch unit) Share the cost of instruction stream across many ALUs (SIMD model) Multiple context per stream multiprocessor (SM) hide latency Source: Tim Warburton ATPESC 2014

MinoTauro Supercomputer MinoTauro (BSC) was used in the present work Nodes: 2 Intel E5649 (6-Core) processors at 2.53 GHz (Westmere) 12 GB RAM per CPU 2 M2090 NVIDIA GPU Cards (Tesla) 6 GB RAM per GPU Network: Infiniband QDR (40 Gbit each) to a non-blocking network

Implementation Algebraic kernels: Vector-vector operations CUBLAS 5.0 Sparse matrix vector product Sliced ELLPACK format (1) Grouping rows by number of entries (2) Use ELLPACK format on each subgrups GLOPS Device, SpMV format Mesh size (thousands of cells) 50 100 200 400 800 1600 Avg. speedup CPU CSR MKL 2.45 2.18 1.49 1.37 1.30 1.18 1.7x CPU ELLPACK 3.44 3.02 2.89 2.76 2.41 2.06 GPU CSR cusparse 3.64 4.10 4.40 4.58 4.79 4.70 3.3x GPY HYB cusparse 8.74 11.2 13.4 14.9 15.6 15.9 1.1x GPU sliced ELLPACK 10.9 12.8 14.9 15.9 16.2 16.4

SPMV kernel [1/4] No ordering Cuthill - Mckee ordering Theoretically achievable performance (perfect locality and alignment supposed): Arithmetic intensity (Ax=b for uniform tetrahedral mesh) A bytes: (8*5*N)+(4*5*N) = 60N b bytes: 8N x bytes: 8N (max. cache reuse) SpMV bytes: 76N SpMV flops: 9N Flop/byte ratio: 9/76 = 0.12

SPMV kernel [2/4] Theoretically achievable performance (perfect locality and alignment supposed): Performance on Intel Xeon E5640 (6 core,turbo Freq. 2.93 GHz, Bandwidth 25.6 GB/s) Peak performance: 24 flops x cycle x 2.93 G cycles/s = 70.32 Gflops/s (flops: (2 FMA + 2 SIMD) x 6 cores ) Time computations: 10N flop/70.32 Gflop/s = 0.14 N ns Time data comm.: 76N bytes/ 32 GB/s = 2,33 N ns Ratio: time comm../ time comp. ~17!! Achivable performance: 9/76 x 32 = 3.8 Gflops (~5% peak) Performance on NVIDIA M2090 (Tesla) Peak performance: 666.1 Gflop/s Bandwidth: 141.6 GB/s (ECC on) Achievable performance: 9/76 x 141,6 = 16.8 Gflops (~2.5% peak)

SPMV kernel [2/4] Theoretically achievable performance (perfect locality and alignment supposed): Performance on Intel Xeon E5640 (6 core,turbo Freq. 2.93 GHz, Bandwidth 25.6 GB/s) Peak performance: 24 flops x cycle x 2.93 G cycles/s = 70.32 Gflops/s (flops: (2 FMA + 2 SIMD) x 6 cores ) Time computations: 10N flop/70.32 Gflop/s = 0.14 N ns Time data comm.: 76N bytes/ 32 GB/s = 2,33 N ns Ratio: time comm../ time comp. ~17!! Achivable performance: 9/76 x 32 = 3.8 Gflops (~5% peak) Performance on NVIDIA M2090 (Tesla) Peak performance: 666.1 Gflop/s Bandwidth: 141.6 GB/s (ECC on) Achievable performance: 9/76 x 141,6 = 16.8 Gflops (~2.5% peak) 4.4 = 16.8 141.6 = 3.8 32 Performance ratio equals Bandwidth ratio

SPMV kernel [3/4] Net performance on a single 6-core CPU (left) and on a single GPU (right)

SPMV kernel [4/4] Speedup GPU vs CPU For normal workloads per device, it's better exploited the bandwidth of GPUS! Remember: RAM CPU 12 GB (6 cores) RAM GPU 6 GB

Multi-GPU SPMV kernel [1/4] MPI + CUDA implementation Parallelization based on a domain decomposition One MPI-process per subdomain and one GPU per MPI-process Local data partition: separate inner parts (do not require data from other subdomains) from interface parts (require external elements) Local data partition + two stream model -> overlapping computations on GPU with communications

Multi-GPU SPMV kernel [2/4] (left): Weak speedup test up to 128 GPUs, (right): overlapping effect on the executions with 128 GPUs

Multi-GPU SPMV kernel [3/4] Strong scalability, (left): speedup, (right): parallel efficiency Note: CPU execution 1 device is 6 cores

Multi-GPU SPMV kernel [3/4] load GPU 400K 80% load GPU 200K 55% load GPU 100K 35% Strong scalability, (left): test, in terms of speedup or (right): parallel efficiency Note: CPU execution 1 device is 6 cores

Multi-GPU SPMV kernel [4/4] (left): normalized performance of SpMV computations (right): estimated speedup for hypothetical constant performance (canceling cache and occupancy effects)

Multi-GPU SPMV kernel [4/4] but GPU 4 times faster! (left):net performance of the computing part for the strong speedup test. (right): estimated speedup if performance remained constant (canceling cache and occupancy effects)

LES test: flow around ASMO car [1/4] Flow around ASMO car, Re=7e5 5.5 million unstructured mesh with prismatic boundary layer Sub-grid scale: wall-adapting local-eddy viscosity (WALE) Poisson solver: CG with Jacobi diagonal scaling Flow and turbulent structures around simplified car models. D.E. Aljure, O. Lehmkuhl,, I. Rodríguez, A. Oliva. Computers & Fluids 96 (2014) 122 135

LES test: flow around ASMO car [2/4] (left): Relative weight of the main operations for different number of CPUs and GPUS. (right): average relative weight over all tests

LES test: flow around ASMO car [3/4] Note: CPU execution 1 device is 6 cores Performance of overall CFD code in any system can be estimated testing only three algebraic kernels

LES test: flow around ASMO car [4/4] Speedup multi GPU vs multi CPU Note: CPU execution 1 device is 6 cores

Tests on Mont Blanc ARM [1/2] Mont Blanc: European project focused on the development of a new type of computer architecture capable of setting future global HPC standards, built from energy efficient solutions used in embedded and mobile devices Termo Fluids S.L is part of the Industrial User Group (IUG) We have run parallel LES simulations on MontBlanc nodes using the AGP platform Specifics: Load distribution SpMV kernel 100K rows OpenCL + openmp + MPI model required to engage all component of nodes Shared memory between CPU and GPU requires an accurate load distribution CPU: Cortex-A15 1.7 GHz dual core GPU: Mali T-604 (OpenCL 1.1 capable) Network: 10 Gbit/s Ethernet

Tests on Mont Blanc ARM [2/2] Similar performance of CPU and GPU makes meaningful hybridization Languages: CPU OpenMP, GPU OpenCL Synchronization points (clfinish()) required to maintain main memory coherence 16% 16% 40% 16% 16%

Tests on Mont Blanc ARM Weak speedup [2/2] Strong speedup

Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks

CONCLUDING REMARKS Exaflops come with disruptive changes in HPC technology We developed a portable version of our CFD code based on an algebraic operational approach ~98% com computing time is spent on three kernels: SpMV, AXPY, DOT The three kernels are clearly memory bounded: performance depends exclusively on the bandwidth achieved (not on flops) Bandwidth is more profitable with throughput oriented (latency hiding) approach of GPUs Overall time-step performance is perfectly estimable at any system by testing the three basic kernels Speedup of multi-gpu vs multi-cpu implementation on LES simulation of flow around ASMO car ranges from 4x to 8x at MinoTauro supercomputer