Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Size: px

Start display at page:

Download "Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf"

Sabrina Cross
5 years ago
Views:

1 PADC Anual Workshop 20 Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture Alexander Berreth RECOM Services GmbH, Stuttgart Markus Bühler, Benedikt Anlauf IBM Deutschland Research & Development GmbH Pascal Vezolle IBM Systems France

2 Coal Fired Power Station Source: 2

3 Simulation of the Combustion Process with RECOM-AIOLOS 3D-Boiler Model Visualisation of computational results Simulation using the in-house 3D-Combustion Simulation Code RECOM-AIOLOS on HPC Hardware Improved understanding of the process leads to: - Reduced Emissions (NOx, CO, ) - Higher Efficiency (CO2 reduction) - Higher Availability of the boiler Source: 3

radiation/heat transfer 0-0 Mio cells Solves 00 non-linear coupled equations with - Billion unknowns Intense

4 3D-CFD Simulation Software RECOM-AIOLOS RECOM-AIOLOS contains models for the description of: Multi-Phase continuum mechanics for solid, liquid and gaseous fuels Combustion chemistry, pollutant formation and radiation/heat transfer 0-0 Mio cells Solves 00 non-linear coupled equations with - Billion unknowns Intense computational demand: e.g. up to 2 h on 6 Haswell nodes Validated against measured data from real, full scale power plants.

.3 Symbolic picture Source: https://stacresearch.

5 Performance evaluation of RECOM-AIOLOS 2-socket Intel Xeon E5-2680v3 (Haswell) node of Cray XC0 2 x 2 core 2.5 GHz 68 GB/s memory bandwidth Cray Fortran Compiler V Symbolic picture Source: 2-socket IBM S82L POWER8 node 2 x 0 core 3. GHz 230 GB/s memory bandwidth IBM xlf FORTRAN compiler V

6 2-Socket node POWER8 Memory Organization Socket 0 (CPU 0) POWER8 DCM P8-Chip NUMA node 0 P8-Chip NUMA node Core 0 Core Core 2 Core 3 Core Core 5 Socket (CPU ) POWER8 DCM Core 0 Core Core 2 Core 3 Core Core 5 P8-Chip NUMA node 2 P8-Chip NUMA node 3 Local memory access: Within Power Chip (NUMA node) Direct attachment of cores to memory controllers Near memory access: Between the Power Chips By intra-node communication paths Far memory access When connecting multiple nodes (not shown here) Core 6 Core 7 Core 8 Core 9 Core 6 Core 7 Core 8 Core 9 Local memory access is much faster than near or far memory access! 6

7 2-Socket node POWER8 Memory Organization Socket 0 (CPU 0) POWER8 DCM P8-Chip NUMA node 0 P8-Chip NUMA node Core 0 Core Core 2 Core 3 Core Core 5 Core 6 Core 7 Core 8 Core 9 Socket (CPU ) POWER8 DCM Core 0 Core Core 2 Core 3 Core Core 5 Core 6 Core 7 Core 8 Core 9 P8-Chip NUMA node 2 P8-Chip NUMA node 3 What does it mean: OpenMP Parallelization: First touch policy Core pinning to avoid thread migration Hybrid Parallelization (OpenMP + MPI): MPI ranks across NUMA nodes OpenMP within NUMA nodes vectorize loops loop lengths 75k 8.M Core pinning 7

8 Single Node Performance of RECOM-AIOLOS on POWER8 RECOM-AIOLOS Speedup using a hybrid parallelization approach 32 8 processes processes 2 processes OpenMP only Linear Speedup 8 2*0 cores (2 sockets) MPI process (rank) with 20 OpenMP threads (SMT) 2 MPI process on the node Number of Threads*MPI Processes 8

9 Single Node Performance of RECOM-AIOLOS on POWER8 RECOM-AIOLOS Speedup using a hybrid parallelization approach 32 8 processes processes 2 processes OpenMP only Linear Speedup 8 SMT 2 SMT2 SMT 20 Cores SMT8 MPI process on the node Number of Threads*MPI Processes 9

10 Single Node Performance of RECOM-AIOLOS on POWER8 RECOM-AIOLOS Speedup using a hybrid parallelization approach 32 8 processes processes 2 processes OpenMP only Linear Speedup 8 SMT 2 SMT2 SMT 20 Cores SMT8 MPI process on each socket Number of Threads*MPI Processes 0

11 Single Node Performance of RECOM-AIOLOS on POWER8 RECOM-AIOLOS Speedup using a hybrid parallelization approach 32 8 processes processes 2 processes OpenMP only Linear Speedup 8 SMT 2 SMT2 SMT 20 Cores SMT8 MPI process on each NUMA node Number of Threads*MPI Processes

12 Single Node Performance of RECOM-AIOLOS on POWER8 RECOM-AIOLOS Speedup using a hybrid parallelization approach Best Speedup 22.7 with SMT 32 8 processes processes 2 processes OpenMP only Linear Speedup 8 Difference because of short loops??? Speedup 5. without SMT SMT SMT2 SMT 20 Cores SMT8 2 Significant performance increase 2 MPI processes each NUMA node withon SMT No further performance increase with SMT Number of Threads*MPI Processes 2

13 Single Node Performance of RECOM-AIOLOS on POWER8 RECOM-AIOLOS Speedup using OpenMP with idealized test case 3D model of an idealized test case Numerical grid (structured) with 0 mio. cells in total No domain decomposition; only domain Loop length 0 mio elements Parallelization (OpenMP) and vectorization of the loops 3

14 Single Node Performance of RECOM-AIOLOS on POWER8 RECOM-AIOLOS Speedup: Hybrid vs. idealized OpenMP 32 Speedup 8 SMT SMT2 SMT SMT8 20 Cores 2 Idealized test case with OpenMP only Linear Number of Threads*MPI Processes

using long loops Real test case with 8 MPI processes

15 Single Node Performance of RECOM-AIOLOS on POWER8 RECOM-AIOLOS Speedup: Hybrid vs. idealized OpenMP 32 Speedup 8 2 SMT SMT2 SMT SMT8 20 Cores Excellent scaling with pure OpenMP parallelization when using long loops Real test case with 8 MPI processes Idealized test case with OpenMP only Linear NUMA placement seems to be OK Number of Threads*MPI Processes 5

16 Single Node Performance of RECOM-AIOLOS on Haswell RECOM-AIOLOS Speedup using a hybrid parallelization approach 32 8 processes processes 2 processes OpenMP only Linear Speedup 8 Best Speedup 2 cores + HT. 2 cores Speedup significantly lower compared to POWER8 but less dependent on the parallelization setting (Only 2 NUMA nodes on Haswell) 2 No performance gain with HT Number of Threads*MPI Processes

17 Single Node Performance of RECOM-AIOLOS Computing time (wall-clock-time) on POWER8 vs. Haswell Avg. comp. time for iteration [s] IBM Power 8 50 Intel Haswell Core2Core Node2Node 7

18 Single Node Performance of RECOM-AIOLOS Routine based computing time on POWER8 vs. Haswell Avg. comp. time for iteration [s] POWER8 slower High computational load POWER8 faster High memory transfer Further analysis (e.g. compiler output) revealed that major loops are unvectorized IBM Power 8 Intel Haswell 8

19 Summary The 3D-combustion simulation software RECOM-AIOLOS was successfully ported to the POWER8 hardware With proper NUMA memory allocation, an excellent speedup was achieved when using OpenMP or a hybrid (OpenMP + MPI) approach Significant performance gain was observed when using SMT Similar node to node performance of POWER8 and Haswell Memory intense routines were faster on POWER8 Compute intense routines were slower on POWER8, which could be attributed to a lack of vectorization on POWER8 9

Code Saturne on POWER8 clusters: First Investigations

Code Saturne on POWER8 clusters: First Investigations C. MOULINEC, V. SZEREMI, D.R. EMERSON (STFC Daresbury Lab., UK) Y. FOURNIER (EDF R&D, FR) P. VEZOLLE, L. ENAULT (IBM Montpellier, FR) B. ANLAUF, M.