Mixed OpenMP/MPI approaches on Blue Gene for CDF applications (EDF R&D and IBM Research collaboration)

Size: px

Start display at page:

Download "Mixed OpenMP/MPI approaches on Blue Gene for CDF applications (EDF R&D and IBM Research collaboration)"

Douglas Doyle
6 years ago
Views:

1 IBM eserver pseries Sciomp15- May Barcelona Mixed OpenMP/MPI approaches on Blue Gene for CDF applications (EDF R&D and IBM Research collaboration) Pascal Vezolle - IBM Deep Computing, vezolle@fr.ibm.com Yvan fournier EDF R&D, yvan.fournier@edf.fr 2004 IBM Corporation

2 Acknowledgement EDF: Y. Fournier (Saturne code), R. Issa (Spartacus), Emile RAZAFINDRAKOTO (Telemac 3d) STFC Daresbury Laboratory: C. Moulinec (Spartacus)

EDF group Overview EDF R&D /IBM OpenMP/MPI appraoch for larger CFD models Permanent objectives guarantee safety improve

performance through new technologies, new operating modes and system-wide optimization adapt to evolving set of rules

testing and simulation are key tools from the outset First Electricity maker and provider in Europe Employees in the world

3 EDF group Overview EDF R&D /IBM OpenMP/MPI appraoch for larger CFD models Permanent objectives guarantee safety improve performances/costs maintain assets Changing operating conditions face unexpected events, ageing issues, maintenance improve performance through new technologies, new operating modes and system-wide optimization adapt to evolving set of rules (safety, environment, regulatory) In-house technical backing expertise: strong Engineering and R&D Divisions physical testing and simulation are key tools from the outset First Electricity maker and provider in Europe Employees in the world : R&D Staff 2,000 staff, 30% women 300 PhDs and 200 doctoral students 150 researchers teach at universities and engineering schools

4 EDF R&F IBM history (> 3 years of technical collaboration) - Several domains of interests are part of IBM technical support: Material science Financing & Project management CFD July 2006: EDF bought 4 Blue Gene\L racks December 2007: EDF ordered 8 BGP racks, installed in 2008 and hosted by IBM Since 2008: Grand Challenge collaboration between EDF R&D and IBM Research around High Precision Thermal-Hydraulic Calculations based on CFD Code_Saturne code. Objective: : 3D RANS Full vessel steady approach 5-10 billion cells IBM activities: Parallel and serial tuning, Mixed mode approach: OpenMP/MPI, meshing,

5 CFD Grand Challenge: High Precision Thermal-Hydraulic Calculations Within Core >2012? 3D LES Full vessel unsteady approach >50 billion cells D LES 1 or 2 rods unsteady approach 5 billion cells D RANS Full vessel steady approach 5-10 billion cells D RANS, 5x5 rods, 100 millions cells 2 M cpu.hours 5

6 Ongoing CFD codes on Blue Gene EDF R&D /IBM OpenMP/MPI appraoch for larger CFD models 2 general, complex and complete integrated CFD families including several different codes and a workflow environment Nuclear reactor simulation Finite Volume approach, RANS its co-located Finite Volume approach, it deals with any type of mesh cell and grid structure. Incompressible and expandable flows with or without heat transfer and turbulence Dedicated modules are available (radioactive heat transfer, combustion, magneto-hydro dynamics, compressible flows, Euler-Lagrange approach for two-phase flows, capabilities for parallel code coupling). 2 main codes: Code_Saturne (single-phase flows) and NEPTUNE_CFD (two-phase flows) Code_Saturne CFD package is Open Source: Environment simulations: underground flows, water quality, sedimentation, dam breaking, etc Finite Elements approach, Euler (Telemac code), available through license, more than 100 user over the world Lagrangian, SPH (Smoother Particles Hydrodynamics), (dam breaking simulation), strong deformation (Spartacus code) 2 codes are being implemented in mixed mode OpenMP/MPI: Code_Saturne and Spartacus

SPARTACUS-3D SPH 3D-lagrangian modelling of complex free-surface

intake and release works for thermal powerplants Design of structures

field Dam spillway crest conception Optimization of fish migration

7 SPARTACUS-3D SPH 3D-lagrangian modelling of complex free-surface flows Marine and coastal field Design of protection works, water intake and release works for thermal powerplants Design of structures (piles, foundations) for windfarm, marine current turbines River field Dam spillway crest conception Optimization of fish migration devices Multiphasis Complex geometry Fast-dynamics water flows and complex free surface

8 OpenMP/MPI implementation for SPH (Blue Gene/P): 13 M particles VN : 1 OpenMP threads/mpi task Dual: 2 OpenMP threads/mpi task SMP: 4 OpenMP threads/mpi task 1 thread/core BG/P Particles Time per Step (sec) SMP DUAL VN Nodes

9 Code_Saturne Simulation of incompressible or expandable flows with or without heat transfer and turbulence (mixing length, 2-equation models, v2f, RSM, LES, )

Code_Saturne Communication Features (1/2) Distributed memory parallelism using domain partitioning Use classical ghost cell method for both parallelism and periodicity => MPI point-to-point

10 Code_Saturne Communication Features (1/2) Distributed memory parallelism using domain partitioning Use classical ghost cell method for both parallelism and periodicity => MPI point-to-point communications Most operations require only ghost cells sharing faces Extended neighborhoods for gradients also require ghost cells sharing vertices MPI collectives Communications Global reductions (scalar products) are also used, especially by the preconditioned conjugate gradient algorithm MPI_Alltoall for I/O

Current MPI Code_Saturne performances (1/3):

Industrial studies Exploratory studies Father test

11 Current MPI Code_Saturne performances (1/3): Benchmark test cases Number of cells in the mesh Industrial studies Exploratory studies Father test case 1 M. cells Hypi test case 10 M. cells GRILLE test case 100 M. cells 200 iterations Turbulence = L.E.S 200 iterations Turbulence = L.E.S 50 iterations Turbulence = k-ε

12 MPI SARTUNE performances (2/3) FA Grid RANS test case 2 LES test 3D cases (most I/O factored out) #cells/mpi task 1 M cells: 880 at 1024 cores 10 M cells 9345 at 1024 cores Elapsed time per iteration Nov ascale Blue Gene/L (CO) Blue Gene/L (VN) n cores RANS, 100 M tetrahedra + polyhedra (most I/O factored out) : 96286/ min/max cells/core at 1024 cores; 11344/12781 min/max cells cores at 8192 cores FATHER 1 M hexahedra LES test case HYPPI 10 M hexahedra LES test case Elapsed time Opteron + infiniband Opteron + Myrinet NovaScale Blue Gene/L Elapsed tile Opteron + Myrinet Opteron + infiniband NovaScale Blue Gene/L n cores n cores

13 Finite Volumes Code_Saturne features (2/2) (general features for a lot of general implicit CFD codes) Linear equation solvers usually amount to 80% of CPU cost (dominated by pressure), gradient reconstruction about 20% The larger the mesh, the higher the relative cost of the pressure step Segregated solver All variables are solved independently, coupling terms are explicit Diagonal-preconditioned CG used for pressure equation, Jacobi (or bi-cgstab) used for other variables More important, matrices have no block structure, and are very sparse Typically 7 non-zeroes per row for hexahedra, 5 for tetrahedra Indirect addressing + no dense blocs means less opportunities for Matrix-Vector optimization, as memory bandwidth is as important as peak flops. u Very difficult to optimize on modern processor and to take advantage of the vector Floating Point Units (SIMD on Blue Gene, SSE on x86) u Not possible to expect a real performance improvement per core in the coming years u aggregated solver with sparse block matrices can help

14 WHY a mixed approach OpenMP/MPI is mandatory for very large meshes Processor trends for the 5 coming years 1. more cores per chip at the same frequency 2. more threads per core (SMT) 3. thread performance improvement mainly due to SIMD vector unit (up to 4-8 operations/cycle/fpu in , 2-4 today) => not real improvement for CFD codes due to sparse matrices and indirect access, especially with segregated solver 4. accelerators (GPGPU) very difficult to use for a general purpose codes, requiring a huge amount of memory and modifications Larger meshes (> 1B cells) for a better resolution of the same geometry, more complex geometry more than K cores are required Pure MPI version can not scalable enough to run on more than K (especially in an industrial environment; 85% efficiency from 32K to 64K => 10K useless cores) In middle term, the mixed OpenMP (multi-threads)/ MPI approach can allow to take advantage of the steady increase of cores per ship, as well as the SMT features, while keeping constant the number of MPI tasks (therefore the scalability) => first objective is ~3.5 general speed-up for 4 threads

15 OpenMP enablement Two potential approaches for OpenMP enablement 1. High level parallelization similar to the MPI implementation (multiple threads approach) Advantages: Introduces a new level of parallelization, allowing better scalability. Can be completed by a second level of parallelization over the loops (very efficient on SMT processors). Disadvantages: Requires a new partitioning method and may add to load imbalance A lot of source modifications (almost a new code) 2. Loop parallelization Advantage: minimize the source modifications Disadvantages: the loops with memory contentions and dependencies are not naturally parallel; and only certain loops can be parallelized, limiting the speed-up For Code_Saturne : loop parallelization with the constraint to minimize the source modifications while optimizing scalability and performance Face/element numberings method and new data structure to // loops to remove the memory contentions

16 Performance optimization and OpenMP enablement depending on sparse matrix data structures 2 types of loops depending on the sparse matrix data structures (3D mesh) 1. Over the cells (or volumes) with 3-array CSR (compressed sparse row) format 2. Over the faces, native data structure for the gradient and flux calculation The sparse matrix-vector multiplication step can be naturally multi-threaded using certain data structures, without the need to reorder faces and/or elements. 1. Face loop : for (face = 0; face < nfaces; face++) { ii = element_list[2*face]; jj = element_list[2*face+1]; y[ii] += A[face] * x[jj]; // MEMORY CONTENTION: stores to y[ii] and y[jj] limit the threading options y[jj] += A[face] * x[ii]; } Example: the 3-array CSR (compressed sparse row) format has no store conflicts 2. #pragma omp parallel private(tmp, jj,k) for (i=0; i<nelements; i++) { tmp = 0.0; for (k=ia[i]; k<ia[i+1]; k++) { jj = ja[k]; tmp += Apacked[k]*x[jj]; } } y[i] += tmp; // stores for each element are independent, // difficult to control the work balancing between thread (depends on the study case)

17 Performance experiments with sparse matrix data structures (for Matrix/Vector) Used face/element indices from FATHER test case with 32 MPI ranks nfaces = 88554, nelements = 32782, conformal hexahedral mesh Data structures: (1) native face loop with original face/element ordering (2) three array CSR format, where the rows correspond to elements (3) two-dimensional array for the matrix A2d[nelements][6] Matrix-vector multiplication times are shown in the table below: Compiler impact Format => face/element CSR A2d 1 thread 5.07 msec 5.11 msec 4.24 msec 2 threads 2.75 msec 2.24 msec 4 threads 1.43 msec 1.18 msec CSR format general and can be easily threaded for MATVEC, but generate an overhead (~5 matvec) to build the matrix compared to a native format (only applicable to the MATVEC) CSR format looks promising; it is general and can be easily threaded. Can get some additional performance with a 2d data structure. Potentially useful if you know that the mesh Can is get conformal, some additional for example performance at most 6 faces with contribute a 2d data per structure. element. Potentially useful if you know that the mesh is conformal, for example at most 6 faces contribute per element.

18 OpenMP enablement for the face loops EDF R&D /IBM OpenMP/MPI appraoch for larger CFD models The solution is simply to generate groups of faces without any dependency Several methods based heuristic, coloring algorithm, space filing curves, partitioning, etc can be applied; Several loops transformation All the techniques require a face renumbering We propose a general transformation based on GRP groups of THR subsets of faces with no memory contention, while minimizing the number of groups (to reduce the thread overhead) for (i=1; i<nfac; i++) for (ng=0; ng<grp; ng++) { #pragma omp parallel for for (nt=0; nt<thr; nt++) { for (i=li[ng*grp][2*nt]; i< =LI[ng*GRP][2*nt+1]; i++) { The variables GRP, THR, and LI are computed within each MPI task and optimized depending on the size of the partition. If the partition is not large enough for the active threads either THR or the LI elements are set to zero For a small number of threads we have introduced a new optimal face renumbering method based on a matrix bandwidth reduction algorithm (~ up to 8 threads) For a large number of threads, methods based on partitioning algorithm are efficient

19 Face numbering impact on the MATVEC performance (1) 256 MPI tasks; # of elements and faces: 4796, tests 1. original face/element numbers 2. faces renumbering with bandwidth reduction algorithm based on standard Cuthill-McKee 3. element renumbering to improve geometric locality element renumbering based on faces (simple algorithm ordering elements in the order of increasing face number). test Time (sec) 0,0686 0,0812 0,0963 0,0637 diff (%) 0,0 18,5 40,4-7,2 The original face/numberings are pretty efficient and allow cache lines reuse Renumbering independently the faces or the elements can strongly damage the performance A simple reordering of the elements depending on face numbering improves the performance by more than 7% => potential significant improvements by studying the dependences between face and element numberings (+ store/load reduction techniques)

representations for the matrix-vector product using a face

20 Face numbering impact on the performance (2) EDF R&D /IBM OpenMP/MPI appraoch for larger CFD models Cache lines accesses representations for the matrix-vector product using a face loop, 256 MPI tasks, 4796 cells, faces test 1 test 2 test 3 test 4

21 OpenMP enablement for the face loops (internal tests) (1) OpenMP performance of the matrix-vector products on BG/P using a face loop, 256 MPI tasks, 4796 cells, faces 3 tests 1. original face/element numberings 2. faces renumbering with bandwidth reduction algorithm based on standard Cuthill-McKee element renumbering based on faces (simple algorithm ordering elements in the order of increasing face number). Creation of groups with bandwidth reduction method => GRP=2 and #threads face subsets/group test serial 0,0686 sec 0,0812 sec 0,0637 sec 1 thread 0,0688 sec 0,0815 sec 0,0640 sec 4 threads 0,0244 sec 0,0187 sec speedup 3,34 3,42

22 First tests with Code_Saturne with a native storage FATHER model: 1M cells OpenMP directives in gradrc.f (gradiant calculation, complex face loops) Speedup with 256 MPI tasks (4 threads): 3.42 HYPI model: 10 M cells OpenMP directives in gradrc.f and cs_matrix.c (matrix/vector product) speedup with 256 MPI tasks (4 threads) cs_matrix_alpha_a_x_p_beta_y: ~4 gradrc : 3.59 First results are very encouraging.. More than 95% of the computing routines can be now parallelized with OpenMP (on going work) with a good efficiency (objective: ~3.5 speed-up for 4 OpenMP threads)

23 Thank for your attention QUESTIONS?

High Performance Calculation with Code_Saturne at EDF. Toolchain evoution and roadmap

High Performance Calculation with Code_Saturne at EDF Toolchain evoution and roadmap Code_Saturne Features of note to HPC Segregated solver All variables are solved or independently, coupling terms are