OpenFOAM on BG/Q porting and performance

Size: px

Start display at page:

Download "OpenFOAM on BG/Q porting and performance"

Albert Cobb
5 years ago
Views:

1 OpenFOAM on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA

2 SYSTEM OVERVIEW OpenFOAM : selected application inside of PRACE project Fermi : PRACE Tier- System Model: IBM-BlueGene /Q Architecture: 1 BGQ Frame with 2 MidPlanes each Front-end Nodes OS: Red-Hat EL 6.2 Compute Node Kernel: lightweight Linux-like kernel Processor Type: IBM PowerA2, 16 cores, 1.6 GHz Computing Nodes: 1.24 Computing Cores: RAM: 16GB / node Internal Network: Network interface with 11 links ->5D Torus Disk Space: more than 2PB of scratch space Peak Performance: 2.1 PFlop/s

SYSTEM OVERVIEW Compute node (back-end): each compute node

physical memory Applications run on 16 of the cores with the

Nearly the full 16 GB of physical memory is dedicated to

On each core it s possible to run up to 4 processes/threads for

Compute card: One chip module, 16 GB DDR3 Memory Applications :

3 SYSTEM OVERVIEW Compute node (back-end): each compute node comprise 17 cores on a single chip with16 GB of dedicated physical memory Applications run on 16 of the cores with the 17th core reserved for system software. Nearly the full 16 GB of physical memory is dedicated to application usage. On each core it s possible to run up to 4 processes/threads for a total of 64 processes/threads per node Single Chip Module Compute card: One chip module, 16 GB DDR3 Memory Applications : Applications are submitted to the compute nodes by the batch scheduler system To run on the compute nodes (back-end), applications must be cross-compiled

4 Porting of OpenFOAM on BG/Q Compiling OpenFOAM for the back-end nodes on BG/Q requires some system specific changes to the configuration scripts of OpenFOAM and Third-party package It s not possible to use Third-party MPI, rules for BG/Q MPI must be inserted Environment configuration: Configure environment with compilers and zlib using modules module load bgq-gnu module load zlib OpenFOAM configuration scripts and rules: Files bashrc and settings.sh must be changed inserting the rules for BG/Q MPI Files c/c++ in wmake/rules folders must be modified for dynamic linking Scotch library build Before running Allwmake in the OpenFOAM main folder some changes need to be made to the compiling and dynamic linking rules in the file Makefile.inc contained in the scotch library. Cross-compile and execute on the back-end the dummysizes scotch utility to build properly the header files scotch.h and scotchf.h Compile Go in $WM_PROJECT/$WM_PROJECT_VERSION and compile with./allwmake

Performance of OpenFOAM on BG/Q Test cases Cavity 3D Isothermal Incompressible Flow Solver : icofoam BoxTurb 3D Omogeneus Isotropic Turbulence on

5 Performance of OpenFOAM on BG/Q Test cases Cavity 3D Isothermal Incompressible Flow Solver : icofoam BoxTurb 3D Omogeneus Isotropic Turbulence on compressible flow Solver : sonicfoam Airfoil wing section External aerodynamic Solver : simplefoam Dtmb hull Marine hydrodynamics Solver : interfoam

6 Performance of OpenFOAM on BG/Q Systems Model: IBM-BlueGene /Q (Fermi) Processor Type: IBM PowerA2, 1.6 GHz Computing Node: 16 cores RAM: 16GB / node; 1GB/core Internal Network: Network interface with 11 links ->5D Torus Model: Hewlett Packard C7 (Lagrange) Processor Type: Intel, Xeon Westmere, 2.8 GHz Computing Node: 12 cores RAM: 24GB / node; 2GB/core Internal Network: Infiniband QDR/DDR Voltaire, Fat Tree

7 Cavity 3D Flow : laminar, isothermal, incompressible Mesh : fully structured 3D Mesh elements : cubes Elements 1.. Elements 2.. Simple Scotch Simple Scotch icofoam icofoam

8 Cavity 3D and Mesh :1.. Solution saved at final time step Fermi Lagrange Ideal Fermi Lagrange Ideal Fermi Lagrange Ideal

9 Cavity 3D and Mesh :1.. Solution saved every 1 time steps Fermi Lagrange Ideal

10 Increment % Cavity 3D Profiling Number of iterations : 1 Files per core : 3 MPI_Allreduce average message size per core (B) : 8 -- #cores 124 Average message size sent and received per core (KB) : 4,6 -- #cores 124 MPI and I/O profiling : 512 cores # Cores Cumulative I/O (GB) Files Size per core (MB) 64 13, 5, , 2, , 1, ,, , 25% 2% I/O overhead on simulation time MPI and I/O profiling : 124 cores 15% 1% 5% % Fermi Lagrange

11 Cavity 3D and efficiency Mesh :2.. Solution saved at final time step Fermi Lagrange Ideal Fermi Lagrange Ideal Fermi Lagrange Ideal

12 Cavity 3D and efficiency Mesh :2.. Solution saved every 1 time steps Fermi Lagrange Ideal

13 % Increment Cavity 3D Profiling Number of iterations : 1 Files per core : 3 MPI_Allreduce average message size per core (B) : 8 -- #cores 124 Average message size sent and received per core (KB) : 6,4 -- #cores 124 MPI and I/O profiling : 512 cores # Cores Cumulative I/O (GB) Files Size per core (MB) 64 18,1 9, ,1 4, ,5 2, ,5 1, ,1,63 3% 25% 2% 15% 1% 5% % Fermi I/O overhead on simulation time Lagrange MPI and I/O profiling : 124 cores

14 BoxTurb 3D Flow : compressible Case study : homogeneous, isotropic turbulence Mesh : uniform 3D Number of cells : 17.. Solver : sonicfoam Partition method : simple Courtesy of : Matteo Cerminara (INGV), Pisa

15 BoxTurb 3D and efficiency Solution saved at the final time step Patition method - simple Fermi Lagrange Ideal Fermi Lagrange Ideal

16 BoxTurb 3D and efficiency Solution saved every 1 time steps Patition method - simple Fermi Lagrange Ideal Fermi Lagrange Ideal

17 Increment % BoxTurb 3D Profiling Number of iterations : 18 Files per core : 4 MPI_Allreduce average message size per core (B) : 8 -- #cores 124 Average message size sent and received per core (KB) : 9,3 -- #cores 124 MPI and I/O profiling : 512 cores # Cores Cumulative I/O (GB) Files Size per core (MB) 64 18,4 4, ,4 2, ,6 1, , ,2,32 14% 12% 1% 8% 6% 4% 2% % Fermi I/O overhead on simulation time Lagrange MPI and I/O profiling : 124 cores

18 Airfoil wing section Flow : turbulent, incompressible Case study : steady state, extruded NACA airfoil Mesh : fully structured 3D Number of cells : 9.. Solver : simplefoam Method : simple - scotch

19 Airfoil wing section - and efficiency Solution saved at the final time step

20 Airfoil wing section Profiling MPI profiling simple cores MPI profiling simple cores MPI profiling scotch cores MPI profiling scotch cores

21 Airfoil wing section - and efficiency Solution saved every 1 time steps Fermi Lagrange Ideal

22 Airfoil wing section Profiling Number of iterations : 1 Files per core : 6 MPI_Allreduce average message size per core (B) : 8 -- #cores 512 Average message size sent and received per core (KB) : 4,2 -- #cores 512 MPI and I/O profiling : 512 cores # Cores Cumulative I/O (GB) Files Size per core (MB) 64 5,6 1, ,8, ,6, ,9, , 8% 7% 6% 5% 4% 3% 2% 1% % Fermi Decomposition method - scotch Lagrange MPI and I/O profiling : 124 cores

23 Free surface - dtmb hull 3D Flow : turbulent, incompressible Case study : unsteady, multiphase Mesh : unstructured 3D Number of cells : 5.5. Solver : interfoam Method : simple - scotch

24 Free surface - dtmb hull 3D Speed up and efficiency Solution saved at the final time step Fermi Lagrange Ideal ,2 1,8,6,4,

Free surface, dtmb hull 3D Speed up and efficiency Solution saved every 1 time steps 1,2 1,8,6,4,2 32

25 Free surface, dtmb hull 3D Speed up and efficiency Solution saved every 1 time steps 1,2 1,8,6,4, ,2 1,8,6,4,

26 Increment % Free surface - dtmb hull 3D - Profiling Number of iterations : 1 Files per core : 8 MPI_Allreduce average message size per core (B) : 8 -- #cores 512 Average message size sent and received per core (KB) : 29,4 -- #cores 512 # Cores Cumulative I/O (GB) Files Size per core (MB) 64 18,4 4,5 MPI and I/O profiling : 256 cores ,4 2, ,6 1, ,6 MPI and I/O profiling : 512 cores 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % Fermi Lagrange I/O overhead on simulation time

27 Conclusions OpenFOAM scaling and efficiency performance on Fermi and classic HPC systems are comparable but for well suited case studies with a good balancing between computation, I/O and MPI communications we could benefit from the larger amount of available cores on Fermi. OpenFOAM efficiency and scaling are constrained by poor I/O design and intra-process communication A new scheme of I/O based on MPI Parallel I/O routines or available parallel I/O libraries, able to use efficiently parallel file system facilities, should dramatically reduce I/O overhead A multi-threaded hybrid MPI/OpenMP version of the solvers will indeed mitigate the time spent in MPI routines with the increase in the number of cores.

28 Acknowledgements Bob Danani Matteo Cerminara Massimiliano Culpo Piero Lanucara Andrea Penza Francesco Salvadore Ivan Spisso VLSCI Carlton, Melbourne INGV CINECA CINECA CINECA CINECA CINECA

29 Questions?

Carlo Cavazzoni, HPC department, CINECA

Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have