Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College London 2 EPCC, University of Edinburgh 3 STFC, Daresbury Laboratory 4 Fujitsu Laboratories of Europe Ltd. 9 July, 2013

Motivation Fluidity PETSc Unstructured finite element code Anisotropic mesh adaptivity Applications: CFD, geophysical flows, ocean modelling, reservoir modelling, mining, nuclear safety, renewable energies, etc. Linear solver engine Hybrid MPI/OpenMP version

Programming for Exascale Three levels of parallelism in modern HPC architectures 1 : Between nodes: Message passing via MPI Between cores: Shared memory communication Within core: SIMD Hybrid MPI/OpenMP parallelism: Memory argument MPI memory footprint not scalable Replication of halo data Speed argument Message passing overhead Improved load balance with fewer MPI ranks 1 A. D. Robison and R. E. Johnson. Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns, ParaPLoP 10, pages 5:1 5:8, New York, NY, USA, 2010. ACM

PETSc Overview Matrix and Vector classes are used in all other components Added OpenMP threading to low-level implementations: Vector operations CSR matrices Block-CSR matrices

Sparse Matrix-Vector Multiplication Matrix-Multiply is most expensive component of the solve P1 P2 P3 P4 P5 P6 P7 P8 P1 P2 P3 P4 P5 P6 P7 P8 Parallel Matrix-Multiply: Multiply diagonal submatrix Scatter/gather remote vector elements Multiply-add off-diagonal submatrices

Sparse Matrix-Vector Multiplication Input vector elements require MPI communication Hide MPI latency by overlapping with local computation Not all MPI implementations work asynchronously 2 Task-based Matrix-Multiply Dedicated thread for MPI communication Advances communication protocol Copy data to/from buffer In contrast to Vector-based threading Lift parallel section to include scatter/gather operation Cannot use parallel for pragma N 1 threads enough to saturate memory bandwidth 2 G. Schubert, H. Fehske, G. Hager, and G. Wellein. Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems. Parallel Processing Letters, 21(3):339 358, 2011

Sparse Matrix-Vector Multiplication Thread-level Load Balance Matrix rows partitioned in blocks Create partitioning based on non-zero elements per row 3 Cache partitioning with matrix object Explicit thread-balancing scheme Initial greedy allocation Local diffusion algorithm 3 S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3):178 194, 2009

Benchmark Global baroclinic ocean simulation 4 Mesh based on extruded bathymetry data Pressure matrix: 371,102,769 non-zero elements 13,491,933 Degrees of Freedom Solver options: Conjugate Gradient method Jacobi preconditioner 10,000 iterations 4 M. D. Piggott, G. J. Gorman, C. C. Pain, P. A. Allison, A. S. Candy, B. T. Martin, and M. R. Wells. A new computational framework for multi-scale ocean modelling based on adapting unstructured meshes. International Journal for Numerical Methods in Fluids, 56(8):1003 1015, 2008

Cray XE6 (HECToR) Architecture Overview NUMA architecture 32 cores per node 4 NUMA domains, 8 cores each Fujitsu PRIMEHPC FX10 UMA architecture 16 cores per node IBM BlueGene/Q UMA architecture 16 cores per node 4-way hardware threading (SMT)

Hardware Utilisation: 128 cores 350 300 XE6: Vector XE6: Task XE6: Task, NZ-balanced FX10: Vector FX10: Task FX10: Task, NZ-balanced Runtime (s) 250 200 150 100 1 2 4 8 16 32 No. of Threads / MPI process On XE6 slowdown with multiple NUMA domains Performance bound by memory-latency

Hardware Utilisation: 1024 cores Runtime (s) 60 55 50 45 40 35 30 25 XE6: Vector XE6: Task XE6: Task, NZ-balanced FX10: Vector FX10: Task FX10: Task, NZ-balanced 20 1 2 4 8 16 32 No. of Threads / MPI process Both task-based algorithms improve NZ-based load balancing now faster

Hardware Utilisation: 4096 cores 12 11 10 XE6: Vector XE6: Task XE6: Task, NZ-balanced Runtime (s) 9 8 7 6 5 1 2 4 8 16 32 No. of Threads / MPI process Vector-based approach bound by MPI communication Explicit thread-balancing improves memory bandwidth utilisation, but worsens latency effects!

Strong Scaling: Cray XE6 10 3 XE6: Vector-based XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI Runtime (s) 10 2 10 1 Parallel Efficiency (%) 140 120 100 80 60 40 32 64 128 256 512 1024 2048 4096 8192 No. of Cores XE6: Vector-based XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI 20 32 64 128 256 512 1024 2048 4096 8192 No. of Cores

Strong Scaling: Cray XE6 Runtime (s) 10 3 10 2 10 1 XE6: Vector-based XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI Parallel Efficiency (%) 120 110 100 90 80 70 60 50 40 30 256 512 1024 2048 4096 8192 16384 32768 No. of Cores XE6: Vector-based XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI 256 512 1024 2048 4096 8192 16384 32768 No. of Cores

Strong Scaling: BlueGene/Q 10 3 Runtime (s) Parallel Efficiency (%) 10 2 10 1 110 100 90 80 70 60 50 40 30 BGQ: Pure-MPI BGQ (SMT=4): Vector-based BGQ (SMT=4): Task-based BGQ (SMT=4): Task-based, NZ-balanced 128 256 512 1024 2048 4096 8192 No. of Cores BGQ: Pure-MPI BGQ (SMT=4): Vector-based BGQ (SMT=4): Task-based BGQ (SMT=4): Task-based, NZ-balanced 128 256 512 1024 2048 4096 8192 No. of Cores

Conclusion OpenMP threaded PETSc version Threaded vector and matrix operators Task-based sparse matrix multiplication Non-zero-based thread partitioning Strong scaling optimisation Performance deficit on small numbers of nodes (latency-bound) Increased performance in the strong limit (bandwidth-bound) Marshalling load imbalance Inter-process balance improved with less MPI ranks Load imbalance among threads handled explicitly

Acknowledgements Threaded PETSc version is available at: Open Petascale Libraries http://www.openpetascale.org/ The work presented here was funded by: Fujitsu Laboratories of Europe Ltd. European Commission in FP7 as part of the APOS-EU project Many thanks to: EPCC Hartree Centre PETSc development team