Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Size: px

Start display at page:

Download "Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation"

Horatio Cooper
5 years ago
Views:

1 Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College London 2 EPCC, University of Edinburgh 3 STFC, Daresbury Laboratory 4 Fujitsu Laboratories of Europe Ltd. 9 July, 2013

flows, ocean modelling, reservoir modelling, mining, nuclear

2 Motivation Fluidity PETSc Unstructured finite element code Anisotropic mesh adaptivity Applications: CFD, geophysical flows, ocean modelling, reservoir modelling, mining, nuclear safety, renewable energies, etc. Linear solver engine Hybrid MPI/OpenMP version

3 Programming for Exascale Three levels of parallelism in modern HPC architectures 1 : Between nodes: Message passing via MPI Between cores: Shared memory communication Within core: SIMD Hybrid MPI/OpenMP parallelism: Memory argument MPI memory footprint not scalable Replication of halo data Speed argument Message passing overhead Improved load balance with fewer MPI ranks 1 A. D. Robison and R. E. Johnson. Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns, ParaPLoP 10, pages 5:1 5:8, New York, NY, USA, ACM

4 PETSc Overview Matrix and Vector classes are used in all other components Added OpenMP threading to low-level implementations: Vector operations CSR matrices Block-CSR matrices

5 Sparse Matrix-Vector Multiplication Matrix-Multiply is most expensive component of the solve P1 P2 P3 P4 P5 P6 P7 P8 P1 P2 P3 P4 P5 P6 P7 P8 Parallel Matrix-Multiply: Multiply diagonal submatrix Scatter/gather remote vector elements Multiply-add off-diagonal submatrices

6 Sparse Matrix-Vector Multiplication Input vector elements require MPI communication Hide MPI latency by overlapping with local computation Not all MPI implementations work asynchronously 2 Task-based Matrix-Multiply Dedicated thread for MPI communication Advances communication protocol Copy data to/from buffer In contrast to Vector-based threading Lift parallel section to include scatter/gather operation Cannot use parallel for pragma N 1 threads enough to saturate memory bandwidth 2 G. Schubert, H. Fehske, G. Hager, and G. Wellein. Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems. Parallel Processing Letters, 21(3): , 2011

7 Sparse Matrix-Vector Multiplication Thread-level Load Balance Matrix rows partitioned in blocks Create partitioning based on non-zero elements per row 3 Cache partitioning with matrix object Explicit thread-balancing scheme Initial greedy allocation Local diffusion algorithm 3 S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3): , 2009

8 Benchmark Global baroclinic ocean simulation 4 Mesh based on extruded bathymetry data Pressure matrix: 371,102,769 non-zero elements 13,491,933 Degrees of Freedom Solver options: Conjugate Gradient method Jacobi preconditioner 10,000 iterations 4 M. D. Piggott, G. J. Gorman, C. C. Pain, P. A. Allison, A. S. Candy, B. T. Martin, and M. R. Wells. A new computational framework for multi-scale ocean modelling based on adapting unstructured meshes. International Journal for Numerical Methods in Fluids, 56(8): , 2008

9 Cray XE6 (HECToR) Architecture Overview NUMA architecture 32 cores per node 4 NUMA domains, 8 cores each Fujitsu PRIMEHPC FX10 UMA architecture 16 cores per node IBM BlueGene/Q UMA architecture 16 cores per node 4-way hardware threading (SMT)

10 Hardware Utilisation: 128 cores XE6: Vector XE6: Task XE6: Task, NZ-balanced FX10: Vector FX10: Task FX10: Task, NZ-balanced Runtime (s) No. of Threads / MPI process On XE6 slowdown with multiple NUMA domains Performance bound by memory-latency

11 Hardware Utilisation: 1024 cores Runtime (s) XE6: Vector XE6: Task XE6: Task, NZ-balanced FX10: Vector FX10: Task FX10: Task, NZ-balanced No. of Threads / MPI process Both task-based algorithms improve NZ-based load balancing now faster

12 Hardware Utilisation: 4096 cores XE6: Vector XE6: Task XE6: Task, NZ-balanced Runtime (s) No. of Threads / MPI process Vector-based approach bound by MPI communication Explicit thread-balancing improves memory bandwidth utilisation, but worsens latency effects!

13 Strong Scaling: Cray XE XE6: Vector-based XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI Runtime (s) Parallel Efficiency (%) No. of Cores XE6: Vector-based XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI No. of Cores

14 Strong Scaling: Cray XE6 Runtime (s) XE6: Vector-based XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI Parallel Efficiency (%) No. of Cores XE6: Vector-based XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI No. of Cores

15 Strong Scaling: BlueGene/Q 10 3 Runtime (s) Parallel Efficiency (%) BGQ: Pure-MPI BGQ (SMT=4): Vector-based BGQ (SMT=4): Task-based BGQ (SMT=4): Task-based, NZ-balanced No. of Cores BGQ: Pure-MPI BGQ (SMT=4): Vector-based BGQ (SMT=4): Task-based BGQ (SMT=4): Task-based, NZ-balanced No. of Cores

16 Conclusion OpenMP threaded PETSc version Threaded vector and matrix operators Task-based sparse matrix multiplication Non-zero-based thread partitioning Strong scaling optimisation Performance deficit on small numbers of nodes (latency-bound) Increased performance in the strong limit (bandwidth-bound) Marshalling load imbalance Inter-process balance improved with less MPI ranks Load imbalance among threads handled explicitly

17 Acknowledgements Threaded PETSc version is available at: Open Petascale Libraries The work presented here was funded by: Fujitsu Laboratories of Europe Ltd. European Commission in FP7 as part of the APOS-EU project Many thanks to: EPCC Hartree Centre PETSc development team

Mesh reordering in Fluidity using Hilbert space-filling curves

Mesh reordering in Fluidity using Hilbert space-filling curves Mark Filipiak EPCC, University of Edinburgh March 2013 Abstract Fluidity is open-source, multi-scale, general purpose CFD model. It is a finite