Parallelism Inherent in the Wavefront Algorithm. Gavin J. Pringle

Size: px

Start display at page:

Download "Parallelism Inherent in the Wavefront Algorithm. Gavin J. Pringle"

Christopher Chambers
5 years ago
Views:

1 Parallelism Inherent in the Wavefront Algorithm Gavin J. Pringle

2 The Benchmark code Particle transport code using wavefront algorithm Primarily used for benchmarking Coded in Fortran 90 and MPI Scales to thousands of cores for large problems Over 90% of time in one kernel at the heart of the computation 2

3 Serial Algorithm Outline Outer iteration Loop over energy groups Inner iteration Loop over sweeps Loop over cells in z direction Loop over cells in y direction Loop over cells in x direction Loop over angles (only independent loop!) work (90% of time spent here) End loop over angles End loop over cells in x direction End loop over cells in y direction

4 Close up of parallelised loops over cells Loop over cells in z direction Possible MPI_Recv communications Loop over cells in y direction Loop over cells in x direction Loop over angles (number of angles too small for MPI) work End loop over angles End loop over cells in x direction End loop over cells in y direction Possible MPI_Ssend communcations End loop over cells in z direction

5 MPI 2D decomposition is 2D decomposition of front x-y face. Figure shows 4 MPI tasks k l j 5

6 Diagram of dependicies This diagram shows the domain of one MPI task MPI data FromTop A cell cannot be processed until all cells upstream have been processed. MPI data ToLeft MPI data FromRight 6 MPI data ToBottom

7 Sweep order: 3D diagonal slices MPI data FromTop MPI data ToLeft MPI data FromRight Cells of the same colour are independent and may be processed in parallel once preceding slices are complete. 7 MPI data ToBottom

8 Slice shapes (6x6x6) Increasing triangles Then transforming Hexagons Then decreasing (flipped) triangles 8

9 Slice 1 Cell nearest the viewer 9

10 Slice 2 Moving down away from viewer 10

11 11 Slice 3

12 12 Slice 4

13 13 Slice 5

14 14 Slice 6

15 15 Slice 7

16 16 Slice 8

17 17 Slice 9

18 18 Slice 10

19 19 Slice 11

20 20 Slice 12

21 21 Slice 13

22 22 Slice 14

23 23 Slice 15

24 Slice 16 Point furthest from viewer 24

25 Close up of parallelised loops over cells using MPI Loop over cells in z direction Possible MPI_Recv communications Loop over cells in y direction Loop over cells in x direction Loop over angles (number of angles too small for MPI) work End loop over angles End loop over cells in x direction End loop over cells in y direction Possible MPI_Ssend communcations End loop over cells in z direction

26 Loop over slices Possible MPI_Recv communications OMP DO PARALLEL Loop over cells in each slice OMP DO PARALLEL Loop over angles work End loop over angles OMP END DO PARALLEL End Loop over cells in each slice OMP END DO PARALLEL Possible MPI_Ssend communcations End loop over slices Close up of parallelised loops over cells using MPI and OpenMP

27 Parallel Algorithm Outline Outer iteration Loop over energy groups Inner iteration Loop over sweeps Loop over slices Possible MPI_Recv communications OMP DO PARALLEL Loop over cells in each slice OMP DO PARALLEL Loop over angles work End loop over angles Etc

28 Decoupling inter-dependant energy group calculations Initially, each energy group calculation used a previous energy groups results as input Decoupling the energy groups has two outcomes Execution time is greatly increased Energy Groups are now independent and can be parallelised Often seen in HPC Modern algorithms can be inherently serial An older version may be parallelisable

29 TaskFarm Summary If all the tasks take the same time to compute Block distribution of tasks Cyclic distribution of tasks dealing cards else if all tasks have different execution times If length of tasks are unknown in advance else endif Endif Cyclic distribution of tasks Order tasks: longest first, shortest last Cyclic distribution of tasks

30 Final Parallel Algorithm Outline Outer iteration MPI Task Farm of energy groups Inner iteration Loop over sweeps Loop over slices Possible MPI_Recv communications OMP DO PARALLEL Loop over cells in each slice OMP DO PARALLEL Loop over angles work End loop over angles Etc

31 Other wavefront codes have the loops in a different order Conclusion Loop over energy groups can occur within loops over cells and might be parallelised with OpenMP Must be decoupled

32 Thank you Any questions?

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh