Performance optimization of Numerical Simulation of Waves Propagation in Time Domain and in Harmonic Domain using OpenMP Tasks

Size: px

Start display at page:

Download "Performance optimization of Numerical Simulation of Waves Propagation in Time Domain and in Harmonic Domain using OpenMP Tasks"

Linette Lucas
6 years ago
Views:

1 Performance optimization of Numerical Simulation of Waves Propagation in Time Domain and in Harmonic Domain using OpenMP Tasks Giannis Ashiotis Julien Diaz François Broquedis Jean-Francois Mehaut INRIA GRENOBLE UJF & LJK & LIG

2 Hou10ni Overview Implementation details OpenMP PARALLEL DO TASK KAAPI Changes to the program Data locality Array structure Data reordering Tasks acoustic Tasks elasto-acoustic Performance evaluation Overview Testing setup Acoustic Elasto-acoustic Epilogue

3 Hou10ni Overview It was developed by Julien Diaz Simulates the propagation of seismic waves Implemented using the Interior Penalty Discontinuous Galerkin Method (IPDGM) uses unstructured meshes of arbitrary-shaped elements facilitates the description of complex heterogeneous media.

4 Hou10ni Implementation details The application is written in Fortran 90 and it spans over around lines (together with some functionality that I will be not using) The input data that reside in 3 separate files contain all the information needed to define the mesh of the elements that make up the system. The mesh it self is generated by another tool provided together with the main program 2 versions of the code Acoustique - partially parallelized with OpenMP Elasto-acoustique Main workhorse is a double loop first level going over predefined number of time-steps second level going over the physical space of the problem.

5 Hou10ni Overview Implementation details OpenMP PARALLEL DO TASK KAAPI Changes to the program Data locality Array structure Data reordering Tasks acoustic Tasks elasto-acoustic Performance evaluation Overview Testing setup Acoustic Elasto-acoustic Epilogue

6 OpenMP Popular, easy to use API for parallelizing applications in C/C++ and FORTRAN. To be used on shared memory systems. Components compiler directives library routines environment variables Task parallelism added with version OpenMP 3.0

7 OpenMP PARALLEL DO Probably one of the easiest way to parallelize the execution of loops When a PARALLEL DO directive is encountered a number of threads is spawned, and a chunk of the iterations is assigned to each of them In FORTRAN parallel do is implemented like this:!$omp PARALLEL DO (clauses) (do-loop)!$omp END PARALLEL DO (the END statement is optional)

8 OpenMP PARALLEL DO Probably one of the easiest way to parallelize the execution of loops When a PARALLEL DO directive is encountered a number of threads is spawned, and a chunk of the iterations is assigned to each of them In FORTRAN parallel do is implemented like this: PRIVATE(list of variables) SHARED(list of variables) SCHEDULE(kind, chunk)!$omp PARALLEL DO (clauses) (do-loop)!$omp END PARALLEL DO (the END statement is optional)

9 OpenMP TASK Whenever a task directive is encountered, a task is spawned executed immediately passed to a conceptual task pool A task can be executed by any thread belonging to the group associated with the parallel region where the task was created In FORTRAN tasks are implemented like this:!$omp PARALLEL!$OMP TASK (clauses)!$omp END TASK!$OMP END PARALLEL

10 OpenMP TASK Whenever a task directive is encountered, a task is spawned executed immediately passed to a conceptual task pool A task can be executed by any thread belonging to the group associated with the parallel region where the task was created In FORTRAN tasks are implemented like this:!$omp PARALLEL!$OMP TASK (clauses)!$omp END TASK!$OMP END PARALLEL PRIVATE(list of variables) SHARED(list of variables) FIRSTPRIVATE(list of variables) DEFAULT(private shared none firstprivate)

11 OpenMP KAAPI C++ library that allows the multithreaded computation with data flow synchronization between threads developed by the MOAIS team High level interface, fully compatible with the OpenMP notation No chances to the code

12 Hou10ni Overview Implementation details OpenMP PARALLEL DO TASK KAAPI Changes to the program Data locality Array structure Data reordering Tasks acoustic Tasks elasto-acoustic Performance evaluation Overview Testing setup Acoustic Elasto-acoustic Epilogue

13 Changes to the program Data locality Data locality greatly affects the performance of the program in NUMA (Non-Uniform Memory Access) systems Remote accesses are costly! NUMA node 1 CPU 1 core core 1 2 core 3 memory controller memory core 4 NUMA node 2 CPU 2 core core 1 2 core 3 memory controller memory core 4

14 Changes to the program Data locality Data locality greatly affects the performance of the program in NUMA (Non-Uniform Memory Access) systems Remote accesses are costly! NUMA node 1 CPU 1 core core 1 2 core 3 memory controller memory core 4 NUMA node 2 CPU 2 core core 1 2 core 3 memory controller memory core 4 using the first touch memory allocation policy, initialize the arrays in a PARALLEL DO loop NUMA node 1 CPU 1 core core 1 2 core 3 memory controller memory core 4 NUMA node 2 CPU 2 core core 1 2 core 3 memory controller memory core 4

15 Changes to the program Array structure At each iteration of the inner loop 4 arrays come into play. U 1D array, holding the displacements U old 1D array, U from the previous time-step, saved with the use of an intermediate array A 3D array AU 1D array, contains the product of a chunk (depending on the order of the discretization) of U with a slice of A The copying taking place at each time-step can be avoided by using a 2 column array for U and U old, together with pointers that specify which column will be U and which U old A better cache behavior is achieved if consecutive iterations work on data that is contiguous in the memory

16 Changes to the program Array structure At each iteration of the inner loop 4 arrays come into play. U 1D array, holding the displacements U old 1D array, U from the previous time-step, saved with the use of an intermediate array A 3D array AU 1D array, contains the product of a chunk (depending on the order of the discretization) of U with a slice of A FORTRAN stores multidimensional arrays in column-major fashion Us holds the data of U and U old Us(i,j) A(i,j,k) j j i i k

17 Changes to the program Data reordering Furthermore at each iteration of the inner loop data from neighboring elements (which corresponds to other parts of U) is required The elements of U are arranged in a random fashion

elements (which corresponds to other parts of U) The

18 Changes to the program Data reordering Furthermore at each iteration of the inner loop data from neighboring elements (which corresponds to other parts of U) The elements of U are arranged in a random fashion remote memory accesses!

the acoustic version this can be done prior to execution For the elasto-acoustic version it has

19 Changes to the program Data reordering Using a space-filling curve we can reorder the elements My reordering tool was written in C++ as the Standard Container classes make this much easier For the acoustic version this can be done prior to execution For the elasto-acoustic version it has to be done during runtime, as some reordering takes place after the data is read Z curve Hilbert curve

20 Changes to the program Data reordering After the reordering the data look like this Now there is a much greater possibility that the neighboring elements will be in cache when they are needed

21 Changes to the program Data reordering At each iteration of the inner loop, 3 checks take place to decide if the element processed is part of the border or not A border elements is defined by a -1 in the array that holds the neighboring information The checks can be avoided if with cyclic permutations the -1 is brought to the first column and the border elements are processed separately from the inner elements. This functionality was added in the tool responsible for the reordering

22 Changes to the program Tasks acoustic Main loop OpenMP Tasks breaking the loop into smaller loops, assigning each to a task. The number of tasks is passed as input at runtime One thread creates tasks, passing them to the task pool. Before reaching the TASKWAIT directive all task must complete. Parallel region outside the time loop expecting reduction of overhead due to creation and destruction of threads start_i is declared FIRSTPRIVATE so that each tasks gets the correct value at the moment of creation DO N=0,ntfinal!TIME STEP!$OMP PARALLEL DO DO I=1,Ntri... END DO!$OMP END PARALLEL DO... END DO

23 Changes to the program Tasks acoustic Main loop OpenMP Tasks breaking the loop into smaller loops, assigning each to a task. The number of tasks is passed as input at runtime One thread creates tasks, passing them to the task pool. Before reaching the TASKWAIT directive all task must complete. Parallel region outside the time loop expecting reduction of overhead due to creation and destruction of threads start_i is declared FIRSTPRIVATE so that each tasks gets the correct value at the moment of creation iter_task = Ntri/num_of_tasks!$OMP PARALLEL!$OMP SINGLE DO N=0,ntfinal!TIME STEP ptr_u = 1+iand(N,1) ptr_uold = 1+iand(N+1,1) DO st_i=1,ntri,iter_task!$omp TASK PRIVATE(I) FIRSTPRIVATE(st_I) DO I=st_I,min(st_I+iter_task-1,Ntri)... END DO!$OMP END TASK END DO!$OMP TASKWAIT... END DO!$OMP END SINGLE!$OMP END PARALLEL

24 Changes to the program Tasks acoustic Another approach to the assigning of the work to tasks is using recursion One initial task divides the work to two equal parts, assigning a task to each of them The generated tasks go on repeating the same thing, until the desired amount of work is left to each task This approach exhibits good cache behavior for any platform 20 elements with a cutoff of 4

25 Changes to the program Tasks acoustic RECURSIVE SUBROUTINE Calc(low,high,cutoff,ptr_U,ptr_Uold)... IF (high-low > cutoff) THEN mid = low + (high - low)/2!$omp TASK call Calc(low,mid,cutoff,ptr_U,ptr_Uold)!$OMP END TASK call Calc(mid+1,high,cutoff,ptr_U,ptr_Uold)!$OMP TASKWAIT ELSE DO I = low,high... END DO END IF END SUBROUTINE Calc It should be noted that recursion comes with a considerable overhead. Combined with the overhead of task creation, this could lead to performance decrease KAAPI could be the solution to this problem as it has a much reduced cost of task spawning, together with some other features that can be of use

26 Changes to the program Tasks acoustic...!$omp PARALLEL!$OMP SINGLE DO N=0,ntfinal ptr_u = 1+iand(N,1) ptr_uold = 1+iand(N+1,1) call Calc(1,Ntri,cutoff,ptr_U,ptr_Uold)!$OMP TASKWAIT... END DO!$OMP END SINGLE!$OMP END PARALLEL... It should be noted that recursion comes with a considerable overhead. Combined with the overhead of task creation, this could lead to performance decrease KAAPI could be the solution to this problem as it has a much reduced cost of task spawning, together with some other features that can be of use

27 Changes to the program Tasks elasto-acoustic The same changes were applied to the elasto-acoustic version of the program, with some small modifications At runtime, the data is split to 4 parts, with one corresponding to the fluid domain, one to the solid domain and two for the respective interfaces. the Hilbert-curve reordering has to take place during execution, separately for each part The data in the equivalent of the acoustic version U old arrays (now three) are needed again after the current arrays are modified instead of a 2 column array, a 3 column array is used to hold the modified version of the arrays and a set of 3 pointers is used to refer to them: ptr_ = 3-MOD(N+2,3)! P,Ux,Uy ptr_old = 3-MOD(N+1,3)! Pold,Uxold,Uyold ptr_buff = 3-MOD(N,3)

28 Hou10ni Overview Implementation details OpenMP PARALLEL DO TASK KAAPI Changes to the program Data locality Array structure Data reordering Tasks acoustic Tasks elasto-acoustic Performance evaluation Overview Testing setup Acoustic Elasto-acoustic Epilogue

Performance Evaluation Overview Both versions of the program where compiled with GCC and ICC The GCC version was also ran with the KAAPI, overriding the GNU OpenMP API For all the runs a mask was

29 Performance Evaluation Overview Both versions of the program where compiled with GCC and ICC The GCC version was also ran with the KAAPI, overriding the GNU OpenMP API For all the runs a mask was used to bind threads to cores. For GCC is set by the environment variable GOMP_CPU_AFFINITY and for the ICC by the environment variable KMP_AFFINITY. Threads were bound using compact thread binding policy A PARALLEL DO version of the program, with all the optimizations mentioned before was tested as-well threads

30 Performance Evaluation Testing setup The machine used for the testing is IDROUILLE 4 NUMA nodes 8-core Nehalem architecture CPU 16 GB of RAM. L1 and L2 is private to each core L3 is shared over the cores of each CPU

31 Performance Evaluation Acoustic Gain plots show the performance increase over the original code at each thread configuration 70% 60% Gain acoustic GCC ICC compiled program constantly displays better gain that the GCC compiled counterpart Using the KAAPI gave more or less the same performance with the GCC openmp runtime environment 50% 40% 30% 20% 10% 0% Gain acoustic ICC acoustic tasks gcc acoustic do gcc acoustic recur gcc acoustic do kaapi acoustic recur kaapi 80% 70% 60% 50% 40% 30% 20% acoustic tasks icc acoustic do icc acoustic recur icc 10% 0%

32 Performance Evaluation Acoustic Speedup plots show the performance increase when using multiple threads compared to the single-threaded execution Speedup acoustic GCC acoustic gcc acoustic tasks gcc acoustic do gcc acoustic recur gcc acoustic do kaapi 1.00 acoustic recur kaapi Speedup acoustic ICC acoustic icc acoustic tasks icc acoustic do icc 5.00 acoustic recur icc

33 Performance Evaluation Acoustic A summary of the previous plots showing the versions with the best absolute performances and the best scaling It is apparent that the ICC compiled code performs and scales much better than their GCC counterparts Times acoustic acoustic do gcc acoustic recur kaapi acoustic do icc acoustic recur icc Speedup acoustic acoustic do gcc acoustic recur kaapi 10 acoustic do icc 5 acoustic recur icc

34 Performance Evaluation Elasto-acoustic Times elasto-acoustic GCC elasto-acoustic gcc elasto-acoustic tasks gcc elasto-acoustic do gcc elasto-acoustic recur gcc elasto-acoustic do kaapi Times elasto-acoustic ICC elasto-acoustic icc elasto-acoustic tasks icc elasto-acoustic do icc elasto-acoustic recur icc Gains plots are unavailable for the elasto-acoustic version as it was not parallelized Absolute times of execution are shown instead Again the ICC compiled version greatly outperforms the GCC compiled ones, while the performance is more or less the same within each of the sets compiled with the same compiler Note that for the ICC compiled version the TASKS and PARALLEL DO code, the single-threaded execution runs slower than the original one

35 Performance Evaluation Elasto-acoustic Speedup elasto-acoustic GCC elasto-acoustic tasks gcc elasto-acoustic do gcc elasto-acoustic recur gcc Speedup plots show that the elasto-acoustic version of the program exhibits the same behavior as the acoustic one, with the former scaling much better elasto-acoustic do kaapi elasto-acoustic recur kaapi Speedup elasto-acoustic ICC elasto-acoustic tasks icc elasto-acoustic do icc elasto-acoustic recur icc

36 Performance Evaluation Elasto-acoustic Times elasto-acoustic Speedup elasto-acoustic elasto-acoustic do gcc elasto-acoustic recur kaapi elasto-acoustic do icc elasto-acoustic recur icc The difference in the performance can be better seen in these summarization plots elasto-acoustic do gcc elasto-acoustic recur kaapi elasto-acoustic do icc elasto-acoustic recur icc

37 Hou10ni Overview Implementation details OpenMP PARALLEL DO TASK KAAPI Changes to the program Data locality Array structure Data reordering Tasks acoustic Tasks elasto-acoustic Performance evaluation Overview Testing setup Acoustic Elasto-acoustic Epilogue

38 Epilogue The goal of optimizing the Hou10ni program was achieved the acoustic version now performs nearly 3 times faster than the best performance given by the original code the performance gain in elasto-acoustic version is not that apparent as it was not initially parallelized. The scaling is not as good as in the acoustic version, due to the greater complexity of the code more code was serial due to data dependencies. Some promise lies in the recursive version once the KAAPI is completed These optimizations can be applied to any similar code

39 Epilogue The goal of optimizing the Hou10ni program was achieved the acoustic version now performs nearly 3 times faster than the best performance given by the original code the performance gain in elasto-acoustic version is not that apparent as it was not initially parallelized. The scaling is not as good as in the acoustic version, due to the greater complexity of the code more code was serial due to data dependencies. Some promise lies in the recursive version once the KAAPI is completed These optimizations can be applied to any similar code and one last tip use ICC when available

40 Epilogue The goal of optimizing the Hou10ni program was achieved the acoustic version now performs nearly 3 times faster than the best performance given by the original code the performance gain in elasto-acoustic version is not that apparent as it was not initially parallelized. The scaling is not as good as in the acoustic version, due to the greater complexity of the code more code was serial due to data dependencies. Some promise lies in the recursive version once the KAAPI is completed These optimizations can be applied to any similar code and one last tip use ICC when available at least on Intel machines

41 Thank you for your time

Shared Memory Programming with OpenMP

Shared Memory Programming with OpenMP (An UHeM Training) Süha Tuna Informatics Institute, Istanbul Technical University February 12th, 2016 2 Outline - I Shared Memory Systems Threaded Programming Model