Parallel Code Optimisation

Size: px

Start display at page:

Download "Parallel Code Optimisation"

Lillian Austin
5 years ago
Views:

1 April 8, 2008

2 Terms and terminology Identifying bottlenecks Optimising communications Optimising IO Optimising the core code

3 Theoretical perfomance The theoretical floating point performance of a processor is the clock speed of the processor multiplied by the maximum number of floating point opertations per cycle As an example consider a single core of a modern Intel 3GHz Clock frequency is 3x10 9, with a maximum of 4 floating point operations per clock Theoretical performance of 1.2x10 10 FLOPS (floating point operations per second) This will never be reached in practice, and the efficiency of a code is measured as the fraction of this theoretical rate that is obtained Even this fraction will probably vary wildly on different architectures

4 Latency and Bandwidth The bottlenecks which cause this slowdown below the theoretical maximum can be broken down into two classes of problem in various parts of the system Latency problems - The processor is idle because the data that it requires isn t available yet Bandwidth problems - The processor is idle because it can operate on the data faster than the data can be provided to it. Almost all parts of a computer system except the actual processor cores (ALUs and FPUs) are concerned with moving data around and so can be described in terms of latency and bandwidth Latency can be hidden, masked and optimised around either in hardware or to an extent in software Bandwidth is to a great extent a feature of the hardware used and must be accepted as a fundamental limit of the system

5 Latency and Bandwidth Latency is the time after a request has been made before the data begins to become available Bandwidth is the rate at which the data becomes available after the latency time has expired

6 Optimisation strategy Before you can optimise code, you need to know which parts are the cause of the problem. Heavily optimising parts of the code which take little time to execute is a poor use of time Use profilers, parallel profilers, or equivalents and hardware performance counters to itentify the locations of the bottlenecks.

7 Parallel profilers for MPI Most MPI implementations can be compiled with a standard, free MPI profiler called MPE / Jumpshot There are also commerical alternatives Will consider MPE as an example, since it is fairly typical of MPI profilers

8 MPE MPE provides wrapper scripts for compilers just as MPI does, so codes to be tested just have to be compiled using mpecc and mpefc You can either just compile the code using the compilers, or you can custom instrument the code with additional commands to produce more information in the log files If you instrument the code with additional data, it allows you to indentify locations within your code explicitly, making it easier to indentify where problems occur in your code Although it depends on exactly how you use the code, the output will normally be a log file which is then viewed with a the supplied (also free) Jumpshot tool.

9 MPE The results of the output are given using a Ganntt chart, with time on the x axis and processor number on the y axis. Colours represent the MPI commands being executed, white lines represent the path between the start of a send operation and the completion of the matching receive.

10 MPE Long blocks for MPI Send and MPI Recv may well mean that you have a problem in that part of the code The problem can sometimes be relieved by the use of non blocking sends and receives, but sometimes, it genuinely is impossible to proceed any further with the compute work until the communication is finished If that is the case, you may have a load balancing problem where some processors have to work harder than others

11 Profiling without a profiler Much of what is done by MPE is just to put calls to MPI Wtime in before and after MPI commands and to record that to an output file The same technique can be copied manually, by calling MPI Wtime manually and printing the results to output files on a per processor basis Since this is a LOT of work, it requires some level of intuition about where to instrument Also takes more work to read back the results Not to be recommended, but is useful if you want to profile on a machine which doesn t have MPE or a similar program

12 Code profilers If your code isn t limited by communication then you will have to optimise the core algorithm You want to indentify which parts of the core algorithm are causing the slowdown Want to know how long the code spends in each subroutine Once again, there is a free example gprof which works with the gcc compilers Many compiler vendors also have a profiler which works with their compiler There are other types of profiler such as Valgrind which offer many more options

13 gprof To use gprof the code must be compiled using a gcc compiler with the -pg compile time option The code is then run as normal. Note that the code must exit normally or the profiling output will not be written The output is in a file called gmon.out unless otherwise specified The file contains two subsections, the flat profile and the call graph

14 % cumulative self self total time seconds seconds calls ms/call ms/call name MAIN shared_data boundary_c shared_data set_dt gprof The flat profile shows you how long it spent in different subroutines, and how many times the subroutine was called. In this trivial Fortran 90 code, almost all of the time was spent in the main program (referred to as MAIN after compiling) Note that the two other subroutines are called many times and yet still take almost no time Due to the way the gfortran compiles Fortran 90 codes, the module name is prepended to subroutine names

15 index % time self children called name /1 main [2] [1] MAIN [1] /4417 shared_data boundary_condit /2209 shared_data set_dt [10] <spontaneous> [2] main [2] /1 MAIN [1] /4417 MAIN [1] [9] shared_data boundary_conditions /2209 MAIN [1] [10] shared_data set_dt [10] gprof The call graph shows you the calling stacks for all the calls in the program. From this, we can see that boundary conditions was called from MAIN, as was set dt Normally you can see all the information that you need to identify the bottleneck from the flat profile. Just put the most work into Chris the Brady subroutines Parallel that Codetake Optimisation the longest fraction of the

16 Optimising in general Remember, getting the right answer is the key point Always remember that in parallel you may get better results by using a different algorithm which scales better than trying to optimise your first choice Man hours are expensive, compute hours are relatively cheap. Make sure that the optimisation is worth the effort If you re in a hole, stop digging. Some algorithms will never be fast and highly scalable. If you can t change the algorithm and can t get it to scale then just learn to live with it. The harder you optimise your code, the quicker the optimisations are outdated by changes in compilers and hardware. A code optimised for a Cray 1 is not even close to optimal for a modern computer

17 Optimising communications Problems with communications tend to manifest themselves as poor scaling performance The ultimate limit to scaling performance is described by Amdahl s law 1 S = (1 P) + P N where S is the maximum speedup possible on N processors if a fraction of the work P is not parallel If only 10% of the work done by a code is parallelisable then the maximum speedup even on an infinite number of processors is 10 times!

18 Optimising communications Non parallel work includes any time when the code is doing exactly the same thing on multiple processors. Even if that means all processors doing nothing (waiting) This means that any time taken waiting for communication is non-parallel work and limits the scaling according to Amdahl s law Therefore, you want spend as small a fraction of the runtime waiting for communication as possible Note that for many types of code, this actually isn t a real concern because as the problem size increases communication becomes an ever smaller fraction of the total runtime automatically. In these cases, scaling performance is recovered by looking at larger problems, and the limit becomes one of maximum speedup for a given problem size

19 Optimising communications For large domain decomposed codes non-parallel work fractions can be much smaller than 1% allowing scaling to thousands of processors It is generally easier to optimise for SMP machines, so this section will describe how to optimise MPI codes on clusters Doing the same thing on an SMP machine will improve performance there, but usually by a smaller amount

20 Optimising for communication latency On modern cluster hardware, latency is usually only the limiting factor when sending many small messages ( 100s of bytes) If that is the case, try and coalesce the many small messages into a single larger one and then send that If that is not possible then the only option is to try an perform computation while the communication is underway by using non-blocking MPI commands Normally start all communications at the start of the timestep, and then put in MPI Wait commands at the point when a particular piece of information is needed

21 Optimising for communication latency BE CAREFUL! There is a both a latency associated with non-blocking sends and receives and also a compute overhead associated with the monitoring threads for the in flight communications. This can mean that attempting to mask latency using this approach can actually make a code slower. MPI Isend and MPI Irecv are particularly bad in this sense. Much better to set up persistant communication handles using MPI Send init and MPI Recv init if possible If you have really large numbers of in-flight messages, then even this may be a poor choice due to the overhead of managing many open communications

22 MPI Send init int MPI Send init (void * buf, int count, MPI Datatype datatype, int dest, int tag, MPI Comm comm, MPI Request *request ) CALL MPI SEND INIT (buf, count, datatype, dest, tag, comm, request, ierr ) Description Creates a persistent communication handle request for a send operation Routine only needs to be called once at the start of the program, and the handle is then used everytime the communication goes ahead Saves a lot of overhead associated with MPI Isend If you wish to send part of an array, this is a good reason to use MPI custom types

23 MPI Recv init int MPI Recv init (void * buf, int count, MPI Datatype datatype, int source, int tag, MPI Comm comm, MPI Request *request ) CALL MPI RECV INIT (buf, count, datatype, source, tag, comm, request, ierr ) Description Creates a persistent communication handle request for a receive operation Routine only needs to be called once at the start of the program, and the handle is then used everytime the communication goes ahead Saves a lot of overhead associated with MPI Irecv If you wish to send part of an array, this is a good reason to use MPI custom types

24 MPI Start int MPI Start ( MPI Request *request ) CALL MPI START ( request, ierr ) Description Starts an instance of the persistent communication referenced by request In most MPI implementations, this is a low latency command The communication is started in a non-blocking manner the code must explicitly wait for the command to complete using MPI Wait

25 Optimising for communication bandwidth It is unusual for communication bandwidth to be the limiting factor in point to point communications in parallel codes because usually a code which sends large quantities of data then performs many operations on the data that it has received It s more common for collective communications to become bandwidth limited (covered in the next section) If a code is bandwidth limited, there are only really two things that can be done Change the algorithm to require less data to be communicated Use the non-blocking commands to send and receive data while still computing Since normally only a small number of send and receive pairs, can use MPI Isend and MPI Irecv without too much of a penalty

26 Optimising collective communication Optimising collective communication is much harder, since the user has less control over the operation of the command Try to perform as much reduction as possible on local nodes before calling MPI Reduce or MPI Allreduce If you have really specific requirements, you can try to write your own algorithm for the collective operation using point-to-point commands, but this will only work if you have very specific requirements Try to avoid collective communications if at all possible

27 Load balancing If you have a code where the workload on different processors isn t guaranteed to be equal, you could have problems with load balancing Load balancing causes problems because the scaling of the whole system is no better than the scaling of the worst element of the system. So if one processor has a workload which doesn t scale at all with number of processors then the parallel code will be no faster than the serial code More usually, load balancing appears as the code scaling suboptimally because some processors are less used than others Addressing this problem is specific to the exact algorithm being implemented, but in many off-grid problems can be the dominant limiting factor to performance Have to design some kind of dynamic load balancing algorithm

28 Optimising IO Any code is only as useful as the data that it outputs, however you want to minimize time taken to write data to disk Both latency and bandwidth to disk storage is very poor compared with either compute or communication Best thing you can do with IO is to try and minimise it

29 Optimising IO Using MPIIO is about the best thing that you can do to improve performance MPIIO can be further improved by passing hints to the MPI layer during calls to open and write statements using the MPI Info object The exact form of the hints is not portable, but they are fairly easy to work with using MPI Info create and MPI Info set, and should be documented on a machine by machine basis Turning off MPIIO atomic mode using MPI File set atomicity increases speed massively if you do not write to the same area of the file from several nodes

30 Optimising IO Another possibility is to look at doing data reduction during the compute time. Depending on what you want to do with the data, it may be possible to reduce the amount of data to be dumped from a large 3D grid of several variables down to a single line, or possibly even a single number Moving to more modern machines with parallel storage systems will massively improve performance

31 Optimising core code performance In many classes of parallel code, the limiting factor in execution speed is the speed of computation on each node In this case, the code must be optimised in the same manner as a serial code Identify the bottleneck subroutines as already suggested and then optimise them There are a few standard tricks which should always be used to optimise speed, although note that all can sometimes impair performance and should be tested These will be introduced as a checklist of things to do and check, and then explained

32 Optimising core code performance READ YOUR COMPILER MANUAL Many of the classical tricks to optimise code are now done automatically by compilers Check different compiler optimisations, some will help, some will hinder Don t expect that always using the most aggressive compiler options will lead to the fastest code Always try compiler options before hand optimising code, it s much quicker Often highly hand optimised code is much harder to read, always ask whether readability is more important than speed Note that in Fortran, because of the language structures, compilers can safely be more aggressive, and so improvements for hand tuning are usually smaller for Fortran codes than C codes

33 Loop optimisation The most common error when using loops is to loop over the elements of a multidimensional array in the wrong order In Fortran, loops should be ordered so that the left-most index of an array is changing fastest In C, loops should normally be ordered so that the right-most index of an array is changing fastest For scientific codes, this tends to be a very robust optimisation and improves speed in most cases Compilers should perform this operation (loop interchange) automatically, however, in complex codes they are often unable to confirm that this interchange is safe and so don t perform the optimisation Depending on hardware architecture and code structure, this optimisation can reduce runtimes by up to 50%!

34 Other loop optimisations Most other loop optimisations are now efficiently dealt with by compilers, but some that you may want too look at are Loop fusion - There is an overhead associated with starting loops. Two operations in one loop are usually faster than one operation in each of two loops Loop peeling - If the first element of a loop must be dealt with in a special way, break the code out of the loop rather than using an IF statement in the loop The key point in all loop optimisation is locality of reference, which will be explained in the next section

35 Branch optimisations IF statements within loops are always bad If you can remove the IF statements completely by hand by handling special cases differently, or moving the IF statements outside, do so Otherwise, you may find that you are better off doing additional compute work rather than using an IF Try computing both and then multiplying the two results by a 0/1 Boolean flag rather

36 Optimising in theory That fairly short list contains most of the optimisations which are worth trying to perform with modern compilers There are other things which can be done, but they normally lead to small performance improvements However, for less common types of code, or for really heavy optimisation, have to understand why the optimisation works Once again, back to latency and bandwidth concerns, but now on the level of a single computer system Normally, codes which can be optimised are limited by memory latency, bandwidth limitation is also possible, but much harder to work around

37 Cache Modern computers use cache memory to improve memory latency Cache provides faster access to recently used memory by copying it into a smaller, faster, more expensive area of RAM which is usually situated on the CPU die In this form cache only speeds access to data that has been used recently. Scientific codes generally have working sets which are many megabytes in size, so simple caching provides only limited improvements Further performance improvement is given by the use of cache prefetchers

38 Cache prefetchers Historically, cache prefetching was a programming technique where data wanted on the next iteration was requested on the current iteration, so bringing the data into cache In modern systems, this duty is normally performed either by the compiler or by enabling hardware prefetch units on the CPU In either case, the prefetcher normally just assumes the data that will be wanted next is local to the memory just accessed. This means that improving locality of reference in the code structure improves performance

39 Locality of reference The easiest way to prefetch is to assume that the program is simply going to ask for the next piece of data in memory directly after the piece that it just asked for This means that the best way to optimise the performance of a code is to ensure that your code accesses memory in as close to a linear fashion as possible This explains why the order in which multidimensional arrays are accessed in loops is important When the prefetcher is working well, the effect of the cache is to significantly reduce the latency of the main memory Unless the entire dataset fits in the cache, or the code perform large amounts of calculation on a small block of data before moving onto the next block, then the higher bandwidth of the cache memory has little effect on the effective bandwidth of the entire memory subsystem

40 Bandwidth limitation The classical example of bandwidth limitation is video processing Relatively little work is done to each element of data, which is then never used again The net effect of this is to almost remove the effect of the cache memory again, since the limiting factor is the rate at which the data can be transfered from the main memory Although the number is highly workload dependant, a good rule of thumb is that you need 1 bit/s of memory bandwidth for every FLOP, although scientific codes often need more Prefetchers can make the situation worse if they are not performing perfectly, because incorrectly prefetched data is simply wasted memory bandwidth

41 Bandwidth limitation If your code is bandwidth limited then you have few options Buy a better computer with more memory bandwidth Make your code do more work with the existing data before moving onto the next part of the data

42 SIMD optimisation Most modern processors have a superscalar architecture, meaning that they can operate in a Single Instruction Multiple Data mode. This is often called vector operation SIMD allows processors to have a higher average IPC Also, the older scalar FPUs are often legacy features of the architecture and are much slower than their SIMD counterparts (normal x86 processors are a very good example of this) Compilers will use the SIMD paths automatically unless they are stopped from doing so This is usually because the compiler is unable to determine if there is dependancy in the program flow

43 SIMD optimisation The most common cause of a compiler being unable to determine a dependancy is that data in a given iteration really does depend on data in a previous iteration If that is the case, remember that compilers only try to vectorise the inner loop of a set of nested loops, so reordering your loops may allow vectorisation The second most common cause is that there is a branch in the code which the compiler is unable to determine will not cause dependancy Some compilers allow you to put in hints which force it to ignore safety checks A better approach is to try and remove the IF statements as already mentioned

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance