2 Rupert W. Ford and Michael O'Brien Parallelism can be naturally exploited at the level of rays as each ray can be calculated independently. Note, th

Size: px

Start display at page:

Download "2 Rupert W. Ford and Michael O'Brien Parallelism can be naturally exploited at the level of rays as each ray can be calculated independently. Note, th"

Junior Sherman
6 years ago
Views:

1 A Load Balancing Routine for the NAG Parallel Library Rupert W. Ford 1 and Michael O'Brien 2 1 Centre for Novel Computing, Department of Computer Science, The University of Manchester, Manchester M13 9PL, U.K. rupert@cs.man.ac.uk 2 Military Aircraft and Aerostructures, British Aerospace, Warton Aerodrome, Lancashire PR4 1AX, U.K. Michael.OBrien@bae.co.uk Abstract. This paper describes a load balance routine which has been developed for the NAG Parallel Library. This routine is designed for load balance problems where each task can be computed independently and allows the user to choose from a number of dierent load balance strategies. The benets of this routine are discussed in terms of both performance and ease of use, and results are presented for a production RCS prediction code on a Cray T3D and a SGI Origin Introduction The load balance routine described in this paper has been developed for inclusion in the NAG Parallel Library [7]. This library is a collection of portable, memory scalable, parallel Fortran 77 routines for the solution of numerical and statistical problems. This work forms part of the ESPRIT Framework IV project P20018 PINEAPL (Parallel Industrial Numerical Applications and Portable Libraries). The aim of the project is to develop an application-driven, general-purpose library of parallel numerical software to signicantly extend the scope of the NAG parallel library. The project (coordinated by NAG) is driven by applications from four industrial end users, representing the needs of the numerical scientic and engineering market. In this project parallel library experts are paired with end users; in the University of Manchester's case the end user is British Aerospace (BAe). One of BAe's applications (\System AB3") involves the prediction of the radar cross section (RCS) of an aircraft's air intake duct. The particular technique requires raytracing to calculate the RCS of an arbitrary shaped duct. A ray tracer developed at the University of Manchester has been integrated into BAe's \System AB3".

2 2 Rupert W. Ford and Michael O'Brien Parallelism can be naturally exploited at the level of rays as each ray can be calculated independently. Note, the geometry is simple enough to allow its replication on each processor. However, although rays can be independently calculated, their computational cost will vary signicantly, depending on the path a ray takes. This means that a static equal allocation of rays to processors will not necessarily give a load balanced solution. The load balance routine (called Y01CAFP) was developed to solve the above and similar load balance problems. Y01CAFP is therefore designed to minimise the elapsed time for n independent tasks, where n is xed and known, running on p processors. The solution to such a problem is often termed task farming, as tasks may be sent (farmed out) to other processors. It is primarily designed for problems where n p and the time for each task is variable and unknown, however it can be of benet for problems where the time for each task is known but the distribution of tasks is not regular. It is also useful for distributing tasks when all data is held on the root (master) processor. The next section summarises the design philosophy of the NAG Parallel library and describes the main features of its implementation. This allows a detailed description of Y01CAFP in the subsequent section. Section 4 discusses the BAe RCS application and the test case used for evaluation. Section 5 presents the results of running the test case on a Cray T3D and a SGI Origin 2000 and nally, Section 6 gives our conclusions. 2 NAG Parallel Library The routine described in this paper has been developed for inclusion in the NAG Parallel Library [7]. This library is a collection of parallel Fortran 77 routines for the solution of numerical and statistical problems. The library is divided into chapters, each devoted to a branch of numerical analysis or statistics. The library is primarily intended for distributed memory parallel machines, including networks and clusters, although it can readily be used on shared memory parallel systems that implement PVM [6] or MPI [9]. The library supports parallelism and memory scalability, and has been designed to be portable across a wide range of parallel machines. The library assumes a Single Program Multiple Data (SPMD) model of parallelism in which a single instance of the user's program executes on each of the logical processors. The NAG Parallel Library uses the Basic Linear Algebra Communication Subprograms (BLACS) [5] for the majority of the communication within the library. Implementations of the BLACS, available in both PVM and MPI, provide a higher level communication interface. However, there are a number of facilities that are not available in the BLACS, such as sending multiple data types in one message (multiple messages must be sent) and non-blocking sends and receives. There is, therefore, a clear trade-o between code portability (plus ease of maintenance) and performance. As performance is crucial in load balancing, much of the communication is written in PVM and MPI.

3 Lecture Notes in Computer Science 3 The library is designed to minimise the user's concern with use of the BLACS, PVM or MPI, and present a higher level interface using library calls. Task spawning and the denition of a logical processor grid and its context is handled by the parallel library routine Z01AAFP. On completion the library routine Z01ABFP is called to undene the grid and context. The routines Z01AAFP and Z01ABFP can be considered as left and right braces, respectively, around the parallel code. 3 Load Balance Routine (Y01CAFP) 3.1 Code integration Y01CAFP assumes the problem requiring load balancing is written in the form DO I=1,NLOCAL CALL TASK(I) END DO where all data is passed into TASK through COMMON, the index I distinguishes the actual task to be performed and all tasks are independent. The routine then replaces the above code fragment. The user must therefore modify the program so that it conforms to this specication. The user must also supply a routine (whose specication is dened in the documentation) which will pack or unpack the data required to compute a task for a range of contiguous indices. Y01CAFP will call this routine to pack and unpack data into the appropriate indices. Y01CAFP supplies pack (NAGPACK) and unpack (NAGUNPACK) routines to facilitate this task. These routines are wrappers around the PVM and MPI versions and have a similar syntax. As well as specifying how many tasks are on a particular processor (NLO- CAL) the user must also specify the maximum number of tasks that could be computed on that processor (NMAX). The load balancer will use the space between NLOCAL and NMAX to perform any required remote computation. NMAX must be large enough to allow any required remote computation to take place. The actual amount is dened in the user documentation and the program will give an error if NMAX is not large enough. In the PVM implementation Y01CAFP makes use of the system buers to buer data. As MPI does not support system buers the user must supply a buer large enough to send and receive the largest message. 3.2 Load Balancing options Y01CAFP allows the user to select one of four dierent load balancing strategies, `ASIS', `BLOCK', `CYCLIC' and `GRAB'. In addition `CYCLIC' and `GRAB' have a block size (BSIZE) which is set by the user. If all of the data is initially on the root processor (NLOCAL=0 on all other processors) then the master/slave (MASLV) option can be set. This option dedicates the root processor to communication. In this case Y01CAFP is eectively parallelising the application.

4 4 Rupert W. Ford and Michael O'Brien Note that, changing the load balancing options in Y01CAFP will not aect the results, only the load balance and therefore, solution time. Y01CAFP accepts any initial data distribution and the nal distribution will be the same as the initial distribution. Y01CAFP provides an indication of how it has performed through the TINFO and NINFO arrays. These arrays give timing and counting information respectively. ASIS: `ASIS' performs no load balancing. It is useful to test the correct working of the code when the load balancing routine is rst used. It can also be used to determine the load imbalance inherent in the problem using the NINFO and TINFO output arrays and gives a non-load balanced timing result allowing comparison with any load balanced results. BLOCK: `BLOCK' should be used when the computational costs of the tasks are the same but their distribution across processors is irregular. Note, if the distribution were regular in this case, the problem would already be load balanced. The implementation of `BLOCK' takes a given task distribution and redistributes it so that each processor has no more than dn=pe tasks. It attempts to minimise the number of messages sent by a combination of sending tasks from the most loaded processor to the least loaded processor and looking for pairs of equally overloaded and underloaded processors [10]. CYCLIC: `CYCLIC' should be used when computational costs for successive tasks (in iteration space) are similar, but the load varies over many iterations. The distributed task indices are treated as a single global index ordered by processor identier. The implementation of `CYCLIC' rstly computes the global iteration space. Secondly, processors send all local tasks which require redistribution. Note, if a processor needs to send more than one block to the same remote processor, it does so in separate messages. Thirdly, processors compute any local tasks, and nally, processors compute any remote tasks and return the results. GRAB: `GRAB' should be used when the computational costs of the tasks are unknown. In this case a regular distribution of tasks (as given by the two previous strategies) may result in load imbalance. With this option each processor performs its own computation then steals `BSIZE' tasks from any processors which are still computing until all work has completed. The implementation of `GRAB' checks for any task requests after computing each local task (of size `BSIZE'). If it receives a request and has more than `BSIZE' tasks remaining it sends these to the requesting processor, otherwise it sends a negative acknowledgement (NACK). When a processor has nished its own tasks it requests each processor in turn for work. It completes when it has received NACK's from all other processors and has sent NACK's to all other processors. MASLV: The `MASLV' option is only relevant when all tasks are on the root processor i.e: NLOCAL= 0 on all processors except root. If, in this case, MASLV is.true. the root processor does not take part in any computation, it is used

5 Lecture Notes in Computer Science 5 purely for communication. This option is useful when the cost of communication is high enough to signicantly slow the root processor. For example, if the communication costs of sending and receiving data were equal to the computation costs the root processor would take approximately twice as long as the other processors. This eect increases with the number of processors, the amount of data transfered and the speed of the processor. It decreases with the speed of the network. Output from NINFO and TINFO, helps the user these eects. 4 RCS Example 4.1 Description The purpose of this application is to predict the radar cross section (RCS) of an aircraft's air intake duct. Ducts are particularly important as they act as a waveguide propagating energy (Electro-magnetic (EM) waves) back in the receiver direction. Therefore a large portion of an aircrafts RCS is due to duct reection. Ray tracing techniques [2] are useful for RCS analysis as they allow the realistic modelling of physical systems with arbitrary shaped ducts and dierent absorption characteristics [1]. Manchester University has developed a ray tracer for inclusion into BAe's \System AB3". This code uses raytracing to calculate the RCS of an arbitrary shaped duct. The duct geometry is designed using the CAD package CATIA whose surfaces are output as parametric bi-cubic patches [2] in PATRAN [3]. A user generated \ASPECT" le controls the position, direction, frequency, angle and polarisation of the initial EM rays. Rays are exclusively directed inside the duct as this is the area of interest. These rays are then ray traced by AB3. At each ray/surface intersection the EM characteristics of the ray are modied based on the intersected surface characteristics. The rays are terminated either when they emerge from the duct or when their energy falls below an appropriate threshold. AB3 integrates the emerging rays to obtain the RCS. 4.2 Integration into the NAG Parallel library In AB3, a set of rays, whose starting points are arranged in a two dimensional grid, are traced into the duct. This was implemented as a double loop over the rays initial coordinates. To make the AB3 code conform to the load balance specication this double loop had to be changed to single index. The data was already passed into the routine using common. The packing and unpacking routine was simple to implement as all data was dependent on the index. The initial implementation added the NAG begin parallel (Z01AAFP) and end parallel (Z01ABFP) calls around the code. All non-root processors then skipped the initialisation and waited in Y01CAFP while the root processor set up all the data. In this case the load balancer distributes the work from the root to the remaining processors and acts as if it is parallelising the code. Whilst this version was a useful starting point, the memory requirements of the root processor meant that this version would not scale to large problem sizes.

6 Rupert W. Ford and Michael O'Brien Fig. 1. External view of a de-classied BAe duct To remove this limitation the data was pre-distributed amongst the processors. This was done in two ways.

6 6 Rupert W. Ford and Michael O'Brien Fig. 1. External view of a de-classied BAe duct To remove this limitation the data was pre-distributed amongst the processors. This was done in two ways. The rst (termed block) assigned the rst dn=pe rays to the rst processor, the next dn=pe rays to the second processor and so on. The second (termed cyclic) assigned the rst ray to the rst processor, the second ray to the second processor and so on, wrapping round to the rst processor after the last processor. The nal RCS prediction code section (integrating the emerging rays to obtain the RCS) has not been parallelised. The results from each processor are sent to the root processor and it performs the computation. This section is included in the timing results presented and for large numbers of processors becomes an important factor. 4.3 Test case The test case used in this paper is a duct which has the complexity of real ducts currently in use and/or being developed by BAe, but has been modied so that it can be de-classied. The external visual ray tracing of this duct has been performed using the ray tracer developed at the University of Manchester (which is a modication of krt[4] to allow bi-cubic patch intersection). This is also the ray tracer which has been modied to form part of \System AB3". The patches have been articially shaded to highlight them, see Figure 1. 5 Results In all versions described in this section the grab option uses a block size of l=(5p 2 ), where p is the number of processors and l is the total number of rays. Smaller block sizes were investigated but this made no dierence to the performance. In this section the problem sizes are given in terms of the ray density which is the number of rays per wavelength. In the example code the wavelength is 3cm and the frequency is approximately 10GHz. The total number of rays is also given for reference. At the time the following results were taken the TINFO and NINFO performance analysis arrays and the CYCLIC option described in Section 3.2 were not implemented.

7 Lecture Notes in Computer Science naive ideal lb block lb grab lb grab master/slave /t nprocs Fig. 2. SGI O2000, initial distribution all on root, ray density 13 (106,912 rays). In Figure 2 the initial data is stored on the root processor. The reciprocal of wall-clock time is given on the y-axis giving the equivalent of a speedup graph without normalising the time taken. The `naive ideal' line is simply the sequential time divided by the number of processors. The load balance block option suers from load imbalance which is improved by the load balance grab option. In these cases the root processor is both computing its own work and sending and receiving work to and from the other processors. To determine this performance penalty the grab option is repeated with the master/slave option set to true. Note, to show this overhead the root processor is not included in nprocs which in this case is the number of computing processors. This shows that much of the remaining dierence from the ideal line is due to this overhead. The data was then pre-distributed in equal sized blocks across the processors. With the load balance block option set, the result is identical to the block option in Figure 2 and is therefore not presented. This result shows that the load balance block option is ecient when the data is all on one processor. Note, in this predistributed case Y01CAFP does not have to perform any data re-distribution. The load balance grab option is given in Figure 3 and it performs as well as Y01CAFP with grab and master/slave options (originally shown in Figure 2) which is also displayed as a reference. This shows that pre-distributing the data removes the data transfer bottleneck from the root processor. The data was then pre-distributed in a cyclic manner. Figure 3 shows that with the load balance block option the performance is as good as the other two options. The load balance block option actually performs no data re-distribution in this case. This means that for this problem pre-distributing the rays in a cyclic manner gives very good load balance. The load balance grab option gives no further improvement and is therefore not included.

8 8 Rupert W. Ford and Michael O'Brien naive ideal lb grab master/slave lb block grab lb cyclic block /t nprocs Fig. 3. SGI O2000, initial distribution block and cyclic, ray density 13 (106,912 rays). Figure 4 presents results for the same test case scaling up to a much larger number of processors on a Cray T3D. The initial block distribution with the load balance block option performs worst due to load imbalance. Changing this option to grab brings the performance close to that for the initial cyclic distribution with the load balance block option. The initial cyclic distribution with the load balance grab option performs the best by a small margin. The performance improvement falls o primarily due to the sequential fraction mentioned in Section naive ideal lb block block lb cyclic block lb block grab lb cyclic grab /t nprocs Fig. 4. Cray T3D, ray density 13 (106,912 rays).

9 Lecture Notes in Computer Science naive ideal lb block block lb cyclic block lb block grab lb cyclic grab /t nprocs Fig. 5. Cray T3D, ray density 50 (1,586,356 rays). Figure 5 presents results for the same options as the previous gure with a much greater ray density of 50 (1,586,356 rays). The trends are very similar however the larger problem size is more scalable. At 256 processors the cyclic pre-distribution actually performs slightly worse with the load balance cyclic option than the load balance block option. Figure 6 again presents results for the same options as Figure 4 for a ray density of 200 (25,370,577 rays). At this density the problem will only run on 128 or more processors due to memory limitations. In this example all options scale linearly except the initial block distribution with load balance block option which suers from load imbalance. 6 Conclusions Y01CAFP has proven to be useful for distributing work from the root processor to the remaining processors. Note, the load balancer is eectively parallelising the application here. In this case the master/slave option helps reduce the communication bottleneck at the root processor by dedicating it to this task. For larger problems the data must be pre-distributed (particularly for distributed memory machines) not only for performance but also so that the memory requirements per processor is not too high. In the example problem presented in this paper a cyclic pre-distribution of the data gives a near load balanced solution (as rays close to each other follow similar paths and thus have a similar computational cost). However, for all problem sizes an initial block distribution of data with the grab load balance option gives very similar performance results. This suggests that for a dierent dataset, or dierent problem entirely, where an initial cyclic distribution is not

10 10 Rupert W. Ford and Michael O'Brien naive ideal lb block block lb cyclic block lb block grab lb cyclic grab 1/t e nprocs Fig. 6. Cray T3D, ray density 200 (25,370,577 rays). feasible or does not give a load balanced solution an initial block distribution of data with the grab load balance option will load balance the problem. The initial integration of the load balancer into the RCS code was relatively simple and BAe are now using the NAG parallel library and the Y01CAFP load balance routine in production runs with much greater ray densities and more realistic geometries than were previously possible. In summary Y01CAFP has proven to be a very powerful, exible and useful load balancing routine. References 1. Ling H. et al, Shooting and Bouncing Rays: Calculating the RCS of an Arbitrarily Shaped Cavity, IEEE Transactions on Antennas and Propagation, Vol. 37, No. 2, February 1989, pp Watt A., Fundamentals of Three Dimensional Computer Graphics, Addison Wesley, PATRAN Plus User Manual. 4. Keates M., Hubbold J., Accelerated Ray Tracing on the KSR1 UMCS J. Dongarra and R. C. Whaley, (1997) A User's Guide to the BLACS v1.1, Technical Report CS , University of Tennessee, Knoxville, Tennessee. 6. A. Geist, A. Beguelin, J. Dongarra, R. Manchek, W. Jiang, and V. Sunderam, (1994), PVM: A Users' Guide and Tutorial for Networked Parallel Computing, The MIT Press, Cambridge, Massachusetts. 7. N.A.G., (1997) N.A.G. Parallel Library Manual, Release 2, N.A.G. Ltd., Oxford. 8. N.A.G., (1997) N.A.G. Fortran Library Manual, Mark 17, N.A.G. Ltd., Oxford. 9. M.Snir, S.Otto, S.Huss-Lederman, D.Walker and J.Dongarra, (1996) MPI: The Complete Reference, The MIT Press, Cambridge, Massachusetts. 10. R.Ford (1998) A Message Minimisation Algorithm CNC Technical Report, Department Of Computer Science, The University of Manchester, Manchester, U.K.

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz

Compiler and Runtime Support for Programming in Adaptive Parallel Environments 1 Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz UMIACS and Dept. of Computer Science University