Feedback Guided Scheduling of Nested Loops

Size: px

Start display at page:

Download "Feedback Guided Scheduling of Nested Loops"

Loreen Stokes
5 years ago
Views:

1 Feedback Guided Scheduling of Nested Loops T. L. Freeman 1, D. J. Hancock 1, J. M. Bull 2, and R. W. Ford 1 1 Centre for Novel Computing, University of Manchester, Manchester, M13 9PL, U.K. 2 Edinburgh Parallel Computing Centre, University of Edinburgh, Edinburgh, EH9 3JZ, U.K. Abstract. In earlier papers ([2], [3], [6]) feedback guided loop scheduling algorithms have been shown to be very effective for certain loop scheduling problems which involve a sequential loop and a parallel inner loop and for which the workload of the parallel loop changes only slowly from one execution to the next. In this paper the extension of these ideas the case of nested parallel loops is investigated. We describe four feedback guided algorithms for scheduling nested loops and evaluate the performances of the algorithms on a set of synthetic benchmarks. 1 Introduction Loops are a rich source of parallelism for many applications. As a result, a variety of loop scheduling algorithms that aim to schedule parallel loop iterations to processors of a shared-memory machine in an almost optimal way have been suggested. In a number of recent papers, Bull, Ford and co-workers ([2], [3], [6]) have shown that a Feedback Guided Dynamic Loop Scheduling (FGDLS) algorithm performs well for certain parallel single loops. Parallel nested loops are an even richer source of parallelism; in this paper we extend these feedback guided scheduling ideas to the more general case of nested loops, and present numerical results to evaluate the performance of the resulting algorithms. 2 Loop Scheduling Many existing loop scheduling algorithms are based on either (variants of) guided selfscheduling (Polychronopoulos and Kuck [11], Eager and Zahorjan [5], Hummel et al. [7], Lucco [8], Tzen and Ni [13]), or (variants of) affinity scheduling algorithms (Markatos and LeBlanc [9], Subramaniam and Eager [12], Yan et al. [14]). The underlying assumption of these algorithms is that each execution of a parallel loop is independent of previous executions of the same loop and therefore has to be rescheduled from scratch. In a sequence of papers, Bull et al. [3] and Bull [2] (see also Ford et al. [6]) propose a substantially different loop scheduling algorithm, termed Feedback Guided Dynamic Loop Scheduling. The algorithm assumes that the workload is changing only slowly from one execution of a loop to the next, so that observed timing information from the current execution of the loop can, and should, be used to guide the scheduling of the next execution of the same loop. In this way it is possible to limit the number of chunks into which the loop iterations is divided to be equal to the number of processors, and

2 thereby it is possible to limit the overheads (loss of data locality, synchronisation costs, etc.) associated with guided self-scheduling and affinity scheduling algorithms. Note that almost all scheduling algorithms are designed to deal with the case of a single loop. The application of the algorithms to parallel loops usually proceeds by either (a) treating only the most loop as parallel (see Section 4.4), or (b) coalescing the loops into a single loop (see Section 4.2). In Section 4.1 we introduce a scheduling algorithm that treats the multidimensionality of the iteration space corresponding to the nested loops directly. 3 Feedback Guided Loop Scheduling The feedback guided algorithms described in [3] and [2] are designed to deal with the case of the scheduling, across p processors, of a single loop of the form: DO SEQUENTIAL J = 1, NSTEPS DO PARALLEL K = 1, NPOINTS CALL LOOP_BODY(K) We assume that the scheduling algorithm has defined the lower and upper loop iteration bounds, l t j ut j ¾ Æ, j 1 2 p on iteration step t, and that the corresponding measured execution times are given by Tj t j 1 2 p. This execution time data enables a piecewise constant approximation to the observed workload to be defined, and the iteration bounds for iteration t 1, l t 1 j u t 1 j ¾ Æ, j 1 2 p are then chosen to approximately equi-partition this piecewise constant function. This feedback guided algorithm has been shown to be particularly effective for problems where the workload of the parallel loop changes slowly from one execution to the next (see [2], [3]) this is a situation that occurs in many applications. In situations where the workload is more rapidly changing, guided self-scheduling or affinity scheduling algorithms are likely to be more efficient. In this paper we consider the extension of feedback guided loop scheduling algorithms to the case of nested parallel loops: DO SEQUENTIAL J = 1, NSTEPS DO PARALLEL K1 = 1, NPOINTS1.. DO PARALLEL KM = 1, NPOINTSM CALL LOOP_BODY(K1,...,KM)..

3 As in the single loop case, the issues that arise in the scheduling of nested loops are load balancing and communication minimisation. In higher dimensions, the problem of minimising the amount of communication implied by the schedule assumes greater significance there are two aspects to this problem: temporal locality and spatial locality. Temporal locality is a measure of the communication implied by the scheduling of different parts of the iteration space to different processors on successive iterations of the sequential loop. Spatial locality is a measure of the communication implied by the computation at each point of the iteration space. 4 Scheduling Nested Loops Of the approaches described in this section, the first, described in Section 4.1, addresses the multi-dimensional problem directly by defining an explicit recursive partition of the M-dimensional iteration space. The next two approaches are based on the transformation of the nested loops into an equivalent single loop, followed by the application of the one-dimensional Feedback Guided Loop Scheduling algorithm to this single loop; of these one-dimensional approaches, the first, described in Section 4.2, is based on coalescing the nested loops into a single loop; the second, described in Section 4.3, uses a space-filling curve technique to define a one-dimensional traversal of the M-dimensional iteration space. The final approach, described in Section 4.4, is based on the application of the onedimensional Feedback Guided Loop Scheduling algorithm to the most parallel loop only. 4.1 Recursive Partitioning This algorithm generalises the algorithm described in Section 3 to multiple dimensions. We assume that the scheduling algorithm has defined a partitioning into p disjoint subregions, S t j, of the M-dimensional iteration space on the (sequential) iteration t: so that p S t j j 1 S t j lt 1 j ut 1 j lt 2 j ut 2 j lt M j ut M j j 1 2 p 1 NPOINTS1 1 NPOINTS2 1 NPOINTSM (i.e. so that the union of the disjoint subregions is the complete iteration space). We also assume that the corresponding measured execution times, Tj t j 1 2 p are known. Based on this information, an M-dimensional piecewise constant approximation Ŵ t to the actual workload at iteration t can be formed: Ŵ t x 1 x 2 x M µ Tj t u t 1 j lt 1 j 1µ ut 2 j lt 2 j 1µ ut M j lt M j 1µ where x i ¾ li t j ut i j i 1 2 M, j 1 2 p. The recursive partitioning algorithm then defines new disjoint subregions S t 1 j j 1 2 p for scheduling the next

4 iteration t 1, so that p j 1 S t 1 j is the complete iteration space and so that the piecewise constant workload function Ŵ t is approximately equi-partitioned across these subregions. This is achieved by recursively partitioning the dimensions of the iteration space in our current implementation, the dimensions are treated from most loop to innermost loop, although it would be easy to develop other, more sophisticated, strategies (to take account of loop length, for example). To give a specific example, if p is a power of 2, then the first partitioning (splitting) point is such that the piecewise constant workload function Ŵ t is approximately bisected in its first dimension; the second partitioning points are in the second dimension and are such that the two piecewise constant workloads resulting from the first partitioning are approximately bisected in the second dimension; subsequent partitioning points further approximately bisect Ŵ t in the other dimensions. It is also straightforward to deal with the case when p is not a power of 2; for example, if p is odd, the first partitioning point is such that the piecewise constant workload function Ŵ t is approximately divided in the ratio p 2 : p 2 in its first dimension ( r is the largest integer less than r and r is the smallest integer greater than r); subsequent partitioning points divide Ŵ t in the appropriate ratios in the other dimensions. 4.2 Loop Coalescing A standard way of reducing a loop nest into a single loop is by loop coalescing (see [1] or [11] for details). The M parallel loops can be coalesced into a single loop of length NPOINTS1*NPOINTS2* *NPOINTSM. For example, for the case M 4 the coalesced loop is given by: DO SEQUENTIAL J=1,NSTEPS DO PARALLEL K=1,NPOINTS1*NPOINTS2*NPOINTS3*NPOINTS4 K1 = ((K-1)/(NPOINTS4*NPOINTS3*NPOINTS2)) + 1 K2 = MOD((K-1)/(NPOINTS4*NPOINTS3),NPOINTS2) + 1 K3 = MOD((K-1)/NPOINTS4,NPOINTS3) + 1 K4 = MOD(K-1,NPOINTS4) + 1 CALL LOOP_BODY(K1,K2,K3,K4) Having coalesced the loops we can use any one-dimensional algorithm to schedule the resulting single loop. The numerical results of [2], [3], show that the feedback guided scheduling algorithms are very effective for problems where the workload of a (one-dimensional) parallel loop changes slowly from one execution to the next. This is the situation in our case and thus we use the one-dimensional Feedback Guided Loop Scheduling algorithm described in Section 3 to schedule the coalesced loop. 4.3 Space-filling Traversal This algorithm is based on the use of space-filling curves to traverse the iteration space. Such curves visit each point in the space exactly once, and have the property that any

5 point on the curve is spatially adjacent to its neighbouring points. Again, any onedimensional scheduling algorithm can be used to partition the resulting one-parameter curve into p pieces (for the results of the next section the Feedback Guided Loop Scheduling algorithm described in Section 3 is used). This algorithm readily extends to M-dimensions, although our current implementation is restricted to M 2. A first order Hilbert curve (see [4]) is used to fill the 2-dimensional iteration space, and the mapping of two-dimensional coordinates to curve distances is performed using a series of bitwise operations. One drawback of the use of such a space-filling curve is that the iteration space must have sides of equal lengths, and the length must be a power of Outer Loop Parallelisation In this algorithm, only the most, K1, loop is treated as a parallel loop, the inner, K2,, KM, loops are treated as sequential loops; the one-dimensional Feedback Guided Loop Scheduling algorithm of Section 3 is used to schedule this single parallel loop. This approach is often recommended because of its simplicity and because it exploits parallelism at the coarsest level. 5 Numerical Results and Conclusions This section reports the results of a set of experiments that simulate the performance of the four algorithms described in Section 4 under synthetic load conditions. Simulation has been used for three reasons; for accuracy (reproducible timing results are notoriously hard to achieve on large HPC systems (see Appendix A of Mukhopadhyay [1])), so that the instrumenting of the code does not affect performance, and to allow large values of p to be tested. Statistics were recorded as each of the algorithms attempted to load-balance a number of computations with different load distributions. At the end of each iteration, the load imbalance was computed as (T max - T mean ) / T max, where T max was the maximum execution time of any processor for the given iteration and T mean was the corresponding mean execution time over all processors. The results presented in Figures 1 4 show average load imbalance over all iterations versus number of processors; we plot average load imbalance to show more clearly the performance of the algorithms in response to a dynamic load. The degree of data reuse between successive iterations was calculated as follows. For each point in the iteration space, a record was kept of the processor to which it was allocated in the previous iteration. For a given processor, the reuse was calculated as the percentage of points allocated to that processor that were also allocated to it in the previous iteration. As with load imbalance, the results presented in Figures 5 8 show reuse averaged over all iterations plotted against number of processors. Four different workload configurations have been used in the experiments:

6 Gauss A Gaussian workload where the centre of the workload rotates about the centre of the iteration space. Parameters control the width of the Gaussian curve, and the radius and time period of the rotation. Pond A constant workload within a circle centred at the centre of the iteration space where the radius of the circle oscillates sinusoidally; the workload outside the circle is zero. Parameters control the initial radius of the circle and the amplitude and period of the radial oscillations. Ridge A workload described by a ridge with Gaussian cross-section that traverses the iteration space. Parameters control the orientation of the ridge, the width of the Gaussian cross-section and the period of transit across the iteration space. Linear A workload that varies linearly across the iteration space, but is static. Only the linear workload is static, the other three workloads are dynamic. The computation is assumed to be executed on an iteration space that is Each of the algorithms described in Section 4 was used to partition the loads for numbers of processors varying between 2 and 256. The results displayed are for 4 iterations of the loop scheduling algorithm (for the dynamic workloads, this corresponds to 2 complete cycles of the workload period). The algorithms were also run for 1 and 2 iterations and very similar results were obtained. The four algorithms are labelled as (recursive partitioning or recursive co-ordinate bisection algorithm), (loop coalescing algorithm), (space-filling traversal or space-filling curve algorithm), ( loop parallelisation). We first consider the load balancing performance summarised in the graphs of Figures 1 4. For two of the dynamic loads (Gauss, Pond, Figures 1 and 2 respectively), there is little to choose between the four algorithms. For the third dynamic load (Ridge, Figure 3) and perform better than and ; and traverse the iteration space in a line by line fashion and this is a very effective strategy for this particular workload. For the static load (Linear, Figure 4), and outperform and, particularly for larger numbers of processors; we note that and achieve essentially perfect load balance. These results can be explained by considering the structures of the four algorithms. Slice and are one-dimensional algorithms adapted to solve the two-dimensional test problems; as such they adopt a relatively fine-grain approach to loop-scheduling and this should lead to good load balance properties. In contrast is a multi-dimensional algorithm and, in its initial stages (the first few partitions), it adopts a relatively coarse-grain approach to loop scheduling. For example, the first partition approximately divides the iteration space into two parts and any imbalance implied by this partition is inherited by subsequent recursive partitions. Outer, which schedules only the most loop, also adopts a coarse-grain approach to the scheduling problem. Given this, the load balance properties of and for the dynamic workload problems are surprisingly good and are often better than either or. We would expect the structure of to be advantageous when we consider the data reuse achieved by the algorithms and this is borne out by the graphs in Figures 5 8 which show the levels of intra-iteration data reuse achieved. Note that data re-use is a measure of the temporal locality achieved by the different scheduling algorithms. In terms of intra-iteration data reuse, clearly outperforms, and for all the dynamic load types and for nearly all numbers of processors. In particular for larger

7 numbers of processors substantially outperforms the other algorithms. Again the explanation for these results lies in the structure of the algorithms; in particular, the fact that treats the multi-dimensional iteration space directly is advantageous for data reuse. In broad terms we observe that the load balance properties of the four algorithms are very similar and it is not possible to identify one of the algorithms as significantly, and uniformly, better than the others. In contrast is substantially better than the other algorithms in terms of intra-iteration data reuse, particularly for larger numbers of processors. How these observed performance differences manifest themselves on an actual parallel machine will depend on a number of factors, including: the relative performances of computation and communication on the machine; the granularity of the computations in the parallel nested loops; the rate at which the workload changes and the extent of the imbalance of the work across the iteration space. A comprehensive set of further experiments to undertake this evaluation of the algorithms on a parallel machine will be reported in a forthcoming paper. Avg. Imbalance Fig. 1. Load Balance for Gauss Load Acknowledgements The authors acknowledge the support of the EPSRC research grant GR/K8574, Feedback Guided Affinity Loop Scheduling for Multiprocessors. References 1. D. F. Bacon, S. L. Graham and O. J. Sharp, (1994) Compiler Transformations for High- Performance Computing, ACM Computing Surveys, vol. 26, no. 4, pp J. M. Bull, (1998) Feedback Guided Loop Scheduling: Algorithms and Experiments, Proceedings of Euro-Par 98, Lecture Notes in Computer Science, Springer-Verlag.

8 Avg. Imbalance Fig. 2. Load Balance for Pond Load Avg. Imbalance Fig. 3. Load Balance for Ridge Load Avg. Imbalance Fig. 4. Load Balance for Linear Load

9 1 8 Avg. Reuse (%) Fig. 5. Reuse for Gauss Load 1 8 Avg. Reuse (%) Fig. 6. Reuse for Pond Load 1 8 Avg. Reuse (%) Fig. 7. Reuse for Ridge Load

10 1 8 Avg. Reuse (%) Fig. 8. Reuse for Linear Load 3. J. M. Bull, R. W. Ford and A. Dickinson, (1996) A Feedback Based Load Balance Algorithm for Physics Routines in NWP, Proceedings of Seventh ECMWF Workshop on the Use of Parallel Processors in Meteorology, World Scientific. 4. A. R. Butz, (1971) Alternative Algorithm for Hilbert s Space-Filling Curve, IEEE Trans. on Computers, vol. 2, pp D. L. Eager and J. Zahorjan, (1992) Adaptive Guided Self-Scheduling, Technical Report , Department of Computer Science and Engineering, University of Washington. 6. R. W. Ford, D. F. Snelling and A. Dickinson, (1994) Controlling Load Balance, Cache Use and Vector Length in the UM, Proceedings of Sixth ECMWF Workshop on the Use of Parallel Processors in Meteorology, World Scientific. 7. S. F. Hummel, E. Schonberg and L. E. Flynn, (1992) Factoring: A Practical and Robust Method for Scheduling Parallel Loops, Communications of the ACM, vol. 35, no. 8, pp S. Lucco, (1992) A Dynamic Scheduling Method for Irregular Parallel Programs, in Proceedings of ACM SIGPLAN 92 Conference on Programming Language Design and Implementation, San Francisco, CA, June 1992, pp E. P. Markatos and T. J. LeBlanc, (1994) Using Processor Affinity in Loop Scheduling on Shared Memory Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 4, pp N. Mukhopadhyay, (1999) On the Effectiveness of Feedback-guided Parallelisation, PhD Thesis, Department of Computer Science, University of Manchester. 11. C. D. Polychronopoulos and D. J. Kuck, (1987) Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers, IEEE Transactions on Computers, C-36(12), pp S. Subramaniam and D. L. Eager, (1994) Affinity Scheduling of Unbalanced Workloads, in Proceedings of Supercomputing 94, IEEE Comp. Soc. Press, pp T. H. Tzen and L. M. Ni, (1993) Trapezoid Self-Scheduling Scheme for Parallel Computers, IEEE Trans. on Parallel and Distributed Systems, vol. 4, no. 1, pp Y. Yan, C. Jin and X. Zhang, (1997) Adaptively Scheduling Parallel Loops in Distributed Shared-Memory Systems, IEEE Trans. on Parallel and Distributed Systems, vol. 8, no. 1, pp

Symbolic Evaluation of Sums for Parallelising Compilers

Symbolic Evaluation of Sums for Parallelising Compilers Rizos Sakellariou Department of Computer Science University of Manchester Oxford Road Manchester M13 9PL United Kingdom e-mail: rizos@csmanacuk Keywords: