CODE GENERATION FOR GENERAL LOOPS USING METHODS FROM COMPUTATIONAL GEOMETRY

Size: px

Start display at page:

Download "CODE GENERATION FOR GENERAL LOOPS USING METHODS FROM COMPUTATIONAL GEOMETRY"

Ruth Clark
5 years ago
Views:

1 CODE GENERATION FOR GENERAL LOOPS USING METHODS FROM COMPUTATIONAL GEOMETRY T. Andronikos, F. M. Ciorba, P. Theodoropoulos, D. Kamenopoulos and G. Papakonstantinou Computing Systems Laboratory National Technical University of Athens 9th Heroon Polytechnioy Zographou Campus, 157-0, Athens, Greece ABSTRACT This paper deals with general nested loops and proposes a novel dynamic scheduling technique. General loops contain complex loop bodies (consisting of arbitrary program statements, such as assignments, conditions and repetitions) that exhibit uniform loop-carried dependencies. Therefore it is now possible to achieve efficient parallelization for a vast class of loops, mostly found in DSP, PDEs, signal and video coding. At the core of this technique lies a simple and efficient dynamic rule (SDS - Successive Dynamic Scheduling) for determining the next ready-to-beexecuted iteration at runtime. The central idea is to schedule the iterations on-the-fly by using SDS, along the optimal hyperplane (determined using the QuickHull algorithm). Furthermore, a tool (CRONUS/1) that implements this theory and automatically produces the SPMD parallel code for message passing architectures is presented. As a testing case study, the FSBM motion estimation algorithm (used in video coding standards, e.g., MPEG-2, H.21) was used. The tool was also tested on a suite of randomly generated loops. The experimental results validate the presented theory and corroborate the efficiency of the generated parallel code. KEY WORDS General loops, dynamic scheduling, automatic SPMD code generation, message passing architectures. 1 Introduction Parallelizing computational intensive programs has lead to dramatic performance improvements. Usually these programs contain repetitive iterations, the majority in the form of nested for loops. The iterations within a loop nest can be either independent iterations or precedence constrained iterations. The latter can be uniform (constant) or nonuniform throughout the execution of the program. This paper tackles the uniform precedence constraints case by presenting a novel scheduling policy and a code generation technique for general nested loops. When parallelizing nested loops the following tasks need to be carried out: Detection of the inherent parallelism by applying (when necessary) any program transformation that may enable this. Computation scheduling (specifying when the different computations are performed). Computation mapping (specifying where the different computations are performed). Explicitly managing the memory and communication (additional task for distributed memory systems). Generating the code so that each processor will execute its apportioned computation and communication explicitly. Related Work. Scheduling nested loops with uniform dependencies was studied by Lamport, who partitioned the index space into hyperplanes [1]. The idea behind this is that all points that belong to the same hyperplane can be executed in parallel. Darte proved that this method is nearly optimal [2], while the problem of finding the hyperlane that results in the minimum makespan was solved in [3]. Moldovan, Shang, Darte and others applied the hyperplane method to find a linear optimal execution schedule, using diophantine equations [4], linear programming in subspaces [5], or integer programming [2]. All these approaches consider unit execution time for each iteration and zero communication for each communication step (UET model). Today the most prominent technique for the parallelization and code generation of nested loops is tiling [], [7], [], [9]. Nevertheless, parallelizing nested loops for distributed memory machines incurs communication overhead. One way to tackle the scheduling problem in the presence of communication delay, is to model the algorithm by a directed acyclic task graph (DAG), where the computation and communication times are depicted as node and edge weights, respectively. Since the general scheduling problem is well known to be NP-complete [10], [11], researchers have given attention to other methods such heuristics, approximation algorithms etc. In order to minimize the inter-processor communication cost, many researchers focus on partitioning the index space

2 into as independently as possible groups of computations. Shang and Fortes presented in [12] a method for dividing the index space into independent sets of computations that are assigned to different processors, such that the communication cost is zeroed. Kshemkalyani and Singhal [13] identified different communication patterns in distributed computations in order to reduce the overall communication cost. Feautrier [14] explored the possibility of having the parallelizing compiler determine the distribution of data and computations, using only information available in the source program, provided that it contains only for loops and arrays with affine subscripts. Engels et al. presented in [15] a polynomial-time algorithm for scheduling unit-length jobs on a fixed number of identical parallel machines for the case where the precedence constraints graph is a forest of in/out-trees and the delays are bounded by a constant, such that the makespan of the resulting schedule is minimized. However the asserted constraints restrict the applicability of their algorithm. In [1] T. Yang et al. present a tool for statically scheduling DAGs and generating the appropriate code for specific message passing MIMD architectures. The DAGs static scheduling followed by parallel code generation is a drawback in terms of makespan. On top of that, the parallel code is generated for specific architectures (ncube- 2 and INTEL-2), proving lack of portability. Calland et al. present in [17] a method for scheduling and mapping (columnwise/rowwise) tiles on limited computational resources, assuming communication/computation overlap, and in [1] they present a static technique for heterogeneous networks. Generally there are two basic computation scheduling methods: static and dynamic. Static scheduling implies the creation of a timetable for the iterations of the loop nest at compilation time (by traversing once the index space) that preserves the precedence constraints, followed by computation mapping. Dynamic scheduling implies finding a rule of instructing every processor at run time what loop iteration to compute, rather than explicit specification at compile time. Herein, the computation mapping is embedded in the dynamic rule. Code generation is a complex problem, and most methods exploit inter-loop iterations parallelism (task level parallelism), with some effort concerning intra-loop parallelization (instruction level parallelism). A generic approach to code generation is not very practical due to the complexity of the problem. Instead, efficient code generation for classes of specific problems with high applicability leads to enhanced performance over the generic approach. Generally, most of the scheduling complexity stems from the fact that precedence constraints (i.e., the computations must be executed in a specific order so as to preserve the meaning of the original sequential program) and resource constraints (i.e., the number of concurrent operations at any time step is bounded by the available resources of the multicomputer at hand) must both be tackled simultaneously. Problem Description. Given a sequential general loop structure, automatically produce an equivalent high speed-up parallel program using a bounded (but arbitrary) number of processors. Our Approach. In this paper we tackle the computation scheduling, computation mapping and code generation for general loops (GLs) that have arbitrary execution and communication times. GLs are those nested loops for which the loop body consists of generic program statements (such as assignments, conditions and repetitions) and which exhibit uniform loop-carried dependencies (or even no loop-carried dependencies at all). The loop iterations are considered as points in a Cartesian n-dimensional index space, and the loop-carried dependence vectors are considered as directed arcs between the corresponding index points. We focus on iteration level parallelism under the assumption that the number of processors used is finite. Finally, we automatically produce an efficient parallel code for these processors. Our approach to the above problem consists of three steps: 1. We determine the optimal family of hyperplanes 1 using the well known QuickHull algorithm [19]. 2. We define the lexicographic ordering for this hyperplanes, and by applying SDS we schedule the iterations on-the-fly. 3. The presented tool automatically produces the parallel code, for a given number of processors. Step 1 is performed at compile time, whereas steps 2 and 3 are performed at run time. The reason for performing step 2 at run-time is the ability to exploit the uniformity of the index space, in order to find an efficient adaptive rule. Also, when dealing with large index spaces, performing step 2 at compile time would be a very tedious work. In our approach, the overhead associated with step 2 is greatly reduced by performing it at run time. The run time scheduling (SDS) introduced in this paper produces a very efficient time scheduling while meeting the precedence constraints. Also, it scales with the available number of processors, in that for any given number of processors it produces an efficient time schedule. The tool CRONUS/1 produces the parallel code automatically, using SDS routines and MPI primitives for dynamic scheduling and executing of the sequential program. CRONUS/1 takes as input a sequential program, which is parsed and the essential parameters for code generation are extracted: depth of the loop nest (n), size of the index space ( J ) and dependence vectors set (DS) (if they exist). Once the crucial parameters are available, the tool calls the QuickHull algorithm, which returns the optimal hyperplane. At this point, the available number of processors (NP ) is required as input. With the help of an automatic code generator the appropriate parallel code is generated for the given number of processors. This parallel code 1 A hyperplane is optimal if no other hyperplane leads to a smaller makespan.

contains run time routines for SDS and MPI primitives for data communication (if communication is necessary at all); it is eligible for compilation and execution on the multicomputer at hand.

3 contains run time routines for SDS and MPI primitives for data communication (if communication is necessary at all); it is eligible for compilation and execution on the multicomputer at hand. DS = { d 1,..., d m }, m n, is the set of the m dependence vectors, which must be uniform, i.e., constant throughout the index space, and must have non negative coordinates. Contribution The efficiency of the scheduling rule used by SDS. SDS uses this rule to instruct each processors which index point to execute next, as well as to which processors to send data to, and receive data from (if necessary). The efficiency of the rule ensures that during the execution of the parallel code no artificial communication or computation delays, that are not present in the initial sequential program, occur. Moreover, each processor spends most of the time executing the appropriate iteration rather than communicating data. The validity and optimality of SDS. The validity is ensured because the points on any subsequent hyperplane depend (eventually) on the points of the current hyperplane (and perhaps on points of previous hyperplanes), whereas the optimality yields from the fact that all inherent parallelism is exploited by following the optimal hyperplane. Perfect load balancing of SDS under the assumption that all processors are homogeneous. This is ensured by the dynamic rule because it distributes the iterations evenly among processors (in a round robin fashion). Automatic parallel code generation. The automatic code generator facilitates the writing of the parallel code without any help from the user. Therefore it greatly reduces the overhead associated with handwritten coding, especially when conducting experiments for performance comparisons. The rationale behind the automatic code generator is independent of the dynamic scheduling methodology employed. Moreover, generating parallel code for the MPI platform yields high portability. 2 Terminology and Definitions 2.1 General Loops General Loops (GLs) have the form shown in Fig. 1. In this model the loop body LB(j) contains general programs statements that include assignment statements, conditional if statements and repetitions such as for or while. The lower and upper bounds for the loop indices are l i and u i Z, respectively. The depth of the loop nest, n, determines the dimension of the index space J = {j = (i 1,..., i n ) N n l r i r u r, 1 r n}. Each point of the n-dimensional index space is a distinct iteration of the loop body. L = (l 1,..., l n ) and U = (u 1,..., u n ) are the initial and terminal points of the index space. Figure 1. Computational model The only requirement is that any loop instance LB(j) depends on previous loop instance(s) as follows: LB(j) = f(lb(j d 1 ),..., LB(j d m )) (1) In the above equation f must be a computable function. 3. If the above condition does not hold, the proposed methodology is not applicable to this particular loop. Deciding whether condition (1) holds for any arbitrary loop (containing if, while, exit, continue statements) is a very difficult problem, and we do not claim we have a solution for it. However, we claim that if one verifies condition (1), our methodology is valid. In practice this is usually very easy. To demonstrate that for many real life examples it is trivial to establish that condition (1) holds, we have chosen FSBM as a case study. The nested loops for FSBM is shown in Figure Finding the optimal scheduling hyperplane using Convex Hulls The hyperplane (or wavefront) method [1] was one of the first methods for parallelizing uniform dependence loops. Consequently, it formed the base of many heuristics algorithms developed in the recent years. The choice of the optimal hyperplane for a given index space was one of the challenges that many researchers tackled using diophantine equations, linear programming in subspaces and integer programming. In [3] and [20] the problem of finding the optimal hyperplane for uniform dependence loops was reduced to the problem of computing the convex hull 2 of the dependence vectors and the terminal point. For instance, given the dependence vectors d 1 = (1, ), d 2 = (2, 5), d 3 = (3, 3), d 4 = (, 2) and d 5 = (, 1), consider two index spaces: in the first the terminal point is U 1 = (75, 90) (see Figure 3 (a)), and in the second 2 The convex hull formed from the index points j 1,..., j m is defined as: CH = {j N n j = λ 1 j λ mj m, where λ 1,..., λ m 0 and λ λ m = 1}.

3 Lexicographic Ordering on Hyperplanes The central issue in dynamic scheduling is finding an adaptive rule for instructing every processor what to do at run time rather than explicitly specifying at

4 3 Lexicographic Ordering on Hyperplanes The central issue in dynamic scheduling is finding an adaptive rule for instructing every processor what to do at run time rather than explicitly specifying at compile time. The adaptive rule is based on the ability to determine the next-to-be-executed point or the required-alreadyexecuted-point for any loop instance. In order to efficiently define such a rule, a partial ordering over the index points is necessary. This section describes the lexicographic ordering on hyperplanes by which the index space is traversed lexicographically along hyperplanes, yielding a zigzag traversal in a 2D index space, or a spiral traversal in a 3D (or higher) index space. In order to achieve this, the concepts of successor and predecessor for a point are introduced. The successor provides: the means by which each processor can determine what iteration point to execute next, and Figure 2. Six-level nested loop full search block matching motion estimation algorithm a fast and reliable way for the currently running processor to determine the processors requiring the locally computed data. The converse hold for the predecessor. 4 2 y d1 A d2 B d3 C d4 d5 D (a) E U1 U x Figure 3. spaces 4 2 y d1 Optimal hyperplane for two different index U 2 = (105, 90) (see Figure 3 (b)). As shown in Figure 3 the the convex hulls are the index points contained by the polygons A, B, C, E, U 1 and A, B, C, E, U 2, respectively. The optimal hyperplane in the first case is defined by the dependence vectors ( d 2, d 3 ), whereas in the second case it is defined by ( d 3, d 5 ). Hence, the equation of the optimal family of hyperplanes in the first case is 2x 1 + x 2 = k, k N, and in the second case 2x 1 + 5x 2 = k, k N. Scheduling the loop nest of Figure 3(a) along hyperplane 2x 1 + x 2 = k leads to a parallel time of (240/9) + 1 = 2 time steps, assuming the initial point of the index space is L 1 = (0, 0). The algorithm used to compute the convex hulls is the well known QuickHull algorithm [19]. A d2 B d3 C d4 d5 D (b) E x All index points on a hyperplane can be lexicographically ordered. Therefore, given a family of hyperplanes of the general form Π k : a 1 x a n x n = k, (2) where a i, x i, k N, there exists a lexicographically minimum and a lexicographically maximum index point on every hyperplane that contains index points 3. The algorithms that produce the minimum and maximum points of hyperplane Π k, denoted min k,n and max k,n, are given in the Appendix. Both min k,n and max k,n depend on the current hyperplane s value k and the number of coordinates n. Let i and j be two index points that belong to the same hyperplane Π k. j is the successor of i, denoted j = Succ(i), if i < j (i is lexicographically smaller than j) and for no other index point j of the same hyperplane does it hold that i < j < j. In the special case where i is the maximum index point of Π k, then Succ(i) is the minimum index point of Π k+1. In a similar fashion, i is the predecessor of j, denoted i = P red(j), if j is the successor of i. Finally, we define: j if r = 0; Succ r (j) = Succ(j) if r = 1; Succ(Succ r 1 (j)) if r > 1. Continuing the example of the previous section, we depict in Figure 3 minimum and the maximum points for 3 It is possible the intersection of the index space with a particular hyperplane to be the empty set, in which case this hyperplane has no minimum and no maximum index point.

hyperplanes 2x 1 +x 2 = 9 (Figure 3(a)) and 2x 1 +5x 2 = 21 (Figure 3(b)). 10 9 7 5 4 3 2 1 0 1 min max 2 3 4 5 7 9 10 (a) 10 9 7 5 4 3 2 1 0 1 min 2 3 4 5 7 9 10 Figure 4.

5 hyperplanes 2x 1 +x 2 = 9 (Figure 3(a)) and 2x 1 +5x 2 = 21 (Figure 3(b)) min max (a) min Figure 4. Minimum and maximum points on hyperplanes. (b) max Generation phase) for the given number of processors. This is achieved with the help of a Perl script that operates on a configuration file, which contains all the required information. Due to the fact that we tackle complex loop nests, implementation of a general parser for such loops is beyond the scope of our research. Hence, the user must manually define the loop body and the index space boundaries. In the configuration file the user must also define a startup function in C (automatically called by the parallel generated code) to perform data initialization on every processor, right before the actual parallel computation starts. The parallel code is written in C and contains run time routines for SDS and MPI primitives for data communication (if communication is necessary at all); it is eligible for compilation and execution on the multicomputer at hand (in the Run Time stage). In this paper we advocate the use of successor index points along the optimal hyperplane as an adaptive dynamic rule. This rule is efficient in the sense that the overhead induced by its computation is negligible. Moreover, with regards to distributed memory platforms, it does not incur any additional communication cost. The index space is traversed hyperplane by hyperplane, and each hyperplane lexicographically, yielding a zig-zag/spiral traversal. This ensures its validity and its optimality. The former because the points of any subsequent hyperplane depend (eventually) on the points of the current hyperplane (and perhaps on points of previous hyperplanes), and the latter because by following the optimal hyperplane all inherent parallelism is exploited. 4 Overview of CRONUS/1 This section gives an overview of the tool and details the SDS method. 1. Cronus/1 is an existing semi-automatic parallelization tool and a detailed description of the tool is contained in the Cronus User s Guide that can be found at cflorina/research/ongoing research.html. At this site one can also find the generated parallel code for the FSBM example given in the following section. The organization of CRONUS/1 is given in Figure 4. In the first stage (User Input), the user inputs a serial program. The next stage (Compile Time) consists of the loop nest detection phase (Parallelism Detection). If no loop nest can be found in the sequential program, the tool stops. Throughout the second phase (Parameters Extraction), the program is parsed and the following essential parameters are extracted: depth of the loop nest (n), size of the (iteration) index space ( J ) and dependence vectors set (DS). Once the crucial parameters are available, the tool calls the QuickHull algorithm, which returns the optimal hyperplane. At this point, the available number of processors (NP ) is required as input. With the help of an automatic code generator the appropriate parallel code is generated (in the Automatic Code Figure 5. Organization of CRONUS/1 4.1 Successive Dynamic Scheduling (SDS) The scheduling method introduced in this paper was developed based on the following assumption: the multiprocessor system is uniform/homogenenous (i.e., the processors are identical) and non preemptive (a processor completes the current task before executing a new one). The most prominent features of SDS are: It is a dynamic scheduling strategy (both scheduling and task execution are performed at runtime based on the availability of processors and on releasing iteration points from their dependencies, if they exist). It is a distributed scheduling (the scheduling task and/or the scheduling information are distributed among processors and their memories). It is a self-scheduling technique (an idle processor determines its next iteration of the loop nest by incrementing the loop indices in a synchronized way). The scheduling policy is the following: assuming there are NP available processors (P 1,..., P NP ), P 1 executes the initial index point L, P 2 executes Succ(L), P 3 executes Succ 2 (L), and so on, until all processors are employed in execution for the first time. Upon completion of L, P 1 executes the next ready-to-be-executed point, found by skipping NP points in the index space (in the zigzag/spiral manner described in 3). The coordinates of this point are obtained by applying the Succ function NP times

6 to the point currently executed by P 1 : Succ NP (L). Similarly, upon completion of its current point (call it j), P 2 executes the point given by Succ NP (j) and so on until exhausting all index points. SDS ends when the terminal point U has been executed. Thus, SDS uses a perfect load distribution policy because it assigns to a processor only one loop iteration at a time. In other words, the way iterations are assigned to processors follows the round-robin scheduling method to achieve this perfect load balancing. 5 Experimental Validation CRONUS/1 was coded in C, except for the Automatic Code Generator written in Perl. The parallel code produced by CRONUS/1 uses point-to-point, synchronous send and receive MPI calls when required. The experiments were conducted on a cluster with 1 identical 500MHz Pentium III nodes. Each node has 25MB of RAM and 10GB hard drive and runs Linux with kernel version. We used MPI (MPICH) to run the experiments over the FastEthernet interconnection network. 5.1 Case Study: FSBM ME algorithm Block motion estimation in video coding standards such as MPEG-1, 2, 4 and H.21 is perhaps one of the most computation-intensive multimedia operations. Hence is also the most implemented algorithm. The block matching algorithm is an essential element in video compression to remove the temporal redundancy among adjacent frames. The motion compensated frame is reconstructed from motion estimated blocks of pixels. Every pixel in each block is assumed to displace with the same 2D displacement called motion vector, obtained with the Block ME algorithm. The Full-Search Block-Matching Motion Estimation Algorithm (FSBM ME) [21] is a block matching method, for which every pixel in the search area is tested in order to find the best matching block. Therefore, this algorithm offers the best match, at an extremely high computation cost. Assuming a current video frame is divided into N h N v blocks in the horizontal and vertical directions, respectively, with each block containing N N pixels, the most popular similarity criterion is the mean absolute distortion (MAD), defined as MAD(m, n) = 1 N 2 N 1 i=0 N 1 j=0 x(i, j) y(i + m, j + n) (3) where x(i, j) and y(i + m, j + n) are the pixels of the current and previous frames. The motion vector (MV) corresponding to the minimum MAD within the search area is given by MV = arg{minmad(m, n)}, p m, n p, (4) where p is the search range parameter. The algorithm focuses on the situation where the search area is a region in the reference frame consisting of (2p + 1) 2 pixels. In FSBM, MAD differences between the current block and all (2p + 1) 2 candidate blocks are to be computed. The displacement that yields the minimum MAD among these (2p + 1) 2 positions is chosen as the motion vector corresponding to the present block. For the entire video frame, this highly regular FSBM can be described as a six-level nested loop algorithm, as shown in Fig. 2. As it can be seen from the figure, the general loop nest is designated by the outer two loops, whereas the inner four loops represent the loop body. Unfortunately, this algorithm does not have any loop-carried dependencies involving the two outer loop indices, i.e., the iterations of the FSBM loop body are completely independent of each other. This makes the FSBM ME algorithm a particular case study for CRONUS/1 due to no loop-carried dependencies. We have also tested the algorithm with artificial dependencies involving the two outer loops, and the results are comparable with the ones presented in 5.2. We did not present these results here due to space limitations. However, CRONUS/1 performs very well for the FSBM as it is (as it can be seen in later in this section), thus preserving the advantage of perfect load balancing provided by the SDS. 5.2 Testing with FSBM In this section we present the performance of CRONUS/1 for our case study FSBM algorithm. The parallel execution time is the sum of communication time, busy time (SDS overhead + loop body computation) and idle time (time spend by a processor on waiting for data to become available, or points to become eligible for execution). The obtained speedup (defined as the sequential execution time over the parallel execution time) is reported in comparison with the ideal speedup for different number of processors. The efficiency (measured in percents) is defined as the speedup over the number of processors used to achieve Speedup # of processors, the respective speedup, i.e. efficiency = and the optimal efficiency is considered 100%. We experimented on the FSBM algorithm with different frame sizes and search ranges and produced the serial and parallel code. Our results prove to be very close to the ideal speedup. The results in Fig. were produced for a frame of pixels, a search range of 10 blocks, and the block size was pixels. The results in Fig. 7 are for pixels frame, search range of 15 blocks and the block size was the same as previous. Finally, Fig. shows the performance of our tool for frames of pixels, search range 15 blocks, and the same block size. For all FSBM tests, CRONUS/1 used the available 1-node cluster. This case study does not exhibit loop carried dependencies (with respect to the two outer loops). However we chose to present FSBM algorithm because of its high practical importance in video coding and to its extremely high computational cost (due to its complex loop body).

7 Figure. Speedup comparison for 1024 x 7 frame size In this section we present the performance of CRONUS/1 for a suite of randomly generated GLs with uniform loop carried dependencies. The parallel execution time is taken as above (the sum of communication time, busy time: SDS overhead + loop body computation, and idle time: the time spend by a processor on waiting for data to become available, or points to become eligible for execution). The obtained speedup is defined as above, and is reported in comparison with the ideal speedup for different number of processors. The efficiency (measured in percents) is defined as the speedup over the number of processors used to achieve Speedup # of processors, the respective speedup, i.e. efficiency = and the optimal efficiency is considered 100%. The prerequisite for parallelizing to obtain most favorable speedups is that the problem size must be sufficiently large. The two sets of randomly generated GLs have index space sizes of and 00 00, respectively, 2- uniform dependence vectors, with sizes of 1-. The results for these two sets are given in Fig. 9 and 10. The experimental results validate the presented theory and corroborate the efficiency of the generated parallel code. Figure 7. Speedup comparison for 120 x 1024 frame size Figure 9. Speedup comparison for random examples with 00 x 00 index space size Figure. Speedup comparison for 100 x 1024 frame size 5.3 Experiments on random examples Figure 10. Speedup comparison for random examples with 00 x 00 index space size Conclusion In this paper we propose a dynamic scheduling policy and present a tool that automatically generates parallel code for GLs based on this theory. Our philosophy is that simplicity and efficiency are the key factors for minimizing the

8 runtime of the parallel program. In our approach the compilation time is kept to a minimum because the index space is not traversed and the issue of what iteration to compute next is solved at runtime. We have decided to make this trade-off because the successor concept is very simple and efficient, i.e., it does not incur a significant penalty especially considering heavy loop bodies. We strongly believe that an efficient parallelizing tool should strive to minimize the sum of compilation and run time compared to cost of the original sequential algorithm. Further work will focus on porting the tool onto different interconnection networks such as SCI and Myrinet, both intended for reducing the network latency of the currently used FastEthernet, and on reducing the communication costs. 7 Appendix 7.1 Defining the successor on hyperplanes Consider the hyperplane Π k : a 1 x a n x n = k, where a i, x i, k N. min k,n and max k,n can be defined recursively as follows: { ( k min k,1 = max k,1 = a 1 ) if k = a 1 k a 1 ; undef ined, otherwise k for (i= a n+1 ; i >= 0; i- -) { x n+1 = i; if min (k an+1 i),n is defined then return (min (k an+1 i),n, x n+1); } Figure 11. Computing min k,n+1 recursively Now let j = (j 1,..., j n ) be an index point of hyperplane Π k. Succ(j) is defined as follows: for (l = n 1; l > 0; l- -) { p = j l j n ; for (i=1; i < p a l +1; i++) { q = p a l i; if min q,n l is defined then return (j 1,..., j l + a l i, min q,n l ); } } Figure 12. Computing the successor 7.2 Implementation and availability The tool described here is available by request from the authors. More information and other related papers can be found at cflorina/research/ongoing research.html. References [1] L. Lamport. The parallel execution of DO loops. Comm. of the ACM, 37(2):3 93, February [2] A. Darte, L. Khachiyan, and Y. Robert. Linear scheduling is nearly optimal. Par. Proc. Letters, 1.2:73 1, [3] G. Papakonstantinou, T. Andronikos, and I. Drositis. On the parallelization of UET/UET-UCT loops. NPSC Journal on Computing, [4] D. I. Moldovan and J. Fortes. Partitioning and mapping algorithms into fixed size systolic arrays. IEEE Transactions on Computers, C-35(1):1 11, 19. [5] W. Shang and J.A.B. Fortes. Time optimal linear schedules for algorithms with uniform dependencies. IEEE Transactions on Computers, 40(): , [] F. Irigoin and R. Triolet. Supernode partitioning. In Proceedings of the 15th Annual ACM SIG ACT- SIGPLAN Symposium on Principles of Programming Languages, pages , January 19. [7] Jingling Xue. On tiling as a loop transformation. Parallel Processing Letters, 7(4): , [] G. Goumas, M. Athanasaki, and N. Koziris. Automatic code generation for executing tiled nested loops onto parallel architectures. In Proceedings of the 17th symposium on Proceedings of the 2002 ACM symposium on applied computing, pages 7 1. ACM Press, [9] G. Goumas, A. Sotiropoulos, and N. Koziris. Minimizing completion time for loops tiling with computation and communication overlapping. In 15th International Parallel and Distributed Processing Symposium, California, April IEEE Press. [10] J. Ullman. NP-complete scheduling problems. J. of Comp. and System Sciences, 10:34 393, [11] M. R. Garey and D. S. Johnson. Computers and Intractability, a Guide to the Theory of NPcompleteness. W. H. Freeman and Company, New York, [12] W. Shang and J.A.B. Fortes. Independent partitioning of algorithms with uniform dependencies. IEEE Transactions on Computers, 41:190 20, [13] A.D. Kshemkalyani and M. Singhal. Communication patterns in distributed computations. Journal of Parallel and Distributed Computing, 2: , [14] P. Feautrier. Automatic distribution of data and computations. In Technical Report 2000/3, Mar 2000.

9 [15] D.W. Engels, J. Feldman, D.R. Karger, and M. Ruhl. Parallel processor scheduling with delay constraints. In 12th Annual Symp. on Discrete algorithms, pages , NY, USA, ACM Press. [1] T. Yang and A. Gerasoulis. PYRROS: Static task scheduling and code generation for message passing multiprocessors. In Proceedings of the 1992 ACM International Conference on Supercomputing, Washington, DC, [17] Pierre-Yves Calland, Jack Dongarra, and Yves Robert. Tiling on systems with communication/computation overlap. Concurrency - Practice and Experience, 11(3): , [1] Pierre Boulet, Jack Dongarra, Yves Robert, and Frédéric Vivien. Static tiling for heterogeneous computing platforms. Parallel Computing, 25(5):547 5, [19] C. Bradford Barber, David P. Dobkin, and Hannu Huhdanpaa. The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software, 22(4):49 43, 199. [20] I. Drositis, T. Andronikos, M. Kalathas, G. Papakonstantinou, and N. Koziris. Optimal loop parallelization in n-dimensional index spaces, [21] H. Yee and Yu Hen Hu. A novel modular systolic array architecture for full-search block matching motion estimation. IEEE Transactions on Circuits and Systems for Video Technology, 5(5):407 41, October 1995.

Chain Pattern Scheduling for nested loops

Chain Pattern Scheduling for nested loops Florina Ciorba, Theodore Andronikos and George Papakonstantinou Computing Sstems Laborator, Computer Science Division, Department of Electrical and Computer Engineering,