Software pipelining of nested loops 2 1 Introduction Exploiting parallelism in loops in scientic programs is an important factor in realizing the pote

Size: px

Start display at page:

Download "Software pipelining of nested loops 2 1 Introduction Exploiting parallelism in loops in scientic programs is an important factor in realizing the pote"

Peregrine Berry
5 years ago
Views:

1 Software pipelining of nested loops J. Ramanujam Dept. of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA May 1994 Abstract This paper presents an approach to software pipelining of nested loops. While several papers have addressed software pipelining of inner loops, little work has been done in the area of extending it to nested loops. This paper solves the problem of nding the minimum iteration initiation interval (in the absence of resource constraints) for each level of a nested loop. The problem is formulated as one of nding a rational quasi-ane schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. This allows us to treat iteration-dependent statement reordering and multidimensional loop unrolling in the same framework. Unlike most work in scheduling nested loops, we treat each statement in the body as a unit of scheduling. Thus, the schedules derived allow for instances of statements from dierent iterations to be scheduled at the same time. Optimal schedules derived here subsume extant work on software pipelining of inner loops, in the absence of resource constraints. Keywords: Instruction level parallelism, ne-grain scheduling, nested loops, software pipelining, optimal scheduling. Note: This is an expanded version of the paper titled \Optimal Software Pipelining of Nested Loops," that appeared in Proc. 8th International Parallel Processing Symposium, (April 1994), pp. 335{342. Supported in part by an NSF Young Investigator Award CCR{ , and NSF grant CCR{ , and by the Louisiana Board of Regents through contract LEQSF ( )-RD-A-09. 1

2 Software pipelining of nested loops 2 1 Introduction Exploiting parallelism in loops in scientic programs is an important factor in realizing the potential performance of highly parallel computers today. Programming these machines remains a dicult problem. Much progress has been made resulting in a suite of techniques that extract coarsegrain parallelism from sequential programs [4, 28, 40, 42]. With the advent of VLIW, superscalar, horizontal microengines, multiple RISC and pipelined processors, the exploitation of ne-grain instruction-level parallelism has become a major challenge to parallelizing compilers [7, 13, 20, 33, 37, 38]. The problem will become even more important as these processors form the building blocks of massively parallel machines. Software pipelining [1, 6, 11, 14, 15, 16, 23, 24, 25, 27, 32, 35, 41] has been proposed as an eective ne-grain scheduling technique that restructures the statements in the body of a loop subject to resource and dependence constraints such that one iteration of a loop can start execution before another nishes. The total execution time thus depends on the iteration initiation interval. While software pipelining of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. This paper presents an approach to software pipelining of nested loops by presenting a technique to nd the minimum iteration initiation interval (in the absence of resource constraints). We formulate the problem as that of nding a rational ane schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. This framework allows for an integrated treatment of iteration-dependent statement reordering and multidimensional loop unrolling. In contrast to most work in scheduling nested loops, we treat each statement in the body as a unit of scheduling. Thus, the schedules derived allow for instances of statements from dierent iterations to be scheduled at the same time. Optimal schedules derived here subsume extant work on software pipelining of inner loops. Due to space constraints, we do not discuss code generation issues in this paper; the reader is referred to [31] for details. Section 2 provides the background and Section 3 discusses related work, placing this research in the context of extant work in the eld. In Section 4, we formulate the problem of optimal negrain scheduling of nested loops (in the absence of resource constraints) as a linear programming problem, the solution to which gives a rational ane schedule for each statement in the body of a nested loop as a function of the loop indices. We show that the solution corresponds to the minimum initiation interval for each level of a nested loop. Section 5 provides examples. In Section 6, we discuss work in progress aimed at integrating resource constraints and handling conditionals. Section 7 concludes with a discussion. 2 Background 2.1 Data dependence Good and thorough parallelization of a program critically depends on how precisely a compiler can discover the data dependence information [4, 5, 39, 40, 42]. These dependences imply precedence constraints among computations which have to be satised for a correct execution. In this paper,

3 Software pipelining of nested loops 3 we consider perfectly nested loops of the form: for I 1 = L 1 to U 1 do endfor for I n = L n to U n do S 1 (I) S r (I) endfor where L j and U j are integer-valued ane expressions involving I 1 ; : : : ; I j?1 and I = (I 1 ; : : : ; I n ). Each I j (j = 1; : : : ; n) is a loop index; S 1 ; : : : ; S r are assignment statements of the form X o = E(X 1 ; : : : ; X K ) where X o is dened (i.e., written) in expression E, which is evaluated using some variables X 1 ; X 2 ; : : : ; X K. We assume that the increment of each loop is +1. Each computation is denoted by an index vector I = (I 1 ; : : : ; I n ). A loop instance is the loop iteration where the indices take on a particular value, I = i = (i 1 ; i 2 ; : : : ; i n ). The instance of statement S k executed in iteration vector I is denoted S k (I). The iteration set is a collection of iteration vectors and constitutes the iteration space. With the assumption on loop bound linearity, the sets of computations considered are nite convex polyhedra of some iteration space in Z n, where n is the number of loops in the nest which is also the dimensionality of the iteration space. The iteration set of a given nested loop is described as a set of integer points (or, vectors) whose convex hull I is a non-degenerate (or, full dimensional) convex polyhedron. The loop iterations are executed in lexicographic ordering during sequential execution. Any vector x = (x 1 ; x 2 ; : : : ; x n ) is a positive vector, if its rst (leading { read from left to right) non-zero component is positive [5]. We say that i = (i 1 ; : : : ; i n ) precedes j = (j 1 ; : : : ; i n ), written i j, if j? i is a positive vector. Positive vectors capture the lexicographic ordering among iterations of a nested loop. A loop nest where the loop limits are constants is said to have a rectangular iteration space associated with it. Let X and Y be two p-dimensional arrays; and let f i and g i (i = 1; : : : ; p) be two sets of integer functions such that X(f 1 (I); : : : ; f p (I)) is a \dened" (i.e., written) variable and Y (g 1 (I); : : : ; g p (I)) is a \used" (i.e., read) variable. Let F (I) denote f 1 (I); : : : ; f p (I) and let G(I) denote g 1 (I); : : : ; g p (I). Given two statements S k (I 1 ) and S l (I 2 ), S l (I 2 ) is dependent on S k (I 1 ) (with distance vector d k;l ) i [5, 28, 39, 40]: (I 1 I 2 ) or (I 1 = I 2 and k < l) and f i (I 1 ) = g i (I 2 ) for i = 1; : : : ; p; Either X(F (I 1 )) is written in statement S k (I 1 ) or X(G(I 2 )) is written in statement S l (I 2 ): A ow dependence exists from statement S k to statement S l if S k writes a value that is subsequently, in sequential execution, read by S l. An anti-dependence exists from S k to S l if S k reads a value

4 Software pipelining of nested loops 4 that is subsequently modied by S l. An output dependence exists between S k and S l if S k writes a value which is subsequently written by S l. If I 1 = I 2, the dependence is called a loop-independent dependence; otherwise, it is called a loop-carried dependence. Many dependences that occur in practice have a constant distance in each dimension of the iteration space. In such cases, the vector d = I 2? I 1 is called the distance vector. We limit our discussion to distance vectors in this paper. 2.2 Statement Level Dependence Graph (SLDG) Dependence relations are often represented in Statement Level Dependence Graphs (SLDG's). For a perfectly n-nested loop with index set (i 1 ; i 2 ; : : : ; i n ) whose body contains statements S 1 ; : : : ; S r, the SLDG has r nodes, one for each statement. For each dependence from statement S k to S l with a distance vector d k;l, the graph has a directed edge from node S k to S l labeled with the distance vector d k;l. A dependence from a node to itself is called a self-dependence. In addition to the three types of dependence mentioned above, there is another type of dependence known as control dependence. A control dependence exists between a statement with a conditional jump and another statement if the conditional jump statement controls the execution of the other statement. Control dependences can be handled by methods similar to data dependences [4]. In our analysis, we treat the dierent types of dependences uniformly. Methods to calculate data dependence vectors can be found in [4, 5, 39, 40, 42]. 3 Related Work Software pipelining of inner loops has been considered by several authors [1, 6, 11, 14, 15, 16, 23, 24, 25, 27, 32, 35, 41]. All of these studies search for the minimum initiation interval by unrolling the loop several times. This is inadequate in situations where the minimum iteration initiation interval is non-integral. Moreover, these approaches use an ad hoc method to decide on the degree of loop unrolling, and are unacceptable in cases where the optimal solution can be found only after a very large amount of unrolling. Aiken and Nicolau [2], describe a procedure which yields an optimal schedule for inner sequential loops. The procedure works by simulating the execution of the loop body until a pattern evolves. The technique does not guarantee an upper bound on the amount of time it needs to nd a solution. Zaky and Sadayappan [41] present a novel approach that is based on eigenvalues of matrices that arise path algebra. Their algorithm has polynomial time complexity; their algorithm exploits the connectivity properties of the loop dependence graph. While the algorithm of [2] requires unrolling to detect a pattern, the algorithm in [41] does not require any unrolling. In Section 5.1, we show that the technique developed in this paper for nested loops is equally applicable to inner loops and derives the same solution as [41] using a simpler framework. Iwano and Yeh [22] use network ow algorithms for optimal loops parallelization. Software pipelining of sequential loops on limited resources is discussed in [1, 11, 15, 25].

5 Software pipelining of nested loops 5 Trace scheduling [7, 12, 13] is a technique used in VLIW machines that extracts parallelism in sequential loops by unrolling it several times. The code for one iteration of the unrolled loop is then compacted using the acyclic dependence graph corresponding to only the code body of the unrolled loop. Weiss and Smith [38] discuss an adaptation of this technique for pipelined supercomputers. While software pipelining of inner loops has received a lot of attention, very few authors have addressed software pipelining of nested loops. Cytron [9, 10] presents a technique for doacross loops that minimizes the delays between initiating successive iterations of a sequential loop with no reordering of statements in its body. Cytron [9, 10] does not explicitly attempt to exploit ne-grain parallelism. Munshi and Simmons [27] study the problem of minimizing the iteration initiation interval which considers statement reordering. They show that a close variant of the problem is NP-complete. Both these papers separate the issues of iteration initiation and the scheduling of operations within an iteration. In general, such a separation does not result in the minimum iteration initiation interval. Nicolau [29] suggests loop quantization as a technique for multidimensional loop unrolling in conjunction with tree-height reduction and percolation scheduling. He does not consider the problem of determining the optimal initiation interval for each loop. Loop quantization as described in [1, 29] deals with the problem at the iteration level rather than at the statement level. Recently, Gao et al. [17] present a technique that works for rectangular loops but requires all components of all distance vectors to be positive. While unimodular transformations could be used to convert all distance vectors to have non-negative entries, the transformed iteration spaces are no longer rectangular; this limits the applicability of the results in [17]. The technique presented in this paper does not have the restriction on non-negativity and hence is more general. 4 Statement level rational ane schedules In this section, we formulate the problem of optimal ne grain scheduling of nested loops in the absence of resource constraints as a Linear Programming (LP) problem [19, 34] which admits polynomial time solutions and is extremely fast in practice. This paper generalizes the hyperplane scheduling technique of scheduling iterations of nested loops pioneered by Lamport [26] by deriving optimal schedules at the statement level rather than at the iteration level. The solutions derived give the minimum iteration initiation interval for each level of an n-nested loop. Let G denote the statement level dependence graph. If G is acyclic, then list scheduling and tree height reduction can be used to optimally schedule the computation [1]. If G is cyclic, we use Tarjan's algorithm [36, 40, 42] to nd all the strongly connected components and schedule each strongly connected component separately. For the rest of the paper, we discuss the optimal scheduling of a single strongly connected component in G. In Section 5.1, we discuss the interleaving of the schedules of strongly connected components through Example 3. Given a number x, bxc is the largest integer that is less than or equal to x and is called the oor of x. Let bq k (I)c denote the time at which statement S k (k = 1; : : : ; r) in iteration I = (i 1 ; : : : ; i n ) (denoted S k (I)) is scheduled which is the time at which execution starts. Let t k be the time taken

6 Software pipelining of nested loops 6 to execute statement S k. We assume that t k 1 and is an integer. q k (I) is a rational function, i.e., it is written as q k (I) = h k;1 i 1 + h k;2 i h k;n i n + k : Let h k = (h k;1 ; h k;2 ; : : : ; h k;n ) for each k; the elements of the vector h k and k are rational. Note that we could also use dq k (I)e as the time at which S k (I) is scheduled. We choose to use the oor function throughout this paper. We use a single h vector for each strongly connected component for the rest of the paper, i.e., q k (I) = h I + k k = 1; : : : ; r: The schedule should satisfy all the dependences in the loop. A schedule is a tuple hh; i where h = (h 1 ; : : : ; h n ) is an n-vector and = ( 1 ; : : : ; r ) is an r-vector. A schedule hh; i is legal if for each edge from statement S k to S l with a distance vector d k;l in G, q l (I) q k (I? d k;l ) + t k This states that statement S l in iteration I can be scheduled only after statement S k in iteration (I? d k;l ) has completed execution. Since S k (I? d k;l ) starts execution at q k (I? d k;l ), S l (I) can start at the earliest at time q l (I? d k;l ) + t k. Thus, h I + l h (I? d k;l ) + k + t k h d k;l k? l + t k for all dependences in G. If d k;k is a self dependence on S k this condition translates to h d k;k t k For an n-nested loop with a schedule hh; i, the execution time, E is given by the expression, E = max fq k (I)? q l (J)g I;J2I^k;l2[1;r] The optimal execution time is the minimum value of the expression E: E max fh (I? J)g + max ( k )? min ( k ) I;J2I k2[1;r] k2[1;r] We assume that the number of iterations at each level of the loop nest is large; hence, we ignore the contribution from the term: max k2[1;r] ( k )? min k2[1;r] ( k ). The expression max I;J2I fh (I? J)g can be approximated by max I2I h I? min I2I h I. With the assumption that loop bounds are ane functions of outer loop variables, the iteration space is a convex polyhedron. The extrema of ane functions over I, therefore occur at the corners of the polyhedron [34]. If the iteration space is rectangular, i.e., L j and U j (j = 1; : : : ; n) are integer constants, we can nd an expression of the

7 Software pipelining of nested loops 7 optimal value of E using Banerjee's inequalities [5] as discussed below. Denition 1 [5]: Given a number h, its positive part, h + = max(h; 0); and its negative part, h? = max(?h; 0). Some properties are given below: 1. h + 0 and h? 0 2. h = h +? h? and jhj = h + + h? (jhj is the absolute value of h). 3.?h? h h + For rectangular loops, we assume that L j i j U j for j = 1; : : : ; n and L j and U j are constants. Using Banerjee's inequalities, max h I = I2I n o h + j U j? h? j L j and Therefore, E min h I = I2I n o h + j L j? h? j U j n o h + j U j? h? j L j? h + j L j? h? j U j E n h + j + h? j o (U j? L j ) From the properties in denition 1, this is equal to E fjh j j (U j? L j )g Thus, we can formulate the problem of nding the optimal schedule for an n-nested loop (with a rectangular iteration space and the size of each level in the iteration space is the same) with r-statements as that of nding a schedule hh; i, i.e., h 1 ; : : : ; h n and 1 ; : : : ; r that minimizes P n jh j j (U j? L j ) subject to dependence constraints: Minimize jh j j (U j? L j ) subject to h d k;l k? l + t k k; l 2 [1; r] for every edge in G. In many cases, the loop limits, though constants, are not known at compile time. In such cases, we aim at nding optimal schedules independent of the loop limits. We assume rectangular iteration spaces, where the size of each loop U j? L j + 1 is the same for all values of j; in such cases,

8 Software pipelining of nested loops 8 the optimal value of the expression E is a function of P n jh jj. Thus, the execution time depends on the loop limits, where as the schedule, hh; i does not depend on L j and U j (j = 1; : : : ; n): If the size, i.e., the value of U j? L j + 1 (j = 1; : : : ; n), are dierent for dierent loop levels, then the technique presented in this paper is sub-optimal. With unknown loop limits, our problem then is that of nding a schedule hh; i that minimizes P n jh jj subject to dependence constraints: Minimize jh j j subject to h d k;l k? l + t k k; l 2 [1; r] for every edge in G. The above formulation is not in standard linear programming form for two reasons: 1. Lack of non-negativity constraints on hh; i 2. Absolute values of variables in the objective function The rst problem is handled by writing each variable h j (j = 1; : : : ; n) and i (i = 1; : : : ; r) as the dierence of two variables, which are constrained to be non-negative, e.g., replace h j with h 1 j? h2 j with the constraint that h 1 j 0 and h2 j 0. The second problem is handled by adding a set of variables j ; j = 1; : : : ; n; the new objective function is P n j. For each variable h j, we add two constraints, j? h j 0 and j + h j 0. With these modications, the problem is now in standard Linear Programming (LP) form: subject to h 1 j? h2 j Minimize j? h 1 j + h 2 j 0 j + h 1 j? h2 j 0 d k;l j? 1 k? 2 k j j = 1; : : : ; n j = 1; : : : ; n + 1 l? 2 l t k where k; l 2 [1; r] ^ (k; l) 2 edges(g). The formulation has 2n + m constraints with 3n + 2r variables where m is the number of edges in G for an n-nested loop with r statements. In practice, our implementation obtains solutions very quickly. 4.1 What does the LP solution mean? The value of jh j j denotes the iteration initiation interval for the jth loop in the nest. If h j > 0, then the next loop iteration initiated at level j is numbered higher than the currently executing

9 Software pipelining of nested loops 9 iteration at level j. On the other hand h j < 0 means that the next iteration initiated has an iteration number less than the currently executing iteration, i.e., the loop at level j is unrolled in the reverse direction. If h j is a fraction i.e., h j = a j b j where a j and b j are integers and gcd(a j ; b j ) = 1, in every ja j j time units, jb j j iterations at level j are unrolled; the unrolling is in reverse direction if h j < 0. If h j = 0, the minimum initiation interval is zero, i.e., the loop is a parallel loop. Thus 1 jh j j denotes the initiation or unrolling rate of the jth loop. 5 Examples In this section, we show the eectiveness of our approach through examples. First, we show an example of a two-level nested loop with four statements for which the optimal initiation rate (with no bound on resources) is determined using the LP formulation described in this paper. Consider the following loop: Example 1: for i = 1 to N do for j = 1 to N do S 1 : A[i; j] = B[i? 1; j? 3] + D[i? 1; j + 3] S 2 : B[i; j] = C[i? 1; j + 4] + X[i; j] S 3 : C[i; j] = A[i; j? 2] + Y [i; j] S 4 : D[i; j] = A[i; j? 1] + Z[i; j] endfor endfor The statement level dependence graph is shown in Figure 1. We assume that each statement takes one unit of time to execute, i.e., t 1 = t 2 = t 3 = t 4 = 1. The linear programming problem for this example is: Minimize subject to 1? h h h 1 1? h ? h h h 1 2? h h 1 2? 2h2 2? ? h 1 1? h2 1? 4h h 2 2? ? h 1 1? h h 1 2? 3h2 2? ? h 1 2? h2 2? ? 2 4 1

10 Software pipelining of nested loops 10 (0,1) S1 (1,-3) S4 (1,3) (0,2) S2 (1,-4) S3 Figure 1: Statement level dependence graph for Example 1 h 1 1? h2 1? 3h h 2 2? ? The optimal solution to this problem is: h 1 = 8 5, h 2 =?1 5, 1 = 0, 2 = 0, 3 = 7 5, 4 = 6 5. This means that S 1 (i; j) is scheduled at b 8i 5? j 5 c S 2 (i; j) is scheduled at b 8i 5? j 5 c S 3 (i; j) is scheduled at b 8i 5? j c S 4 (i; j) is scheduled at b 8i 5? j c In every 8 units of time, 5 new iterations of the outer loop are initiated. In every unit of time, 5 new iterations of the inner loop are initiated in the reverse direction. The optimal execution time is 9N. The best execution time that can be derived using only the iteration space distance 5 vectors is 6N for the schedule 5i + j. The ne grained solution runs 3:3 times faster than the best solution that can be obtained using hyperplane technique [26]. 5.1 Application to inner loops The method presented here is equally applicable to inner loops. Consider the following example from page 45 in [18]. Example 2: for i = 1 to N do

11 Software pipelining of nested loops 11 S1 1 0 S3 1 S2 Figure 2: Statement level dependence graph for Example 2 S 1 : A[i] = C[i? 1] S 2 : B[i + 1] = A[i] S 3 : C[i] = B[i] endfor The statement level dependence graph for Example 2 is shown in Figure 2. We assume that each statement takes one unit of time to execute, i.e., t 1 = t 2 = t 3 = 1. The linear programming problem for this example is: Minimize subject to? h 1 + h h 1? h 2 0 0? ? h 1? h 2? ? h 1? h 2? ? The optimal solution to this problem is: h = 3 2, 1 = 0, 2 = 1, and 3 = 1 2. S 1 (i) is scheduled at b 3i 2 c S 2 (i) is scheduled at b 3i 2 + 1c S 3 (i) is scheduled at b 3i c

12 Software pipelining of nested loops 12 A 0 0 B 0 0 C 1 1 D Figure 3: Statement level dependence graph for Example 3 In every 3 units of time, 2 new iterations of the loop are initiated. The optimal iteration initiation interval is 3. The optimal execution time is 2 3N. The best execution time that can be derived 2 using only the iteration space distance vectors is 3N for sequential execution (which is the only possibility because of the loop carried dependence of distance 1). The ne grained solution runs 2 times faster than the best solution that can be obtained using the hyperplane technique [26]. Earlier we had mentioned that we schedule strongly connected components separately. Next, we show an example that illustrates how we can interleave strongly connected components; we use the following example from page 124 in [1]: Example 3: A : for i = 1 to N do A[i] = f 1 (B[i]) B : B[i] = f 2 (A[i]; D[i? 1]) C : C[i] = f 3 (A[i]; D[i? 1]) D : D[i] = f 4 (B[i]; C[i]) endfor The statement level dependence graph is shown in Figure 3. We assume that each statement takes one unit of time to execute, i.e., t 1 = t 2 = t 3 = t 4 = 1. The SLDG in Figure 3 has two strongly connected components, on consisting of just the node A and the other made up of nodes B; C; and D. We use the same value of h for all statements in the loop; this allows for interleaving

13 Software pipelining of nested loops 13 of dierent strongly connected components. The linear programming problem for this example is: Minimize subject to? h 1 + h h 1? h 2 0 0? ? ? ? ? ? ? ? h 1? h 2? ? h 1? h 2? ? The optimal solution to this problem is: h = 2, 1 = 0, 2 = 1, 3 = 1, and 4 = 2. Statement A is scheduled at 2i Statement B is scheduled at 2i + 1 Statement C is scheduled at 2i + 1 Statement D is scheduled at 2i + 2 This is the optimal solution for this example. In addition, our technique produces the optimal solution for codes such as the ones on pages 131 and 138 of [1], both of which require interleaving of strongly connected components in scheduling. Due to lack of space, we do not present these solutions here; see [31] for details. 6 Work in progress We briey discuss here our eort at integrating resource constraints such as the number of processors, functional units etc. Since scheduling in the presence of resource constraints is NP-complete for inner loops, the problem is NP-complete for nested loops as well. In the area of dataow machines, Culler [8] has proposed a technique known as loop bounding, which limits the number of iterations that can be active (i.e., started but not yet nished) at any time. This can be generalized to nested loops using loop quantization. Cytron [9, 10] addresses the same problem. Aiken and Nicolau [1, 29] use loop quantization and percolation scheduling. We are working on mitred quantizations [1] that keep all the available processors busy. The problem is related to iteration space tiling of nested loops [21, 30]. We are also working towards integrating conditionals in nested loops.

14 Software pipelining of nested loops 14 7 Conclusions Software pipelining is an eective ne-grain loop scheduling technique that restructures the statements in the body of a loop subject to resource and dependence constraints such that one iteration of a loop can start execution before another nishes. The total execution time of a software-pipelined loop depends on the interval between two successive initiation of iterations. While software pipelining of single loops has been addressed in many papers, little work has been done in the area of software pipelining of nested loops. In this paper, we have presented an approach to software pipelining of nested loops. We formulated the problem of nding the minimum iteration initiation interval for each level of a nested loop as that of nding a rational ane schedule for each statement in the body of a perfectly nested loop; this is then solved using linear programming. This framework allows for an integrated treatment of iteration-dependent statement reordering and multidimensional loop unrolling. Unlike most work in scheduling nested loops, we treat each statement in the body as a unit of scheduling. Thus, the schedules derived allow for instances of statements from dierent iterations to be scheduled at the same time. Optimal schedules derived here subsume extant work on software pipelining of inner loops. Work is in progress in deriving near-optimal multidimensional loop unrolling in the presence of resource constraints and conditionals. References [1] A. Aiken. Compaction based parallelization. (Ph.D. thesis). Technical Report , Cornell University, [2] A. Aiken and A. Nicolau. Optimal loop parallelization. In Proc. ACM SIGPLAN Conference on Programming Languages Design and Implementation, June [3] A. Aiken and A. Nicolau. A realistic resource-constrained software pipelining algorithm. In Proc. 3rd Workshop on Languages and Compilers for Parallel Computing, Irvine, CA, August [4] R. Allen and K. Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Trans. Programming Languages and Systems, 9(4):491{542, [5] U. Banerjee. Dependence analysis for supercomputing, Kluwer Academic Publishers, Boston, MA, [6] A. Charlesworth. An approach to scientic array processing: The architectural design of the AP-120B/FPS-164 family. Computer, 14(3):18{27, [7] R. Colwell, R. Nix, J. O'Donnell, D. Papworth, and P. Rodman. A VLIW architecture for a trace scheduling compiler. IEEE Trans. Comput., C-37(8):967{979, August [8] D. Culler and Arvind. Resource requirements for dataow programs. In Proc. International Symposium on Computer Architecture, May [9] R. Cytron. Compile-time scheduling and optimization for asynchronous machines. PhD thesis, University of Illinois at Urbana-Champaign, Urbana, Illinois, 1984.

15 Software pipelining of nested loops 15 [10] R. Cytron. doacross: Beyond vectorization for multiprocessors. Proc International Conference on Parallel Processing, pp , August [11] C. Eisenbeis. Optimization of horizontal microcode generation for loop structures. In Proc. 2nd ACM International Conference on Supercomputing, pp. 453{465, June [12] J. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput., C-30(7):478{490, July [13] J. Fisher. Very long instruction word architectures and the ELI-512. In Proc. 10th International Symposium on Computer Architecture, pp. 140{150, June [14] K. Ebcioglu. A compilation technique for software pipelining of loops with conditional jumps. In Proc. 20th Annual Workshop on Microprogramming, December [15] K. Ebcioglu and A. Nicolau. A global resource-constrained parallelization technique. In Proc. ACM International Conference on Supercomputing, June [16] G. Gao, Y. Wong, and Q. Ning. A Petri-Net model for ne-grain loop scheduling. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp , Toronto, Canada, June [17] G. Gao, Q. Ning and V. van Dongen. Extending software pipelining for scheduling nested loops. In Proc. 6th Annual Workshop on Languages and Compilers for Parallel Computing, August [18] F. Gasperoni. Compilation techniques for VLIW architectures. Technical Report 435, Department of Computer Science, New York University, March [19] Saul I. Gass. Linear programming, methods and applications. McGraw-Hill Book Company, fourth edition, [20] J. L. Hennessy and D. A. Patterson. Computer architecture: A quantitative approach, Morgan Kaufmann Publishers, [21] F. Irigoin and R. Triolet. Supernode Partitioning. Proc. 15th Annual ACM Symp. Principles of Programming Languages, San Diego, CA, Jan. 1988, pp [22] K. Iwano and Yeh. An ecient algorithm for optimal loop parallelization. Lecture Notes in Comp. Sci., No. 450, Springer-Verlag, 1990, pp. 201{210. [23] R. Jones and H. Allan. Software pipelining: A comparison and improvement. In Proc. 23rd Annual Workshop on Microprogramming and Microarchitectures, pp , Orlando, Florida, November [24] M. Lam. A systolic array optimizing compiler. PhD thesis, Carnegie Mellon University, May [25] M. Lam. Software pipelining: An eective scheduling technique for VLIW machines. In Proc. ACM SIGPLAN Conf. Programming Languages Design and Implementation, pp , Atlanta, GA, June [26] L. Lamport. The Parallel Execution of DO Loops. Communications of the ACM, 17(2):83{93, Feb

16 Software pipelining of nested loops 16 [27] A. Munshi and B. Simons. Scheduling sequential loops on parallel processors. In Proc. ACM International Conference on Supercomputing, pp. 392{415, June [28] D. Padua and M. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12):1184{1201, Dec [29] A. Nicolau. Loop quantization: A generalized loop unwinding technique. J. Parallel and Dist. Comput., 5(5):568{586, October [30] J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for nonshared memory machines. Journal of Parallel and Distributed Computing, 16(2):108{120, October [31] J. Ramanujam. Optimal multidimensional loop unwinding: A framework for software pipelining of nested loops. Technical Report TR , Dept. of Electrical and Computer Engineering, Louisiana State University, May [32] B. Rau and C. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientic computing, In Proc. 14th Annual Workshop on Microprogramming, pp , [33] B. Rau. Cydra 5 directed dataow architecture. In Compcon 88, pp. 106{113. IEEE Computer Society, [34] A. Schrijver. Theory of linear and integer programming. Wiley-Interscience series in Discrete Mathematics and Optimization, John Wiley and Sons, New York, [35] B. Su, S. Ding and J. Xia. GURPR { A method for global software pipelining. In Proc. 20th Annual Workshop on Microprogramming, pp. 88{96, December [36] R. Tarjan. Depth-rst search and linear graph algorithms. SIAM J. Comput., 1(2):146{160, June [37] R. Touzeau. A FORTRAN compiler for the FPS-164 scientic computer. In Proc. SIGPLAN Symposium on Compiler Construction, pp , June [38] S. Weiss and J. Smith. A study of scalar compilation techniques for pipelined supercomputers. In Proc. 2nd Intl. Conf. Architectural Support for Programming Languages & Operating System, pp. 105{109, October [39] M. Wolfe and U. Banerjee. Data dependence and its application to parallel processing. International Journal of Parallel Programming, 16(2):137{178, [40] M. Wolfe. Optimizing supercompilers for supercomputers, MIT Press, [41] A. Zaky and P. Sadayappan. Optimal static scheduling of sequential loops on multiprocessors. In Proc. International Conference on Parallel Processing, volume 3, pp , [42] H. Zima and B. Chapman. Supercompilers for parallel and vector supercomputers. ACM Press Frontier Series, 1990.

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742 UMIACS-TR-93-133 December, 1992 CS-TR-3192 Revised April, 1993 Denitions of Dependence Distance William Pugh Institute for Advanced Computer Studies Dept. of Computer Science Univ. of Maryland, College