Software pipelining of nested loops 2 1 Introduction Exploiting parallelism in loops in scientic programs is an important factor in realizing the pote

Size: px
Start display at page:

Download "Software pipelining of nested loops 2 1 Introduction Exploiting parallelism in loops in scientic programs is an important factor in realizing the pote"

Transcription

1 Software pipelining of nested loops J. Ramanujam Dept. of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA May 1994 Abstract This paper presents an approach to software pipelining of nested loops. While several papers have addressed software pipelining of inner loops, little work has been done in the area of extending it to nested loops. This paper solves the problem of nding the minimum iteration initiation interval (in the absence of resource constraints) for each level of a nested loop. The problem is formulated as one of nding a rational quasi-ane schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. This allows us to treat iteration-dependent statement reordering and multidimensional loop unrolling in the same framework. Unlike most work in scheduling nested loops, we treat each statement in the body as a unit of scheduling. Thus, the schedules derived allow for instances of statements from dierent iterations to be scheduled at the same time. Optimal schedules derived here subsume extant work on software pipelining of inner loops, in the absence of resource constraints. Keywords: Instruction level parallelism, ne-grain scheduling, nested loops, software pipelining, optimal scheduling. Note: This is an expanded version of the paper titled \Optimal Software Pipelining of Nested Loops," that appeared in Proc. 8th International Parallel Processing Symposium, (April 1994), pp. 335{342. Supported in part by an NSF Young Investigator Award CCR{ , and NSF grant CCR{ , and by the Louisiana Board of Regents through contract LEQSF ( )-RD-A-09. 1

2 Software pipelining of nested loops 2 1 Introduction Exploiting parallelism in loops in scientic programs is an important factor in realizing the potential performance of highly parallel computers today. Programming these machines remains a dicult problem. Much progress has been made resulting in a suite of techniques that extract coarsegrain parallelism from sequential programs [4, 28, 40, 42]. With the advent of VLIW, superscalar, horizontal microengines, multiple RISC and pipelined processors, the exploitation of ne-grain instruction-level parallelism has become a major challenge to parallelizing compilers [7, 13, 20, 33, 37, 38]. The problem will become even more important as these processors form the building blocks of massively parallel machines. Software pipelining [1, 6, 11, 14, 15, 16, 23, 24, 25, 27, 32, 35, 41] has been proposed as an eective ne-grain scheduling technique that restructures the statements in the body of a loop subject to resource and dependence constraints such that one iteration of a loop can start execution before another nishes. The total execution time thus depends on the iteration initiation interval. While software pipelining of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. This paper presents an approach to software pipelining of nested loops by presenting a technique to nd the minimum iteration initiation interval (in the absence of resource constraints). We formulate the problem as that of nding a rational ane schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. This framework allows for an integrated treatment of iteration-dependent statement reordering and multidimensional loop unrolling. In contrast to most work in scheduling nested loops, we treat each statement in the body as a unit of scheduling. Thus, the schedules derived allow for instances of statements from dierent iterations to be scheduled at the same time. Optimal schedules derived here subsume extant work on software pipelining of inner loops. Due to space constraints, we do not discuss code generation issues in this paper; the reader is referred to [31] for details. Section 2 provides the background and Section 3 discusses related work, placing this research in the context of extant work in the eld. In Section 4, we formulate the problem of optimal negrain scheduling of nested loops (in the absence of resource constraints) as a linear programming problem, the solution to which gives a rational ane schedule for each statement in the body of a nested loop as a function of the loop indices. We show that the solution corresponds to the minimum initiation interval for each level of a nested loop. Section 5 provides examples. In Section 6, we discuss work in progress aimed at integrating resource constraints and handling conditionals. Section 7 concludes with a discussion. 2 Background 2.1 Data dependence Good and thorough parallelization of a program critically depends on how precisely a compiler can discover the data dependence information [4, 5, 39, 40, 42]. These dependences imply precedence constraints among computations which have to be satised for a correct execution. In this paper,

3 Software pipelining of nested loops 3 we consider perfectly nested loops of the form: for I 1 = L 1 to U 1 do endfor for I n = L n to U n do S 1 (I) S r (I) endfor where L j and U j are integer-valued ane expressions involving I 1 ; : : : ; I j?1 and I = (I 1 ; : : : ; I n ). Each I j (j = 1; : : : ; n) is a loop index; S 1 ; : : : ; S r are assignment statements of the form X o = E(X 1 ; : : : ; X K ) where X o is dened (i.e., written) in expression E, which is evaluated using some variables X 1 ; X 2 ; : : : ; X K. We assume that the increment of each loop is +1. Each computation is denoted by an index vector I = (I 1 ; : : : ; I n ). A loop instance is the loop iteration where the indices take on a particular value, I = i = (i 1 ; i 2 ; : : : ; i n ). The instance of statement S k executed in iteration vector I is denoted S k (I). The iteration set is a collection of iteration vectors and constitutes the iteration space. With the assumption on loop bound linearity, the sets of computations considered are nite convex polyhedra of some iteration space in Z n, where n is the number of loops in the nest which is also the dimensionality of the iteration space. The iteration set of a given nested loop is described as a set of integer points (or, vectors) whose convex hull I is a non-degenerate (or, full dimensional) convex polyhedron. The loop iterations are executed in lexicographic ordering during sequential execution. Any vector x = (x 1 ; x 2 ; : : : ; x n ) is a positive vector, if its rst (leading { read from left to right) non-zero component is positive [5]. We say that i = (i 1 ; : : : ; i n ) precedes j = (j 1 ; : : : ; i n ), written i j, if j? i is a positive vector. Positive vectors capture the lexicographic ordering among iterations of a nested loop. A loop nest where the loop limits are constants is said to have a rectangular iteration space associated with it. Let X and Y be two p-dimensional arrays; and let f i and g i (i = 1; : : : ; p) be two sets of integer functions such that X(f 1 (I); : : : ; f p (I)) is a \dened" (i.e., written) variable and Y (g 1 (I); : : : ; g p (I)) is a \used" (i.e., read) variable. Let F (I) denote f 1 (I); : : : ; f p (I) and let G(I) denote g 1 (I); : : : ; g p (I). Given two statements S k (I 1 ) and S l (I 2 ), S l (I 2 ) is dependent on S k (I 1 ) (with distance vector d k;l ) i [5, 28, 39, 40]: (I 1 I 2 ) or (I 1 = I 2 and k < l) and f i (I 1 ) = g i (I 2 ) for i = 1; : : : ; p; Either X(F (I 1 )) is written in statement S k (I 1 ) or X(G(I 2 )) is written in statement S l (I 2 ): A ow dependence exists from statement S k to statement S l if S k writes a value that is subsequently, in sequential execution, read by S l. An anti-dependence exists from S k to S l if S k reads a value

4 Software pipelining of nested loops 4 that is subsequently modied by S l. An output dependence exists between S k and S l if S k writes a value which is subsequently written by S l. If I 1 = I 2, the dependence is called a loop-independent dependence; otherwise, it is called a loop-carried dependence. Many dependences that occur in practice have a constant distance in each dimension of the iteration space. In such cases, the vector d = I 2? I 1 is called the distance vector. We limit our discussion to distance vectors in this paper. 2.2 Statement Level Dependence Graph (SLDG) Dependence relations are often represented in Statement Level Dependence Graphs (SLDG's). For a perfectly n-nested loop with index set (i 1 ; i 2 ; : : : ; i n ) whose body contains statements S 1 ; : : : ; S r, the SLDG has r nodes, one for each statement. For each dependence from statement S k to S l with a distance vector d k;l, the graph has a directed edge from node S k to S l labeled with the distance vector d k;l. A dependence from a node to itself is called a self-dependence. In addition to the three types of dependence mentioned above, there is another type of dependence known as control dependence. A control dependence exists between a statement with a conditional jump and another statement if the conditional jump statement controls the execution of the other statement. Control dependences can be handled by methods similar to data dependences [4]. In our analysis, we treat the dierent types of dependences uniformly. Methods to calculate data dependence vectors can be found in [4, 5, 39, 40, 42]. 3 Related Work Software pipelining of inner loops has been considered by several authors [1, 6, 11, 14, 15, 16, 23, 24, 25, 27, 32, 35, 41]. All of these studies search for the minimum initiation interval by unrolling the loop several times. This is inadequate in situations where the minimum iteration initiation interval is non-integral. Moreover, these approaches use an ad hoc method to decide on the degree of loop unrolling, and are unacceptable in cases where the optimal solution can be found only after a very large amount of unrolling. Aiken and Nicolau [2], describe a procedure which yields an optimal schedule for inner sequential loops. The procedure works by simulating the execution of the loop body until a pattern evolves. The technique does not guarantee an upper bound on the amount of time it needs to nd a solution. Zaky and Sadayappan [41] present a novel approach that is based on eigenvalues of matrices that arise path algebra. Their algorithm has polynomial time complexity; their algorithm exploits the connectivity properties of the loop dependence graph. While the algorithm of [2] requires unrolling to detect a pattern, the algorithm in [41] does not require any unrolling. In Section 5.1, we show that the technique developed in this paper for nested loops is equally applicable to inner loops and derives the same solution as [41] using a simpler framework. Iwano and Yeh [22] use network ow algorithms for optimal loops parallelization. Software pipelining of sequential loops on limited resources is discussed in [1, 11, 15, 25].

5 Software pipelining of nested loops 5 Trace scheduling [7, 12, 13] is a technique used in VLIW machines that extracts parallelism in sequential loops by unrolling it several times. The code for one iteration of the unrolled loop is then compacted using the acyclic dependence graph corresponding to only the code body of the unrolled loop. Weiss and Smith [38] discuss an adaptation of this technique for pipelined supercomputers. While software pipelining of inner loops has received a lot of attention, very few authors have addressed software pipelining of nested loops. Cytron [9, 10] presents a technique for doacross loops that minimizes the delays between initiating successive iterations of a sequential loop with no reordering of statements in its body. Cytron [9, 10] does not explicitly attempt to exploit ne-grain parallelism. Munshi and Simmons [27] study the problem of minimizing the iteration initiation interval which considers statement reordering. They show that a close variant of the problem is NP-complete. Both these papers separate the issues of iteration initiation and the scheduling of operations within an iteration. In general, such a separation does not result in the minimum iteration initiation interval. Nicolau [29] suggests loop quantization as a technique for multidimensional loop unrolling in conjunction with tree-height reduction and percolation scheduling. He does not consider the problem of determining the optimal initiation interval for each loop. Loop quantization as described in [1, 29] deals with the problem at the iteration level rather than at the statement level. Recently, Gao et al. [17] present a technique that works for rectangular loops but requires all components of all distance vectors to be positive. While unimodular transformations could be used to convert all distance vectors to have non-negative entries, the transformed iteration spaces are no longer rectangular; this limits the applicability of the results in [17]. The technique presented in this paper does not have the restriction on non-negativity and hence is more general. 4 Statement level rational ane schedules In this section, we formulate the problem of optimal ne grain scheduling of nested loops in the absence of resource constraints as a Linear Programming (LP) problem [19, 34] which admits polynomial time solutions and is extremely fast in practice. This paper generalizes the hyperplane scheduling technique of scheduling iterations of nested loops pioneered by Lamport [26] by deriving optimal schedules at the statement level rather than at the iteration level. The solutions derived give the minimum iteration initiation interval for each level of an n-nested loop. Let G denote the statement level dependence graph. If G is acyclic, then list scheduling and tree height reduction can be used to optimally schedule the computation [1]. If G is cyclic, we use Tarjan's algorithm [36, 40, 42] to nd all the strongly connected components and schedule each strongly connected component separately. For the rest of the paper, we discuss the optimal scheduling of a single strongly connected component in G. In Section 5.1, we discuss the interleaving of the schedules of strongly connected components through Example 3. Given a number x, bxc is the largest integer that is less than or equal to x and is called the oor of x. Let bq k (I)c denote the time at which statement S k (k = 1; : : : ; r) in iteration I = (i 1 ; : : : ; i n ) (denoted S k (I)) is scheduled which is the time at which execution starts. Let t k be the time taken

6 Software pipelining of nested loops 6 to execute statement S k. We assume that t k 1 and is an integer. q k (I) is a rational function, i.e., it is written as q k (I) = h k;1 i 1 + h k;2 i h k;n i n + k : Let h k = (h k;1 ; h k;2 ; : : : ; h k;n ) for each k; the elements of the vector h k and k are rational. Note that we could also use dq k (I)e as the time at which S k (I) is scheduled. We choose to use the oor function throughout this paper. We use a single h vector for each strongly connected component for the rest of the paper, i.e., q k (I) = h I + k k = 1; : : : ; r: The schedule should satisfy all the dependences in the loop. A schedule is a tuple hh; i where h = (h 1 ; : : : ; h n ) is an n-vector and = ( 1 ; : : : ; r ) is an r-vector. A schedule hh; i is legal if for each edge from statement S k to S l with a distance vector d k;l in G, q l (I) q k (I? d k;l ) + t k This states that statement S l in iteration I can be scheduled only after statement S k in iteration (I? d k;l ) has completed execution. Since S k (I? d k;l ) starts execution at q k (I? d k;l ), S l (I) can start at the earliest at time q l (I? d k;l ) + t k. Thus, h I + l h (I? d k;l ) + k + t k h d k;l k? l + t k for all dependences in G. If d k;k is a self dependence on S k this condition translates to h d k;k t k For an n-nested loop with a schedule hh; i, the execution time, E is given by the expression, E = max fq k (I)? q l (J)g I;J2I^k;l2[1;r] The optimal execution time is the minimum value of the expression E: E max fh (I? J)g + max ( k )? min ( k ) I;J2I k2[1;r] k2[1;r] We assume that the number of iterations at each level of the loop nest is large; hence, we ignore the contribution from the term: max k2[1;r] ( k )? min k2[1;r] ( k ). The expression max I;J2I fh (I? J)g can be approximated by max I2I h I? min I2I h I. With the assumption that loop bounds are ane functions of outer loop variables, the iteration space is a convex polyhedron. The extrema of ane functions over I, therefore occur at the corners of the polyhedron [34]. If the iteration space is rectangular, i.e., L j and U j (j = 1; : : : ; n) are integer constants, we can nd an expression of the

7 Software pipelining of nested loops 7 optimal value of E using Banerjee's inequalities [5] as discussed below. Denition 1 [5]: Given a number h, its positive part, h + = max(h; 0); and its negative part, h? = max(?h; 0). Some properties are given below: 1. h + 0 and h? 0 2. h = h +? h? and jhj = h + + h? (jhj is the absolute value of h). 3.?h? h h + For rectangular loops, we assume that L j i j U j for j = 1; : : : ; n and L j and U j are constants. Using Banerjee's inequalities, max h I = I2I n o h + j U j? h? j L j and Therefore, E min h I = I2I n o h + j L j? h? j U j n o h + j U j? h? j L j? h + j L j? h? j U j E n h + j + h? j o (U j? L j ) From the properties in denition 1, this is equal to E fjh j j (U j? L j )g Thus, we can formulate the problem of nding the optimal schedule for an n-nested loop (with a rectangular iteration space and the size of each level in the iteration space is the same) with r-statements as that of nding a schedule hh; i, i.e., h 1 ; : : : ; h n and 1 ; : : : ; r that minimizes P n jh j j (U j? L j ) subject to dependence constraints: Minimize jh j j (U j? L j ) subject to h d k;l k? l + t k k; l 2 [1; r] for every edge in G. In many cases, the loop limits, though constants, are not known at compile time. In such cases, we aim at nding optimal schedules independent of the loop limits. We assume rectangular iteration spaces, where the size of each loop U j? L j + 1 is the same for all values of j; in such cases,

8 Software pipelining of nested loops 8 the optimal value of the expression E is a function of P n jh jj. Thus, the execution time depends on the loop limits, where as the schedule, hh; i does not depend on L j and U j (j = 1; : : : ; n): If the size, i.e., the value of U j? L j + 1 (j = 1; : : : ; n), are dierent for dierent loop levels, then the technique presented in this paper is sub-optimal. With unknown loop limits, our problem then is that of nding a schedule hh; i that minimizes P n jh jj subject to dependence constraints: Minimize jh j j subject to h d k;l k? l + t k k; l 2 [1; r] for every edge in G. The above formulation is not in standard linear programming form for two reasons: 1. Lack of non-negativity constraints on hh; i 2. Absolute values of variables in the objective function The rst problem is handled by writing each variable h j (j = 1; : : : ; n) and i (i = 1; : : : ; r) as the dierence of two variables, which are constrained to be non-negative, e.g., replace h j with h 1 j? h2 j with the constraint that h 1 j 0 and h2 j 0. The second problem is handled by adding a set of variables j ; j = 1; : : : ; n; the new objective function is P n j. For each variable h j, we add two constraints, j? h j 0 and j + h j 0. With these modications, the problem is now in standard Linear Programming (LP) form: subject to h 1 j? h2 j Minimize j? h 1 j + h 2 j 0 j + h 1 j? h2 j 0 d k;l j? 1 k? 2 k j j = 1; : : : ; n j = 1; : : : ; n + 1 l? 2 l t k where k; l 2 [1; r] ^ (k; l) 2 edges(g). The formulation has 2n + m constraints with 3n + 2r variables where m is the number of edges in G for an n-nested loop with r statements. In practice, our implementation obtains solutions very quickly. 4.1 What does the LP solution mean? The value of jh j j denotes the iteration initiation interval for the jth loop in the nest. If h j > 0, then the next loop iteration initiated at level j is numbered higher than the currently executing

9 Software pipelining of nested loops 9 iteration at level j. On the other hand h j < 0 means that the next iteration initiated has an iteration number less than the currently executing iteration, i.e., the loop at level j is unrolled in the reverse direction. If h j is a fraction i.e., h j = a j b j where a j and b j are integers and gcd(a j ; b j ) = 1, in every ja j j time units, jb j j iterations at level j are unrolled; the unrolling is in reverse direction if h j < 0. If h j = 0, the minimum initiation interval is zero, i.e., the loop is a parallel loop. Thus 1 jh j j denotes the initiation or unrolling rate of the jth loop. 5 Examples In this section, we show the eectiveness of our approach through examples. First, we show an example of a two-level nested loop with four statements for which the optimal initiation rate (with no bound on resources) is determined using the LP formulation described in this paper. Consider the following loop: Example 1: for i = 1 to N do for j = 1 to N do S 1 : A[i; j] = B[i? 1; j? 3] + D[i? 1; j + 3] S 2 : B[i; j] = C[i? 1; j + 4] + X[i; j] S 3 : C[i; j] = A[i; j? 2] + Y [i; j] S 4 : D[i; j] = A[i; j? 1] + Z[i; j] endfor endfor The statement level dependence graph is shown in Figure 1. We assume that each statement takes one unit of time to execute, i.e., t 1 = t 2 = t 3 = t 4 = 1. The linear programming problem for this example is: Minimize subject to 1? h h h 1 1? h ? h h h 1 2? h h 1 2? 2h2 2? ? h 1 1? h2 1? 4h h 2 2? ? h 1 1? h h 1 2? 3h2 2? ? h 1 2? h2 2? ? 2 4 1

10 Software pipelining of nested loops 10 (0,1) S1 (1,-3) S4 (1,3) (0,2) S2 (1,-4) S3 Figure 1: Statement level dependence graph for Example 1 h 1 1? h2 1? 3h h 2 2? ? The optimal solution to this problem is: h 1 = 8 5, h 2 =?1 5, 1 = 0, 2 = 0, 3 = 7 5, 4 = 6 5. This means that S 1 (i; j) is scheduled at b 8i 5? j 5 c S 2 (i; j) is scheduled at b 8i 5? j 5 c S 3 (i; j) is scheduled at b 8i 5? j c S 4 (i; j) is scheduled at b 8i 5? j c In every 8 units of time, 5 new iterations of the outer loop are initiated. In every unit of time, 5 new iterations of the inner loop are initiated in the reverse direction. The optimal execution time is 9N. The best execution time that can be derived using only the iteration space distance 5 vectors is 6N for the schedule 5i + j. The ne grained solution runs 3:3 times faster than the best solution that can be obtained using hyperplane technique [26]. 5.1 Application to inner loops The method presented here is equally applicable to inner loops. Consider the following example from page 45 in [18]. Example 2: for i = 1 to N do

11 Software pipelining of nested loops 11 S1 1 0 S3 1 S2 Figure 2: Statement level dependence graph for Example 2 S 1 : A[i] = C[i? 1] S 2 : B[i + 1] = A[i] S 3 : C[i] = B[i] endfor The statement level dependence graph for Example 2 is shown in Figure 2. We assume that each statement takes one unit of time to execute, i.e., t 1 = t 2 = t 3 = 1. The linear programming problem for this example is: Minimize subject to? h 1 + h h 1? h 2 0 0? ? h 1? h 2? ? h 1? h 2? ? The optimal solution to this problem is: h = 3 2, 1 = 0, 2 = 1, and 3 = 1 2. S 1 (i) is scheduled at b 3i 2 c S 2 (i) is scheduled at b 3i 2 + 1c S 3 (i) is scheduled at b 3i c

12 Software pipelining of nested loops 12 A 0 0 B 0 0 C 1 1 D Figure 3: Statement level dependence graph for Example 3 In every 3 units of time, 2 new iterations of the loop are initiated. The optimal iteration initiation interval is 3. The optimal execution time is 2 3N. The best execution time that can be derived 2 using only the iteration space distance vectors is 3N for sequential execution (which is the only possibility because of the loop carried dependence of distance 1). The ne grained solution runs 2 times faster than the best solution that can be obtained using the hyperplane technique [26]. Earlier we had mentioned that we schedule strongly connected components separately. Next, we show an example that illustrates how we can interleave strongly connected components; we use the following example from page 124 in [1]: Example 3: A : for i = 1 to N do A[i] = f 1 (B[i]) B : B[i] = f 2 (A[i]; D[i? 1]) C : C[i] = f 3 (A[i]; D[i? 1]) D : D[i] = f 4 (B[i]; C[i]) endfor The statement level dependence graph is shown in Figure 3. We assume that each statement takes one unit of time to execute, i.e., t 1 = t 2 = t 3 = t 4 = 1. The SLDG in Figure 3 has two strongly connected components, on consisting of just the node A and the other made up of nodes B; C; and D. We use the same value of h for all statements in the loop; this allows for interleaving

13 Software pipelining of nested loops 13 of dierent strongly connected components. The linear programming problem for this example is: Minimize subject to? h 1 + h h 1? h 2 0 0? ? ? ? ? ? ? ? h 1? h 2? ? h 1? h 2? ? The optimal solution to this problem is: h = 2, 1 = 0, 2 = 1, 3 = 1, and 4 = 2. Statement A is scheduled at 2i Statement B is scheduled at 2i + 1 Statement C is scheduled at 2i + 1 Statement D is scheduled at 2i + 2 This is the optimal solution for this example. In addition, our technique produces the optimal solution for codes such as the ones on pages 131 and 138 of [1], both of which require interleaving of strongly connected components in scheduling. Due to lack of space, we do not present these solutions here; see [31] for details. 6 Work in progress We briey discuss here our eort at integrating resource constraints such as the number of processors, functional units etc. Since scheduling in the presence of resource constraints is NP-complete for inner loops, the problem is NP-complete for nested loops as well. In the area of dataow machines, Culler [8] has proposed a technique known as loop bounding, which limits the number of iterations that can be active (i.e., started but not yet nished) at any time. This can be generalized to nested loops using loop quantization. Cytron [9, 10] addresses the same problem. Aiken and Nicolau [1, 29] use loop quantization and percolation scheduling. We are working on mitred quantizations [1] that keep all the available processors busy. The problem is related to iteration space tiling of nested loops [21, 30]. We are also working towards integrating conditionals in nested loops.

14 Software pipelining of nested loops 14 7 Conclusions Software pipelining is an eective ne-grain loop scheduling technique that restructures the statements in the body of a loop subject to resource and dependence constraints such that one iteration of a loop can start execution before another nishes. The total execution time of a software-pipelined loop depends on the interval between two successive initiation of iterations. While software pipelining of single loops has been addressed in many papers, little work has been done in the area of software pipelining of nested loops. In this paper, we have presented an approach to software pipelining of nested loops. We formulated the problem of nding the minimum iteration initiation interval for each level of a nested loop as that of nding a rational ane schedule for each statement in the body of a perfectly nested loop; this is then solved using linear programming. This framework allows for an integrated treatment of iteration-dependent statement reordering and multidimensional loop unrolling. Unlike most work in scheduling nested loops, we treat each statement in the body as a unit of scheduling. Thus, the schedules derived allow for instances of statements from dierent iterations to be scheduled at the same time. Optimal schedules derived here subsume extant work on software pipelining of inner loops. Work is in progress in deriving near-optimal multidimensional loop unrolling in the presence of resource constraints and conditionals. References [1] A. Aiken. Compaction based parallelization. (Ph.D. thesis). Technical Report , Cornell University, [2] A. Aiken and A. Nicolau. Optimal loop parallelization. In Proc. ACM SIGPLAN Conference on Programming Languages Design and Implementation, June [3] A. Aiken and A. Nicolau. A realistic resource-constrained software pipelining algorithm. In Proc. 3rd Workshop on Languages and Compilers for Parallel Computing, Irvine, CA, August [4] R. Allen and K. Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Trans. Programming Languages and Systems, 9(4):491{542, [5] U. Banerjee. Dependence analysis for supercomputing, Kluwer Academic Publishers, Boston, MA, [6] A. Charlesworth. An approach to scientic array processing: The architectural design of the AP-120B/FPS-164 family. Computer, 14(3):18{27, [7] R. Colwell, R. Nix, J. O'Donnell, D. Papworth, and P. Rodman. A VLIW architecture for a trace scheduling compiler. IEEE Trans. Comput., C-37(8):967{979, August [8] D. Culler and Arvind. Resource requirements for dataow programs. In Proc. International Symposium on Computer Architecture, May [9] R. Cytron. Compile-time scheduling and optimization for asynchronous machines. PhD thesis, University of Illinois at Urbana-Champaign, Urbana, Illinois, 1984.

15 Software pipelining of nested loops 15 [10] R. Cytron. doacross: Beyond vectorization for multiprocessors. Proc International Conference on Parallel Processing, pp , August [11] C. Eisenbeis. Optimization of horizontal microcode generation for loop structures. In Proc. 2nd ACM International Conference on Supercomputing, pp. 453{465, June [12] J. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput., C-30(7):478{490, July [13] J. Fisher. Very long instruction word architectures and the ELI-512. In Proc. 10th International Symposium on Computer Architecture, pp. 140{150, June [14] K. Ebcioglu. A compilation technique for software pipelining of loops with conditional jumps. In Proc. 20th Annual Workshop on Microprogramming, December [15] K. Ebcioglu and A. Nicolau. A global resource-constrained parallelization technique. In Proc. ACM International Conference on Supercomputing, June [16] G. Gao, Y. Wong, and Q. Ning. A Petri-Net model for ne-grain loop scheduling. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp , Toronto, Canada, June [17] G. Gao, Q. Ning and V. van Dongen. Extending software pipelining for scheduling nested loops. In Proc. 6th Annual Workshop on Languages and Compilers for Parallel Computing, August [18] F. Gasperoni. Compilation techniques for VLIW architectures. Technical Report 435, Department of Computer Science, New York University, March [19] Saul I. Gass. Linear programming, methods and applications. McGraw-Hill Book Company, fourth edition, [20] J. L. Hennessy and D. A. Patterson. Computer architecture: A quantitative approach, Morgan Kaufmann Publishers, [21] F. Irigoin and R. Triolet. Supernode Partitioning. Proc. 15th Annual ACM Symp. Principles of Programming Languages, San Diego, CA, Jan. 1988, pp [22] K. Iwano and Yeh. An ecient algorithm for optimal loop parallelization. Lecture Notes in Comp. Sci., No. 450, Springer-Verlag, 1990, pp. 201{210. [23] R. Jones and H. Allan. Software pipelining: A comparison and improvement. In Proc. 23rd Annual Workshop on Microprogramming and Microarchitectures, pp , Orlando, Florida, November [24] M. Lam. A systolic array optimizing compiler. PhD thesis, Carnegie Mellon University, May [25] M. Lam. Software pipelining: An eective scheduling technique for VLIW machines. In Proc. ACM SIGPLAN Conf. Programming Languages Design and Implementation, pp , Atlanta, GA, June [26] L. Lamport. The Parallel Execution of DO Loops. Communications of the ACM, 17(2):83{93, Feb

16 Software pipelining of nested loops 16 [27] A. Munshi and B. Simons. Scheduling sequential loops on parallel processors. In Proc. ACM International Conference on Supercomputing, pp. 392{415, June [28] D. Padua and M. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12):1184{1201, Dec [29] A. Nicolau. Loop quantization: A generalized loop unwinding technique. J. Parallel and Dist. Comput., 5(5):568{586, October [30] J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for nonshared memory machines. Journal of Parallel and Distributed Computing, 16(2):108{120, October [31] J. Ramanujam. Optimal multidimensional loop unwinding: A framework for software pipelining of nested loops. Technical Report TR , Dept. of Electrical and Computer Engineering, Louisiana State University, May [32] B. Rau and C. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientic computing, In Proc. 14th Annual Workshop on Microprogramming, pp , [33] B. Rau. Cydra 5 directed dataow architecture. In Compcon 88, pp. 106{113. IEEE Computer Society, [34] A. Schrijver. Theory of linear and integer programming. Wiley-Interscience series in Discrete Mathematics and Optimization, John Wiley and Sons, New York, [35] B. Su, S. Ding and J. Xia. GURPR { A method for global software pipelining. In Proc. 20th Annual Workshop on Microprogramming, pp. 88{96, December [36] R. Tarjan. Depth-rst search and linear graph algorithms. SIAM J. Comput., 1(2):146{160, June [37] R. Touzeau. A FORTRAN compiler for the FPS-164 scientic computer. In Proc. SIGPLAN Symposium on Compiler Construction, pp , June [38] S. Weiss and J. Smith. A study of scalar compilation techniques for pipelined supercomputers. In Proc. 2nd Intl. Conf. Architectural Support for Programming Languages & Operating System, pp. 105{109, October [39] M. Wolfe and U. Banerjee. Data dependence and its application to parallel processing. International Journal of Parallel Programming, 16(2):137{178, [40] M. Wolfe. Optimizing supercompilers for supercomputers, MIT Press, [41] A. Zaky and P. Sadayappan. Optimal static scheduling of sequential loops on multiprocessors. In Proc. International Conference on Parallel Processing, volume 3, pp , [42] H. Zima and B. Chapman. Supercompilers for parallel and vector supercomputers. ACM Press Frontier Series, 1990.

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742 UMIACS-TR-93-133 December, 1992 CS-TR-3192 Revised April, 1993 Denitions of Dependence Distance William Pugh Institute for Advanced Computer Studies Dept. of Computer Science Univ. of Maryland, College

More information

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8)

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8) Vectorization Using Reversible Data Dependences Peiyi Tang and Nianshu Gao Technical Report ANU-TR-CS-94-08 October 21, 1994 Vectorization Using Reversible Data Dependences Peiyi Tang Department of Computer

More information

Generalized Iteration Space and the. Parallelization of Symbolic Programs. (Extended Abstract) Luddy Harrison. October 15, 1991.

Generalized Iteration Space and the. Parallelization of Symbolic Programs. (Extended Abstract) Luddy Harrison. October 15, 1991. Generalized Iteration Space and the Parallelization of Symbolic Programs (Extended Abstract) Luddy Harrison October 15, 1991 Abstract A large body of literature has developed concerning the automatic parallelization

More information

Affine and Unimodular Transformations for Non-Uniform Nested Loops

Affine and Unimodular Transformations for Non-Uniform Nested Loops th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 3-, 008 Affine and Unimodular Transformations for Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and

More information

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University

More information

Tiling Multidimensional Iteration Spaces for Multicomputers

Tiling Multidimensional Iteration Spaces for Multicomputers 1 Tiling Multidimensional Iteration Spaces for Multicomputers J. Ramanujam Dept. of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 080 901, USA. Email: jxr@max.ee.lsu.edu

More information

Reducing Memory Requirements of Nested Loops for Embedded Systems

Reducing Memory Requirements of Nested Loops for Embedded Systems Reducing Memory Requirements of Nested Loops for Embedded Systems 23.3 J. Ramanujam Λ Jinpyo Hong Λ Mahmut Kandemir y A. Narayan Λ Abstract Most embedded systems have limited amount of memory. In contrast,

More information

Increasing Parallelism of Loops with the Loop Distribution Technique

Increasing Parallelism of Loops with the Loop Distribution Technique Increasing Parallelism of Loops with the Loop Distribution Technique Ku-Nien Chang and Chang-Biau Yang Department of pplied Mathematics National Sun Yat-sen University Kaohsiung, Taiwan 804, ROC cbyang@math.nsysu.edu.tw

More information

Published in HICSS-26 Conference Proceedings, January 1993, Vol. 1, pp The Benet of Predicated Execution for Software Pipelining

Published in HICSS-26 Conference Proceedings, January 1993, Vol. 1, pp The Benet of Predicated Execution for Software Pipelining Published in HICSS-6 Conference Proceedings, January 1993, Vol. 1, pp. 97-506. 1 The Benet of Predicated Execution for Software Pipelining Nancy J. Warter Daniel M. Lavery Wen-mei W. Hwu Center for Reliable

More information

I = 4+I, I = 1, I 4

I = 4+I, I = 1, I 4 On Reducing Overhead in Loops Peter M.W. Knijnenburg Aart J.C. Bik High Performance Computing Division, Dept. of Computer Science, Leiden University, Niels Bohrweg 1, 2333 CA Leiden, the Netherlands. E-mail:

More information

Symbolic Evaluation of Sums for Parallelising Compilers

Symbolic Evaluation of Sums for Parallelising Compilers Symbolic Evaluation of Sums for Parallelising Compilers Rizos Sakellariou Department of Computer Science University of Manchester Oxford Road Manchester M13 9PL United Kingdom e-mail: rizos@csmanacuk Keywords:

More information

Compilation Issues for High Performance Computers: A Comparative. Overview of a General Model and the Unied Model. Brian J.

Compilation Issues for High Performance Computers: A Comparative. Overview of a General Model and the Unied Model. Brian J. Compilation Issues for High Performance Computers: A Comparative Overview of a General Model and the Unied Model Abstract This paper presents a comparison of two models suitable for use in a compiler for

More information

To appear: Proceedings of Supercomputing '92 1. Scott A. Mahlke William Y. Chen John C. Gyllenhaal Wen-mei W. Hwu

To appear: Proceedings of Supercomputing '92 1. Scott A. Mahlke William Y. Chen John C. Gyllenhaal Wen-mei W. Hwu To appear: Proceedings of Supercomputing '9 Compiler Code Transformations for Superscalar-Based High-Performance Systems Scott A. Mahlke William Y. Chen John C. Gyllenhaal Wen-mei W. Hwu Center for Reliable

More information

Predicated Software Pipelining Technique for Loops with Conditions

Predicated Software Pipelining Technique for Loops with Conditions Predicated Software Pipelining Technique for Loops with Conditions Dragan Milicev and Zoran Jovanovic University of Belgrade E-mail: emiliced@ubbg.etf.bg.ac.yu Abstract An effort to formalize the process

More information

Compile-Time Techniques for Data Distribution. in Distributed Memory Machines. J. Ramanujam. Louisiana State University, Baton Rouge, LA

Compile-Time Techniques for Data Distribution. in Distributed Memory Machines. J. Ramanujam. Louisiana State University, Baton Rouge, LA Compile-Time Techniques for Data Distribution in Distributed Memory Machines J. Ramanujam Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 783-59 P. Sadayappan

More information

Lecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm.

Lecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm. Lecture 19 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm I. Example of DoAll Loops Machine: Per clock: 1 read, 1 write, 1 (2-stage) arithmetic op, with hardware loop op and

More information

Profiling Dependence Vectors for Loop Parallelization

Profiling Dependence Vectors for Loop Parallelization Profiling Dependence Vectors for Loop Parallelization Shaw-Yen Tseng Chung-Ta King Chuan-Yi Tang Department of Computer Science National Tsing Hua University Hsinchu, Taiwan, R.O.C. fdr788301,king,cytangg@cs.nthu.edu.tw

More information

We propose the following strategy for applying unimodular transformations to non-perfectly nested loops. This strategy amounts to reducing the problem

We propose the following strategy for applying unimodular transformations to non-perfectly nested loops. This strategy amounts to reducing the problem Towards Unimodular Transformations for Non-perfectly Nested Loops Peter M.W. Knijnenburg High Performance Computing Division, Dept. of Computer Science, Leiden University, Niels Bohrweg 1, 2333 CA Leiden,

More information

Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer Science The Australian National University Canberra ACT 26

Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer Science The Australian National University Canberra ACT 26 Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Technical Report ANU-TR-CS-92- November 7, 992 Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer

More information

The Polytope Model for Optimizing Cache Locality Illkirch FRANCE.

The Polytope Model for Optimizing Cache Locality Illkirch FRANCE. The Polytope Model for Optimizing Cache Locality Beno t Meister, Vincent Loechner and Philippe Clauss ICPS, Universit Louis Pasteur, Strasbourg P le API, Bd S bastien Brant 67400 Illkirch FRANCE e-mail:

More information

Historical Perspective and Further Reading 3.10

Historical Perspective and Further Reading 3.10 3.10 6.13 Historical Perspective and Further Reading 3.10 This section discusses the history of the first pipelined processors, the earliest superscalars, the development of out-of-order and speculative

More information

A Boolean Expression. Reachability Analysis or Bisimulation. Equation Solver. Boolean. equations.

A Boolean Expression. Reachability Analysis or Bisimulation. Equation Solver. Boolean. equations. A Framework for Embedded Real-time System Design? Jin-Young Choi 1, Hee-Hwan Kwak 2, and Insup Lee 2 1 Department of Computer Science and Engineering, Korea Univerity choi@formal.korea.ac.kr 2 Department

More information

4.16. Historical Perspective and Further Reading. supercomputer: Any machine still on the drawing board.

4.16. Historical Perspective and Further Reading. supercomputer: Any machine still on the drawing board. supercomputer: Any machine still on the drawing board. Stan Kelly-Bootle, The Devil s DP Dictionary, 1981 4.16 Historical Perspective and Further Reading This section discusses the history of the first

More information

Transformations Techniques for extracting Parallelism in Non-Uniform Nested Loops

Transformations Techniques for extracting Parallelism in Non-Uniform Nested Loops Transformations Techniques for extracting Parallelism in Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and SAHAR A. GOMAA ) Kaferelsheikh University, Kaferelsheikh, EGYPT

More information

Legal and impossible dependences

Legal and impossible dependences Transformations and Dependences 1 operations, column Fourier-Motzkin elimination us use these tools to determine (i) legality of permutation and Let generation of transformed code. (ii) Recall: Polyhedral

More information

Department of. Computer Science. Uniqueness and Completeness. Analysis of Array. Comprehensions. December 15, Colorado State University

Department of. Computer Science. Uniqueness and Completeness. Analysis of Array. Comprehensions. December 15, Colorado State University Department of Computer Science Uniqueness and Completeness Analysis of Array Comprehensions David Garza and Wim Bohm Technical Report CS-93-132 December 15, 1993 Colorado State University Uniqueness and

More information

Communication-Minimal Tiling of Uniform Dependence Loops

Communication-Minimal Tiling of Uniform Dependence Loops Communication-Minimal Tiling of Uniform Dependence Loops Jingling Xue Department of Mathematics, Statistics and Computing Science University of New England, Armidale 2351, Australia Abstract. Tiling is

More information

Case Studies on Cache Performance and Optimization of Programs with Unit Strides

Case Studies on Cache Performance and Optimization of Programs with Unit Strides SOFTWARE PRACTICE AND EXPERIENCE, VOL. 27(2), 167 172 (FEBRUARY 1997) Case Studies on Cache Performance and Optimization of Programs with Unit Strides pei-chi wu and kuo-chan huang Department of Computer

More information

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University Using the Holey Brick Tree for Spatial Data in General Purpose DBMSs Georgios Evangelidis Betty Salzberg College of Computer Science Northeastern University Boston, MA 02115-5096 1 Introduction There is

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

Rearrangement of DNA fragments: a branch-and-cut algorithm Abstract. In this paper we consider a problem that arises in the process of reconstruction

Rearrangement of DNA fragments: a branch-and-cut algorithm Abstract. In this paper we consider a problem that arises in the process of reconstruction Rearrangement of DNA fragments: a branch-and-cut algorithm 1 C. E. Ferreira 1 C. C. de Souza 2 Y. Wakabayashi 1 1 Instituto de Mat. e Estatstica 2 Instituto de Computac~ao Universidade de S~ao Paulo e-mail:

More information

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology Automatic Array Alignment for Distributed Memory Multicomputers Mitsuru Ikei Hitachi Chemical Company Ltd. Michael Wolfe Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland OR 97291

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

URSA: A Unified ReSource Allocator for Registers and Functional Units in VLIW Architectures

URSA: A Unified ReSource Allocator for Registers and Functional Units in VLIW Architectures Presented at IFIP WG 10.3(Concurrent Systems) Working Conference on Architectures and Compliation Techniques for Fine and Medium Grain Parallelism, Orlando, Fl., January 1993 URSA: A Unified ReSource Allocator

More information

2 <3> <2> <1> (5,6) 9 (5,6) (4,5) <1,3> <1,2> <1,1> (4,5) 6 <1,1,4> <1,1,3> <1,1,2> (5,7) (5,6) (5,5)

2 <3> <2> <1> (5,6) 9 (5,6) (4,5) <1,3> <1,2> <1,1> (4,5) 6 <1,1,4> <1,1,3> <1,1,2> (5,7) (5,6) (5,5) A fast approach to computing exact solutions to the resource-constrained scheduling problem M. NARASIMHAN and J. RAMANUJAM 1 Louisiana State University This paper presents an algorithm that substantially

More information

A Geometric Approach for Partitioning N-Dimensional Non-Rectangular Iteration Spaces

A Geometric Approach for Partitioning N-Dimensional Non-Rectangular Iteration Spaces A Geometric Approach for Partitioning N-Dimensional Non-Rectangular Iteration Spaces Arun Kejariwal, Paolo D Alberto, Alexandru Nicolau Constantine D. Polychronopoulos Center for Embedded Computer Systems

More information

CS 293S Parallelism and Dependence Theory

CS 293S Parallelism and Dependence Theory CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law

More information

Integer Programming for Array Subscript Analysis

Integer Programming for Array Subscript Analysis Appears in the IEEE Transactions on Parallel and Distributed Systems, June 95 Integer Programming for Array Subscript Analysis Jaspal Subhlok School of Computer Science, Carnegie Mellon University, Pittsburgh

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly

More information

A 100 B F

A 100 B F Appears in Adv. in Lang. and Comp. for Par. Proc., Banerjee, Gelernter, Nicolau, and Padua (ed) 1 Using Prole Information to Assist Advanced Compiler Optimization and Scheduling William Y. Chen, Scott

More information

Null space basis: mxz. zxz I

Null space basis: mxz. zxz I Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 15, 2010 888.11, Class #4 Introduction: Polyhedral

More information

Improving Software Pipelining with Hardware Support for Self-Spatial Loads

Improving Software Pipelining with Hardware Support for Self-Spatial Loads Improving Software Pipelining with Hardware Support for Self-Spatial Loads Steve Carr Philip Sweany Department of Computer Science Michigan Technological University Houghton MI 49931-1295 fcarr,sweanyg@mtu.edu

More information

Linear Programming in Small Dimensions

Linear Programming in Small Dimensions Linear Programming in Small Dimensions Lekcija 7 sergio.cabello@fmf.uni-lj.si FMF Univerza v Ljubljani Edited from slides by Antoine Vigneron Outline linear programming, motivation and definition one dimensional

More information

where is a constant, 0 < <. In other words, the ratio between the shortest and longest paths from a node to a leaf is at least. An BB-tree allows ecie

where is a constant, 0 < <. In other words, the ratio between the shortest and longest paths from a node to a leaf is at least. An BB-tree allows ecie Maintaining -balanced Trees by Partial Rebuilding Arne Andersson Department of Computer Science Lund University Box 8 S-22 00 Lund Sweden Abstract The balance criterion dening the class of -balanced trees

More information

Loop Transformations, Dependences, and Parallelization

Loop Transformations, Dependences, and Parallelization Loop Transformations, Dependences, and Parallelization Announcements HW3 is due Wednesday February 15th Today HW3 intro Unimodular framework rehash with edits Skewing Smith-Waterman (the fix is in!), composing

More information

CS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension

CS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension CS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension Antoine Vigneron King Abdullah University of Science and Technology November 7, 2012 Antoine Vigneron (KAUST) CS 372 Lecture

More information

Theory and Algorithms for the Generation and Validation of Speculative Loop Optimizations

Theory and Algorithms for the Generation and Validation of Speculative Loop Optimizations Theory and Algorithms for the Generation and Validation of Speculative Loop Optimizations Ying Hu Clark Barrett Benjamin Goldberg Department of Computer Science New York University yinghubarrettgoldberg

More information

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap Storage-Ecient Finite Field Basis Conversion Burton S. Kaliski Jr. 1 and Yiqun Lisa Yin 2 RSA Laboratories 1 20 Crosby Drive, Bedford, MA 01730. burt@rsa.com 2 2955 Campus Drive, San Mateo, CA 94402. yiqun@rsa.com

More information

2 The Service Provision Problem The formulation given here can also be found in Tomasgard et al. [6]. That paper also details the background of the mo

2 The Service Provision Problem The formulation given here can also be found in Tomasgard et al. [6]. That paper also details the background of the mo Two-Stage Service Provision by Branch and Bound Shane Dye Department ofmanagement University of Canterbury Christchurch, New Zealand s.dye@mang.canterbury.ac.nz Asgeir Tomasgard SINTEF, Trondheim, Norway

More information

The Polyhedral Model (Transformations)

The Polyhedral Model (Transformations) The Polyhedral Model (Transformations) Announcements HW4 is due Wednesday February 22 th Project proposal is due NEXT Friday, extension so that example with tool is possible (see resources website for

More information

6. Concluding Remarks

6. Concluding Remarks [8] K. J. Supowit, The relative neighborhood graph with an application to minimum spanning trees, Tech. Rept., Department of Computer Science, University of Illinois, Urbana-Champaign, August 1980, also

More information

A Framework for Integrated Communication and I/O Placement

A Framework for Integrated Communication and I/O Placement Syracuse University SURFACE Electrical Engineering and Computer Science College of Engineering and Computer Science 1996 A Framework for Integrated Communication and I/O Placement Rajesh Bordawekar Syracuse

More information

Department of. Computer Science. Uniqueness Analysis of Array. Omega Test. October 21, Colorado State University

Department of. Computer Science. Uniqueness Analysis of Array. Omega Test. October 21, Colorado State University Department of Computer Science Uniqueness Analysis of Array Comprehensions Using the Omega Test David Garza and Wim Bohm Technical Report CS-93-127 October 21, 1993 Colorado State University Uniqueness

More information

Statement-Level Communication-Free. Partitioning Techniques for. National Central University. Chung-Li 32054, Taiwan

Statement-Level Communication-Free. Partitioning Techniques for. National Central University. Chung-Li 32054, Taiwan Appeared in the Ninth Worshop on Languages and Compilers for Parallel Comping, San Jose, CA, Aug. 8-0, 996. Statement-Level Communication-Free Partitioning Techniques for Parallelizing Compilers Kuei-Ping

More information

SYSTEMS MEMO #12. A Synchronization Library for ASIM. Beng-Hong Lim Laboratory for Computer Science.

SYSTEMS MEMO #12. A Synchronization Library for ASIM. Beng-Hong Lim Laboratory for Computer Science. ALEWIFE SYSTEMS MEMO #12 A Synchronization Library for ASIM Beng-Hong Lim (bhlim@masala.lcs.mit.edu) Laboratory for Computer Science Room NE43-633 January 9, 1992 Abstract This memo describes the functions

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 22, 2010 888.11, Class #5 Introduction: Polyhedral

More information

A Case for Two-Level Distributed Recovery Schemes. Nitin H. Vaidya. reduce the average performance overhead.

A Case for Two-Level Distributed Recovery Schemes. Nitin H. Vaidya.   reduce the average performance overhead. A Case for Two-Level Distributed Recovery Schemes Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-31, U.S.A. E-mail: vaidya@cs.tamu.edu Abstract Most distributed

More information

Redundant Synchronization Elimination for DOACROSS Loops

Redundant Synchronization Elimination for DOACROSS Loops Redundant Synchronization Elimination for DOACROSS Loops Ding-Kai Chen Pen-Chung Yew fdkchen,yewg@csrd.uiuc.edu Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign

More information

Data Dependence Analysis

Data Dependence Analysis CSc 553 Principles of Compilation 33 : Loop Dependence Data Dependence Analysis Department of Computer Science University of Arizona collberg@gmail.com Copyright c 2011 Christian Collberg Data Dependence

More information

Conditional Branching is not Necessary for Universal Computation in von Neumann Computers Raul Rojas (University of Halle Department of Mathematics an

Conditional Branching is not Necessary for Universal Computation in von Neumann Computers Raul Rojas (University of Halle Department of Mathematics an Conditional Branching is not Necessary for Universal Computation in von Neumann Computers Raul Rojas (University of Halle Department of Mathematics and Computer Science rojas@informatik.uni-halle.de) Abstract:

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique

More information

Sparse Matrices. Mathematics In Science And Engineering Volume 99 READ ONLINE

Sparse Matrices. Mathematics In Science And Engineering Volume 99 READ ONLINE Sparse Matrices. Mathematics In Science And Engineering Volume 99 READ ONLINE If you are looking for a ebook Sparse Matrices. Mathematics in Science and Engineering Volume 99 in pdf form, in that case

More information

Center for Automation and Autonomous Complex Systems. Computer Science Department, Tulane University. New Orleans, LA June 5, 1991.

Center for Automation and Autonomous Complex Systems. Computer Science Department, Tulane University. New Orleans, LA June 5, 1991. Two-phase Backpropagation George M. Georgiou Cris Koutsougeras Center for Automation and Autonomous Complex Systems Computer Science Department, Tulane University New Orleans, LA 70118 June 5, 1991 Abstract

More information

REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS

REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS Noureddine Chabini 1 and Wayne Wolf 2 1 Department of Electrical and Computer Engineering, Royal Military College

More information

University of Ghent. St.-Pietersnieuwstraat 41. Abstract. Sucient and precise semantic information is essential to interactive

University of Ghent. St.-Pietersnieuwstraat 41. Abstract. Sucient and precise semantic information is essential to interactive Visualizing the Iteration Space in PEFPT? Qi Wang, Yu Yijun and Erik D'Hollander University of Ghent Dept. of Electrical Engineering St.-Pietersnieuwstraat 41 B-9000 Ghent wang@elis.rug.ac.be Tel: +32-9-264.33.75

More information

and Memory Constraints Ulrich Kremer y Rutgers University Abstract an ecient HPF program. Although nding an ecient data layout fully automatically

and Memory Constraints Ulrich Kremer y Rutgers University Abstract an ecient HPF program. Although nding an ecient data layout fully automatically Automatic Data Layout With Read{Only Replication and Memory Constraints Ulrich Kremer y Department of Computer Science Rutgers University Abstract Besides the algorithm selection, the data layout choice

More information

Copyright (C) 1997, 1998 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for

Copyright (C) 1997, 1998 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for Copyright (C) 1997, 1998 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided

More information

A Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology.

A Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology. A Fast Recursive Mapping Algorithm Song Chen and Mary M. Eshaghian Department of Computer and Information Science New Jersey Institute of Technology Newark, NJ 7 Abstract This paper presents a generic

More information

A Perfect Branch Prediction Technique for Conditional Loops

A Perfect Branch Prediction Technique for Conditional Loops A Perfect Branch Prediction Technique for Conditional Loops Virgil Andronache Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA and Richard P. Simpson Department

More information

A Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract

A Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract A Quantitative Algorithm for Data Locality Optimization Francois Bodin, William Jalby, Daniel Windheiser IRISA, University of Rennes Rennes, FRANCE Christine Eisenbeis INRIA Rocquencourt, FRANCE Abstract

More information

Linear Loop Transformations for Locality Enhancement

Linear Loop Transformations for Locality Enhancement Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation

More information

PPS : A Pipeline Path-based Scheduler. 46, Avenue Felix Viallet, Grenoble Cedex, France.

PPS : A Pipeline Path-based Scheduler. 46, Avenue Felix Viallet, Grenoble Cedex, France. : A Pipeline Path-based Scheduler Maher Rahmouni Ahmed A. Jerraya Laboratoire TIMA/lNPG,, Avenue Felix Viallet, 80 Grenoble Cedex, France Email:rahmouni@verdon.imag.fr Abstract This paper presents a scheduling

More information

Thunks (continued) Olivier Danvy, John Hatcli. Department of Computing and Information Sciences. Kansas State University. Manhattan, Kansas 66506, USA

Thunks (continued) Olivier Danvy, John Hatcli. Department of Computing and Information Sciences. Kansas State University. Manhattan, Kansas 66506, USA Thunks (continued) Olivier Danvy, John Hatcli Department of Computing and Information Sciences Kansas State University Manhattan, Kansas 66506, USA e-mail: (danvy, hatcli)@cis.ksu.edu Abstract: Call-by-name

More information

Data Dependences and Parallelization

Data Dependences and Parallelization Data Dependences and Parallelization 1 Agenda Introduction Single Loop Nested Loops Data Dependence Analysis 2 Motivation DOALL loops: loops whose iterations can execute in parallel for i = 11, 20 a[i]

More information

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742 Availability of Coding Based Replication Schemes Gagan Agrawal Department of Computer Science University of Maryland College Park, MD 20742 Abstract Data is often replicated in distributed systems to improve

More information

Optimal Parallel Randomized Renaming

Optimal Parallel Randomized Renaming Optimal Parallel Randomized Renaming Martin Farach S. Muthukrishnan September 11, 1995 Abstract We consider the Renaming Problem, a basic processing step in string algorithms, for which we give a simultaneously

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

Eulerian disjoint paths problem in grid graphs is NP-complete

Eulerian disjoint paths problem in grid graphs is NP-complete Discrete Applied Mathematics 143 (2004) 336 341 Notes Eulerian disjoint paths problem in grid graphs is NP-complete Daniel Marx www.elsevier.com/locate/dam Department of Computer Science and Information

More information

Allowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs

Allowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs To appear in: Int. Conf. on Parallel and Distributed Systems, ICPADS'96, June 3-6, 1996, Tokyo Allowing Cycle-Stealing Direct Memory Access I/O Concurrent with Hard-Real-Time Programs Tai-Yi Huang, Jane

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

An Overview to. Polyhedral Model. Fangzhou Jiao

An Overview to. Polyhedral Model. Fangzhou Jiao An Overview to Polyhedral Model Fangzhou Jiao Polyhedral Model A framework for performing loop transformation Loop representation: using polytopes to achieve fine-grain representation of program Loop transformation:

More information

Preemptive Scheduling of Equal-Length Jobs in Polynomial Time

Preemptive Scheduling of Equal-Length Jobs in Polynomial Time Preemptive Scheduling of Equal-Length Jobs in Polynomial Time George B. Mertzios and Walter Unger Abstract. We study the preemptive scheduling problem of a set of n jobs with release times and equal processing

More information

REDUCTION IN RUN TIME USING TRAP ANALYSIS

REDUCTION IN RUN TIME USING TRAP ANALYSIS REDUCTION IN RUN TIME USING TRAP ANALYSIS 1 Prof. K.V.N.Sunitha 2 Dr V. Vijay Kumar 1 Professor & Head, CSE Dept, G.Narayanamma Inst.of Tech. & Science, Shaikpet, Hyderabad, India. 2 Dr V. Vijay Kumar

More information

An Approach for Integrating Basic Retiming and Software Pipelining

An Approach for Integrating Basic Retiming and Software Pipelining An Approach for Integrating Basic Retiming and Software Pipelining Noureddine Chabini Department of Electrical and Computer Engineering Royal Military College of Canada PB 7000 Station Forces Kingston

More information

Optimal Partitioning of Sequences. Abstract. The problem of partitioning a sequence of n real numbers into p intervals

Optimal Partitioning of Sequences. Abstract. The problem of partitioning a sequence of n real numbers into p intervals Optimal Partitioning of Sequences Fredrik Manne and Tor S revik y Abstract The problem of partitioning a sequence of n real numbers into p intervals is considered. The goal is to nd a partition such that

More information

2 The MiníMax Principle First consider a simple problem. This problem will address the tradeos involved in a two-objective optimiation problem, where

2 The MiníMax Principle First consider a simple problem. This problem will address the tradeos involved in a two-objective optimiation problem, where Determining the Optimal Weights in Multiple Objective Function Optimiation Michael A. Gennert Alan L. Yuille Department of Computer Science Division of Applied Sciences Worcester Polytechnic Institute

More information

Dependence Analysis. Hwansoo Han

Dependence Analysis. Hwansoo Han Dependence Analysis Hwansoo Han Dependence analysis Dependence Control dependence Data dependence Dependence graph Usage The way to represent dependences Dependence types, latencies Instruction scheduling

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

Proc. XVIII Conf. Latinoamericana de Informatica, PANEL'92, pages , August Timed automata have been proposed in [1, 8] to model nite-s

Proc. XVIII Conf. Latinoamericana de Informatica, PANEL'92, pages , August Timed automata have been proposed in [1, 8] to model nite-s Proc. XVIII Conf. Latinoamericana de Informatica, PANEL'92, pages 1243 1250, August 1992 1 Compiling Timed Algebras into Timed Automata Sergio Yovine VERIMAG Centre Equation, 2 Ave de Vignate, 38610 Gieres,

More information

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of California, San Diego CA 92093{0114, USA Abstract. We

More information

Implementing Sequential Consistency In Cache-Based Systems

Implementing Sequential Consistency In Cache-Based Systems To appear in the Proceedings of the 1990 International Conference on Parallel Processing Implementing Sequential Consistency In Cache-Based Systems Sarita V. Adve Mark D. Hill Computer Sciences Department

More information

2.1. A motivating example for constraint SQL

2.1. A motivating example for constraint SQL MLPQ: A Linear Constraint Database System with Aggregate Operators Peter Z Revesz and Yiming Li University of Nebraska{Lincoln Dept of Computer Science and Engineering Lincoln, NE 68588, USA frevesz,ylig@cseunledu

More information

Interprocedural Dependence Analysis and Parallelization

Interprocedural Dependence Analysis and Parallelization RETROSPECTIVE: Interprocedural Dependence Analysis and Parallelization Michael G Burke IBM T.J. Watson Research Labs P.O. Box 704 Yorktown Heights, NY 10598 USA mgburke@us.ibm.com Ron K. Cytron Department

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,

More information

Embedding Formulations, Complexity and Representability for Unions of Convex Sets

Embedding Formulations, Complexity and Representability for Unions of Convex Sets , Complexity and Representability for Unions of Convex Sets Juan Pablo Vielma Massachusetts Institute of Technology CMO-BIRS Workshop: Modern Techniques in Discrete Optimization: Mathematics, Algorithms

More information