Software pipelining of nested loops 2 1 Introduction Exploiting parallelism in loops in scientic programs is an important factor in realizing the pote
|
|
- Peregrine Berry
- 5 years ago
- Views:
Transcription
1 Software pipelining of nested loops J. Ramanujam Dept. of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA May 1994 Abstract This paper presents an approach to software pipelining of nested loops. While several papers have addressed software pipelining of inner loops, little work has been done in the area of extending it to nested loops. This paper solves the problem of nding the minimum iteration initiation interval (in the absence of resource constraints) for each level of a nested loop. The problem is formulated as one of nding a rational quasi-ane schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. This allows us to treat iteration-dependent statement reordering and multidimensional loop unrolling in the same framework. Unlike most work in scheduling nested loops, we treat each statement in the body as a unit of scheduling. Thus, the schedules derived allow for instances of statements from dierent iterations to be scheduled at the same time. Optimal schedules derived here subsume extant work on software pipelining of inner loops, in the absence of resource constraints. Keywords: Instruction level parallelism, ne-grain scheduling, nested loops, software pipelining, optimal scheduling. Note: This is an expanded version of the paper titled \Optimal Software Pipelining of Nested Loops," that appeared in Proc. 8th International Parallel Processing Symposium, (April 1994), pp. 335{342. Supported in part by an NSF Young Investigator Award CCR{ , and NSF grant CCR{ , and by the Louisiana Board of Regents through contract LEQSF ( )-RD-A-09. 1
2 Software pipelining of nested loops 2 1 Introduction Exploiting parallelism in loops in scientic programs is an important factor in realizing the potential performance of highly parallel computers today. Programming these machines remains a dicult problem. Much progress has been made resulting in a suite of techniques that extract coarsegrain parallelism from sequential programs [4, 28, 40, 42]. With the advent of VLIW, superscalar, horizontal microengines, multiple RISC and pipelined processors, the exploitation of ne-grain instruction-level parallelism has become a major challenge to parallelizing compilers [7, 13, 20, 33, 37, 38]. The problem will become even more important as these processors form the building blocks of massively parallel machines. Software pipelining [1, 6, 11, 14, 15, 16, 23, 24, 25, 27, 32, 35, 41] has been proposed as an eective ne-grain scheduling technique that restructures the statements in the body of a loop subject to resource and dependence constraints such that one iteration of a loop can start execution before another nishes. The total execution time thus depends on the iteration initiation interval. While software pipelining of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. This paper presents an approach to software pipelining of nested loops by presenting a technique to nd the minimum iteration initiation interval (in the absence of resource constraints). We formulate the problem as that of nding a rational ane schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. This framework allows for an integrated treatment of iteration-dependent statement reordering and multidimensional loop unrolling. In contrast to most work in scheduling nested loops, we treat each statement in the body as a unit of scheduling. Thus, the schedules derived allow for instances of statements from dierent iterations to be scheduled at the same time. Optimal schedules derived here subsume extant work on software pipelining of inner loops. Due to space constraints, we do not discuss code generation issues in this paper; the reader is referred to [31] for details. Section 2 provides the background and Section 3 discusses related work, placing this research in the context of extant work in the eld. In Section 4, we formulate the problem of optimal negrain scheduling of nested loops (in the absence of resource constraints) as a linear programming problem, the solution to which gives a rational ane schedule for each statement in the body of a nested loop as a function of the loop indices. We show that the solution corresponds to the minimum initiation interval for each level of a nested loop. Section 5 provides examples. In Section 6, we discuss work in progress aimed at integrating resource constraints and handling conditionals. Section 7 concludes with a discussion. 2 Background 2.1 Data dependence Good and thorough parallelization of a program critically depends on how precisely a compiler can discover the data dependence information [4, 5, 39, 40, 42]. These dependences imply precedence constraints among computations which have to be satised for a correct execution. In this paper,
3 Software pipelining of nested loops 3 we consider perfectly nested loops of the form: for I 1 = L 1 to U 1 do endfor for I n = L n to U n do S 1 (I) S r (I) endfor where L j and U j are integer-valued ane expressions involving I 1 ; : : : ; I j?1 and I = (I 1 ; : : : ; I n ). Each I j (j = 1; : : : ; n) is a loop index; S 1 ; : : : ; S r are assignment statements of the form X o = E(X 1 ; : : : ; X K ) where X o is dened (i.e., written) in expression E, which is evaluated using some variables X 1 ; X 2 ; : : : ; X K. We assume that the increment of each loop is +1. Each computation is denoted by an index vector I = (I 1 ; : : : ; I n ). A loop instance is the loop iteration where the indices take on a particular value, I = i = (i 1 ; i 2 ; : : : ; i n ). The instance of statement S k executed in iteration vector I is denoted S k (I). The iteration set is a collection of iteration vectors and constitutes the iteration space. With the assumption on loop bound linearity, the sets of computations considered are nite convex polyhedra of some iteration space in Z n, where n is the number of loops in the nest which is also the dimensionality of the iteration space. The iteration set of a given nested loop is described as a set of integer points (or, vectors) whose convex hull I is a non-degenerate (or, full dimensional) convex polyhedron. The loop iterations are executed in lexicographic ordering during sequential execution. Any vector x = (x 1 ; x 2 ; : : : ; x n ) is a positive vector, if its rst (leading { read from left to right) non-zero component is positive [5]. We say that i = (i 1 ; : : : ; i n ) precedes j = (j 1 ; : : : ; i n ), written i j, if j? i is a positive vector. Positive vectors capture the lexicographic ordering among iterations of a nested loop. A loop nest where the loop limits are constants is said to have a rectangular iteration space associated with it. Let X and Y be two p-dimensional arrays; and let f i and g i (i = 1; : : : ; p) be two sets of integer functions such that X(f 1 (I); : : : ; f p (I)) is a \dened" (i.e., written) variable and Y (g 1 (I); : : : ; g p (I)) is a \used" (i.e., read) variable. Let F (I) denote f 1 (I); : : : ; f p (I) and let G(I) denote g 1 (I); : : : ; g p (I). Given two statements S k (I 1 ) and S l (I 2 ), S l (I 2 ) is dependent on S k (I 1 ) (with distance vector d k;l ) i [5, 28, 39, 40]: (I 1 I 2 ) or (I 1 = I 2 and k < l) and f i (I 1 ) = g i (I 2 ) for i = 1; : : : ; p; Either X(F (I 1 )) is written in statement S k (I 1 ) or X(G(I 2 )) is written in statement S l (I 2 ): A ow dependence exists from statement S k to statement S l if S k writes a value that is subsequently, in sequential execution, read by S l. An anti-dependence exists from S k to S l if S k reads a value
4 Software pipelining of nested loops 4 that is subsequently modied by S l. An output dependence exists between S k and S l if S k writes a value which is subsequently written by S l. If I 1 = I 2, the dependence is called a loop-independent dependence; otherwise, it is called a loop-carried dependence. Many dependences that occur in practice have a constant distance in each dimension of the iteration space. In such cases, the vector d = I 2? I 1 is called the distance vector. We limit our discussion to distance vectors in this paper. 2.2 Statement Level Dependence Graph (SLDG) Dependence relations are often represented in Statement Level Dependence Graphs (SLDG's). For a perfectly n-nested loop with index set (i 1 ; i 2 ; : : : ; i n ) whose body contains statements S 1 ; : : : ; S r, the SLDG has r nodes, one for each statement. For each dependence from statement S k to S l with a distance vector d k;l, the graph has a directed edge from node S k to S l labeled with the distance vector d k;l. A dependence from a node to itself is called a self-dependence. In addition to the three types of dependence mentioned above, there is another type of dependence known as control dependence. A control dependence exists between a statement with a conditional jump and another statement if the conditional jump statement controls the execution of the other statement. Control dependences can be handled by methods similar to data dependences [4]. In our analysis, we treat the dierent types of dependences uniformly. Methods to calculate data dependence vectors can be found in [4, 5, 39, 40, 42]. 3 Related Work Software pipelining of inner loops has been considered by several authors [1, 6, 11, 14, 15, 16, 23, 24, 25, 27, 32, 35, 41]. All of these studies search for the minimum initiation interval by unrolling the loop several times. This is inadequate in situations where the minimum iteration initiation interval is non-integral. Moreover, these approaches use an ad hoc method to decide on the degree of loop unrolling, and are unacceptable in cases where the optimal solution can be found only after a very large amount of unrolling. Aiken and Nicolau [2], describe a procedure which yields an optimal schedule for inner sequential loops. The procedure works by simulating the execution of the loop body until a pattern evolves. The technique does not guarantee an upper bound on the amount of time it needs to nd a solution. Zaky and Sadayappan [41] present a novel approach that is based on eigenvalues of matrices that arise path algebra. Their algorithm has polynomial time complexity; their algorithm exploits the connectivity properties of the loop dependence graph. While the algorithm of [2] requires unrolling to detect a pattern, the algorithm in [41] does not require any unrolling. In Section 5.1, we show that the technique developed in this paper for nested loops is equally applicable to inner loops and derives the same solution as [41] using a simpler framework. Iwano and Yeh [22] use network ow algorithms for optimal loops parallelization. Software pipelining of sequential loops on limited resources is discussed in [1, 11, 15, 25].
5 Software pipelining of nested loops 5 Trace scheduling [7, 12, 13] is a technique used in VLIW machines that extracts parallelism in sequential loops by unrolling it several times. The code for one iteration of the unrolled loop is then compacted using the acyclic dependence graph corresponding to only the code body of the unrolled loop. Weiss and Smith [38] discuss an adaptation of this technique for pipelined supercomputers. While software pipelining of inner loops has received a lot of attention, very few authors have addressed software pipelining of nested loops. Cytron [9, 10] presents a technique for doacross loops that minimizes the delays between initiating successive iterations of a sequential loop with no reordering of statements in its body. Cytron [9, 10] does not explicitly attempt to exploit ne-grain parallelism. Munshi and Simmons [27] study the problem of minimizing the iteration initiation interval which considers statement reordering. They show that a close variant of the problem is NP-complete. Both these papers separate the issues of iteration initiation and the scheduling of operations within an iteration. In general, such a separation does not result in the minimum iteration initiation interval. Nicolau [29] suggests loop quantization as a technique for multidimensional loop unrolling in conjunction with tree-height reduction and percolation scheduling. He does not consider the problem of determining the optimal initiation interval for each loop. Loop quantization as described in [1, 29] deals with the problem at the iteration level rather than at the statement level. Recently, Gao et al. [17] present a technique that works for rectangular loops but requires all components of all distance vectors to be positive. While unimodular transformations could be used to convert all distance vectors to have non-negative entries, the transformed iteration spaces are no longer rectangular; this limits the applicability of the results in [17]. The technique presented in this paper does not have the restriction on non-negativity and hence is more general. 4 Statement level rational ane schedules In this section, we formulate the problem of optimal ne grain scheduling of nested loops in the absence of resource constraints as a Linear Programming (LP) problem [19, 34] which admits polynomial time solutions and is extremely fast in practice. This paper generalizes the hyperplane scheduling technique of scheduling iterations of nested loops pioneered by Lamport [26] by deriving optimal schedules at the statement level rather than at the iteration level. The solutions derived give the minimum iteration initiation interval for each level of an n-nested loop. Let G denote the statement level dependence graph. If G is acyclic, then list scheduling and tree height reduction can be used to optimally schedule the computation [1]. If G is cyclic, we use Tarjan's algorithm [36, 40, 42] to nd all the strongly connected components and schedule each strongly connected component separately. For the rest of the paper, we discuss the optimal scheduling of a single strongly connected component in G. In Section 5.1, we discuss the interleaving of the schedules of strongly connected components through Example 3. Given a number x, bxc is the largest integer that is less than or equal to x and is called the oor of x. Let bq k (I)c denote the time at which statement S k (k = 1; : : : ; r) in iteration I = (i 1 ; : : : ; i n ) (denoted S k (I)) is scheduled which is the time at which execution starts. Let t k be the time taken
6 Software pipelining of nested loops 6 to execute statement S k. We assume that t k 1 and is an integer. q k (I) is a rational function, i.e., it is written as q k (I) = h k;1 i 1 + h k;2 i h k;n i n + k : Let h k = (h k;1 ; h k;2 ; : : : ; h k;n ) for each k; the elements of the vector h k and k are rational. Note that we could also use dq k (I)e as the time at which S k (I) is scheduled. We choose to use the oor function throughout this paper. We use a single h vector for each strongly connected component for the rest of the paper, i.e., q k (I) = h I + k k = 1; : : : ; r: The schedule should satisfy all the dependences in the loop. A schedule is a tuple hh; i where h = (h 1 ; : : : ; h n ) is an n-vector and = ( 1 ; : : : ; r ) is an r-vector. A schedule hh; i is legal if for each edge from statement S k to S l with a distance vector d k;l in G, q l (I) q k (I? d k;l ) + t k This states that statement S l in iteration I can be scheduled only after statement S k in iteration (I? d k;l ) has completed execution. Since S k (I? d k;l ) starts execution at q k (I? d k;l ), S l (I) can start at the earliest at time q l (I? d k;l ) + t k. Thus, h I + l h (I? d k;l ) + k + t k h d k;l k? l + t k for all dependences in G. If d k;k is a self dependence on S k this condition translates to h d k;k t k For an n-nested loop with a schedule hh; i, the execution time, E is given by the expression, E = max fq k (I)? q l (J)g I;J2I^k;l2[1;r] The optimal execution time is the minimum value of the expression E: E max fh (I? J)g + max ( k )? min ( k ) I;J2I k2[1;r] k2[1;r] We assume that the number of iterations at each level of the loop nest is large; hence, we ignore the contribution from the term: max k2[1;r] ( k )? min k2[1;r] ( k ). The expression max I;J2I fh (I? J)g can be approximated by max I2I h I? min I2I h I. With the assumption that loop bounds are ane functions of outer loop variables, the iteration space is a convex polyhedron. The extrema of ane functions over I, therefore occur at the corners of the polyhedron [34]. If the iteration space is rectangular, i.e., L j and U j (j = 1; : : : ; n) are integer constants, we can nd an expression of the
7 Software pipelining of nested loops 7 optimal value of E using Banerjee's inequalities [5] as discussed below. Denition 1 [5]: Given a number h, its positive part, h + = max(h; 0); and its negative part, h? = max(?h; 0). Some properties are given below: 1. h + 0 and h? 0 2. h = h +? h? and jhj = h + + h? (jhj is the absolute value of h). 3.?h? h h + For rectangular loops, we assume that L j i j U j for j = 1; : : : ; n and L j and U j are constants. Using Banerjee's inequalities, max h I = I2I n o h + j U j? h? j L j and Therefore, E min h I = I2I n o h + j L j? h? j U j n o h + j U j? h? j L j? h + j L j? h? j U j E n h + j + h? j o (U j? L j ) From the properties in denition 1, this is equal to E fjh j j (U j? L j )g Thus, we can formulate the problem of nding the optimal schedule for an n-nested loop (with a rectangular iteration space and the size of each level in the iteration space is the same) with r-statements as that of nding a schedule hh; i, i.e., h 1 ; : : : ; h n and 1 ; : : : ; r that minimizes P n jh j j (U j? L j ) subject to dependence constraints: Minimize jh j j (U j? L j ) subject to h d k;l k? l + t k k; l 2 [1; r] for every edge in G. In many cases, the loop limits, though constants, are not known at compile time. In such cases, we aim at nding optimal schedules independent of the loop limits. We assume rectangular iteration spaces, where the size of each loop U j? L j + 1 is the same for all values of j; in such cases,
8 Software pipelining of nested loops 8 the optimal value of the expression E is a function of P n jh jj. Thus, the execution time depends on the loop limits, where as the schedule, hh; i does not depend on L j and U j (j = 1; : : : ; n): If the size, i.e., the value of U j? L j + 1 (j = 1; : : : ; n), are dierent for dierent loop levels, then the technique presented in this paper is sub-optimal. With unknown loop limits, our problem then is that of nding a schedule hh; i that minimizes P n jh jj subject to dependence constraints: Minimize jh j j subject to h d k;l k? l + t k k; l 2 [1; r] for every edge in G. The above formulation is not in standard linear programming form for two reasons: 1. Lack of non-negativity constraints on hh; i 2. Absolute values of variables in the objective function The rst problem is handled by writing each variable h j (j = 1; : : : ; n) and i (i = 1; : : : ; r) as the dierence of two variables, which are constrained to be non-negative, e.g., replace h j with h 1 j? h2 j with the constraint that h 1 j 0 and h2 j 0. The second problem is handled by adding a set of variables j ; j = 1; : : : ; n; the new objective function is P n j. For each variable h j, we add two constraints, j? h j 0 and j + h j 0. With these modications, the problem is now in standard Linear Programming (LP) form: subject to h 1 j? h2 j Minimize j? h 1 j + h 2 j 0 j + h 1 j? h2 j 0 d k;l j? 1 k? 2 k j j = 1; : : : ; n j = 1; : : : ; n + 1 l? 2 l t k where k; l 2 [1; r] ^ (k; l) 2 edges(g). The formulation has 2n + m constraints with 3n + 2r variables where m is the number of edges in G for an n-nested loop with r statements. In practice, our implementation obtains solutions very quickly. 4.1 What does the LP solution mean? The value of jh j j denotes the iteration initiation interval for the jth loop in the nest. If h j > 0, then the next loop iteration initiated at level j is numbered higher than the currently executing
9 Software pipelining of nested loops 9 iteration at level j. On the other hand h j < 0 means that the next iteration initiated has an iteration number less than the currently executing iteration, i.e., the loop at level j is unrolled in the reverse direction. If h j is a fraction i.e., h j = a j b j where a j and b j are integers and gcd(a j ; b j ) = 1, in every ja j j time units, jb j j iterations at level j are unrolled; the unrolling is in reverse direction if h j < 0. If h j = 0, the minimum initiation interval is zero, i.e., the loop is a parallel loop. Thus 1 jh j j denotes the initiation or unrolling rate of the jth loop. 5 Examples In this section, we show the eectiveness of our approach through examples. First, we show an example of a two-level nested loop with four statements for which the optimal initiation rate (with no bound on resources) is determined using the LP formulation described in this paper. Consider the following loop: Example 1: for i = 1 to N do for j = 1 to N do S 1 : A[i; j] = B[i? 1; j? 3] + D[i? 1; j + 3] S 2 : B[i; j] = C[i? 1; j + 4] + X[i; j] S 3 : C[i; j] = A[i; j? 2] + Y [i; j] S 4 : D[i; j] = A[i; j? 1] + Z[i; j] endfor endfor The statement level dependence graph is shown in Figure 1. We assume that each statement takes one unit of time to execute, i.e., t 1 = t 2 = t 3 = t 4 = 1. The linear programming problem for this example is: Minimize subject to 1? h h h 1 1? h ? h h h 1 2? h h 1 2? 2h2 2? ? h 1 1? h2 1? 4h h 2 2? ? h 1 1? h h 1 2? 3h2 2? ? h 1 2? h2 2? ? 2 4 1
10 Software pipelining of nested loops 10 (0,1) S1 (1,-3) S4 (1,3) (0,2) S2 (1,-4) S3 Figure 1: Statement level dependence graph for Example 1 h 1 1? h2 1? 3h h 2 2? ? The optimal solution to this problem is: h 1 = 8 5, h 2 =?1 5, 1 = 0, 2 = 0, 3 = 7 5, 4 = 6 5. This means that S 1 (i; j) is scheduled at b 8i 5? j 5 c S 2 (i; j) is scheduled at b 8i 5? j 5 c S 3 (i; j) is scheduled at b 8i 5? j c S 4 (i; j) is scheduled at b 8i 5? j c In every 8 units of time, 5 new iterations of the outer loop are initiated. In every unit of time, 5 new iterations of the inner loop are initiated in the reverse direction. The optimal execution time is 9N. The best execution time that can be derived using only the iteration space distance 5 vectors is 6N for the schedule 5i + j. The ne grained solution runs 3:3 times faster than the best solution that can be obtained using hyperplane technique [26]. 5.1 Application to inner loops The method presented here is equally applicable to inner loops. Consider the following example from page 45 in [18]. Example 2: for i = 1 to N do
11 Software pipelining of nested loops 11 S1 1 0 S3 1 S2 Figure 2: Statement level dependence graph for Example 2 S 1 : A[i] = C[i? 1] S 2 : B[i + 1] = A[i] S 3 : C[i] = B[i] endfor The statement level dependence graph for Example 2 is shown in Figure 2. We assume that each statement takes one unit of time to execute, i.e., t 1 = t 2 = t 3 = 1. The linear programming problem for this example is: Minimize subject to? h 1 + h h 1? h 2 0 0? ? h 1? h 2? ? h 1? h 2? ? The optimal solution to this problem is: h = 3 2, 1 = 0, 2 = 1, and 3 = 1 2. S 1 (i) is scheduled at b 3i 2 c S 2 (i) is scheduled at b 3i 2 + 1c S 3 (i) is scheduled at b 3i c
12 Software pipelining of nested loops 12 A 0 0 B 0 0 C 1 1 D Figure 3: Statement level dependence graph for Example 3 In every 3 units of time, 2 new iterations of the loop are initiated. The optimal iteration initiation interval is 3. The optimal execution time is 2 3N. The best execution time that can be derived 2 using only the iteration space distance vectors is 3N for sequential execution (which is the only possibility because of the loop carried dependence of distance 1). The ne grained solution runs 2 times faster than the best solution that can be obtained using the hyperplane technique [26]. Earlier we had mentioned that we schedule strongly connected components separately. Next, we show an example that illustrates how we can interleave strongly connected components; we use the following example from page 124 in [1]: Example 3: A : for i = 1 to N do A[i] = f 1 (B[i]) B : B[i] = f 2 (A[i]; D[i? 1]) C : C[i] = f 3 (A[i]; D[i? 1]) D : D[i] = f 4 (B[i]; C[i]) endfor The statement level dependence graph is shown in Figure 3. We assume that each statement takes one unit of time to execute, i.e., t 1 = t 2 = t 3 = t 4 = 1. The SLDG in Figure 3 has two strongly connected components, on consisting of just the node A and the other made up of nodes B; C; and D. We use the same value of h for all statements in the loop; this allows for interleaving
13 Software pipelining of nested loops 13 of dierent strongly connected components. The linear programming problem for this example is: Minimize subject to? h 1 + h h 1? h 2 0 0? ? ? ? ? ? ? ? h 1? h 2? ? h 1? h 2? ? The optimal solution to this problem is: h = 2, 1 = 0, 2 = 1, 3 = 1, and 4 = 2. Statement A is scheduled at 2i Statement B is scheduled at 2i + 1 Statement C is scheduled at 2i + 1 Statement D is scheduled at 2i + 2 This is the optimal solution for this example. In addition, our technique produces the optimal solution for codes such as the ones on pages 131 and 138 of [1], both of which require interleaving of strongly connected components in scheduling. Due to lack of space, we do not present these solutions here; see [31] for details. 6 Work in progress We briey discuss here our eort at integrating resource constraints such as the number of processors, functional units etc. Since scheduling in the presence of resource constraints is NP-complete for inner loops, the problem is NP-complete for nested loops as well. In the area of dataow machines, Culler [8] has proposed a technique known as loop bounding, which limits the number of iterations that can be active (i.e., started but not yet nished) at any time. This can be generalized to nested loops using loop quantization. Cytron [9, 10] addresses the same problem. Aiken and Nicolau [1, 29] use loop quantization and percolation scheduling. We are working on mitred quantizations [1] that keep all the available processors busy. The problem is related to iteration space tiling of nested loops [21, 30]. We are also working towards integrating conditionals in nested loops.
14 Software pipelining of nested loops 14 7 Conclusions Software pipelining is an eective ne-grain loop scheduling technique that restructures the statements in the body of a loop subject to resource and dependence constraints such that one iteration of a loop can start execution before another nishes. The total execution time of a software-pipelined loop depends on the interval between two successive initiation of iterations. While software pipelining of single loops has been addressed in many papers, little work has been done in the area of software pipelining of nested loops. In this paper, we have presented an approach to software pipelining of nested loops. We formulated the problem of nding the minimum iteration initiation interval for each level of a nested loop as that of nding a rational ane schedule for each statement in the body of a perfectly nested loop; this is then solved using linear programming. This framework allows for an integrated treatment of iteration-dependent statement reordering and multidimensional loop unrolling. Unlike most work in scheduling nested loops, we treat each statement in the body as a unit of scheduling. Thus, the schedules derived allow for instances of statements from dierent iterations to be scheduled at the same time. Optimal schedules derived here subsume extant work on software pipelining of inner loops. Work is in progress in deriving near-optimal multidimensional loop unrolling in the presence of resource constraints and conditionals. References [1] A. Aiken. Compaction based parallelization. (Ph.D. thesis). Technical Report , Cornell University, [2] A. Aiken and A. Nicolau. Optimal loop parallelization. In Proc. ACM SIGPLAN Conference on Programming Languages Design and Implementation, June [3] A. Aiken and A. Nicolau. A realistic resource-constrained software pipelining algorithm. In Proc. 3rd Workshop on Languages and Compilers for Parallel Computing, Irvine, CA, August [4] R. Allen and K. Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Trans. Programming Languages and Systems, 9(4):491{542, [5] U. Banerjee. Dependence analysis for supercomputing, Kluwer Academic Publishers, Boston, MA, [6] A. Charlesworth. An approach to scientic array processing: The architectural design of the AP-120B/FPS-164 family. Computer, 14(3):18{27, [7] R. Colwell, R. Nix, J. O'Donnell, D. Papworth, and P. Rodman. A VLIW architecture for a trace scheduling compiler. IEEE Trans. Comput., C-37(8):967{979, August [8] D. Culler and Arvind. Resource requirements for dataow programs. In Proc. International Symposium on Computer Architecture, May [9] R. Cytron. Compile-time scheduling and optimization for asynchronous machines. PhD thesis, University of Illinois at Urbana-Champaign, Urbana, Illinois, 1984.
15 Software pipelining of nested loops 15 [10] R. Cytron. doacross: Beyond vectorization for multiprocessors. Proc International Conference on Parallel Processing, pp , August [11] C. Eisenbeis. Optimization of horizontal microcode generation for loop structures. In Proc. 2nd ACM International Conference on Supercomputing, pp. 453{465, June [12] J. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput., C-30(7):478{490, July [13] J. Fisher. Very long instruction word architectures and the ELI-512. In Proc. 10th International Symposium on Computer Architecture, pp. 140{150, June [14] K. Ebcioglu. A compilation technique for software pipelining of loops with conditional jumps. In Proc. 20th Annual Workshop on Microprogramming, December [15] K. Ebcioglu and A. Nicolau. A global resource-constrained parallelization technique. In Proc. ACM International Conference on Supercomputing, June [16] G. Gao, Y. Wong, and Q. Ning. A Petri-Net model for ne-grain loop scheduling. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp , Toronto, Canada, June [17] G. Gao, Q. Ning and V. van Dongen. Extending software pipelining for scheduling nested loops. In Proc. 6th Annual Workshop on Languages and Compilers for Parallel Computing, August [18] F. Gasperoni. Compilation techniques for VLIW architectures. Technical Report 435, Department of Computer Science, New York University, March [19] Saul I. Gass. Linear programming, methods and applications. McGraw-Hill Book Company, fourth edition, [20] J. L. Hennessy and D. A. Patterson. Computer architecture: A quantitative approach, Morgan Kaufmann Publishers, [21] F. Irigoin and R. Triolet. Supernode Partitioning. Proc. 15th Annual ACM Symp. Principles of Programming Languages, San Diego, CA, Jan. 1988, pp [22] K. Iwano and Yeh. An ecient algorithm for optimal loop parallelization. Lecture Notes in Comp. Sci., No. 450, Springer-Verlag, 1990, pp. 201{210. [23] R. Jones and H. Allan. Software pipelining: A comparison and improvement. In Proc. 23rd Annual Workshop on Microprogramming and Microarchitectures, pp , Orlando, Florida, November [24] M. Lam. A systolic array optimizing compiler. PhD thesis, Carnegie Mellon University, May [25] M. Lam. Software pipelining: An eective scheduling technique for VLIW machines. In Proc. ACM SIGPLAN Conf. Programming Languages Design and Implementation, pp , Atlanta, GA, June [26] L. Lamport. The Parallel Execution of DO Loops. Communications of the ACM, 17(2):83{93, Feb
16 Software pipelining of nested loops 16 [27] A. Munshi and B. Simons. Scheduling sequential loops on parallel processors. In Proc. ACM International Conference on Supercomputing, pp. 392{415, June [28] D. Padua and M. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12):1184{1201, Dec [29] A. Nicolau. Loop quantization: A generalized loop unwinding technique. J. Parallel and Dist. Comput., 5(5):568{586, October [30] J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for nonshared memory machines. Journal of Parallel and Distributed Computing, 16(2):108{120, October [31] J. Ramanujam. Optimal multidimensional loop unwinding: A framework for software pipelining of nested loops. Technical Report TR , Dept. of Electrical and Computer Engineering, Louisiana State University, May [32] B. Rau and C. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientic computing, In Proc. 14th Annual Workshop on Microprogramming, pp , [33] B. Rau. Cydra 5 directed dataow architecture. In Compcon 88, pp. 106{113. IEEE Computer Society, [34] A. Schrijver. Theory of linear and integer programming. Wiley-Interscience series in Discrete Mathematics and Optimization, John Wiley and Sons, New York, [35] B. Su, S. Ding and J. Xia. GURPR { A method for global software pipelining. In Proc. 20th Annual Workshop on Microprogramming, pp. 88{96, December [36] R. Tarjan. Depth-rst search and linear graph algorithms. SIAM J. Comput., 1(2):146{160, June [37] R. Touzeau. A FORTRAN compiler for the FPS-164 scientic computer. In Proc. SIGPLAN Symposium on Compiler Construction, pp , June [38] S. Weiss and J. Smith. A study of scalar compilation techniques for pipelined supercomputers. In Proc. 2nd Intl. Conf. Architectural Support for Programming Languages & Operating System, pp. 105{109, October [39] M. Wolfe and U. Banerjee. Data dependence and its application to parallel processing. International Journal of Parallel Programming, 16(2):137{178, [40] M. Wolfe. Optimizing supercompilers for supercomputers, MIT Press, [41] A. Zaky and P. Sadayappan. Optimal static scheduling of sequential loops on multiprocessors. In Proc. International Conference on Parallel Processing, volume 3, pp , [42] H. Zima and B. Chapman. Supercompilers for parallel and vector supercomputers. ACM Press Frontier Series, 1990.
UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742
UMIACS-TR-93-133 December, 1992 CS-TR-3192 Revised April, 1993 Denitions of Dependence Distance William Pugh Institute for Advanced Computer Studies Dept. of Computer Science Univ. of Maryland, College
More informationi=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8)
Vectorization Using Reversible Data Dependences Peiyi Tang and Nianshu Gao Technical Report ANU-TR-CS-94-08 October 21, 1994 Vectorization Using Reversible Data Dependences Peiyi Tang Department of Computer
More informationGeneralized Iteration Space and the. Parallelization of Symbolic Programs. (Extended Abstract) Luddy Harrison. October 15, 1991.
Generalized Iteration Space and the Parallelization of Symbolic Programs (Extended Abstract) Luddy Harrison October 15, 1991 Abstract A large body of literature has developed concerning the automatic parallelization
More informationAffine and Unimodular Transformations for Non-Uniform Nested Loops
th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 3-, 008 Affine and Unimodular Transformations for Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and
More informationAN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1
AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University
More informationTiling Multidimensional Iteration Spaces for Multicomputers
1 Tiling Multidimensional Iteration Spaces for Multicomputers J. Ramanujam Dept. of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 080 901, USA. Email: jxr@max.ee.lsu.edu
More informationReducing Memory Requirements of Nested Loops for Embedded Systems
Reducing Memory Requirements of Nested Loops for Embedded Systems 23.3 J. Ramanujam Λ Jinpyo Hong Λ Mahmut Kandemir y A. Narayan Λ Abstract Most embedded systems have limited amount of memory. In contrast,
More informationIncreasing Parallelism of Loops with the Loop Distribution Technique
Increasing Parallelism of Loops with the Loop Distribution Technique Ku-Nien Chang and Chang-Biau Yang Department of pplied Mathematics National Sun Yat-sen University Kaohsiung, Taiwan 804, ROC cbyang@math.nsysu.edu.tw
More informationPublished in HICSS-26 Conference Proceedings, January 1993, Vol. 1, pp The Benet of Predicated Execution for Software Pipelining
Published in HICSS-6 Conference Proceedings, January 1993, Vol. 1, pp. 97-506. 1 The Benet of Predicated Execution for Software Pipelining Nancy J. Warter Daniel M. Lavery Wen-mei W. Hwu Center for Reliable
More informationI = 4+I, I = 1, I 4
On Reducing Overhead in Loops Peter M.W. Knijnenburg Aart J.C. Bik High Performance Computing Division, Dept. of Computer Science, Leiden University, Niels Bohrweg 1, 2333 CA Leiden, the Netherlands. E-mail:
More informationSymbolic Evaluation of Sums for Parallelising Compilers
Symbolic Evaluation of Sums for Parallelising Compilers Rizos Sakellariou Department of Computer Science University of Manchester Oxford Road Manchester M13 9PL United Kingdom e-mail: rizos@csmanacuk Keywords:
More informationCompilation Issues for High Performance Computers: A Comparative. Overview of a General Model and the Unied Model. Brian J.
Compilation Issues for High Performance Computers: A Comparative Overview of a General Model and the Unied Model Abstract This paper presents a comparison of two models suitable for use in a compiler for
More informationTo appear: Proceedings of Supercomputing '92 1. Scott A. Mahlke William Y. Chen John C. Gyllenhaal Wen-mei W. Hwu
To appear: Proceedings of Supercomputing '9 Compiler Code Transformations for Superscalar-Based High-Performance Systems Scott A. Mahlke William Y. Chen John C. Gyllenhaal Wen-mei W. Hwu Center for Reliable
More informationPredicated Software Pipelining Technique for Loops with Conditions
Predicated Software Pipelining Technique for Loops with Conditions Dragan Milicev and Zoran Jovanovic University of Belgrade E-mail: emiliced@ubbg.etf.bg.ac.yu Abstract An effort to formalize the process
More informationCompile-Time Techniques for Data Distribution. in Distributed Memory Machines. J. Ramanujam. Louisiana State University, Baton Rouge, LA
Compile-Time Techniques for Data Distribution in Distributed Memory Machines J. Ramanujam Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 783-59 P. Sadayappan
More informationLecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm.
Lecture 19 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm I. Example of DoAll Loops Machine: Per clock: 1 read, 1 write, 1 (2-stage) arithmetic op, with hardware loop op and
More informationProfiling Dependence Vectors for Loop Parallelization
Profiling Dependence Vectors for Loop Parallelization Shaw-Yen Tseng Chung-Ta King Chuan-Yi Tang Department of Computer Science National Tsing Hua University Hsinchu, Taiwan, R.O.C. fdr788301,king,cytangg@cs.nthu.edu.tw
More informationWe propose the following strategy for applying unimodular transformations to non-perfectly nested loops. This strategy amounts to reducing the problem
Towards Unimodular Transformations for Non-perfectly Nested Loops Peter M.W. Knijnenburg High Performance Computing Division, Dept. of Computer Science, Leiden University, Niels Bohrweg 1, 2333 CA Leiden,
More informationExact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer Science The Australian National University Canberra ACT 26
Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Technical Report ANU-TR-CS-92- November 7, 992 Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer
More informationThe Polytope Model for Optimizing Cache Locality Illkirch FRANCE.
The Polytope Model for Optimizing Cache Locality Beno t Meister, Vincent Loechner and Philippe Clauss ICPS, Universit Louis Pasteur, Strasbourg P le API, Bd S bastien Brant 67400 Illkirch FRANCE e-mail:
More informationHistorical Perspective and Further Reading 3.10
3.10 6.13 Historical Perspective and Further Reading 3.10 This section discusses the history of the first pipelined processors, the earliest superscalars, the development of out-of-order and speculative
More informationA Boolean Expression. Reachability Analysis or Bisimulation. Equation Solver. Boolean. equations.
A Framework for Embedded Real-time System Design? Jin-Young Choi 1, Hee-Hwan Kwak 2, and Insup Lee 2 1 Department of Computer Science and Engineering, Korea Univerity choi@formal.korea.ac.kr 2 Department
More information4.16. Historical Perspective and Further Reading. supercomputer: Any machine still on the drawing board.
supercomputer: Any machine still on the drawing board. Stan Kelly-Bootle, The Devil s DP Dictionary, 1981 4.16 Historical Perspective and Further Reading This section discusses the history of the first
More informationTransformations Techniques for extracting Parallelism in Non-Uniform Nested Loops
Transformations Techniques for extracting Parallelism in Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and SAHAR A. GOMAA ) Kaferelsheikh University, Kaferelsheikh, EGYPT
More informationLegal and impossible dependences
Transformations and Dependences 1 operations, column Fourier-Motzkin elimination us use these tools to determine (i) legality of permutation and Let generation of transformed code. (ii) Recall: Polyhedral
More informationDepartment of. Computer Science. Uniqueness and Completeness. Analysis of Array. Comprehensions. December 15, Colorado State University
Department of Computer Science Uniqueness and Completeness Analysis of Array Comprehensions David Garza and Wim Bohm Technical Report CS-93-132 December 15, 1993 Colorado State University Uniqueness and
More informationCommunication-Minimal Tiling of Uniform Dependence Loops
Communication-Minimal Tiling of Uniform Dependence Loops Jingling Xue Department of Mathematics, Statistics and Computing Science University of New England, Armidale 2351, Australia Abstract. Tiling is
More informationCase Studies on Cache Performance and Optimization of Programs with Unit Strides
SOFTWARE PRACTICE AND EXPERIENCE, VOL. 27(2), 167 172 (FEBRUARY 1997) Case Studies on Cache Performance and Optimization of Programs with Unit Strides pei-chi wu and kuo-chan huang Department of Computer
More informationUsing the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University
Using the Holey Brick Tree for Spatial Data in General Purpose DBMSs Georgios Evangelidis Betty Salzberg College of Computer Science Northeastern University Boston, MA 02115-5096 1 Introduction There is
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationRearrangement of DNA fragments: a branch-and-cut algorithm Abstract. In this paper we consider a problem that arises in the process of reconstruction
Rearrangement of DNA fragments: a branch-and-cut algorithm 1 C. E. Ferreira 1 C. C. de Souza 2 Y. Wakabayashi 1 1 Instituto de Mat. e Estatstica 2 Instituto de Computac~ao Universidade de S~ao Paulo e-mail:
More informationAutomatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology
Automatic Array Alignment for Distributed Memory Multicomputers Mitsuru Ikei Hitachi Chemical Company Ltd. Michael Wolfe Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland OR 97291
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationURSA: A Unified ReSource Allocator for Registers and Functional Units in VLIW Architectures
Presented at IFIP WG 10.3(Concurrent Systems) Working Conference on Architectures and Compliation Techniques for Fine and Medium Grain Parallelism, Orlando, Fl., January 1993 URSA: A Unified ReSource Allocator
More information2 <3> <2> <1> (5,6) 9 (5,6) (4,5) <1,3> <1,2> <1,1> (4,5) 6 <1,1,4> <1,1,3> <1,1,2> (5,7) (5,6) (5,5)
A fast approach to computing exact solutions to the resource-constrained scheduling problem M. NARASIMHAN and J. RAMANUJAM 1 Louisiana State University This paper presents an algorithm that substantially
More informationA Geometric Approach for Partitioning N-Dimensional Non-Rectangular Iteration Spaces
A Geometric Approach for Partitioning N-Dimensional Non-Rectangular Iteration Spaces Arun Kejariwal, Paolo D Alberto, Alexandru Nicolau Constantine D. Polychronopoulos Center for Embedded Computer Systems
More informationCS 293S Parallelism and Dependence Theory
CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law
More informationInteger Programming for Array Subscript Analysis
Appears in the IEEE Transactions on Parallel and Distributed Systems, June 95 Integer Programming for Array Subscript Analysis Jaspal Subhlok School of Computer Science, Carnegie Mellon University, Pittsburgh
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly
More informationA 100 B F
Appears in Adv. in Lang. and Comp. for Par. Proc., Banerjee, Gelernter, Nicolau, and Padua (ed) 1 Using Prole Information to Assist Advanced Compiler Optimization and Scheduling William Y. Chen, Scott
More informationNull space basis: mxz. zxz I
Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence
More informationPolyhedral Compilation Foundations
Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 15, 2010 888.11, Class #4 Introduction: Polyhedral
More informationImproving Software Pipelining with Hardware Support for Self-Spatial Loads
Improving Software Pipelining with Hardware Support for Self-Spatial Loads Steve Carr Philip Sweany Department of Computer Science Michigan Technological University Houghton MI 49931-1295 fcarr,sweanyg@mtu.edu
More informationLinear Programming in Small Dimensions
Linear Programming in Small Dimensions Lekcija 7 sergio.cabello@fmf.uni-lj.si FMF Univerza v Ljubljani Edited from slides by Antoine Vigneron Outline linear programming, motivation and definition one dimensional
More informationwhere is a constant, 0 < <. In other words, the ratio between the shortest and longest paths from a node to a leaf is at least. An BB-tree allows ecie
Maintaining -balanced Trees by Partial Rebuilding Arne Andersson Department of Computer Science Lund University Box 8 S-22 00 Lund Sweden Abstract The balance criterion dening the class of -balanced trees
More informationLoop Transformations, Dependences, and Parallelization
Loop Transformations, Dependences, and Parallelization Announcements HW3 is due Wednesday February 15th Today HW3 intro Unimodular framework rehash with edits Skewing Smith-Waterman (the fix is in!), composing
More informationCS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension
CS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension Antoine Vigneron King Abdullah University of Science and Technology November 7, 2012 Antoine Vigneron (KAUST) CS 372 Lecture
More informationTheory and Algorithms for the Generation and Validation of Speculative Loop Optimizations
Theory and Algorithms for the Generation and Validation of Speculative Loop Optimizations Ying Hu Clark Barrett Benjamin Goldberg Department of Computer Science New York University yinghubarrettgoldberg
More informationreasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap
Storage-Ecient Finite Field Basis Conversion Burton S. Kaliski Jr. 1 and Yiqun Lisa Yin 2 RSA Laboratories 1 20 Crosby Drive, Bedford, MA 01730. burt@rsa.com 2 2955 Campus Drive, San Mateo, CA 94402. yiqun@rsa.com
More information2 The Service Provision Problem The formulation given here can also be found in Tomasgard et al. [6]. That paper also details the background of the mo
Two-Stage Service Provision by Branch and Bound Shane Dye Department ofmanagement University of Canterbury Christchurch, New Zealand s.dye@mang.canterbury.ac.nz Asgeir Tomasgard SINTEF, Trondheim, Norway
More informationThe Polyhedral Model (Transformations)
The Polyhedral Model (Transformations) Announcements HW4 is due Wednesday February 22 th Project proposal is due NEXT Friday, extension so that example with tool is possible (see resources website for
More information6. Concluding Remarks
[8] K. J. Supowit, The relative neighborhood graph with an application to minimum spanning trees, Tech. Rept., Department of Computer Science, University of Illinois, Urbana-Champaign, August 1980, also
More informationA Framework for Integrated Communication and I/O Placement
Syracuse University SURFACE Electrical Engineering and Computer Science College of Engineering and Computer Science 1996 A Framework for Integrated Communication and I/O Placement Rajesh Bordawekar Syracuse
More informationDepartment of. Computer Science. Uniqueness Analysis of Array. Omega Test. October 21, Colorado State University
Department of Computer Science Uniqueness Analysis of Array Comprehensions Using the Omega Test David Garza and Wim Bohm Technical Report CS-93-127 October 21, 1993 Colorado State University Uniqueness
More informationStatement-Level Communication-Free. Partitioning Techniques for. National Central University. Chung-Li 32054, Taiwan
Appeared in the Ninth Worshop on Languages and Compilers for Parallel Comping, San Jose, CA, Aug. 8-0, 996. Statement-Level Communication-Free Partitioning Techniques for Parallelizing Compilers Kuei-Ping
More informationSYSTEMS MEMO #12. A Synchronization Library for ASIM. Beng-Hong Lim Laboratory for Computer Science.
ALEWIFE SYSTEMS MEMO #12 A Synchronization Library for ASIM Beng-Hong Lim (bhlim@masala.lcs.mit.edu) Laboratory for Computer Science Room NE43-633 January 9, 1992 Abstract This memo describes the functions
More informationPolyhedral Compilation Foundations
Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 22, 2010 888.11, Class #5 Introduction: Polyhedral
More informationA Case for Two-Level Distributed Recovery Schemes. Nitin H. Vaidya. reduce the average performance overhead.
A Case for Two-Level Distributed Recovery Schemes Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-31, U.S.A. E-mail: vaidya@cs.tamu.edu Abstract Most distributed
More informationRedundant Synchronization Elimination for DOACROSS Loops
Redundant Synchronization Elimination for DOACROSS Loops Ding-Kai Chen Pen-Chung Yew fdkchen,yewg@csrd.uiuc.edu Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign
More informationData Dependence Analysis
CSc 553 Principles of Compilation 33 : Loop Dependence Data Dependence Analysis Department of Computer Science University of Arizona collberg@gmail.com Copyright c 2011 Christian Collberg Data Dependence
More informationConditional Branching is not Necessary for Universal Computation in von Neumann Computers Raul Rojas (University of Halle Department of Mathematics an
Conditional Branching is not Necessary for Universal Computation in von Neumann Computers Raul Rojas (University of Halle Department of Mathematics and Computer Science rojas@informatik.uni-halle.de) Abstract:
More informationEgemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for
Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and
More informationA taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA
A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique
More informationSparse Matrices. Mathematics In Science And Engineering Volume 99 READ ONLINE
Sparse Matrices. Mathematics In Science And Engineering Volume 99 READ ONLINE If you are looking for a ebook Sparse Matrices. Mathematics in Science and Engineering Volume 99 in pdf form, in that case
More informationCenter for Automation and Autonomous Complex Systems. Computer Science Department, Tulane University. New Orleans, LA June 5, 1991.
Two-phase Backpropagation George M. Georgiou Cris Koutsougeras Center for Automation and Autonomous Complex Systems Computer Science Department, Tulane University New Orleans, LA 70118 June 5, 1991 Abstract
More informationREDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS
REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS Noureddine Chabini 1 and Wayne Wolf 2 1 Department of Electrical and Computer Engineering, Royal Military College
More informationUniversity of Ghent. St.-Pietersnieuwstraat 41. Abstract. Sucient and precise semantic information is essential to interactive
Visualizing the Iteration Space in PEFPT? Qi Wang, Yu Yijun and Erik D'Hollander University of Ghent Dept. of Electrical Engineering St.-Pietersnieuwstraat 41 B-9000 Ghent wang@elis.rug.ac.be Tel: +32-9-264.33.75
More informationand Memory Constraints Ulrich Kremer y Rutgers University Abstract an ecient HPF program. Although nding an ecient data layout fully automatically
Automatic Data Layout With Read{Only Replication and Memory Constraints Ulrich Kremer y Department of Computer Science Rutgers University Abstract Besides the algorithm selection, the data layout choice
More informationCopyright (C) 1997, 1998 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for
Copyright (C) 1997, 1998 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided
More informationA Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology.
A Fast Recursive Mapping Algorithm Song Chen and Mary M. Eshaghian Department of Computer and Information Science New Jersey Institute of Technology Newark, NJ 7 Abstract This paper presents a generic
More informationA Perfect Branch Prediction Technique for Conditional Loops
A Perfect Branch Prediction Technique for Conditional Loops Virgil Andronache Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA and Richard P. Simpson Department
More informationA Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract
A Quantitative Algorithm for Data Locality Optimization Francois Bodin, William Jalby, Daniel Windheiser IRISA, University of Rennes Rennes, FRANCE Christine Eisenbeis INRIA Rocquencourt, FRANCE Abstract
More informationLinear Loop Transformations for Locality Enhancement
Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation
More informationPPS : A Pipeline Path-based Scheduler. 46, Avenue Felix Viallet, Grenoble Cedex, France.
: A Pipeline Path-based Scheduler Maher Rahmouni Ahmed A. Jerraya Laboratoire TIMA/lNPG,, Avenue Felix Viallet, 80 Grenoble Cedex, France Email:rahmouni@verdon.imag.fr Abstract This paper presents a scheduling
More informationThunks (continued) Olivier Danvy, John Hatcli. Department of Computing and Information Sciences. Kansas State University. Manhattan, Kansas 66506, USA
Thunks (continued) Olivier Danvy, John Hatcli Department of Computing and Information Sciences Kansas State University Manhattan, Kansas 66506, USA e-mail: (danvy, hatcli)@cis.ksu.edu Abstract: Call-by-name
More informationData Dependences and Parallelization
Data Dependences and Parallelization 1 Agenda Introduction Single Loop Nested Loops Data Dependence Analysis 2 Motivation DOALL loops: loops whose iterations can execute in parallel for i = 11, 20 a[i]
More informationAvailability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742
Availability of Coding Based Replication Schemes Gagan Agrawal Department of Computer Science University of Maryland College Park, MD 20742 Abstract Data is often replicated in distributed systems to improve
More informationOptimal Parallel Randomized Renaming
Optimal Parallel Randomized Renaming Martin Farach S. Muthukrishnan September 11, 1995 Abstract We consider the Renaming Problem, a basic processing step in string algorithms, for which we give a simultaneously
More informationConsistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:
Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical
More informationEulerian disjoint paths problem in grid graphs is NP-complete
Discrete Applied Mathematics 143 (2004) 336 341 Notes Eulerian disjoint paths problem in grid graphs is NP-complete Daniel Marx www.elsevier.com/locate/dam Department of Computer Science and Information
More informationAllowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs
To appear in: Int. Conf. on Parallel and Distributed Systems, ICPADS'96, June 3-6, 1996, Tokyo Allowing Cycle-Stealing Direct Memory Access I/O Concurrent with Hard-Real-Time Programs Tai-Yi Huang, Jane
More informationBehavioral Array Mapping into Multiport Memories Targeting Low Power 3
Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationAn Overview to. Polyhedral Model. Fangzhou Jiao
An Overview to Polyhedral Model Fangzhou Jiao Polyhedral Model A framework for performing loop transformation Loop representation: using polytopes to achieve fine-grain representation of program Loop transformation:
More informationPreemptive Scheduling of Equal-Length Jobs in Polynomial Time
Preemptive Scheduling of Equal-Length Jobs in Polynomial Time George B. Mertzios and Walter Unger Abstract. We study the preemptive scheduling problem of a set of n jobs with release times and equal processing
More informationREDUCTION IN RUN TIME USING TRAP ANALYSIS
REDUCTION IN RUN TIME USING TRAP ANALYSIS 1 Prof. K.V.N.Sunitha 2 Dr V. Vijay Kumar 1 Professor & Head, CSE Dept, G.Narayanamma Inst.of Tech. & Science, Shaikpet, Hyderabad, India. 2 Dr V. Vijay Kumar
More informationAn Approach for Integrating Basic Retiming and Software Pipelining
An Approach for Integrating Basic Retiming and Software Pipelining Noureddine Chabini Department of Electrical and Computer Engineering Royal Military College of Canada PB 7000 Station Forces Kingston
More informationOptimal Partitioning of Sequences. Abstract. The problem of partitioning a sequence of n real numbers into p intervals
Optimal Partitioning of Sequences Fredrik Manne and Tor S revik y Abstract The problem of partitioning a sequence of n real numbers into p intervals is considered. The goal is to nd a partition such that
More information2 The MiníMax Principle First consider a simple problem. This problem will address the tradeos involved in a two-objective optimiation problem, where
Determining the Optimal Weights in Multiple Objective Function Optimiation Michael A. Gennert Alan L. Yuille Department of Computer Science Division of Applied Sciences Worcester Polytechnic Institute
More informationDependence Analysis. Hwansoo Han
Dependence Analysis Hwansoo Han Dependence analysis Dependence Control dependence Data dependence Dependence graph Usage The way to represent dependences Dependence types, latencies Instruction scheduling
More informationCompiling for Advanced Architectures
Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have
More informationProc. XVIII Conf. Latinoamericana de Informatica, PANEL'92, pages , August Timed automata have been proposed in [1, 8] to model nite-s
Proc. XVIII Conf. Latinoamericana de Informatica, PANEL'92, pages 1243 1250, August 1992 1 Compiling Timed Algebras into Timed Automata Sergio Yovine VERIMAG Centre Equation, 2 Ave de Vignate, 38610 Gieres,
More informationOptimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C
Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of California, San Diego CA 92093{0114, USA Abstract. We
More informationImplementing Sequential Consistency In Cache-Based Systems
To appear in the Proceedings of the 1990 International Conference on Parallel Processing Implementing Sequential Consistency In Cache-Based Systems Sarita V. Adve Mark D. Hill Computer Sciences Department
More information2.1. A motivating example for constraint SQL
MLPQ: A Linear Constraint Database System with Aggregate Operators Peter Z Revesz and Yiming Li University of Nebraska{Lincoln Dept of Computer Science and Engineering Lincoln, NE 68588, USA frevesz,ylig@cseunledu
More informationInterprocedural Dependence Analysis and Parallelization
RETROSPECTIVE: Interprocedural Dependence Analysis and Parallelization Michael G Burke IBM T.J. Watson Research Labs P.O. Box 704 Yorktown Heights, NY 10598 USA mgburke@us.ibm.com Ron K. Cytron Department
More informationEcient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines
Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,
More informationEmbedding Formulations, Complexity and Representability for Unions of Convex Sets
, Complexity and Representability for Unions of Convex Sets Juan Pablo Vielma Massachusetts Institute of Technology CMO-BIRS Workshop: Modern Techniques in Discrete Optimization: Mathematics, Algorithms
More information