Automatic Translation of FORTRAN Programs to Vector Form

Size: px

Start display at page:

Download "Automatic Translation of FORTRAN Programs to Vector Form"

Trevor Caldwell
6 years ago
Views:

1 Automatic Translation of FORTRAN Programs to Vector Form RANDY ALLEN and KEN KENNEDY Rice University The recent success of vector computers such as the Cray-1 and array processors such as those manufactured by Floating Point Systems has increased interest in making vector operations available to the FORTRAN programmer. The FORTRAN standards committee is currently considering a successor to FORTRAN 77, usually called FORTRAN 8x, that will permit the programmer to explicitly specify vector and array operations. Although FORTRAN 8x will make it convenient to specify explicit vector operations in new programs, it does little for existing code. In order to benefit from the power of vector hardware, existing programs will need to be rewritten in some language (presumably FORTRAN 8x) that permits the explicit specification of vector operations. One way to avoid a massive manual recoding effort is to provide a translator that discovers the parallelism implicit in a FORTRAN program and automatically rewrites that program in FORTRAN 8x. Such a translation from FORTRAN to FORTRAN 8x is not straightforward because FORTRAN DO loops are not always semantically equivalent to the corresponding FORTRAN 8x parallel operation. The semantic difference between these two constructs is precisely captured by the concept of dependence. A translation from FORTRAN to FORTRAN 8x preserves the semantics of the original program if it preserves the dependences in that program. The theoretical background is developed here for employing data dependence to convert FOR- TRAN programs to parallel form. Dependence is defined and characterized in terms of the conditions that give rise to it; accurate tests to determine dependence are presented; and transformations that use dependence to uncover additional parallelism are discussed. Categories and Subject Descriptors: D.1.2 [Programming Techniques]: Automatic Programming; D.1.3 [Programming Techniques]: Concurrent Programming; D.3.4 [Processors]: Optimization General Terms: Languages Additional Key Words and Phrases: FORTRAN, vector computing detection of parallelism, language translators, 1. INTRODUCTION With the advent of successful vector computers such as the Gay-1 [lo, 301 and the popularity of array processors such as the Floating Point Systems AP-120 [13,35], there has been increased interest in making vector operations available to the FORTRAN programmer. One common method is to supply a vectorizing This work was supported by the IBM Corporation. Authors address: Department of Computer Science, Brown School of Engineering, Rice University, P.O. Box 1892, Houston, TX Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission ACM /87/ $01.50 ACM Transactions on Programming Languages and Systems, Vol. 9, No. 4, October 1987, Pages

2 492 l R. Allen and K. Kennedy FORTRAN compiler [ll] as depicted in Figure 1. Here standard FORTRAN is accepted as input, and, as part of the optimization phase of the compiler, a vectorizing stage attempts to convert the innermost loops to vector operations. The code generator can then produce vector machine code for these operations. This scheme has two advantages. First, programmers need not learn a new language since the FORTRAN compiler itself takes on the task of discovering where vector operations may be useful. Second, this scheme does not require a major conversion effort to bring old code across. In practice, however, this system has drawbacks. Uncovering implicitly parallel operations in a program is a subtle intellectual activity-so subtle that most compilers to date have not been able to do a truly thorough job. As a result, the programmer often has to assist the compiler by recoding loops into a form that the compiler can handle. The Cray FORTRAN manual [ll], for example, has several pages devoted to such recoding methods. With this system, the programmer is still obligated to rewrite his programs for a new machine, not because the compiler will not accept the old program, but because the compiler is unable to generate suitably efficient code. During a number of visits to Los Alamos Scientific Laboratory, which has several Crays, we have observed the widespread sentiment that every FORTRAN program will need to be rewritten to be acceptably efficient on the Cray. This presents the question: If we are forced to rewrite FORTRAN programs into vector form anyway, why not write them in a language that permits explicit specification of vector operations, while still maintaining the flavor of FOR- TRAN? Many such languages have been proposed. VECTRAN [27,28] is one of the earliest and most influential of such proposals, although there have been numerous others [7, 12, 341. In fact, it seems clear that the next ANSI standard for FORTRAN, which we shall refer to as FORTRAN 8x, will contain explicit vector operations like those in VECTRAN [5, 261. Suppose that, instead of a vectorizing FORTRAN compiler, we were to provide FORTRAN 8x compilers for use with vector machines. This would allow programmers to bypass the implicitly sequential semantics of FORTRAN and explicitly code vector algorithms in a language designed for that purpose. However, the basic problem will still be unresolved: What do we do about old code? One answer is to provide a translator that will take FORTRAN 66 or FOR- TRAN 77 as input and produce FORTRAN 8x as output. This leads to the system depicted in Figure 2. An advantage of this system is that the translator need not be as efficient as a vectorizing stage embedded in a compiler must be, since the translation from FORTRAN to FORTRAN 8x is usually done only once. Therefore, the translator can attempt more ambitious program transformations, using techniques from program verification and artificial intelligence. Such a translator should uncover significantly more parallelism than a conventional vectorizing compiler. There is another advantage to this method. If the translator should fail to discover a potential vector operation in a critical program region, the programmer need not try to trick the translator into recognizing it. Instead, he can correct the problem directly in the FORTRAN 8x version. This advantage is very significant, because some loops can be correctly run in vector form even when

3 Automatic Translation of FORTRAN Programs to Vector Form l 493 Fortran_ 41 vecto:lclfne Fig. 1. Vectorizing FORTRAN compiler. Fortran Fi --;;;;i;+ pgq ET+ Fig. 2. Vectorizing FORTRAN translator. a transformation to such a form appears to violate their semantics. Such loops can usually be recoded by a programmer into explicit vector statements in FORTRAN 8x. This paper discusses the theoretical concepts underlying a project at Rice University to develop an automatic translator, called PFC (for Parallel FORTRAN Converter), from FORTRAN to FORTRAN 8x. The Rice project, based initially upon the research of Kuck and others at the University of Illinois [6, 17-21, 24, 32, 361, is a continuation of work begun while on leave at IBM Research in Yorktown Heights, N.Y. Our first implementation was based on the Illinois PARAFRASE compiler [20, 361, but the current version is a completely new program (although it performs many of the same transformations as PARAFRASE). Other projects that have influenced our work are the Texas Instruments ASC compiler [9,33], the Cray-1 FORTRAN compiler [15], and the Massachusetts Computer Associates Vectorizer [22, 251. The paper is organized into seven sections. Section 2 introduces FORTRAN 8x and gives examples of its use. Section 3 presents an overview of the translation process along with an extended translation example. Section 4 develops the concept of interstatement dependence and shows how it can be applied to the problem of vectorization. Loop carried dependence and loop independent dependence are introduced in this section to extend dependence to multiple statements and multiple loops. Section 5 develops dependence-based algorithms for code generation and transformations for enhancing the parallelism of a statement. Section 6 describes a method for extending the power of data dependence to control statements by the process of IF conuersion. Finally, Section 7 details the current state of PFC and our plans for its continued development. 2. FUNDAMENTALS OF FORTRAN 8x It is difficult to describe any language whose definition is still evolving, much less write a language translator for it, but we need some language as the basis for our discussion. In this section, we describe a potential version of FORTRAN 8x, ACM Transactions on Programming Languages and Systema, Vol. 9, No. 4, October 1987.

4 494 l R. Allen and K. Kennedy one that is similar to the version presently under consideration by the ANSI X3J3 committee. Our version extends 1977 ANSI FORTRAN to include the proposed features for support of array processing and most of the proposed control structures. 2.1 Array Assignment Vectors and arrays may be treated as aggregates in the assignment statement. Suppose X and Y are two arrays of the same dimension, then X=Y copies Y into X, element by element. In other words, this assignment is equivalent to X(1) = Y(1) X(2) = Y(2) X(N) = Y(N). Scalar quantities may be mixed with vector quantities using the convention that a scalar is expanded to a vector of the appropriate dimensions before operations are performed. Thus x = x adds the constant 5.0 to every element of array X. Array assignments in FORTRAN 8x are viewed as being executed simultaneously; that is, the assignment must be treated so that all input operands are fetched before any output values are stored. For instance, consider x = X/X(2). Even though the value of X(2) is changed by this statement, the original value of X(2) is used throughout, so that the result is the same as T = X(2) X(1) = X(1)/T X(2) = X(2)/T X(N) = X(N)/T. This is an important semantic distinction translation process. that has a significant impact on the 2.2 Array Sections Sections of arrays, including individual rows and columns, may be assigned using triplet notation. Suppose A and B are two-dimensional arrays whose subscripts

5 Automatic Translation of FORTRAN Programs to Vector Form l 495 range from 1 to 100 in each dimension; then A(1 : 100, I) = B(J, 1: 100) assigns the Jth row of B to the Ith column of A. One may also define a range of iteration for vector assignment that is smaller than a whole row or column. Suppose you wish to assign the first M elements of the Jth row of B to the first M elements of the Ith column of A. In FORTRAN 8x, the following assignment could be used: A(l:M, I) = B(J, 1:M). This statement has the effect of the assignments: A(l, I) = B(J, 1) A(2, I) = B(J, 2) A(M, I) = B(J, M) even though M might contain a value much smaller than the actual upper bound of these arrays. The term triplet seems to imply that the iteration range specifications such as the one above should have three components. Indeed, the third component, when it appears, specifies a stride for the index vector in that subscript position. For example, if we had wished to assign the first M elements of the Jth row of B to the first M elements of the Ith column of A in odd subscript positions, the following assignment could have been used. A(1 : M*2-1: 2, I) = B(J, 1: M). The triplet notation is also useful in dealing with operations involving shifted sections. The assignment has the effect A(1, l:m) = B(l:M, J) + C(1, 3:M + 2) A(1, 1) = B(l, J) + C(1, 3) A(1, 2) = B(2, J) + C(1, 4) 2.3 Array identification A(1, M) = B(M, J) + C(1, M + 2). Useful as it is, the triplet notation provides no way to skip through elements of a rotated array section, like the diagonal. To do that, one must use the IDENTIFY statement, which allows an array name to be mapped onto an existing array. For example, IDENTIFY /l:m/ D(1) = C(1, I + 1)

6 496 l FL Allen and K. Kennedy defines the name D, dimensioned from 1 to M, to be the superdiagonal of C. Thus has the effect D = A(l:M, J) C(l, 2) = A(l, J) C(2, 3) = A(2, J) C(M, M + 1) = A(M, J). It is important to note that D has no storage of its own; it is merely a pseudonym for a subset of the storage assigned to C. 2.4 Conditional Assignment The FORTRAN 8x WHERE statement will permit an array assignment to be controlled by a conditional masking array. For example, WHERE(A.GT. 0.0) A = A + B specifies that the vector sum of A and B be formed, but that stores back to A take place only in positions where A was originally greater than zero. The semantics of this statement require that it behave as if only components corresponding to the locations where the controlling condition is true are involved in the computation. In the special case of statements like WHERE(A.NE. 0.0) B = B/A the semantics require that divide checks arising as a result of evaluating the right-hand side not affect the behavior of the program-the code must hide the error from the user. In other words, any error side-effects that might occur as a result of evaluating the right-hand side in positions where the controlling vector is false are ignored. 2.5 Library Functions Mathematical library functions, such as SQRT and SIN, are extended on an elementwise basis to vectors and arrays. In addition, new intrinsic functions are provided, such as inner matrix product (DOTPRODUCT) and transpose (TRANSPOSE). The special function SEQ(L N) returns an index vector from 1 to N. Reduction functions, much like those in APL, are also provided. For example, SUM applied to a vector returns the sum of all elements in that vector. 2.6 User-Defined Subprograms There are several enhancements to the handling of user-defined subroutines and functions. First, arrays, even identified arrays, may be passed as parameters to subroutines. Second, an array may be returned as the value of a function.

7 Automatic Translation of FORTRAN Programs to Vector Form l THE TRANSLATION PROCESS Now we are ready to describe, in an idealized way, the process of translating a FORTRAN program into FORTRAN 8x. In so doing, we will illustrate some important aspects of the problem. Suppose the translator is presented with the following FORTRAN fragment: Sl : S: DO 20 I = 1,100 KI = I DO 10 J = 1,300,3 KI = KI + 2 U(J) = U(J) * W(K1) V(J + 3) = V(J) + W(K1) 10 CONTINUE 20 CONTINUE The goal is to convert statements Ss and S q to vector assignments, removing them from the innermost loop. That will be possible if there is no semantic difference between executing them in a sequential loop and executing them as vector statements. Consider a somewhat simpler case: DO 10 I = 1,100 X(1) = X(1) + Y(1) 10 CONTINUE If we are to convert this to the vector assignment X(1: 100) = X(1 : 100) + Y(l: 100) we must be sure that no semantic difference arises. Specifically, a vector assignment requires that the right-hand side be fetched before any stores occur on the left. Thus, it can use only old values of its input operands. If the sequential loop computes a value on one iteration and uses it on a later iteration, it is not semantically equivalent to a vector statement. The following fragment DO 10 I = 1,100 X(1 + 1) = X(1) + Y(1) 10 CONTINUE cannot be correctly converted to the vector assignment X(2:101) = X(1:100) + Y(l:lOO) because each iteration after the first uses a value computed on the previous iteration. The vector assignment would use only old values of X. An iterated, statement that depends upon itself in the manner shown is called a recurrence. In order to distinguish the two cases above, the translator must perform a precise test to determine whether or not a statement depends upon itself-that is, whether or not it uses a value that it has computed on some previous iteration. Details of this dependence test will be provided in the next section; for now, it is enough to know that certain program transformations are required to make the test possible.

8 498 l R. Allen and K. Kennedy The first of these, DO-loop normalization, transforms loops so that the loop induction variables iterate from 1 to some upper bound by increments of 1. Sometimes new induction variables must be introduced to accomplish this. Within the loop, every reference to the old loop induction variable is replaced by an expression in the new induction variable. The effect of DO-loop normalization on our example is DO 20 I = 1,100 KI = I DO 1Oj = 1,100 KI = KI + 2 U(3*j-2)=U(3*j-2)*W(KI) V(3*j+l)=V(3*j-2)+W(KI) 10 CONTINUE ss J = CONTINUE Note that the new variable j (written as a lowercase letter to signify that it has been introduced by the translator) is now the inner loop induction variable and that an assignment Ss has been introduced to define the previous induction variable on exit from the loop. In this form, the upper bound of the loop is precisely the number of times the loop will be executed. A major goal of this sequence of normalizing transformations is to convert all subscripts to linear functions of loop induction variables. To accomplish this conversion, uses of auxiliary induction variables, such as KI in our example, must be replaced. This transformation, called induction uaridde substitution [36], replaces statements that increment auxiliary induction variables with statements that compute them directly using normal loop induction variables and loop constants. The effect in our example is as follows: DO201=1,100 KI = I DO 1Oj = 1,100 U(3*j-2)=U(3*j-2)*W(KI+2*j) V(3:j+l)=V(3*j-2)+W(KI+2:j) 10 CONTINUE KI = KI J = CONTINUE Here the computation of KI has been removed from the loop and all references to KI have been replaced by references to the initial value of KI plus the sum total of increments that can occur by the relevant iteration, expressed as a function of j. At the end of the loop, an assignment updates the value of KI by the aggregate total of all increments in the loop. Note that since it attempts to replace simple additions with multiplications, induction variable substitution is, in a sense, an inverse of the classical optimization technique operator strength reduction [2, 81. The final transformation in preparation for dependence testing is expression folding, which substitutes integer expressions and constants forward into

9 Automatic Translation of FORTRAN Programs to Vector Form 499 subscripts, with simplification where possible. The result in our example is DO 20 I = 1,100 DO 1Oj = 1,100 S3 U(3*j-2)=U(3*j-2)*W(I+2*j) S, V(3*j+l)=V(3*j-2)+W(I+2*j) 10 CONTINUE KI = I : J = CONTINUE In this example, the first assignment to KI in the outer loop has been removed and references to KI replaced by the right-hand side (I) in statements &, S4, and S,j. It should be noted that statements 2% and S, could now be removed from the loop by forward substitution; this is, in fact, done in the actual translator. Once the subscripts have been transformed, a standard data flow analysis phase can be applied to build the data flow graph for the whole program. This graph can be used to propagate constants throughout the program and to recognize dead statements, that is, statements whose output will never be used. In the example above, suppose that KI and J are both dead after the code segment shown. Then all assignments of those variables will be deleted, as shown below. DO 20 I = 1,100 DOlOj=l,lOO SS U(3*j-2)=U(3*j-2)*W(I+2*j) & V(3:j+l)=V(3*j-2)+W(I+2*j) 10 CONTINUE 20 CONTINUE The point of this complex assortment of transformations is to attempt to convert all subscripts to a canonical form: linear functions of the DO loop induction variables. This form makes it possible to apply a powerful and precise test for interstatement dependence. In the example above, we have succeeded in putting all subscripts into the desired form, so we can use precise tests to determine what dependences exist among the statements in the inner loop. Once the dependences have been identified, we are ready for vector code generation. Using dependence information, the translator determines which of the remaining statements does not depend on itself. As it happens, statement S3 does not depend upon itself, while statement S4 does (and hence represents a recurrence). Therefore, statement Ss is converted to a vector assignment, while statement S4 is left in a sequential loop by itself. DO 20 I = 1, U(1:298:3) = U(1:298:3) * W(I- 2:1+ 200:2) DO 1Oj = 1, y(3*j+l)=v(3*j-2)+w(i+2*j) 10 CONTINUE 20 CONTINUE Figure 3 gives an overview of the translation process as implemented in PFC. The scanner-parser phase converts the input program to an abstract syntax tree that is used as the intermediate form throughout the translation. The pretty

10 500 l R. Allen and K. Kennedy Scanner- Parser 1 tree Vector Translator L Pretty Printer pi-iizq ---+ poiiir Fig. 3. Overview of PFC. printer can reconstruct a source program from the abstract syntax tree; it is used throughout the translator. The vector translation phase consists of three main subphases: (1) subscript stundurdization, which encompasses all the transformations that attempt to put subscripts into canonical form; (2) dependence analysis, which builds the interstatement dependence graph; (3) parallel code generation, which generates array assignments where possible. Each of these will be discussed in more detail. Since the dependence test is fundamental to these phases, it is the subject of the next section. 4. DEPENDENCE ANALYSIS Since a statement can be directly vectorized only if it does not depend upon itself, the analysis of interstatement dependence is an important part of PFC. In this section we formalize the concept of dependence and introduce a precise test for interstatement dependence in a single loop. We then extend this concept to multiple loops with the concept of layered dependence. 4.1 Interstatement Dependence Informally, a statement Sz depends upon statement S1 if some execution of S2 uses as input a value created by some previous execution of &. In straight-line code, this condition is easy to determine. Since we are interested in determining whether a statement depends upon itself and since this can only happen if the execution flows from a statement back to itself via a loop, we must be able to determine dependence within loops. To illustrate the complexity of this problem, consider the following loop: DOlOJ=l,N X(J) = X(J) + C 10 CONTINUE

11 Automatic Translation of FORTRAN Programs to Vector Form 501 The statement in this loop does not depend on itself because the input variable X(J) always refers to the old value at that location. By contrast, the similar loop DOlOJ=l,N-1 X(J + 1) = X(J) + C 10 CONTINUE forms a recurrence and cannot be directly converted to vector form. The input on any iteration i + 1 is always the value of X computed on iteration i. As a result, the direct vector analog will not be equivalent. In order to understand intrastatement dependence, we need to examine the generalized form of a (possibly dependent) single statement within a loop. DOlOi=l,N (*) W(i)), = WXW))) 10 CONTINUE Here f and g are arbitrary subscript expressions, and F is some expression involving its input parameter. Definition. Statement (*) depends upon itself if and only if there exist integers ii, i2 such that 1 5 ii C i 2 I N and f (il) = g(i2). The integers ii and i2 represent separate iterations of the i loop. On iteration il, statement (*) computes a value that is subsequently used on iteration iz. To put it another way, statement (*) depends upon itself if and only if the dependence equation f(x) - g(y) = 0 has integer solutions in the region depicted in Figure 4. If f and g are permitted to be arbitrary functions of the DO loop induction variable, then determining whether statement (*) depends upon itself is an extremely difficult problem. The problem becomes much more tractable when f and g are restricted to be linear functions of the induction variable, that is, f(i) = ao + aii g(i) = bo + b,i. This is by far the most common case encountered in practice. With this restriction, the dependence equation has solutions if and only if alx - bly = bo - a+ In order for x and y to be viable solutions to the dependence equation, they must be integers. As a result, we are seeking integer solutions to an equation with integer coefficients. Almost any text on number theory (e.g., [14]) will include the following theorem on Diophantine equations. THEOREM 1. The linear Diophantine only if gcd(a, b) 1 n. equation ax + by = n bus a solution if and

12 502 l R. Allen and K. Kennedy Fig. 4. The region of interest. Immediately following from the above theorem is a necessary requirement for dependence: COROLLARY 1 (GCD TEST). Statement (*) with f(i) = a~ + ali and g(i) = b. + bli depends upon itself only if gcd&, bj 1 b0 - u,,. Note that the gcd test is only necessary for dependence, because an integer solution to the dependence equation is not sufficient to guarantee self-dependence. For that, the solution must exist within the region depicted in Figure 4. Although the gcd test is interesting, it is of limited usefulness, because the most common case by far is that in which the gcd of al and bl is 1. A more effective test can be developed by examining the effects of region constraints on the existence of solutions. The mathematics of determining integer solutions to a Diophantine equation within a restricted region can lead to extremely expensive tests for dependence. As a result, it is more useful to investigate the real solutions to the dependence equation in the region of interest. Consider the real solutions of in the region R: hh, Y) = f (4 - g(y) = 0 lrx<n-1 21ysN nzsy-1. A real solution to the dependence equation exists in R if and only if the level curve at 0 for h passes through R, as depicted in Figure 5. If h meets fairly general continuity conditions, the intermediate value theorem guarantees that h has ACM Transactions cm Programming Languages and Systems, Vol. 9, No. 4,Octoher 1987.

13 Automatic Translation of FORTRAN Programs to Vector Form l 503 Fig. 5. Real solutions in R. zeros in R if and only if there exist points (x1, yl) and (x2, yz) in R such that e1, Yl) h(x2, Y2). The following theorem summarizes this observation. THEOREM 2. If h(x, y) is continuous in R, then there exists a solution to h(x, y) = 0 in R if and only if min h(x, y) I 0 5 max h(x, y). R R COROLLARY 2. If f(x) and g(y) are continuous, then statement (*) depends upon itself only if mjn (f (4 - g(y)) y (f (4 - g(y)). Once again, this condition is necessary, but not sufficient; the existence of real solutions in R does not imply the existence of integer solutions. As a result, the requirements of Corollary 2 may be satisfied by a statement that is not selfdependent. Corollary 2 is useful only if there is a fast way to find the maximum and minimum on a region. Such a way is provided by the following theorem, adapted from a result due to Banerjee [6]. THEOREM 3. If f (x) = a0 + alx and g( y) = b. + biy, then max (f(x) - g(y)) = Q + al - b0-2bi + (a: - bi)+(n - 2) R rn? (f(x) - g(y)) = a0 + al - b0-2bl - (a; - b;)+(n - 2)

14 504 l R. Allen and K. Kennedy where the superscript notation is defined by the following: Definition. If t denotes a real number, then the positive part t+ and the negative part t- of t are defined as t+= t, if tz0 0, if t<o t- = -t, if t I 0 i 0, if t > 0. Thus t+ L 0, t- r 0, and t = t+ - t-. The proof of a multidimensional Theorem 3 is given in the Appendix. Theorem 2 and Theorem 3 establish the following result. variant of COROLLARY 3 (BANERJEE INEQUALITY). If f (x) = a0 + alx and g( y) = b. + bly then statement (*) depends on itself only if -bl - (Ui + bl)+(n - 2) I bo + bl - & - al I -bl + (a: - bj+(n - 2). PROOF. Immediate from Corollary 2 and Theorem 3 with subtraction of a0 + al - b0 - bl from each side of the inequalities in Corollary 2. 0 Corollaries 1 and 3 comprise a necessary test for self-dependence. This test may be expressed algorithmically as follows: (1) Determine whether f and g are linear. If they are, then compute a,,, bo, al, and bl. (2) If either (a) gcd(al, bl) does not divide b0 - a0 or (b) Banerjee s inequality does not hold, then the statement does not depend upon itself. Otherwise, assume it does (even though it may not). Testing for self-dependence in the presence of multiple loops is more complicated. Before developing that test, let us examine some applications of dependence. 4.2 Dependence Graphs and Their Application While determining whether a statement depends upon itself or not is useful, it is clearly a simplified case of a more general phenomenon. In general, a statement may depend upon itself indirectly through a chain of zero (the direct case) or more statements, as the following example illustrates: : SZ DO 10 I = 1,100 T(I) = A(I) * B(I) S(1) = S(1) + T(1) A(1 + 1) = S(1) + C(1) 10 CONTINUE Although statements S1, Sz, and S8 all depend upon themselves indirectly, no statement depends directly upon itself. In order to uncover the recurrence, it is necessary to first uncover the individual statement-to-statement dependences.

15 Automatic Translation of FORTRAN Programs to Vector Form l 505 Kuck and others at the University of Illinois [18, 321 have defined three types of dependence that can hold between statements. Definition. If control flow within a program can reach statement Sp after passing through S1, then Sp depends on S1, written S1 A Sz, if (1) SZ uses the output of Si. This type of dependence is known as true dependence (denoted A), and is illustrated by the following s1: x= 6 sp: = x. $ (2) S1 might wrongly use the output of Sz if they were reversed in order. This type of dependence is called antidependence (denoted?$ and is illustrated by the following: S1: =x b sz: x = $ (3) SZ recomputes the output of S,; thus, if they were reversed, later statements might wrongly use the output of Si. This type of dependence is termed output dependence (denoted So) and is illustrated by s1: x= a0 sp: x= 2 Thus A = 6 + f + 6, where addition means set union. All three types of dependence must be considered when detecting recurrences that inhibit vectorization. Note that dependence, in this sense, denotes a relation between two statements that captures the order in which they must be executed. This concept of dependence differs from that normally encountered in data flow analysis, where dependence implies that one statement must be present for another to receive the correct values. Antidependence and output dependence are meaningless in such a setting, since they only fix the order of statements. In particular, we would not wish to use either of these pseudodependences (as we will henceforth call them) in the dead statement eliminator; it would be ridiculous to refuse to eliminate a particular statement because some useful statement recomputes its output and hence depends on it. In any case, the common element among these types of dependence is the use of the same memory location in two statements (or in two different executions of the same statement). The actual type of dependence created by a common use is determined by which statement (or statements) defines the location and which statement uses the location. As a result, all three types of dependence can be decided by the same test. The only change necessary is to switch the locations from which the subscript functions f and g are taken. We will therefore discuss only the test for true dependence between two statements in a loop, with the

16 506 l Ft. Allen and K. Kennedy understanding that the same methods are easily extended to all types of dependence. In contrast to the case of self-dependence, there are two completely separate ways in which dependence can arise between different statements. One statement may store into a location on one iteration of the loop; the other statement may fetch from that location on a later iteration of the loop. The dependence of statement S1 on statement S3 in the previous example illustrates this type of dependence, known as loop carried dependence. The other possibility is that one statement may store into a location on an iteration of the loop; on the same iteration another statement may fetch from that location. The dependence of statement Sz on statement S1 illustrates this type of dependence, known as loop independent dependence. In order for one statement to have a true dependence upon another, it is necessary that the statement defining the common memory location precede (in terms of execution) the statement using that location. Since these two types of dependence completely describe all possible ways that a definition can precede a use, these two types of dependence completely encapsulate all possible data dependences. Before providing a more formal definition of these types of dependence, it is convenient to introduce some notation. Definition. Let & and Sz be two statements that appear in the same DO-loop. We say that Sz follows Si, or S, > S1 if S1 appears first in the loop and S1 # &. Consider two statements S1 and S2, both contained in one loop with loop induction variable i. Suppose S1 is of the form 2%: X( f (i)) = F(...) where f is a subscript expression and F is an expression, and suppose Sz is of the form Sz: A = G(X(g(i))) where A is an arbitrary variable (possibly subscripted), G is an expression involving X(g(i)), and g is a subscript expression. Then the following definitions are obvious from the above discussion. Definition. Sz has a loop carried dependence on Si (denoted S1 6 Sz) if there exist il and i2 such that 1 I il < i2 d N and f (ii) = g(h). Definition. Sz has a loop independent dependence on S1 (denoted S1 6, &) if there exists some iteration i, 1 I i I N, such that Sz > S1 and f(i) = g(i). Note that self-dependence is merely a special case of loop carried dependence. It is true in the case of self-dependence (as in the case of all loop carried dependences) that the dependence arises because of the iteration of a loop. In particular, there is no way in which a single statement can first define and later use a value unless it is contained within a loop. Loop independent dependences, on the other hand, arise not because of loop iterations, but because of the relative

17 Automatic Translation of FORTRAN Programs to Vector Form l 507 Fig. 6. A sample dependence graph D. position of two statements within the loop. These dependences do not cross loop iterations. A loop carried dependence cannot be limited to a single iteration by its very nature. Since the primary function of the translator is to detect recurrences, it is useful to see how the concept of dependence aids in that function. In order to apply dependence analysis, it is necessary to (1) test each pair of statements for dependence (true, anti, or output), building a dependence relation D; (2) compute the transitive closure II+ of the dependence relation; (3) execute each statement that does not depend upon itself in II+ in parallel; all others are part of a recurrence. There is a small wrinkle, however. The parallel statements must be executed in an order that is consistent with the dependence relation D+. To view it in the manner suggested by Kuck [18], consider D as a graph in which individual statements are nodes and in which pairs in the relation are represented by directed edges. Figure 6 contains an example of such a graph. Cycles in this graph

18 508 l R. Allen and K. Kennedy Fig. 7. The derived dependence graph for r-blocks. represent recurrences. If each cycle and each single statement not part of a cycle are reduced to a single node (called a r-block), then the dependence graph D derived from this transformation on D is acyclic (see Figure 7). Using a topological sort [16], we can then generate code for each r-block in an order that preserves the dependence relation D. As an example, consider the program below. DO 10 I = 1,99 Sl X(1) = I SZ B(1) = 100-I 10 CONTINUE DO 20 I = 1,99 AU) = WWI)) :I X(1 + 1) = G(B(1)) 20 CONTINUE Figure 8 depicts the dependences among the numbered statements in this program, ignoring dependences on the DO statements. Since there are no cycles, all the statements may be executed in vector, but we must be careful to choose an order that preserves dependences. In particular, S4 must come before S3 in the final code. Choosing the order (S1, Sz, Sd, SJ, the result is

19 Automatic Translation of FORTRAN Programs to Vector Form 509 Fig. 8. Dependences in the example program. the FORTRAN 8x program X(1:99) = SEQ(l, 99, 1) B(l: 99) = SEQ(99, 1, -1) X(2:100) = G(B(1:99)) A(1:99) = F(X(1:99)) which is fully consistent with the original sequential semantics. Currently, the translator leaves a recurrence coded as a sequential DO-loop. 4.3 Dependence in Multiple Loops When extending the definition of dependence to multiple loops, it is convenient to precisely pinpoint the loop that creates a loop carried dependence. An example will illustrate this concept. DO 100 i = 1,100 DO 9Oj = 1,100 DO 30 k = 1,100 S1 X(i, j + 1, k) = A(i, j, k) CONTINUE DO 80 1= 1.50 SZ A(i + 1, j, l) = X(i, j, 1) CONTINUE CONTINUE l o o CONTINUE First, statements S, and Sz depend upon each other. On every iteration of the j loop other than the first, Sp uses a value that was computed on the previous iteration of the j loop by Sr. Similarly, on every iteration of the i loop other than

20 510 l R. Allen and K. Kennedy the first, S1 uses a value computed on the previous iteration by Sp. Neither the k loop nor the 1 loop can carry a dependence between the statements, because two statements must be nested within a loop in order for it to carry a dependence between them. It is important to recognize which loop carries a particular dependence if we are to do a good job of translation. This is aptly illustrated by the example above, because S1 and Sz may be executed in parallel in two dimensions even though they form a global recurrence. If the outermost loop is left sequential we get DO 100 i = 1,100 X(i, 2 : 101,l: 100) = A(i, 1: 100,l: 100) + 10 A(i+1,1:100,1:50)=X(i,1:100,1:50) CONTINUE Clearly, this partial vectorization is desirable. The test for loop carried dependence presented earlier can be generalized to detect which loop carries a dependence by the following: Let f and g be subscript mappings where 2 is the set of all integers, nl is the number of loops containing statement Sl Sl: Hfh, x2,..., xn,)) = F( 1 n2 is the number of loops containing statement S2 Sp: A = G(X(g(x,,..., x,)) and m is the number of subscripts for array X. The symbol F( ) denotes an arbitrary left-hand side. We use xl, x2,... to denote the induction variables for the loops, with x1 being the induction variable for the outermost loop. In general, we will number the loops from the outermost to the innermost. The upper bound of the ith loop surrounding & is assumed to be Mi; the upper bound of the ith loop surrounding S2 is assumed to be Ni; hence Mi = Ni for 1 I i I n, where n is the number of common loops surrounding the two statements. Definition. Statement S2 depends on & with respect to carrier k (k I n), written Si bk S2, if there exist (il, h,..., i&, (jk+l, j,+2,...,jn,), (b+l, lk+2,..., 1,), and integers 5;, 5; in the following regions: 1 I iq s Np V, s.t. 1 5 q and q < k lsjqsmp V, s.t. k < q and q 5 nl 1 5 lp I N4 V, s.t. k < q and q 5 n2 1 5 s; < {2 I Nk such that the following equation holds: f (il, i2,..., ik-1, rl;, jk+l,..., jni) = dil, i2,..., ik-1, r22, ik+l,..., 1%). Intuitively, we test for dependence with respect to carrier loop k by holding the outer loop indices constant and letting the inner loop indices run free. Note that

21 Automatic Translation of FORTRAN Programs to Vector Form l 511 the same definition can be used for antidependence and output dependence as well. Interstatement dependence can now be defined in terms of dependence with respect to a particular carrier. Definition. S2 depends directly on Si, & A Sz, if and only if there exists some k L 1 such that S1 8k Sz. If we view dependence as a relation, then A = jl ata, where addition is interpreted as set union. Now we are ready for the main result of this section-testing on a particular carrier. for dependence THEOREM 4. If f(xl,...,x,*) = a0 + )=?A:, aixi, g(xl,..., 3%) = bo + z?& bixi, and S1 and Sz are of the form Sl: X(f(xl,..., x,,)) = F(...) &: A = G(Xkh..., +J) and are contained in n common loops (assumed normalized), n r k, and the upper bounds of the loops surrounding S1 are Mi and the upper bounds of the loops surrounding Sz are Ni (M; = Ni for i 5 n), then S1 & Sz only if (a) gcd test: gcdbl - h, a2 - b2,..., ak-1 - bk-1, ak,..., a,,, bk,..., b,) 1 bo - a0 (b) Banerjee inequality: k-l -bk - zl (ai - bi)-(ni - 1) - (a; + bk)+(nk - 2) - i$, af(m - 1) - i=!+l b:(ni - 1) 5,,zo bi - a80 ai b-1 C - --bk + E (ai - bi)+(ni - 1) + (ai - bk)+(nk - 2) i=l + is!+l at(mi - 1) + 5 b;(ni - 1). i=k+l The long but straightforward proof is given in the Appendix. This theorem is an adaptation of a result by Banerjee. The gcd test has been slightly sharpened over Banerjee s and the test has been formulated as a test for dependence with respect to a specific carrier k. ACM Transactions on Programming Languages end Systems, Vol. 9, No. 4, October 1987.

22 512 l R. Allen and K. Kennedy When the theory of loop carried dependence is extended to account for multiple loops, it is convenient to determine which loop creates the dependence. The previous theorem does exactly that. Extending loop independent dependence to multiple loops is much simpler, since such dependences do not arise from the iteration of loops, but from the relative statement position. The gcd test and Banerjee s inequality can be modified to test for loop independent dependences as follows. THEOREM 5. Iff&,..., x,,) = h + C& aixi, g(q,..., LX+,) = bo + CC, bixi, and S1 and S2 are of the form Sl: X(f(xl,..., x,,)) = F(...) Sz: A = G(XMa,..., x,))) and are contained in n common loops (assumed normalized), and the upper bounds of the loops surrounding S1 are Mi and the upper bounds of the loops surrounding 5 2 are Ni (Mi = Ni for i I n), then S1 6, Sz(& has a loop independent dependence on &) only if S2 follows & and (a) gcd test: gcd(al - bl, a2 - by,..., a, - b,, a,+~,..., a,,, b,+~,..., b,) I bo (b) Banerjee inequality: -il (ai - bi)-(ni - 1) - 3 a;(mi - 1) - z bt(ni - 1) i=n+l i=n+l 5 2 bi - 2 ai i=o i=o 5 i (ai - bi)+(ni - 1) + 2 a (Mi - 1) + z bf(ni - 1). i=l i=n+l i=n+l 4.4 The Depth of a Dependence - a0 The test for dependence given in the previous section leads in a natural way to the concept of dependence depth. Recall that Sz depends directly on S1 (& A Sz) if and only if there exists k > 0 such that S, Bk Sz. Clearly, if we disregard some of the outer loops, holding them constant, the dependence may not exist. Therefore, let us introduce the concept of depth into our theory of dependence. Definition. We say that Sz depends on S1 at depth d (denoted S1 & S,), if there exists a k 2 d such that S1 & &. In other words, & = & &z. Note that in this scheme, a loop independent dependence is a dependence of infinite depth. The reason for this will become clear shortly. Definition. For statements S1 and Sp, T,J (&, S,), the nesting level of the direct dependence of S2 on Si, is the maximum depth at which the dependence exists,

23 Automatic Translation of FORTRAN Programs to Vector Form 513 that is, 90(&, S2) = maxllz S1 & S2J, if S1 A S2 0, otherwise. LEMMAS (a) If dl 2 d2 then 54 Add, S1 Adz &. (b) If S1 A S2 and 7 = q (S1, SZ), then Sl J&S2 and S1 A,+, SP. PROOF. Obvious. Clearly, v (S1, S2) is easy to compute for any pair of statements. q It is customary to view dependence as a transitive relation. That is, if SZ depends on Si, and Ss depends on S2, then Ss depends on Si, albeit indirectly. Henceforth, we will say that S2 depends on S, if S1 A' S2 where A+ is the transitive closure of A, that is, A+ = A + A2 + A In other words, Si A+ S2 if there exist statements To, Z,,..., T,, (n L 1) such that and To = S1 T,, = S2 231 = T,, A Tl A T2 A... A T,, = Sp. We shall refer to the sequence (To, T,,..., T,) as a path in the dependence graph. It is also possible to extend the notion of loop carried dependence by taking the transitive closure. That is, Si Ai S2 if there exists a path To, Tl,..., T,, (n L 1) such that Next we extend q to dependence paths. s1 = T,, a& Tl L&j... Ad T,, = S2. Definition. Let P = (To, T,,..., T,,) be a path in the dependence graph; in other words, To A Tl A. -. A Tn. The nesting level of P, q (P), is the maximum depth at which all the dependences in the path still exist. q (P) = max(d Z 11 Ti Ad Ti+l Vi, 0 5 i 5 n LEMMA 3. If P = (To, T,,..., T,,), then v (P) = min(v (Ti, Ti+l), 0 5 i 5 n - 1). PROOF. Let T = min(o (Ti, Ti+l), 0 s i I n - 1). By Lemma 2, all of the dependences Ti A, Ti+, exist, while at least one such dependence, the minimum, does not exist at level Cl Finally, we extend the concept of nesting level to arbitrary pairs of statements.

24 514 R. Allen and K. Kennedy Definition. For arbitrary statements Si and Sz, q(s1, Sz), the nesting level of the dependence, is the maximum depth d at which there exists a path (2 0, 2 1,..., T,,) such that S1 = To & Tl &... Ad T,, = Sz, that is, rl(sl, Sz) = max(d 3 1 ] Si AZ Sa), if S1 A+ Sz 0, otherwise. Note that we must distinguish q (&, Ss), the depth of a direct dependence, and a(&, Sz), the depth of a dependence. This is because it is possible that there exists a dependence path from S1 to Sz at a depth greater than that of the direct dependence. In other words, q(s1, S,) L v (&, S2) and the inequality may be strict. THEOREM 6. If 2% A+ Sz, s(s,, Sz) = ma~(~~(p) ] P a dependence path from s1 to S2). PROOF. Let PO = (Uo, VI,..., U,,,) be the path from Si to Sz with maximum nesting level, and let 7 =,I (Po). Clearly S1 = UO A, U1 A, - -. A, U, = S s, so?i(&, S2) 3 7. Suppose q(s,, S,) > 7. Then there exists a path P = (To, Tl,..., T ) such that & = T,, & Tl L& -. - & T,, = S2 where d > 7. But then so(p) = d > T, contradicting the maximality of o(po). Cl Lemma 3 and Theorem 6 establish that the computation of T(&, S2) for each pair of statements in the program is just a shortest path problem with min replacing + as the operation used to compose costs along a path (Lemma 3) and max replacing min as the operation to compute the resulting cost at a vertex where two paths join (Theorem 6). Hence, Kleene s algorithm can be used to compute q(si, S2) for each pair of statements in time proportional to the cube of the number of statements [l]. The concept of depth of a dependence is useful because it permits partial vectorization. Definition. Consider a statement S that depends upon itself (S A+ S). The parallelism index of S, p(s) is defined P(S) = m - ds, S) where m is the number of loops containing S. Observe that if p(s) > 0, then S may be executed in parallel in the innermost p(s) loops surrounding it. As an example, consider the multiple loop from Section 4.3. DO 100 i = 1,100 DO90j=l,lOO DO 30 k = 1,100 S, X(i, j + 1, k) = A(i, j, k) CONTINUE DO 80 I- 1,50 S* A(i + 1, j, 1) = X(i, j, I) CONTINUE 90 CONTINUE 100 CONTINUE

25 Automatic Translation of FORTRAN Programs to Vector Form l 515 In this loop S1 A Sz and Sp A Si; however, q(si, SZ) = 2, while v(sz, Si) = 1. From the definitions of 9 and p, we have q(si, Si) = v(sp, Sz) = 1 and p(s) = ~(57~) = 2. Thus both inner loops surrounding each statement may be run in vector. The translated program would be DO 100 i = 1,100 X(i, 2101, 1:lOO) = A(i, l:loo, 1:lOO) + 10 A(i + 1, l:loo, 1:50) = X(i, l:loo, 1:50) CONTINUE This is the same result that was obtained in Section 4.3. The depth of a dependence represents the number of loops that, if iterated sequentially, will guarantee that the dependence is satisfied. That is, a level one dependence will be preserved so long as the outer loop is iterated sequentially, regardless of what is done to inner loops or to statement order within the loop. In this context, the depth of a loop independent dependence is correct as infinity, since it is impossible to guarantee that a loop independent dependence is satisfied by any iteration of loops. Rather, relative statement order preserves those dependences, regardless of the iteration of the surrounding loops. Although there exist infinite level dependences, 1 (S, S) can never be greater than the number of loops surrounding statement S. The reason is very simple; any path that has S as both start and end must contain a loop carried dependence, because loop independent dependences are always directed forward. Therefore, p(s) is always nonnegative. The next section presents a general procedure to find p(s) for each S in a program and to generate FORTRAN 8x code that runs the innermost p(s) loops in parallel. 5. GENERATION OF VECTOR CODE In this section, we demonstrate how the test for dependence can be used to generate vector code. This material was briefly introduced in Section 4.4. This section generalizes the ideas presented there and discusses several techniques for improving the quality of the generated code. 5.1 The Augmented Dependence Graph Earlier in the paper, we discussed the concept of a dependence graph, in which each statement was represented by a vertex and each dependence by a directed edge from the statement depended upon to the dependent statement (the edges indicate the direction in which control must flow). In the augmented dependence graph, we shall attach auxiliary information to each edge in the form of a label. Definition. The augmented dependence graph D is an ordered pair (V, E) where V, the set of vertices, represents the statements in a program and E, the set of edges, represents interstatement dependences. Each edge e E E may be viewed as a quadruple (Si, SZ, t, k), where Si and Sz are two statements such that Si Bk SZ and where t is the type of the dependence (true, anti-, output). The pair (t, 12) is the label of the dependence edge.

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy

Automatic Translation of Fortran Programs to Vector Form Randy Allen and Ken Kennedy The problem New (as of 1987) vector machines such as the Cray-1 have proven successful Most Fortran code is written