Compiling FORTRAN for Massively Parallel Architectures. Peter Brezany. University of Vienna

Size: px

Start display at page:

Download "Compiling FORTRAN for Massively Parallel Architectures. Peter Brezany. University of Vienna"

Nora Baldwin
6 years ago
Views:

1 Compiling FORTRAN for Massively Parallel Architectures Peter Brezany University of Vienna Institute for Software Technology and Parallel Systems Brunnerstrasse 72, A-1210 Vienna, Austria 1 Introduction One of the most fundamental challenges facing computer science today is the need to develop the algorithms and programming tools that are needed to exploit the vast computing power of massively parallel computers. Distributed-memory multiprocessors (DMMPs) provide an attractive approach to high speed computing because their performance can be easily scaled up by increasing the number of processor-memory modules. Many high performance DMMPs are already now commercially available. However, the enormous potential these machines provide can only be fully exploited when they are programmed eectively, which has proved to be a dicult task. The eciency of parallel programs depends critically on the proper utilization of specic architectural features of the underlying hardware, which makes automatic support of the program development process highly desirable. Therefore, adequate programming environments supporting the program development process for the scientic user are urgently needed. A signicant amount of software research for developing programming environments for DMMPs is currently underway both in academia as well as industry. The research eort can be broadly categorized into three classes, namely parallelizing compilers, languages, and support tools. An important problem faced by DMMP installations is the conversion of the large body of existing scientic Fortran code into a form suitable for parallel processing on DMMPs. The exclusive use of new parallel languages would force the user to transform each program manually, an extremely time-consuming and error-prone process. This represents an enormous cost to the institutions concerned. On the other hand, Fortran is still being used as the primary language for the development of scientic software. Therefore, development of Fortran oriented software tools for DMMPs is a high priority objective. 1

2 While DMMPs are less expensive to build than shared-memory systems (SMS) and easily scalable to a large number of processors, the programming paradigm associated with SMS oers clear advantages by providing all processes with uniform access to a global shared memory. Various eorts have been made to bridge that gap by implementing a virtual shared memory on top of a DMMP. This can be done either in hardware or by appropriate software mechanisms. A dominating method for providing a virtual shared memory for a DMMP is automatic parallelization: in this approach, sequential programs (usually written in Fortran) are automatically transformed into explicitly parallel programs in a Fortran superset, utilizing message-passing operations. During the last few years, the basic compilation techniques have been established, and a number of parallelizing systems have been successfully implemented. In virtually all these systems, parallelization is guided by exploiting data parallelism, whereby the data distribution has to be explicitly specied by the user. Data distribution involves partitioning the sequential program's data domain into disjoint sets of variables, and mapping these sets to the processors of the DMMP. The task of the compiler then essentially consists of adapting the program code in such a way that each processor executes all assignments to the data which have been mapped to it, inserting communication where necessary. Recently, a consortium of researchers from industry, government labs and academia formed the High Performance Fortran Forum to develop a standard set of extensions for Fortran 90 which would provide a portable interface to a wide variety of parallel architectures. The forum has produced a draft proposal for a language, called High Performance Fortran (HPF) [11], which focuses mainly on issues of distributing data across the memories of a distributed memory multiprocessor. The main concepts in HPF have been derived from a number of predecessor languages, including mainly DINO [21], CM Fortran [28], Kali [17], Fortran D [7], and Vienna Fortran, with the last two languages having the largest impact. Within the past few years, a standard technique for compiling FORTRAN for distributed memory has evolved, and several prototype systems have been developed, including the Vienna Fortran Compilation System (VFCS), which was among the very rst tools of this kind. In this paper, we outline the basic principles of compilers and languages for distributed memory machines, which are based on the data-parallel Single-Program-Multiple-Data (SPMD) paradigm. The remainder of this paper is organized as follows. Section 2 describes the basic features of Vienna Fortran, provides some denitions and terminology used in later sections. Section 3 provides an overview of the parallelization strategy used in VFCS. Section 4 describes related work and section 5 contains some concluding remarks. 2

3 2 The Vienna Fortran Language In cooperation with ICASE, NASA, a machine-independent language extension to Fortran 77 called Vienna Fortran has been proposed. In this section, we present the basic language features of Vienna Fortran. 2.1 The Programming Model Vienna Fortran assumes that a program will be executed by a machine with one or more processors according to the SPMD programming model as described above. This model requires that each participating processor execute the same program; parallelism is obtained by applying the computation to dierent parts of the data domain simultaneously. The generated code will store the local parts of arrays and the overlap areas locally and use message passing, optimized where possible, to exchange data. It will also map logical processor structures declared by the user to the physical processors which execute the program. These transformations are, however, transparent to the Vienna Fortran programmer. 2.2 The Language Model Simplied The data space A is the set of declared arrays of Q (scalar variables can be interpreted as one-dimensional arrays with one element). Denition 1 Index domain An index domain of rank (dimension) n is any set I that can be represented in the form I= X n i=1 D i, where n 1 and for all i; 1 i n, D i is a nonempty, linearly ordered set of integer numbers. Let A 2 A denote an arbitrary array. The index domain of A is denoted by I A Processors The set of processors, P, is specied in the program by the declaration of a processor array, which provides a means of naming and accessing its elements. For a processor array R, I R denotes the associated index set, and index R : P! I R the function mapping processors to their index. 3

4 2.2.2 Array Distribution A distribution of an array maps each array element to one or more processors which become the owners of the element, and, in this capacity, store the element in their local memory. We model distributions by functions between the associated index domains. Denition 2 Array Distribution Let A 2 A, and assume that R is a processor array. A distribution of the array A with respect to R is dened by the mapping: where P(I R ) denotes the power set of I R. A R : IA! P(I R )? Array Files Input/Output statements control the data ow between program variables and the le system. The le system of machine M may reside physically on a host system and/or a Concurrent Input/Output System. Denition 3 File System The le system F of machine M is dened by the union of a set of standard FORTRAN les F ST and a set of array les F ARR. When transferring elements of a distributed array to an array le, each processor does input/output operations controlling the transfer of the local part of the array to or from the corresponding part of the le. A suitable le structuring is necessary to achieve high transfer eciency. Array les in Vienna Fortran may contain values from more than one array. Therefore, array les are structured into records. Each record contains an array distribution descriptor followed by a sequence of data elements associated with this array. Denition 4 Array File An array le F 2 F ARR is a sequence of distributed array records < darec1; darec2; : : : > Each record can be associated with a distributed array, A, and has the form ( A ; O A ), where A is a distribution descriptor and has the structure ( I A, I R, A R ). Here, A R is the distribution used for writing out the sequence of data elements and I A and I R are the underlying array and processor index domains, respectively, used for dening this distribution. 4

5 O A is a sequence of data elements stored in this record. 2.3 The Language Extensions: An Overview Vienna Fortran includes all of the following language extensions to Fortran 77. Many of them will be discussed in the examples below, where their use is further described in an informal manner. A complete and precise description of the language, examples of the use of these extensions and demonstration of their expressiveness can be found in ([3, 31]). The PROCESSORS statement The user may declare and name one or more processor arrays by means of the PROCESSORS statement. The rst such array is called the primary processor array; others are declared using the keyword RESHAPE. They refer to precisely the same set of processors, providing dierent views of it: a correspondence is established between any two processor arrays by the column-major ordering of array elements dened in Fortran 77. Expressions for the bounds of processor arrays may contain symbolic names, whose values are obtained from the environment at load time. Assertions may be used to impose restrictions on the values that can be assumed by these variables. This allows the program to be parameterized by the number of processors. This statement is optional in each program unit. For example: PROCESSORS MYP3(NP1, NP2, NP3) RESHAPE MYP2(NP1, NP2*NP3) Processor References Processor arrays may be referred to in their entirety by specifying the name only. Array section notation, as introduced in Fortran 90, is used to describe subsets of processor arrays; individual processors may be referenced by the usual array subscript notation. Dimensions of a processor array may be permuted. Processor Intrinsics The number of processors on which the program executes may be accessed by the intrinsic function $NP. A one dimensional processor array, $P(1:$NP), is always implicitly declared and may be referred to. This is the default primary array if there is no processor statement in a program. The index of an executing processor in $P is returned by the intrinsic function $MY PROC. Distribution Annotations Distribution annotations may be appended to array declarations to specify direct and implicit distributions of the arrays to processors. Direct distributions consist of the keyword DIST together with a parenthesized distribution expression, and an optional TO clause. The TO clause species the set of processors to which the array(s) are 5

6 distributed; if it is not present, the primary processor array is selected by default. A distribution expression consists of a list of distribution functions. There is either one function to describe the distribution of the entire array, which may have more than one dimension, or each function in the list distributes the corresponding array dimension to a dimension of the processor array. The elision symbol \:" is provided to indicate that an array dimension is not distributed. If there are fewer distributed dimensions in the data array than there are in the processor array, the array will be replicated to the remaining processor dimensions. Both intrinsic functions and user-dened functions may be used to specify the distribution of an array dimension. REAL A(L,N,M), B(M,M,M) DIST ( BLOCK, CYCLIC, BLOCK ) REAL C(1200) DIST ( MYOWNFUNC ) TO $P By default, an array which is not explicitly distributed is replicated to all processors. Distribution Intrinsics Direct distributions may be specied by using the elision symbol, as described above, and the BLOCK and CYCLIC intrinsic functions. The BLOCK function distributes an array dimension to a processor dimension in evenly sized segments. The CYCLIC (or scatter) distribution maps elements of a dimension of the data array in a round-robin fashion to a dimension of the processor array. If a width is specied, then contiguous segments of that width are distributed in a round-robin manner. The INDIRECT distribution intrinsic function enables the specication of a mapping array which allows each array element to be distributed individually to a single processor. The mapping array must be of the same size and shape as the array being distributed. The values of the given array are processor numbers (in $P): INTEGER IAPROCS(1000) REAL A(1000) DIST ( INDIRECT(IAPROCS) ) Thus, for example, the value of IAPROCS(60) is the number of the processor to which A(60) is to be mapped. Note that IAPROCS must be dened before it is used to specify the distribution of A, and that each element of A can be mapped to only one processor. Dynamic Distributions and the DISTRIBUTE Statement By default, the distribution of an array is static. Thus it does not change within the scope of the declaration to which the distribution has been appended. The keyword DYNAMIC is provided to declare an array 6

7 distribution to be dynamic. This permits the array to be the target of a DISTRIBUTE statement. A dynamically distributed array may optionally be provided with an initial distribution in the manner described above for static distributions. A range of permissible distributions may be specied when the array is declared by giving the keyword RANGE and a set of explicit distributions. If this does not appear, the array may take on any permitted distribution with the appropriate dimensionality during execution of the program. Finally, the distribution of such an array may be dynamically connected to the distribution of another dynamically distributed array in a specied xed manner. This is expressed by means of the CONNECT keyword. Thus, if the latter array is redistributed, then the connected array will automatically also be redistributed. REAL F(200,200) DYNAMIC, & RANGE (( BLOCK, BLOCK ), ( CYCLIC(5), BLOCK )) The distribute statement begins with the keyword DISTRIBUTE and a list of the arrays which are to be distributed at runtime. Following the separator symbol \::", a direct, implicit or indirect distribution is specied using the same constructs as those for specifying static distributions. It has an optional NOTRANSFER clause; if it appears, then it species that the arrays to which it applies are to be distributed according to the specication, but that old data (if there is any) is not to be transferred. Thus only the access function is modied. For example: DISTRIBUTE A, B :: ( CYCLIC(10) ) NOTRANSFER (B) in the above statement, both arrays A and B are redistributed with the new distribution CYCLIC(10), however for the array B only the access function is changed, the old values are not transferred to the new locations. Whenever an array is redistributed via a distribute statement, then any arrays connected to it are also automatically redistributed to maintain the relationship between their distributions. Procedures Dummy array arguments may be distributed in the same way as other arrays. If the given distribution diers from that of the actual argument, then redistribution will take place. If the actual argument is dynamically distributed, then it may be permanently modied in a procedure; if it is statically distributed, then the original distribution must be restored on procedure exit. This can always be enforced by the keyword RESTORE. While argument transmission is generally call by reference, there are situations in which arguments must be copied. The user can suppress this by specifying a NOCOPY. 7

8 Dummy array arguments may also inherit the distribution of the actual argument: this is specied by using an \*" as the distribution expression: CALL EX(A,B(1:N,10),N,3) SUBROUTINE EX(X,Y,N,J) REAL X(N,N) DIST (*) REAL Y(N) DIST ( BLOCK ) TO MYP2(1:N,J) Array sections may be passed as arguments to subroutines using the syntax of Fortran 90. The FORALL Loop The FORALL loop enables the user to assert that the iterations of a loop are independent and can be executed in parallel. A precondition for the correctness of this loop is the absence of loop-carried dependences. There is an implicit synchronization at the beginning and end of such a loop. Private variables are permitted within forall loops; they are known only in the forall loop in which they are declared and each loop iteration has its own copy. The iterations of the loop may be assigned explicitly to processors if the user desires, or they may be performed by the processor which owns a specied datum. This can be done through the optional on clause specied in the forall loop header. FORALL I = 1,N ON MASTER(C(K(I)) ) Y(K(I))=X(I)+C(K(I))*Y(K(I)) END FORALL (a) Parallel loop with indirect accesses FORALL I = 1,N ON $P(NOP(I)) REAL T : : : END FORALL (b) Parallel loop with a private variable Figure 1: Parallel loops. The on-clause in the example shown in Figure 1a species that the i-th iteration (1 i N) of the loop is executed on processor MASTER(C(K(i))) y. The processor may also be specied Note that the forall loop, as introduced here, is not the forall loop proposed during the development of Fortran 90 and in HPF. y MASTER (ref) returns a uniquely dened processor p which owns the array element denoted by ref. 8

9 explicitly, such as in ON R1(I), where R1 is a processor array. In the second parallel loop (Figure 1b) the on-clause directly refers to the implicit processor array, with the ith iteration assigned to the kth processor where k is the current value of the array element which is denoted by NOP(i). T is declared private in the forall loop. Logically there are N copies of the real T, one for each iteration of the loop. loop-carried dependences. Thus assignments to such variables does not cause A reduction statement may be used within forall loops to perform such operations as global sums (cf. ADD below); the result is not available until the end of the loop. The user may also dene reduction functions for operations which are commutative and associative in the mathematical sense. The intrinsic reduction operators provided by Vienna Fortran are ADD, MULT, MAX and MIN. The forall loop in Figure 2a results in the values of the array A being summed and the result being placed in the variable X. In each iteration of the forall loop in Figure 2b, elements of D and E are multiplied, and the result is used to increment the corresponding element of B. In general, all arrays B, D, E, X, Y, and Z can be distributed. FORALL I = 1, N ON OWNER(A(I)) : : : REDUCE ( ADD, X, A(I) ) : : : END FORALL (a) Summing values of a distributed array FORALL I = 1, N ON OWNER (B(X(I))) : : : REDUCE ( ADD, B(X(I)), D(Y(I))*E(Z(I)) ) : : : END FORALL (b) Accumulating values onto a distributed array Figure 2: Applying reduction statements. Input/Output Files read/written by parallel programs may be stored in a distributed manner or on a single storage device. We provide a separate set of I/O operations to enable individual processor access to data stored across several devices. I/O Operations Concurrent I/O operations supported by Vienna Fortran can be classied into three groups: data transfer, inquiry and le manipulation operations. These operations deal with whole 9

10 arrays which are distributed across a set of processors. Thus, a global synchronization of the processors is required before they cooperate to execute the operation. Writing to a File The concurrent write statement, CWRITE, can be used to write multiple arrays to a le in a single statement. For each array a distributed array record is written onto the le. Vienna Fortran provides three forms of the concurrent write statement. These aect the order of data elements written out to the distributed array record. (i) In the simplest form, the individual distributions of the arrays determine the sequence of array elements written out to the le. For example, in the following statement: CWRITE (f) A1, A2,..., A r where f denotes the I/O unit number and A i, 1 i r are array identiers. This form should be used when the data is going to be read into arrays with the \same" distribution as A i. In this situation, the sequence of elements in the le are generated by concatenating the linearized local segments of A owned by the individual processors according to the increasing order of the linearized index of the processors. This is the most ecient form of writing out a distributed array since each processor can independently (and in parallel) write out the piece of the array that it owns, thus utilizing the I/O capacity of the architecture to its fullest. (ii) Consider the situation in which the data is to be read several times into an array B, where the distribution of B is dierent from that of the array being written out. In this case, the user may wish to optimize the sequence of data elements in the le according to the distribution of the array B so as to make the multiple read operations more ecient. Additional parameters of the CWRITE statement enable the user to specify (a) the shape of the distributed array to which the read operation will be applied, and (b) its distribution. These additional specications can then be used by the compiler to determine the sequence of elements in the output le. If a shape is specied, the size of the arrays A1,..., A n has to be equal to the product of the extents of the specied index domain. The resulting rank and shape have to match the distribution specication. For example, the following statement can be used if A is a two dimensional array. CWRITE (f, PROCESSORS='R2D(N,N)', & DIST='(BLOCK,CYCLIC) TO R2D') A 10

11 Here, the elements of the array A are written so as to optimize reading them into an array which is distributed as (BLOCK, CYCLIC). Depending on the sequence to be written, the processors (a) could synchronize so as to execute the correct sequence of the individual writes to secondary storage, or (b) could incur the overhead of redistributing the data internally before using a parallel write operation to output the data. (iii) If the data in a le is to be subsequently read into arrays with dierent distributions or there is no information available about the distribution of the target arrays, the user may allow the compiler to choose the sequence of the elements to be written out. This is done by specifying 'SYSTEM' as the distribution in the CWRITE statement: CWRITE (f, DIST='SYSTEM') A1,...,A r This allows the compiler and the runtime system to cooperate to determine the best possible sequence for writing out the data, given that there is no knowledge about distribution of the target arrays. Reading from a File A read operation to one or more distributed arrays is specied by a statement of the following form: CREAD (f) B1, B2,..., B r where again f denotes the I/O unit number and B i, 1 i r are array identiers. The operation reads the next r distributed array records in f. The data elements of the ith record are read into B i. Note that the semantics of standard FORTRAN I/O operations has to be maintained. That is, if an array A is written out to a le and then read into another array B, the column-major linearization of FORTRAN arrays will determine which element of A is read into a given element of B. The actual transfer of data, thus, is done by taking into account the distribution descriptor of the ith record and the shape and the distribution of B i. Accessing a Distribution Descriptor The distribution descriptor of the current distributed array record in the le can be accessed as follows: CDISTR (f) 11

12 Other Operations COPEN (colist) - Open an array le. CCLOSE (cclist) - Close an array le z. CSKIP (f; ) - Skip to the end of le. CSKIP (f; n) - Skip n distributed array records. CBACKARRAY (f) - Move back to the previous array record. CREWIND (f) - Rewind the le. CEOF (f) - Check for end of le. Note that the concurrent I/O operations supported by Vienna Fortran can be applied only to the special array les dened here, and conversely array les can only be accessed through these operations. 3 Overview of the Parallelization Strategy used in VFCS VFCS is an interactive system. The user and system can work as a team to produce good parallel code. During the transformation process, the user is able to inspect the internal information, supply special information to the system and select transformations. The ability of an interactive system to provide information about the program on a selected level is useful during the parallelization process as well as for the development of new transformations. For example, it is easy to identify parallelization inhibiting factors by analysing data dependences. Since dependences guide the process of communication optimization, the interactive dependence analysis is useful in this context. The system also provides some assistance for automatic compilation. Many parts of the the transformation process are similar in spite of dierent input programs. The system is able to save and reexecute sequences of transformations and to recompile automatically the same source code or to dene "macro sequences" of transformations for automatic compilation of new source codes. The determination either of the best transformation in a given situation or the decision whether the program is "near" optimal is a dicult task and requires expert intelligence. Therefore, an intelligent subsystem for advice giving and controlling the transformation process z The operations COPEN and CCLOSE have the same meaning and the lists colist and cclist have the same form as their counterparts in FORTRAN

13 for a class of DMMPs is being developed. It is supported by a Parameter Based Performance Prediction Tool [6] which statically computes a set of optional parameters that characterize the behaviour of the parallel program. The structure of VFCS is shown in Figure 2. Input programs are written in Vienna Fortran. The main goal of VFCS is to generate a host program and a parallel node program with minimized load imbalance and communication costs. The system is made up of six main components: the kernel, the frontend components, the backend and the reconstructor, which utilize the program database. The dierent components perform the following tasks: Frontend 1: scanning, parsing and normalization of individual program units Frontend 2: collection of global information, e.g. call graph, program splitting Frontend 3: execution of a predened transformation sequence on each program unit Backend: machine dependent transformations Reconstructor: reconstructor of the parallelized code from the syntax tree Kernel: system organization, execution of analysis services and transformations The system components are implemented via individual programs, except for frontend 3 which is integrated in the kernel system. This structure facilitates future extensions for other target machines (using this conception, VFCS is able to generate code for various target machines) and other input languages. The kernel implements the user interface. The user may activate other system parts and select analysis services and transformations. He is able to inspect the internal information via the services of the information component. Since the program database contains only the current version of the program, VFCS provides the additional service to save program versions. The user may thus return to an earlier state of the parallelization process if the performed transformations were not successful. The system provides support for automatic recompilation of the same source via the tracing of the actions performed during an interactive session. The trace les can be automaticaly executed as long as changes in the source code do not aect positioning commands which are necessary to identify source code regions on which special transformations have been executed. In VFCS parallelization is guided by a user dened data partition, specifying a set of processors, a set of distributed arrays and their individual distributions. Distribution of work results from this specication by the owner-computes-rule, i.e. a process executes all assignments to array elements mapped to it. Accesses to non-local array elements are implemented via interprocess communication. The overlap concept is used to describe non-local variables accessed by a processor. The overlap 13

14 Vienna Fortran Frontend 1 program Trace File Frontend 3 Vienna Fortran program Frontend 1 - syntax tree - call graph PROGRAM DATABASE - interprocedural database - dependence graph - partitioning information Frontend 2 Frontend 3 Backend Recon- structor parallelized Analysis Transformation Information program Component Component Component Interactive Component Tracing Component Kernel Trace File Program Versions Figure 3: System Structure area of a process are all non-local elements in an area around the rectangular section assigned to that process. The overlap concept simplies storage allocation as well as the optimization of the communication between processors. VFCS perform interprocedural analysis to determine the maximum overlap area for each distributed array in the program. This information is used to statically allocate storage for copies of non-local data. The overlap concept is especially tailored to eciently handle programs with local computations adhering to a regular pattern. For such programs, the set of non-local variables of a process can be described by a small overlap area around its local segment. However, the overlap concept cannot adequately handle computations with irregular accesses 14

15 as they arise in sparse or unstructured problems, for example. Here, subscript functions often depend on data available at runtime only. Because of the dependence on runtime data, worst case compile-time assumptions must be made by VFCS in most of the cases mentioned above when determining an overlap description. This results in the allocation of memory for any potentially non-local variable and additional overhead for the resulting communication, part of which may be superuous. To eectively exploit distributed memory systems for irregular computations, techniques for runtime compilation ([14, 15, 23, 24, 25]) have been developed. In the runtime compilation phase the compiler generates the code that carries out runtime analysis of the corresponding input program parts. Besides these implementation techniques, languages have been designed, providing means for the specication of irregular distributions and to support the ecient compilation of codes from sparse or unstructured applications [3, 7, 17]. In the following parts we describe how are the runtime techniques integrated with the advanced compile-time parallelization techniques of VFCS [1]. Currently, parallelization in VFCS is performed in 5 steps. The steps are illustrated within a simple example (Figure 4). Step 1: Program Splitting. Program splitting transforms the input program into a host and a node program. All I/O statements are collected in the host program, and communication statements for the corresponding value transfers are inserted into both programs. In the resulting code, the host process is loosely synchronised with the node processes. Thus, the host process may read input values before they are actually needed in node processes. Step 2: Initial Adaptation. A dierent kind of processing is applied to program parts that are enclosed by forall loops, from the rest of the node program (Figure 6). For the program parts not enclosed by the forall loops, the initial adaptation distributes the entire work assigned to these node program parts across the set of all node processors according to the given array distributions, and resolves accesses to non-local data via communication. The basic rule governing the assignment of work to the node processors is that a node processor is responsible for executing all the assignments to its local data that occur in the original sequential program (owner computes rule). The distribution of work is internally expressed by masks: A mask is a boolean guard that is attached to each statement. A statement is executed i its mask evaluates to true. The mask of a statement is omitted if it is always TRUE. The mask owned(ref) is satised in a processor i the variable accessed is local to it. After masking has been performed, the node program parts processed by this technique may contain references 15

16 Example 1: program Example processors p1d(4) real a(100), b(400), c(100) dist (block) real z integer map(100), index(100) read (*,*) b c... initialization of map and index... c... by some user dened algorithms... do i = 3, 90 c(i) = b(i-2) + b(i+10) enddo forall i = iexp1, iexp2 on owner(a(index(i))) z = c(i) * b(i) a(index(i)) = b(index(i)+1) + z enddo end Figure 4: Input Program. to non-local objects. For all references which may access a non-local variable a communication statement EXSR (see [8]) is inserted which updates a private copy of the variable if necessary. The communication statements are extended by a description which determines those processes which exchange the array element accessed. The computation of this description is not discussed in this paper. The interested reader can it found in [8]. An EXSR (EXchange Send Receive) is syntactically described as EXSR(A(I n,...,i n ),[l1/u1,...,l n /u n ]) where v = A(I n,...,i n ) is the array element inducing communication and ovp = [l1/u1,...,l n /u n ] is the overlap description for array A. For each i, l i and u i respectively specify the left and right extension of dimension i in the local segment of A. The semantics of an EXSR statement is described in Figure 5. Various analysis techniques are applied to the forall loops. For each forall loop for which the user did not specify the work distribution, the initial adaptation derives the initial work distribution which we call as automatic work distribution (the technique used is described in 16

17 IF (executing processor p owns v) THEN send v to all processors p', such that (1) p' reads v, and (2) p' does not own v, and (3) v is in the overlap area of p' ELSE IF (v is in the overlap area of p) THEN receive v ENDIF Figure 5: Description of the EXSR statement. [1]). The specication of this distribution appears in the on clause of the header of the forall loop. Lists of variables (distributed and undistributed arrays, and scalar variables) occurring in the loop body are constructed for each forall loop. These lists can be viewed by the user using information services of VFCS. In VFCS, distributed arrays are classied into distribution classes. A global list of distribution classes encountered in the forall loops in the whole program is also constructed in the initial adaptation phase. The information about the work distribution derived by the system, and the lists constructed are stored in the internal representation. Step 3: View/Modify Work Distribution. The user can change the work distribution specication derived by the system in the initial adaptation phase, or the work distribution specied by the user to any type supported by the VFCS, e.g., he or she can prescribe that the iteration i of the forall loop will be executed on the processor whose index is stored in the integer array element map(i). The forall loop with the work distribution modied can be seen in Figure 7. A suitable work distribution helps minimize load imbalance for the forall loop. Step 4: Optimization. The code resulting from the initial adaptation is usually not ecient, since the updating is performed via single-element messages and work distribution is enforced at statement level. In the optimization phase, special transformations are applied to generate more ecient code (see Figure 8). First, communication is extracted from surrounding do-loops resulting in the fusion of messages and secondly, loop iterations which do not perform any computation for local variables are suppressed in the node processes. Loop bounds are parameterized according to the data distribution. The bounds of the local segment of c are stored on each processor in the private variables $L and $R. The mask of the assignment is enforced in the loop bounds. If we look at processor p1d(1), for example, we see that it executes iterations 3 to 25 and thus 17

18 Example 2: c program NODE processors p1d (4) real a(100), b(400), c(100) dist (block) real z integer map(100), index(100) receive b... initialization of map and index do i = 3, 90 EXSR (b(i+10),[0/10]) EXSR (b(i-2),[-2/0]) owned(c(i))! c(i) = b(i-2) + b(i+10) enddo forall i = iexp1, iexp2 on owner (a(index(i))) z = c(i) * b(i) a(index(i)) = b(index(i)+1) + z end forall end Figure 6: Node program after initial adaptation. only writes local variables. Furthermore, VFCS determines standard reductions, such as sum, product, maximum and minimum values of vector elements and dotproduct of two vectors, and treats them in an ecient way. The forall loops are optimized in the following step. Step 5: Code Generation. In VFCS, the program that is being parallelized and all information collected at compile-time are stored in an internal representation. In the last step, the back-end adapts the internal representation to the target Fortran language. Then the reconstructor produces les with the FORTRAN code of the host and node programs which can be passed to the native FORTRAN compilers which generate object codes for the host and node processors. If a forall loop appears in the program unit processed, the back-end generates new data 18

19 Example 3: c program NODE processors p1d(4) real a(100), b(400), c(100) dist (block) real z integer map(100), index(100)... initialization of map and index forall i = iexp1, iexp2 on p1d(map(i)) z = c(i) * b(i) a(index(i)) = b(index(i)+1) + z end forall end Figure 7: Node program after modifying the work distribution specication. structures and statements for the runtime processing of this loop. All new constructs are generated on the syntax tree level using lists which describe the utilization of variables occurring in the program statements. These lists are available for every statement and are constructed in the previous parallelization steps. A runtime processing strategy based on the inspector-executor paradigm is used. Each forall loop is evaluated in two phases. The inspector phase generates a description of the communication necessary for the loop. The executor uses this information to perform the communication and the execution of the loop body. There are three major tasks to be performed when compiling forall loops for DMMPs. The rst is distributing the forall loop iterations across the processors, the second is generating communication statements, and the third is performing extensive optimizations to minimize the runtime preprocessing, and the forall loop execution time. If, during the forall body execution, data are used or modied on a processor other than its home, communication of this data is required. The semantics of the Vienna Fortran forall loop guarantees that the data needed during the loop is available before its execution begins. Similarly, if a processor modies data stored on another one, the update can be deferred until execution of the forall loop nishes. 19

20 Example 4: c program NODE processors p1d(4) real a(100), b(400), c(100) dist (block) real z integer index(100)... initialization of map and index EXSR (b(*),[0/10]) EXSR (b(*),[-2/0]) do i = max($l,3), min($r,90) c(i) = b(i-2) + b(i+10) enddo forall i = iexp1, iexp2 on owner(a(index(i))) z = c(i) * b(i) a(index(i)) = b(index(i)+1) + z end forall end Figure 8: Node program after compile-time optimization. Thus the compiler may implement a forall loop with the following strategy: Determine the global iteration set Iter set. This set is computed from the forall loop header parameters. Derive work distribution specication, if it has not been provided by the user. For each processor p: { Determine the local iteration set exec(p). This phase uses Iter set and work distribution specication as inputs. { Identify sets of all nonlocal data used or modied on this processor and sets of all local data that other processors access. { Exchange sets of data to be used during the forall loop iterations with other processors. 20

21 { Execute the iterations from exec(p) on this processor. The forall loop body must be correspondingly transformed. { Exchange sets of data that was modied during the forall loop execution with other processors. Much of the complexity of our implementation is contained in the PARTI routines ([26, 25, 23, 5]). They were originally developed as a runtime support for irregular computations with primarily irregular distribution of data. On top of the PARTI routines, a VFCS runtime library was constructed to provide a suitable interface between the VFCS and the PARTI routines. The VFCS implementation of the inspector/executor paradigm is described in [1]. 3.1 Related Work DINO [20, 21] is an extended version of C supporting general purpose computation. It is an early attempt to provide higher-level language constructs for the algorithm specication on DMMPs. The description of SUPERB in [30] is the rst journal publication in the area of compiling Fortran for DMMPs. The concept of dening processor arrays and distributing data to them was rst introduced in the programming language BLAZE [16] in the context of shared memory systems with nonuniform access times. This research was continued in the Kali programming language [17] for distributed memory machines, which requires that the user specify data distributions in much the same way that Vienna Fortran does. The design of Kali has greatly inuenced the development of Vienna Fortran. In particular, the parallel FORALL loops of Vienna Fortran were rst dened in Kali and implemented with the inspector-executor paradigm. On top of PARTI, Joel Saltz and coworkers developed a compiler for a language called ARF (ARguably Fortran). The compiler generates automatically calls to the PARTI routines from distribution annotations and distributed loops with an on clause for work distribution [29]. A commercially available system is the MIMDizer ([18]) which may be used to parallelize sequential Fortran programs according to the SPMD model. The programming language Fortran D [7] proposes a Fortran language extension in which the programmer species the distribution of data by aligning each array to a decomposition, which corresponds to an HPF template, and then specifying a distribution of the decomposition to a virtual machine. These are executable statements, and array distributions are dynamic only. A subset of Fortran D has been implemented for the ipsc/860 [10]. 21

22 Cray Research Inc. has announced a set of language extensions to Cray Fortran (cf77) [19] which enable the user to specify the distribution of data and work. Several methods for distributing iterations of loops are provided. Special directives for specifying concurrent le operations are also available. 3.2 Conclusions In this paper, we have described the main features of Vienna Fortran and outlined the compilation strategy of VFCS. This system provides sophisticated analyses, communication and work optimizations, and runtime compilation. The back end of VFCS transforms the internal representation of the program to message passing Fortran code. The back ends for the Intel ipsc/860 and GENESIS-P Fortran code have been designed and implemented. The latter generates code with the PARMACS portable message-passing macros. By now, they have been implemented on the ipsc/2, ipsc/860, NCUBE 2, Parsytec, SUPRENUM, GENESIS-P, and some workstation networks. This wide range of PARMACS macros implementation signicantly contributes to the portability of programs parallelized by VFCS. References [1] P.Brezany, M.Gerndt, V.Sipkova, H.P.Zima. SUPERB Support for Irregular Scientic Computations. In Proceedings of the Scalable High Performance Computing Conference, Williamsburg, USA, pp , April [2] S.Benkner, B.Chapman, H.P.Zima. Vienna Fortran90. In Proceedings of the Scalable High Performance Computing Conference, Williamsburg, USA, pp , April [3] B.Chapman, P.Mehrotra, H.P.Zima. Programming in Vienna Fortran. ACPC Technical Report Series, University of Vienna, Vienna, Austria, [4] K.D. Cooper, K. Kennedy and L. Torczon. The Impact of Interprocedural Analysis and Optimization in the R n Programming Environment. ACM Transactions on Programming Languages and Systems, Vol.8, No. 4, Oct. 86, 491{523. [5] R. Das, J. Saltz. A manual for Parti runtime primitives - revision 2. Internal Research Report, University of Maryland,

23 [6] T. Fahringer, H. Zima. Astatic Parameter Based Performance Prediction Tool for Parallel Programs. Austrian Center for Parallel Computation, Technical Report ACPC/TR 93-1, January [7] G.Fox, S.Hiranadani, K.Kennedy, C.Koelbel, U.Kremer, C.Tseng, M.Wu. Fortran D Language Specication. Rice University, Technical Report COMP TR90-141, December [8] H.M. Gerndt. Automatic Parallelization for Distributed-Memory Multiprocessing Systems. Ph.D. Dissertation, University of Bonn, [9] V.A. Guarna Jr., D. Gannon, D. Jablonowski, A.D. Malony, Y. Gaur. Faust: an Integrated Environment for the Developmenmt of Parallel Programs. IEEE Software, July [10] S.Hiranandani, K.Kennedy and C.-W.Tseng. Compiling Fortran D for MIMD Distributed{ Memory Machines. Comm.ACM Vol.35,No.8, pages 66{80, August [11] High Performance Fortran Forum. DRAFT High Performance Fortran Language Specication. Version 1.0 Draft, January 25,1993. Technical Report, Rice University. [12] C. Koelbel. Compiling Programs for Nonshared Memory Machines. Ph.D. Dissertation, Purdue University, West Lafayette, IN, November [13] C. Koelbel. Compile time Generation of Regular communications Patterns. Proceedings Supercomputing 91, Albuquerque, [14] C. Koelbel and P. Mehrotra. Compiling global name-space parallel loops for distributed execution. IEEE Transactions on Parallel and Distributed Sytems, October [15] C. Koelbel, P. Mehrotra, and J. Van Rosendale. Supporting shared data structures on distributed memory architectures. In 2nd ACM SIGPLAN Symposium on Principles Practice of Parallel Programming, pages 177{186, March [16] P. Mehrotra, J. Van Rosendale. The BLAZE language: A parallel language for scientic programming. Parallel Computing, Vol. 5, , [17] P. Mehrotra, J. Van Rosendale. Programming Distributed Memory Architectures Using Kali. ICASE, Nasa Langley Reserach Center [18] MIMDizer User's Guide, Version 8.0. Applied Parallel Research Inc., Placerville, CA., [19] D.M. Pase. MPP Fortran Programming Model, Draft 1.0. Technical Report, Cray Research, October

24 [20] M. Rosing, R. W. Schnabel, and R. P. Weaver. Expressing complex parallel algorithms in DINO. In Proceedings of the 4th Conference on Hypercubes, Concurrent Computers, and Applications, pages 553{560, [21] M. Rosing, R. W. Schnabel, and R. P. Weaver. The DINO parallel programming language. Technical Report CU-CS , University of Colorado, Boulder, CO, April [22] R. Ruehl, M. Annaratone. Parallelization of Fortran Code on Distributed-Memory Parallel Processors. Proceedings of the 4th International Conference on Supercomputing 1990, Amsterdam, [23] J. Saltz, H. Berryman, and J. Wu. Runtime compilation for multiprocessors. Report 90-59, ICASE, [24] J. Saltz, K. Crowley, R. Mirchandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8(2):303{312, [25] J. Saltz, R. Das, R. Ponnusamy, D. Mavriplis, H. Berryman and J. Wu. Parti Procedures for Realistic Loops. Proceedings of DMCC6, Portland OR, [26] R. Das, J. Saltz, H. Berryman. A manual for Parti runtime primitives - revision 1. Interim Report 91-17, ICASE, [27] J. McGraw, S. Skedzielewski, S. Allan, R. Oldenhoeft, J. Glauert, C. Kirkham, W. Noyce, and R. Thomas. SISAL: Streams and iteration in a single assignment language: Language reference manual. Report M-146, Lawrence Livermore National Laboratory, March [28] CM Fortran Reference Manual, Version 5.2. Thinking Machines Corporation, Cambridge, MA, [29] J. Wu, Joel Saltz, Harry Berryman, Seema Hiranandani. Distributed Memory Compiler Design For Sparse Problems. ICASE Report No , January [30] H. Zima, H.-J. Bast, and H.M. Gerndt. SUPERB - a tool for semi-automatic MIMD/SIMD parallelization. Parallel Computing, 6, 1-18, [31] H. Zima, P. Brezany, B. Chapman, P. Mehrotra, and A. Schwald. Vienna Fortran { a language specication. ACPC Technical Report Series, University of Vienna, Vienna, Austria, Also available as ICASE INTERIM REPORT 21, MS 132c, NASA, Hampton VA

DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output.

DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output. Language, Compiler and Parallel Database Support for I/O Intensive Applications? Peter Brezany a, Thomas A. Mueck b and Erich Schikuta b University of Vienna a Inst. for Softw. Technology and Parallel