Compiling FORTRAN for Massively Parallel Architectures. Peter Brezany. University of Vienna

Size: px
Start display at page:

Download "Compiling FORTRAN for Massively Parallel Architectures. Peter Brezany. University of Vienna"

Transcription

1 Compiling FORTRAN for Massively Parallel Architectures Peter Brezany University of Vienna Institute for Software Technology and Parallel Systems Brunnerstrasse 72, A-1210 Vienna, Austria 1 Introduction One of the most fundamental challenges facing computer science today is the need to develop the algorithms and programming tools that are needed to exploit the vast computing power of massively parallel computers. Distributed-memory multiprocessors (DMMPs) provide an attractive approach to high speed computing because their performance can be easily scaled up by increasing the number of processor-memory modules. Many high performance DMMPs are already now commercially available. However, the enormous potential these machines provide can only be fully exploited when they are programmed eectively, which has proved to be a dicult task. The eciency of parallel programs depends critically on the proper utilization of specic architectural features of the underlying hardware, which makes automatic support of the program development process highly desirable. Therefore, adequate programming environments supporting the program development process for the scientic user are urgently needed. A signicant amount of software research for developing programming environments for DMMPs is currently underway both in academia as well as industry. The research eort can be broadly categorized into three classes, namely parallelizing compilers, languages, and support tools. An important problem faced by DMMP installations is the conversion of the large body of existing scientic Fortran code into a form suitable for parallel processing on DMMPs. The exclusive use of new parallel languages would force the user to transform each program manually, an extremely time-consuming and error-prone process. This represents an enormous cost to the institutions concerned. On the other hand, Fortran is still being used as the primary language for the development of scientic software. Therefore, development of Fortran oriented software tools for DMMPs is a high priority objective. 1

2 While DMMPs are less expensive to build than shared-memory systems (SMS) and easily scalable to a large number of processors, the programming paradigm associated with SMS oers clear advantages by providing all processes with uniform access to a global shared memory. Various eorts have been made to bridge that gap by implementing a virtual shared memory on top of a DMMP. This can be done either in hardware or by appropriate software mechanisms. A dominating method for providing a virtual shared memory for a DMMP is automatic parallelization: in this approach, sequential programs (usually written in Fortran) are automatically transformed into explicitly parallel programs in a Fortran superset, utilizing message-passing operations. During the last few years, the basic compilation techniques have been established, and a number of parallelizing systems have been successfully implemented. In virtually all these systems, parallelization is guided by exploiting data parallelism, whereby the data distribution has to be explicitly specied by the user. Data distribution involves partitioning the sequential program's data domain into disjoint sets of variables, and mapping these sets to the processors of the DMMP. The task of the compiler then essentially consists of adapting the program code in such a way that each processor executes all assignments to the data which have been mapped to it, inserting communication where necessary. Recently, a consortium of researchers from industry, government labs and academia formed the High Performance Fortran Forum to develop a standard set of extensions for Fortran 90 which would provide a portable interface to a wide variety of parallel architectures. The forum has produced a draft proposal for a language, called High Performance Fortran (HPF) [11], which focuses mainly on issues of distributing data across the memories of a distributed memory multiprocessor. The main concepts in HPF have been derived from a number of predecessor languages, including mainly DINO [21], CM Fortran [28], Kali [17], Fortran D [7], and Vienna Fortran, with the last two languages having the largest impact. Within the past few years, a standard technique for compiling FORTRAN for distributed memory has evolved, and several prototype systems have been developed, including the Vienna Fortran Compilation System (VFCS), which was among the very rst tools of this kind. In this paper, we outline the basic principles of compilers and languages for distributed memory machines, which are based on the data-parallel Single-Program-Multiple-Data (SPMD) paradigm. The remainder of this paper is organized as follows. Section 2 describes the basic features of Vienna Fortran, provides some denitions and terminology used in later sections. Section 3 provides an overview of the parallelization strategy used in VFCS. Section 4 describes related work and section 5 contains some concluding remarks. 2

3 2 The Vienna Fortran Language In cooperation with ICASE, NASA, a machine-independent language extension to Fortran 77 called Vienna Fortran has been proposed. In this section, we present the basic language features of Vienna Fortran. 2.1 The Programming Model Vienna Fortran assumes that a program will be executed by a machine with one or more processors according to the SPMD programming model as described above. This model requires that each participating processor execute the same program; parallelism is obtained by applying the computation to dierent parts of the data domain simultaneously. The generated code will store the local parts of arrays and the overlap areas locally and use message passing, optimized where possible, to exchange data. It will also map logical processor structures declared by the user to the physical processors which execute the program. These transformations are, however, transparent to the Vienna Fortran programmer. 2.2 The Language Model Simplied The data space A is the set of declared arrays of Q (scalar variables can be interpreted as one-dimensional arrays with one element). Denition 1 Index domain An index domain of rank (dimension) n is any set I that can be represented in the form I= X n i=1 D i, where n 1 and for all i; 1 i n, D i is a nonempty, linearly ordered set of integer numbers. Let A 2 A denote an arbitrary array. The index domain of A is denoted by I A Processors The set of processors, P, is specied in the program by the declaration of a processor array, which provides a means of naming and accessing its elements. For a processor array R, I R denotes the associated index set, and index R : P! I R the function mapping processors to their index. 3

4 2.2.2 Array Distribution A distribution of an array maps each array element to one or more processors which become the owners of the element, and, in this capacity, store the element in their local memory. We model distributions by functions between the associated index domains. Denition 2 Array Distribution Let A 2 A, and assume that R is a processor array. A distribution of the array A with respect to R is dened by the mapping: where P(I R ) denotes the power set of I R. A R : IA! P(I R )? Array Files Input/Output statements control the data ow between program variables and the le system. The le system of machine M may reside physically on a host system and/or a Concurrent Input/Output System. Denition 3 File System The le system F of machine M is dened by the union of a set of standard FORTRAN les F ST and a set of array les F ARR. When transferring elements of a distributed array to an array le, each processor does input/output operations controlling the transfer of the local part of the array to or from the corresponding part of the le. A suitable le structuring is necessary to achieve high transfer eciency. Array les in Vienna Fortran may contain values from more than one array. Therefore, array les are structured into records. Each record contains an array distribution descriptor followed by a sequence of data elements associated with this array. Denition 4 Array File An array le F 2 F ARR is a sequence of distributed array records < darec1; darec2; : : : > Each record can be associated with a distributed array, A, and has the form ( A ; O A ), where A is a distribution descriptor and has the structure ( I A, I R, A R ). Here, A R is the distribution used for writing out the sequence of data elements and I A and I R are the underlying array and processor index domains, respectively, used for dening this distribution. 4

5 O A is a sequence of data elements stored in this record. 2.3 The Language Extensions: An Overview Vienna Fortran includes all of the following language extensions to Fortran 77. Many of them will be discussed in the examples below, where their use is further described in an informal manner. A complete and precise description of the language, examples of the use of these extensions and demonstration of their expressiveness can be found in ([3, 31]). The PROCESSORS statement The user may declare and name one or more processor arrays by means of the PROCESSORS statement. The rst such array is called the primary processor array; others are declared using the keyword RESHAPE. They refer to precisely the same set of processors, providing dierent views of it: a correspondence is established between any two processor arrays by the column-major ordering of array elements dened in Fortran 77. Expressions for the bounds of processor arrays may contain symbolic names, whose values are obtained from the environment at load time. Assertions may be used to impose restrictions on the values that can be assumed by these variables. This allows the program to be parameterized by the number of processors. This statement is optional in each program unit. For example: PROCESSORS MYP3(NP1, NP2, NP3) RESHAPE MYP2(NP1, NP2*NP3) Processor References Processor arrays may be referred to in their entirety by specifying the name only. Array section notation, as introduced in Fortran 90, is used to describe subsets of processor arrays; individual processors may be referenced by the usual array subscript notation. Dimensions of a processor array may be permuted. Processor Intrinsics The number of processors on which the program executes may be accessed by the intrinsic function $NP. A one dimensional processor array, $P(1:$NP), is always implicitly declared and may be referred to. This is the default primary array if there is no processor statement in a program. The index of an executing processor in $P is returned by the intrinsic function $MY PROC. Distribution Annotations Distribution annotations may be appended to array declarations to specify direct and implicit distributions of the arrays to processors. Direct distributions consist of the keyword DIST together with a parenthesized distribution expression, and an optional TO clause. The TO clause species the set of processors to which the array(s) are 5

6 distributed; if it is not present, the primary processor array is selected by default. A distribution expression consists of a list of distribution functions. There is either one function to describe the distribution of the entire array, which may have more than one dimension, or each function in the list distributes the corresponding array dimension to a dimension of the processor array. The elision symbol \:" is provided to indicate that an array dimension is not distributed. If there are fewer distributed dimensions in the data array than there are in the processor array, the array will be replicated to the remaining processor dimensions. Both intrinsic functions and user-dened functions may be used to specify the distribution of an array dimension. REAL A(L,N,M), B(M,M,M) DIST ( BLOCK, CYCLIC, BLOCK ) REAL C(1200) DIST ( MYOWNFUNC ) TO $P By default, an array which is not explicitly distributed is replicated to all processors. Distribution Intrinsics Direct distributions may be specied by using the elision symbol, as described above, and the BLOCK and CYCLIC intrinsic functions. The BLOCK function distributes an array dimension to a processor dimension in evenly sized segments. The CYCLIC (or scatter) distribution maps elements of a dimension of the data array in a round-robin fashion to a dimension of the processor array. If a width is specied, then contiguous segments of that width are distributed in a round-robin manner. The INDIRECT distribution intrinsic function enables the specication of a mapping array which allows each array element to be distributed individually to a single processor. The mapping array must be of the same size and shape as the array being distributed. The values of the given array are processor numbers (in $P): INTEGER IAPROCS(1000) REAL A(1000) DIST ( INDIRECT(IAPROCS) ) Thus, for example, the value of IAPROCS(60) is the number of the processor to which A(60) is to be mapped. Note that IAPROCS must be dened before it is used to specify the distribution of A, and that each element of A can be mapped to only one processor. Dynamic Distributions and the DISTRIBUTE Statement By default, the distribution of an array is static. Thus it does not change within the scope of the declaration to which the distribution has been appended. The keyword DYNAMIC is provided to declare an array 6

7 distribution to be dynamic. This permits the array to be the target of a DISTRIBUTE statement. A dynamically distributed array may optionally be provided with an initial distribution in the manner described above for static distributions. A range of permissible distributions may be specied when the array is declared by giving the keyword RANGE and a set of explicit distributions. If this does not appear, the array may take on any permitted distribution with the appropriate dimensionality during execution of the program. Finally, the distribution of such an array may be dynamically connected to the distribution of another dynamically distributed array in a specied xed manner. This is expressed by means of the CONNECT keyword. Thus, if the latter array is redistributed, then the connected array will automatically also be redistributed. REAL F(200,200) DYNAMIC, & RANGE (( BLOCK, BLOCK ), ( CYCLIC(5), BLOCK )) The distribute statement begins with the keyword DISTRIBUTE and a list of the arrays which are to be distributed at runtime. Following the separator symbol \::", a direct, implicit or indirect distribution is specied using the same constructs as those for specifying static distributions. It has an optional NOTRANSFER clause; if it appears, then it species that the arrays to which it applies are to be distributed according to the specication, but that old data (if there is any) is not to be transferred. Thus only the access function is modied. For example: DISTRIBUTE A, B :: ( CYCLIC(10) ) NOTRANSFER (B) in the above statement, both arrays A and B are redistributed with the new distribution CYCLIC(10), however for the array B only the access function is changed, the old values are not transferred to the new locations. Whenever an array is redistributed via a distribute statement, then any arrays connected to it are also automatically redistributed to maintain the relationship between their distributions. Procedures Dummy array arguments may be distributed in the same way as other arrays. If the given distribution diers from that of the actual argument, then redistribution will take place. If the actual argument is dynamically distributed, then it may be permanently modied in a procedure; if it is statically distributed, then the original distribution must be restored on procedure exit. This can always be enforced by the keyword RESTORE. While argument transmission is generally call by reference, there are situations in which arguments must be copied. The user can suppress this by specifying a NOCOPY. 7

8 Dummy array arguments may also inherit the distribution of the actual argument: this is specied by using an \*" as the distribution expression: CALL EX(A,B(1:N,10),N,3) SUBROUTINE EX(X,Y,N,J) REAL X(N,N) DIST (*) REAL Y(N) DIST ( BLOCK ) TO MYP2(1:N,J) Array sections may be passed as arguments to subroutines using the syntax of Fortran 90. The FORALL Loop The FORALL loop enables the user to assert that the iterations of a loop are independent and can be executed in parallel. A precondition for the correctness of this loop is the absence of loop-carried dependences. There is an implicit synchronization at the beginning and end of such a loop. Private variables are permitted within forall loops; they are known only in the forall loop in which they are declared and each loop iteration has its own copy. The iterations of the loop may be assigned explicitly to processors if the user desires, or they may be performed by the processor which owns a specied datum. This can be done through the optional on clause specied in the forall loop header. FORALL I = 1,N ON MASTER(C(K(I)) ) Y(K(I))=X(I)+C(K(I))*Y(K(I)) END FORALL (a) Parallel loop with indirect accesses FORALL I = 1,N ON $P(NOP(I)) REAL T : : : END FORALL (b) Parallel loop with a private variable Figure 1: Parallel loops. The on-clause in the example shown in Figure 1a species that the i-th iteration (1 i N) of the loop is executed on processor MASTER(C(K(i))) y. The processor may also be specied Note that the forall loop, as introduced here, is not the forall loop proposed during the development of Fortran 90 and in HPF. y MASTER (ref) returns a uniquely dened processor p which owns the array element denoted by ref. 8

9 explicitly, such as in ON R1(I), where R1 is a processor array. In the second parallel loop (Figure 1b) the on-clause directly refers to the implicit processor array, with the ith iteration assigned to the kth processor where k is the current value of the array element which is denoted by NOP(i). T is declared private in the forall loop. Logically there are N copies of the real T, one for each iteration of the loop. loop-carried dependences. Thus assignments to such variables does not cause A reduction statement may be used within forall loops to perform such operations as global sums (cf. ADD below); the result is not available until the end of the loop. The user may also dene reduction functions for operations which are commutative and associative in the mathematical sense. The intrinsic reduction operators provided by Vienna Fortran are ADD, MULT, MAX and MIN. The forall loop in Figure 2a results in the values of the array A being summed and the result being placed in the variable X. In each iteration of the forall loop in Figure 2b, elements of D and E are multiplied, and the result is used to increment the corresponding element of B. In general, all arrays B, D, E, X, Y, and Z can be distributed. FORALL I = 1, N ON OWNER(A(I)) : : : REDUCE ( ADD, X, A(I) ) : : : END FORALL (a) Summing values of a distributed array FORALL I = 1, N ON OWNER (B(X(I))) : : : REDUCE ( ADD, B(X(I)), D(Y(I))*E(Z(I)) ) : : : END FORALL (b) Accumulating values onto a distributed array Figure 2: Applying reduction statements. Input/Output Files read/written by parallel programs may be stored in a distributed manner or on a single storage device. We provide a separate set of I/O operations to enable individual processor access to data stored across several devices. I/O Operations Concurrent I/O operations supported by Vienna Fortran can be classied into three groups: data transfer, inquiry and le manipulation operations. These operations deal with whole 9

10 arrays which are distributed across a set of processors. Thus, a global synchronization of the processors is required before they cooperate to execute the operation. Writing to a File The concurrent write statement, CWRITE, can be used to write multiple arrays to a le in a single statement. For each array a distributed array record is written onto the le. Vienna Fortran provides three forms of the concurrent write statement. These aect the order of data elements written out to the distributed array record. (i) In the simplest form, the individual distributions of the arrays determine the sequence of array elements written out to the le. For example, in the following statement: CWRITE (f) A1, A2,..., A r where f denotes the I/O unit number and A i, 1 i r are array identiers. This form should be used when the data is going to be read into arrays with the \same" distribution as A i. In this situation, the sequence of elements in the le are generated by concatenating the linearized local segments of A owned by the individual processors according to the increasing order of the linearized index of the processors. This is the most ecient form of writing out a distributed array since each processor can independently (and in parallel) write out the piece of the array that it owns, thus utilizing the I/O capacity of the architecture to its fullest. (ii) Consider the situation in which the data is to be read several times into an array B, where the distribution of B is dierent from that of the array being written out. In this case, the user may wish to optimize the sequence of data elements in the le according to the distribution of the array B so as to make the multiple read operations more ecient. Additional parameters of the CWRITE statement enable the user to specify (a) the shape of the distributed array to which the read operation will be applied, and (b) its distribution. These additional specications can then be used by the compiler to determine the sequence of elements in the output le. If a shape is specied, the size of the arrays A1,..., A n has to be equal to the product of the extents of the specied index domain. The resulting rank and shape have to match the distribution specication. For example, the following statement can be used if A is a two dimensional array. CWRITE (f, PROCESSORS='R2D(N,N)', & DIST='(BLOCK,CYCLIC) TO R2D') A 10

11 Here, the elements of the array A are written so as to optimize reading them into an array which is distributed as (BLOCK, CYCLIC). Depending on the sequence to be written, the processors (a) could synchronize so as to execute the correct sequence of the individual writes to secondary storage, or (b) could incur the overhead of redistributing the data internally before using a parallel write operation to output the data. (iii) If the data in a le is to be subsequently read into arrays with dierent distributions or there is no information available about the distribution of the target arrays, the user may allow the compiler to choose the sequence of the elements to be written out. This is done by specifying 'SYSTEM' as the distribution in the CWRITE statement: CWRITE (f, DIST='SYSTEM') A1,...,A r This allows the compiler and the runtime system to cooperate to determine the best possible sequence for writing out the data, given that there is no knowledge about distribution of the target arrays. Reading from a File A read operation to one or more distributed arrays is specied by a statement of the following form: CREAD (f) B1, B2,..., B r where again f denotes the I/O unit number and B i, 1 i r are array identiers. The operation reads the next r distributed array records in f. The data elements of the ith record are read into B i. Note that the semantics of standard FORTRAN I/O operations has to be maintained. That is, if an array A is written out to a le and then read into another array B, the column-major linearization of FORTRAN arrays will determine which element of A is read into a given element of B. The actual transfer of data, thus, is done by taking into account the distribution descriptor of the ith record and the shape and the distribution of B i. Accessing a Distribution Descriptor The distribution descriptor of the current distributed array record in the le can be accessed as follows: CDISTR (f) 11

12 Other Operations COPEN (colist) - Open an array le. CCLOSE (cclist) - Close an array le z. CSKIP (f; ) - Skip to the end of le. CSKIP (f; n) - Skip n distributed array records. CBACKARRAY (f) - Move back to the previous array record. CREWIND (f) - Rewind the le. CEOF (f) - Check for end of le. Note that the concurrent I/O operations supported by Vienna Fortran can be applied only to the special array les dened here, and conversely array les can only be accessed through these operations. 3 Overview of the Parallelization Strategy used in VFCS VFCS is an interactive system. The user and system can work as a team to produce good parallel code. During the transformation process, the user is able to inspect the internal information, supply special information to the system and select transformations. The ability of an interactive system to provide information about the program on a selected level is useful during the parallelization process as well as for the development of new transformations. For example, it is easy to identify parallelization inhibiting factors by analysing data dependences. Since dependences guide the process of communication optimization, the interactive dependence analysis is useful in this context. The system also provides some assistance for automatic compilation. Many parts of the the transformation process are similar in spite of dierent input programs. The system is able to save and reexecute sequences of transformations and to recompile automatically the same source code or to dene "macro sequences" of transformations for automatic compilation of new source codes. The determination either of the best transformation in a given situation or the decision whether the program is "near" optimal is a dicult task and requires expert intelligence. Therefore, an intelligent subsystem for advice giving and controlling the transformation process z The operations COPEN and CCLOSE have the same meaning and the lists colist and cclist have the same form as their counterparts in FORTRAN

13 for a class of DMMPs is being developed. It is supported by a Parameter Based Performance Prediction Tool [6] which statically computes a set of optional parameters that characterize the behaviour of the parallel program. The structure of VFCS is shown in Figure 2. Input programs are written in Vienna Fortran. The main goal of VFCS is to generate a host program and a parallel node program with minimized load imbalance and communication costs. The system is made up of six main components: the kernel, the frontend components, the backend and the reconstructor, which utilize the program database. The dierent components perform the following tasks: Frontend 1: scanning, parsing and normalization of individual program units Frontend 2: collection of global information, e.g. call graph, program splitting Frontend 3: execution of a predened transformation sequence on each program unit Backend: machine dependent transformations Reconstructor: reconstructor of the parallelized code from the syntax tree Kernel: system organization, execution of analysis services and transformations The system components are implemented via individual programs, except for frontend 3 which is integrated in the kernel system. This structure facilitates future extensions for other target machines (using this conception, VFCS is able to generate code for various target machines) and other input languages. The kernel implements the user interface. The user may activate other system parts and select analysis services and transformations. He is able to inspect the internal information via the services of the information component. Since the program database contains only the current version of the program, VFCS provides the additional service to save program versions. The user may thus return to an earlier state of the parallelization process if the performed transformations were not successful. The system provides support for automatic recompilation of the same source via the tracing of the actions performed during an interactive session. The trace les can be automaticaly executed as long as changes in the source code do not aect positioning commands which are necessary to identify source code regions on which special transformations have been executed. In VFCS parallelization is guided by a user dened data partition, specifying a set of processors, a set of distributed arrays and their individual distributions. Distribution of work results from this specication by the owner-computes-rule, i.e. a process executes all assignments to array elements mapped to it. Accesses to non-local array elements are implemented via interprocess communication. The overlap concept is used to describe non-local variables accessed by a processor. The overlap 13

14 Vienna Fortran Frontend 1 program Trace File Frontend 3 Vienna Fortran program Frontend 1 - syntax tree - call graph PROGRAM DATABASE - interprocedural database - dependence graph - partitioning information Frontend 2 Frontend 3 Backend Recon- structor parallelized Analysis Transformation Information program Component Component Component Interactive Component Tracing Component Kernel Trace File Program Versions Figure 3: System Structure area of a process are all non-local elements in an area around the rectangular section assigned to that process. The overlap concept simplies storage allocation as well as the optimization of the communication between processors. VFCS perform interprocedural analysis to determine the maximum overlap area for each distributed array in the program. This information is used to statically allocate storage for copies of non-local data. The overlap concept is especially tailored to eciently handle programs with local computations adhering to a regular pattern. For such programs, the set of non-local variables of a process can be described by a small overlap area around its local segment. However, the overlap concept cannot adequately handle computations with irregular accesses 14

15 as they arise in sparse or unstructured problems, for example. Here, subscript functions often depend on data available at runtime only. Because of the dependence on runtime data, worst case compile-time assumptions must be made by VFCS in most of the cases mentioned above when determining an overlap description. This results in the allocation of memory for any potentially non-local variable and additional overhead for the resulting communication, part of which may be superuous. To eectively exploit distributed memory systems for irregular computations, techniques for runtime compilation ([14, 15, 23, 24, 25]) have been developed. In the runtime compilation phase the compiler generates the code that carries out runtime analysis of the corresponding input program parts. Besides these implementation techniques, languages have been designed, providing means for the specication of irregular distributions and to support the ecient compilation of codes from sparse or unstructured applications [3, 7, 17]. In the following parts we describe how are the runtime techniques integrated with the advanced compile-time parallelization techniques of VFCS [1]. Currently, parallelization in VFCS is performed in 5 steps. The steps are illustrated within a simple example (Figure 4). Step 1: Program Splitting. Program splitting transforms the input program into a host and a node program. All I/O statements are collected in the host program, and communication statements for the corresponding value transfers are inserted into both programs. In the resulting code, the host process is loosely synchronised with the node processes. Thus, the host process may read input values before they are actually needed in node processes. Step 2: Initial Adaptation. A dierent kind of processing is applied to program parts that are enclosed by forall loops, from the rest of the node program (Figure 6). For the program parts not enclosed by the forall loops, the initial adaptation distributes the entire work assigned to these node program parts across the set of all node processors according to the given array distributions, and resolves accesses to non-local data via communication. The basic rule governing the assignment of work to the node processors is that a node processor is responsible for executing all the assignments to its local data that occur in the original sequential program (owner computes rule). The distribution of work is internally expressed by masks: A mask is a boolean guard that is attached to each statement. A statement is executed i its mask evaluates to true. The mask of a statement is omitted if it is always TRUE. The mask owned(ref) is satised in a processor i the variable accessed is local to it. After masking has been performed, the node program parts processed by this technique may contain references 15

16 Example 1: program Example processors p1d(4) real a(100), b(400), c(100) dist (block) real z integer map(100), index(100) read (*,*) b c... initialization of map and index... c... by some user dened algorithms... do i = 3, 90 c(i) = b(i-2) + b(i+10) enddo forall i = iexp1, iexp2 on owner(a(index(i))) z = c(i) * b(i) a(index(i)) = b(index(i)+1) + z enddo end Figure 4: Input Program. to non-local objects. For all references which may access a non-local variable a communication statement EXSR (see [8]) is inserted which updates a private copy of the variable if necessary. The communication statements are extended by a description which determines those processes which exchange the array element accessed. The computation of this description is not discussed in this paper. The interested reader can it found in [8]. An EXSR (EXchange Send Receive) is syntactically described as EXSR(A(I n,...,i n ),[l1/u1,...,l n /u n ]) where v = A(I n,...,i n ) is the array element inducing communication and ovp = [l1/u1,...,l n /u n ] is the overlap description for array A. For each i, l i and u i respectively specify the left and right extension of dimension i in the local segment of A. The semantics of an EXSR statement is described in Figure 5. Various analysis techniques are applied to the forall loops. For each forall loop for which the user did not specify the work distribution, the initial adaptation derives the initial work distribution which we call as automatic work distribution (the technique used is described in 16

17 IF (executing processor p owns v) THEN send v to all processors p', such that (1) p' reads v, and (2) p' does not own v, and (3) v is in the overlap area of p' ELSE IF (v is in the overlap area of p) THEN receive v ENDIF Figure 5: Description of the EXSR statement. [1]). The specication of this distribution appears in the on clause of the header of the forall loop. Lists of variables (distributed and undistributed arrays, and scalar variables) occurring in the loop body are constructed for each forall loop. These lists can be viewed by the user using information services of VFCS. In VFCS, distributed arrays are classied into distribution classes. A global list of distribution classes encountered in the forall loops in the whole program is also constructed in the initial adaptation phase. The information about the work distribution derived by the system, and the lists constructed are stored in the internal representation. Step 3: View/Modify Work Distribution. The user can change the work distribution specication derived by the system in the initial adaptation phase, or the work distribution specied by the user to any type supported by the VFCS, e.g., he or she can prescribe that the iteration i of the forall loop will be executed on the processor whose index is stored in the integer array element map(i). The forall loop with the work distribution modied can be seen in Figure 7. A suitable work distribution helps minimize load imbalance for the forall loop. Step 4: Optimization. The code resulting from the initial adaptation is usually not ecient, since the updating is performed via single-element messages and work distribution is enforced at statement level. In the optimization phase, special transformations are applied to generate more ecient code (see Figure 8). First, communication is extracted from surrounding do-loops resulting in the fusion of messages and secondly, loop iterations which do not perform any computation for local variables are suppressed in the node processes. Loop bounds are parameterized according to the data distribution. The bounds of the local segment of c are stored on each processor in the private variables $L and $R. The mask of the assignment is enforced in the loop bounds. If we look at processor p1d(1), for example, we see that it executes iterations 3 to 25 and thus 17

18 Example 2: c program NODE processors p1d (4) real a(100), b(400), c(100) dist (block) real z integer map(100), index(100) receive b... initialization of map and index do i = 3, 90 EXSR (b(i+10),[0/10]) EXSR (b(i-2),[-2/0]) owned(c(i))! c(i) = b(i-2) + b(i+10) enddo forall i = iexp1, iexp2 on owner (a(index(i))) z = c(i) * b(i) a(index(i)) = b(index(i)+1) + z end forall end Figure 6: Node program after initial adaptation. only writes local variables. Furthermore, VFCS determines standard reductions, such as sum, product, maximum and minimum values of vector elements and dotproduct of two vectors, and treats them in an ecient way. The forall loops are optimized in the following step. Step 5: Code Generation. In VFCS, the program that is being parallelized and all information collected at compile-time are stored in an internal representation. In the last step, the back-end adapts the internal representation to the target Fortran language. Then the reconstructor produces les with the FORTRAN code of the host and node programs which can be passed to the native FORTRAN compilers which generate object codes for the host and node processors. If a forall loop appears in the program unit processed, the back-end generates new data 18

19 Example 3: c program NODE processors p1d(4) real a(100), b(400), c(100) dist (block) real z integer map(100), index(100)... initialization of map and index forall i = iexp1, iexp2 on p1d(map(i)) z = c(i) * b(i) a(index(i)) = b(index(i)+1) + z end forall end Figure 7: Node program after modifying the work distribution specication. structures and statements for the runtime processing of this loop. All new constructs are generated on the syntax tree level using lists which describe the utilization of variables occurring in the program statements. These lists are available for every statement and are constructed in the previous parallelization steps. A runtime processing strategy based on the inspector-executor paradigm is used. Each forall loop is evaluated in two phases. The inspector phase generates a description of the communication necessary for the loop. The executor uses this information to perform the communication and the execution of the loop body. There are three major tasks to be performed when compiling forall loops for DMMPs. The rst is distributing the forall loop iterations across the processors, the second is generating communication statements, and the third is performing extensive optimizations to minimize the runtime preprocessing, and the forall loop execution time. If, during the forall body execution, data are used or modied on a processor other than its home, communication of this data is required. The semantics of the Vienna Fortran forall loop guarantees that the data needed during the loop is available before its execution begins. Similarly, if a processor modies data stored on another one, the update can be deferred until execution of the forall loop nishes. 19

20 Example 4: c program NODE processors p1d(4) real a(100), b(400), c(100) dist (block) real z integer index(100)... initialization of map and index EXSR (b(*),[0/10]) EXSR (b(*),[-2/0]) do i = max($l,3), min($r,90) c(i) = b(i-2) + b(i+10) enddo forall i = iexp1, iexp2 on owner(a(index(i))) z = c(i) * b(i) a(index(i)) = b(index(i)+1) + z end forall end Figure 8: Node program after compile-time optimization. Thus the compiler may implement a forall loop with the following strategy: Determine the global iteration set Iter set. This set is computed from the forall loop header parameters. Derive work distribution specication, if it has not been provided by the user. For each processor p: { Determine the local iteration set exec(p). This phase uses Iter set and work distribution specication as inputs. { Identify sets of all nonlocal data used or modied on this processor and sets of all local data that other processors access. { Exchange sets of data to be used during the forall loop iterations with other processors. 20

21 { Execute the iterations from exec(p) on this processor. The forall loop body must be correspondingly transformed. { Exchange sets of data that was modied during the forall loop execution with other processors. Much of the complexity of our implementation is contained in the PARTI routines ([26, 25, 23, 5]). They were originally developed as a runtime support for irregular computations with primarily irregular distribution of data. On top of the PARTI routines, a VFCS runtime library was constructed to provide a suitable interface between the VFCS and the PARTI routines. The VFCS implementation of the inspector/executor paradigm is described in [1]. 3.1 Related Work DINO [20, 21] is an extended version of C supporting general purpose computation. It is an early attempt to provide higher-level language constructs for the algorithm specication on DMMPs. The description of SUPERB in [30] is the rst journal publication in the area of compiling Fortran for DMMPs. The concept of dening processor arrays and distributing data to them was rst introduced in the programming language BLAZE [16] in the context of shared memory systems with nonuniform access times. This research was continued in the Kali programming language [17] for distributed memory machines, which requires that the user specify data distributions in much the same way that Vienna Fortran does. The design of Kali has greatly inuenced the development of Vienna Fortran. In particular, the parallel FORALL loops of Vienna Fortran were rst dened in Kali and implemented with the inspector-executor paradigm. On top of PARTI, Joel Saltz and coworkers developed a compiler for a language called ARF (ARguably Fortran). The compiler generates automatically calls to the PARTI routines from distribution annotations and distributed loops with an on clause for work distribution [29]. A commercially available system is the MIMDizer ([18]) which may be used to parallelize sequential Fortran programs according to the SPMD model. The programming language Fortran D [7] proposes a Fortran language extension in which the programmer species the distribution of data by aligning each array to a decomposition, which corresponds to an HPF template, and then specifying a distribution of the decomposition to a virtual machine. These are executable statements, and array distributions are dynamic only. A subset of Fortran D has been implemented for the ipsc/860 [10]. 21

22 Cray Research Inc. has announced a set of language extensions to Cray Fortran (cf77) [19] which enable the user to specify the distribution of data and work. Several methods for distributing iterations of loops are provided. Special directives for specifying concurrent le operations are also available. 3.2 Conclusions In this paper, we have described the main features of Vienna Fortran and outlined the compilation strategy of VFCS. This system provides sophisticated analyses, communication and work optimizations, and runtime compilation. The back end of VFCS transforms the internal representation of the program to message passing Fortran code. The back ends for the Intel ipsc/860 and GENESIS-P Fortran code have been designed and implemented. The latter generates code with the PARMACS portable message-passing macros. By now, they have been implemented on the ipsc/2, ipsc/860, NCUBE 2, Parsytec, SUPRENUM, GENESIS-P, and some workstation networks. This wide range of PARMACS macros implementation signicantly contributes to the portability of programs parallelized by VFCS. References [1] P.Brezany, M.Gerndt, V.Sipkova, H.P.Zima. SUPERB Support for Irregular Scientic Computations. In Proceedings of the Scalable High Performance Computing Conference, Williamsburg, USA, pp , April [2] S.Benkner, B.Chapman, H.P.Zima. Vienna Fortran90. In Proceedings of the Scalable High Performance Computing Conference, Williamsburg, USA, pp , April [3] B.Chapman, P.Mehrotra, H.P.Zima. Programming in Vienna Fortran. ACPC Technical Report Series, University of Vienna, Vienna, Austria, [4] K.D. Cooper, K. Kennedy and L. Torczon. The Impact of Interprocedural Analysis and Optimization in the R n Programming Environment. ACM Transactions on Programming Languages and Systems, Vol.8, No. 4, Oct. 86, 491{523. [5] R. Das, J. Saltz. A manual for Parti runtime primitives - revision 2. Internal Research Report, University of Maryland,

23 [6] T. Fahringer, H. Zima. Astatic Parameter Based Performance Prediction Tool for Parallel Programs. Austrian Center for Parallel Computation, Technical Report ACPC/TR 93-1, January [7] G.Fox, S.Hiranadani, K.Kennedy, C.Koelbel, U.Kremer, C.Tseng, M.Wu. Fortran D Language Specication. Rice University, Technical Report COMP TR90-141, December [8] H.M. Gerndt. Automatic Parallelization for Distributed-Memory Multiprocessing Systems. Ph.D. Dissertation, University of Bonn, [9] V.A. Guarna Jr., D. Gannon, D. Jablonowski, A.D. Malony, Y. Gaur. Faust: an Integrated Environment for the Developmenmt of Parallel Programs. IEEE Software, July [10] S.Hiranandani, K.Kennedy and C.-W.Tseng. Compiling Fortran D for MIMD Distributed{ Memory Machines. Comm.ACM Vol.35,No.8, pages 66{80, August [11] High Performance Fortran Forum. DRAFT High Performance Fortran Language Specication. Version 1.0 Draft, January 25,1993. Technical Report, Rice University. [12] C. Koelbel. Compiling Programs for Nonshared Memory Machines. Ph.D. Dissertation, Purdue University, West Lafayette, IN, November [13] C. Koelbel. Compile time Generation of Regular communications Patterns. Proceedings Supercomputing 91, Albuquerque, [14] C. Koelbel and P. Mehrotra. Compiling global name-space parallel loops for distributed execution. IEEE Transactions on Parallel and Distributed Sytems, October [15] C. Koelbel, P. Mehrotra, and J. Van Rosendale. Supporting shared data structures on distributed memory architectures. In 2nd ACM SIGPLAN Symposium on Principles Practice of Parallel Programming, pages 177{186, March [16] P. Mehrotra, J. Van Rosendale. The BLAZE language: A parallel language for scientic programming. Parallel Computing, Vol. 5, , [17] P. Mehrotra, J. Van Rosendale. Programming Distributed Memory Architectures Using Kali. ICASE, Nasa Langley Reserach Center [18] MIMDizer User's Guide, Version 8.0. Applied Parallel Research Inc., Placerville, CA., [19] D.M. Pase. MPP Fortran Programming Model, Draft 1.0. Technical Report, Cray Research, October

24 [20] M. Rosing, R. W. Schnabel, and R. P. Weaver. Expressing complex parallel algorithms in DINO. In Proceedings of the 4th Conference on Hypercubes, Concurrent Computers, and Applications, pages 553{560, [21] M. Rosing, R. W. Schnabel, and R. P. Weaver. The DINO parallel programming language. Technical Report CU-CS , University of Colorado, Boulder, CO, April [22] R. Ruehl, M. Annaratone. Parallelization of Fortran Code on Distributed-Memory Parallel Processors. Proceedings of the 4th International Conference on Supercomputing 1990, Amsterdam, [23] J. Saltz, H. Berryman, and J. Wu. Runtime compilation for multiprocessors. Report 90-59, ICASE, [24] J. Saltz, K. Crowley, R. Mirchandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8(2):303{312, [25] J. Saltz, R. Das, R. Ponnusamy, D. Mavriplis, H. Berryman and J. Wu. Parti Procedures for Realistic Loops. Proceedings of DMCC6, Portland OR, [26] R. Das, J. Saltz, H. Berryman. A manual for Parti runtime primitives - revision 1. Interim Report 91-17, ICASE, [27] J. McGraw, S. Skedzielewski, S. Allan, R. Oldenhoeft, J. Glauert, C. Kirkham, W. Noyce, and R. Thomas. SISAL: Streams and iteration in a single assignment language: Language reference manual. Report M-146, Lawrence Livermore National Laboratory, March [28] CM Fortran Reference Manual, Version 5.2. Thinking Machines Corporation, Cambridge, MA, [29] J. Wu, Joel Saltz, Harry Berryman, Seema Hiranandani. Distributed Memory Compiler Design For Sparse Problems. ICASE Report No , January [30] H. Zima, H.-J. Bast, and H.M. Gerndt. SUPERB - a tool for semi-automatic MIMD/SIMD parallelization. Parallel Computing, 6, 1-18, [31] H. Zima, P. Brezany, B. Chapman, P. Mehrotra, and A. Schwald. Vienna Fortran { a language specication. ACPC Technical Report Series, University of Vienna, Vienna, Austria, Also available as ICASE INTERIM REPORT 21, MS 132c, NASA, Hampton VA

DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output.

DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output. Language, Compiler and Parallel Database Support for I/O Intensive Applications? Peter Brezany a, Thomas A. Mueck b and Erich Schikuta b University of Vienna a Inst. for Softw. Technology and Parallel

More information

SVM Support in the Vienna Fortran Compilation System. Michael Gerndt. Research Centre Julich(KFA)

SVM Support in the Vienna Fortran Compilation System. Michael Gerndt. Research Centre Julich(KFA) SVM Support in the Vienna Fortran Compilation System Peter Brezany University of Vienna brezany@par.univie.ac.at Michael Gerndt Research Centre Julich(KFA) m.gerndt@kfa-juelich.de Viera Sipkova University

More information

Application Programmer. Vienna Fortran Out-of-Core Program

Application Programmer. Vienna Fortran Out-of-Core Program Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse

More information

DYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN. Hans Zima a. Institute for Software Technology and Parallel Systems,

DYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN. Hans Zima a. Institute for Software Technology and Parallel Systems, DYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN Barbara Chapman a Piyush Mehrotra b Hans Moritsch a Hans Zima a a Institute for Software Technology and Parallel Systems, University of Vienna, Brunner Strasse

More information

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel

More information

FORSCHUNGSZENTRUM J ULICH GmbH Zentralinstitut f ur Angewandte Mathematik D J ulich, Tel. (02461)

FORSCHUNGSZENTRUM J ULICH GmbH Zentralinstitut f ur Angewandte Mathematik D J ulich, Tel. (02461) FORSCHUNGSZENTRUM J ULICH GmbH Zentralinstitut f ur Angewandte Mathematik D-52425 J ulich, Tel. (02461) 61-6402 Interner Bericht SVM Support in the Vienna Fortran Compiling System Peter Brezany*, Michael

More information

Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D St. Augustin, G

Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D St. Augustin, G Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D-53757 St. Augustin, Germany Abstract. This paper presents language features

More information

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz Compiler and Runtime Support for Programming in Adaptive Parallel Environments 1 Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz UMIACS and Dept. of Computer Science University

More information

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA

More information

Department of. Computer Science. A Comparison of Explicit and Implicit. March 30, Colorado State University

Department of. Computer Science. A Comparison of Explicit and Implicit. March 30, Colorado State University Department of Computer Science A Comparison of Explicit and Implicit Programming Styles for Distributed Memory Multiprocessors Matthew Haines and Wim Bohm Technical Report CS-93-104 March 30, 1993 Colorado

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

On Estimating the Useful Work Distribution of. Thomas Fahringer. University of Vienna. Abstract

On Estimating the Useful Work Distribution of. Thomas Fahringer. University of Vienna. Abstract On Estimating the Useful Work Distribution of Parallel Programs under the P 3 T: A Static Performance Estimator Thomas Fahringer Institute for Software Technology and Parallel Systems University of Vienna

More information

Compile-Time Techniques for Data Distribution. in Distributed Memory Machines. J. Ramanujam. Louisiana State University, Baton Rouge, LA

Compile-Time Techniques for Data Distribution. in Distributed Memory Machines. J. Ramanujam. Louisiana State University, Baton Rouge, LA Compile-Time Techniques for Data Distribution in Distributed Memory Machines J. Ramanujam Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 783-59 P. Sadayappan

More information

Lecture V: Introduction to parallel programming with Fortran coarrays

Lecture V: Introduction to parallel programming with Fortran coarrays Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time

More information

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety Data Parallel Programming with the Khoros Data Services Library Steve Kubica, Thomas Robey, Chris Moorman Khoral Research, Inc. 6200 Indian School Rd. NE Suite 200 Albuquerque, NM 87110 USA E-mail: info@khoral.com

More information

PARTI Primitives for Unstructured and Block Structured Problems

PARTI Primitives for Unstructured and Block Structured Problems Syracuse University SURFACE College of Engineering and Computer Science - Former Departments, Centers, Institutes and Projects College of Engineering and Computer Science 1992 PARTI Primitives for Unstructured

More information

2 3. Syllabus Time Event 9:00{10:00 morning lecture 10:00{10:30 morning break 10:30{12:30 morning practical session 12:30{1:30 lunch break 1:30{2:00 a

2 3. Syllabus Time Event 9:00{10:00 morning lecture 10:00{10:30 morning break 10:30{12:30 morning practical session 12:30{1:30 lunch break 1:30{2:00 a 1 Syllabus for the Advanced 3 Day Fortran 90 Course AC Marshall cuniversity of Liverpool, 1997 Abstract The course is scheduled for 3 days. The timetable allows for two sessions a day each with a one hour

More information

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology Automatic Array Alignment for Distributed Memory Multicomputers Mitsuru Ikei Hitachi Chemical Company Ltd. Michael Wolfe Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland OR 97291

More information

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems On Object Orientation as a Paradigm for General Purpose Distributed Operating Systems Vinny Cahill, Sean Baker, Brendan Tangney, Chris Horn and Neville Harris Distributed Systems Group, Dept. of Computer

More information

Part i Procedures for Realistic Loops

Part i Procedures for Realistic Loops Part i Procedures for Realistic Loops J. Saltz, R. Das, R. Ponnusamy, D. Mavriplis, H. Berryman and J. Wu ICASE, NASA Langley Research Center Hampton VA 23665 Abstract This paper describes a set of primitives

More information

High Performance Fortran. James Curry

High Performance Fortran. James Curry High Performance Fortran James Curry Wikipedia! New Fortran statements, such as FORALL, and the ability to create PURE (side effect free) procedures Compiler directives for recommended distributions of

More information

Programming Languages Third Edition. Chapter 9 Control I Expressions and Statements

Programming Languages Third Edition. Chapter 9 Control I Expressions and Statements Programming Languages Third Edition Chapter 9 Control I Expressions and Statements Objectives Understand expressions Understand conditional statements and guards Understand loops and variation on WHILE

More information

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742 UMIACS-TR-93-133 December, 1992 CS-TR-3192 Revised April, 1993 Denitions of Dependence Distance William Pugh Institute for Advanced Computer Studies Dept. of Computer Science Univ. of Maryland, College

More information

An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors

An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors Proceedings of the 28th Annual Hmvaii Intemottonol Conference on System Sciences - 1995 An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors Matthew

More information

Storage and Sequence Association

Storage and Sequence Association 2 Section Storage and Sequence Association 1 1 HPF allows the mapping of variables across multiple processors in order to improve parallel 1 performance. FORTRAN and Fortran 0 both specify relationships

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

The Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics

The Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics The Compositional C++ Language Denition Peter Carlin Mani Chandy Carl Kesselman March 12, 1993 Revision 0.95 3/12/93, Comments welcome. Abstract This document gives a concise denition of the syntax and

More information

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples Multicore Jukka Julku 19.2.2009 1 2 3 4 5 6 Disclaimer There are several low-level, languages and directive based approaches But no silver bullets This presentation only covers some examples of them is

More information

Chapter 1. Reprinted from "Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing",Norfolk, Virginia (USA), March 1993.

Chapter 1. Reprinted from Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing,Norfolk, Virginia (USA), March 1993. Chapter 1 Parallel Sparse Matrix Vector Multiplication using a Shared Virtual Memory Environment Francois Bodin y Jocelyne Erhel y Thierry Priol y Reprinted from "Proc. 6th SIAM Conference on Parallel

More information

Chapter 3. Fortran Statements

Chapter 3. Fortran Statements Chapter 3 Fortran Statements This chapter describes each of the Fortran statements supported by the PGI Fortran compilers Each description includes a brief summary of the statement, a syntax description,

More information

Cedar Fortran Programmer's Manual 1. Jay Hoeinger. Center for Supercomputing Research and Development. Urbana, Illinois

Cedar Fortran Programmer's Manual 1. Jay Hoeinger. Center for Supercomputing Research and Development. Urbana, Illinois Cedar Fortran Programmer's Manual 1 Jay Hoeinger Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign Urbana, Illinois 61801 June 14, 1993 1 This work was supported

More information

High-Level Management of Communication Schedules in HPF-like Languages

High-Level Management of Communication Schedules in HPF-like Languages High-Level Management of Communication Schedules in HPF-like Languages Siegfried Benkner a Piyush Mehrotra b John Van Rosendale b Hans Zima a a Institute for Software Technology and Parallel Systems, University

More information

PASSION Runtime Library for Parallel I/O. Rajeev Thakur Rajesh Bordawekar Alok Choudhary. Ravi Ponnusamy Tarvinder Singh

PASSION Runtime Library for Parallel I/O. Rajeev Thakur Rajesh Bordawekar Alok Choudhary. Ravi Ponnusamy Tarvinder Singh Scalable Parallel Libraries Conference, Oct. 1994 PASSION Runtime Library for Parallel I/O Rajeev Thakur Rajesh Bordawekar Alok Choudhary Ravi Ponnusamy Tarvinder Singh Dept. of Electrical and Computer

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

The S-Expression Design Language (SEDL) James C. Corbett. September 1, Introduction. 2 Origins of SEDL 2. 3 The Language SEDL 2.

The S-Expression Design Language (SEDL) James C. Corbett. September 1, Introduction. 2 Origins of SEDL 2. 3 The Language SEDL 2. The S-Expression Design Language (SEDL) James C. Corbett September 1, 1993 Contents 1 Introduction 1 2 Origins of SEDL 2 3 The Language SEDL 2 3.1 Scopes : : : : : : : : : : : : : : : : : : : : : : : :

More information

ICC++ Language Denition. Andrew A. Chien and Uday S. Reddy 1. May 25, 1995

ICC++ Language Denition. Andrew A. Chien and Uday S. Reddy 1. May 25, 1995 ICC++ Language Denition Andrew A. Chien and Uday S. Reddy 1 May 25, 1995 Preface ICC++ is a new dialect of C++ designed to support the writing of both sequential and parallel programs. Because of the signicant

More information

Tarek S. Abdelrahman and Thomas N. Wong. University oftoronto. Toronto, Ontario, M5S 1A4. Canada

Tarek S. Abdelrahman and Thomas N. Wong. University oftoronto. Toronto, Ontario, M5S 1A4. Canada Distributed Array Data Management on NUMA Multiprocessors Tarek S. Abdelrahman and Thomas N. Wong Department of Electrical and Computer Engineering University oftoronto Toronto, Ontario, M5S 1A Canada

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

Solve the Data Flow Problem

Solve the Data Flow Problem Gaining Condence in Distributed Systems Gleb Naumovich, Lori A. Clarke, and Leon J. Osterweil University of Massachusetts, Amherst Computer Science Department University of Massachusetts Amherst, Massachusetts

More information

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8)

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8) Vectorization Using Reversible Data Dependences Peiyi Tang and Nianshu Gao Technical Report ANU-TR-CS-94-08 October 21, 1994 Vectorization Using Reversible Data Dependences Peiyi Tang Department of Computer

More information

New Programming Paradigms: Partitioned Global Address Space Languages

New Programming Paradigms: Partitioned Global Address Space Languages Raul E. Silvera -- IBM Canada Lab rauls@ca.ibm.com ECMWF Briefing - April 2010 New Programming Paradigms: Partitioned Global Address Space Languages 2009 IBM Corporation Outline Overview of the PGAS programming

More information

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department of Computer Science and Information Engineering National

More information

Advanced Compiler Construction

Advanced Compiler Construction CS 526 Advanced Compiler Construction http://misailo.cs.illinois.edu/courses/cs526 INTERPROCEDURAL ANALYSIS The slides adapted from Vikram Adve So Far Control Flow Analysis Data Flow Analysis Dependence

More information

SAMOS: an Active Object{Oriented Database System. Stella Gatziu, Klaus R. Dittrich. Database Technology Research Group

SAMOS: an Active Object{Oriented Database System. Stella Gatziu, Klaus R. Dittrich. Database Technology Research Group SAMOS: an Active Object{Oriented Database System Stella Gatziu, Klaus R. Dittrich Database Technology Research Group Institut fur Informatik, Universitat Zurich fgatziu, dittrichg@ifi.unizh.ch to appear

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

The Pandore Data-Parallel Compiler. and its Portable Runtime. Abstract. This paper presents an environment for programming distributed

The Pandore Data-Parallel Compiler. and its Portable Runtime. Abstract. This paper presents an environment for programming distributed The Pandore Data-Parallel Compiler and its Portable Runtime Francoise Andre, Marc Le Fur, Yves Maheo, Jean-Louis Pazat? IRISA, Campus de Beaulieu, F-35 Rennes Cedex, FRANCE Abstract. This paper presents

More information

University of Maryland. fzzj, basili, Empirical studies (Desurvire, 1994) (Jeries, Miller, USABILITY INSPECTION

University of Maryland. fzzj, basili, Empirical studies (Desurvire, 1994) (Jeries, Miller, USABILITY INSPECTION AN EMPIRICAL STUDY OF PERSPECTIVE-BASED USABILITY INSPECTION Zhijun Zhang, Victor Basili, and Ben Shneiderman Department of Computer Science University of Maryland College Park, MD 20742, USA fzzj, basili,

More information

TECHNICAL RESEARCH REPORT

TECHNICAL RESEARCH REPORT TECHNICAL RESEARCH REPORT A Resource Reservation Scheme for Synchronized Distributed Multimedia Sessions by W. Zhao, S.K. Tripathi T.R. 97-14 ISR INSTITUTE FOR SYSTEMS RESEARCH Sponsored by the National

More information

Embedding Data Mappers with Distributed Memory Machine Compilers

Embedding Data Mappers with Distributed Memory Machine Compilers Syracuse University SURFACE Electrical Engineering and Computer Science Technical Reports College of Engineering and Computer Science 4-1992 Embedding Data Mappers with Distributed Memory Machine Compilers

More information

Chapter 4. Fortran Arrays

Chapter 4. Fortran Arrays Chapter 4. Fortran Arrays Fortran arrays are any object with the dimension attribute. In Fortran 90/95, and in HPF, arrays may be very different from arrays in older versions of Fortran. Arrays can have

More information

I R I S A P U B L I C A T I O N I N T E R N E DISTRIBUTED ARRAY MANAGEMENT FOR HPF COMPILERS YVES MAHÉO, JEAN-LOUIS PAZAT ISSN

I R I S A P U B L I C A T I O N I N T E R N E DISTRIBUTED ARRAY MANAGEMENT FOR HPF COMPILERS YVES MAHÉO, JEAN-LOUIS PAZAT ISSN I R I P U B L I C A T I O N I N T E R N E N o 787 S INSTITUT DE RECHERCHE EN INFORMATIQUE ET SYSTÈMES ALÉATOIRES A DISTRIBUTED ARRAY MANAGEMENT FOR HPF COMPILERS YVES MAHÉO, JEAN-LOUIS PAZAT ISSN 1166-8687

More information

HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90

HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90 149 Fortran and HPF 6.2 Concept High Performance Fortran 6.2 Concept Fortran90 extension SPMD (Single Program Multiple Data) model each process operates with its own part of data HPF commands specify which

More information

proposed. In Sect. 3, the environment used for the automatic generation of data parallel programs is briey described, together with the proler tool pr

proposed. In Sect. 3, the environment used for the automatic generation of data parallel programs is briey described, together with the proler tool pr Performance Evaluation of Automatically Generated Data-Parallel Programs L. Massari Y. Maheo DIS IRISA Universita dipavia Campus de Beaulieu via Ferrata 1 Avenue du General Leclerc 27100 Pavia, ITALIA

More information

International Standards Organisation. Parameterized Derived Types. Fortran

International Standards Organisation. Parameterized Derived Types. Fortran International Standards Organisation Parameterized Derived Types in Fortran Technical Report defining extension to ISO/IEC 1539-1 : 1996 {Produced 4-Jul-96} THIS PAGE TO BE REPLACED BY ISO CS ISO/IEC 1

More information

High Performance Fortran http://www-jics.cs.utk.edu jics@cs.utk.edu Kwai Lam Wong 1 Overview HPF : High Performance FORTRAN A language specification standard by High Performance FORTRAN Forum (HPFF), a

More information

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI

More information

Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer Science The Australian National University Canberra ACT 26

Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer Science The Australian National University Canberra ACT 26 Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Technical Report ANU-TR-CS-92- November 7, 992 Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer

More information

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup.

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup. Sparse Implementation of Revised Simplex Algorithms on Parallel Computers Wei Shu and Min-You Wu Abstract Parallelizing sparse simplex algorithms is one of the most challenging problems. Because of very

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

In context with optimizing Fortran 90 code it would be very helpful to have a selection of

In context with optimizing Fortran 90 code it would be very helpful to have a selection of 1 ISO/IEC JTC1/SC22/WG5 N1186 03 June 1996 High Performance Computing with Fortran 90 Qualiers and Attributes In context with optimizing Fortran 90 code it would be very helpful to have a selection of

More information

Interprocedural Compilation of Fortran D for. MIMD Distributed-Memory Machines. Mary W. Hall. Seema Hiranandani. Ken Kennedy.

Interprocedural Compilation of Fortran D for. MIMD Distributed-Memory Machines. Mary W. Hall. Seema Hiranandani. Ken Kennedy. Interprocedural Compilation of Fortran D for MIMD Distributed-Memory Machines Mary W. Hall Seema Hiranandani Ken Kennedy Chau-Wen Tseng CRPC-TR 91195 November 1991 Center for Research on Parallel Computation

More information

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract

More information

Y. Han* B. Narahari** H-A. Choi** University of Kentucky. The George Washington University

Y. Han* B. Narahari** H-A. Choi** University of Kentucky. The George Washington University Mapping a Chain Task to Chained Processors Y. Han* B. Narahari** H-A. Choi** *Department of Computer Science University of Kentucky Lexington, KY 40506 **Department of Electrical Engineering and Computer

More information

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the Heap-on-Top Priority Queues Boris V. Cherkassky Central Economics and Mathematics Institute Krasikova St. 32 117418, Moscow, Russia cher@cemi.msk.su Andrew V. Goldberg NEC Research Institute 4 Independence

More information

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam Parallel Paradigms & Programming Models Lectured by: Pham Tran Vu Prepared by: Thoai Nam Outline Parallel programming paradigms Programmability issues Parallel programming models Implicit parallelism Explicit

More information

A Loosely Synchronized Execution Model for a. Simple Data-Parallel Language. (Extended Abstract)

A Loosely Synchronized Execution Model for a. Simple Data-Parallel Language. (Extended Abstract) A Loosely Synchronized Execution Model for a Simple Data-Parallel Language (Extended Abstract) Yann Le Guyadec 2, Emmanuel Melin 1, Bruno Ran 1 Xavier Rebeuf 1 and Bernard Virot 1? 1 LIFO - IIIA Universite

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Fortran 2008: what s in it for high-performance computing

Fortran 2008: what s in it for high-performance computing Fortran 2008: what s in it for high-performance computing John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory Fortran 2008 has been completed and is about to be published.

More information

Shared Memory Programming with OpenMP

Shared Memory Programming with OpenMP Shared Memory Programming with OpenMP (An UHeM Training) Süha Tuna Informatics Institute, Istanbul Technical University February 12th, 2016 2 Outline - I Shared Memory Systems Threaded Programming Model

More information

[8] J. J. Dongarra and D. C. Sorensen. SCHEDULE: Programs. In D. B. Gannon L. H. Jamieson {24, August 1988.

[8] J. J. Dongarra and D. C. Sorensen. SCHEDULE: Programs. In D. B. Gannon L. H. Jamieson {24, August 1988. editor, Proceedings of Fifth SIAM Conference on Parallel Processing, Philadelphia, 1991. SIAM. [3] A. Beguelin, J. J. Dongarra, G. A. Geist, R. Manchek, and V. S. Sunderam. A users' guide to PVM parallel

More information

Fortran Coarrays John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory

Fortran Coarrays John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory Fortran Coarrays John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory This talk will explain the objectives of coarrays, give a quick summary of their history, describe the

More information

Acknowledgments 2

Acknowledgments 2 Program Slicing: An Application of Object-oriented Program Dependency Graphs Anand Krishnaswamy Dept. of Computer Science Clemson University Clemson, SC 29634-1906 anandk@cs.clemson.edu Abstract A considerable

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

Interactive Parallel Programming Using. the ParaScope Editor. Department of Computer Science, Rice University, Houston, TX

Interactive Parallel Programming Using. the ParaScope Editor. Department of Computer Science, Rice University, Houston, TX Interactive Parallel Programming Using the ParaScope Editor Ken Kennedy Kathryn McKinley Chau-Wen Tseng Department of Computer Science, Rice University, Houston, TX 77251-1892 May 24, 1994 Abstract The

More information

PGI Accelerator Programming Model for Fortran & C

PGI Accelerator Programming Model for Fortran & C PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...

More information

Code Generator for HPF Library on Fujitsu VPP5000

Code Generator for HPF Library on Fujitsu VPP5000 UDC 681.325.3 Code Generator for HPF Library on Fujitsu VPP5000 VMatthijs van Waveren VCliff Addison VPeter Harrison VDave Orange VNorman Brown (Manuscript received October 23, 1999) The Fujitsu VPP5000

More information

Compilation Issues for High Performance Computers: A Comparative. Overview of a General Model and the Unied Model. Brian J.

Compilation Issues for High Performance Computers: A Comparative. Overview of a General Model and the Unied Model. Brian J. Compilation Issues for High Performance Computers: A Comparative Overview of a General Model and the Unied Model Abstract This paper presents a comparison of two models suitable for use in a compiler for

More information

ADR and DataCutter. Sergey Koren CMSC818S. Thursday March 4 th, 2004

ADR and DataCutter. Sergey Koren CMSC818S. Thursday March 4 th, 2004 ADR and DataCutter Sergey Koren CMSC818S Thursday March 4 th, 2004 Active Data Repository Used for building parallel databases from multidimensional data sets Integrates storage, retrieval, and processing

More information

director executor user program user program signal, breakpoint function call communication channel client library directing server

director executor user program user program signal, breakpoint function call communication channel client library directing server (appeared in Computing Systems, Vol. 8, 2, pp.107-134, MIT Press, Spring 1995.) The Dynascope Directing Server: Design and Implementation 1 Rok Sosic School of Computing and Information Technology Grith

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

A Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology.

A Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology. A Fast Recursive Mapping Algorithm Song Chen and Mary M. Eshaghian Department of Computer and Information Science New Jersey Institute of Technology Newark, NJ 7 Abstract This paper presents a generic

More information

High Performance Fortran. Language Specication. High Performance Fortran Forum. January 31, Version 2.0

High Performance Fortran. Language Specication. High Performance Fortran Forum. January 31, Version 2.0 High Performance Fortran Language Specication High Performance Fortran Forum January, Version.0 The High Performance Fortran Forum (HPFF), with participation from over 0 organizations, met from March to

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above Janus a C++ Template Library for Parallel Dynamic Mesh Applications Jens Gerlach, Mitsuhisa Sato, and Yutaka Ishikawa fjens,msato,ishikawag@trc.rwcp.or.jp Tsukuba Research Center of the Real World Computing

More information

NOTE: Answer ANY FOUR of the following 6 sections:

NOTE: Answer ANY FOUR of the following 6 sections: A-PDF MERGER DEMO Philadelphia University Lecturer: Dr. Nadia Y. Yousif Coordinator: Dr. Nadia Y. Yousif Internal Examiner: Dr. Raad Fadhel Examination Paper... Programming Languages Paradigms (750321)

More information

Co-arrays to be included in the Fortran 2008 Standard

Co-arrays to be included in the Fortran 2008 Standard Co-arrays to be included in the Fortran 2008 Standard John Reid, ISO Fortran Convener The ISO Fortran Committee has decided to include co-arrays in the next revision of the Standard. Aim of this talk:

More information

Keywords: networks-of-workstations, distributed-shared memory, compiler optimizations, locality

Keywords: networks-of-workstations, distributed-shared memory, compiler optimizations, locality Informatica 17 page xxx{yyy 1 Overlap of Computation and Communication on Shared-Memory Networks-of-Workstations Tarek S. Abdelrahman and Gary Liu Department of Electrical and Computer Engineering The

More information

CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming

CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming CSE 262 Spring 2007 Scott B. Baden Lecture 4 Data parallel programming Announcements Projects Project proposal - Weds 4/25 - extra class 4/17/07 Scott B. Baden/CSE 262/Spring 2007 2 Data Parallel Programming

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

signature i-1 signature i instruction j j+1 branch adjustment value "if - path" initial value signature i signature j instruction exit signature j+1

signature i-1 signature i instruction j j+1 branch adjustment value if - path initial value signature i signature j instruction exit signature j+1 CONTROL FLOW MONITORING FOR A TIME-TRIGGERED COMMUNICATION CONTROLLER Thomas M. Galla 1, Michael Sprachmann 2, Andreas Steininger 1 and Christopher Temple 1 Abstract A novel control ow monitoring scheme

More information

Fortran 90 - A thumbnail sketch

Fortran 90 - A thumbnail sketch Fortran 90 - A thumbnail sketch Michael Metcalf CERN, Geneva, Switzerland. Abstract The main new features of Fortran 90 are presented. Keywords Fortran 1 New features In this brief paper, we describe in

More information

Fortran 90D/HPF Compiler for Distributed Memory MIMD Computers: Design, Implementation, and Performance Results

Fortran 90D/HPF Compiler for Distributed Memory MIMD Computers: Design, Implementation, and Performance Results Syracuse University SURFACE Northeast Parallel Architecture Center College of Engineering and Computer Science 1993 Fortran 90D/HPF Compiler for Distributed Memory MIMD Computers: Design, Implementation,

More information

Parallelization System. Abstract. We present an overview of our interprocedural analysis system,

Parallelization System. Abstract. We present an overview of our interprocedural analysis system, Overview of an Interprocedural Automatic Parallelization System Mary W. Hall Brian R. Murphy y Saman P. Amarasinghe y Shih-Wei Liao y Monica S. Lam y Abstract We present an overview of our interprocedural

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Performance Comparison Between AAL1, AAL2 and AAL5

Performance Comparison Between AAL1, AAL2 and AAL5 The University of Kansas Technical Report Performance Comparison Between AAL1, AAL2 and AAL5 Raghushankar R. Vatte and David W. Petr ITTC-FY1998-TR-13110-03 March 1998 Project Sponsor: Sprint Corporation

More information

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742 Availability of Coding Based Replication Schemes Gagan Agrawal Department of Computer Science University of Maryland College Park, MD 20742 Abstract Data is often replicated in distributed systems to improve

More information