Cedar Fortran Programmer's Manual 1. Jay Hoeinger. Center for Supercomputing Research and Development. Urbana, Illinois

Size: px

Start display at page:

Download "Cedar Fortran Programmer's Manual 1. Jay Hoeinger. Center for Supercomputing Research and Development. Urbana, Illinois"

Barbra Rodgers
6 years ago
Views:

1 Cedar Fortran Programmer's Manual 1 Jay Hoeinger Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign Urbana, Illinois June 14, This work was supported by the U.S. Department of Energy under Grant No. DOE DE- FG02-85ER25001.

2 Contents 1 Introduction 4 2 Memory Allocation and Control Xylem Process and Memory Concepts : : : : : : : : : : : : : : : : : : : : : Cedar Fortran Memory Concepts : : : : : : : : : : : : : : : : : : : : : : : Controlling Cedar Memory Attributes : : : : : : : : : : : : : : : : : : : : : Rules About Memory Attribute Declaration : : : : : : : : : : : : : : : : : Memory Attributes with DATA and SAVE Statements : : : : : : : : : : : Dynamic Variable Allocation : : : : : : : : : : : : : : : : : : : : : : : : : : The ALLOCATE Statement : : : : : : : : : : : : : : : : : : : : : : : : The MARK Statement : : : : : : : : : : : : : : : : : : : : : : : : : : The RELEASE Statement : : : : : : : : : : : : : : : : : : : : : : : : 12 3 Using Vectors in Cedar Fortran Array Sections : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Array Assignment : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Array Constructors : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The WHERE Statement : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The FORALL Statement : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Vector Reduction Functions : : : : : : : : : : : : : : : : : : : : : : : : : : 19 4 Parallel Loops in Cedar Fortran The Cluster Hardware : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The Global Hardware : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Runtime Support for Spread Loops : : : : : : : : : : : : : : : : : : : : : : Cedar Fortran Cluster Parallel Loops : : : : : : : : : : : : : : : : : : : : : The Operation of a Cluster Parallel Loop : : : : : : : : : : : : : : : Cedar Fortran Spread Parallel Loops : : : : : : : : : : : : : : : : : : : : : Identier Scope Rules : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Nested Concurrent Loop Statements : : : : : : : : : : : : : : : : : Index Variable Considerations for Nested Loops : : : : : : : : : : : Premature Termination of Parallel Loops : : : : : : : : : : : : : : : : : : 33 5 Unstructured Tasking and Synchronization Unstructured Tasking Routines : : : : : : : : : : : : : : : : : : : : : : : : Unstructured Tasking Via Macrotasking : : : : : : : : : : : : : : : Unstructured Tasking Via Microtasking : : : : : : : : : : : : : : : : Synchronization Facilities : : : : : : : : : : : : : : : : : : : : : : : : : : : CDOACROSS Loop Synchronization : : : : : : : : : : : : : : : : : : : : : : : Advance and Await Routines : : : : : : : : : : : : : : : : : : : : Zhu-Yew Synchronization Primitives : : : : : : : : : : : : : : : : : : : : : Specications of Cedar Sync Functions : : : : : : : : : : : : : : : : Cray-XMP Synchronization Routines : : : : : : : : : : : : : : : : : : : : : Using QLocks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 41 1

3 5.6.1 QLock Routines : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 41 6 The Runtime Environment Runtime Libraries : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Using Information About the Environment : : : : : : : : : : : : : : : : : : Your Current Cluster ID : : : : : : : : : : : : : : : : : : : : : : : : : : : : Controlling the Execution of Your Program : : : : : : : : : : : : : : : : : : Environment Variables : : : : : : : : : : : : : : : : : : : : : : : : : The execute command : : : : : : : : : : : : : : : : : : : : : : : : Using the High-Resolution Counter : : : : : : : : : : : : : : : : : : : : : : hrcget : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Mapping the High-Resolution Counter : : : : : : : : : : : : : : : : mem touch : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 55 7 cf Hints Getting Output from Parts of the Compiler : : : : : : : : : : : : : : : : : Output from the cf Command Itself : : : : : : : : : : : : : : : : : Output from the Cedar Restructurer : : : : : : : : : : : : : : : : : Output from the Cedar Fortran Pre-processor : : : : : : : : : : : : Assembler Code from the Alliant Fortran Compiler : : : : : : : : : Running cf at CSRD : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Paths : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Remote usage of A7 and SW : : : : : : : : : : : : : : : : : : : : : : 57 8 Floating Point Exceptions 58 9 The cf manual page The kap manual page The cftn manual page 73 2

4 List of Figures 1 The Fortran compilation system for Cedar : : : : : : : : : : : : : : : : : : 4 2 Xylem memory attributes : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 3 Cluster Parallel Loop Syntax : : : : : : : : : : : : : : : : : : : : : : : : : : 26 4 Spread Parallel Loop Syntax : : : : : : : : : : : : : : : : : : : : : : : : : : 29 5 Improper attempt to pass private data into an SDOALL loop. : : : : : : : : 30 6 Program illustrating scope rules : : : : : : : : : : : : : : : : : : : : : : : : 31 7 Example of the Use of QUIT : : : : : : : : : : : : : : : : : : : : : : : : : : 34 8 Example of the Use of QQUIT : : : : : : : : : : : : : : : : : : : : : : : : : 35 9 Barrier Synchronization for Two Tasks with QLocks : : : : : : : : : : : : : Unordered Critical Section Implemented with QLocks : : : : : : : : : : : : Vertical and Horizontal Stripmining : : : : : : : : : : : : : : : : : : : : : : 50 3

5 USER Fortran 77 Back end compiler Parallelizer Cedar Restructurer Cedar Fortran Cedar Fortran Pre processor Alliant Fortran Compiler fortran1p Post processor cftn txo Kap Object File Figure 1: The Fortran compilation system for Cedar 1 Introduction The Fortran compiler for Cedar (see Figure 1) consists of many parts and takes two dierent input languages. Programs written in standard sequential Fortran77 can be automatically parallelized by the Cedar Restructurer. The restructurer, in turn, expresses the parallel version of the program in the Cedar Fortran language. Programs written directly in the Cedar Fortran language bypass the restructurer. Cedar Fortran gives a user access to all the key features of the machine. Parallelism (both concurrency and vector operation) is expressed directly in this language. This manual describes the Cedar Fortran language and the commands available to invoke the compiler. The Cedar Multiprocessor [Sta91][KDLS86] is a hierarchical-memory multiprocessor. Processors are arranged in groups, called clusters. All processors have access to a large global memory. Additionally, each cluster has its own cluster memory that can only be accessed by the processors in that cluster. There are four clusters, each containing eight processors. Cedar clusters are CSRD-modied Alliant Computer Systems FX/8 multiprocessors [All86] that are connected through a network to a global memory. The operating system for Cedar is called Xylem [Sta91][Emr85]. It is derived from the Alliant Concentrix operating system which itself is derived from Berkeley 4.2 Unix TM. Xylem provides the facilities for multitasking and task synchronization necessary to fully exploit the power of Cedar. The Cedar Restructurer is derived from a 1988 version of the KAP restructurer [Kuc88] 4

6 from Kuck and Associates. Extensive modications were made to it to allow it to generate code for the Cedar. Cedar Fortran is derived from Alliant FX/Fortran 1 [All87] with extensions for parallel loops, placing data in the memory hierarchy, multitasking and synchronization. FX/Fortran is a subset of the Fortran90 standard [Ame91] and a superset of the Fortran77 standard [Ame78]. Cedar Fortran has been specically designed and implemented to give the user access to the full power of Cedar. The Cedar system supports three levels of parallelism: vector parallelism, loop parallelism, and task parallelism. A Cedar Fortran program is executed as a process running under the Xylem operating system. The Fortran program can start new tasks to achieve task parallelism. Each task within the program executes on a Cedar cluster. Task parallelism allows parts of the program to execute asynchronously; it is the coarsest level of parallelism. The medium grain level, loop parallelism, allows the processors of a cluster to cooperate in the execution of a given loop. Loop parallelism is expressed in Cedar Fortran with one of the DOALL or DOACROSS statements in the language. These statements will be described in detail in section 4 of this document. The nest level of parallelism available through Cedar Fortran is vector parallelism. Vector instructions allow multiple elements of an array to be accessed or modied using a single instruction. A vector instruction is performed by a single processor within a cluster. Language constructs have been provided in Cedar Fortran to specify vector operations: array assignments, triplets, the WHERE statement, and the FORALL statement. This manual will assume that the reader is already familiar with Fortran77 and will only describe the dierences between Cedar Fortran and Fortran77 in detail. Many resources exist for those who need to learn or review Fortran77, one of them being [Met85]. 1 Alliant, FX/8, FX/Fortran, and Concentrix are all trademarks of Alliant Computer Systems Corporation. 5

7 2 Memory Allocation and Control The concept of memory in Cedar Fortran departs from the traditional view of a linear memory space. Memory in the Cedar multiprocessor is hierarchical. There is a large global memory that is accessible to all processors. Each cluster has a cluster memory that is accessible only to the processors within that cluster. 2.1 Xylem Process and Memory Concepts The Xylem operating system is a paged virtual memory system. It executes Xylem processes which consist of one or more tasks. Each task sees a single, contiguous virtual address space, some pages of which are mapped to global memory, while others are mapped to cluster memory. Each task sees an identical virtual address space, but the mapping of those virtual addresses to physical memory is unique to each task. This mapping makes it possible to share some parts of the address space by mapping the same virtual pages in each task to the same physical memory, while other parts are made private by mapping the same virtual pages in each task to distinct places in physical memory. The memory of a Xylem process is organized into sections. Sections have three types of attributes that are of concern to the Cedar Fortran programmer. These attributes are summarized in Figure 2. mode access locale read process global write task cluster execute Figure 2: Xylem memory attributes The rst attribute type species what a program may do to a section. These are similar to the Unix le system permissions. The mode of a section may be set to be readable, writable, executable, or any combination of these three. The Cedar Fortran programmer does not need to specify this attribute for data. All data in a Cedar Fortran program are assumed to be read/write. The second attribute type species which processors may access a section; a section may be shared by all processors in all tasks in the process (the process attribute) or be accessible only to the processors within a single task (the task attribute). Whenever a new cluster task is spawned (see section 5), it will share the process sections with all other tasks in the program. New copies of task sections will be created for the newly spawned task. If the task refers to the contents of such a section, it will receive a copy of the original (pre-execution) contents of the section. The third type of attribute species where the data in a section will reside during execution (the locale of the section). It indicates whether the pages in the section should reside in global memory (the global attribute) or cluster memory (the cluster attribute). 6

8 2.2 Cedar Fortran Memory Concepts Memory use in Cedar Fortran must necessarily be compatible with the Cedar memory hierarchy and the memory structure of the Xylem process. The Cedar Fortran programmer has control over the locale and access attributes of the data used in a program. Variables may be said to be process global, task cluster, task global, or process cluster, depending on the attributes assigned by the programmer. At this point we must dene some terms. Interface data is data which is passed to a subroutine or function via an argument list, or data which is made available to a routine by a COMMON statement. Routine-local data is data declared or used in a subroutine or function which is not interface data, and therefore not accessible outside a routine's boundaries. Routine-local data is typically stored on one of the stacks used at runtime, and may be classied according to which stack it is stored on. Routine-local data may be allocated on a global stack (stored in the Cedar global memory), on a cluster stack (shared by all processors on a cluster and accessed through the cluster stack pointer), or on a processor-private stack (accessed through a private stack pointer for each processor). The behavior of routine-local data in Cedar Fortran requires some explanation. By default, Cedar Fortran dynamically allocates routine-local data. This departs from the more common static allocation of routine-local data in typical Fortran implementations, but this change is necessary to accommodate concurrency. When a processor enters a subroutine, each routine-local variable is allocated a place on a stack appropriate for its memory attributes, which is used only by that processor. When the processor returns from the subroutine, the routine-local data is deallocated. Routine-local data declared in the main program will be statically allocated. Data that is declared in common blocks will be allocated statically in a section with the appropriate attributes. There are two types of common blocks, PROCESS COMMON and task COMMON. In PROCESS COMMON, one copy of the common block is shared by all tasks in the program. In task COMMON, a dierent copy of the common block is given to each task. 2.3 Controlling Cedar Memory Attributes Specication of the locale for data in Cedar Fortran is possible through the GLOBAL and CLUSTER statements. The syntax for these memory locale statements is as follows: GLOBAL ( v =cname= ) [; ( v =cname= ) ] : : : where CLUSTER ( v =cname= ) [; ( v =cname= ) ] : : : v cname is a variable name or array name is a common name. The access attribute is settable for data in common blocks through the COMMON statement. The task COMMON statement declares an access attribute of task (private to each 7

9 task) for the data in the common block. In addition, there is a PROCESS COMMON statement which declares an access attribute of process (shared by all tasks in the process) for the data in the common block. Therefore, a copy of a task common block exists for each task, while only one copy of a process common exists to be shared by all tasks. The access attribute for routine-local data is xed. It is process for variables declared in a GLOBAL statement, and task for variables declared in a CLUSTER statement or not declared in any memory attribute statement. Task common may be named or unnamed, but process common blocks must be named. The syntax of the COMMON statements is where COMMON [ / cname / ] v [, v ] : : : PROCESS COMMON / cname / v [, v ] : : : cname v is the name associated with the common. is a variable name or an array name. The default memory attributes for routine-local data and formal arguments to subroutines are task cluster. When routine-local data must be placed in global memory, it must be mentioned in a GLOBAL statement. A programmer has no way to force formal parameters to have a particular locale or access. The placement of a parameter and its accessibility attributes are totally controlled by the routine up the calling tree where the data originates. A programmer may assert a locale for a formal parameter, however, by declaring it in a locale statement. Whenever an array is declared to reside in global memory and is fetched in a vector operation, Cedar vector prefetch instructions are generated for the access, which increase access speed. See section 3 for more information about the vector prefetch mechanism. The default attributes for task common are task cluster, although the common name may be used in a GLOBAL statement to change the attributes of the whole common block to task global. The default attributes for process common are process global. The common name may be used in a CLUSTER statement to change the attributes to process cluster Rules About Memory Attribute Declaration Only variable names are allowed on CLUSTER and GLOBAL statements, not size specications. Therefore: 2 In the current implementation, a PROCESS COMMON given the CLUSTER locale may not be written to by the program. 8

10 INTEGER x GLOBAL x(100) INVALID is not valid; rather, it should be specied as: INTEGER x(100) GLOBAL x VALID OR DIMENSION x(100) INTEGER x GLOBAL x VALID Dening a COMMON name to have a memory locale has the eect of assigning that locale to all variables appearing on the COMMON statement. Any attempt to assign a memory locale attribute to a single variable more than once, or to assign a memory locale attribute to both a variable and the name of the COMMON where it resides will result in a Cedar Fortran error message. This is a situation similar to: REAL x REAL x INVALID which is likewise treated as an error by many compilers. So, the following sequence of Cedar Fortran statements will generate an error: GLOBAL a, b, c GLOBAL b INVALID It is also an error to attempt to dene two diering sets of memory attributes for a variable: GLOBAL /comm1/ GLOBAL a, b, c COMMON /comm1/ a, b, c INVALID 9

11 The variables a, b, and c are set task global by the combination of the \GLOBAL /comm1/" and \COMMON /comm1/ a, b, c" statements, and process global by the \GLOBAL a, b, c" statement. Common block variables may not be given an explicit locale attribute. The following code segment is invalid because it violates this rule: GLOBAL a, b, c COMMON /comm1/ a, b, c INVALID This program segment generates an error because the COMMON statement declares the memory attributes to be task cluster, while the GLOBAL statement declares them to be process global. A proper construction for that program segment is: GLOBAL /comm1/ COMMON /comm1/ a, b, c VALID This sequence properly creates the common, comm1, with the identiers a, b, and c, all with the attributes task global. All identiers in an EQUIVALENCE list must have the same memory attributes. Consider the following example: GLOBAL a, b, c EQUIVALENCE (a, b, c) EQUIVALENCE (a, d) INVALID The rst EQUIVALENCE statement is correct; all identiers in the list have the attributes process global and no problem arises. The second EQUIVALENCE statement, however, is in error. Identier a is a process global variable while the memory attributes of d default to task cluster. They may not be equivalenced because their attributes force them to reside in dierent program sections. 2.5 Memory Attributes with DATA and SAVE Statements The DATA and SAVE statements require special attention in Cedar Fortran because of the potential problems they present when used with concurrent execution. The form of these statements is identical to their Fortran77 form and their semantics is the same when they are used by only a single processor, but the issue becomes more complicated when there are many parallel threads of execution. Variables that appear in DATA or SAVE statements may be declared with any Cedar Fortran memory attributes. The memory attributes will determine how many copies of the variables will exist at execution time. If the memory attributes are process global, then one copy of the variable will exist, will reside in global memory, and will be shared 10

12 by all processors in all tasks. If the memory attributes are task cluster, then one copy of the variable will exist for each cluster task, it will reside in cluster memory, and will be shared by all the processors in a single task. The sharing by multiple processors of a variable used in a DATA or SAVE statement is the crux of the problem. There is no way to specify a processor-private variable in a DATA or SAVE statement, so all such variables will be subject to sharing. In fact, care must be taken when using any variables that are shared by more than one concurrent execution thread. So, the combination of potentially non-deterministic updating of a variable with the potential of having more than one copy of the variable makes it important that the programmer take care when using DATA and SAVE in Cedar Fortran. The problem is best illustrated with an example. Suppose that we have a DATA statement within subroutine sub1 and that sub1 is called from within an CDOALL loop. (CDOALL is a concurrent DO loop which will be described in section 4). PROGRAM example CDOALL i=1,4 CALL sub1(i) END CDOALL CALL sub1(100) SUBROUTINE sub1 (i) INTEGER i INTEGER int1 SAVE int1 IF (i.eq. 1) THEN int1 = 10 ELSE IF (i.eq. 2) THEN int1 = 100 ELSE IF (i.eq. 3) THEN int1 = 1000 ELSE IF (i.eq. 4) THEN int1 = ELSE IF (i.eq. 100) THEN PRINT *, int1 ENDIF END If the CDOALL loop were really a DO loop (executed on one processor), the value printed would be \10000". In a program with the CDOALL version, though, the value printed could be \10", \100", \1000", or \10000" (and the value would probably change from run to run), depending on which iteration of the CDOALL actually assigned int1 last. 2.6 Dynamic Variable Allocation Cedar Fortran allows the programmer to allocate variables with any memory attributes dynamically on a stack appropriate for those attributes. All dynamically allocated variables will be deallocated automatically when the routine in which they were allocated returns. Three statements are involved with memory allocation and deallocation: MARK, ALLOCATE, and RELEASE. 11

13 2.6.1 The ALLOCATE Statement The ALLOCATE statement allows for the dynamic allocation of array variables. The syntax of the ALLOCATE statement is: where ALLOCATE ( array-spec [, array-spec ] : : : ) array-spec! array-name ( doublet [, doublet ] : : : ) doublet! exp [ : exp ] Any variable that is to be allocated must be declared rst. The declaration must be of the form: for example ( Fortran type DIMENSION ) REAL*4 a ( :, :, : ) array? name(: [; :] : : :) This declares an array a that will have three dimensions when allocated. The number of dimensions in the declaration must match the number of dimensions when the array is allocated. Dynamically allocated variables and arrays cannot be used in COMMON blocks, DATA statements, or SAVE statements The MARK Statement The MARK statement records the current address of the top of the task cluster stack in the integer variable supplied as its argument. Any task cluster space allocated after a MARK statement can be deallocated using the RELEASE statement with the integer variable used in the MARK statement as an argument. The syntax of the MARK statement is: where MARK ( v ) v must be of type INTEGER* The RELEASE Statement The RELEASE statement deallocates memory from the task cluster stack by setting the stack pointer to the address stored in the integer variable supplied as its argument. The integer argument must have been set with the MARK statement or an error will occur. This 12

14 statement is useful when the programmer wishes to deallocate dynamic variables before the end of a routine. The syntax of the RELEASE statement is: where RELEASE ( v ) v must be of type INTEGER*4. It should also be noted that all dynamic variables will be deallocated automatically at the end of the subroutine or function. The MARK and RELEASE statements need only be used when the programmer wishes to remove dynamic space before the routine terminates. It must be emphasized that MARK and RELEASE only deal with the task cluster stack, while ALLOCATE can allocate both task cluster and process global variables. In the current implementation, using an ALLOCATE statement for global arrays inside a parallel region does not work correctly. Consequently, an ALLOCATE statement for global arrays must only appear outside parallel loops. 13

15 An example of using ALLOCATE, MARK, and RELEASE on the task cluster stack: SUBROUTINE Example () INTEGER stack1, stack2 REAL *4 a ( :, :, : ), b ( :, : ) MARK (stack1) ALLOCATE ( a (10,10,10) ) C perform computation with A MARK (stack2) ALLOCATE ( b (10,10) ) C C perform computation with A and B deallocate array B RELEASE (stack2) ALLOCATE ( b (20, 20) ) C C C C perform computation with A and this larger B deallocate both A and B RELEASE (stack1) RELEASE was not really necessary here because all variables are deallocated on return RETURN END 14

16 3 Using Vectors in Cedar Fortran Vector processing or array processing allows a processor to operate on many elements of an array with a single instruction. This allows data manipulations to be completed much faster than on a non-vector processor. Complex arithmetic operations can be done using vector instructions. The subscript expression used in a vector operation has a beginning, an end, and a stride (similar to a DO loop). In Cedar Fortran, vector operations can replace DO loops in many cases and thereby increase the execution speed of the program. When arrays are declared to have the global attribute and are involved in a vector fetch, the compiler will automatically insert vector prefetch instructions prior to the fetch. The vector prefetch instructions trigger the prefetch unit on the Global Interface Board (GIB) attached to the processor issuing the instruction. This causes a pipe-lined read from global memory for the array, with the data ending up in the prefetch buer on the GIB. When the data is available in the prefetch buer, the processor can access it at a speed similar to that of accessing the cluster shared data cache. 3.1 Array Sections Cedar Fortran allows the programmer to specify operations on sections of arrays with a single statement. A section is specied with a triplet (begin : end [: stride ]). The triplet species the beginning of the section, the end of the section, and the optional stride through the section (the number of elements to increment for the next location). This notation makes it possible to assign a section of one array to a section of another array. The two sections must be conformable. That is, they must have the same number of dimensions and the same number of elements in each dimension. For example, the instructions: INTEGER a(4), b(4) DATA b/2, 4, 6, 8/ a(1:2) = b(3:4) would assign a(1) the value 6 and a(2) the value 8 leaving the other elements unassigned. The assignment a(1:3:2) = b(2:4:2) would assign a(1) the value 4 and a(3) the value 8 and leave the other elements unassigned. Sections can also be specied for multidimensional arrays with one triplet for 15

17 each dimension of the array. A full description of these instructions is given in the Alliant FX/Fortran Programmer's Handbook. Another form of vector notion uses a starting position and a length along with a possible stride. It is written with the :$ operator, as follows: (begin :$ length [: stride ]). A third form of vector notation exists which species the length of the vector instruction, but it asserts to the compiler that the length is 32 or less, allowing more ecient code to be generated. It is written with the :$$ operator, as follows: (begin :$$ length [: stride ]). Length of vector 32 A fourth form of vector notation makes the same assertion about the length of the vector instruction being 32 or less, but uses a form that species the beginning and ending indices of the access: (begin : $ end [: stride ]). Length of vector Array Assignment It is also possible to assign an entire array with a single statement. The array elements may be assigned the value of a single scalar, or they may be assigned to the elements of another array. If an array is to be assigned to another array, then the arrays must be conformable. This is really just a special case of the array section assignment with the section being the entire array. For example: INTEGER a(4), b(4) DATA b/2, 4, 6, 8/ a = b After the assignment, all the elements of a will have the same values as the corresponding elements of b. 16

18 3.3 Array Constructors Cedar Fortran provides array constructors that allow the values of an array to be specied in a concise form. Within angle brackets ( < > ), elements of the array may be specied individually, separated by commas, or a range of values may be specied using the triplet notation described above. Furthermore, the two notations may be combined in the same constructor. For example: REAL a(10) a = <9,1:25:3> The array a will contain the values 9,1,4,7,10,13,16,19,22,25. Array constructors may only appear on the right hand side of assignments. A special case of the array constructor provided by Cedar Fortran is the SEQ function. The SEQ function produces a simple sequence of numbers. For example: SEQ (1, 10, 2) is equivalent to <1 : 10 : 2> and would produce the sequence 1,3,5,7, The WHERE Statement It is also possible to conditionally assign the values of an array using vector instructions in a WHERE statement. The WHERE statement rst evaluates a logical array expression. For each element evaluated.true., the body of the WHERE statement is executed. Cedar Fortran provides two forms of the WHERE statement, a single-statement WHERE and a block WHERE. The body of the WHERE statement, either a single statement or a block, may contain only array assignments. The right hand side of every array expression in the body must be conformable to the logical array expression. WHERE is a vector statement, so the logical array expression and all right-hand side expressions in the body are evaluated before the assignments are performed. The syntax of the WHERE statement is: 17

19 Single-statement WHERE WHERE ( logical array expression ) array assignment statement Block WHERE WHERE ( logical array expression ) array assignment statements [ OTHERWISE array assignment statements ] END WHERE If the OTHERWISE block is present, the array assignment statements of the OTHERWISE block will be performed for every element of the logical array expression of the WHERE statement whose value is.false.. Again, all operations are done as vectors. The logical array expression is evaluated, then the right-hand side of all expressions of all assignments in both blocks are evaluated and stored in temporary vectors. Finally, the assignments are made in the WHERE block for each.true. value in the logical array expression and then the assignments are made in the OTHERWISE block, if present, for each.false. value in the logical array expression. A simple example of the use of the WHERE statement would be to zero all array elements that are negative. INTEGER a (100) WHERE (a(1:100).lt. 0) a(1:100) = 0 END WHERE or WHERE (a(1:100).lt. 0) a(1:100) = 0 In this example only the elements of a that have values less than zero will be set to zero. All the other elements will remain unchanged. This looks very much like an iterative loop, but it is not, it is implemented with vector instructions. Consult the Alliant FX/Fortran Programer's Handbook for more details on the block WHERE statement. 18

20 3.5 The FORALL Statement The FORALL statement is a Cedar Fortran construct that allows elements of multi-dimensional arrays to be assigned conditionally. It is the most general of the vector constructs. This is a vector statement, so all evaluation is done before any assignment. The logical array expression and the right-hand side of the expression of the assignment are evaluated for every iteration specied by the index variables; then the assignment is done for every iteration specied by the index variables for which the value of the logical array expression is.true. The syntax of the FORALL statement is: FORALL ( loop spec [, loop spec ] : : : [, logical array exp ] ) array section expression = array section expression where loop spec! integer variable = triplet logical array exp! logical array expression The logical array exp must be conformable to the array section expressions. The FORALL statement allows the specication of array section assignments that would not be possible using triplets or the WHERE statement. For example, it is not possible to access the diagonal elements of a matrix with any of the previously described statements. This can be done using the FORALL statement: INTEGER a(10,10), diag(10), i FORALL (i=1:10) diag(i) = a(i,i) The FORALL statement makes it possible to express many complicated vector operations with a single statement. In the following example, the diagonal elements of a matrix are tested, and any element that is negative is set to zero. INTEGER a(10,10), i FORALL (i=1:10, a(i,i).lt. 0) a(i,i) = 0 The programmer should be advised that there is nothing \magical" about the construct. FORALL statements must be translated by the Cedar Fortran compiler into a set of vector statements, and possibly extra DO loops and subroutine calls. The transformed statements may introduce some overhead that is not anticipated by the programmer. 3.6 Vector Reduction Functions Cedar Fortran provides several functions that reduce vector or array expressions to a single result. In some cases, the result is an array reduced in dimensionality from the argument, or the result may be a scalar (the entire array reduced to a single scalar value). 19

21 The routines are all, any, count, dotproduct, rstmaxoset, rstminoset, rsttrueoset, lastmaxoset, lastminoset, lasttrueoset, maxval, minval, product, and sum. The formal parameters for all of these functions (except dotproduct) are exactly the same. The calling sequence is: ret val = name ( array exp [, [ dim =] dim ] [, [ mask =] mask ] [, forall=( index = triplet [, index = triplet ] : : : ) ] ) where ret val is either a logical or integer array expression of one dimension less than the argument expression, or a single logical or numeric scalar value. name is one of f all; any; count; maxval; minval; product; or sum g. dim is the dimension along which the function is applied. mask is an array expression conformable to array exp. forall is a mechanism for using one or more index variables to specify parts of the array. The calling sequence for the dotproduct is: ret val = dotproduct ( array exp1, array exp2 [, [ dim =] dim ] [, [ mask =] mask ] [, forall=( index = triplet [, index = triplet ] : : : ) ] ) If the array exp has N dimensions (N > 1) and dim is specied, the ret val is an array of dimension N - 1. Given an array X with shape (d 1,d 2,: : :,d i?1,d i, d i+1,: : :,d N ) and applying an array reduction function along dimension i, ( dim =i), the result array has shape (d 1,d 2,: : :,d i?1, d i+1,: : :,d N ). The value of the result array element (j 1,j 2,: : :,j i?1, j i+1,: : :,j N ) is calculated by applying the function to the array section specied by array exp (j 1,j 2,: : :, j i?1, :, j i+1,: : :,j N ) (with the mask applied if present). 20

22 all any count Returns.TRUE. if ALL the elements of the array exp are.true. (nonzero) and returns.false. otherwise. The array exp argument should be a logical array expression. Returns.TRUE. if ANY the element of the array exp is.true. (nonzero) and returns.false. otherwise. The array exp argument should be a logical array expression. dotproduct Returns the number of elements in the array exp with non-zero (.TRUE.) values. The array exp argument should be a logical array expression. Returns the dot product of two vectors. The two array exp arguments must be of rank 1 or dim must be specied. The dim applies to both array exp arguments and the arguments must be of the same size and shape. rstmaxoset Returns the oset of the rst occurrence (in the normal order of storage) of the maximum value in an array. If the mask is specied, only those elements whose corresponding elements in the mask array are true participate in the operation. If all elements of the mask array are false, the result is -1. The arguments must be of the same size and of rank 1. The oset of the rst element in the array is 0. rstminoset Returns the oset of the rst occurrence (in the normal order of storage) of the minimum value in an array. If the mask is specied, only those elements whose corresponding elements in the mask array are true participate in the operation. If all elements of the mask array are false, the result is -1. The arguments must be of the same size and of rank 1. The oset of the rst element in the array is 0. 21

23 rsttrueoset Returns the oset of the rst occurrence (in the normal order of storage) of the value true in an array. If the mask is specied, only those elements whose corresponding elements in the mask array are true participate in the operation. If all elements of the mask array are false, the result is -1. The arguments must be of the same size and of rank 1. The oset of the rst element in the array is 0. lastmaxoset lastminoset Returns the oset of the last occurrence (in the normal order of storage) of the maximum value in an array. If the mask is specied, only those elements whose corresponding elements in the mask array are true participate in the operation. If all elements of the mask array are false, the result is -1. The arguments must be of the same size and of rank 1. The oset of the rst element in the array is 0. Returns the oset of the last occurrence (in the normal order of storage) of the minimum value in an array. If the mask is specied, only those elements whose corresponding elements in the mask array are true participate in the operation. If all elements of the mask array are false, the result is -1. The arguments must be of the same size and of rank 1. The oset of the rst element in the array is 0. lasttrueoset maxval minval Returns the oset of the last occurrence (in the normal order of storage) of the value true in an array. If the mask is specied, only those elements whose corresponding elements in the mask array are true participate in the operation. If all elements of the mask array are false, the result is -1. The arguments must be of the same size and of rank 1. The oset of the rst element in the array is 0. Returns the maximum value in the array exp. The array exp argument should be an integer or real array expression. Returns the minimum value in the array exp. The array exp argument should be an integer or real array expression. 22

24 product sum Returns the product of the elements of the array exp. The array exp argument should be an integer or real array expression. Returns the sum of the elements of the array exp. The array exp argument should be an integer or real array expression. 23

25 4 Parallel Loops in Cedar Fortran Loop parallelism is the medium grain level of parallelism supported by Cedar Fortran. There are three types of parallel loops at this time in Cedar Fortran: 1. cluster parallel loops (all processors on a single cluster execute the loop) 2. spread cluster loops (one processor on each cluster executes the loop) 3. spread processor loops (all processors on all clusters execute the loop) Cedar Fortran oers two ways of scheduling parallel loops at this time: 1. iterations started in the same order as in a sequential DO loop (DOACROSS), and 2. iterations scheduled in an unspecied way (DOALL). Cluster parallel loops have a loop verb with a prex of \C" (CDOALL and CDOACROSS). Spread cluster loops (which activate one processor in each cluster task) have a loop verb with a prex of \S" (SDOALL and SDOACROSS). Spread processor loops (which activate all processors on all cluster tasks) have a loop verb with a prex of \X" (XDOALL and XDOACROSS). Programs may activate all processors of the Cedar by either using an SDOALL/CDOALL loop nest, or an XDOALL loop. All parallel loops in Cedar Fortran oer the capability of declaring data \local to the loop". Functionally this means that each iteration of the loop can be considered to have its own copy of this data. Actually, this means that there is one copy of the data per processor which joins the loop. The DOACROSS loops have been provided to make it possible to serialize regions within a parallel loop. These loops provide a known, xed scheduling order that the user can take advantage of to correctly synchronize parts of the loop. 4.1 The Cluster Hardware Special hardware within a Cedar cluster supports the execution of cluster parallel loops. One processor in the complex begins execution of the loop while the rest are idle. When the processor executes a special hardware instruction, all processors in the cluster begin execution of the loop, communicating and synchronizing via the concurrency bus and the concurrency control unit (CCU) in the cluster. The CCU supports both self-scheduling and static scheduling of the processors which cooperate to execute iterations of the loop. Self-scheduling of cluster parallel loops is achieved through a Cedar Fortran CDOALL statement, like: CDOALL i=1,n In the given loop, processors will execute an iteration, then grab the next available iteration, until there are no more iterations left. Static scheduling for cluster parallel loops is achieved through a statement like: CDOALL i=0,numpro$() 24

26 The numpro$ function (see section 6.2) returns a value which is \one less than the number of available processors". This form is special in that it causes the generation of code on a cluster which gives exactly one iteration to each processor, and when that iteration is complete, all but one of the processors goes idle. The remaining processor continues executing the statement following the loop. There are hardware instructions that help you split an original iteration space into a chunk per processor. Cedar Fortran supports these instructions via the functions vlv$, vlv$, vlh$, voh$, and vih$. See section 6.2 for more information. 4.2 The Global Hardware There is no global hardware which is equivalent to the CCU. The Cedar global memory provides some synchronization hardware and a communication path between clusters, but no iteration-scheduling support. Runtime library routines are necessary to coordinate the execution of a spread loop. 4.3 Runtime Support for Spread Loops The execution of spread loops involves a technique called microtasking [BM86]. The original task of a program starts running on the original cluster. As part of its startup code, it starts several Xylem tasks (called helper tasks), which will assist in the execution of spread loops. The helper tasks sit waiting for work as the original task begins execution of the program. When a spread loop is encountered, the helper tasks are notied and they begin executing iterations of the loop. After the spread loop is nished, the helpers resume waiting for work. 4.4 Cedar Fortran Cluster Parallel Loops The syntax of a cluster parallel loop follows a form similar to that of a DO loop, with a few additions for declaring local data and initialization. The form is as shown in Figure 3. Those statements that appear before the LOOP statement (as a group, called the preamble) will be executed once for each processor which joins the loop. Those statements that appear after the LOOP statement will be executed on every iteration. If the LOOP statement is not present, all statements will be executed once per iteration. The data declaration statements should appear before any executable statement in the range of the loop. The function of these declaration statements is to declare loop-local variables 3 which may be referenced only inside the loop where they are declared. Each variable that is to be local to a cluster parallel loop must be declared explicitly within the loop. Any previously-undeclared variables that appear within the loop without a type declaration statement will be assumed to be implicitly declared outside the loop. Block structured scoping rules similar to those of Pascal are applied to the variables declared within a cluster parallel loop. A full description of the Cedar Fortran scoping convention appears in section Character variables may not be declared inside any Cedar Fortran parallel loops. 25

27 Labeled Form: ( ) CDOALL label [; ] i = low, high [, by] CDOACROSS [ data declarations ] [ statements ] \Preamble" [ LOOP ] label [ statements ] statement \Loop Body" Unlabeled Form: ( ) CDOALL i = low, high [, by ] CDOACROSS [ data declarations ] [ statements ] \Preamble" [ LOOP ] END ( [ statements )] \Loop Body" CDOALL CDOACROSS where i low, high, and by is the name of an INTEGER variable, called the loop index. are each integer expressions. Figure 3: Cluster Parallel Loop Syntax 26

28 Loop-local variables will have memory attributes of task cluster. No memory attribute declarations are allowed within a parallel loop. The dimension bounds of arrays declared inside a cluster parallel loop may be arbitrarily complex integer expressions, and may use variables dened outside the body of the loop. Local arrays are limited to 6 dimensions instead of the normal 7 dimensions of Fortran The Operation of a Cluster Parallel Loop Referring to the low, high, and by expressions from the syntax specication of a cluster loop (Figure 3), the number of iterations is N = b(high? low + by)=byc When N = 0 the loop statement causes a transfer of control to the rst executable statement following the loop. Otherwise the loop statement starts M = min ( N, numproc()) processors 4. A copy of the loop index will be allocated to each processor. Initially, the rst M terms of the iteration sequence will be assigned one-to-one to the M copies of the loop index. The program counter of each processor is set to the address of the rst executable statement in the loop. That points to either the preamble or the body of the loop. When a processor executes the last statement in the range of the loop, the processor attempts to get the next iteration of the loop. If the processor just completed the execution of the last iteration in the iteration sequence, then the processor waits until all other processors are idle, and then proceeds by executing the rst statement after the loop. This may not be the same processor which started the loop! If the iteration just completed by the processor is not the last one, and all iterations have already been assigned, the processor goes into the idle state. Otherwise, the next unassigned iteration value is assigned to the processor copy of the loop index, and the processor branches back to the rst statement in the loop. Jumps from inside the range of a cluster parallel loop to the outside or vice versa are not permitted in Cedar Fortran. To terminate a cluster parallel loop before completing all the iterations, the programmer should use the QUIT or QQUIT statements (described in Section 4.7). It must be noted here that QUIT and QQUIT are only supported in one version of the runtime library - the one you get when specifying -M on the cf command. See Section 6.1 about the features of the various runtime libraries. If data or control dependences exist between the iterations of the loop, the programmer must use the CDOACROSS verb and insert proper synchronization instructions to insure the correct execution of the loop. If no data dependences exist between the iterations of the loop, then the programmer should use the CDOALL verb instead of CDOACROSS. 4.5 Cedar Fortran Spread Parallel Loops Spread parallel loops are very similar in syntax to cluster parallel loops. The loop verb prexes for spread parallel loops are \S" and \X" instead of \C". An SDOALL or SDOACROSS 4 numproc() returns the number of available processors in the cluster (see section 5.1.1). 27

29 loop is joined by one processor from each cluster. An XDOALL or XDOACROSS loop is joined by all processors from all clusters. Spread parallel loops have one additional section - a postamble. The postamble is a section of code which lies after an ENDLOOP statement and continues until the end of the loop construct. The postamble is a counterpart to the preamble in that each processor which executes the preamble once prior to the loop body, also executes the postamble once, after all its iterations are nished. Unlike cluster parallel loops, the same processor which starts a spread parallel loop also nishes it. Since an SDOALL activates only one processor per cluster, a CDOALL loop may be nested inside it, to activate all processors on each cluster. An XDOALL loop may be used without an inner parallel loop to the same eect. The dierence between the two forms is one of scheduling. The SDOALL/CDOALL nest does its cross-cluster scheduling on the cluster level and therefore provides a more rigidly-scheduled parallel loop, but requires less cross-cluster synchronization. In an XDOALL loop, cross-cluster scheduling goes on at the processor level, making it more adaptable to runtime events, but requiring more cross-cluster synchronization to achieve that. As with cluster parallel loops, the variables declared locally are only accessible from within the loop, and they supersede any variables of the same name declared outside the spread parallel loop. As was explained in section 2.1, the original task and all the helper tasks see an identical virtual address space. Shared data maps to the same physical memory in all tasks. Private data is mapped to dierent physical memory in all tasks. Consequently, any data which must be passed from outside a spread parallel loop into the loop must be shared data. If it is not, each task will look in a dierent physical memory location for that data. For this reason, the compiler will produce an error message as shown in Figure 5 when this occurs. 4.6 Identier Scope Rules As was mentioned previously, identiers in Cedar Fortran are subject to certain scope rules, depending on the location of the explicit declaration of those identiers. Where concurrent loops are concerned, the scope rules resemble the block structured scoping of Pascal or C. Specically, variables may be declared inside any concurrent construct. The newly declared variables will supersede any variable of the same name declared outside the construct. These declarations remain in eect until the construct terminates or until new declarations of a more nested construct supersede the older declarations. When a construct terminates, the declarations as they existed before the construct began are restored. The program in Figure 6 serves as an example for our discussion of the rules. There are four scopes in the program in Figure 6. The rst is the entire program unit itself (from the PROGRAM statement to the END statement). This is the outermost scope in the program. The identiers x, y, and z are dened in the outermost scope in the sample program. The fact that identier z is dened implicitly rather than explicitly denes it in the outermost scope. The second scope is that of the SDOALL i loop (lines 4 through 13). This loop has 28

Rudolf Eigenmann, Jay Hoeinger, Greg Jaxon, Zhiyuan Li, David Padua. Center for Supercomputing Research & Development. Urbana, Illinois 61801

Restructuring Fortran Programs for Cedar Rudolf Eigenmann, Jay Hoeinger, Greg Jaxon, Zhiyuan Li, David Padua Center for Supercomputing Research & Development University of Illinois at Urbana-Champaign