Parallel Programming in Fortran with Coarrays

Similar documents
Fortran Coarrays John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory

Fortran 2008: what s in it for high-performance computing

Co-arrays to be included in the Fortran 2008 Standard

Coarrays in the next Fortran Standard

Report from WG5 convener

Technical Report on further interoperability with C

Migrating A Scientific Application from MPI to Coarrays. John Ashby and John Reid HPCx Consortium Rutherford Appleton Laboratory STFC UK

Lecture V: Introduction to parallel programming with Fortran coarrays

Coarrays in the next Fortran Standard

Appendix D. Fortran quick reference

Advanced Features. SC10 Tutorial, November 15 th Parallel Programming with Coarray Fortran

More Coarray Features. SC10 Tutorial, November 15 th 2010 Parallel Programming with Coarray Fortran

Proceedings of the GCC Developers Summit

Programming for High Performance Computing in Modern Fortran. Bill Long, Cray Inc. 17-May-2005

Parallel Programming with Coarray Fortran

Additional Parallel Features in Fortran An Overview of ISO/IEC TS 18508

ISO/IEC : TECHNICAL CORRIGENDUM 2

New Programming Paradigms: Partitioned Global Address Space Languages

Technical Specification on further interoperability with C

Parallel Programming without MPI Using Coarrays in Fortran SUMMERSCHOOL

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

Evolution of Fortran. Presented by: Tauqeer Ahmad. Seminar on Languages for Scientific Computing

Introduction to Parallel Programming

A Comparison of Co-Array Fortran and OpenMP Fortran for SPMD Programming

Parallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen

Fortran Coding Standards and Style

Towards Exascale Computing with Fortran 2015

Chapter 4. Fortran Arrays

Parallel and High Performance Computing CSE 745

Information technology Programming languages Fortran Part 1: Base language

Intermediate Code Generation

Dangerously Clever X1 Application Tricks

TS Further Interoperability of Fortran with C WG5/N1917

Coarray Fortran: Past, Present, and Future. John Mellor-Crummey Department of Computer Science Rice University

PACKAGE SPECIFICATION HSL 2013

Morden Fortran: Concurrency and parallelism

NAGWare f95 Recent and Future Developments

First Experiences with Application Development with Fortran Damian Rouson

The Complete Compendium on Cooperative Computing using Coarrays. c 2008 Andrew Vaught October 29, 2008

Fortran's Relevance Today and in the Future. Peter Crouch, Chairman Fortran Specialist Group Member Groups Convention 5 April 2011

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

Co-Array Fortran for parallel programming

Co-array Fortran for parallel programming

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

PGI Accelerator Programming Model for Fortran & C

An Open64-based Compiler and Runtime Implementation for Coarray Fortran

The new features of Fortran 2003

A Coarray Fortran Implementation to Support Data-Intensive Application Development

IMPLEMENTATION AND EVALUATION OF ADDITIONAL PARALLEL FEATURES IN COARRAY FORTRAN

Chapter 8 :: Composite Types

Brook Spec v0.2. Ian Buck. May 20, What is Brook? 0.2 Streams

Current Developments in Fortran Standards. David Muxworthy 15 June 2012

Introduction to OpenMP. Tasks. N.M. Maclaren September 2017

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

NAGWare f95 and reliable, portable programming.

Chapter 3. Fortran Statements

Programming Languages

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Review More Arrays Modules Final Review

A Fast Review of C Essentials Part I

Programming Languages Third Edition. Chapter 9 Control I Expressions and Statements

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

1 Lexical Considerations

CIS24 Project #3. Student Name: Chun Chung Cheung Course Section: SA Date: 4/28/2003 Professor: Kopec. Subject: Functional Programming Language (ML)

Fortran. (FORmula TRANslator) History

Programming with MPI

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Chapter 3: Processes. Operating System Concepts 8 th Edition,

Foundations of the C++ Concurrency Memory Model

Control Abstraction. Hwansoo Han

Fortran 90 - A thumbnail sketch

6.1 Multiprocessor Computing Environment

Memory Consistency Models

Fortran Bill Long, Cray Inc. 21-May Cray Proprietary

Implementation and Evaluation of Coarray Fortran Translator Based on OMNI XcalableMP. October 29, 2015 Hidetoshi Iwashita, RIKEN AICS

Types II. Hwansoo Han

Introduction to the PGAS (Partitioned Global Address Space) Languages Coarray Fortran (CAF) and Unified Parallel C (UPC) Dr. R. Bader Dr. A.

A Coarray Fortran Implementation to Support Data-Intensive Application Development

Index. classes, 47, 228 coarray examples, 163, 168 copystring, 122 csam, 125 csaxpy, 119 csaxpyval, 120 csyscall, 127 dfetrf,14 dfetrs, 14

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Introduction to Fortran95 Programming Part II. By Deniz Savas, CiCS, Shef. Univ., 2018

Comparing One-Sided Communication with MPI, UPC and SHMEM

Verification of Fortran Codes

Strategies & Obstacles in Converting a Large Production Application to FORTRAN 90

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

A macro- generator for ALGOL

International Standards Organisation. Parameterized Derived Types. Fortran

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Model Viva Questions for Programming in C lab

DRAFT DRAFT DRAFT ISO/IEC JTC 1/SC 22 N0703. Élément introductif Élément principal Partie n: Titre de la partie. Date: ISO/IEC TR

Extrinsic Procedures. Section 6

Overview: The OpenMP Programming Model

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Advanced Fortran Programming

Implementation of Parallelization

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lexical Considerations

Chapter 10. Implementing Subprograms

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Tokens, Expressions and Control Structures

Transcription:

Parallel Programming in Fortran with Coarrays John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory Fortran 2008 is now in FDIS ballot: only typos permitted at this stage. The biggest change from Fortran 2003 is the addition of coarrays. This talk will introduce coarrays and explain why we believe that they will lead to easier development of parallel programs, faster execution times, and better maintainability. Advanced Programming Specialist Group, BCS, London, 6 May 2010.

Design objectives Coarrays are the brain-child of Bob Numrich (Minnesota Supercomputing Institute, formerly Cray). The original design objectives were for A simple extension to Fortran Small demands on the implementors Retain optimization between synchronizations Make remote references apparent Provide scope for optimization of communication A subset has been implemented by Cray for some ten years. Coarrays have been added to the g95 compiler, are being added to gfort, and for Intel are at the top of our development list. 2

Summary of coarray model SPMD Single Program, Multiple Data Replicated to a number of images (probably as executables) Number of images fixed during execution Each image has its own set of variables Coarrays are like ordinary variables but have second set of subscripts in [ ] for access between images Images mostly execute asynchronously Synchronization: sync all, sync images, lock, unlock, allocate, deallocate, critical construct Intrinsics: this_image, num_images, image_index Full summary: WG5 N1824 (Google WG5) 3

Examples of coarray syntax real :: r[*], s[0:*]! Scalar coarrays real,save :: x(n)[*]! Array coarray type(u),save :: u2(m,n)[np,*]! Coarrays always have assumed! cosize (equal to number of images) real :: t! Local integer p, q, index(n)! variables : t = s[p] x(:) = x(:)[p]! Reference without [] is to local part x(:)[p] = x(:) u2(i,j)%b(:) = u2(i,j)[p,q]%b(:) 4

Implementation model Usually, each image resides on one core. However, several images may share a core (e.g. for debugging) and one image may execute on a cluster (e.g. with OpenMP). A coarray has the same set of bounds on all images, so the compiler may arrange that it occupies the same set of addresses within each image (known as symmetric memory). On a shared-memory machine, a coarray might be implemented as a single large array. On any machine, a coarray may be implemented so that each image can calculate the memory address of an element on another image. 5

Synchronization With a few exceptions, the images execute asynchronously. If syncs are needed, the user supplies them explicitly. Barrier on all images sync all Wait for others sync images(image-set) Limit execution to one image at a time critical : end critical Limit execution in a more flexible way lock(lock_var[6]) p[6] = p[6] + 1 unlock(lock_var[6]) These are known as image control statements. 6

The sync images statement Ex 1: make other images to wait for image 1: if (this_image() == 1) then! Set up coarray data for other images sync images(*) else sync images(1)! Use the data set up by image 1 end if Ex 2: impose the fixed order 1, 2,... on images: real :: a, asum[*] integer :: me,ne me = this_image() ne = num_images() if(me==1) then asum = a else sync images( me-1 ) asum = asum[me-1] + a end if if(me<ne) sync images( me+1 ) 7

Execution segments On an image, the sequence of statements executed before the first image control statement or between two of them is known as a segment. For example, this code reads a value on image 1 and broadcasts it. real :: p[*] :! Segment 1 sync all if (this_image()==1) then! Segment 2 read (*,*) p! : do i = 2, num_images()! : p[i] = p! : end do! : end if! Segment 2 sync all :! Segment 3 8

Execution segments (cont) Here we show three segments. On any image, these are executed in order, Segment 1, Segment 2, Segment 3. The sync all statements ensure that Segment 1 on any image precedes Segment 2 on any other image and similarly for Segments 2 and 3. However, two segments 1 on different images are unordered. Overall, we have a partial ordering. Important rule: if a variable is defined in a segment, it must not be referenced, defined, or become undefined in a another segment unless the segments are ordered. 9

Dynamic coarrays Only dynamic form: the allocatable coarray. real, allocatable :: a(:)[:], s[:,:] : allocate ( a(n)[*], s[-1:p,0:*] ) All images synchronize at an allocate or deallocate statement so that they can all perform their allocations and deallocations in the same order. The bounds, cobounds, and length parameters must not vary between images. An allocatable coarray may be a component of a structure provided the structure and all its ancestors are scalars that are neither pointers nor coarrays. A coarray is not allowed to be a pointer. 10

Non-coarray dummy arguments A coarray may be associated as an actual argument with a non-coarray dummy argument (nothing special about this). A coindexed object (with square brackets) may be associated as an actual argument with a noncorray dummy argument. Copy-in copy-out is to be expected. These properties are very important for using existing code. 11

Coarray dummy arguments A dummy argument may be a coarray. It may be of explicit shape, assumed size, assumed shape, or allocatable: subroutine subr(n,w,x,y,z) integer :: n real :: w(n)[n,*]! Explicit shape real :: x(n,*)[*]! Assumed size real :: y(:,:)[*]! Assumed shape real, allocatable :: z(:)[:,:] Where the bounds or cobounds are declared, there is no requirement for consistency between images. The local values are used to interpret a remote reference. Different images may be working independently. There are rules to ensure that copy-in copy-out of a coarray is never needed. 12

Coarrays and SAVE Unless allocatable or a dummy argument, a coarray must be given the SAVE attribute. This is to avoid the need for synchronization when coarrays go out of scope on return from a procedure. Similarly, automatic-array coarrays subroutine subr (n) integer :: n real :: w(n)[*] and array-valued functions function fun (n) integer :: n real :: fun(n)[*] are not permitted, since they would require synchronization. 13

Structure components A coarray may be of a derived type with allocatable or pointer components. Pointers must have targets in their own image: q => z[i]%p! Not allowed allocate(z[i]%p)! Not allowed Provides a simple but powerful mechanism for cases where the size varies from image to image, avoiding loss of optimization. 14

Program termination The aim is for an image that terminates normally (stop or end program) to remain active so that its data is available to other executing images, while an error condition leads to quick termination of all images. Normal termination occurs in three steps: initiation, synchronization, and completion. Data on an image is available to the others until they all reach the synchronization. Error termination occurs if any image hits an error condition or executes an error stop statement. All other images that have not initiated error termination do so as soon as possible. 15

Input/output Default input (*) is available on image 1 only. Default output (*) and error output are available on every image. The files are separate, but their records will be merged into a single stream or one for the output files and one for the error files. To order the writes from different images, need synchronization and the flush statement. The open statement connects a file to a unit on the executing image only. Whether a file on one image is the same as a file with the same name on another image is processor dependent. 16

Optimization Most of the time, the compiler can optimize as if the image is on its own, using its temporary storage such as cache, registers, etc. There is no coherency requirement while unordered segments are executing. The programmer is required to follow the rule: if a variable is defined in a segment, it must not be referenced, defined, or become undefined in a another segment unless the segments are ordered. The compiler also has scope to optimize communication. 17

Planned extensions The following features were part of the proposal but have moved into a planned Technical Report on Enhanced Parallel Computing Facilities : 1. The collective intrinsic subroutines. 2. Teams and features that require teams. 3. The notify and query statements. 4. File connected on more than one image, unless default output or default error. 18

A comparison with MPI A colleague (Ashby, 2008) recently converted most of a large code, SBLI, a finite-difference formulation of Direct Numerical Simulation (DNS) of turbulance, from MPI to coarrays using a small Cray X1E (64 processors). Since MPI and coarrays can be mixed, he was able to do this gradually, and he left the solution writing and the restart facilites in MPI. Most of the time was taken in halo exchanges and the code parallelizes well with this number of processors. The speeds were very similar. The code clarity (and maintainability) was much improved. The code for halo exchanges, excluding comments, was reduced from 176 lines to 105 and the code to broadcast global parameters from 230 to 117. 19

Advantages of coarrays Easy to write code the compiler looks after the communication References to local data are obvious as such. Easy to maintain code more concise than MPI and easy to see what is happening Integrated with Fortran type checking, type conversion on assignment,... The compiler can optimize communication Local optimizations still available Does not make severe demands on the compiler, e.g. for coherency. 20

References Ashby, J.V. and Reid, J.K (2008). Migrating a scientific application from MPI to coarrays. CUG 2008 Proceedings. RAL-TR-2008-015, see http://www.numerical.rl.ac.uk/ reports/reports.shtml Reid, John (2010). Coarrays in the next Fortran Standard. ISO/IEC/JTC1/SC22/ WG5 N1824, see ftp://ftp.nag.co.uk/sc22wg5/n1801-n1850 Reid, John (2010). The new features of Fortran 2008. ISO/IEC/JTC1/SC22/ WG5 N1828, see ftp://ftp.nag.co.uk/sc22wg5/n1801-n1850 WG5(2010). FDIS revision of the Fortran Standard. ISO/IEC/JTC1/SC22/ WG5 N1826, see ftp://ftp.nag.co.uk/sc22wg5/n1801-n1850 21

Atomic subroutines call atomic_define(atom[p],value) call atomic_ref(value,atom[p]) The effect of executing an atomic subroutine is as if the action occurs instantaneously, and thus does not overlap with other atomic actions that might occur asynchronously. It acts on a scalar variable of type integer(atomic_int_kind) or logical(atomic_logical_kind). The kinds are defined in an intrinsic module. The variable must be a coarray or a coindexed object. Atomics allowed to break the rule that if a variable is defined in a segment, it must not be referenced, defined, or become undefined in a another segment unless the segments are ordered. 22

Sync memory The sync memory statement is an image control statement that defines a boundary on an image between two segments, each of which can be ordered in some user-defined way with respect to a segment on another image. The compiler will almost certainly treat this as a barrier for code motion, flush all data that might be accessed by another image to memory, and subsequently reload such data from memory. 23

Spin-wait loop Atomics and sync memory allow the spin-wait loop, e.g. use, intrinsic :: iso_fortran_env logical(atomic_logical_kind) :: & locked[*]=.true. logical :: val integer :: iam, p, q : iam = this_image() if (iam == p) then sync memory call atomic_define(locked[q],.false.) else if (iam == q) then val =.true. do while (val) call atomic_ref(val,locked) end do sync memory end if Here, segment 1 on p and segment 2 on q are unordered but locked is atomic so it is OK. Image q will emerge from its spin when it sees that locked has become false. 24