Parallel Programming with Coarray Fortran

Similar documents
Co-arrays to be included in the Fortran 2008 Standard

Fortran 2008: what s in it for high-performance computing

Lecture V: Introduction to parallel programming with Fortran coarrays

New Programming Paradigms: Partitioned Global Address Space Languages

Fortran Coarrays John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory

Programming for High Performance Computing in Modern Fortran. Bill Long, Cray Inc. 17-May-2005

Parallel Programming in Fortran with Coarrays

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

Parallel Programming without MPI Using Coarrays in Fortran SUMMERSCHOOL

Parallel and High Performance Computing CSE 745

Proceedings of the GCC Developers Summit

Comparing One-Sided Communication with MPI, UPC and SHMEM

More Coarray Features. SC10 Tutorial, November 15 th 2010 Parallel Programming with Coarray Fortran

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

Lecture 32: Partitioned Global Address Space (PGAS) programming models

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Appendix D. Fortran quick reference

Parallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen

Coarrays in the next Fortran Standard

Advanced Features. SC10 Tutorial, November 15 th Parallel Programming with Coarray Fortran

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

A Comparison of Co-Array Fortran and OpenMP Fortran for SPMD Programming

Shared Memory Programming with OpenMP

Morden Fortran: Concurrency and parallelism

An Open64-based Compiler and Runtime Implementation for Coarray Fortran

Introduction to OpenMP

Parallel Programming Libraries and implementations

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Evaluating Fortran Coarrays and MPI on a Modern HPC architecture

Coarray Fortran: Past, Present, and Future. John Mellor-Crummey Department of Computer Science Rice University

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Mixed Mode MPI / OpenMP Programming

Parallel Programming. Libraries and Implementations

Execution model of three parallel languages: OpenMP, UPC and CAF

EE/CSCI 451: Parallel and Distributed Computation

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

Steve Deitz Chapel project, Cray Inc.

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Session 4: Parallel Programming with OpenMP

Introduction to Parallel Programming

Scalability issues : HPC Applications & Performance Tools

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

IMPLEMENTATION AND EVALUATION OF ADDITIONAL PARALLEL FEATURES IN COARRAY FORTRAN

PGAS languages. The facts, the myths and the requirements. Dr Michèle Weiland Monday, 1 October 12

Performance without Pain = Productivity Data Layout and Collective Communication in UPC

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Towards Exascale Computing with Fortran 2015

PGAS: Partitioned Global Address Space

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Introduction to OpenMP

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Introduction to Parallel Computing

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

PROGRAMMING MODEL EXAMPLES

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Introduction to the PGAS (Partitioned Global Address Space) Languages Coarray Fortran (CAF) and Unified Parallel C (UPC) Dr. R. Bader Dr. A.

Introduction to OpenMP

EE/CSCI 451: Parallel and Distributed Computation

Unified Parallel C (UPC)

Overview: The OpenMP Programming Model

Migrating A Scientific Application from MPI to Coarrays. John Ashby and John Reid HPCx Consortium Rutherford Appleton Laboratory STFC UK

A Uniform Programming Model for Petascale Computing

Bringing a scientific application to the distributed world using PGAS

WHY PARALLEL PROCESSING? (CE-401)

Coarrays in the next Fortran Standard

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

. Programming in Chapel. Kenjiro Taura. University of Tokyo

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

PGAS Languages (Par//oned Global Address Space) Marc Snir

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Parallel dense linear algebra computations (1)

OpenMP: Open Multiprocessing

Introductory OpenMP June 2008

Types of Parallel Computers

High Performance Computing

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Point-to-Point Synchronisation on Shared Memory Architectures

Introduction to OpenMP

A Characterization of Shared Data Access Patterns in UPC Programs

CS 470 Spring Parallel Languages. Mike Lam, Professor

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Hybrid Programming with MPI and SMPSs

Overview: Emerging Parallel Programming Models

Report from WG5 convener

Implementing a Scalable Parallel Reduction in Unified Parallel C

Parallel Computing. Prof. Marco Bertini

Exploring XcalableMP. Shun Liang. August 24, 2012

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

Barbara Chapman, Gabriele Jost, Ruud van der Pas

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

OpenMP - Introduction

Transcription:

Parallel Programming with Coarray Fortran SC10 Tutorial, November 15 th 2010 David Henty, Alan Simpson (EPCC) Harvey Richardson, Bill Long, Nathan Wichmann (Cray)

Tutorial Overview The Fortran Programming Model Basic coarray features Practical Session 1 Further coarray features Practical Session 2 Advanced coarray features Practical Session 3 Experiences with coarrays 2

The Fortran Programming Model

Motivation Fortran now supports parallelism as a full first-class feature of the language Changes are minimal Performance is maintained Flexibility in expressing communication patterns 4

Programming models for HPC The challenge is to efficiently map a problem to the architecture we have Take advantage of all computational resources Manage distributed memories etc. Optimal use of any communication networks The HPC industry has long experience in parallel programming Vector, threading, data-parallel, message-passing etc. We would like to have models or combinations that are efficient safe easy to learn and use 5

Why consider new programming models? Next-generation architectures bring new challenges: Very large numbers of processors with many cores Complex hierarchy even today (2010) we are at 200k cores Parallel programming is hard, need to make this simpler Some of the models we currently use are bolt-ons to existing languages as APIs or directives Hard to program for underlying architecture unable to scale due to overheads So, is there an alternative to the models prevalent today? Most popular are OpenMP and MPI 6

Shared- directives and OpenMP threads 7

OpenMP: work distribution 1-8 9-16 17-24 25-32!$OMP PARALLEL do i=1,32 a(i)=a(i)*2 end do threads 8

OpenMP implementation process cpus threads 9

Shared Memory Directives Multiple threads share global Most common variant: OpenMP Program loop iterations distributed to threads, more recent task features Each thread has a means to refer to private objects within a parallel context Terminology Thread, thread team Implementation Threads map to user threads running on one SMP node Extensions to distributed not so successful OpenMP is a good model to use within a node 10

Cooperating Processes Models PROBLEM processes 11

Message Passing, MPI process cpu cpu cpu 12

MPI process 0 process 1 cpu cpu MPI_Send(a,...,1, ) MPI_Recv(a,...,0, ) 13

Message Passing Participating processes communicate using a message-passing API Remote data can only be communicated (sent or received) via the API MPI (the Message Passing Interface) is the standard Implementation: MPI processes map to processes within one SMP node or across multiple networked nodes API provides process numbering, point-to-point and collective messaging operations Mostly used in two-sided way, each endpoint coordinates in sending and receiving 14

SHMEM process 0 process 1 cpu cpu shmem_put(a,b, ) 15

SHMEM Participating processes communicate using an API Fundamental operations are based on one-sided PUT and GET Need to use symmetric locations Remote side of communication does not participate Can test for completion Barriers and collectives Popular on Cray and SGI hardware, also Blue Gene version To make sense needs hardware support for low-latency RDMAtype operations 16

UPC thread thread thread cpu cpu cpu 17

UPC thread thread thread cpu cpu cpu upc_forall(i=0;i<32;i++;affinity) a[i]=a[i]*2 end do 18

UPC Extension to ISO C99 Participating threads New shared data structures shared pointers to distributed data (block or cyclic) pointers to shared data local to a thread Synchronization Language constructs to divide up work on shared data upc_forall() to distribute iterations of for() loop Extensions for collectives Both commercial and open source compilers available Cray, HP, IBM Berkeley UPC (from LBL), GCC UPC 19

Fortran 2008 coarray model Example of a Partitioned Global Address Space (PGAS) model Set of participating processes like MPI Participating processes have access to local via standard program mechanisms Access to remote is directly supported by the language 20

Fortran coarray model process process process cpu cpu cpu 21

Fortran coarrays Remote access is a full feature of the language: Type checking Opportunity to optimize communication No penalty for local access Single-sided programming model more natural for some algorithms and a good match for modern networks with RDMA 22

Fortran coarrays Basic Features

Coarray Fortran "Coarrays were designed to answer the question: What is the smallest change required to convert Fortran into a robust and efficient parallel language? The answer: a simple syntactic extension. It looks and feels like Fortran and requires Fortran programmers to learn only a few new rules." John Reid, ISO Fortran Convener 24

Some History Introduced in current form by Numrich and Reid in 1998 as a simple extension to Fortran 95 for parallel processing Many years of experience, mainly on Cray hardware A set of core features are expected to form part of the Fortran 2008 standard Additional features are expected to be published in a Technical Report in due course. 25

How Does It Work? SPMD - Single Program, Multiple Data single program replicated a fixed number of times Each replication is called an image Images are executed asynchronously execution path may differ from image to image some situations cause images to synchronize Images access remote data using coarrays Normal rules of Fortran apply 26

What are coarrays? Arrays or scalars that can be accessed remotely images can access data objects on any other image Additional Fortran syntax for coarrays Specifying a codimension declares a coarray real, dimension(10), codimension[*]:: x real :: x(10)[*] these are equivalent declarations of a array x of size 10 on each image x is now remotely accessible coarrays have the same size on each image! 27

Accessing coarrays integer :: a(4)[*], b(4)[*]!declare coarrays b(:) = a(:)[n]! copy integer arrays a and b declared to be size 4 on all images copy array a from remote image n into local array b () for local access [] for remote access e.g. for two images and n = 2: image 1 image 2 a b 1 2 3 4 a 2 9 3 7 52 69 73 87 b 10 2 11 9 12 3 13 7 28

Synchronisation Be careful when updating coarrays: If we get remote data was it valid? Could another process send us data and overwrite something we have not yet used? How do we know that data sent to us has arrived? Fortran provides intrinsic synchronisation statements For example, barrier for synchronisation of all images: sync all do not make assumptions about execution timing on images unless executed after synchronisation Note there is implicit synchronisation at program start 29

Retrieving information about images Two intrinsics provide index of this image and number of images this_image() (image indexes start at 1) num_images() real :: x[*] if(this_image() == 1) then read *,x do image = 2,num_images() x[image] = x end do end if sync all 30

Making remote references We used a loop over images do image = 2,num_images() x[image] = x end do Note that array indexing within the coindex is not allowed so we can not write x[2:num_images()] = x! illegal 31

Data usage coarrays have the same size on every image Declarations: round brackets () describe rank, shape and extent of local data square brackets [] describe layout of images that hold local data Many HPC problems have physical quantities mapped to n- dimensional grids You need to implement your view of global data from the local coarrays as Fortran does not provide the global view You can be flexible with the coindexing (see later) You can use any access pattern you wish 32

Data usage print out a 16 element global integer array A from 4 processors 4 elements per processor = 4 coarrays on 4 images integer :: ca(4)[*] do image=1,num_images() print *,ca[image] end do A ca(1:4)[1] ca(1:4)[2] ca(1:4)[3] ca(1:4)[4] image 1 image 2 image 3 image 4 33

1D cyclic data access coarray declarations remain unchanged but we use a cyclic access pattern integer :: ca(4)[*] do i=1,4 do image=1,num_images() print *,ca(i)[image] end do end do A............ image 1 image 2 image 3 image 4 34

Synchronisation code execution on images is independent programmer has to control execution using synchronisation synchronise before accessing coarrays ensure content is not updated from remote images before you can use it synchronise after accessing coarrays ensure new content is available to all images implicit synchronisation after variable declarations at first executable statement guarantees coarrays exist on all images when your first program statement is executed We will revisit this topic later 35

Example: maximum of array real :: a(10) real :: maximum[*] call random_number(a) maximum = maxval(a) sync all if (this_image() == 1) then do image = 2, num_images() maximum = max(maximum, maximum[image]) end do do image = 2, num_images() maximum[image] = maximum end do end if sync all implicit synchronisation ensure all images set local maximum ensure all images have copy of maximum value 36

Recap We now know the basics of coarrays declarations references with [] this_image() and num_images() sync all Now consider a full program example... 37

Example2: Calculate density of primes program pdensity implicit none integer, parameter :: n=8000000, nimages=8 integer start,end,i integer, dimension(nimages) :: nprimes[*] real density start = (this_image()-1) * n/num_images() + 1 end = start + n/num_images() - 1 nprimes(this_image())[1] = num_primes(start,end) sync all 38

Example2: Calculate density of primes if (this_image()==1) then & & nprimes(1)=sum(nprimes) density=real(nprimes(1))/n print *,"Calculating prime density on", & num_images(),"images" print *,nprimes(1),'primes in',n,'numbers' write(*,'(" density is ",2P,f5.2,"%")')density write(*,'(" asymptotic theory gives ", & 2P,f5.2,"%")')1.0/(log(real(n))-1.0) end if 39

Example2: Calculate density of primes Calculating prime density on 2 images 539778 primes in 8000000 numbers density is 6.75% asymptotic theory gives 6.71% 40

Observations so far on coarrays Natural extension, easy to learn Makes parallel parts of program obvious (syntax) Part of Fortran language (type checking, etc) No mapping of data to buffers (or copying) or creation of complex types (as we might have with MPI) Compiler can optimize for communication More observations later... 41

Exercise Session 1 Look at the Exercise Notes document for full details Write, compile and run a Hello World program that prints out the value of the running image s image index and the number of images Extend the simple Fortran code provided in order to perform operations on parts of a picture using coarrays 42

Backup Slides HPF model 43

High Performance Fortran (HPF) Data Parallel programming model Single thread of control Arrays can be distributed and operated on in parallel Loosely synchronous Parallelism mainly from Fortran 90 array syntax, FORALL and intrinsics. This model popular on SIMD hardware (AMT DAP, Connection Machines) but extended to clusters where control thread is replicated 44

HPF pe pe pe pe cpu 45

HPF pe pe pe pe A=SQRT(A) A (disributed) cpu 46