HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90

Similar documents
Parallel Programming. March 15,

High Performance Fortran. James Curry

Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program


CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming

Lecture V: Introduction to parallel programming with Fortran coarrays

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Design of Parallel Algorithms. Models of Parallel Computation

MPI Casestudy: Parallel Image Processing

AMath 483/583 Lecture 8

Introduction to OpenMP. Lecture 4: Work sharing directives

Principles of Parallel Algorithm Design: Concurrency and Mapping

Extrinsic Procedures. Section 6

Parallelisation. Michael O Boyle. March 2014

2 3. Syllabus Time Event 9:00{10:00 morning lecture 10:00{10:30 morning break 10:30{12:30 morning practical session 12:30{1:30 lunch break 1:30{2:00 a

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Chapter 3. Fortran Statements

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Chapter 4. Fortran Arrays

Welcome. Modern Fortran (F77 to F90 and beyond) Virtual tutorial starts at BST

MPI Programming. Henrik R. Nagel Scientific Computing IT Division

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Array Processing { Part II. Multi-Dimensional Arrays. 1. What is a multi-dimensional array?

Introduction to Fortran95 Programming Part II. By Deniz Savas, CiCS, Shef. Univ., 2018

EE/CSCI 451: Parallel and Distributed Computation

PACKAGE SPECIFICATION HSL 2013

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Review More Arrays Modules Final Review

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Subroutines, Functions and Modules

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy

Overpartioning with the Rice dhpf Compiler

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Programming for High Performance Computing in Modern Fortran. Bill Long, Cray Inc. 17-May-2005

Introduction to Fortran Programming. -Internal subprograms (1)-

Storage and Sequence Association

Optimisation p.1/22. Optimisation

Blocking SEND/RECEIVE

G Programming Languages - Fall 2012

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Appendix D. Fortran quick reference

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

Introduction to OpenMP

Mixed Mode MPI / OpenMP Programming

Dense Matrix Algorithms

Matrix Multiplication

Lecture 4: Principles of Parallel Algorithm Design (part 3)

Numerical Algorithms

Programming with MPI

Compiling High Performance Fortran for Distributedmemory

Matrix Multiplication

PACKAGE SPECIFICATION HSL 2013

Principles of Parallel Algorithm Design: Concurrency and Mapping

Goals for This Lecture:

Parallelization of an Example Program

Parallel Programming with OpenMP. CS240A, T. Yang

Introduction to Modern Fortran

Declaration and Initialization

ARRAYS COMPUTER PROGRAMMING. By Zerihun Alemayehu

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

Introduction to Programming with Fortran 90

8. Hardware-Aware Numerics. Approaching supercomputing...

Programming with MPI

Interfacing With Other Programming Languages Using Cython

8. Hardware-Aware Numerics. Approaching supercomputing...

Computer Science & Engineering 150A Problem Solving Using Computers

Parallelization Principles. Sathish Vadhiyar

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Lecture 4: Principles of Parallel Algorithm Design (part 3)

More Coarray Features. SC10 Tutorial, November 15 th 2010 Parallel Programming with Coarray Fortran

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Synchronous Shared Memory Parallel Examples. HPC Fall 2012 Prof. Robert van Engelen

(Refer Slide Time 01:41 min)

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Introduction to OpenMP

G. Colin de Verdière CEA

1 Introduction to MATLAB

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

Arrays, Vectors Searching, Sorting

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Run-Time Data Structures

Accelerated Library Framework for Hybrid-x86

Point-to-Point Synchronisation on Shared Memory Architectures

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Exploring XcalableMP. Shun Liang. August 24, 2012

Lecture Notes on Loop Transformations for Cache Optimization

SECTION 5: STRUCTURED PROGRAMMING IN MATLAB. ENGR 112 Introduction to Engineering Computing

Algorithms and Applications

Example of a Parallel Algorithm

Optimising for the p690 memory system

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Synchronous Shared Memory Parallel Examples. HPC Fall 2010 Prof. Robert van Engelen

Fortran. (FORmula TRANslator) History

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #43. Multidimensional Arrays

Affine Loop Optimization using Modulo Unrolling in CHAPEL

1 Introduction to MATLAB

CUDA Fortran COMPILERS &TOOLS. Porting Guide

Transcription:

149 Fortran and HPF 6.2 Concept High Performance Fortran 6.2 Concept Fortran90 extension SPMD (Single Program Multiple Data) model each process operates with its own part of data HPF commands specify which processor gets which part of the data Concurrency is defined by HPF commands based on Fortran90 HPF directives as comments:! HPF$ <directive> Most of the commands of declatrative type concerning data distribution between processes Command INDEPENDENT is an exception (and its attributel NEW) which execute commands

150 Fortran and HPF 6.3 PROCESSORS declaration 6.3 PROCESSORS declaration determines conceptual processor array (which need not reflect the actual hardware structure) Example:! HPF$ PROCESSORS, DIMENSION( 4 ) : : P1! HPF$ PROCESSORS, DIMENSION( 2, 2 ) : : P2! HPF$ PROCESSORS, DIMENSION( 2, 1, 2 ) : : P3 Number of processors needs to remain the same in a program If 2 different processor arrays similar, one can assume thet corresponding processors have the same physical process!hpf$ PROCESSORS :: P defines a scalar processor

151 Fortran and HPF 6.4 DISTRIBUTE-directive 6.4 DISTRIBUTE-directive says how to deliver data between processors. Example: REAL, DIMENSION(50) : : A REAL, DIMENSION(10,10) : : B, C, D! HPF$ DISTRIBUTE (BLOCK) ONTO P1 : : A!1 D! HPF$ DISTRIBUTE (CYCLIC, CYCLIC) ONTO P2 : : B,C!2 D! HPF$ DISTRIBUTE D(BLOCK, ) ONTO P1! a l t e r n a t i v e syntax array and processor array rank need to confirm A(1:50) : P1(1:4) each processor gets 50/4 = 13 elements except P1(4) which gets the rest (11 elements) * dimension to be ignored => in this example D distributed row-wise CYCLIC elements delivered one-by-one (as cards from pile between players)

152 Fortran and HPF 6.4 DISTRIBUTE-directive BLOCK - array elements are delivered block-wise (each block having elements from close range) Example (9 -element array delivered to 3 processors) : CYCLIC: 123123123 BLOCK: 111222333

153 Fortran and HPF 6.4 DISTRIBUTE-directive Example: DISTRIBUTE block-wise: PROGRAM Chunks REAL, DIMENSION(20) : : A! HPF$ PROCESSORS, DIMENSION( 4 ) : : P! HPF$ DISTRIBUTE (BLOCK) ONTO P : : A Example: DISTRIBUTE cyclically: PROGRAM Round_Robin REAL, DIMENSION(20) : : A! HPF$ PROCESSORS, DIMENSION( 4 ) : : P! HPF$ DISTRIBUTE ( CYCLIC) ONTO P : : A

154 Fortran and HPF 6.4 DISTRIBUTE-directive Example: DISTRIBUTE 2-dimensional layout: PROGRAM Skwiffy IMPLICIT NONE REAL, DIMENSION( 4, 4 ) : : A, B, C! HPF$ PROCESSORS, DIMENSION( 2, 2 ) : : P! HPF$ DISTRIBUTE (BLOCK, CYCLIC) ONTO P : : A, B, C B = 1; C = 1; A = B + C END PROGRAM Skwiffy cyclic in one dimension; block-wise in other: (11)(12)(11)(12) (11)(12)(11)(12) (21)(22)(21)(22) (21)(22)(21)(22)

155 Fortran and HPF 6.4 DISTRIBUTE-directive Example: DISTRIBUTE *: PROGRAM Skwiffy IMPLICIT NONE REAL, DIMENSION( 4, 4 ) : : A, B, C! HPF$ PROCESSORS, DIMENSION( 4 ) : : Q! HPF$ DISTRIBUTE (,BLOCK) ONTO Q : : A, B, C B = 1; C = 1; A = B + C; PRINT, A END PROGRAM Skwiffy Remarks about DISTRIBUTE Without ONTO default structure (given with program arguments, e.g.) BLOCK better if algorithm uses a lot of neighbouring elements in the array => less communication, faster CYCLIC good for even distribution ignoring a dimension (with *) good if calculations due with a whole row or column

156 Fortran and HPF 6.5 Distribution of allocatable arrays all scalar variables are by default replicated; updating compiler task 6.5 Distribution of allocatable arrays is similar, except that delivery happens right after the memory allocation REAL, ALLOCATABLE, DIMENSION ( :, : ) : : A INTEGER : : i e r r! HPF$ PROCESSORS, DIMENSION(10,10) : : P! HPF$ DISTRIBUTE (BLOCK, CYCLIC) : : A... ALLOCATE(A(100,20), stat= i e r r )! A a u t o m a t i c a l l y d i s t r i b u t e d here! block size i n dim=1 i s 10 elements... DEALLOCATE(A) END blocksize is determined right after ALLOCATE command

157 Fortran and HPF 6.6 HPF rule: Owner Calcuates 6.6 HPF rule: Owner Calcuates Processor on the left of assignment performs the computations Example: DO i = 1,n a ( i 1) = b ( i 6) / c ( i + j ) a ( i i ) END DO calculation performed by process owning a(i-1) NOTE that the rule is not obligatory but advisable to the compiler! compiler may (for the purpose of less communication in the whole program) to leave only the assignment to the a(i-1) owner-processor

158 Fortran and HPF 6.7 Scalar variables 6.7 Scalar variables REAL, DIMENSION(100,100) : : X REAL : : Scal! HPF$ DISTRIBUTE (BLOCK,BLOCK) : : X.... Scal = X( i, j ).... owner of X(i,j) assigns Scal value and sends to other processes (replication)

159 Fortran and HPF 6.8 Examples of good DISTRIBUTE subdivision 6.8 Examples of good DISTRIBUTE subdivision Example: A( 2 : 9 9 ) = (A( : 9 8 ) +A( 3 : ) ) /2! neighbour c a l c u l a t i o n s B( 2 2 : 5 6 ) = 4.0 ATAN( 1. 0 )! s e c t i o n of B c a l c u l a t e d C ( : ) = SUM(D, DIM=1)! Sum down a column From the owner calculates rule we get:! HPF$ DISTRIBUTE (BLOCK) ONTO P : : A! HPF$ DISTRIBUTE ( CYCLIC) ONTO P : : B! HPF$ DISTRIBUTE (BLOCK) ONTO P : : C! or ( CYCLIC)! HPF$ DISTRIBUTE (,BLOCK) ONTO P : : D! or (, CYCLIC)

160 Fortran and HPF 6.8 Examples of good DISTRIBUTE subdivision Example (SSOR): DO j = 2,n 1 DO i = 2,n 1 a ( i, j ) =(omega/ 4 ) ( a ( i, j 1)+a ( i, j +1)+ & a ( i 1, j ) +a ( i +1, j ) ) +(1 omega) a ( i, j ) END DO END DO Best is BLOCK-distribution in both dimensions

161 Fortran and HPF 6.9 HPF programming methodology 6.9 HPF programming methodology Need to find balance between concurrency and communication the more processes the more communication aiming to find balanced load based from the owner calculates rule data locality use array syntax and intrinsic functions on arrays avoid deprecated Fortran features (like assumed size and memory associations) Easy to write a program in HPF but difficult to gain good efficiency Programming in HPF technique is more or less like this: 1. Write a correctly working serial program, test and debug it 2. add distribution directives introducing as less as possible communication

162 Fortran and HPF 6.9 HPF programming methodology Advisable to add the INDEPENDENT directives (where semantically relevant), Perform data alignment (with ALIGN-directive). First thing is to: Choose a good parallel algorithm! Issues reducing efficiency: Difficult indexing (may confuse the compiler where particular elements are situated) array syntax is very powerful but with complex constructions may make compiler to fail to optimise sequential cycles to be left sequential (or replicate) object redelivery is time consuming

163 Fortran and HPF 6.9 HPF programming methodology Additional advice: Use array syntax possibilities instead of cycles Use intrinsic functions where possible (for possibly better optimisation) Before parallelisation think whether the argorithm is parallelisable at all? Use INDIPENDENT and ALIGN directives It is possible to assign sizes to the lengths in BLOCK and CYCLE if sure in need for the algorithm efficiency, otherwise better leave the decision to the compiler

164 Fortran and HPF 6.10 BLOCK(m) and CYCLIC(m) 6.10 BLOCK(m) and CYCLIC(m) predefining block size; in general make the code less efficient due to more complex accounting on ownership REAL, DIMENSION(20) : : A, B! HPF$ PROCESSORS, DIMENSION( 4 ) : : P! HPF$ DISTRIBUTE A(BLOCK( 9 ) ) ONTO P! HPF$ DISTRIBUTE B(CYCLIC ( 2 ) ) ONTO P 2D example: REAL, DIMENSION( 4, 9 ) : : A! HPF$ PROCESSORS, DIMENSION( 2 ) : : P! HPF$ DISTRIBUTE (BLOCK( 3 ),CYCLIC ( 2 ) ) ONTO P : : A

165 Fortran and HPF 6.11 Array alignment 6.11 Array alignment improves data locality minimises communication distributes workload Simplest example: A = B + C With correct ALIGN ment it is without communication 2 ways:! HPF$ ALIGN ( :, : ) WITH T ( :, : ) : : A, B, C equivalent to:! HPF$ ALIGN A ( :, : ) WITH T ( :, : )! HPF$ ALIGN B ( :, : ) WITH T ( :, : )! HPF$ ALIGN C ( :, : ) WITH T ( :, : )

166 Fortran and HPF 6.11 Array alignment Example: REAL, DIMENSION(10) : : A, B, C! HPF$ ALIGN ( : ) WITH C ( : ) : : A, B need only C be in DISTRIBUTE-command. Example, symbol instead of : :! HPF$ ALIGN ( j ) WITH C( j ) : : A, B means: for j align arrays A and B with array C. ( : instead of j stronger requirement)

167 Fortran and HPF 6.11 Array alignment! HPF$ ALIGN A( i, j ) WITH B( i, j ) Example 2 (2-dimensional case): REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A ( :, : ) WITH B ( :, : ) is stronger requirement than not assuming same size arrays Good to perform A=B+C+B*C operations (everything local!) transposed alignment: REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A( i, : ) WITH B ( :, i ) (second dimension of A and first dimension of B with same length!)

168 Fortran and HPF 6.11 Array alignment Similarly: REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A ( :, j ) WITH B( j, : ) or: REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A( i, j ) WITH B( j, i ) good to perform operation: A=A+TRNSPOSE(B) A! e v e r y t h i n g l o c a l!

169 Fortran and HPF 6.12 Strided Alignment 6.12 Strided Alignment Example: Align elements of matrix D with every second element in E: REAL, DIMENSION( 5 ) : : D REAL, DIMENSION(10) : : E! HPF$ ALIGN D ( : ) WITH E ( 1 : : 2 ) could be written also:! HPF$ ALIGN D( i ) WITH E( i 2 1) Operation: D = D + E ( : : 2 )! l o c a l

170 Fortran and HPF 6.12 Strided Alignment Example: reverse strided alignment: REAL, DIMENSION( 5 ) : : D REAL, DIMENSION(10) : : E! HPF$ ALIGN D ( : ) WITH E(UBOUND(E) :: 2) could be written also:! HPF$ ALIGN D( i ) WITH E(2+UBOUND( E) i 2)

171 Fortran and HPF 6.13 Example on Alignment 6.13 Example on Alignment PROGRAM Warty IMPLICIT NONE REAL, DIMENSION( 4 ) : : C REAL, DIMENSION( 8 ) : : D REAL, DIMENSION( 2 ) : : E C = 1; D = 2 E = D ( : : 4 ) + C ( : : 2 ) END PROGRAM Warty minimal (0) communication is achieved with:! HPF$ ALIGN C ( : ) WITH D ( : : 2 )! HPF$ ALIGN E ( : ) WITH D ( : : 4 )! HPF$ DISTRIBUTE (BLOCK) : : D

172 Fortran and HPF 6.14 Alignment with Allocatable Arrays 6.14 Alignment with Allocatable Arrays alignment is performed togehter with memory allocation existing object cannot be aligned to unallocated object Example: REAL, DIMENSION ( : ), ALLOCATABLE : :! HPF$ ALIGN A ( : ) WITH B ( : ) then ALLOCATE (B(100), stat= i e r r ) ALLOCATE (A(100), stat= i e r r ) is OK ALLOCATE (B(100),A(100), stat= i e r r ) also OK (allocation starts from left), A,B

173 Fortran and HPF 6.14 Alignment with Allocatable Arrays but, ALLOCATE (A(100), stat= i e r r ) ALLOCATE (B(100), stat= i e r r ) or ALLOCATE (A(100),B(100), stat= i e r r ) give error! Simple array cannot be aligned with allocatable: REAL, DIMENSION ( : ) : : X REAL, DIMENSION ( : ), ALLOCATABLE : :! HPF$ ALIGN X ( : ) WITH A ( : )! ERROR A

174 Fortran and HPF 6.14 Alignment with Allocatable Arrays One more problem: REAL, DIMENSION ( : ), ALLOCATABLE : :! HPF$ ALIGN A ( : ) WITH B ( : ) ALLOCATE(B(100), stat= i e r r ) ALLOCATE(A(50), stat= i e r r ) A, B : says that A and B should be with same length (but they are not!) But this is OK: REAL, DIMENSION ( : ), ALLOCATABLE : : A, B! HPF$ ALIGN A( i ) WITH B( i ) ALLOCATE(B(100), stat= i e r r ) ALLOCATE(A(50), stat= i e r r ) still: A cannot be larger than B.

175 Fortran and HPF 6.16 Dimension replication 6.15 Dimension collapsing With one element it is possible to align one or severeal dimensions:! HPF$ ALIGN (, : ) WITH Y ( : ) : : X each element from Y aligned to a column from X (First dimension of matrix X being collapsed) 6.16 Dimension replication! HPF$ ALIGN Y ( : ) WITH X (, : ) each processor getting arbitrary row X(:,i) gets also a copy of Y(i)

176 Fortran and HPF 6.16 Dimension replication Example: 2D Gauss elimination kernel of the program:... DO j = i +1, n A( j, i ) = A( j, i ) /Swap( i ) A( j, i +1:n ) = A( j, i +1:n ) A( j, i ) Swap( i +1:n ) Y( j ) = Y( j ) A( j, i ) Temp END DO Y(k) together with A(k,i) =>! HPF$ ALIGN Y ( : ) WITH A ( :, ) Swap(k) together with A(i,k) =>! HPF$ ALIGN Swap ( : ) WITH A (, : ) No matrix A neighbouring elements in same expression => CYLCIC:

177 Fortran and HPF 6.16 Dimension replication! HPF$ DISTRIBUTE A( CYCLIC, CYCLIC)

178 Fortran and HPF 6.16 Dimension replication Example: matrix multiplication PROGRAM ABmult IMPLICIT NONE INTEGER, PARAMETER : : N = 100 INTEGER, DIMENSION ( N, N) : : A, B, C INTEGER : : i, j! HPF$ PROCESSORS square ( 2, 2 )! HPF$ DISTRIBUTE (BLOCK, BLOCK) ONTO square : : C! HPF$ ALIGN A( i, ) WITH C( i, )! r e p l i c a t e copies of row A( i, ) onto processors which compute C( i, j )! HPF$ ALIGN B(, j ) WITH C(, j )! r e p l i c a t e copies of column B(, j ) ) onto processors which compute C( i, j ) A = 1 B = 2 C = 0 DO i = 1, N DO j = 1, N! A l l the work i s l o c a l due to ALIGNs C( i, j ) = DOT_PRODUCT(A( i, : ), B( :, j ) ) END DO END DO WRITE(, ) C END

179 Fortran and HPF 6.17 HPF Intrinsic Functions 6.17 HPF Intrinsic Functions NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE information about physical hardware needed for portability:! HPF$ PROCESSORS P1 (NUMBER.OF.PROCESSORS( ) )! HPF$ PROCESSORS P2(4, 4,NUMBER.OF.PROCESSORS( ) / 1 6 )! HPF$ PROCESSORS P3 ( 0 :NUMBER.OF.PROCESSORS( 1 ) 1, &! HPF$ 0:NUMBER.OF.PROCESSORS( 2 ) 1) 2048-processor hyprcube: PRINT, PROCESSORS. SHAPE( ) would return: 2 2 2 2 2 2 2 2 2 2 2

180 Fortran and HPF 6.18 HPF Template Syntax 6.18 HPF Template Syntax TEMPLATE - conceptual object, does not use any RAM, defined statically (like 0- sized arrays being not assigned to) are declared distributed can be used to align arrays Example: REAL, DIMENSION(10) : : A, B! HPF$ TEMPLATE, DIMENSION(10) : : T! HPF$ DISTRIBUTE (BLOCK) : : T! HPF$ ALIGN ( : ) WITH T ( : ) : : A, B (here only T may be argument to DISTRIBUTE) Combined TEMPLATE directive:

181 Fortran and HPF 6.18 HPF Template Syntax! HPF$ TEMPLATE, DIMENSION(100,100), &! HPF$ DISTRIBUTE (BLOCK, CYCLIC) ONTO P : : T! HPF$ ALIGN A ( :, : ) WITH T ( :, : ) which is equivalent to:! HPF$ TEMPLATE, DIMENSION(100,100) : : T! HPF$ ALIGN A ( :, : ) WITH T ( :, : )! HPF$ DISTRIBUTE T (BLOCK, CYCLIC) ONTO P

182 Fortran and HPF 6.18 HPF Template Syntax Example: PROGRAM Warty IMPLICIT NONE REAL, DIMENSION( 4 ) : : C REAL, DIMENSION( 8 ) : : D REAL, DIMENSION( 2 ) : : E! HPF$ TEMPLATE, DIMENSION( 8 ) : : T! HPF$ ALIGN D ( : ) WITH T ( : )! HPF$ ALIGN C ( : ) WITH T ( : : 2 )! HPF$ ALIGN E ( : ) WITH T ( : : 4 )! HPF$ DISTRIBUTE (BLOCK) : : T C = 1; D = 2 E = D ( : : 4 ) + C ( : : 2 ) END PROGRAM Warty (similar to the example of alignment with stride)

183 Fortran and HPF 6.18 HPF Template Syntax More examples on using Templates: ALIGN A(:)WITH T1(:, ) i, element A(i) replicated according to row T1(i,:). ALIGN C(i,j) WITH T2(j,i) transposed C aligned with T2 ALIGN B(:, )WITH T3(:) i, matrix B(i,:) aligned with template element T2(i), DISTRIBUTE (BLOCK,CYCLIC):: T1, T2 DISTRIBUTE T1(CYCLIC, )ONTO P T1 rows are distributed cyclically

184 Fortran and HPF 6.19 FORALL 6.19 FORALL Syntax: example: FORALL(<forall-triple-list>[,<scalar-mask>])& <assignment> FORALL ( i =1:n, j =1:m,A( i, j ).NE. 0 ) A( i, j ) = 1/A( i, j )

185 Fortran and HPF 6.19 FORALL Circumstances, where Fortran90 syntax is not enough, but FORALL makes it simple: index expressions: FORALL ( i =1:n, j =1:n, i /= j ) A( i, j ) = REAL( i + j ) intrinsic or PURE-functions (which have no side-effects): FORALL ( i =1:n : 3, j =1:n : 5 ) A( i, j ) = SIN (A( j, i ) ) subindexing: FORALL ( i =1:n, j =1:n ) A(VS( i ), j ) = i +VS( j )

186 Fortran and HPF 6.19 FORALL unusual parts can be accessed: FORALL ( i =1:n ) A( i, i ) = B( i )! diagonal!... DO j = 1, n FORALL ( i =1: j ) A( i, j ) = B( i )! t r i a n g u l a r END DO To parallelise add before DO also:! HPF$ INDEPENDENT, NEW( i ) or write nested FORALL commands: FORALL ( j = 1:n) FORALL ( i =1: j ) A( i, j ) = B( i )! t r i a n g u l a r END FORALL

187 Fortran and HPF 6.19 FORALL FORALL-command execution: 1. triple-list evaluation 2. scalar matrix evaluation 3. for each.true. mask elements find right hand side value 4. assignment of right hand side value to the left hand side In case of HPF synchronisation between the steps!

188 Fortran and HPF 6.20 PURE-procedures 6.20 PURE-procedures PURE REAL FUNCTION F ( x, y ) PURE SUBROUTINE G( x, y, z ) without side effects, i.e.: no outer I/O nor ALLOCATE (communication still allowed) does not change global state of the program intrinsic functions are of type PURE! can be used in FORALL and in PURE-procedures no PAUSE or STOP FUNCTION formal parameters with attribute INTENT(IN)

189 Fortran and HPF 6.20 PURE-procedures Example (function): PURE REAL FUNCTION F ( x, y ) IMPLICIT NONE REAL, INTENT ( IN ) : : x, y F = x x + y y + 2 x y + ASIN ( MIN ( x / y, y / x ) ) END FUNCTION F example of usage: FORALL ( i =1:n, j =1:n ) & A( i, j ) = b ( i ) + F(1.0 i, 1. 0 j )

190 Fortran and HPF 6.20 PURE-procedures Example (subroutine): PURE SUBROUTINE G( x, y, z ) IMPLICIT NONE REAL, INTENT (OUT), DIMENSION ( : ) : : z REAL, INTENT ( IN ), DIMENSION ( : ) : : x, y INTEGER i INTERFACE REAL FUNCTION F ( x, y ) REAL, INTENT ( IN ) : : x, y END FUNCTION F END INTERFACE!... FORALL( i =1: SIZE ( z ) ) z ( i ) = F ( x ( i ), y ( i ) ) END SUBROUTINE G

191 Fortran and HPF 6.20 PURE-procedures MIMD example: REAL FUNCTION F ( x, i )! PURE IMPLICIT NONE REAL, INTENT ( IN ) : : x! element INTEGER, INTENT ( IN ) : : i! index IF ( x > 0. 0 ) THEN F = x x ELSEIF ( i ==1.OR. i ==n ) THEN F = 0.0 ELSE F = x END IF END FUNCTION F

192 Fortran and HPF 6.21 INDEPENDENT 6.21 INDEPENDENT Directly in front of DO or FORALL! HPF$ INDEPENDENT DO i = 1,n x ( i ) = i 2 END DO In front of FORALL: no synchronisation needed between right hand side expression evaluation and assignment If INDEPENDENT loop......assigns more than once to a same element, parallelisation is lost!...includes EXIT, STOP or PAUSE command, iteration needs to execute sequentially to be sure to end in right iteration...has jumps out of the loop or I/O => sequential execution

193 Fortran and HPF 6.21 INDEPENDENT Is independent:! HPF$ INDEPENDENT DO i = 1, n b ( i ) = b ( i ) + b ( i ) END DO Not independent: DO i = 1, n b ( i ) = b ( i +1) + b ( i ) END DO Not independent: DO i = 1, n b ( i ) = b ( i 1) + b ( i ) END DO

194 Fortran and HPF 6.21 INDEPENDENT This is independent loop:! HPF$ INDEPENDENT DO i = 1, n a ( i ) = b ( i 1) + b ( i ) END DO Question to ask: does a later iteration depend on a previous one?

195 Fortran and HPF 6.22 INDEPENDENT NEW command 6.22 INDEPENDENT NEW command create an independent variable on each process!! HPF$ INDEPENDENT, NEW( s1, s2 ) DO i =1,n s1 = Sin ( a ( 1 ) ) s2 = Cos( a ( 1 ) ) a ( 1 ) = s1 s1 s2 s2 END DO

196 Fortran and HPF 6.22 INDEPENDENT NEW command Rules for NEW-variable: cannot be used outside cycle without redefinition cannot be used in FORALL not pointers or formal parameters cannot have SAVE atribute Disallowed:! HPF$ INDEPENDENT, NEW( s1, s2 ) DO i = 1,n s1 = SIN ( a ( i ) ) s2 = COS( a ( i ) ) a ( i ) = s1 s1 s2 s2 END DO k = s1+s2! not allowed!

197 Fortran and HPF 6.23 EXTRINSIC Example (only outer cycles can be executed independently):! HPF$ INDEPENDENT, NEW ( i 2 ) DO i1 = 1, n1! HPF$ INDEPENDENT, NEW ( i 3 ) DO i2 = 1, n2! HPF$ INDEPENDENT, NEW ( i 4 ) DO i3 = 1, n3 DO i4 = 1, n4 a ( i1, i2, i3 ) = a ( i1, i2, i3 ) & + b ( i1, i2, i4 ) c ( i2, i3, i4 ) END DO END DO END DO END DO 6.23 EXTRINSIC Used in INTERFACE-commands to declare that a routine does not belong to HPF

198 Fortran and HPF 6.23 EXTRINSIC 7 MPI See separate slides on course homepage!

199 // Alg.Design Part III Parallel Algorithms 8 Parallel Algorithm Design Principles Identifying portions of the work that can be performed concurrently Mapping the concurrent pieces of work onto multiple processes running in parallel Distributing the input, output and also intermediate data associated with the program Access management of data shared by multiple processes Process synchronisation in various stages in parallel program execution

200 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs 8.1 Decomposition, Tasks and Dependency Graphs Subdividing calculations into smaller components is called decomposition Example: Dense matrix-vector multiplication y = Ab y[i] = n j=1 A[i, j]b[ j] Computational Problem Decomposition Tasks ([1])

201 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs in case of y = Ab tasks calculation of each row independent Tasks and their relative order abstraction: ([1]) Task Dependency Graph directed acyclic graph nodes tasks

202 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs directed edges dependences (can be disconnected) (edge-set can be empty) Mapping tasks onto processors Processors vs. processes

203 // Alg.Design 8.2 Decomposition Techniques 8.2 Decomposition Techniques Recursive decomposition Example: Quicksort

204 // Alg.Design 8.2 Decomposition Techniques Data decomposition Input data Output data intermediate results Owner calculates rule Exploratory decomposition Example: 15-square game Speculative decomposition Hybrid decomposition

205 // Alg.Design 8.3 Tasks and Interactions 8.3 Tasks and Interactions Task generation static dynamic Task sizes uniform non-uniform knowledge of task sizes size of data associated with tasks

206 // Alg.Design 8.3 Tasks and Interactions Characteristics of Inter-Task Interactions static versus dynamic regular versus irregular read-only versus read-write one-way versus two-way

207 // Alg.Design 8.4 Mapping Techniques for Load balancing 8.4 Mapping Techniques for Load balancing Static mapping Mappings based on data partitioning array distribution schemes block distribution cyclic and block-cyclic distribution randomised block distribution Graph partitioning

208 // Alg.Design 8.4 Mapping Techniques for Load balancing Dynamic mapping schemes centralised schemes master-slave self scheduling single task scheduling chunk scheduling distributed schemes each process can send/receive work from whichever process how are sending receiving processes paired together? initiator sender or receiver? how much work exchanged? when the work transfer is performed?

209 // Alg.Design 8.4 Mapping Techniques for Load balancing Methods for reducing interaction overheads maximising data locality minimise volume of data exchange minimise frequency of interactions overlapping computations with interactions replicating data or computation overlapping interactions with other interactions

210 // Alg.Design 8.5 Parallel Algorithm Models 8.5 Parallel Algorithm Models data-parallel model task graph model typically amount of data relatively large to the amount of computation work pool model or task pool model pool can be centralised distributed statically/dynamically created, master-slave model

211 // Alg.Design 8.5 Parallel Algorithm Models pipeline (or producer-consumer model) stream parallelism stream of data triggers computations pipelines is a chain of consumer-produces processes shape of pipeline can be: linear or multidimesional array trees general graphs with or without cycles Bulk-sunchronous parallel (BSP) model Synchronisation steps needed in a regular or irregular pattern p2p synchronisation and/or global synchronisation