149 Fortran and HPF 6.2 Concept High Performance Fortran 6.2 Concept Fortran90 extension SPMD (Single Program Multiple Data) model each process operates with its own part of data HPF commands specify which processor gets which part of the data Concurrency is defined by HPF commands based on Fortran90 HPF directives as comments:! HPF$ <directive> Most of the commands of declatrative type concerning data distribution between processes Command INDEPENDENT is an exception (and its attributel NEW) which execute commands
150 Fortran and HPF 6.3 PROCESSORS declaration 6.3 PROCESSORS declaration determines conceptual processor array (which need not reflect the actual hardware structure) Example:! HPF$ PROCESSORS, DIMENSION( 4 ) : : P1! HPF$ PROCESSORS, DIMENSION( 2, 2 ) : : P2! HPF$ PROCESSORS, DIMENSION( 2, 1, 2 ) : : P3 Number of processors needs to remain the same in a program If 2 different processor arrays similar, one can assume thet corresponding processors have the same physical process!hpf$ PROCESSORS :: P defines a scalar processor
151 Fortran and HPF 6.4 DISTRIBUTE-directive 6.4 DISTRIBUTE-directive says how to deliver data between processors. Example: REAL, DIMENSION(50) : : A REAL, DIMENSION(10,10) : : B, C, D! HPF$ DISTRIBUTE (BLOCK) ONTO P1 : : A!1 D! HPF$ DISTRIBUTE (CYCLIC, CYCLIC) ONTO P2 : : B,C!2 D! HPF$ DISTRIBUTE D(BLOCK, ) ONTO P1! a l t e r n a t i v e syntax array and processor array rank need to confirm A(1:50) : P1(1:4) each processor gets 50/4 = 13 elements except P1(4) which gets the rest (11 elements) * dimension to be ignored => in this example D distributed row-wise CYCLIC elements delivered one-by-one (as cards from pile between players)
152 Fortran and HPF 6.4 DISTRIBUTE-directive BLOCK - array elements are delivered block-wise (each block having elements from close range) Example (9 -element array delivered to 3 processors) : CYCLIC: 123123123 BLOCK: 111222333
153 Fortran and HPF 6.4 DISTRIBUTE-directive Example: DISTRIBUTE block-wise: PROGRAM Chunks REAL, DIMENSION(20) : : A! HPF$ PROCESSORS, DIMENSION( 4 ) : : P! HPF$ DISTRIBUTE (BLOCK) ONTO P : : A Example: DISTRIBUTE cyclically: PROGRAM Round_Robin REAL, DIMENSION(20) : : A! HPF$ PROCESSORS, DIMENSION( 4 ) : : P! HPF$ DISTRIBUTE ( CYCLIC) ONTO P : : A
154 Fortran and HPF 6.4 DISTRIBUTE-directive Example: DISTRIBUTE 2-dimensional layout: PROGRAM Skwiffy IMPLICIT NONE REAL, DIMENSION( 4, 4 ) : : A, B, C! HPF$ PROCESSORS, DIMENSION( 2, 2 ) : : P! HPF$ DISTRIBUTE (BLOCK, CYCLIC) ONTO P : : A, B, C B = 1; C = 1; A = B + C END PROGRAM Skwiffy cyclic in one dimension; block-wise in other: (11)(12)(11)(12) (11)(12)(11)(12) (21)(22)(21)(22) (21)(22)(21)(22)
155 Fortran and HPF 6.4 DISTRIBUTE-directive Example: DISTRIBUTE *: PROGRAM Skwiffy IMPLICIT NONE REAL, DIMENSION( 4, 4 ) : : A, B, C! HPF$ PROCESSORS, DIMENSION( 4 ) : : Q! HPF$ DISTRIBUTE (,BLOCK) ONTO Q : : A, B, C B = 1; C = 1; A = B + C; PRINT, A END PROGRAM Skwiffy Remarks about DISTRIBUTE Without ONTO default structure (given with program arguments, e.g.) BLOCK better if algorithm uses a lot of neighbouring elements in the array => less communication, faster CYCLIC good for even distribution ignoring a dimension (with *) good if calculations due with a whole row or column
156 Fortran and HPF 6.5 Distribution of allocatable arrays all scalar variables are by default replicated; updating compiler task 6.5 Distribution of allocatable arrays is similar, except that delivery happens right after the memory allocation REAL, ALLOCATABLE, DIMENSION ( :, : ) : : A INTEGER : : i e r r! HPF$ PROCESSORS, DIMENSION(10,10) : : P! HPF$ DISTRIBUTE (BLOCK, CYCLIC) : : A... ALLOCATE(A(100,20), stat= i e r r )! A a u t o m a t i c a l l y d i s t r i b u t e d here! block size i n dim=1 i s 10 elements... DEALLOCATE(A) END blocksize is determined right after ALLOCATE command
157 Fortran and HPF 6.6 HPF rule: Owner Calcuates 6.6 HPF rule: Owner Calcuates Processor on the left of assignment performs the computations Example: DO i = 1,n a ( i 1) = b ( i 6) / c ( i + j ) a ( i i ) END DO calculation performed by process owning a(i-1) NOTE that the rule is not obligatory but advisable to the compiler! compiler may (for the purpose of less communication in the whole program) to leave only the assignment to the a(i-1) owner-processor
158 Fortran and HPF 6.7 Scalar variables 6.7 Scalar variables REAL, DIMENSION(100,100) : : X REAL : : Scal! HPF$ DISTRIBUTE (BLOCK,BLOCK) : : X.... Scal = X( i, j ).... owner of X(i,j) assigns Scal value and sends to other processes (replication)
159 Fortran and HPF 6.8 Examples of good DISTRIBUTE subdivision 6.8 Examples of good DISTRIBUTE subdivision Example: A( 2 : 9 9 ) = (A( : 9 8 ) +A( 3 : ) ) /2! neighbour c a l c u l a t i o n s B( 2 2 : 5 6 ) = 4.0 ATAN( 1. 0 )! s e c t i o n of B c a l c u l a t e d C ( : ) = SUM(D, DIM=1)! Sum down a column From the owner calculates rule we get:! HPF$ DISTRIBUTE (BLOCK) ONTO P : : A! HPF$ DISTRIBUTE ( CYCLIC) ONTO P : : B! HPF$ DISTRIBUTE (BLOCK) ONTO P : : C! or ( CYCLIC)! HPF$ DISTRIBUTE (,BLOCK) ONTO P : : D! or (, CYCLIC)
160 Fortran and HPF 6.8 Examples of good DISTRIBUTE subdivision Example (SSOR): DO j = 2,n 1 DO i = 2,n 1 a ( i, j ) =(omega/ 4 ) ( a ( i, j 1)+a ( i, j +1)+ & a ( i 1, j ) +a ( i +1, j ) ) +(1 omega) a ( i, j ) END DO END DO Best is BLOCK-distribution in both dimensions
161 Fortran and HPF 6.9 HPF programming methodology 6.9 HPF programming methodology Need to find balance between concurrency and communication the more processes the more communication aiming to find balanced load based from the owner calculates rule data locality use array syntax and intrinsic functions on arrays avoid deprecated Fortran features (like assumed size and memory associations) Easy to write a program in HPF but difficult to gain good efficiency Programming in HPF technique is more or less like this: 1. Write a correctly working serial program, test and debug it 2. add distribution directives introducing as less as possible communication
162 Fortran and HPF 6.9 HPF programming methodology Advisable to add the INDEPENDENT directives (where semantically relevant), Perform data alignment (with ALIGN-directive). First thing is to: Choose a good parallel algorithm! Issues reducing efficiency: Difficult indexing (may confuse the compiler where particular elements are situated) array syntax is very powerful but with complex constructions may make compiler to fail to optimise sequential cycles to be left sequential (or replicate) object redelivery is time consuming
163 Fortran and HPF 6.9 HPF programming methodology Additional advice: Use array syntax possibilities instead of cycles Use intrinsic functions where possible (for possibly better optimisation) Before parallelisation think whether the argorithm is parallelisable at all? Use INDIPENDENT and ALIGN directives It is possible to assign sizes to the lengths in BLOCK and CYCLE if sure in need for the algorithm efficiency, otherwise better leave the decision to the compiler
164 Fortran and HPF 6.10 BLOCK(m) and CYCLIC(m) 6.10 BLOCK(m) and CYCLIC(m) predefining block size; in general make the code less efficient due to more complex accounting on ownership REAL, DIMENSION(20) : : A, B! HPF$ PROCESSORS, DIMENSION( 4 ) : : P! HPF$ DISTRIBUTE A(BLOCK( 9 ) ) ONTO P! HPF$ DISTRIBUTE B(CYCLIC ( 2 ) ) ONTO P 2D example: REAL, DIMENSION( 4, 9 ) : : A! HPF$ PROCESSORS, DIMENSION( 2 ) : : P! HPF$ DISTRIBUTE (BLOCK( 3 ),CYCLIC ( 2 ) ) ONTO P : : A
165 Fortran and HPF 6.11 Array alignment 6.11 Array alignment improves data locality minimises communication distributes workload Simplest example: A = B + C With correct ALIGN ment it is without communication 2 ways:! HPF$ ALIGN ( :, : ) WITH T ( :, : ) : : A, B, C equivalent to:! HPF$ ALIGN A ( :, : ) WITH T ( :, : )! HPF$ ALIGN B ( :, : ) WITH T ( :, : )! HPF$ ALIGN C ( :, : ) WITH T ( :, : )
166 Fortran and HPF 6.11 Array alignment Example: REAL, DIMENSION(10) : : A, B, C! HPF$ ALIGN ( : ) WITH C ( : ) : : A, B need only C be in DISTRIBUTE-command. Example, symbol instead of : :! HPF$ ALIGN ( j ) WITH C( j ) : : A, B means: for j align arrays A and B with array C. ( : instead of j stronger requirement)
167 Fortran and HPF 6.11 Array alignment! HPF$ ALIGN A( i, j ) WITH B( i, j ) Example 2 (2-dimensional case): REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A ( :, : ) WITH B ( :, : ) is stronger requirement than not assuming same size arrays Good to perform A=B+C+B*C operations (everything local!) transposed alignment: REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A( i, : ) WITH B ( :, i ) (second dimension of A and first dimension of B with same length!)
168 Fortran and HPF 6.11 Array alignment Similarly: REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A ( :, j ) WITH B( j, : ) or: REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A( i, j ) WITH B( j, i ) good to perform operation: A=A+TRNSPOSE(B) A! e v e r y t h i n g l o c a l!
169 Fortran and HPF 6.12 Strided Alignment 6.12 Strided Alignment Example: Align elements of matrix D with every second element in E: REAL, DIMENSION( 5 ) : : D REAL, DIMENSION(10) : : E! HPF$ ALIGN D ( : ) WITH E ( 1 : : 2 ) could be written also:! HPF$ ALIGN D( i ) WITH E( i 2 1) Operation: D = D + E ( : : 2 )! l o c a l
170 Fortran and HPF 6.12 Strided Alignment Example: reverse strided alignment: REAL, DIMENSION( 5 ) : : D REAL, DIMENSION(10) : : E! HPF$ ALIGN D ( : ) WITH E(UBOUND(E) :: 2) could be written also:! HPF$ ALIGN D( i ) WITH E(2+UBOUND( E) i 2)
171 Fortran and HPF 6.13 Example on Alignment 6.13 Example on Alignment PROGRAM Warty IMPLICIT NONE REAL, DIMENSION( 4 ) : : C REAL, DIMENSION( 8 ) : : D REAL, DIMENSION( 2 ) : : E C = 1; D = 2 E = D ( : : 4 ) + C ( : : 2 ) END PROGRAM Warty minimal (0) communication is achieved with:! HPF$ ALIGN C ( : ) WITH D ( : : 2 )! HPF$ ALIGN E ( : ) WITH D ( : : 4 )! HPF$ DISTRIBUTE (BLOCK) : : D
172 Fortran and HPF 6.14 Alignment with Allocatable Arrays 6.14 Alignment with Allocatable Arrays alignment is performed togehter with memory allocation existing object cannot be aligned to unallocated object Example: REAL, DIMENSION ( : ), ALLOCATABLE : :! HPF$ ALIGN A ( : ) WITH B ( : ) then ALLOCATE (B(100), stat= i e r r ) ALLOCATE (A(100), stat= i e r r ) is OK ALLOCATE (B(100),A(100), stat= i e r r ) also OK (allocation starts from left), A,B
173 Fortran and HPF 6.14 Alignment with Allocatable Arrays but, ALLOCATE (A(100), stat= i e r r ) ALLOCATE (B(100), stat= i e r r ) or ALLOCATE (A(100),B(100), stat= i e r r ) give error! Simple array cannot be aligned with allocatable: REAL, DIMENSION ( : ) : : X REAL, DIMENSION ( : ), ALLOCATABLE : :! HPF$ ALIGN X ( : ) WITH A ( : )! ERROR A
174 Fortran and HPF 6.14 Alignment with Allocatable Arrays One more problem: REAL, DIMENSION ( : ), ALLOCATABLE : :! HPF$ ALIGN A ( : ) WITH B ( : ) ALLOCATE(B(100), stat= i e r r ) ALLOCATE(A(50), stat= i e r r ) A, B : says that A and B should be with same length (but they are not!) But this is OK: REAL, DIMENSION ( : ), ALLOCATABLE : : A, B! HPF$ ALIGN A( i ) WITH B( i ) ALLOCATE(B(100), stat= i e r r ) ALLOCATE(A(50), stat= i e r r ) still: A cannot be larger than B.
175 Fortran and HPF 6.16 Dimension replication 6.15 Dimension collapsing With one element it is possible to align one or severeal dimensions:! HPF$ ALIGN (, : ) WITH Y ( : ) : : X each element from Y aligned to a column from X (First dimension of matrix X being collapsed) 6.16 Dimension replication! HPF$ ALIGN Y ( : ) WITH X (, : ) each processor getting arbitrary row X(:,i) gets also a copy of Y(i)
176 Fortran and HPF 6.16 Dimension replication Example: 2D Gauss elimination kernel of the program:... DO j = i +1, n A( j, i ) = A( j, i ) /Swap( i ) A( j, i +1:n ) = A( j, i +1:n ) A( j, i ) Swap( i +1:n ) Y( j ) = Y( j ) A( j, i ) Temp END DO Y(k) together with A(k,i) =>! HPF$ ALIGN Y ( : ) WITH A ( :, ) Swap(k) together with A(i,k) =>! HPF$ ALIGN Swap ( : ) WITH A (, : ) No matrix A neighbouring elements in same expression => CYLCIC:
177 Fortran and HPF 6.16 Dimension replication! HPF$ DISTRIBUTE A( CYCLIC, CYCLIC)
178 Fortran and HPF 6.16 Dimension replication Example: matrix multiplication PROGRAM ABmult IMPLICIT NONE INTEGER, PARAMETER : : N = 100 INTEGER, DIMENSION ( N, N) : : A, B, C INTEGER : : i, j! HPF$ PROCESSORS square ( 2, 2 )! HPF$ DISTRIBUTE (BLOCK, BLOCK) ONTO square : : C! HPF$ ALIGN A( i, ) WITH C( i, )! r e p l i c a t e copies of row A( i, ) onto processors which compute C( i, j )! HPF$ ALIGN B(, j ) WITH C(, j )! r e p l i c a t e copies of column B(, j ) ) onto processors which compute C( i, j ) A = 1 B = 2 C = 0 DO i = 1, N DO j = 1, N! A l l the work i s l o c a l due to ALIGNs C( i, j ) = DOT_PRODUCT(A( i, : ), B( :, j ) ) END DO END DO WRITE(, ) C END
179 Fortran and HPF 6.17 HPF Intrinsic Functions 6.17 HPF Intrinsic Functions NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE information about physical hardware needed for portability:! HPF$ PROCESSORS P1 (NUMBER.OF.PROCESSORS( ) )! HPF$ PROCESSORS P2(4, 4,NUMBER.OF.PROCESSORS( ) / 1 6 )! HPF$ PROCESSORS P3 ( 0 :NUMBER.OF.PROCESSORS( 1 ) 1, &! HPF$ 0:NUMBER.OF.PROCESSORS( 2 ) 1) 2048-processor hyprcube: PRINT, PROCESSORS. SHAPE( ) would return: 2 2 2 2 2 2 2 2 2 2 2
180 Fortran and HPF 6.18 HPF Template Syntax 6.18 HPF Template Syntax TEMPLATE - conceptual object, does not use any RAM, defined statically (like 0- sized arrays being not assigned to) are declared distributed can be used to align arrays Example: REAL, DIMENSION(10) : : A, B! HPF$ TEMPLATE, DIMENSION(10) : : T! HPF$ DISTRIBUTE (BLOCK) : : T! HPF$ ALIGN ( : ) WITH T ( : ) : : A, B (here only T may be argument to DISTRIBUTE) Combined TEMPLATE directive:
181 Fortran and HPF 6.18 HPF Template Syntax! HPF$ TEMPLATE, DIMENSION(100,100), &! HPF$ DISTRIBUTE (BLOCK, CYCLIC) ONTO P : : T! HPF$ ALIGN A ( :, : ) WITH T ( :, : ) which is equivalent to:! HPF$ TEMPLATE, DIMENSION(100,100) : : T! HPF$ ALIGN A ( :, : ) WITH T ( :, : )! HPF$ DISTRIBUTE T (BLOCK, CYCLIC) ONTO P
182 Fortran and HPF 6.18 HPF Template Syntax Example: PROGRAM Warty IMPLICIT NONE REAL, DIMENSION( 4 ) : : C REAL, DIMENSION( 8 ) : : D REAL, DIMENSION( 2 ) : : E! HPF$ TEMPLATE, DIMENSION( 8 ) : : T! HPF$ ALIGN D ( : ) WITH T ( : )! HPF$ ALIGN C ( : ) WITH T ( : : 2 )! HPF$ ALIGN E ( : ) WITH T ( : : 4 )! HPF$ DISTRIBUTE (BLOCK) : : T C = 1; D = 2 E = D ( : : 4 ) + C ( : : 2 ) END PROGRAM Warty (similar to the example of alignment with stride)
183 Fortran and HPF 6.18 HPF Template Syntax More examples on using Templates: ALIGN A(:)WITH T1(:, ) i, element A(i) replicated according to row T1(i,:). ALIGN C(i,j) WITH T2(j,i) transposed C aligned with T2 ALIGN B(:, )WITH T3(:) i, matrix B(i,:) aligned with template element T2(i), DISTRIBUTE (BLOCK,CYCLIC):: T1, T2 DISTRIBUTE T1(CYCLIC, )ONTO P T1 rows are distributed cyclically
184 Fortran and HPF 6.19 FORALL 6.19 FORALL Syntax: example: FORALL(<forall-triple-list>[,<scalar-mask>])& <assignment> FORALL ( i =1:n, j =1:m,A( i, j ).NE. 0 ) A( i, j ) = 1/A( i, j )
185 Fortran and HPF 6.19 FORALL Circumstances, where Fortran90 syntax is not enough, but FORALL makes it simple: index expressions: FORALL ( i =1:n, j =1:n, i /= j ) A( i, j ) = REAL( i + j ) intrinsic or PURE-functions (which have no side-effects): FORALL ( i =1:n : 3, j =1:n : 5 ) A( i, j ) = SIN (A( j, i ) ) subindexing: FORALL ( i =1:n, j =1:n ) A(VS( i ), j ) = i +VS( j )
186 Fortran and HPF 6.19 FORALL unusual parts can be accessed: FORALL ( i =1:n ) A( i, i ) = B( i )! diagonal!... DO j = 1, n FORALL ( i =1: j ) A( i, j ) = B( i )! t r i a n g u l a r END DO To parallelise add before DO also:! HPF$ INDEPENDENT, NEW( i ) or write nested FORALL commands: FORALL ( j = 1:n) FORALL ( i =1: j ) A( i, j ) = B( i )! t r i a n g u l a r END FORALL
187 Fortran and HPF 6.19 FORALL FORALL-command execution: 1. triple-list evaluation 2. scalar matrix evaluation 3. for each.true. mask elements find right hand side value 4. assignment of right hand side value to the left hand side In case of HPF synchronisation between the steps!
188 Fortran and HPF 6.20 PURE-procedures 6.20 PURE-procedures PURE REAL FUNCTION F ( x, y ) PURE SUBROUTINE G( x, y, z ) without side effects, i.e.: no outer I/O nor ALLOCATE (communication still allowed) does not change global state of the program intrinsic functions are of type PURE! can be used in FORALL and in PURE-procedures no PAUSE or STOP FUNCTION formal parameters with attribute INTENT(IN)
189 Fortran and HPF 6.20 PURE-procedures Example (function): PURE REAL FUNCTION F ( x, y ) IMPLICIT NONE REAL, INTENT ( IN ) : : x, y F = x x + y y + 2 x y + ASIN ( MIN ( x / y, y / x ) ) END FUNCTION F example of usage: FORALL ( i =1:n, j =1:n ) & A( i, j ) = b ( i ) + F(1.0 i, 1. 0 j )
190 Fortran and HPF 6.20 PURE-procedures Example (subroutine): PURE SUBROUTINE G( x, y, z ) IMPLICIT NONE REAL, INTENT (OUT), DIMENSION ( : ) : : z REAL, INTENT ( IN ), DIMENSION ( : ) : : x, y INTEGER i INTERFACE REAL FUNCTION F ( x, y ) REAL, INTENT ( IN ) : : x, y END FUNCTION F END INTERFACE!... FORALL( i =1: SIZE ( z ) ) z ( i ) = F ( x ( i ), y ( i ) ) END SUBROUTINE G
191 Fortran and HPF 6.20 PURE-procedures MIMD example: REAL FUNCTION F ( x, i )! PURE IMPLICIT NONE REAL, INTENT ( IN ) : : x! element INTEGER, INTENT ( IN ) : : i! index IF ( x > 0. 0 ) THEN F = x x ELSEIF ( i ==1.OR. i ==n ) THEN F = 0.0 ELSE F = x END IF END FUNCTION F
192 Fortran and HPF 6.21 INDEPENDENT 6.21 INDEPENDENT Directly in front of DO or FORALL! HPF$ INDEPENDENT DO i = 1,n x ( i ) = i 2 END DO In front of FORALL: no synchronisation needed between right hand side expression evaluation and assignment If INDEPENDENT loop......assigns more than once to a same element, parallelisation is lost!...includes EXIT, STOP or PAUSE command, iteration needs to execute sequentially to be sure to end in right iteration...has jumps out of the loop or I/O => sequential execution
193 Fortran and HPF 6.21 INDEPENDENT Is independent:! HPF$ INDEPENDENT DO i = 1, n b ( i ) = b ( i ) + b ( i ) END DO Not independent: DO i = 1, n b ( i ) = b ( i +1) + b ( i ) END DO Not independent: DO i = 1, n b ( i ) = b ( i 1) + b ( i ) END DO
194 Fortran and HPF 6.21 INDEPENDENT This is independent loop:! HPF$ INDEPENDENT DO i = 1, n a ( i ) = b ( i 1) + b ( i ) END DO Question to ask: does a later iteration depend on a previous one?
195 Fortran and HPF 6.22 INDEPENDENT NEW command 6.22 INDEPENDENT NEW command create an independent variable on each process!! HPF$ INDEPENDENT, NEW( s1, s2 ) DO i =1,n s1 = Sin ( a ( 1 ) ) s2 = Cos( a ( 1 ) ) a ( 1 ) = s1 s1 s2 s2 END DO
196 Fortran and HPF 6.22 INDEPENDENT NEW command Rules for NEW-variable: cannot be used outside cycle without redefinition cannot be used in FORALL not pointers or formal parameters cannot have SAVE atribute Disallowed:! HPF$ INDEPENDENT, NEW( s1, s2 ) DO i = 1,n s1 = SIN ( a ( i ) ) s2 = COS( a ( i ) ) a ( i ) = s1 s1 s2 s2 END DO k = s1+s2! not allowed!
197 Fortran and HPF 6.23 EXTRINSIC Example (only outer cycles can be executed independently):! HPF$ INDEPENDENT, NEW ( i 2 ) DO i1 = 1, n1! HPF$ INDEPENDENT, NEW ( i 3 ) DO i2 = 1, n2! HPF$ INDEPENDENT, NEW ( i 4 ) DO i3 = 1, n3 DO i4 = 1, n4 a ( i1, i2, i3 ) = a ( i1, i2, i3 ) & + b ( i1, i2, i4 ) c ( i2, i3, i4 ) END DO END DO END DO END DO 6.23 EXTRINSIC Used in INTERFACE-commands to declare that a routine does not belong to HPF
198 Fortran and HPF 6.23 EXTRINSIC 7 MPI See separate slides on course homepage!
199 // Alg.Design Part III Parallel Algorithms 8 Parallel Algorithm Design Principles Identifying portions of the work that can be performed concurrently Mapping the concurrent pieces of work onto multiple processes running in parallel Distributing the input, output and also intermediate data associated with the program Access management of data shared by multiple processes Process synchronisation in various stages in parallel program execution
200 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs 8.1 Decomposition, Tasks and Dependency Graphs Subdividing calculations into smaller components is called decomposition Example: Dense matrix-vector multiplication y = Ab y[i] = n j=1 A[i, j]b[ j] Computational Problem Decomposition Tasks ([1])
201 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs in case of y = Ab tasks calculation of each row independent Tasks and their relative order abstraction: ([1]) Task Dependency Graph directed acyclic graph nodes tasks
202 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs directed edges dependences (can be disconnected) (edge-set can be empty) Mapping tasks onto processors Processors vs. processes
203 // Alg.Design 8.2 Decomposition Techniques 8.2 Decomposition Techniques Recursive decomposition Example: Quicksort
204 // Alg.Design 8.2 Decomposition Techniques Data decomposition Input data Output data intermediate results Owner calculates rule Exploratory decomposition Example: 15-square game Speculative decomposition Hybrid decomposition
205 // Alg.Design 8.3 Tasks and Interactions 8.3 Tasks and Interactions Task generation static dynamic Task sizes uniform non-uniform knowledge of task sizes size of data associated with tasks
206 // Alg.Design 8.3 Tasks and Interactions Characteristics of Inter-Task Interactions static versus dynamic regular versus irregular read-only versus read-write one-way versus two-way
207 // Alg.Design 8.4 Mapping Techniques for Load balancing 8.4 Mapping Techniques for Load balancing Static mapping Mappings based on data partitioning array distribution schemes block distribution cyclic and block-cyclic distribution randomised block distribution Graph partitioning
208 // Alg.Design 8.4 Mapping Techniques for Load balancing Dynamic mapping schemes centralised schemes master-slave self scheduling single task scheduling chunk scheduling distributed schemes each process can send/receive work from whichever process how are sending receiving processes paired together? initiator sender or receiver? how much work exchanged? when the work transfer is performed?
209 // Alg.Design 8.4 Mapping Techniques for Load balancing Methods for reducing interaction overheads maximising data locality minimise volume of data exchange minimise frequency of interactions overlapping computations with interactions replicating data or computation overlapping interactions with other interactions
210 // Alg.Design 8.5 Parallel Algorithm Models 8.5 Parallel Algorithm Models data-parallel model task graph model typically amount of data relatively large to the amount of computation work pool model or task pool model pool can be centralised distributed statically/dynamically created, master-slave model
211 // Alg.Design 8.5 Parallel Algorithm Models pipeline (or producer-consumer model) stream parallelism stream of data triggers computations pipelines is a chain of consumer-produces processes shape of pipeline can be: linear or multidimesional array trees general graphs with or without cycles Bulk-sunchronous parallel (BSP) model Synchronisation steps needed in a regular or irregular pattern p2p synchronisation and/or global synchronisation