HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90

Size: px
Start display at page:

Download "HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90"

Transcription

1 149 Fortran and HPF 6.2 Concept High Performance Fortran 6.2 Concept Fortran90 extension SPMD (Single Program Multiple Data) model each process operates with its own part of data HPF commands specify which processor gets which part of the data Concurrency is defined by HPF commands based on Fortran90 HPF directives as comments:! HPF$ <directive> Most of the commands of declatrative type concerning data distribution between processes Command INDEPENDENT is an exception (and its attributel NEW) which execute commands

2 150 Fortran and HPF 6.3 PROCESSORS declaration 6.3 PROCESSORS declaration determines conceptual processor array (which need not reflect the actual hardware structure) Example:! HPF$ PROCESSORS, DIMENSION( 4 ) : : P1! HPF$ PROCESSORS, DIMENSION( 2, 2 ) : : P2! HPF$ PROCESSORS, DIMENSION( 2, 1, 2 ) : : P3 Number of processors needs to remain the same in a program If 2 different processor arrays similar, one can assume thet corresponding processors have the same physical process!hpf$ PROCESSORS :: P defines a scalar processor

3 151 Fortran and HPF 6.4 DISTRIBUTE-directive 6.4 DISTRIBUTE-directive says how to deliver data between processors. Example: REAL, DIMENSION(50) : : A REAL, DIMENSION(10,10) : : B, C, D! HPF$ DISTRIBUTE (BLOCK) ONTO P1 : : A!1 D! HPF$ DISTRIBUTE (CYCLIC, CYCLIC) ONTO P2 : : B,C!2 D! HPF$ DISTRIBUTE D(BLOCK, ) ONTO P1! a l t e r n a t i v e syntax array and processor array rank need to confirm A(1:50) : P1(1:4) each processor gets 50/4 = 13 elements except P1(4) which gets the rest (11 elements) * dimension to be ignored => in this example D distributed row-wise CYCLIC elements delivered one-by-one (as cards from pile between players)

4 152 Fortran and HPF 6.4 DISTRIBUTE-directive BLOCK - array elements are delivered block-wise (each block having elements from close range) Example (9 -element array delivered to 3 processors) : CYCLIC: BLOCK:

5 153 Fortran and HPF 6.4 DISTRIBUTE-directive Example: DISTRIBUTE block-wise: PROGRAM Chunks REAL, DIMENSION(20) : : A! HPF$ PROCESSORS, DIMENSION( 4 ) : : P! HPF$ DISTRIBUTE (BLOCK) ONTO P : : A Example: DISTRIBUTE cyclically: PROGRAM Round_Robin REAL, DIMENSION(20) : : A! HPF$ PROCESSORS, DIMENSION( 4 ) : : P! HPF$ DISTRIBUTE ( CYCLIC) ONTO P : : A

6 154 Fortran and HPF 6.4 DISTRIBUTE-directive Example: DISTRIBUTE 2-dimensional layout: PROGRAM Skwiffy IMPLICIT NONE REAL, DIMENSION( 4, 4 ) : : A, B, C! HPF$ PROCESSORS, DIMENSION( 2, 2 ) : : P! HPF$ DISTRIBUTE (BLOCK, CYCLIC) ONTO P : : A, B, C B = 1; C = 1; A = B + C END PROGRAM Skwiffy cyclic in one dimension; block-wise in other: (11)(12)(11)(12) (11)(12)(11)(12) (21)(22)(21)(22) (21)(22)(21)(22)

7 155 Fortran and HPF 6.4 DISTRIBUTE-directive Example: DISTRIBUTE *: PROGRAM Skwiffy IMPLICIT NONE REAL, DIMENSION( 4, 4 ) : : A, B, C! HPF$ PROCESSORS, DIMENSION( 4 ) : : Q! HPF$ DISTRIBUTE (,BLOCK) ONTO Q : : A, B, C B = 1; C = 1; A = B + C; PRINT, A END PROGRAM Skwiffy Remarks about DISTRIBUTE Without ONTO default structure (given with program arguments, e.g.) BLOCK better if algorithm uses a lot of neighbouring elements in the array => less communication, faster CYCLIC good for even distribution ignoring a dimension (with *) good if calculations due with a whole row or column

8 156 Fortran and HPF 6.5 Distribution of allocatable arrays all scalar variables are by default replicated; updating compiler task 6.5 Distribution of allocatable arrays is similar, except that delivery happens right after the memory allocation REAL, ALLOCATABLE, DIMENSION ( :, : ) : : A INTEGER : : i e r r! HPF$ PROCESSORS, DIMENSION(10,10) : : P! HPF$ DISTRIBUTE (BLOCK, CYCLIC) : : A... ALLOCATE(A(100,20), stat= i e r r )! A a u t o m a t i c a l l y d i s t r i b u t e d here! block size i n dim=1 i s 10 elements... DEALLOCATE(A) END blocksize is determined right after ALLOCATE command

9 157 Fortran and HPF 6.6 HPF rule: Owner Calcuates 6.6 HPF rule: Owner Calcuates Processor on the left of assignment performs the computations Example: DO i = 1,n a ( i 1) = b ( i 6) / c ( i + j ) a ( i i ) END DO calculation performed by process owning a(i-1) NOTE that the rule is not obligatory but advisable to the compiler! compiler may (for the purpose of less communication in the whole program) to leave only the assignment to the a(i-1) owner-processor

10 158 Fortran and HPF 6.7 Scalar variables 6.7 Scalar variables REAL, DIMENSION(100,100) : : X REAL : : Scal! HPF$ DISTRIBUTE (BLOCK,BLOCK) : : X.... Scal = X( i, j ).... owner of X(i,j) assigns Scal value and sends to other processes (replication)

11 159 Fortran and HPF 6.8 Examples of good DISTRIBUTE subdivision 6.8 Examples of good DISTRIBUTE subdivision Example: A( 2 : 9 9 ) = (A( : 9 8 ) +A( 3 : ) ) /2! neighbour c a l c u l a t i o n s B( 2 2 : 5 6 ) = 4.0 ATAN( 1. 0 )! s e c t i o n of B c a l c u l a t e d C ( : ) = SUM(D, DIM=1)! Sum down a column From the owner calculates rule we get:! HPF$ DISTRIBUTE (BLOCK) ONTO P : : A! HPF$ DISTRIBUTE ( CYCLIC) ONTO P : : B! HPF$ DISTRIBUTE (BLOCK) ONTO P : : C! or ( CYCLIC)! HPF$ DISTRIBUTE (,BLOCK) ONTO P : : D! or (, CYCLIC)

12 160 Fortran and HPF 6.8 Examples of good DISTRIBUTE subdivision Example (SSOR): DO j = 2,n 1 DO i = 2,n 1 a ( i, j ) =(omega/ 4 ) ( a ( i, j 1)+a ( i, j +1)+ & a ( i 1, j ) +a ( i +1, j ) ) +(1 omega) a ( i, j ) END DO END DO Best is BLOCK-distribution in both dimensions

13 161 Fortran and HPF 6.9 HPF programming methodology 6.9 HPF programming methodology Need to find balance between concurrency and communication the more processes the more communication aiming to find balanced load based from the owner calculates rule data locality use array syntax and intrinsic functions on arrays avoid deprecated Fortran features (like assumed size and memory associations) Easy to write a program in HPF but difficult to gain good efficiency Programming in HPF technique is more or less like this: 1. Write a correctly working serial program, test and debug it 2. add distribution directives introducing as less as possible communication

14 162 Fortran and HPF 6.9 HPF programming methodology Advisable to add the INDEPENDENT directives (where semantically relevant), Perform data alignment (with ALIGN-directive). First thing is to: Choose a good parallel algorithm! Issues reducing efficiency: Difficult indexing (may confuse the compiler where particular elements are situated) array syntax is very powerful but with complex constructions may make compiler to fail to optimise sequential cycles to be left sequential (or replicate) object redelivery is time consuming

15 163 Fortran and HPF 6.9 HPF programming methodology Additional advice: Use array syntax possibilities instead of cycles Use intrinsic functions where possible (for possibly better optimisation) Before parallelisation think whether the argorithm is parallelisable at all? Use INDIPENDENT and ALIGN directives It is possible to assign sizes to the lengths in BLOCK and CYCLE if sure in need for the algorithm efficiency, otherwise better leave the decision to the compiler

16 164 Fortran and HPF 6.10 BLOCK(m) and CYCLIC(m) 6.10 BLOCK(m) and CYCLIC(m) predefining block size; in general make the code less efficient due to more complex accounting on ownership REAL, DIMENSION(20) : : A, B! HPF$ PROCESSORS, DIMENSION( 4 ) : : P! HPF$ DISTRIBUTE A(BLOCK( 9 ) ) ONTO P! HPF$ DISTRIBUTE B(CYCLIC ( 2 ) ) ONTO P 2D example: REAL, DIMENSION( 4, 9 ) : : A! HPF$ PROCESSORS, DIMENSION( 2 ) : : P! HPF$ DISTRIBUTE (BLOCK( 3 ),CYCLIC ( 2 ) ) ONTO P : : A

17 165 Fortran and HPF 6.11 Array alignment 6.11 Array alignment improves data locality minimises communication distributes workload Simplest example: A = B + C With correct ALIGN ment it is without communication 2 ways:! HPF$ ALIGN ( :, : ) WITH T ( :, : ) : : A, B, C equivalent to:! HPF$ ALIGN A ( :, : ) WITH T ( :, : )! HPF$ ALIGN B ( :, : ) WITH T ( :, : )! HPF$ ALIGN C ( :, : ) WITH T ( :, : )

18 166 Fortran and HPF 6.11 Array alignment Example: REAL, DIMENSION(10) : : A, B, C! HPF$ ALIGN ( : ) WITH C ( : ) : : A, B need only C be in DISTRIBUTE-command. Example, symbol instead of : :! HPF$ ALIGN ( j ) WITH C( j ) : : A, B means: for j align arrays A and B with array C. ( : instead of j stronger requirement)

19 167 Fortran and HPF 6.11 Array alignment! HPF$ ALIGN A( i, j ) WITH B( i, j ) Example 2 (2-dimensional case): REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A ( :, : ) WITH B ( :, : ) is stronger requirement than not assuming same size arrays Good to perform A=B+C+B*C operations (everything local!) transposed alignment: REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A( i, : ) WITH B ( :, i ) (second dimension of A and first dimension of B with same length!)

20 168 Fortran and HPF 6.11 Array alignment Similarly: REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A ( :, j ) WITH B( j, : ) or: REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A( i, j ) WITH B( j, i ) good to perform operation: A=A+TRNSPOSE(B) A! e v e r y t h i n g l o c a l!

21 169 Fortran and HPF 6.12 Strided Alignment 6.12 Strided Alignment Example: Align elements of matrix D with every second element in E: REAL, DIMENSION( 5 ) : : D REAL, DIMENSION(10) : : E! HPF$ ALIGN D ( : ) WITH E ( 1 : : 2 ) could be written also:! HPF$ ALIGN D( i ) WITH E( i 2 1) Operation: D = D + E ( : : 2 )! l o c a l

22 170 Fortran and HPF 6.12 Strided Alignment Example: reverse strided alignment: REAL, DIMENSION( 5 ) : : D REAL, DIMENSION(10) : : E! HPF$ ALIGN D ( : ) WITH E(UBOUND(E) :: 2) could be written also:! HPF$ ALIGN D( i ) WITH E(2+UBOUND( E) i 2)

23 171 Fortran and HPF 6.13 Example on Alignment 6.13 Example on Alignment PROGRAM Warty IMPLICIT NONE REAL, DIMENSION( 4 ) : : C REAL, DIMENSION( 8 ) : : D REAL, DIMENSION( 2 ) : : E C = 1; D = 2 E = D ( : : 4 ) + C ( : : 2 ) END PROGRAM Warty minimal (0) communication is achieved with:! HPF$ ALIGN C ( : ) WITH D ( : : 2 )! HPF$ ALIGN E ( : ) WITH D ( : : 4 )! HPF$ DISTRIBUTE (BLOCK) : : D

24 172 Fortran and HPF 6.14 Alignment with Allocatable Arrays 6.14 Alignment with Allocatable Arrays alignment is performed togehter with memory allocation existing object cannot be aligned to unallocated object Example: REAL, DIMENSION ( : ), ALLOCATABLE : :! HPF$ ALIGN A ( : ) WITH B ( : ) then ALLOCATE (B(100), stat= i e r r ) ALLOCATE (A(100), stat= i e r r ) is OK ALLOCATE (B(100),A(100), stat= i e r r ) also OK (allocation starts from left), A,B

25 173 Fortran and HPF 6.14 Alignment with Allocatable Arrays but, ALLOCATE (A(100), stat= i e r r ) ALLOCATE (B(100), stat= i e r r ) or ALLOCATE (A(100),B(100), stat= i e r r ) give error! Simple array cannot be aligned with allocatable: REAL, DIMENSION ( : ) : : X REAL, DIMENSION ( : ), ALLOCATABLE : :! HPF$ ALIGN X ( : ) WITH A ( : )! ERROR A

26 174 Fortran and HPF 6.14 Alignment with Allocatable Arrays One more problem: REAL, DIMENSION ( : ), ALLOCATABLE : :! HPF$ ALIGN A ( : ) WITH B ( : ) ALLOCATE(B(100), stat= i e r r ) ALLOCATE(A(50), stat= i e r r ) A, B : says that A and B should be with same length (but they are not!) But this is OK: REAL, DIMENSION ( : ), ALLOCATABLE : : A, B! HPF$ ALIGN A( i ) WITH B( i ) ALLOCATE(B(100), stat= i e r r ) ALLOCATE(A(50), stat= i e r r ) still: A cannot be larger than B.

27 175 Fortran and HPF 6.16 Dimension replication 6.15 Dimension collapsing With one element it is possible to align one or severeal dimensions:! HPF$ ALIGN (, : ) WITH Y ( : ) : : X each element from Y aligned to a column from X (First dimension of matrix X being collapsed) 6.16 Dimension replication! HPF$ ALIGN Y ( : ) WITH X (, : ) each processor getting arbitrary row X(:,i) gets also a copy of Y(i)

28 176 Fortran and HPF 6.16 Dimension replication Example: 2D Gauss elimination kernel of the program:... DO j = i +1, n A( j, i ) = A( j, i ) /Swap( i ) A( j, i +1:n ) = A( j, i +1:n ) A( j, i ) Swap( i +1:n ) Y( j ) = Y( j ) A( j, i ) Temp END DO Y(k) together with A(k,i) =>! HPF$ ALIGN Y ( : ) WITH A ( :, ) Swap(k) together with A(i,k) =>! HPF$ ALIGN Swap ( : ) WITH A (, : ) No matrix A neighbouring elements in same expression => CYLCIC:

29 177 Fortran and HPF 6.16 Dimension replication! HPF$ DISTRIBUTE A( CYCLIC, CYCLIC)

30 178 Fortran and HPF 6.16 Dimension replication Example: matrix multiplication PROGRAM ABmult IMPLICIT NONE INTEGER, PARAMETER : : N = 100 INTEGER, DIMENSION ( N, N) : : A, B, C INTEGER : : i, j! HPF$ PROCESSORS square ( 2, 2 )! HPF$ DISTRIBUTE (BLOCK, BLOCK) ONTO square : : C! HPF$ ALIGN A( i, ) WITH C( i, )! r e p l i c a t e copies of row A( i, ) onto processors which compute C( i, j )! HPF$ ALIGN B(, j ) WITH C(, j )! r e p l i c a t e copies of column B(, j ) ) onto processors which compute C( i, j ) A = 1 B = 2 C = 0 DO i = 1, N DO j = 1, N! A l l the work i s l o c a l due to ALIGNs C( i, j ) = DOT_PRODUCT(A( i, : ), B( :, j ) ) END DO END DO WRITE(, ) C END

31 179 Fortran and HPF 6.17 HPF Intrinsic Functions 6.17 HPF Intrinsic Functions NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE information about physical hardware needed for portability:! HPF$ PROCESSORS P1 (NUMBER.OF.PROCESSORS( ) )! HPF$ PROCESSORS P2(4, 4,NUMBER.OF.PROCESSORS( ) / 1 6 )! HPF$ PROCESSORS P3 ( 0 :NUMBER.OF.PROCESSORS( 1 ) 1, &! HPF$ 0:NUMBER.OF.PROCESSORS( 2 ) 1) 2048-processor hyprcube: PRINT, PROCESSORS. SHAPE( ) would return:

32 180 Fortran and HPF 6.18 HPF Template Syntax 6.18 HPF Template Syntax TEMPLATE - conceptual object, does not use any RAM, defined statically (like 0- sized arrays being not assigned to) are declared distributed can be used to align arrays Example: REAL, DIMENSION(10) : : A, B! HPF$ TEMPLATE, DIMENSION(10) : : T! HPF$ DISTRIBUTE (BLOCK) : : T! HPF$ ALIGN ( : ) WITH T ( : ) : : A, B (here only T may be argument to DISTRIBUTE) Combined TEMPLATE directive:

33 181 Fortran and HPF 6.18 HPF Template Syntax! HPF$ TEMPLATE, DIMENSION(100,100), &! HPF$ DISTRIBUTE (BLOCK, CYCLIC) ONTO P : : T! HPF$ ALIGN A ( :, : ) WITH T ( :, : ) which is equivalent to:! HPF$ TEMPLATE, DIMENSION(100,100) : : T! HPF$ ALIGN A ( :, : ) WITH T ( :, : )! HPF$ DISTRIBUTE T (BLOCK, CYCLIC) ONTO P

34 182 Fortran and HPF 6.18 HPF Template Syntax Example: PROGRAM Warty IMPLICIT NONE REAL, DIMENSION( 4 ) : : C REAL, DIMENSION( 8 ) : : D REAL, DIMENSION( 2 ) : : E! HPF$ TEMPLATE, DIMENSION( 8 ) : : T! HPF$ ALIGN D ( : ) WITH T ( : )! HPF$ ALIGN C ( : ) WITH T ( : : 2 )! HPF$ ALIGN E ( : ) WITH T ( : : 4 )! HPF$ DISTRIBUTE (BLOCK) : : T C = 1; D = 2 E = D ( : : 4 ) + C ( : : 2 ) END PROGRAM Warty (similar to the example of alignment with stride)

35 183 Fortran and HPF 6.18 HPF Template Syntax More examples on using Templates: ALIGN A(:)WITH T1(:, ) i, element A(i) replicated according to row T1(i,:). ALIGN C(i,j) WITH T2(j,i) transposed C aligned with T2 ALIGN B(:, )WITH T3(:) i, matrix B(i,:) aligned with template element T2(i), DISTRIBUTE (BLOCK,CYCLIC):: T1, T2 DISTRIBUTE T1(CYCLIC, )ONTO P T1 rows are distributed cyclically

36 184 Fortran and HPF 6.19 FORALL 6.19 FORALL Syntax: example: FORALL(<forall-triple-list>[,<scalar-mask>])& <assignment> FORALL ( i =1:n, j =1:m,A( i, j ).NE. 0 ) A( i, j ) = 1/A( i, j )

37 185 Fortran and HPF 6.19 FORALL Circumstances, where Fortran90 syntax is not enough, but FORALL makes it simple: index expressions: FORALL ( i =1:n, j =1:n, i /= j ) A( i, j ) = REAL( i + j ) intrinsic or PURE-functions (which have no side-effects): FORALL ( i =1:n : 3, j =1:n : 5 ) A( i, j ) = SIN (A( j, i ) ) subindexing: FORALL ( i =1:n, j =1:n ) A(VS( i ), j ) = i +VS( j )

38 186 Fortran and HPF 6.19 FORALL unusual parts can be accessed: FORALL ( i =1:n ) A( i, i ) = B( i )! diagonal!... DO j = 1, n FORALL ( i =1: j ) A( i, j ) = B( i )! t r i a n g u l a r END DO To parallelise add before DO also:! HPF$ INDEPENDENT, NEW( i ) or write nested FORALL commands: FORALL ( j = 1:n) FORALL ( i =1: j ) A( i, j ) = B( i )! t r i a n g u l a r END FORALL

39 187 Fortran and HPF 6.19 FORALL FORALL-command execution: 1. triple-list evaluation 2. scalar matrix evaluation 3. for each.true. mask elements find right hand side value 4. assignment of right hand side value to the left hand side In case of HPF synchronisation between the steps!

40 188 Fortran and HPF 6.20 PURE-procedures 6.20 PURE-procedures PURE REAL FUNCTION F ( x, y ) PURE SUBROUTINE G( x, y, z ) without side effects, i.e.: no outer I/O nor ALLOCATE (communication still allowed) does not change global state of the program intrinsic functions are of type PURE! can be used in FORALL and in PURE-procedures no PAUSE or STOP FUNCTION formal parameters with attribute INTENT(IN)

41 189 Fortran and HPF 6.20 PURE-procedures Example (function): PURE REAL FUNCTION F ( x, y ) IMPLICIT NONE REAL, INTENT ( IN ) : : x, y F = x x + y y + 2 x y + ASIN ( MIN ( x / y, y / x ) ) END FUNCTION F example of usage: FORALL ( i =1:n, j =1:n ) & A( i, j ) = b ( i ) + F(1.0 i, 1. 0 j )

42 190 Fortran and HPF 6.20 PURE-procedures Example (subroutine): PURE SUBROUTINE G( x, y, z ) IMPLICIT NONE REAL, INTENT (OUT), DIMENSION ( : ) : : z REAL, INTENT ( IN ), DIMENSION ( : ) : : x, y INTEGER i INTERFACE REAL FUNCTION F ( x, y ) REAL, INTENT ( IN ) : : x, y END FUNCTION F END INTERFACE!... FORALL( i =1: SIZE ( z ) ) z ( i ) = F ( x ( i ), y ( i ) ) END SUBROUTINE G

43 191 Fortran and HPF 6.20 PURE-procedures MIMD example: REAL FUNCTION F ( x, i )! PURE IMPLICIT NONE REAL, INTENT ( IN ) : : x! element INTEGER, INTENT ( IN ) : : i! index IF ( x > 0. 0 ) THEN F = x x ELSEIF ( i ==1.OR. i ==n ) THEN F = 0.0 ELSE F = x END IF END FUNCTION F

44 192 Fortran and HPF 6.21 INDEPENDENT 6.21 INDEPENDENT Directly in front of DO or FORALL! HPF$ INDEPENDENT DO i = 1,n x ( i ) = i 2 END DO In front of FORALL: no synchronisation needed between right hand side expression evaluation and assignment If INDEPENDENT loop......assigns more than once to a same element, parallelisation is lost!...includes EXIT, STOP or PAUSE command, iteration needs to execute sequentially to be sure to end in right iteration...has jumps out of the loop or I/O => sequential execution

45 193 Fortran and HPF 6.21 INDEPENDENT Is independent:! HPF$ INDEPENDENT DO i = 1, n b ( i ) = b ( i ) + b ( i ) END DO Not independent: DO i = 1, n b ( i ) = b ( i +1) + b ( i ) END DO Not independent: DO i = 1, n b ( i ) = b ( i 1) + b ( i ) END DO

46 194 Fortran and HPF 6.21 INDEPENDENT This is independent loop:! HPF$ INDEPENDENT DO i = 1, n a ( i ) = b ( i 1) + b ( i ) END DO Question to ask: does a later iteration depend on a previous one?

47 195 Fortran and HPF 6.22 INDEPENDENT NEW command 6.22 INDEPENDENT NEW command create an independent variable on each process!! HPF$ INDEPENDENT, NEW( s1, s2 ) DO i =1,n s1 = Sin ( a ( 1 ) ) s2 = Cos( a ( 1 ) ) a ( 1 ) = s1 s1 s2 s2 END DO

48 196 Fortran and HPF 6.22 INDEPENDENT NEW command Rules for NEW-variable: cannot be used outside cycle without redefinition cannot be used in FORALL not pointers or formal parameters cannot have SAVE atribute Disallowed:! HPF$ INDEPENDENT, NEW( s1, s2 ) DO i = 1,n s1 = SIN ( a ( i ) ) s2 = COS( a ( i ) ) a ( i ) = s1 s1 s2 s2 END DO k = s1+s2! not allowed!

49 197 Fortran and HPF 6.23 EXTRINSIC Example (only outer cycles can be executed independently):! HPF$ INDEPENDENT, NEW ( i 2 ) DO i1 = 1, n1! HPF$ INDEPENDENT, NEW ( i 3 ) DO i2 = 1, n2! HPF$ INDEPENDENT, NEW ( i 4 ) DO i3 = 1, n3 DO i4 = 1, n4 a ( i1, i2, i3 ) = a ( i1, i2, i3 ) & + b ( i1, i2, i4 ) c ( i2, i3, i4 ) END DO END DO END DO END DO 6.23 EXTRINSIC Used in INTERFACE-commands to declare that a routine does not belong to HPF

50 198 Fortran and HPF 6.23 EXTRINSIC 7 MPI See separate slides on course homepage!

51 199 // Alg.Design Part III Parallel Algorithms 8 Parallel Algorithm Design Principles Identifying portions of the work that can be performed concurrently Mapping the concurrent pieces of work onto multiple processes running in parallel Distributing the input, output and also intermediate data associated with the program Access management of data shared by multiple processes Process synchronisation in various stages in parallel program execution

52 200 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs 8.1 Decomposition, Tasks and Dependency Graphs Subdividing calculations into smaller components is called decomposition Example: Dense matrix-vector multiplication y = Ab y[i] = n j=1 A[i, j]b[ j] Computational Problem Decomposition Tasks ([1])

53 201 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs in case of y = Ab tasks calculation of each row independent Tasks and their relative order abstraction: ([1]) Task Dependency Graph directed acyclic graph nodes tasks

54 202 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs directed edges dependences (can be disconnected) (edge-set can be empty) Mapping tasks onto processors Processors vs. processes

55 203 // Alg.Design 8.2 Decomposition Techniques 8.2 Decomposition Techniques Recursive decomposition Example: Quicksort

56 204 // Alg.Design 8.2 Decomposition Techniques Data decomposition Input data Output data intermediate results Owner calculates rule Exploratory decomposition Example: 15-square game Speculative decomposition Hybrid decomposition

57 205 // Alg.Design 8.3 Tasks and Interactions 8.3 Tasks and Interactions Task generation static dynamic Task sizes uniform non-uniform knowledge of task sizes size of data associated with tasks

58 206 // Alg.Design 8.3 Tasks and Interactions Characteristics of Inter-Task Interactions static versus dynamic regular versus irregular read-only versus read-write one-way versus two-way

59 207 // Alg.Design 8.4 Mapping Techniques for Load balancing 8.4 Mapping Techniques for Load balancing Static mapping Mappings based on data partitioning array distribution schemes block distribution cyclic and block-cyclic distribution randomised block distribution Graph partitioning

60 208 // Alg.Design 8.4 Mapping Techniques for Load balancing Dynamic mapping schemes centralised schemes master-slave self scheduling single task scheduling chunk scheduling distributed schemes each process can send/receive work from whichever process how are sending receiving processes paired together? initiator sender or receiver? how much work exchanged? when the work transfer is performed?

61 209 // Alg.Design 8.4 Mapping Techniques for Load balancing Methods for reducing interaction overheads maximising data locality minimise volume of data exchange minimise frequency of interactions overlapping computations with interactions replicating data or computation overlapping interactions with other interactions

62 210 // Alg.Design 8.5 Parallel Algorithm Models 8.5 Parallel Algorithm Models data-parallel model task graph model typically amount of data relatively large to the amount of computation work pool model or task pool model pool can be centralised distributed statically/dynamically created, master-slave model

63 211 // Alg.Design 8.5 Parallel Algorithm Models pipeline (or producer-consumer model) stream parallelism stream of data triggers computations pipelines is a chain of consumer-produces processes shape of pipeline can be: linear or multidimesional array trees general graphs with or without cycles Bulk-sunchronous parallel (BSP) model Synchronisation steps needed in a regular or irregular pattern p2p synchronisation and/or global synchronisation

Parallel Programming. March 15,

Parallel Programming. March 15, Parallel Programming March 15, 2010 1 Some Definitions Computational Models and Models of Computation real world system domain model - mathematical - organizational -... computational model March 15, 2010

More information

High Performance Fortran. James Curry

High Performance Fortran. James Curry High Performance Fortran James Curry Wikipedia! New Fortran statements, such as FORALL, and the ability to create PURE (side effect free) procedures Compiler directives for recommended distributions of

More information

Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program

Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program SIMD Single Instruction Multiple Data Lecture 12: SIMD-machines & data parallelism, dependency analysis for automatic vectorizing and parallelizing of serial program Part 1 Parallelism through simultaneous

More information

High Performance Fortran http://www-jics.cs.utk.edu jics@cs.utk.edu Kwai Lam Wong 1 Overview HPF : High Performance FORTRAN A language specification standard by High Performance FORTRAN Forum (HPFF), a

More information

CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming

CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming CSE 262 Spring 2007 Scott B. Baden Lecture 4 Data parallel programming Announcements Projects Project proposal - Weds 4/25 - extra class 4/17/07 Scott B. Baden/CSE 262/Spring 2007 2 Data Parallel Programming

More information

Lecture V: Introduction to parallel programming with Fortran coarrays

Lecture V: Introduction to parallel programming with Fortran coarrays Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time

More information

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Lecture 4: Principles of Parallel Algorithm Design (part 4) Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

MPI Casestudy: Parallel Image Processing

MPI Casestudy: Parallel Image Processing MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by

More information

AMath 483/583 Lecture 8

AMath 483/583 Lecture 8 AMath 483/583 Lecture 8 This lecture: Fortran subroutines and functions Arrays Dynamic memory Reading: class notes: Fortran Arrays class notes: Fortran Subroutines and Functions class notes: gfortran flags

More information

Introduction to OpenMP. Lecture 4: Work sharing directives

Introduction to OpenMP. Lecture 4: Work sharing directives Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives which appear inside a parallel region and indicate how work should be shared out between threads Parallel do/for

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Extrinsic Procedures. Section 6

Extrinsic Procedures. Section 6 Section Extrinsic Procedures 1 1 1 1 1 1 1 1 0 1 This chapter defines the mechanism by which HPF programs may call non-hpf subprograms as extrinsic procedures. It provides the information needed to write

More information

Parallelisation. Michael O Boyle. March 2014

Parallelisation. Michael O Boyle. March 2014 Parallelisation Michael O Boyle March 2014 1 Lecture Overview Parallelisation for fork/join Mapping parallelism to shared memory multi-processors Loop distribution and fusion Data Partitioning and SPMD

More information

2 3. Syllabus Time Event 9:00{10:00 morning lecture 10:00{10:30 morning break 10:30{12:30 morning practical session 12:30{1:30 lunch break 1:30{2:00 a

2 3. Syllabus Time Event 9:00{10:00 morning lecture 10:00{10:30 morning break 10:30{12:30 morning practical session 12:30{1:30 lunch break 1:30{2:00 a 1 Syllabus for the Advanced 3 Day Fortran 90 Course AC Marshall cuniversity of Liverpool, 1997 Abstract The course is scheduled for 3 days. The timetable allows for two sessions a day each with a one hour

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Chapter 3. Fortran Statements

Chapter 3. Fortran Statements Chapter 3 Fortran Statements This chapter describes each of the Fortran statements supported by the PGI Fortran compilers Each description includes a brief summary of the statement, a syntax description,

More information

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information

More information

Chapter 4. Fortran Arrays

Chapter 4. Fortran Arrays Chapter 4. Fortran Arrays Fortran arrays are any object with the dimension attribute. In Fortran 90/95, and in HPF, arrays may be very different from arrays in older versions of Fortran. Arrays can have

More information

Welcome. Modern Fortran (F77 to F90 and beyond) Virtual tutorial starts at BST

Welcome. Modern Fortran (F77 to F90 and beyond) Virtual tutorial starts at BST Welcome Modern Fortran (F77 to F90 and beyond) Virtual tutorial starts at 15.00 BST Modern Fortran: F77 to F90 and beyond Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Fortran Ancient History (1967)

More information

MPI Programming. Henrik R. Nagel Scientific Computing IT Division

MPI Programming. Henrik R. Nagel Scientific Computing IT Division 1 MPI Programming Henrik R. Nagel Scientific Computing IT Division 2 Outline Introduction Finite Difference Method Finite Element Method LU Factorization SOR Method Monte Carlo Method Molecular Dynamics

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Array Processing { Part II. Multi-Dimensional Arrays. 1. What is a multi-dimensional array?

Array Processing { Part II. Multi-Dimensional Arrays. 1. What is a multi-dimensional array? Array Processing { Part II Multi-Dimensional Arrays 1. What is a multi-dimensional array? A multi-dimensional array is simply a table (2-dimensional) or a group of tables. The following is a 2-dimensional

More information

Introduction to Fortran95 Programming Part II. By Deniz Savas, CiCS, Shef. Univ., 2018

Introduction to Fortran95 Programming Part II. By Deniz Savas, CiCS, Shef. Univ., 2018 Introduction to Fortran95 Programming Part II By Deniz Savas, CiCS, Shef. Univ., 2018 Summary of topics covered Logical Expressions, IF and CASE statements Data Declarations and Specifications ARRAYS and

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

PACKAGE SPECIFICATION HSL 2013

PACKAGE SPECIFICATION HSL 2013 PACKAGE SPECIFICATION HSL 2013 1 SUMMARY Given a rank-one or rank-two allocatable array, reallocates the array to have a different size, and can copy all or part of the original array into the new array.

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Review More Arrays Modules Final Review

Review More Arrays Modules Final Review OUTLINE 1 REVIEW 2 MORE ARRAYS Using Arrays Why do we need dynamic arrays? Using Dynamic Arrays 3 MODULES Global Variables Interface Blocks Modular Programming 4 FINAL REVIEW THE STORY SO FAR... Create

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

Subroutines, Functions and Modules

Subroutines, Functions and Modules Subroutines, Functions and Modules Subdividing the Problem Most problems are thousands of lines of code. Few people can grasp all of the details. Good design principle: Exhibit the overall structure in

More information

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy Automatic Translation of Fortran Programs to Vector Form Randy Allen and Ken Kennedy The problem New (as of 1987) vector machines such as the Cray-1 have proven successful Most Fortran code is written

More information

Overpartioning with the Rice dhpf Compiler

Overpartioning with the Rice dhpf Compiler Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf

More information

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center

More information

Programming for High Performance Computing in Modern Fortran. Bill Long, Cray Inc. 17-May-2005

Programming for High Performance Computing in Modern Fortran. Bill Long, Cray Inc. 17-May-2005 Programming for High Performance Computing in Modern Fortran Bill Long, Cray Inc. 17-May-2005 Concepts in HPC Efficient and structured layout of local data - modules and allocatable arrays Efficient operations

More information

Introduction to Fortran Programming. -Internal subprograms (1)-

Introduction to Fortran Programming. -Internal subprograms (1)- Introduction to Fortran Programming -Internal subprograms (1)- Subprograms Subprograms are used to split the program into separate smaller units. Internal subprogram is not an independent part of a program.

More information

Storage and Sequence Association

Storage and Sequence Association 2 Section Storage and Sequence Association 1 1 HPF allows the mapping of variables across multiple processors in order to improve parallel 1 performance. FORTRAN and Fortran 0 both specify relationships

More information

Optimisation p.1/22. Optimisation

Optimisation p.1/22. Optimisation Performance Tuning Optimisation p.1/22 Optimisation Optimisation p.2/22 Constant Elimination do i=1,n a(i) = 2*b*c(i) enddo What is wrong with this loop? Compilers can move simple instances of constant

More information

Blocking SEND/RECEIVE

Blocking SEND/RECEIVE Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F

More information

G Programming Languages - Fall 2012

G Programming Languages - Fall 2012 G22.2110-003 Programming Languages - Fall 2012 Lecture 4 Thomas Wies New York University Review Last week Control Structures Selection Loops Adding Invariants Outline Subprograms Calling Sequences Parameter

More information

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Principles of Parallel Algorithm Design: Concurrency and Decomposition Principles of Parallel Algorithm Design: Concurrency and Decomposition John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 2 12 January 2017 Parallel

More information

Appendix D. Fortran quick reference

Appendix D. Fortran quick reference Appendix D Fortran quick reference D.1 Fortran syntax... 315 D.2 Coarrays... 318 D.3 Fortran intrisic functions... D.4 History... 322 323 D.5 Further information... 324 Fortran 1 is the oldest high-level

More information

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science High Performance Computing Lecture 41 Matthew Jacob Indian Institute of Science Example: MPI Pi Calculating Program /Each process initializes, determines the communicator size and its own rank MPI_Init

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP p. 1/?? Introduction to OpenMP More Syntax and SIMD Nick Maclaren nmm1@cam.ac.uk September 2017 Introduction to OpenMP p. 2/?? C/C++ Parallel for (1) I said that I would give the

More information

Mixed Mode MPI / OpenMP Programming

Mixed Mode MPI / OpenMP Programming Mixed Mode MPI / OpenMP Programming L.A. Smith Edinburgh Parallel Computing Centre, Edinburgh, EH9 3JZ 1 Introduction Shared memory architectures are gradually becoming more prominent in the HPC market,

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Lecture 4: Principles of Parallel Algorithm Design (part 3)

Lecture 4: Principles of Parallel Algorithm Design (part 3) Lecture 4: Principles of Parallel Algorithm Design (part 3) 1 Exploratory Decomposition Decomposition according to a search of a state space of solutions Example: the 15-puzzle problem Determine any sequence

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI One-sided Communication Nick Maclaren nmm1@cam.ac.uk October 2010 Programming with MPI p. 2/?? What Is It? This corresponds to what is often called RDMA

More information

Compiling High Performance Fortran for Distributedmemory

Compiling High Performance Fortran for Distributedmemory Compiling High Performance Fortran for Distributedmemory Systems Jonathan Harris John A. Bircsak M. Regina Bolduc Jill Ann Diewald Israel Gale Neil W. Johnson Shin Lee C. Alexander Nelson Carl D. Offner

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

PACKAGE SPECIFICATION HSL 2013

PACKAGE SPECIFICATION HSL 2013 HSL MC73 PACKAGE SPECIFICATION HSL 2013 1 SUMMARY Let A be an n n matrix with a symmetric sparsity pattern. HSL MC73 has entries to compute the (approximate) Fiedler vector of the unweighted or weighted

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

Goals for This Lecture:

Goals for This Lecture: Goals for This Lecture: Learn about multi-dimensional (rank > 1) arrays Learn about multi-dimensional array storage Learn about the RESHAPE function Learn about allocatable arrays & the ALLOCATE and DEALLOCATE

More information

Parallelization of an Example Program

Parallelization of an Example Program Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.

More information

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Programming with OpenMP. CS240A, T. Yang Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs

More information

Introduction to Modern Fortran

Introduction to Modern Fortran Introduction to Modern Fortran p. 1/?? Introduction to Modern Fortran Advanced Use Of Procedures Nick Maclaren nmm1@cam.ac.uk March 2014 Introduction to Modern Fortran p. 2/?? Summary We have omitted some

More information

Declaration and Initialization

Declaration and Initialization 6. Arrays Declaration and Initialization a1 = sqrt(a1) a2 = sqrt(a2) a100 = sqrt(a100) real :: a(100) do i = 1, 100 a(i) = sqrt(a(i)) Declaring arrays real, dimension(100) :: a real :: a(100) real :: a(1:100)!

More information

ARRAYS COMPUTER PROGRAMMING. By Zerihun Alemayehu

ARRAYS COMPUTER PROGRAMMING. By Zerihun Alemayehu ARRAYS COMPUTER PROGRAMMING By Zerihun Alemayehu AAiT.CED Rm. E119B BASIC RULES AND NOTATION An array is a group of variables or constants, all of the same type, that are referred to by a single name.

More information

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009. CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont. Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following

More information

Introduction to Programming with Fortran 90

Introduction to Programming with Fortran 90 Introduction to Programming with Fortran 90 p. 1/?? Introduction to Programming with Fortran 90 Array Concepts Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 November 2007 Introduction to Programming

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous

More information

Interfacing With Other Programming Languages Using Cython

Interfacing With Other Programming Languages Using Cython Lab 19 Interfacing With Other Programming Languages Using Cython Lab Objective: Learn to interface with object files using Cython. This lab should be worked through on a machine that has already been configured

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Computer Science & Engineering 150A Problem Solving Using Computers

Computer Science & Engineering 150A Problem Solving Using Computers Computer Science & Engineering 150A Problem Solving Using Computers Lecture 06 - Stephen Scott Adapted from Christopher M. Bourke 1 / 30 Fall 2009 Chapter 8 8.1 Declaring and 8.2 Array Subscripts 8.3 Using

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012 CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night

More information

Lecture 4: Principles of Parallel Algorithm Design (part 3)

Lecture 4: Principles of Parallel Algorithm Design (part 3) Lecture 4: Principles of Parallel Algorithm Design (part 3) 1 Exploratory Decomposition Decomposition according to a search of a state space of solutions Example: the 15-puzzle problem Determine any sequence

More information

More Coarray Features. SC10 Tutorial, November 15 th 2010 Parallel Programming with Coarray Fortran

More Coarray Features. SC10 Tutorial, November 15 th 2010 Parallel Programming with Coarray Fortran More Coarray Features SC10 Tutorial, November 15 th 2010 Parallel Programming with Coarray Fortran Overview Multiple Dimensions and Codimensions Allocatable Coarrays and Components of Coarray Structures

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges ELE 455/555 Computer System Engineering Section 4 Class 1 Challenges Introduction Motivation Desire to provide more performance (processing) Scaling a single processor is limited Clock speeds Power concerns

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

Synchronous Shared Memory Parallel Examples. HPC Fall 2012 Prof. Robert van Engelen

Synchronous Shared Memory Parallel Examples. HPC Fall 2012 Prof. Robert van Engelen Synchronous Shared Memory Parallel Examples HPC Fall 2012 Prof. Robert van Engelen Examples Data parallel prefix sum and OpenMP example Task parallel prefix sum and OpenMP example Simple heat distribution

More information

(Refer Slide Time 01:41 min)

(Refer Slide Time 01:41 min) Programming and Data Structure Dr. P.P.Chakraborty Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture # 03 C Programming - II We shall continue our study of

More information

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives which appear inside a parallel region and indicate how work should be shared out between threads Parallel do/for

More information

G. Colin de Verdière CEA

G. Colin de Verdière CEA A methodology to port production codes to GPU learned from our experiments G. Colin de Verdière CEA GCdV CEA, DAM, DIF, F-91297 Arpajon 1 Moving towards GPGPU Potentially important speedups COMPLEX(8)

More information

1 Introduction to MATLAB

1 Introduction to MATLAB 1 Introduction to MATLAB 1.1 Quick Overview This chapter is not intended to be a comprehensive manual of MATLAB R. Our sole aim is to provide sufficient information to give you a good start. If you are

More information

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1 CS4961 Parallel Programming Lecture 4: Data and Task Parallelism Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following command:

More information

Arrays, Vectors Searching, Sorting

Arrays, Vectors Searching, Sorting Arrays, Vectors Searching, Sorting Arrays char s[200]; //array of 200 characters different type than class string can be accessed as s[0], s[1],..., s[199] s[0]= H ; s[1]= e ; s[2]= l ; s[3]= l ; s[4]=

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

Run-Time Data Structures

Run-Time Data Structures Run-Time Data Structures Static Structures For static structures, a fixed address is used throughout execution. This is the oldest and simplest memory organization. In current compilers, it is used for:

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19 Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing

More information

Exploring XcalableMP. Shun Liang. August 24, 2012

Exploring XcalableMP. Shun Liang. August 24, 2012 Exploring XcalableMP Shun Liang August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract This project has implemented synthetic and application

More information

Lecture Notes on Loop Transformations for Cache Optimization

Lecture Notes on Loop Transformations for Cache Optimization Lecture Notes on Loop Transformations for Cache Optimization 5-: Compiler Design André Platzer Lecture Introduction In this lecture we consider loop transformations that can be used for cache optimization.

More information

SECTION 5: STRUCTURED PROGRAMMING IN MATLAB. ENGR 112 Introduction to Engineering Computing

SECTION 5: STRUCTURED PROGRAMMING IN MATLAB. ENGR 112 Introduction to Engineering Computing SECTION 5: STRUCTURED PROGRAMMING IN MATLAB ENGR 112 Introduction to Engineering Computing 2 Conditional Statements if statements if else statements Logical and relational operators switch case statements

More information

Algorithms and Applications

Algorithms and Applications Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers

More information

Example of a Parallel Algorithm

Example of a Parallel Algorithm -1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh A Short Introduction to OpenMP Mark Bull, EPCC, University of Edinburgh Overview Shared memory systems Basic Concepts in Threaded Programming Basics of OpenMP Parallel regions Parallel loops 2 Shared memory

More information

Synchronous Shared Memory Parallel Examples. HPC Fall 2010 Prof. Robert van Engelen

Synchronous Shared Memory Parallel Examples. HPC Fall 2010 Prof. Robert van Engelen Synchronous Shared Memory Parallel Examples HPC Fall 2010 Prof. Robert van Engelen Examples Data parallel prefix sum and OpenMP example Task parallel prefix sum and OpenMP example Simple heat distribution

More information

Fortran. (FORmula TRANslator) History

Fortran. (FORmula TRANslator) History Fortran (FORmula TRANslator) History FORTRAN vs. Fortran 1954 FORTRAN first successful high level language John Backus (IBM) 1958 FORTRAN II (Logical IF, subroutines, functions) 1961 FORTRAN IV 1966 FORTRAN

More information

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #43. Multidimensional Arrays

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #43. Multidimensional Arrays Introduction to Programming in C Department of Computer Science and Engineering Lecture No. #43 Multidimensional Arrays In this video will look at multi-dimensional arrays. (Refer Slide Time: 00:03) In

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

1 Introduction to MATLAB

1 Introduction to MATLAB 1 Introduction to MATLAB 1.1 General Information Quick Overview This chapter is not intended to be a comprehensive manual of MATLAB R. Our sole aim is to provide sufficient information to give you a good

More information

CUDA Fortran COMPILERS &TOOLS. Porting Guide

CUDA Fortran COMPILERS &TOOLS. Porting Guide Porting Guide CUDA Fortran CUDA Fortran is the Fortran analog of the NVIDIA CUDA C language for programming GPUs. This guide includes examples of common language features used when porting Fortran applications

More information