HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90
|
|
- Carmella Lang
- 6 years ago
- Views:
Transcription
1 149 Fortran and HPF 6.2 Concept High Performance Fortran 6.2 Concept Fortran90 extension SPMD (Single Program Multiple Data) model each process operates with its own part of data HPF commands specify which processor gets which part of the data Concurrency is defined by HPF commands based on Fortran90 HPF directives as comments:! HPF$ <directive> Most of the commands of declatrative type concerning data distribution between processes Command INDEPENDENT is an exception (and its attributel NEW) which execute commands
2 150 Fortran and HPF 6.3 PROCESSORS declaration 6.3 PROCESSORS declaration determines conceptual processor array (which need not reflect the actual hardware structure) Example:! HPF$ PROCESSORS, DIMENSION( 4 ) : : P1! HPF$ PROCESSORS, DIMENSION( 2, 2 ) : : P2! HPF$ PROCESSORS, DIMENSION( 2, 1, 2 ) : : P3 Number of processors needs to remain the same in a program If 2 different processor arrays similar, one can assume thet corresponding processors have the same physical process!hpf$ PROCESSORS :: P defines a scalar processor
3 151 Fortran and HPF 6.4 DISTRIBUTE-directive 6.4 DISTRIBUTE-directive says how to deliver data between processors. Example: REAL, DIMENSION(50) : : A REAL, DIMENSION(10,10) : : B, C, D! HPF$ DISTRIBUTE (BLOCK) ONTO P1 : : A!1 D! HPF$ DISTRIBUTE (CYCLIC, CYCLIC) ONTO P2 : : B,C!2 D! HPF$ DISTRIBUTE D(BLOCK, ) ONTO P1! a l t e r n a t i v e syntax array and processor array rank need to confirm A(1:50) : P1(1:4) each processor gets 50/4 = 13 elements except P1(4) which gets the rest (11 elements) * dimension to be ignored => in this example D distributed row-wise CYCLIC elements delivered one-by-one (as cards from pile between players)
4 152 Fortran and HPF 6.4 DISTRIBUTE-directive BLOCK - array elements are delivered block-wise (each block having elements from close range) Example (9 -element array delivered to 3 processors) : CYCLIC: BLOCK:
5 153 Fortran and HPF 6.4 DISTRIBUTE-directive Example: DISTRIBUTE block-wise: PROGRAM Chunks REAL, DIMENSION(20) : : A! HPF$ PROCESSORS, DIMENSION( 4 ) : : P! HPF$ DISTRIBUTE (BLOCK) ONTO P : : A Example: DISTRIBUTE cyclically: PROGRAM Round_Robin REAL, DIMENSION(20) : : A! HPF$ PROCESSORS, DIMENSION( 4 ) : : P! HPF$ DISTRIBUTE ( CYCLIC) ONTO P : : A
6 154 Fortran and HPF 6.4 DISTRIBUTE-directive Example: DISTRIBUTE 2-dimensional layout: PROGRAM Skwiffy IMPLICIT NONE REAL, DIMENSION( 4, 4 ) : : A, B, C! HPF$ PROCESSORS, DIMENSION( 2, 2 ) : : P! HPF$ DISTRIBUTE (BLOCK, CYCLIC) ONTO P : : A, B, C B = 1; C = 1; A = B + C END PROGRAM Skwiffy cyclic in one dimension; block-wise in other: (11)(12)(11)(12) (11)(12)(11)(12) (21)(22)(21)(22) (21)(22)(21)(22)
7 155 Fortran and HPF 6.4 DISTRIBUTE-directive Example: DISTRIBUTE *: PROGRAM Skwiffy IMPLICIT NONE REAL, DIMENSION( 4, 4 ) : : A, B, C! HPF$ PROCESSORS, DIMENSION( 4 ) : : Q! HPF$ DISTRIBUTE (,BLOCK) ONTO Q : : A, B, C B = 1; C = 1; A = B + C; PRINT, A END PROGRAM Skwiffy Remarks about DISTRIBUTE Without ONTO default structure (given with program arguments, e.g.) BLOCK better if algorithm uses a lot of neighbouring elements in the array => less communication, faster CYCLIC good for even distribution ignoring a dimension (with *) good if calculations due with a whole row or column
8 156 Fortran and HPF 6.5 Distribution of allocatable arrays all scalar variables are by default replicated; updating compiler task 6.5 Distribution of allocatable arrays is similar, except that delivery happens right after the memory allocation REAL, ALLOCATABLE, DIMENSION ( :, : ) : : A INTEGER : : i e r r! HPF$ PROCESSORS, DIMENSION(10,10) : : P! HPF$ DISTRIBUTE (BLOCK, CYCLIC) : : A... ALLOCATE(A(100,20), stat= i e r r )! A a u t o m a t i c a l l y d i s t r i b u t e d here! block size i n dim=1 i s 10 elements... DEALLOCATE(A) END blocksize is determined right after ALLOCATE command
9 157 Fortran and HPF 6.6 HPF rule: Owner Calcuates 6.6 HPF rule: Owner Calcuates Processor on the left of assignment performs the computations Example: DO i = 1,n a ( i 1) = b ( i 6) / c ( i + j ) a ( i i ) END DO calculation performed by process owning a(i-1) NOTE that the rule is not obligatory but advisable to the compiler! compiler may (for the purpose of less communication in the whole program) to leave only the assignment to the a(i-1) owner-processor
10 158 Fortran and HPF 6.7 Scalar variables 6.7 Scalar variables REAL, DIMENSION(100,100) : : X REAL : : Scal! HPF$ DISTRIBUTE (BLOCK,BLOCK) : : X.... Scal = X( i, j ).... owner of X(i,j) assigns Scal value and sends to other processes (replication)
11 159 Fortran and HPF 6.8 Examples of good DISTRIBUTE subdivision 6.8 Examples of good DISTRIBUTE subdivision Example: A( 2 : 9 9 ) = (A( : 9 8 ) +A( 3 : ) ) /2! neighbour c a l c u l a t i o n s B( 2 2 : 5 6 ) = 4.0 ATAN( 1. 0 )! s e c t i o n of B c a l c u l a t e d C ( : ) = SUM(D, DIM=1)! Sum down a column From the owner calculates rule we get:! HPF$ DISTRIBUTE (BLOCK) ONTO P : : A! HPF$ DISTRIBUTE ( CYCLIC) ONTO P : : B! HPF$ DISTRIBUTE (BLOCK) ONTO P : : C! or ( CYCLIC)! HPF$ DISTRIBUTE (,BLOCK) ONTO P : : D! or (, CYCLIC)
12 160 Fortran and HPF 6.8 Examples of good DISTRIBUTE subdivision Example (SSOR): DO j = 2,n 1 DO i = 2,n 1 a ( i, j ) =(omega/ 4 ) ( a ( i, j 1)+a ( i, j +1)+ & a ( i 1, j ) +a ( i +1, j ) ) +(1 omega) a ( i, j ) END DO END DO Best is BLOCK-distribution in both dimensions
13 161 Fortran and HPF 6.9 HPF programming methodology 6.9 HPF programming methodology Need to find balance between concurrency and communication the more processes the more communication aiming to find balanced load based from the owner calculates rule data locality use array syntax and intrinsic functions on arrays avoid deprecated Fortran features (like assumed size and memory associations) Easy to write a program in HPF but difficult to gain good efficiency Programming in HPF technique is more or less like this: 1. Write a correctly working serial program, test and debug it 2. add distribution directives introducing as less as possible communication
14 162 Fortran and HPF 6.9 HPF programming methodology Advisable to add the INDEPENDENT directives (where semantically relevant), Perform data alignment (with ALIGN-directive). First thing is to: Choose a good parallel algorithm! Issues reducing efficiency: Difficult indexing (may confuse the compiler where particular elements are situated) array syntax is very powerful but with complex constructions may make compiler to fail to optimise sequential cycles to be left sequential (or replicate) object redelivery is time consuming
15 163 Fortran and HPF 6.9 HPF programming methodology Additional advice: Use array syntax possibilities instead of cycles Use intrinsic functions where possible (for possibly better optimisation) Before parallelisation think whether the argorithm is parallelisable at all? Use INDIPENDENT and ALIGN directives It is possible to assign sizes to the lengths in BLOCK and CYCLE if sure in need for the algorithm efficiency, otherwise better leave the decision to the compiler
16 164 Fortran and HPF 6.10 BLOCK(m) and CYCLIC(m) 6.10 BLOCK(m) and CYCLIC(m) predefining block size; in general make the code less efficient due to more complex accounting on ownership REAL, DIMENSION(20) : : A, B! HPF$ PROCESSORS, DIMENSION( 4 ) : : P! HPF$ DISTRIBUTE A(BLOCK( 9 ) ) ONTO P! HPF$ DISTRIBUTE B(CYCLIC ( 2 ) ) ONTO P 2D example: REAL, DIMENSION( 4, 9 ) : : A! HPF$ PROCESSORS, DIMENSION( 2 ) : : P! HPF$ DISTRIBUTE (BLOCK( 3 ),CYCLIC ( 2 ) ) ONTO P : : A
17 165 Fortran and HPF 6.11 Array alignment 6.11 Array alignment improves data locality minimises communication distributes workload Simplest example: A = B + C With correct ALIGN ment it is without communication 2 ways:! HPF$ ALIGN ( :, : ) WITH T ( :, : ) : : A, B, C equivalent to:! HPF$ ALIGN A ( :, : ) WITH T ( :, : )! HPF$ ALIGN B ( :, : ) WITH T ( :, : )! HPF$ ALIGN C ( :, : ) WITH T ( :, : )
18 166 Fortran and HPF 6.11 Array alignment Example: REAL, DIMENSION(10) : : A, B, C! HPF$ ALIGN ( : ) WITH C ( : ) : : A, B need only C be in DISTRIBUTE-command. Example, symbol instead of : :! HPF$ ALIGN ( j ) WITH C( j ) : : A, B means: for j align arrays A and B with array C. ( : instead of j stronger requirement)
19 167 Fortran and HPF 6.11 Array alignment! HPF$ ALIGN A( i, j ) WITH B( i, j ) Example 2 (2-dimensional case): REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A ( :, : ) WITH B ( :, : ) is stronger requirement than not assuming same size arrays Good to perform A=B+C+B*C operations (everything local!) transposed alignment: REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A( i, : ) WITH B ( :, i ) (second dimension of A and first dimension of B with same length!)
20 168 Fortran and HPF 6.11 Array alignment Similarly: REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A ( :, j ) WITH B( j, : ) or: REAL, DIMENSION(10,10) : : A, B! HPF$ ALIGN A( i, j ) WITH B( j, i ) good to perform operation: A=A+TRNSPOSE(B) A! e v e r y t h i n g l o c a l!
21 169 Fortran and HPF 6.12 Strided Alignment 6.12 Strided Alignment Example: Align elements of matrix D with every second element in E: REAL, DIMENSION( 5 ) : : D REAL, DIMENSION(10) : : E! HPF$ ALIGN D ( : ) WITH E ( 1 : : 2 ) could be written also:! HPF$ ALIGN D( i ) WITH E( i 2 1) Operation: D = D + E ( : : 2 )! l o c a l
22 170 Fortran and HPF 6.12 Strided Alignment Example: reverse strided alignment: REAL, DIMENSION( 5 ) : : D REAL, DIMENSION(10) : : E! HPF$ ALIGN D ( : ) WITH E(UBOUND(E) :: 2) could be written also:! HPF$ ALIGN D( i ) WITH E(2+UBOUND( E) i 2)
23 171 Fortran and HPF 6.13 Example on Alignment 6.13 Example on Alignment PROGRAM Warty IMPLICIT NONE REAL, DIMENSION( 4 ) : : C REAL, DIMENSION( 8 ) : : D REAL, DIMENSION( 2 ) : : E C = 1; D = 2 E = D ( : : 4 ) + C ( : : 2 ) END PROGRAM Warty minimal (0) communication is achieved with:! HPF$ ALIGN C ( : ) WITH D ( : : 2 )! HPF$ ALIGN E ( : ) WITH D ( : : 4 )! HPF$ DISTRIBUTE (BLOCK) : : D
24 172 Fortran and HPF 6.14 Alignment with Allocatable Arrays 6.14 Alignment with Allocatable Arrays alignment is performed togehter with memory allocation existing object cannot be aligned to unallocated object Example: REAL, DIMENSION ( : ), ALLOCATABLE : :! HPF$ ALIGN A ( : ) WITH B ( : ) then ALLOCATE (B(100), stat= i e r r ) ALLOCATE (A(100), stat= i e r r ) is OK ALLOCATE (B(100),A(100), stat= i e r r ) also OK (allocation starts from left), A,B
25 173 Fortran and HPF 6.14 Alignment with Allocatable Arrays but, ALLOCATE (A(100), stat= i e r r ) ALLOCATE (B(100), stat= i e r r ) or ALLOCATE (A(100),B(100), stat= i e r r ) give error! Simple array cannot be aligned with allocatable: REAL, DIMENSION ( : ) : : X REAL, DIMENSION ( : ), ALLOCATABLE : :! HPF$ ALIGN X ( : ) WITH A ( : )! ERROR A
26 174 Fortran and HPF 6.14 Alignment with Allocatable Arrays One more problem: REAL, DIMENSION ( : ), ALLOCATABLE : :! HPF$ ALIGN A ( : ) WITH B ( : ) ALLOCATE(B(100), stat= i e r r ) ALLOCATE(A(50), stat= i e r r ) A, B : says that A and B should be with same length (but they are not!) But this is OK: REAL, DIMENSION ( : ), ALLOCATABLE : : A, B! HPF$ ALIGN A( i ) WITH B( i ) ALLOCATE(B(100), stat= i e r r ) ALLOCATE(A(50), stat= i e r r ) still: A cannot be larger than B.
27 175 Fortran and HPF 6.16 Dimension replication 6.15 Dimension collapsing With one element it is possible to align one or severeal dimensions:! HPF$ ALIGN (, : ) WITH Y ( : ) : : X each element from Y aligned to a column from X (First dimension of matrix X being collapsed) 6.16 Dimension replication! HPF$ ALIGN Y ( : ) WITH X (, : ) each processor getting arbitrary row X(:,i) gets also a copy of Y(i)
28 176 Fortran and HPF 6.16 Dimension replication Example: 2D Gauss elimination kernel of the program:... DO j = i +1, n A( j, i ) = A( j, i ) /Swap( i ) A( j, i +1:n ) = A( j, i +1:n ) A( j, i ) Swap( i +1:n ) Y( j ) = Y( j ) A( j, i ) Temp END DO Y(k) together with A(k,i) =>! HPF$ ALIGN Y ( : ) WITH A ( :, ) Swap(k) together with A(i,k) =>! HPF$ ALIGN Swap ( : ) WITH A (, : ) No matrix A neighbouring elements in same expression => CYLCIC:
29 177 Fortran and HPF 6.16 Dimension replication! HPF$ DISTRIBUTE A( CYCLIC, CYCLIC)
30 178 Fortran and HPF 6.16 Dimension replication Example: matrix multiplication PROGRAM ABmult IMPLICIT NONE INTEGER, PARAMETER : : N = 100 INTEGER, DIMENSION ( N, N) : : A, B, C INTEGER : : i, j! HPF$ PROCESSORS square ( 2, 2 )! HPF$ DISTRIBUTE (BLOCK, BLOCK) ONTO square : : C! HPF$ ALIGN A( i, ) WITH C( i, )! r e p l i c a t e copies of row A( i, ) onto processors which compute C( i, j )! HPF$ ALIGN B(, j ) WITH C(, j )! r e p l i c a t e copies of column B(, j ) ) onto processors which compute C( i, j ) A = 1 B = 2 C = 0 DO i = 1, N DO j = 1, N! A l l the work i s l o c a l due to ALIGNs C( i, j ) = DOT_PRODUCT(A( i, : ), B( :, j ) ) END DO END DO WRITE(, ) C END
31 179 Fortran and HPF 6.17 HPF Intrinsic Functions 6.17 HPF Intrinsic Functions NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE information about physical hardware needed for portability:! HPF$ PROCESSORS P1 (NUMBER.OF.PROCESSORS( ) )! HPF$ PROCESSORS P2(4, 4,NUMBER.OF.PROCESSORS( ) / 1 6 )! HPF$ PROCESSORS P3 ( 0 :NUMBER.OF.PROCESSORS( 1 ) 1, &! HPF$ 0:NUMBER.OF.PROCESSORS( 2 ) 1) 2048-processor hyprcube: PRINT, PROCESSORS. SHAPE( ) would return:
32 180 Fortran and HPF 6.18 HPF Template Syntax 6.18 HPF Template Syntax TEMPLATE - conceptual object, does not use any RAM, defined statically (like 0- sized arrays being not assigned to) are declared distributed can be used to align arrays Example: REAL, DIMENSION(10) : : A, B! HPF$ TEMPLATE, DIMENSION(10) : : T! HPF$ DISTRIBUTE (BLOCK) : : T! HPF$ ALIGN ( : ) WITH T ( : ) : : A, B (here only T may be argument to DISTRIBUTE) Combined TEMPLATE directive:
33 181 Fortran and HPF 6.18 HPF Template Syntax! HPF$ TEMPLATE, DIMENSION(100,100), &! HPF$ DISTRIBUTE (BLOCK, CYCLIC) ONTO P : : T! HPF$ ALIGN A ( :, : ) WITH T ( :, : ) which is equivalent to:! HPF$ TEMPLATE, DIMENSION(100,100) : : T! HPF$ ALIGN A ( :, : ) WITH T ( :, : )! HPF$ DISTRIBUTE T (BLOCK, CYCLIC) ONTO P
34 182 Fortran and HPF 6.18 HPF Template Syntax Example: PROGRAM Warty IMPLICIT NONE REAL, DIMENSION( 4 ) : : C REAL, DIMENSION( 8 ) : : D REAL, DIMENSION( 2 ) : : E! HPF$ TEMPLATE, DIMENSION( 8 ) : : T! HPF$ ALIGN D ( : ) WITH T ( : )! HPF$ ALIGN C ( : ) WITH T ( : : 2 )! HPF$ ALIGN E ( : ) WITH T ( : : 4 )! HPF$ DISTRIBUTE (BLOCK) : : T C = 1; D = 2 E = D ( : : 4 ) + C ( : : 2 ) END PROGRAM Warty (similar to the example of alignment with stride)
35 183 Fortran and HPF 6.18 HPF Template Syntax More examples on using Templates: ALIGN A(:)WITH T1(:, ) i, element A(i) replicated according to row T1(i,:). ALIGN C(i,j) WITH T2(j,i) transposed C aligned with T2 ALIGN B(:, )WITH T3(:) i, matrix B(i,:) aligned with template element T2(i), DISTRIBUTE (BLOCK,CYCLIC):: T1, T2 DISTRIBUTE T1(CYCLIC, )ONTO P T1 rows are distributed cyclically
36 184 Fortran and HPF 6.19 FORALL 6.19 FORALL Syntax: example: FORALL(<forall-triple-list>[,<scalar-mask>])& <assignment> FORALL ( i =1:n, j =1:m,A( i, j ).NE. 0 ) A( i, j ) = 1/A( i, j )
37 185 Fortran and HPF 6.19 FORALL Circumstances, where Fortran90 syntax is not enough, but FORALL makes it simple: index expressions: FORALL ( i =1:n, j =1:n, i /= j ) A( i, j ) = REAL( i + j ) intrinsic or PURE-functions (which have no side-effects): FORALL ( i =1:n : 3, j =1:n : 5 ) A( i, j ) = SIN (A( j, i ) ) subindexing: FORALL ( i =1:n, j =1:n ) A(VS( i ), j ) = i +VS( j )
38 186 Fortran and HPF 6.19 FORALL unusual parts can be accessed: FORALL ( i =1:n ) A( i, i ) = B( i )! diagonal!... DO j = 1, n FORALL ( i =1: j ) A( i, j ) = B( i )! t r i a n g u l a r END DO To parallelise add before DO also:! HPF$ INDEPENDENT, NEW( i ) or write nested FORALL commands: FORALL ( j = 1:n) FORALL ( i =1: j ) A( i, j ) = B( i )! t r i a n g u l a r END FORALL
39 187 Fortran and HPF 6.19 FORALL FORALL-command execution: 1. triple-list evaluation 2. scalar matrix evaluation 3. for each.true. mask elements find right hand side value 4. assignment of right hand side value to the left hand side In case of HPF synchronisation between the steps!
40 188 Fortran and HPF 6.20 PURE-procedures 6.20 PURE-procedures PURE REAL FUNCTION F ( x, y ) PURE SUBROUTINE G( x, y, z ) without side effects, i.e.: no outer I/O nor ALLOCATE (communication still allowed) does not change global state of the program intrinsic functions are of type PURE! can be used in FORALL and in PURE-procedures no PAUSE or STOP FUNCTION formal parameters with attribute INTENT(IN)
41 189 Fortran and HPF 6.20 PURE-procedures Example (function): PURE REAL FUNCTION F ( x, y ) IMPLICIT NONE REAL, INTENT ( IN ) : : x, y F = x x + y y + 2 x y + ASIN ( MIN ( x / y, y / x ) ) END FUNCTION F example of usage: FORALL ( i =1:n, j =1:n ) & A( i, j ) = b ( i ) + F(1.0 i, 1. 0 j )
42 190 Fortran and HPF 6.20 PURE-procedures Example (subroutine): PURE SUBROUTINE G( x, y, z ) IMPLICIT NONE REAL, INTENT (OUT), DIMENSION ( : ) : : z REAL, INTENT ( IN ), DIMENSION ( : ) : : x, y INTEGER i INTERFACE REAL FUNCTION F ( x, y ) REAL, INTENT ( IN ) : : x, y END FUNCTION F END INTERFACE!... FORALL( i =1: SIZE ( z ) ) z ( i ) = F ( x ( i ), y ( i ) ) END SUBROUTINE G
43 191 Fortran and HPF 6.20 PURE-procedures MIMD example: REAL FUNCTION F ( x, i )! PURE IMPLICIT NONE REAL, INTENT ( IN ) : : x! element INTEGER, INTENT ( IN ) : : i! index IF ( x > 0. 0 ) THEN F = x x ELSEIF ( i ==1.OR. i ==n ) THEN F = 0.0 ELSE F = x END IF END FUNCTION F
44 192 Fortran and HPF 6.21 INDEPENDENT 6.21 INDEPENDENT Directly in front of DO or FORALL! HPF$ INDEPENDENT DO i = 1,n x ( i ) = i 2 END DO In front of FORALL: no synchronisation needed between right hand side expression evaluation and assignment If INDEPENDENT loop......assigns more than once to a same element, parallelisation is lost!...includes EXIT, STOP or PAUSE command, iteration needs to execute sequentially to be sure to end in right iteration...has jumps out of the loop or I/O => sequential execution
45 193 Fortran and HPF 6.21 INDEPENDENT Is independent:! HPF$ INDEPENDENT DO i = 1, n b ( i ) = b ( i ) + b ( i ) END DO Not independent: DO i = 1, n b ( i ) = b ( i +1) + b ( i ) END DO Not independent: DO i = 1, n b ( i ) = b ( i 1) + b ( i ) END DO
46 194 Fortran and HPF 6.21 INDEPENDENT This is independent loop:! HPF$ INDEPENDENT DO i = 1, n a ( i ) = b ( i 1) + b ( i ) END DO Question to ask: does a later iteration depend on a previous one?
47 195 Fortran and HPF 6.22 INDEPENDENT NEW command 6.22 INDEPENDENT NEW command create an independent variable on each process!! HPF$ INDEPENDENT, NEW( s1, s2 ) DO i =1,n s1 = Sin ( a ( 1 ) ) s2 = Cos( a ( 1 ) ) a ( 1 ) = s1 s1 s2 s2 END DO
48 196 Fortran and HPF 6.22 INDEPENDENT NEW command Rules for NEW-variable: cannot be used outside cycle without redefinition cannot be used in FORALL not pointers or formal parameters cannot have SAVE atribute Disallowed:! HPF$ INDEPENDENT, NEW( s1, s2 ) DO i = 1,n s1 = SIN ( a ( i ) ) s2 = COS( a ( i ) ) a ( i ) = s1 s1 s2 s2 END DO k = s1+s2! not allowed!
49 197 Fortran and HPF 6.23 EXTRINSIC Example (only outer cycles can be executed independently):! HPF$ INDEPENDENT, NEW ( i 2 ) DO i1 = 1, n1! HPF$ INDEPENDENT, NEW ( i 3 ) DO i2 = 1, n2! HPF$ INDEPENDENT, NEW ( i 4 ) DO i3 = 1, n3 DO i4 = 1, n4 a ( i1, i2, i3 ) = a ( i1, i2, i3 ) & + b ( i1, i2, i4 ) c ( i2, i3, i4 ) END DO END DO END DO END DO 6.23 EXTRINSIC Used in INTERFACE-commands to declare that a routine does not belong to HPF
50 198 Fortran and HPF 6.23 EXTRINSIC 7 MPI See separate slides on course homepage!
51 199 // Alg.Design Part III Parallel Algorithms 8 Parallel Algorithm Design Principles Identifying portions of the work that can be performed concurrently Mapping the concurrent pieces of work onto multiple processes running in parallel Distributing the input, output and also intermediate data associated with the program Access management of data shared by multiple processes Process synchronisation in various stages in parallel program execution
52 200 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs 8.1 Decomposition, Tasks and Dependency Graphs Subdividing calculations into smaller components is called decomposition Example: Dense matrix-vector multiplication y = Ab y[i] = n j=1 A[i, j]b[ j] Computational Problem Decomposition Tasks ([1])
53 201 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs in case of y = Ab tasks calculation of each row independent Tasks and their relative order abstraction: ([1]) Task Dependency Graph directed acyclic graph nodes tasks
54 202 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs directed edges dependences (can be disconnected) (edge-set can be empty) Mapping tasks onto processors Processors vs. processes
55 203 // Alg.Design 8.2 Decomposition Techniques 8.2 Decomposition Techniques Recursive decomposition Example: Quicksort
56 204 // Alg.Design 8.2 Decomposition Techniques Data decomposition Input data Output data intermediate results Owner calculates rule Exploratory decomposition Example: 15-square game Speculative decomposition Hybrid decomposition
57 205 // Alg.Design 8.3 Tasks and Interactions 8.3 Tasks and Interactions Task generation static dynamic Task sizes uniform non-uniform knowledge of task sizes size of data associated with tasks
58 206 // Alg.Design 8.3 Tasks and Interactions Characteristics of Inter-Task Interactions static versus dynamic regular versus irregular read-only versus read-write one-way versus two-way
59 207 // Alg.Design 8.4 Mapping Techniques for Load balancing 8.4 Mapping Techniques for Load balancing Static mapping Mappings based on data partitioning array distribution schemes block distribution cyclic and block-cyclic distribution randomised block distribution Graph partitioning
60 208 // Alg.Design 8.4 Mapping Techniques for Load balancing Dynamic mapping schemes centralised schemes master-slave self scheduling single task scheduling chunk scheduling distributed schemes each process can send/receive work from whichever process how are sending receiving processes paired together? initiator sender or receiver? how much work exchanged? when the work transfer is performed?
61 209 // Alg.Design 8.4 Mapping Techniques for Load balancing Methods for reducing interaction overheads maximising data locality minimise volume of data exchange minimise frequency of interactions overlapping computations with interactions replicating data or computation overlapping interactions with other interactions
62 210 // Alg.Design 8.5 Parallel Algorithm Models 8.5 Parallel Algorithm Models data-parallel model task graph model typically amount of data relatively large to the amount of computation work pool model or task pool model pool can be centralised distributed statically/dynamically created, master-slave model
63 211 // Alg.Design 8.5 Parallel Algorithm Models pipeline (or producer-consumer model) stream parallelism stream of data triggers computations pipelines is a chain of consumer-produces processes shape of pipeline can be: linear or multidimesional array trees general graphs with or without cycles Bulk-sunchronous parallel (BSP) model Synchronisation steps needed in a regular or irregular pattern p2p synchronisation and/or global synchronisation
Parallel Programming. March 15,
Parallel Programming March 15, 2010 1 Some Definitions Computational Models and Models of Computation real world system domain model - mathematical - organizational -... computational model March 15, 2010
More informationHigh Performance Fortran. James Curry
High Performance Fortran James Curry Wikipedia! New Fortran statements, such as FORALL, and the ability to create PURE (side effect free) procedures Compiler directives for recommended distributions of
More informationSystolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program
SIMD Single Instruction Multiple Data Lecture 12: SIMD-machines & data parallelism, dependency analysis for automatic vectorizing and parallelizing of serial program Part 1 Parallelism through simultaneous
More informationHigh Performance Fortran http://www-jics.cs.utk.edu jics@cs.utk.edu Kwai Lam Wong 1 Overview HPF : High Performance FORTRAN A language specification standard by High Performance FORTRAN Forum (HPFF), a
More informationCSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming
CSE 262 Spring 2007 Scott B. Baden Lecture 4 Data parallel programming Announcements Projects Project proposal - Weds 4/25 - extra class 4/17/07 Scott B. Baden/CSE 262/Spring 2007 2 Data Parallel Programming
More informationLecture V: Introduction to parallel programming with Fortran coarrays
Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time
More informationLecture 4: Principles of Parallel Algorithm Design (part 4)
Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationDesign of Parallel Algorithms. Models of Parallel Computation
+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes
More informationMPI Casestudy: Parallel Image Processing
MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by
More informationAMath 483/583 Lecture 8
AMath 483/583 Lecture 8 This lecture: Fortran subroutines and functions Arrays Dynamic memory Reading: class notes: Fortran Arrays class notes: Fortran Subroutines and Functions class notes: gfortran flags
More informationIntroduction to OpenMP. Lecture 4: Work sharing directives
Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives which appear inside a parallel region and indicate how work should be shared out between threads Parallel do/for
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationExtrinsic Procedures. Section 6
Section Extrinsic Procedures 1 1 1 1 1 1 1 1 0 1 This chapter defines the mechanism by which HPF programs may call non-hpf subprograms as extrinsic procedures. It provides the information needed to write
More informationParallelisation. Michael O Boyle. March 2014
Parallelisation Michael O Boyle March 2014 1 Lecture Overview Parallelisation for fork/join Mapping parallelism to shared memory multi-processors Loop distribution and fusion Data Partitioning and SPMD
More information2 3. Syllabus Time Event 9:00{10:00 morning lecture 10:00{10:30 morning break 10:30{12:30 morning practical session 12:30{1:30 lunch break 1:30{2:00 a
1 Syllabus for the Advanced 3 Day Fortran 90 Course AC Marshall cuniversity of Liverpool, 1997 Abstract The course is scheduled for 3 days. The timetable allows for two sessions a day each with a one hour
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationChapter 3. Fortran Statements
Chapter 3 Fortran Statements This chapter describes each of the Fortran statements supported by the PGI Fortran compilers Each description includes a brief summary of the statement, a syntax description,
More informationParallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville
Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information
More informationChapter 4. Fortran Arrays
Chapter 4. Fortran Arrays Fortran arrays are any object with the dimension attribute. In Fortran 90/95, and in HPF, arrays may be very different from arrays in older versions of Fortran. Arrays can have
More informationWelcome. Modern Fortran (F77 to F90 and beyond) Virtual tutorial starts at BST
Welcome Modern Fortran (F77 to F90 and beyond) Virtual tutorial starts at 15.00 BST Modern Fortran: F77 to F90 and beyond Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Fortran Ancient History (1967)
More informationMPI Programming. Henrik R. Nagel Scientific Computing IT Division
1 MPI Programming Henrik R. Nagel Scientific Computing IT Division 2 Outline Introduction Finite Difference Method Finite Element Method LU Factorization SOR Method Monte Carlo Method Molecular Dynamics
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationArray Processing { Part II. Multi-Dimensional Arrays. 1. What is a multi-dimensional array?
Array Processing { Part II Multi-Dimensional Arrays 1. What is a multi-dimensional array? A multi-dimensional array is simply a table (2-dimensional) or a group of tables. The following is a 2-dimensional
More informationIntroduction to Fortran95 Programming Part II. By Deniz Savas, CiCS, Shef. Univ., 2018
Introduction to Fortran95 Programming Part II By Deniz Savas, CiCS, Shef. Univ., 2018 Summary of topics covered Logical Expressions, IF and CASE statements Data Declarations and Specifications ARRAYS and
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationPACKAGE SPECIFICATION HSL 2013
PACKAGE SPECIFICATION HSL 2013 1 SUMMARY Given a rank-one or rank-two allocatable array, reallocates the array to have a different size, and can copy all or part of the original array into the new array.
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationReview More Arrays Modules Final Review
OUTLINE 1 REVIEW 2 MORE ARRAYS Using Arrays Why do we need dynamic arrays? Using Dynamic Arrays 3 MODULES Global Variables Interface Blocks Modular Programming 4 FINAL REVIEW THE STORY SO FAR... Create
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationSubroutines, Functions and Modules
Subroutines, Functions and Modules Subdividing the Problem Most problems are thousands of lines of code. Few people can grasp all of the details. Good design principle: Exhibit the overall structure in
More informationAutomatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy
Automatic Translation of Fortran Programs to Vector Form Randy Allen and Ken Kennedy The problem New (as of 1987) vector machines such as the Cray-1 have proven successful Most Fortran code is written
More informationOverpartioning with the Rice dhpf Compiler
Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf
More informationAcknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text
Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center
More informationProgramming for High Performance Computing in Modern Fortran. Bill Long, Cray Inc. 17-May-2005
Programming for High Performance Computing in Modern Fortran Bill Long, Cray Inc. 17-May-2005 Concepts in HPC Efficient and structured layout of local data - modules and allocatable arrays Efficient operations
More informationIntroduction to Fortran Programming. -Internal subprograms (1)-
Introduction to Fortran Programming -Internal subprograms (1)- Subprograms Subprograms are used to split the program into separate smaller units. Internal subprogram is not an independent part of a program.
More informationStorage and Sequence Association
2 Section Storage and Sequence Association 1 1 HPF allows the mapping of variables across multiple processors in order to improve parallel 1 performance. FORTRAN and Fortran 0 both specify relationships
More informationOptimisation p.1/22. Optimisation
Performance Tuning Optimisation p.1/22 Optimisation Optimisation p.2/22 Constant Elimination do i=1,n a(i) = 2*b*c(i) enddo What is wrong with this loop? Compilers can move simple instances of constant
More informationBlocking SEND/RECEIVE
Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F
More informationG Programming Languages - Fall 2012
G22.2110-003 Programming Languages - Fall 2012 Lecture 4 Thomas Wies New York University Review Last week Control Structures Selection Loops Adding Invariants Outline Subprograms Calling Sequences Parameter
More informationPrinciples of Parallel Algorithm Design: Concurrency and Decomposition
Principles of Parallel Algorithm Design: Concurrency and Decomposition John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 2 12 January 2017 Parallel
More informationAppendix D. Fortran quick reference
Appendix D Fortran quick reference D.1 Fortran syntax... 315 D.2 Coarrays... 318 D.3 Fortran intrisic functions... D.4 History... 322 323 D.5 Further information... 324 Fortran 1 is the oldest high-level
More informationHigh Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science
High Performance Computing Lecture 41 Matthew Jacob Indian Institute of Science Example: MPI Pi Calculating Program /Each process initializes, determines the communicator size and its own rank MPI_Init
More informationIntroduction to OpenMP
Introduction to OpenMP p. 1/?? Introduction to OpenMP More Syntax and SIMD Nick Maclaren nmm1@cam.ac.uk September 2017 Introduction to OpenMP p. 2/?? C/C++ Parallel for (1) I said that I would give the
More informationMixed Mode MPI / OpenMP Programming
Mixed Mode MPI / OpenMP Programming L.A. Smith Edinburgh Parallel Computing Centre, Edinburgh, EH9 3JZ 1 Introduction Shared memory architectures are gradually becoming more prominent in the HPC market,
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationLecture 4: Principles of Parallel Algorithm Design (part 3)
Lecture 4: Principles of Parallel Algorithm Design (part 3) 1 Exploratory Decomposition Decomposition according to a search of a state space of solutions Example: the 15-puzzle problem Determine any sequence
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More informationProgramming with MPI
Programming with MPI p. 1/?? Programming with MPI One-sided Communication Nick Maclaren nmm1@cam.ac.uk October 2010 Programming with MPI p. 2/?? What Is It? This corresponds to what is often called RDMA
More informationCompiling High Performance Fortran for Distributedmemory
Compiling High Performance Fortran for Distributedmemory Systems Jonathan Harris John A. Bircsak M. Regina Bolduc Jill Ann Diewald Israel Gale Neil W. Johnson Shin Lee C. Alexander Nelson Carl D. Offner
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationPACKAGE SPECIFICATION HSL 2013
HSL MC73 PACKAGE SPECIFICATION HSL 2013 1 SUMMARY Let A be an n n matrix with a symmetric sparsity pattern. HSL MC73 has entries to compute the (approximate) Fiedler vector of the unweighted or weighted
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction
More informationGoals for This Lecture:
Goals for This Lecture: Learn about multi-dimensional (rank > 1) arrays Learn about multi-dimensional array storage Learn about the RESHAPE function Learn about allocatable arrays & the ALLOCATE and DEALLOCATE
More informationParallelization of an Example Program
Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.
More informationParallel Programming with OpenMP. CS240A, T. Yang
Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs
More informationIntroduction to Modern Fortran
Introduction to Modern Fortran p. 1/?? Introduction to Modern Fortran Advanced Use Of Procedures Nick Maclaren nmm1@cam.ac.uk March 2014 Introduction to Modern Fortran p. 2/?? Summary We have omitted some
More informationDeclaration and Initialization
6. Arrays Declaration and Initialization a1 = sqrt(a1) a2 = sqrt(a2) a100 = sqrt(a100) real :: a(100) do i = 1, 100 a(i) = sqrt(a(i)) Declaring arrays real, dimension(100) :: a real :: a(100) real :: a(1:100)!
More informationARRAYS COMPUTER PROGRAMMING. By Zerihun Alemayehu
ARRAYS COMPUTER PROGRAMMING By Zerihun Alemayehu AAiT.CED Rm. E119B BASIC RULES AND NOTATION An array is a group of variables or constants, all of the same type, that are referred to by a single name.
More informationCS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.
CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont. Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following
More informationIntroduction to Programming with Fortran 90
Introduction to Programming with Fortran 90 p. 1/?? Introduction to Programming with Fortran 90 Array Concepts Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 November 2007 Introduction to Programming
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More informationProgramming with MPI
Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous
More informationInterfacing With Other Programming Languages Using Cython
Lab 19 Interfacing With Other Programming Languages Using Cython Lab Objective: Learn to interface with object files using Cython. This lab should be worked through on a machine that has already been configured
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More informationComputer Science & Engineering 150A Problem Solving Using Computers
Computer Science & Engineering 150A Problem Solving Using Computers Lecture 06 - Stephen Scott Adapted from Christopher M. Bourke 1 / 30 Fall 2009 Chapter 8 8.1 Declaring and 8.2 Array Subscripts 8.3 Using
More informationParallelization Principles. Sathish Vadhiyar
Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs
More informationCS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012
CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night
More informationLecture 4: Principles of Parallel Algorithm Design (part 3)
Lecture 4: Principles of Parallel Algorithm Design (part 3) 1 Exploratory Decomposition Decomposition according to a search of a state space of solutions Example: the 15-puzzle problem Determine any sequence
More informationMore Coarray Features. SC10 Tutorial, November 15 th 2010 Parallel Programming with Coarray Fortran
More Coarray Features SC10 Tutorial, November 15 th 2010 Parallel Programming with Coarray Fortran Overview Multiple Dimensions and Codimensions Allocatable Coarrays and Components of Coarray Structures
More informationScalable Algorithmic Techniques Decompositions & Mapping. Alexandre David
Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability
More informationELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges
ELE 455/555 Computer System Engineering Section 4 Class 1 Challenges Introduction Motivation Desire to provide more performance (processing) Scaling a single processor is limited Clock speeds Power concerns
More informationLecture 16: Recapitulations. Lecture 16: Recapitulations p. 1
Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently
More informationSynchronous Shared Memory Parallel Examples. HPC Fall 2012 Prof. Robert van Engelen
Synchronous Shared Memory Parallel Examples HPC Fall 2012 Prof. Robert van Engelen Examples Data parallel prefix sum and OpenMP example Task parallel prefix sum and OpenMP example Simple heat distribution
More information(Refer Slide Time 01:41 min)
Programming and Data Structure Dr. P.P.Chakraborty Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture # 03 C Programming - II We shall continue our study of
More informationUvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP
Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel
More informationIntroduction to OpenMP
Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives which appear inside a parallel region and indicate how work should be shared out between threads Parallel do/for
More informationG. Colin de Verdière CEA
A methodology to port production codes to GPU learned from our experiments G. Colin de Verdière CEA GCdV CEA, DAM, DIF, F-91297 Arpajon 1 Moving towards GPGPU Potentially important speedups COMPLEX(8)
More information1 Introduction to MATLAB
1 Introduction to MATLAB 1.1 Quick Overview This chapter is not intended to be a comprehensive manual of MATLAB R. Our sole aim is to provide sufficient information to give you a good start. If you are
More informationCS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1
CS4961 Parallel Programming Lecture 4: Data and Task Parallelism Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following command:
More informationArrays, Vectors Searching, Sorting
Arrays, Vectors Searching, Sorting Arrays char s[200]; //array of 200 characters different type than class string can be accessed as s[0], s[1],..., s[199] s[0]= H ; s[1]= e ; s[2]= l ; s[3]= l ; s[4]=
More informationCMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)
CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI
More informationRun-Time Data Structures
Run-Time Data Structures Static Structures For static structures, a fixed address is used throughout execution. This is the oldest and simplest memory organization. In current compilers, it is used for:
More informationAccelerated Library Framework for Hybrid-x86
Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit
More informationPoint-to-Point Synchronisation on Shared Memory Architectures
Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:
More informationParallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19
Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing
More informationExploring XcalableMP. Shun Liang. August 24, 2012
Exploring XcalableMP Shun Liang August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract This project has implemented synthetic and application
More informationLecture Notes on Loop Transformations for Cache Optimization
Lecture Notes on Loop Transformations for Cache Optimization 5-: Compiler Design André Platzer Lecture Introduction In this lecture we consider loop transformations that can be used for cache optimization.
More informationSECTION 5: STRUCTURED PROGRAMMING IN MATLAB. ENGR 112 Introduction to Engineering Computing
SECTION 5: STRUCTURED PROGRAMMING IN MATLAB ENGR 112 Introduction to Engineering Computing 2 Conditional Statements if statements if else statements Logical and relational operators switch case statements
More informationAlgorithms and Applications
Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers
More informationExample of a Parallel Algorithm
-1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More informationA Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh
A Short Introduction to OpenMP Mark Bull, EPCC, University of Edinburgh Overview Shared memory systems Basic Concepts in Threaded Programming Basics of OpenMP Parallel regions Parallel loops 2 Shared memory
More informationSynchronous Shared Memory Parallel Examples. HPC Fall 2010 Prof. Robert van Engelen
Synchronous Shared Memory Parallel Examples HPC Fall 2010 Prof. Robert van Engelen Examples Data parallel prefix sum and OpenMP example Task parallel prefix sum and OpenMP example Simple heat distribution
More informationFortran. (FORmula TRANslator) History
Fortran (FORmula TRANslator) History FORTRAN vs. Fortran 1954 FORTRAN first successful high level language John Backus (IBM) 1958 FORTRAN II (Logical IF, subroutines, functions) 1961 FORTRAN IV 1966 FORTRAN
More informationIntroduction to Programming in C Department of Computer Science and Engineering. Lecture No. #43. Multidimensional Arrays
Introduction to Programming in C Department of Computer Science and Engineering Lecture No. #43 Multidimensional Arrays In this video will look at multi-dimensional arrays. (Refer Slide Time: 00:03) In
More informationAffine Loop Optimization using Modulo Unrolling in CHAPEL
Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers
More information1 Introduction to MATLAB
1 Introduction to MATLAB 1.1 General Information Quick Overview This chapter is not intended to be a comprehensive manual of MATLAB R. Our sole aim is to provide sufficient information to give you a good
More informationCUDA Fortran COMPILERS &TOOLS. Porting Guide
Porting Guide CUDA Fortran CUDA Fortran is the Fortran analog of the NVIDIA CUDA C language for programming GPUs. This guide includes examples of common language features used when porting Fortran applications
More information