Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program

Size: px

Start display at page:

Download "Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program"

Aubrey Wilson
6 years ago
Views:

1 SIMD Single Instruction Multiple Data Lecture 12: SIMD-machines & data parallelism, dependency analysis for automatic vectorizing and parallelizing of serial program Part 1 Parallelism through simultaneous operations on different data Fine grain parallelism Systolic arrays Parallel SIMD machines 10k++ processors /Pipeline units 1 2 Systolic Array SIMD Machine Network of processors, memory around Performance by doing all computations before restoring Often hardware implementations solving one problem Special topologies Memory 3 Front End Normal von Neuman Runs the application program Processor array Synchronous The same operation at the same time or idle Extends the FPU:s instructions Small memory/processor Smart memory I/O Example ILLIAC IV, IBM GF 11, Maspar, CM200(Bellman 16k) Host Controller 4 Data Parallell Programming Idea: update the elements of an array at the same time Divides the work between the programmer and the compiler The programmers solves the problem in their mel Concentrates on structure and concepts on a hight level Collective operations on large data structures Keeps data in large arrays with mapping information The compiler maps the program on a physical machine Fills in all the details (gladly receives hints from the user) Optimizes computations and communications 5 Building Blocks in Data Parallel Programming The user controls the placing of data on processors Minimize communication: keep all processors busy Operations on whole arrays Apply one operation on each element in the array in parallel Meths to access parts of an array Operations can operate on these parts Example: element < 0 element := 1 Reduction operations on arrays pruces a result from a combination of many array elements: sum, max, min,... Shift operations along the axis on multidimensional arrays Scan-operations prefix/suffix-operations Generalized communication 6

2 C* C* Supports broadcast, reduktion and interprocessor communication Parallel variables has type and shape shape defines number of elements and their organization shape [16384] employees /* 1-D */ shape [512] [512] image /* 2-D */ left-indexing: indexing that refers to parallel variables 1:st dim as axis 0, 2:nd as axis 1 etc int: employees employee_id [2]employee_id: refers to the 3:rd element in employee_id shape [16384] employees; struct date{ int month; int day; int year; }; struct date: employees birthday Each element in the parallel variable birthday contains a date. birthday.month specifies all month fields in the parallel variable birthday. 7 8 C* - Parallel Operations Overloading x = y + z (adds y and z in each position in the shape) New operations a, b scalar or parallel a <? b - min of two variables a >? b - max of two variables Selection of shape (with) shape [16384] numbers; int: numbers x, y, z; with (numbers) x = y + z where C* setting the context Limits the area where the operation is performed with (numbers) where (z!= 0) /* sets active positions */ x = y/z else /* reverses active positions */ x = y everywhere all positions active independently of earlier context 9 10 Grid communication C* - Communication pcoord (~myne) gives my index along axis in shape Example: Send the value of source to element dest that is one position higher up [pcoord(0) + 1]dest = source dot (.) is sometimes used instead of pcoord [. + 1]dest = source [. + 1][. -2]dest = source Compute Pi in C* Pi = 1/N * Σ (Ν 1) i=0 4/(1+ x i * x i ), where x i = (i+1/2)/n #define N = shape[n] chunk double: chunk x; main() { double sum; double width; width = 1.0/N in parallel with (chunk) { x = (pcoord(0) + 0.5)*width; sum = (+=(4.0/(1.0+x*x))); } sum =sum * width; printf( Estimate of Pi = %14.12f\n, sum); } 11 12

3 Compute Partial Sums in Array (C*) #define N = 1024 Select shape shape [N] ArrayShape int: ArrayShape x; int i; Active positions main() { with (ArrayShape) for (i = 0; i < log(n); i++) where (pcoord(0) >= pow(2, i-1) x += [pcoord(0) - pow(2, i-1)]x } Left indexing 13 High Performance Fortran Data parallel language (Many similarities to CM FORTRAN) For SIMD and MIMD (NUMA) machines Based on F90 (F77) Array operations HPF User defined data types Recursion and dynamic memory allocation Pointers F77 + Mess. Pass Control of data distribution SPMD Parallel constructs Data mapping directives FORALL statements and constructs Exe-file INDEPENDENT directive, etc 14 The PROCESSOR directive The DISTRIBUTE directive Declares an abstract processor arrangement on which data is mapped Each element of this arrangement corresponds to a ne on the physical machine The declarations are often parametrized with the intrinsic function NUMBER_OF_PROCESSORS!hpf$ processors p(number_of_processors()/2,2) Comment 15 Controls the mapping of data onto processors BLOCK distribution Each processor stores a consecutive block of the array REAL a(16)!hpf$ PROCESSORS p(4)!hpf$ DISTRIBUTE a(block) ONTO p BLOCK, BLOCK distribution For multidimensional arrays, separate blocking in each dimension. REAL a(7,7)!hpf$ PROCESSORS p(2,2)!hpf$ DISTRIBUTE a(block, BLOCK) ONTO p P1 P2 P3 P The DISTRIBUTE directive CYCLIC distribution REAL a(16)!hpf$ PROCESSORS p(4)!hpf$ DISTRIBUTE a(cyclic) ONTO p P1 P2 P3 P The DISTRIBUTE directive CYCLIC,BLOCK distribution It is not necessary to have the same distribution in all dimensions REAL a(7,7)!hpf$ PROCESSORS p(2,2)!hpf$ DISTRIBUTE a(cyclic, BLOCK) ONTO p CYCLIC,CYCLIC distribution REAL a(7,7)!hpf$ PROCESSORS p(2,2)!hpf$ DISTRIBUTE a(cyclic, CYCLIC) ONTO p 17!HPF$ DISTRIBUTE a(block, CYCLIC) ONTO 18p

4 The ALIGN directive Example: Simple Matrix Multiplication Describes mapping relations between interacting objects Both objects are allocated on the same processor REAL a(6), b(6)!hpf$ ALIGN a(i) WITH b(i) REAL a(4,4), b(4,10)!hpf$ ALIGN a(i,j) WITH b(i, 2*J+1) a b(1,3) b(1,5) b(1,7) b(1,9) b(2,3) b(2,5) b(2,7) b(2,9) b(3,3) b(3,5) b(3,7) b(3,9) b(4,3) b(4,5) b(4,7) b(4,9) a(1) a(2) a(3) a(4) a(5) a(6) b(1) b(2) b(3) b(4) b(5) b(6) 19 PROGRAM ABmult INTEGER, PARAMETER :: N = 100 INTEGER, DIMENSION (N,N) :: A, B, C INTEGER :: i, j!hpf$ PROCESSORS SQ(2,2)!HPF$ DISTRIBUTE C(BLOCK,BLOCK) ONTO SQ!HPF$ ALIGN A(i,*) WITH C(i,*)! replicate copies of row A(i,*)! onto processors which compute C(i,j)!HPF$ ALIGN B(*,j) WITH C(*,j)! replicate copies of column B(*,j))! onto processors which compute C(i,j) A = 1, B = 2, C = 0 DO i = 1, N DO j = 1, N! All the work is local due to ALIGNs C(i,j) = DOT_PRODUCT(A(i,:), B(:,j)) END C A B 20 The FORALL statement Generalization of array assignment and masked array assignment (NOT a loop) Single statement FORALL FORALL (index, mask) forall-assignment Equivalent to array assignment in F90 For every index, controll the mask Compute right hand side for unmasked values Carry out the assignments to the left hand side Multiple statement FORALL-semantics FORALL (index, mask) forall-by-list END FORALL forall-by can be FORALL, WHERE, or ordinary forallassignments Abbreviation of a series of single statement FORALLs The INDEPENDENT directive States that no iteration affects any other iteration in any way Is used to give the compiler extra information about the execution of a DO or FORALL Applied on DO: states that there are no loop carried dependencies Applied on FORALL: states that no index points to an address used by any other object!hpf$ INDEPENDENT DO I = 1, N A(INDX(I)) = B(I) The INDEPENDENT directive Game of LIFE FORALL (I=1:3)!HPF$ L1(I) = R1(I) L2(I) = R2(I) END FORALL Assume that R1(3) & R2(1) takes longer time due to communication R1(1) R1(2) R1(3) L1(1) L1(2) L1(3) R2(1) R2(2) R2(3) L2(1) L2(2) L2(3) Sync Sync Sync R1(1) L1(1) R2(1) L2(1) INDEPENDENT FORALL (I=1:3) L1(I) = R1(I) L2(I) = R2(I) END FORALL R1(2) L1(2) Time gained R2(2) L2(2) R1(3) L1(3) R2(3) L2(3) 23 INTEGER LIFE(64, 64), NCOUNT(64, 64)!HPF$ ALIGN LIFE WITH NCOUNT!HPF$ DISTRIBUTE LIFE(BLOCK, BLOCK)... INIT LIFE... NCOUNT = 0 DO M = 1, NUMBER_OF_GENERATIONS FORALL (I=2:63, J=2:63) NCOUNT(I,J) = SUM(LIFE(I-1:I+1,J-1:J+1))-LIFE(I,J) END FORALL! Create next generation WHERE ((LIFE.EQ.0).AND.(NCOUNT.EQ.3)) LIFE=1 END WHERE WHERE ((LIFE.EQ.1).AND.(NCOUNT.NE.2).AND.(NCOUNT.EQ.3)) LIFE = 0 END WHERE END 24

5 Summation Data Parallelism Scalable Data parallel programming simpler than messagepassing Data parallel languages C*, CM Fortran, HPF SIMD-style: Single Program, Single instruction flow SPMD-style: Single Program, multiple data different instruction flows locally Machines: SIMD (CM2, Maspar,..) or MIMD and SPMD programming Lecture 12b: Dependency analysis for automatic vectorization and parallelization of serial programs Automatic // Loops are the largest source for parallelism Loop parallelization Different iterations on different processors Different tasks within an iteration on different processors ization/pipelining Pipeline: breaks down instructions intp substeps that are being overlapped : the piped instructions are carried out on a vector register of fixed length 27 Content hardware Data dependency analysis dependency graphs dependency tests ization standard transformations vector ce generation Parallelization loop scheduling 28 Supercomputer (Register-to-Register) Mass storage I/O data pipes instr instr Control unit instructions Main Memory (Program & data) Host Computer data registers Control unit pipe pipe Transformation of a loop to a sequence of vector instructions instructions do I = 1, N C(I) = A(I) + B(I) ization C[1:N]= A[1:N] + B[1:N] L G0, N Load vector length N LA G3, C Load addr for C LA G2, B Load addr for B LA G1, A Load addr for A LOOP VLVCU G0 Set up loop for 128 elements VLD V1, G1 Load 128 A in V1 VLD V2, G2 Load 128 B in V2 VAD V3, V1, V2 A + B -> V3 VSTD V3, G3 V3 -> C BC 2, LOOP If more elements, Loop 29 30

6 Speedup, Expected speedup do I = 1, N C(I)= A(I) + B(I) instruction cycles Load A(I) in i register 1 Load B(I) in i register 1 ADD A(I) + B(I) 3 Store C(I) from register 1 Decr counter by length 128 -> 7*128 C[1:N]= A[1:N] + B[1:N] instruction cycles Load A(1:128) 128 Load B(1:128) 128 ADD A(1:128)+B(1:128) 128 Store C(1:128) * What can be ized? Only Do (For) loops can be vectorized Only one loop in a loop nest can be vectorized izable loops may NOT contain Data dependencies jump in/out/entry/stop loop variables other that integers I/O statements Side effects calls to external subprograms In same cases the compiler can rewrite the loop and then vectorize partially Speedup = 7/4= Different Types of Dependencies True/Flow dependence, is defined before use (DEF USE) S1: A = B + C S2: D = A + 2 S3: E = A * 3 (S1 δ t S2, S1 δ t S3) Anti dependence, is used before defined S1: A = B + C S2: B = X * 3 (S1 δ a S2) Output dependence, is allocated a value several times S1: A = B + C S2: A = X * 3 (S1 δ o S2) 33 Execution order Data Dependency S(i, j, k) << S(i, j, k ) iff (i, j, k) < (i, j, k ) Input & output sets DEF(S) = the set of all variables defined by the statement S USE(S) = the set of all variables used by the statement S Data dependency between two statements S and T ( S δ T) if S << T it exists a variable, v such that v is in both DEF(S) and USE(T) or v is in both USE(S) and DEF(T) or v id in both DEF(S) and DEF(T) it does not exist a statement SI such that S << SI << T and v is in DEF(SI) 34 Data Dependency in Loops Independent loops no iteration depends on data from any other iteration Dependent loops statement S is depenent on statement S k if the execution of S k must occur before the execution of S Loop carried dependency if the dependency depends on a loop index Loop independent dependency if the dependency does not depend on a loop index 35 Basic Concept Iteration vector points to specific iteration of loop (i = i 1, i 2,.., i n ) where i 1 is outermost Distance vector the distance between two iteration vectors i - i Dependency distance vectors if S and S are instances of statements in a loop nest and S(i) δ S (i ) then the dependency distance vector dist(i, i ) = i - i Dependency direction vectors the same as dependency distance vectors but only the direction is shown (<, =, >) corresponds to (+, 0, -) 36

7 Dependency Distance, Distance & Direction s S2: D(i) = A(i-1) Loop carried dependency i = 2: S1: A(2) = B(i) + C(i) S2: D(2) = A(1) i = 3: S1: A(3) = B(i) + C(i) S2: D(3) = A(2) DEF, USE -> S1 δ t S2, distance 3-2 = 1, direction > Representation of Data Dependency Dependency graph directed graph G(V, E) where V is a set of statements, and E edges representing dependencies Dependency cycles Dependencies starting and ending at the statement S Loop independent dependency S2: D(i) = A(i) i = 2: S1: A(2) = B(i) + C(i) S2: D(2) = A(2) i = 3: S1: A(3) = B(i) + C(i) S2: D(3) = A(3) DEF, USE-> S1 δ t S2, distance 2-2= 0, direction = S1: A = B + E S2: B = C S3: C = A V = {S1, S2, S3} E = {(S1, S2), (S1, S3), (S2, S3)} S1 δ a S2 δ a S3 δ t Loop Dependencies Kontrollfrågor S2: D(i) = A(i-1) S2: D(i) = A(i+1) Vilka beroenden finns i ksnuttarna på föregående sida? Riktningsvektorer? Hur ser beroendegraferna ut? do j = 2, 99 S1: A(i+1,j-1) =A(i, j) + C(i,j) 39 40

CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming

CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming CSE 262 Spring 2007 Scott B. Baden Lecture 4 Data parallel programming Announcements Projects Project proposal - Weds 4/25 - extra class 4/17/07 Scott B. Baden/CSE 262/Spring 2007 2 Data Parallel Programming