Page 1. Parallelization techniques. Dependence graph. Dependence Distance and Distance Vector

Size: px

Start display at page:

Download "Page 1. Parallelization techniques. Dependence graph. Dependence Distance and Distance Vector"

Tobias Lyons
5 years ago
Views:

1 Parallelization techniques The parallelization techniques for loops normally follow the following three steeps: 1. Perform a Data Dependence Test to detect potential parallelism. These tests may be performed over the statements in a loop, over consecutive iterations of the loop, or even, over more than one loop of the original program.. Restructure the loop into one of the possible forms that represent total, partial or no parallelism: DOALL, DOACROSS or DOSEQ. We may want to restructure the internal layout of the statements to find a more suitable order for other transformations like for example vectorization. Using different transformation we can obtain the greatest degree of parallelism in a program.. Generate parallel code for a particular computer and/or architecture by scheduling the iterations on specific processors and then synthesizing a convenient mechanism for achieving parallelism in a Shared Memory system or in a Distributed Memory system. 1 1 Dependence graph A dependence graph is a precedence graph where nodes are statements and arcs are dependences. A directed graph G (V,E) where V is a set of nodes V = { S1, S... Sn} corresponding to statements in a program, E is a set of arcs E={eij=(Si,Sj) Si,Sj in V} representing data dependencies between statements S1: X = Y + 1 L1: DO =,0 S: C() = X + B() S: A() = C( - 1) + Z S: C( + 1) = B() + A() L: DO J =, 0 S: F(, J) = F(, J - 1) + X S6: Z = Y + Dependence Distance and Distance Vector Suppose a statement S is a nested loop L. Let the first instance of S occur when the loop index is 1, and the second instance occur when the index is, where iteration 1 is the source of a dependence and is the sink of the dependence relation. The dependence distance is 1 - for this dependence. L1: DO = 1, L: DO J = 1, S1: A(, J) = B(, J) + C(, J) S: B(, J + 1) = A(, J) + B(, J) The distance for above example are 0 and 1 for the and J loops, respectively Page 1 1

Vectorization The aim of the Vectorization is the automatic transformation of a sequential structure into code suitable for vector machines.

n the most simple case, when dependencies do not exist, the compiler must distribute the loop around each statement of the loop, and create a vector statement for each case.

2 Vectorization The aim of the Vectorization is the automatic transformation of a sequential structure into code suitable for vector machines. To do this, the compiler must check all the dependencies existing inside the loop. n the most simple case, when dependencies do not exist, the compiler must distribute the loop around each statement of the loop, and create a vector statement for each case. DO = 1, N S1: A() = B() * C() S: D() = B() * K MODFED LOOP S1: A(:N) = B(:N) * C(:N) S: D(:N) = K * B(:N) Vectorization of the simple loop (without data dependence) Vectorization S1 DO = 1, N S1: A(+1) = B(-1) + C() S: B() = A() * K S: C()=B() -1 t S a t a S Program and corresponding Dependence Graph A simple analysis of the graph, shows that statements S1 and S are strongly connected, because data dependence exists in more than one direction, and can not be vectorized. Otherwise statement S can be vectorized, because the mentioned condition is not found between S and the other statements. DO = 1, N S1: A(+1) = B(-1) + C() S: B() = A() * K Program after the Vectorization transformation S : C(:N)=B(:N) -1 Vectorization (1/) Loop Reordering DO = 1, 100 S: D() = A(-1) * D() S : A() = B() + C() S S t Program and correspondent Dependence Graph After the Statement Reordering transformation, the program and the Dependence Graph, are as follows: DO = 1, 100 S : A() = B() + C() S: D() = A(-1) * D() S S t Modified program and correspondent Dependence Graph Note that now a dependence relation exists between S and S statements, but it is only in one direction and it does not cross the iteration boundary. 6 6 Page

Vectorization (/) - Loop Reordering L :DO = 1, 100 S : A() = B() + C() L :DO = 1, 100 S: D() = A(-1) * D() The

after vectorisation 7 7 Loop Fusion ( also known as Loop Jamming ) (1/) Loop fusion merges two separate loops

A test of Data Dependence must be performed between the statements inside the two loops to ensure that no

S L1: DO = 1, N S1: A() = B() + C(+1) L: DO = 1, N S: C() = A(+) FUSED LOOP DO = 1, N S1: A() = B() + C(+1) S:

3 Vectorization (/) - Loop Reordering L :DO = 1, 100 S : A() = B() + C() L :DO = 1, 100 S: D() = A(-1) * D() The next step of transformation - Loop Distribution Now, there are not dependences within L and L and vectorization can be done. Loop after Vectorization A ( 1:100 ) = B ( 1:100 ) + C (1:100) D ( 1:100 ) = A ( 0: 99 ) * D ( 1:100) Program after vectorisation 7 7 Loop Fusion ( also known as Loop Jamming ) (1/) Loop fusion merges two separate loops into a single one. A test of Data Dependence must be performed between the statements inside the two loops to ensure that no dependence relation is being created with the fusion of the loops. S L1: DO = 1, N S1: A() = B() + C(+1) L: DO = 1, N S: C() = A(+) FUSED LOOP DO = 1, N S1: A() = B() + C(+1) S: C() = A(+) The two original loops in a program that can not be fused into one loop. 8 8 Loop Fusion ( also known as Loop Jamming ) (/) S L1: DOALL = 1, N S1: D() = E() + F() + X() L: DOALL J = 1, N S: E(J) = D(J) * F(J) FUSED LOOP L1: DOALL = 1, N S1: D() = E() + F() + X() S: E() = D() * F() The two original loops in a program that can be fused into one loop. 9 9 Page

Loop Distribution The idea of this transformation, is to distribute or separate a complete loop around each statement in its body, or around modules inside the loop.

n the example, a loop with three statements placed inside, is analyzed and some dependencies are found. The distribution is made taking into account this dependencies found in the anterior phase.

There is also a flow dependence relation between S1 and S, because the values used in S of A() can not be modified before, by the assignment to A ( i + 1 ) in S1.

4 Loop Distribution The idea of this transformation, is to distribute or separate a complete loop around each statement in its body, or around modules inside the loop. n general, this distribution of statements is legal, if there is no data dependence between each pair of statements, or if there are data dependencies in only one direction. n the example, a loop with three statements placed inside, is analyzed and some dependencies are found. The distribution is made taking into account this dependencies found in the anterior phase. n the original loop there exist a flow dependence relation between S and S1. n each iteration the value used by B ( i - 1 ) in S1, is the value calculated in the anterior iteration by S. There is also a flow dependence relation between S1 and S, because the values used in S of A() can not be modified before, by the assignment to A ( i + 1 ) in S1. An antidependence exists between S1 and S, and between S and S we can find a flow dependence. But in this case, these two latter dependencies are towards S, and only in one direction. Thus, we can make the separation. DOSEQ = 1, N S1: A( + 1 ) = B ( - 1 ) + C ( ) S: B ( ) = A ( ) * K S: C ( ) = B ( ) - 1 END DO TRANSFORMED LOOP DOSEQ = 1, N S1: A ( + 1 ) = B ( - 1 ) + C ( ) S: B ( ) = A ( ) * K DOALL = 1, N S: C ( ) = B ( ) - 1 A loop with three statements placed inside and program after transformation Loop nterchange (1/) The loop interchange between two nested loops, is a permutation of the loop statements so that the outer loop becomes the inner loop and viceversa. Naturally, the transformation can be applied repeatedly to interchange more than two loops when the program is composed by a set of nested loops. As a difference, when we are using it for parallelization purposes, but not in a vector machine, the most suitable idea is to bring the parallelizable loop to the outermost position, to achieve maximum parallelism. The rule in this case, on the contrary of the anterior one, is that the outermost position we bring the loop, the most iterations can be launched in parallel. L1:DOALL =, N L: DOSEQ J =, M S1:A (, J ) = A (, J-1 ) + 1 TRANSFORMED LOOP L1:DOSEQ J =, M L: DOALL =, N S1:A (, J ) = A (, J-1 ) Loop nterchange (/) The loop is not vectorizable since the innermost loop must be executed serially, but the exterior loop is a parallel one. nterchanging makes vectorization possible. LOOP AFTER VECTORZATON L1: DOSEQ J =, M VS: A( :N, J ) = A ( :N, J - 1 ) + 1 The inner loop is transformed into one vector statement. 1 1 Page

Loop nterchange (/) The example shows how to use Loop nterchanging to achieve maximum parallelism.

Bringing this loop to the outermost position, increases the grain of what is going to be executed in parallel.

L1:DOSEQ =, N L: DOALL J = 1, N S1: A(,J) = A(-1,J) + B() TRANSFORMED LOOP L1:DOALL J = 1, N L: DOSEQ = 1, N S1: A(,J) = A(-1,J) + B() 1 1 Node Splitting - Loop Partitioning Two ideas are important

5 Loop nterchange (/) The example shows how to use Loop nterchanging to achieve maximum parallelism. The interchange here is done to place the DOALL loop in the outermost position, so that, all what is inside this loop can be launched in parallel. Bringing this loop to the outermost position, increases the grain of what is going to be executed in parallel. As outer as we bring the parallel loop, more statements and loops will be inside the loop and much more code can be executed in parallel. L1:DOSEQ =, N L: DOALL J = 1, N S1: A(,J) = A(-1,J) + B() TRANSFORMED LOOP L1:DOALL J = 1, N L: DOSEQ = 1, N S1: A(,J) = A(-1,J) + B() 1 1 Node Splitting - Loop Partitioning Two ideas are important when considering the partitioning of a loop. One possibility is to separate some statements that form a loop into parts to eliminate any kind of data dependence existing between them. The other idea consists in simply partition the statements of a loop to convert the problem into independent smaller problems. L1:DOSEQ = 1,N S1: A() = B() + C() S: D() = A(-1) * A(+1) A loop with two statements inside where a dependence cycle exists. n every iteration A is updated, and the value is used in the next iteration. The distance of the dependence is 1. NTRODUCNG TEMPORARY REORDERNG STATEMENTS VARABLE AND RENAMNG L1:DOSEQ = 1,N L1:DOSEQ = 1,N S: TEMP() = A (+1) S1: A() = B() + C() S1: A() = B() + C() S: TEMP() = A (+1) S: D() = A(-1) * TEMP() S: D() = A(-1) * TEMP() 1 1 Node Splitting - Loop Partitioning Two steps of the transformation. n the first step a new variable is introduced and a reordering is done. n the second step a second reordering is performed. After these steps we can arrive to a suitable form to perform the distribution of the loop. The distribution creates three DOALL loops, it means, three loops with possibility of total parallel execution. AFTER DSTRBUTON L1:DOALL = 1, N S: TEMP() = A (+1) L1:DOALL = 1, N S1: A() = B() + C() L1:DOALL = 1, N S: D() = A(-1) * TEMP() 1 1 Page

Node Splitting The next example presents a special possibility of Node Spliting. Here, we have a dependence between a pair of statements with a distance of two between them.

6 Node Splitting The next example presents a special possibility of Node Spliting. Here, we have a dependence between a pair of statements with a distance of two between them. L1:DO = 1, M S1:A() = A( - ) - The loop with a dependence cycle of distance. The iterations where a dependence exists must be performed serially, but we can perform in parallel the two groups of independent-operation statements that exist in the loop. S1: A ( ) A ( - ) S: A ( + 1 ) A ( - 1 ) S: A ( + ) A ( ) S: A ( + ) A ( + 1) S: A ( + ) A ( + ) Node Splitting As you can see in the Figure there is a dependence relation between statement S1 and statement S and every other statement S*i+1. s the same situation that we find in statements S, S and so on. But every of the groups can be done in parallel. So, we perform a split of the original loop into two independent loops. NODE SPLTTNG. PART ONE DO = 1, ( M - 1)/ * + 1, STEP A ( ) = A ( - ) - NODE SPLTTNG. PART TWO DO =, M / *, STEP A ( ) = A ( - ) - The loop is splitted into two loops. The new loops perform each one of them half the original work. We must take care in this transformation with the indexes of the loop Loop Shrinking The purpose of Loop Shrinking or, also known in the bibliography as Cycle Shrinking, could be considered similar to partial Loop Partition. The difference is that Loop Shrinking can always give better results than partition, as we will see later. Let s consider the following example to introduce the technique and how it is performed. The following loop with K statements is involved in a dependence cycle of the type : S1 1 S... k-1 Sk k S1, and 1 is the distance of i ( i = 1,,...,k ). DO = 1, N S1 S... SK Original loop for the example Page 6 6

Loop Shrinking DOALL J = 1, g DO = J,J + (N-J)/g S1 S... SK ALL Modified loop after a Partition transformation. where g = GCD ( 1,,.

.., k ) DO J = 1, N, DOALL = J,J + - 1 S1 S... SK ALL 19 Modified loop after a Loop Shrinking transformation.

Partition transformation. The philosophy of each transformation is different: Partition tries to group together all iterations of a DO loop that form a dependence chain.

7 Loop Shrinking DOALL J = 1, g DO = J,J + (N-J)/g S1 S... SK ALL Modified loop after a Partition transformation. where g = GCD ( 1,,..., k) is the greatest common divisor of all K distances in the dependence cycle. The same loop will be transformed by Cycle Shrinking to the following loop, where = min ( 1,,..., k ) DO J = 1, N, DOALL = J,J S1 S... SK ALL 19 Modified loop after a Loop Shrinking transformation. 19 Comparing Loop Shrinking and Loop Partitioning n other terms, the size of the DOALL loop created by Cycle Shrinking is always greater than or equal to the size of the DOALL created by the Partition transformation. The philosophy of each transformation is different: Partition tries to group together all iterations of a DO loop that form a dependence chain. Each such group is executed serially, while different groups can execute in parallel. Dependencies are confined within the iterations of each group and dependencies across groups do not exist. Cycle Shrinking groups together independent iterations and executes them in parallel. Dependencies exist only across groups and are satisfied by executing the different groups in their natural sequential order. 0 0 Comparing Loop Shrinking and Loop Partitioning L1: DOSEQ =, N S1: A () = B (-) -1 S: B () = A (-) * K MODFED LOOP L1: DOSEQ J =, N, L: DOALL = J, J+1 S1: A () = B (-) -1 S: B () = A (-) * K An example of Loop Shrinking transformation. = 7 = teration Space Graphs for the above example 1 1 Page 7 7

8 Loop Skewing Loop Skewing extracts parallelism from multiple nested loops, in many cases where parallelism can not be found in any single loop. DOSEQ =, N-1 DOSEQ J =, N-1 A(,J) = (A(+1,J)+A(-1,J)+A(,J+1)+A(,J-1))/ J teration Space Graph for the example. n the graph you can see diagonal lines that correspond to wave fronts found in this iteration. Loop Skewing The transformation consists in a shift of the index set of the original loop, creating rhomboid teration Space out of what was a square. The corresponding restructured code is the following: MODFED LOOP DOALL =, N-1 DO J = +,+N-1 A(,J-1) = (A(+1,J-)+A(-1,J-1)+A(,J+1-)+A(,J-1-))/ J teration Space Graph for the modified program using the Loop Skewing technique. terations in a vertical line are executed concurrently. Page 8 8

Compiling for Advanced Architectures

Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have