A Preliminary Assessment of the ACRI 1 Fortran Compiler

Size: px

Start display at page:

Download "A Preliminary Assessment of the ACRI 1 Fortran Compiler"

Alison Hall
6 years ago
Views:

1 A Preliminary Assessment of the ACRI 1 Fortran Compiler Joan M. Parcerisa, Antonio González, Josep Llosa, Toni Jerez Computer Architecture Department Universitat Politècnica de Catalunya Report No UPC-DAC (also available as UPC-CEPBA-94-17)

2 1 1. INTRODUCTION. UPC is currently developing a series of tests which are intended to validate the performance advantages of the ACRI architecture and to assess the difference between the compiler s version and the hand coded version of various kernels. The subject of the following study belongs to the second of the above mentioned issues. The compiler, however, is still under development. So, this work has to be regarded only as a preliminary approach. Our aim is to offer as soon as possible, some early diagnostics, though they risk to be superseded by further versions of the compiler. This work explores how the current version of the af90 fortran compiler (v ) deals with a particular set of routines that we have well studied in the past. That is, the axpy, dot product, matrix by vector, and matrix by matrix products. Other routines and kernels will be used in the future. First, we present an outline of each of the tests being used. For each algorithm, some of the more useful transformations are described, either to obtain best utilization of the resources of the architecture, or to reduce memory traffic. Next, we present the performance obtained by simulation with the compiled and the hand coded version of each test, and we try to identify the reasons for the differences. The simulator that has been used is the asim (v ) architecture simulator. 2. AN OVERVIEW OF THE TESTS In this section we review some of the algorithm transformations proposed by UPC for several linear algebra routines [1]. They are intended to exploit data locality, thus reducing the memory traffic, and also to reduce the negative effect of data dependencies, in order to achieve the maximum utilization of the functional units in the Stream Units, thus avoiding stalls or idle cycles. All of the following routines take advantage of the efficient use of the guard register and the shifting register file when applying software pipelining to the loops. The guard register is useful to eliminate the code corresponding to the prologue and the epilogue of the pipelined loop, by executing each instruction conditionally to its guard (which is derived from the pipestage where it is scheduled). The shifting register is useful when holding temporary values in an pipelined loop, as it provides a means for renaming the registers, thus avoiding to unroll the loop to prevent data dependencies.

3 AXPY The following is the basic algorithm, in Fortran: DO I=1, N Y(I) = Y(I) + ALPHA * X(I) It has not any recurrence. It can be efficiently coded by means of applying software pipelining to it. The current loop can be easily pipelined by means of using the guard register and the shift register file. The product must be held into a register during several cycles until it is read by the addition. As far as we do loop pipelining, we need to preserve it from being rewritten by further products issued during the following pipestages, before the addition takes place. One method would be to unroll the loop as many iterations as needed to ensure that both each product and its corresponding addition are issued in the same pipestage, so that we can code each unrolled product writing to a different register. The second method consists of using the shifting register file as explained in the above section. This is more efficient, since the code obtained is shorter and, unlike the unrolled loop, it can deal with any value of the loop counter. With this two features, the body loop for the DU code looks like this:.bflags 4, 0, 5, 5.block 2 g1 mult $3, $lq, $s0; g6 addt $s5, $rq, $sq.block 2.2. The DOT product The Fortran code for the basic algorithm is: DO I=1, N DOT = DOT + X(I) * Y(I) Here we can see that the loop has a recurrence with a one iteration distance. Only one packet (two instructions) is to be issued per iteration, but the addition operation uses the same register as source and destination operands. As far as the latency of the addition is 3 cycles, the scoreboarding will stall the DU for 2 additional cycles at every iteration, that is, increasing threefold the execution time.

4 3 Transformation: To avoid the effect of the recurrence, we split the variable DOT into three instances, and unroll three iterations of the loop, so that each one performs the addition with a different variable. Obviously, at the end of the loop it will be necessary to reduce the split variable by adding the three values. The following pseudo-code illustrates this idea: dot1 = DOT dot2 = 0 dot3 = 0 for i=1 to n step 3 do dot1 = dot1 + X(i) * Y(i) dot2 = dot2 + X(i+1)* Y(i+1) dot3 = dot3 + X(i+2)* Y(i+2) endfor DOT = dot1 + dot2 + dot3 Again, like in the AXPY routine, the shift register feature will help us to split the variable into different registers automatically by means of renaming it, instead of unrolling the loop. The body loop for the DU code will be:.bflags 4, 0, 5, 5.block 2 g1 mult $lq, $rq, $s0; g6 addt $s5, $s9, $s6.block Again, there is an additional benefit of using the register shifting: it works with any value of the loop counter, while the 3-iteration unrolled loop only deals with sizes multiple of 3, thus needing some additional code. On the other hand, the drawback of this transformation is an additional cost of 9 cycles due to the reduction (3 additions) Matrix by Vector The Fortran code for the basic algorithm is: DO I=1, M DO J=1 TO N Y(I) = Y(I) + A(I,J) * X(J) Here, there are 4 references to memory per iteration, but if we hold Y(I) in a register during the calculation of the innermost loop, the number of references is reduced to 2: only one packet (two instructions) must be issued per iteration. But, just like in the DOT routine, it will take 3 cycles per iteration due to the one iteration recurrence.

5 4 Transformation 1: If we hold Y(I) into a register during the innermost loop calculation, this loop is exactly the same as that of the DOT product. So, it can be applied the same transformation to the loop. And the code generated for the innermost loop body will also be the same as that of the DOT. Transformation 2: The recurrence can be avoided by applying the strip-mine and interchange, plus loop unroll technique. That is, for example, applying the strip mining to the outer loop so that it processes the matrix in stripes of 3 rows (each strip is then processed in a new nested loop which iterates only 3 times). Next, we interchange this new loop with the innermost loop. And finally, we unroll completely the new innermost loop (3 iterations), so that we have again 2 loops: for i=1 to M step 3 do for j=1 to N do Y(i) = Y(i) + A(i,j) * X(j) Y(i+1) = Y(i+1) + A(i+1,j) * X(j) Y(i+2) = Y(i+2) + A(i+2,j) * X(j) endfor endfor And the DU code for the body of the innermost loop will look like this:.bflags 4, 0, 5, 2.block 2 g1 mult $lq, $rq, $s0; g3 addt $s4, $r4, $r4 g1 mult $lq, $rq, $s2; g3 addt $s7, $r5, $r5 g1 mult $lq, $rq, $s5; g2 addt $s1, $r3, $r3.block After the transformation, the innermost loop calculates in parallel the dot products associated with 3 consecutive rows. The recurrence is still of 1-iteration distance, but now we execute 3 packets per iteration, so that when an addition takes place, the previous one has already written its result, and there is not any stall condition due to the dependence. Unlike Transformation 1, this technique does not imply any additional cost due to the reduction, but it requires 3 registers to hold the values of Y(i), Y(i+1) and Y(i+2). Furthermore, there is an additional benefit of doing this transformation: the value of X(J) can be reused 3 times by keeping it into a register. This idea is easily extended to minimize the memory references to X by increasing the width of the strips (thus increasing the unrolling degree in the innermost loop) as much as possible. Although 57 iterations is the maximum unrolling length due to the available number of registers, 56 is a more handy unrolling factor. Thus, the number of references to X is reduced from MN to MN/56, and the total number of references becomes MN(1+1/56) + 2M. As the cycle count is about MN cycles, the memory traffic generated is near 1 reference/cycle, that is, it results almost halved.

6 5 On the other hand, moving X(J) to a register will take 3 additional cycles per iteration (because of the dependence with the first product), so that the cycle count for the body loop increases from 56 to 59 cycles. There is still a sophisticated scheduling which can unroll only 52 times the loop but reduces this overhead to a single cycle per iteration, that is 53 cycles. But it is so irregular that we do not consider it as a reference for our compiler diagnostics Matrix by Matrix The Fortran code for the basic algorithm is: DO J=1, N DO I=1 TO M DO K=1 TO P C(I,J) = C(I,J) + A(I,K) * B(K,J) Again, the 4 references per iteration, can be reduced to 2 by allocating C(I,J) into a register, and a single packet would be enough to code the innermost loop in each stream unit. And it would also take 3 cycles per iteration in the DU due to the one iteration recurrence. Transformation 1: Assuming that we hold C(I,J) into a register, the innermost loop has the same structure as DOT. So, the same transformation applied to DOT can be used here. And the code generated for the innermost loop body can also be the same. The memory references are reduced from 4MNP to (2+2N)MP, that is, near the half of the memory traffic. Transformation 2: The two innermost loops are identical to the Matrix by Vector algorithm. So, the same transformation of the previous section can be applied to them, which was based on strip mining, loop interchange and unroll. And the code generated for the innermost loop body will also be the same. The memory references are reduced to (2+57N/56)MP, that is, near the half than with Transformation 1. Transformation 3: Here, strip mining, loop interchange and unroll can be extended by adding an additional dimension. Now, we are going to update the C matrix by blocks instead of strips, exploiting the most locality of this algorithm by using the register file as much as possible. That is, we perform the strip mining into both the outermost and the middle loops, being a and b the widths. Next, we interchange loops so that the two new loops become the innermost loops, and finally

7 6 we unroll them so that we get again only three loops: J, I and K. But now, the K loop body has a b lines of code. On the current architecture, the optimal values to minimize the memory traffic are a=7 and b=7. The algorithm will look like this: DO J=1, N, b DO I=1, M, a DO K=1, P C(I, J) = C(I, J) + A(I, K) * B(K, J) C(I+1, J) = C(I+1, J) + A(I+1, K) * B(K, J) C(I+2, J) = C(I+2, J) + A(I+2, K) * B(K, J) C(I+3, J) = C(I+3, J) + A(I+3, K) * B(K, J) C(I+a-1,J) = C(I+a-1,J) + A(I+a-1,K) * B(K, J) C(I, J+1) = C(I, J+1) + A(I, K) * B(K, J+1) C(I+1, J+1) = C(I+1, J+1) + A(I+1, K) * B(K, J+1) C(I+2, J+1) = C(I+2, J+1) + A(I+2, K) * B(K, J+1) C(I+3, J+1) = C(I+3, J+1) + A(I+3, K) * B(K, J+1) C(I+a-1,J+1) = C(I+a-1,J+1) + A(I+a-1,K) * B(K, J+1) C(I, J+b-1) = C(I, J+b-1) + A(I, K) * B(K, J+b-1) C(I+1, J+b-1) = C(I+1, J+b-1) + A(I+1, K) * B(K, J+b-1) C(I+2, J+b-1) = C(I+2, J+b-1) + A(I+2, K) * B(K, J+b-1) C(I+3, J+b-1) = C(I+3, J+b-1) + A(I+3, K) * B(K, J+b-1) C(I+a-1,J+b-1) = C(I+a-1,J+b-1) + A(I+a-1,K) * B(K, J+b-1) The reutilization is done by holding A(I, K) to A(I+a-1, K) and B(K,J) to B(K, J+b-1) in registers during the calculation. Moving those values to registers has an additional cost of 8 cycles per iteration, so the innermost loop body cycle count increases from 49 to 57 cycles. But the number of references to memory are reduced to (2+N+N/56)MP which is near the half than in the previous transformation.

8 7 3. SIMULATION RESULTS Table 1 compares the performance of the axpy compiled routine to the hand coded version. Differences are not very important, except for some additional overhead. The compiler generates a kernel loop code similar to the hand coded Table 1: Cycle Count with the axpy routine N Hand coded Compiled version. Table 2 includes results for 2 versions of the dot routine. In the first one, the successive products are accumulated on the parameter of the routine. The second version first copies this parameter to a local variable, then uses it to do the calculations, and finally copies the result back to the parameter. In the first case, the compiler generates an inefficient code which issues a dispatch at every iteration. But in the second case, the code generated has a kernel loop with a body similar to that of the hand coded version, i.e. the compiler does the transformation explained in the previous section. Again, differences are in the start-up code of the loop. Table 2: Cycle count with the DOT routine N Hand coded Compiled 1 Compiled 2 (with local var) Table 3 describes the results of 2 versions of the Matrix x Vector routine. First, we have compiled the original Fortran routine. The code generated by the compiler follows the transformation 1 explained in the previous section. Secondly, we have applied the transformation 2 (unrolling 3 iterations) directly to the Fortran source program, and then the code generated for the body of the kernel loop, is well optimized, just as we would do by hand. However, there is a significative loss of

9 8 performance in the compiler routine with respect to the hand coded version. An Table 3: Cycle count (and Mflops) with the MxV routine N (rows) M (cols) Hand coded (56 unrolled) Compiled 1 (not unrolled) Compiled 2 (3 unrolled) (206.7) ( 63.1) (115.6) (292.9) ( 67.1) (146.6) (293.8) (227.2) * (212.5) (193.7) (185.9) effort has been done to find out why it looses so much time, and it will be described in the following section. Table 4 shows the results of the Matrix x Matrix routine. As with the MxV, the compiler by itself is capable to apply only the transformation 1 explained in the previous section. However, the compiler results are quite worse than those of the hand coded version. Here the differences are even bigger than in the previous algorithm, but the reasons for it are the same (refer to next section). Table 4: Cycle count (and Mflops) on the MxM routine M N P Hand coded (49 unrolled) Compiled (not unrolled) (120.0) (11.8) (177.4) (24.9) * (64.3) (256.2) (26.2) * (66.2) (261.9) (26.3) * (66.4) Note: The cells marked (*) mean unavailable results due to simulator failure.

10 9 4. SOME EARLY DIAGNOSTICS From the tests analyzed before, we can see that the current version of the compiler (version 0.3.0) can efficiently apply loop pipelining and some transformations to the loops, such as that of the DOT routine. Other transformations, like strip-mine plus interchange and unroll which are oriented to get block algorithms that can exploit more efficiently the temporal locality are not currently performed. They can be programmed directly on the fortran code but, if the number of unrolled iterations is too large, the compiler generates a code with a separate dispatch for each iteration. Some significant loss of performance has been detected when testing the MxV routine (and also with MxM). Table 5 illustrates those losses of cycles, and how they are closely related to the number of dispatches been issued. As it seems that a near constant overhead is added to each dispatch, we have put the attention on the setup sections (stream units) of the innermost loop. And we have identified in the AU code, in the setup section of the innermost loop, some sequences of instructions that produce Loss of Decoupling. Table 5: Loss of cycles of the MxV (3 unrolled) compiled routine with respect to a hand coded version (56 unrolled). N (rows) M (cols) dispatches lost cycles lost c. /dispatch Table 6: Loss of cycles of the MxM (not unrolled) compiled routine with respect to a hand coded version (49 unrolled). N M P dispatches lost cycles lost c. /dispatch Typically, they are like the following:

11 10 ldq aq, address ;begins an access to memory (a few instructions) mov aq, $57 ;aq is still empty => AU stalls ;waiting for memory AU stalls for a memory latency period, approximately. Another variant of this sequence has been found, which is preceded by a store-queue instruction: stq address ;puts the address into the SAQ ldq aq, address ;a hit in the SAQ => bypass (a few instructions) mov aq, $57 ;aq is still empty => AU stalls ;waiting for DU Here, as the data comes from the DU (via bypass), the AU must wait until it gets synchronized with the DU. If the same sequence appears later, the impact will be then smaller. But instead of this, if the AU issues a ldq for a DU queue, and the DU immediately requires such data element, it will be stalled until the memory delivers the data. That is to say, those sequences produce, by themselves or by a combination of them, severe degradation of the performance. These stalls cause more impact when the loop counter of the dispatches is small, because they are placed in the setup section of the loop. We believe that much of this problems could be alleviated either by a higher reutilization of data at the AU, either by simplifying the communication protocol between the stream units, or by improving the instruction scheduling on the AU. REFERENCES [1] J. Cortadella et al. "Linear Algebra Routines and FFT on the ACRI Architecture" SHIPS P Deliverable June 92 - May 93. Univ. Politècnica de Catalunya

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic