Pipelining and Vector Processing

Size: px

Start display at page:

Download "Pipelining and Vector Processing"

Ralph Moore
5 years ago
Views:

1 Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline stalls can be caused by three types of hazards: resource, data, and control hazards. Resource hazards result when two or more instructions in the pipeline want to use the same resource. Such resource conflicts can result in serialized execution, reducing the scope for overlapped execution. Data hazards are caused by data dependencies among the instructions in the pipeline. As a simple example, suppose that the result produced by instruction I1 is needed as an input to instruction I2. We have to stall the pipeline until I1 has written the result so that I2 reads the correct input. If the pipeline is not designed properly, data hazards can produce wrong results by using incorrect operands. Therefore, we have to worry about the correctness first. Control hazards are caused by control dependencies. As an example, consider the flow control altered by a branch instruction. If the branch is not taken, we can proceed with the instructions in the pipeline. But, if the branch is taken, we have to throw away all the instructions that are in the pipeline and fill the pipeline with instructions at the branch target. 8 3 Prefetching is a technique used to handle resource conflicts. Pipelining typically uses just-intime mechanism so that only a simple buffer is needed between the stages. We can minimize the performance impact if we relax this constraint by allowing a queue instead of a single buffer. The instruction fetch can prefetch instructions and place them in the instruction queue. The decoding will have ample instructions even if the instruction fetch is occasionally delayed because of a cache miss or resource conflict. 8 4 Data hazards are caused by data dependencies among the instructions in the pipeline. As a simple example, suppose that the result produced by instruction I1 is needed as an input to instruction I2. We have to stall the pipeline until I1 has written the result so that I2 reads the correct input. 1

2 2 Chapter 8 There are two techniques used to handle data dependencies: register interlocking and register forwarding. Register forwarding works if the two instructions involved in the dependency are in the pipeline. The basic idea is to provide the output result as soon as it is available in the datapath. This technique is demonstrated in the following figure. For example, if we provide the output of I1 to I2 as we write into the destination register of I1, we will reduce the number of stall cycles by one (see Figure a). We can do even better if we feed the output from the IE stage as shown in Figure b. In this case, we completely eliminate the pipeline stalls. Clock cycle I1 I2 I3 I4 (a) Forward scheme 1 Clock cycle I1 I2 I3 I4 (b) Forward scheme 2 Register interlocking is a general technique to solve the correctness problem associated with data dependencies. In this method, a bit is associated with each register to specify whether the contents are correct. If the bit is 0, the contents of the register can be used. Instructions should not read contents of a register when this interlocking bit is 1, as the register is locked by another instruction. The following figure shows how the register interlocking works for the example given below: I1: add R2,R3,R4 /* R2 = R3 + R4 */ I2: sub R5,R6,R2 /* R5 = R6 R2 */ I1 locks the R2 register for clock cycles 3 to 5 so that I2 cannot proceed reading an incorrect R2 value. Clearly, register forwarding is more efficient than the interlocking method.

3 Chapter 8 3 R2 is locked Clock cycle I1 I2 ID OF IE WB I3 ID OF IE WB I4 8 5 Flow altering instructions such as branch require special handling in pipelined processors. In the following figure, Figure a shows the impact of a branch instruction on our pipeline. Here we are assuming that instruction Ib is a branch instruction; if the branch is taken, it transfers control to instruction It. If the branch is not taken, the instructions in the pipeline are useful. However, for a taken branch, we have to discard all the instructions that are in the pipeline at various stages. In our example, we have to discard instructions I2, I3, and I4. We start fetching instructions at the target address. This causes our pipeline to do wasteful work for three clock cycles. This is called the branch penalty. Clock cycle Branch instruction Ib I2 Discarded instructions I3 I4 Branch target instruction It (a) Branch decision is known during the IE stage Clock cycle Branch instruction Ib Discarded instruction I2 Branch target instruction It (b) Branch decision is known during the ID stage 8 6 Several techniques can be used to reduce branch penalty. If we don t do anything clever, we wait until the execution (IE) stage before initiating the instruction fetch at the branch target address. We can reduce the delay if we can determine this earlier. For example, if we find whether the branch is taken along with the target address

4 4 Chapter 8 information during the decode (ID) stage, we would just pay a penalty of one cycle, as shown in the following figure. Clock cycle Branch instruction Ib I2 Discarded instructions I3 I4 Branch target instruction It (a) Branch decision is known during the IE stage Clock cycle Branch instruction Ib Discarded instruction I2 Branch target instruction It (b) Branch decision is known during the ID stage Delayed branch execution effectively reduces the branch penalty further. The idea is based on the observation that we always fetch the instruction following the branch before we know whether the branch is taken. Why not execute this instruction instead of throwing it away? This implies that we have to place a useful instruction in this instruction slot. This instruction slot is called the delay slot. In other words, the branching is delayed until after the instruction in the delay slot is executed. Some processors like the SPARC and MIPS use delayed execution for both branching and procedure calls. When we apply this technique, we need to modify our program to put a useful instruction in the delay slot. We illustrate this by using an example. Consider the following code segment: add R2,R3,R4 branch target sub R5,R6,R target: mult R8,R9,R If the branch is delayed, we can reorder the instructions so that the branch instruction is moved ahead by one instruction, as shown below: branch target add R2,R3,R4 /* Branch delay slot */

5 Chapter 8 5 sub R5,R6,R target: mult R8,R9,R Programmers do not have to worry about moving instructions into the delay slots. This job is done by the compilers and assemblers. When no useful instruction can be moved into the delay slot, a no operation (NOP) is placed. Branch prediction is traditionally used to handle the branch penalty problem. We discussed three branch prediction strategies: fixed, static, and dynamic. In the fixed strategy, as the name implies, prediction is fixed. These strategies are simple to implement and assume that the branch is either never taken or always taken. The static strategy uses instruction opcode to predict whether the branch is taken. For example, if the instruction is unconditional branch, we use branch always-taken decision. Dynamic strategy looks at the run-time history to make more accurate predictions. The basic idea is to take the past n branch executions of the branch type in question and use this information to predict the next one. 8 7 Delayed branch execution effectively reduces the branch penalty. The idea is based on the observation that we always fetch the instruction following the branch before we know whether the branch is taken. Why not execute this instruction instead of throwing it away? This implies that we have to place a useful instruction in this instruction slot. This instruction slot is called the delay slot. In other words, the branching is delayed until after the instruction in the delay slot is executed. 8 8 In delayed branch execution, when the branch is not taken, sometimes we do not want to execute the delay slot instruction. That is, we want to nullify the delay slot instruction. Some processors like the SPARC provide this nullification option. 8 9 Branch prediction is traditionally used to handle the branch problem. We discussed three branch prediction strategies: fixed, static, and dynamic. 1. Fixed Branch Prediction: In this strategy, prediction is fixed. These strategies are simple to implement and assume that the branch is either never taken or always taken. The Motorola and VAX 11/780 use the branch-never-taken approach. The advantage of the nevertaken strategy is that the processor can continue to fetch instructions sequentially to fill the pipeline. This involves minimum penalty in case the prediction is wrong. If, on the other hand, we use the always-taken approach, the processor would prefetch the instruction at the branch target address. In a paged environment, this may lead to a page fault, and a special mechanism is needed to take care of this situation. Furthermore, if the prediction were wrong, we would have done lot of unnecessary work. The branch-never-taken approach, however, is not proper for a loop structure. If a loop iterates 200 times, the branch is taken 199 out of 200 times. For loops, the always-taken

6 6 Chapter 8 approach is better. Similarly, the always-taken approach is preferred for procedure calls and returns. 2. Static Branch Prediction: This strategy, rather than following a fixed strategy, uses instruction opcode to predict whether the branch is taken. For example, if the instruction is unconditional branch, we use branch always-taken decision. We use a similar decision for the loop and call/return instructions. On the other hand, for conditional branches, we may use never-taken decision. It has been shown that this strategy improves prediction accuracy. 3. Dynamic Branch Prediction: Dynamic strategy looks at the run-time history to make more accurate predictions. The basic idea is to take the past n branch executions of the branch type in question and use this information to predict the next one. Will this work in practice? How much additional benefit can we derive over the static approach? An empirical study suggests that we can get significant improvement in prediction accuracy To show why the static strategy gives high prediction accuracy, we present sample data for commercial environments. In such environments, of all the branch-type operations, the branches are about 70%, loops are 10%, and the rest are procedure calls/returns. Of the total branches, 40% are unconditional. If we use a never-taken guess for the conditional branch and always-taken for the rest of the branch-type operations, we get a prediction accuracy of about 82% as shown in the following table. Instruction Prediction: Correct prediction Instruction type distribution (%) Branch taken? (%) Unconditional branch = 28 Yes 28 Conditional branch =42 No = 25.2 Loop 10 Yes =9 Call/return 20 Yes 20 Overall prediction accuracy = 82.2% The data in this table assume that conditional branches are not taken about 60% of the time. Thus, our prediction that a conditional branch is never taken is correct only 60% of the time. This gives us 42 0:6 = 25:2% as the prediction accuracy for conditional branches. Similarly, loops jump back with 90% probability. Since loops appear about 10% of the time, the prediction is right 9% of the time. Surprisingly, even this simple static prediction strategy gives us about 82% accuracy! If we apply never-taken decision, our prediction accuracy reduces to 26.2% as shown in the following table. On the other hand, the always-taken approach gives us a prediction accuracy of 73.8%. In either case, the static strategy gives higher prediction accuracy.

7 Chapter 8 7 Instruction Correct prediction: Correct prediction: Instruction type distribution (%) Never taken (%) Always taken (%) Unconditional branch = Conditional branch = = = 16.8 Loop = =9 Call/return Overall prediction accuracy = 26.2% 73.8% 8 11 The static prediction strategy uses instruction opcode to predict whether the branch is taken. Dynamic strategy, on the other hand, looks at the run-time history to make more accurate predictions. The basic idea is to take the past n branch executions of the branch type in question and use this information to predict the next one. Since this takes runtime conditions into account, it can potentially perform better than the static strategy. How much additional benefit can we derive over the static approach? The empirical study by Lee and Smith suggests that we can get significant improvement in prediction accuracy. A summary of their study is presented in the following table. The algorithm they implemented is simple: The prediction for the next branch is the majority of the previous n branch executions. For example, for n =3, if two or more times branches were taken in the past three branch executions, the prediction is that the branch will be taken. Type of mix n Compiler Business Scientific The data in this table suggest that looking at the past two branch executions will give us over 90% prediction accuracy for most mixes. Beyond that, we get only marginal improvement The static prediction strategy uses instruction opcode to predict whether the branch is taken. Dynamic strategy, on the other hand, looks at the run-time history to make more accurate predictions. Static strategy is simple to implement compared to the dynamic strategy. Implementation of the dynamic strategy requires maintaining two bits for each branch instruction. However, dynamic strategy improves prediction accuracy. In the example presented in the text (Section 8.4.2), dynamic strategy gives us over 90% prediction accuracy whereas the prediction accuracy of the static strategy is about 82%.

8 8 Chapter A summary of the empirical study by Lee and Smith study is presented in the following table. The algorithm they implemented is simple: The prediction for the next branch is the majority of the previous n branch executions. For example, for n =3, if two or more times branches were taken in the past three branch executions, the prediction is that the branch will be taken. Type of mix n Compiler Business Scientific The data in this table suggest that looking at the past two branch executions will give us over 90% prediction accuracy for most mixes. Beyond that, we get only marginal improvement. This implies that we need just two bits to take the history of the past two branch executions. The basic idea is simple: keep the current prediction unless the past two predictions were wrong. Specifically, we do not want to change our prediction just because our last prediction was wrong Superscalar processors improve performance by replicating the pipeline hardware. One simple technique is to have multiple pipelines. The following figure shows a dual pipeline design, somewhat similar to that present in the Pentium. The instruction fetch fetches two instructions each cycle and loads the two pipelines with one instruction each. Since these two pipelines are independent, instruction execution can proceed in parallel. U pipeline Common instruction fetch Instruction decode Instruction decode Operand fetch Operand fetch Instruction execution Instruction execution Result write back Result write back V pipeline We can also improve performance by providing multiple execution s, linked to a single pipeline, as shown in the following figure. In this figure, we are using four execution s: two integer s and two floating-point s. Such designs are referred to as superscalar processors.

9 Chapter 8 9 Integer execution 1 Instruction fetch Instruction decode Operand fetch Integer execution 2 Floating-point execution 1 Result write back Floating-point execution Superscalar processors improve performance by replicating the pipeline hardware (multiple pipelines and multiple execution s) Superscalar processors improve performance by replicating the pipeline hardware (multiple pipelines and multiple execution s). Superpipelined systems improve performance by increasing the pipeline depth The main difference is that vector machines are designed to operate at the vector level. In contrast, traditional processors are designed to work on scalars. Vector machines also exploit pipelining (i.e., overlapped execution) to the maximum extent. Vector machines not only use pipelining for integer and floating-point operations but also to feed data from one functional to another. This process is known as chaining. In addition, load and store operations are also pipelined Vector processing offers improved performance due to several reasons. Some of them are listed below: Flynn s bottleneck can be reduced by using vector instructions as each vector instruction specifies a lot of work. Data hazards can be eliminated due to the structured nature of the data used by vector machines. Memory latency can be reduced by using pipelined load and store operations. Control hazards are reduced as a result of specifying a large number of iterations in a single vector instruction. Pipelining can be exploited to the maximum extent. This is facilitated by the absence of data and control hazards. Vector machines not only use pipelining for integer and floating-point

10 10 Chapter 8 operations but also to feed data from one functional to another. This process is known as chaining. In addition, as mentioned before, load and store operations also use pipelining The vector length register holds the valid vector length (VL). All vector operations are done on the first VL elements (i.e., elements in the range 0 to VL 1) Larger vectors are handled by a technique known as strip mining. As an example, assume that the vector length supported by the machine is 64 and the vector to be processed consists of 200 elements. In strip mining, the vector is partitioned into strips of 64 elements. This leaves one odd-size piece that may be less than 64 elements. The size of this piece is given by (N mod 64). We load each strip into a vector register and apply the vector operation. The number of strips is given by (N=64) + 1. For our example, we divide the 200 elements into four pieces: three pieces with 64 elements and one odd piece with 8 elements. We use a loop that iterates four times: in one of the iterations, we set VL to 8 and the remaining three iterations will set the VL register to To understand vector stride, we have to consider how the elements are stored in memory. Since vectors are one-dimensional arrays, storing a vector in memory is straightforward: vector elements are stored as sequential words in memory. If we want to fetch 40 elements, we read 40 contiguous words from memory. These elements are said to have a stride of 1. That is, to get to the next element, we add 1 to the current element. Note that the distance between successive elements is not measured in bytes; rather in number of elements. We will need non vector strides for multidimensional arrays. To see why, let s focus on twodimensional matrices. If we want to store a two-dimensional matrix in memory, we have to linearize it. We can do this in one of two ways: row-major or column-major order. Most languages, except FORTRAN, use the row-major order. In this ordering, elements are stored in row order: row 0, row 1, row 2, and so on. In the column-major order, which is used by FORTRAN, elements are stored column by column: column 0, column 1, and so on. As an example, consider the following 4 4 matrix: A = : This matrix is stored in memory as shown in the following figure.

11 Chapter 8 11 Row 0 Row 1 Row 2 Row (a) Row-major order Column 0 Column 1 Column 2 Column (b) Column-major order Assuming row-major order for storing, how do we access all elements of column 0? Clearly, these elements are not stored contiguously. We have to access 0, 4, 8, and 12 elements in the memory array. Since successive elements are separated by 4 elements, we say that the stride is 4. This is the reason why vector machines provide load and store instructions that take the stride into account Chaining is useful when there is a conflict as in the following example: A1 5 VL A1 (VL = 5) V1 V2+FV3 (Floating-point addition of V2 and V3) V4 V5*FV1 (Floating-point multiplication of V5 and V1) The multiplication operation takes values from V1, which are produced by the addition operation. Due to this dependency, we cannot independently schedule these two instructions. As we have seen, sequential execution of these two instructions takes = 33 clocks. This is where chaining is useful. Chaining allows feeding of data from one operation to the next without waiting for the first operation to complete. For example, once the first result of addition is placed in V1, we could make that value available for the multiplication instruction. The Cray X-MP allows using the first result after two clock cycles of delay. The execution timing using the chaining is shown in the following figure. Instruction A1 5 I E VL A1 I E V1 V2 + FV3 I S S S F F F F F R1 R2 R3 R4 R5 D D D V4 V5 * FV1 I S S S H H H H H H H F F F F F F R1 R2 R3 R4 R5 D D D Multiply on hold The vector multiplication instruction uses chaining to get values from the V1 vector register. The addition operation produces the first result R1 during clock cycle 12. This result is available for

12 12 Chapter 8 use by the multiplication two clocks later. Until that time, the multiplication pipeline is in the hold (H) state. From that point onward, execution proceeds in the normal fashion. Thus, by using chaining, we complete both operations in 25 cycles. This is significantly faster than executing without chaining: 33 cycles versus 25 cycles, about 24% improvement The timing diagram is shown below: Instruction A1 10 I E VL A1 I E V1 V2 + V3 I S S S F F R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 D D D V4 V5 + FV6 I S S S F F F F F R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 D D D 8 24 The timing diagram is shown below: Instruction A1 10 I E VL A1 I E V1 V2 + V3 I S S S F F R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 D D D V4 V5 + FV6 I S S S F F F F F R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 D D D V7 V4 * FV1 I S S S H H H H H H H F F F F F F R1 R2 R3 R4 R5 R6 R7 R8 R9R10 D D D 8 25 If we ignore other factors, we improve speedup as we increase the number of pipeline stages. The number of pipeline stages is also called the pipeline depth. However, in practice, larger pipeline depth reduces performance. This reduction is mainly contributed by the data and control hazards. The probability of a pipeline stall occurring due to these dependencies increases with the pipeline depth. Both these dependencies will cause the work in the pipeline to be thrown out. In addition, longer pipelines mean more time for branching when control hazards occur We get improved speedup as the vector size increases from 1 to VL. This is due to the amortization of the pipeline fill cost. As an example, consider a 10-stage pipeline with 64-element vector registers (VL = 64). Neglecting all other influences, nonpipelined operation on 64 elements takes 640 time s. On a pipelined, it takes = 73 time s giving us a speedup of 640/73 = Now look at what happens to the speedup when we increase the vector size by one element. On the nonpipelined, it takes 650 time s, whereas the pipelined takes = 83. Thus, the speedup drops to 650/83 = Similarly, you can verify that the speedup reaches 8.77 for 128-element vectors but drops to 8.27 for 129 elements. Thus, the performance peaks at multiples of VL but drops immediately after that, leading to the saw-tooth shaped performance.

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC