Module 4c: Pipelining

Size: px

Start display at page:

Download "Module 4c: Pipelining"

Leona Thomas
6 years ago
Views:

1 Module 4c: Pipelining R E F E R E N C E S : S T A L L I N G S, C O M P U T E R O R G A N I Z A T I O N A N D A R C H I T E C T U R E M O R R I S M A N O, C O M P U T E R O R G A N I Z A T I O N A N D A R C H I T E C T U R E P A T T E R S O N A N D H E N N E S S Y, C O M P U T E R O R G A N I Z A T I O N A N D D E S I G N

2 Pipeline Performance of a computer can be increased by increasing the performance of the CPU. This can be done by executing more than one tasks at a time. This procedure is referred to as pipelining. The concept of pipelining is to allow the processing of a new task even though the processing of previous task has not ended. 2

3 The root of the single cycle processor s problems: The cycle time has to be long enough for the slowest instruction Solution: Break the instruction into smaller steps Execute each step (instead of the entire instruction) in one cycle Cycle time: time it takes to execute the longest step Keep all the steps to have similar length This is the essence of the multiple cycle processor

4 The advantages of the multiple cycle processor: Cycle time is much shorter Different instructions take different number of cycles to complete Load takes five cycles Jump only takes three cycles Allows a functional unit to be used more than once per instruction

5 Pipeline A single task is divided into several small independent processes. 5 Process T3 T2 T1 Segment 1 Segment 2 Segment 3

6 Analogy: Pipelined Laundry Non-pipelined approach: 1. run 1 load of clothes through washer 2. run load through dryer 3. fold the clothes (optional step for students) 4. put the clothes away (also optional). 6 Two loads? Start all over.

7 Analogy: Pipelined Laundry 7 While the first load is drying, put the second load in the washing machine. When the first load is being folded and the second load is in the dryer, put the third load in the washing machine.

8 T i m e T a s k o r d e r A B C D 6 P M A M Non-pipelined 16 units of time T i m e 6 P M A M T a s k o r d e r A B C D 8 Pipelined 7 units of time Notice that all the process have the same length.

9 Pipeline: Space Time Diagram Illustrates the behaviour of a pipeline S1 P1 P2 P3 P clock S2 S3 S4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 T1 n-1 S = segment; P = process;

10 Pipelining Lessons Pipelining doesn t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for Dependences

11 Design Issues Since each segment are connected to each other in a sequence, the next segment cannot start execution until it has received the result from the previous segment (in this case, pipelining is not ideal). 12 So, the cycle time of the segments must be the same. However, it is known that the execution time of each segment is not the same. Therefore for synchronization, the cycle time for the pipeline will be based on, the longest execution time of the segment in the pipeline.

12 Pipeline performance: Degree of Speedup Given t n is the cycle time for non-pipelining and t p for pipelining. An ideal pipeline divides a task into k independent sequential processes: Each process requires t p time unit to complete, The task itself then requires k t p time units to complete. For n iterations of the task, the execution times will be: With no pipelining: n t n time units, With pipelining: k t p + (n-1) t p time units. Degree of speedup is thus: S = Ex. Time non-pipelining / Ex. Time with pipelining = (n t n )/ [k+(n-1)] t p If n is too large from k, n >> k and when t n = k t p, thus max. speedup: S max = k 13 An ideal pipeline each process has requires the same time unit

13 The Laundromat In a Laundromat the following process occur run 1 load of clothes through washer : 8 run load through dryer : 4 fold the clothes : 5 put the clothes away : 3 Answer these: Determine the cycle time for pipelined and non-pipelined process? Determine the execution time for both pipelined and nonpipelined execution. What is the maximum speedup possible? What is the real speedup value?

14 Non- Ideal Pipeline Structure: Example The operands pass through all segments in a fixed sequence 15 Control Unit Segments are separated by registers, that hold the intermediate results between stages. R1 S1 R2 S2... Rm Sm Data In Segment 1 Each segment Segment consists of 2 a circuit Segment m that performs sub-operation Data Out

15 Example 1: Pipeline (Non-ideal case) Given a 4 segment pipeline whereby each segment has a delay time as follows: - Segment 1 : 40 ns - Segment 2 : 25 ns - Segment 3 : 45 ns - Segment 4 : 45 ns The delay time for the interface register is 5ns. Calculate the: i) cycle time of the non-pipeline and pipeline, ii) execution time for 100 tasks, iii) real speedup, and iv) maximum speedup. 16

16 Example 1: Pipeline 17 Control Unit Data Input Segment 1 Segment 2 Segment 3 Segment 4 Data Output Seg.1 : 40 ns Cycle time Seg.2 : 25 ns Seg.3 : 45 ns Seg.4 : 45 ns i) Cycle time: t n = ( ) ns = 160ns t p = the longest delay for execution + interface delay = (45 + 5) ns = 50 ns

17 Example 1: Pipeline ii. Execution time for 100 tasks 18 = k x t p + (n-1) t p (k + (n 1)) t p = ((4 * 50) + (99 * 50)) ns = (50 * 103) ns = 5150 ns For non-pipeline system, the total execution time for 100 tasks = n t n = 100 * 160ns = ns iii. The real speedup for 100 task Speed up = Execution time for non-pipeline /Execution time for pipeline = / 5150= iv. Maximum speedup, S max = k = 4

18 Instruction Pipeline The instruction cycle clearly shows the sequence of operations that take place in order to execute a single instruction. A good design goal of any system is to have all of its components performing useful work all of the time high efficiency. Following the instruction cycle in a sequential fashion does not permit this level of efficiency. Analogy: Automobile assembly line Perform all tasks concurrently, but on different (sequential) instructions 20 The result is temporal parallelism Result is the instruction pipeline This cause the instruction fetch and execute phases to overlap and perform simultaneous operations.

19 Implementation of Instruction Pipeline: Case 1 Divide the instruction cycle into two processes 22 Instruction fetch (Fetch cycle) Everything else (Execution phase) While one instruction is in execution, pre-fetching of the next instruction Assumes the memory bus will be idle at some point during the execution phase. Reduces the time to fetch an instruction to zero (ideal situation). Instruction prefetching is also known as fetch overlap

20 Implementation of Instruction Pipeline: Case 1 23 Sequential Fetch #1 Execute #1 Fetch #2 Execute #2 Pipeline Fetch #1 Execute #1 Fetch #2 Execute #2

21 Implementation of Instruction Pipeline: Case 1 Fetch and execute are not the same size, take different times to operate so its not really neat or ideal Execution is generally longer than fetch Whenever the execution unit is not using memory, the control increments the program counter, and read the consecutive instructions from memory efficient Fetch and put into queue But sometimes execution needs to branch, so all prefetching is useless. e.g. needs instruction 123 rather than 102 flush the queue and fetch again Also, if there is a branch, fetch must wait for the address from execute Execute must wait until that address is fetched.

22 Implementation of Instruction Pipeline: Case 2 Complex computer instructions require other finer phases Having more stages can further speedup Example: use a 6-stage pipeline: Fetch instruction (FI) read next instruction into buffer Decode instruction (DI) determine opcode and operand Calculate operands (CO) calculate effective address of operand Fetch operands (FO) fetch operand from memory Execute instruction (EI) Write (store) operand (WO) store result in memory The various stages will be more of equal duration Lets assume equal duration 26

23 Implementation of Instruction Pipeline: Case 2 Use multiple execution functional units to parallelize the actual execution phase of several instructions Use branching strategies to minimize branch impact 27

24 Space Time Diagram 28 A 6-stage pipeline can reduce execution time of 9 ins. From 54 to 14 time units. Non-pipelined: 9 ins. X 6 stages = 54 time units.

25 Comments on diagram Assumes each instruction goes thru all 6 stages not true Example: load instruction don t need WO stage Assumes there are no memory conflicts Example: FI, FO, WO involve memory access cannot do it simultaneously Unless FO and WO stage is null or value needed is already in cache then this is not a problem. Assumes no interrupt or branching happens Assumes no data dependency Where the CO stage may depend on contents of a register that could be altered by a previous instruction still in the pipeline

26 6-stage CPU Instruction Pipeline

27 Example 2 The estimated timings for each of the stages of an instruction pipeline: 31 Instruction Fetch Register Read ALU Operation Data Memory Register Write 2ns 1ns 2ns 2ns 1ns

28 P r o g r a m e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) mov ax, num I n s t r u c t i o n D a t a R e g A L U R e g f e t c h a c c e s s mov bx, num2 8 n s I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g mov cx, num3 8 n s I n s t r u c t i o n f e t c h 8 n s... P r o g r a m e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) mov ax, num I n s t r u c t i o n D a t a R e g A L U R e g f e t c h a c c e s s mov bx, num2 2 n s I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g mov cx, num3 2 n s I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g 2 n s 2 n s 2 n s 2 n s 2 n s 32

29 Pipeline Limitations

30 Pipeline Limitations Difficulties sometimes called limitations or hazards In general there are 3 major difficulties that causes instruction pipeline to deviate from its normal operation (i.e. affect pipeline performance) Resource conflict (or resource hazard or structural hazard) Caused by access to memory by 2 segments at the same time Data dependency (or data hazard) Conflict arise when an instruction depends on the result of a previous instruction, but this result is not yet available Branch difficulties (or control hazard) Arise from branch and other instructions that change the PC value Pipeline depth is often not included in the list but it does have an effect on pipeline performance (so it will be discussed)

31 Pipeline Limitations Hazards in pipeline can make the pipeline to stall Eliminating a hazard often requires that some instructions in the pipeline to be allowed to proceed while others are delayed When an instruction is stalled, instructions issued later than the stalled instruction are stopped, while the ones issued earlier must continue No new instructions are fetched during the stall

32 Pipeline Limitation: Pipeline depth If the speedup is based on the number of stages, why not build lots of stages? 37 Each stage uses latches at its input (output) to buffer the next set of inputs. More stages means more hardware overhead If the stage granularity is reduced too much, the latches and their control become a significant hardware overhead. Also suffer a time overhead in the propagation time through the latches. Limits the rate at which data can be clocked through the pipeline. Logic to handle memory and register use and to control the overall pipeline increases significantly with increasing pipeline depth.

33 6 Deep i.e. 6 stages

34 Pipeline Limitation : Data dependency 39 Pipelining must insure that computed results are the same as if computation was performed in strict sequential order. With multiple stages, two instructions in execution in the pipeline may have data dependencies -- must design the pipeline to prevent this. Data dependencies limit when an instruction can be input to the pipeline. Data dependency is shown in the following portion of a program: A = B + C; D = E + A; C = G x H; A = D / H; D needs A but cant read value of A because it still being written by previous instruction

35 Say that this is a 5-staged pipeline I1: ADD AX, BX ; [AX] [AX] + [BX] I2: SUB AX, CX ; [AX] [AX] - [CX] ADD ins. does not update reg. AX until stage 5 But SUB ins. needs the value at beginning of stage 3 cannot fetch something that is not ready To maintain correct operation, the pipeline must stall for 2 clock cycles Results in inefficient pipeline usage

36 Solutions to Data Dependency Hardware interlocks 42 An interlock is a circuit that detects instructions with data dependency and inserts required delays to resolve conflicts Operand forwarding Use circuit to detect possible conflict. Then uses extra hardware to keep results for future use down the pipeline Allowing the result of ALU directly sent as input to ALU to be used by other ALU operations in the next instruction cycle. Delayed load (NOP) Compiler is used to detect conflict and reorder instructions to delay loading of conflicting data. Using NOP instruction.

37 Solution to Data Dependency: Operand Forwarding 43 Src1, Src2 RSLT ALU Operation Operand Store Allowing result of ALU to be used as input to ALU Forwarding Data Path

38 Solution to Data Dependency: Delay Load (NOP) Fetch Instruction Decode Instruction Execute Instruction Store Result I1 NOP I2 I3 I4 I1 NOP I2 I3 I4 I1 NOP I2 I3 I4 I1 NOP I2 I3 I4

39 Pipeline Limitation: Conflict of Resources Occurs when two segment need to access memory at the same time Can be solved by implementing modular memory. Fetch Instruction Decode Instruction Fetch I4 I1 I2 I3 I4 I5 I1 I2 I3 I4 I5 All need access to memory at the same time Execute Instruction Store Result Fetching indirect operand for I3 I1 I2 I3 I4 I5 I1 I2 I3 I4 I5 Store/Write I1

1 at a time. Assume that the source operand for I1 is in memory.

40 5-staged pipeline, ideal case We have a conflict with 2 instruction needing the same resource Assume that memory has a single port so data reads and writes can only happen 1 at a time. Assume that the source operand for I1 is in memory. A delay in any stage can cause pipeline stalls. The FI stage for I3 must idle for 1 clock cycle before beginning

41 Handling Resource Conflict 49 This scenario describes a memory conflict caused by the instruction fetch of I3 and memory resident operand fetch of I1.

42 Handling Resource Conflict 50 Harvard architecture alleviates this issue.

43 Pipeline Limitation 4: Branching For the pipeline to have the desired operational speedup, we must feed it with long strings of instructions. However, 15-20% of instructions in an assemblylevel stream are (conditional) or not ; until it is actually executed. branches. 51 There must be a steady flow of instructions Conditional branches impossible to determine whether the branch will be taken Of these, 60-70% take the branch to a target address. Impact of the branch is that pipeline never really operates at its full capacity limiting the performance improvement that is derived from the pipeline

44 Example 4: Branching 52 Instruction that need to be flushed out Fetch Instruction Decode Instruction Execute Instruction Store Result idle

45 Solution for Branching 53 Delayed branch (using NOP) Rearranging the instructions Implementation of instruction queue

46 Solution for Branching: Delay Branch Using NOP 54 When the compiler detect a branch, it will automatically insert several NOP so that there is no interruptions in the pipeline. Example: I4 I5 I6 I7 MOV INC ADD RET KK MOV INC ADD RET KK The number of NOPs used is 1 less than the number of segments If k = Number of segments Then, NOP to be used = k 1 = 4 1 = 3 I8 NOP NOP NOP

47 Solution for Branching: Delay Branch Using NOP Fetch Inst.(FI) Decode Inst.(DI) Execute Instr.(EI) Store Result I4 I5 I6 I7 NOP NOP NOP KK I4 I5 I6 I7 NOP NOP NOP KK I4 I5 I6 I7 NOP NOP NOP KK I4 I5 I6 I7 NOP NOP NOP KK 4 segments 3 NOPs Wasted 3 pipeline clock cycles per segment

48 Solution for Branching: Rearranging Instructions Tasks are rearranged so that the pipelined can operate efficiently. Fetch Instruction Decode Instruction Execute Instruction Store Result Mov Add Inc Ret 57 Ret Mov Add Inc How to arrange? By bringing the branch instruction up few notches. The number of notches = No segment 1 = k 1 = 4 1 = 3 Ret Mov Add Inc I1 I2 I3 I4 I5 Ret Mov Add Inc I1 I2 I3 I4 Ret Mov Add Inc I1 I2 I3 Ret Mov Add Inc I1 I2

49 Solution for Branching: Rearranging Instructions I4 I5 I6 I7 I8 MOV INC ADD RET KK Fetch Inst.(FI) Decode Inst.(DI) Execute Instr.(EI) I7 RET KK I4 MOV I5 INC I6 ADD I7 I4 I5 I6 KK I7 I4 I5 I6 KK How to arrange? By bringing the branch instruction up few notches. The number of notches = No segment 1 = k 1 = 4 1 = 3 I7 I4 I5 I6 KK Store Result I7 I4 I5 I6 KK

50 I1 I2 I3 I4 I5 Example: Rearranging Instructions MOV AX,BX ADD BX,CX INC CX JMP LOSSY MOV DX,1 59 Pipeline with branching hazard I6 INC BX : I11 LOSSY: DEC AX Flushed out Idle Fetch Inst.(FI) I1 I2 I3 I4 I5 I6 I11 Decode Inst.(DI) I1 I2 I3 I4 I5 I11 Execute Instr.(EI) I1 I2 I3 I4 I11 Store Result I1 I2 I3 I4 I11

51 I1 I2 I3 Example: Rearranging Instructions MOV AX,BX ADD BX,CX INC CX I4 I1 I2 JMP LOSSY MOV AX,BX 60 ADD BX,CX Bring JMP inst. up by 3 notches I4 I5 I6 JMP LOSSY MOV DX,1 INC BX I3 I5 I6 INC CX MOV DX,1 INC BX Pipeline AFTER rearranging instructions : I11 LOSSY: DEC AX : I11 LOSSY: DEC AX Fetch Inst.(FI) Decode Inst.(DI) Execute Instr.(EI) Store Result I4 I1 I2 I3 I11 I4 I1 I2 I3 I11 I4 I1 I2 I3 I11 I4 I1 I2 I3 I11

52 Instruction Queue Solution for Branching: Instruction Queue 61 Fetch Instruction Prefetch target instruction: Prefetch both possible next instructions in the case of a conditional branch Fetch Operand Execute Instruction Store Result Instruction Queue S1 S2 JMP S4 When FI segment detect a JMP the next address instruction regarding to the JMP will be determined. S4 will be deleted and the new instruction will be fetched.

53 Example 5: Instruction Pipeline Given a pipeline that consist of 5 segments; Fetch Instruction, Decode Instruction, Fetch Operand, Execute Instruction and Store Result. 62 Say, n = 3 : I n : MOV AX, NUM ; AX NUM I n+1 : ADD BX, AX ; BX BX + AX I n+2 : JMP LOOP1 ; I n+10 I n+3 : : : : : I n+10 : LOOP1:

54 : I 3 MOV AX,NUM ; AX <-- NUM I 4 ADD BX, AX ; Here BX <-- we have BX + a AX I 5 JMP LOOP1 ; branching I13 problem I 6 : : : I 13 LOOP1: Draw the space time diagram for the execution of the instructions- Identify the branching and data dependency hazard flushed Fetch Inst.(FI) I3 I4 I5 I6 I7 I8 I13 Pipeline is idle until JMP is completed and I13 is fetched and completed Decode Inst.(DI) I3 I4 I5 I6 I7 I13 Fetch Opr.(FO) I3 I4 I5 I6 I13 Execute Instr.(EI) When I5 (JMP) is executed, all fetched instructions will be flushed I3 I4 I5 I13 Store Result I3 I4 I5 I13

55 : I 3 MOV AX,NUM ; AX <-- NUM I 4 ADD BX, AX ; BX <-- BX + AX I 5 JMP LOOP1 ; I13 I 6 : : : Here we have data dependency problem Draw the space time diagram for the execution of the instructions- Identify the branching and data dependency hazard I4 needs AX, but AX is still being processed by I3 I 13 LOOP1: Fetch Inst.(FI) I3 I4 I5 I6 I7 I8 I13 Decode Inst.(DI) I3 I4 I5 I6 I7 I13 Fetch Opr.(FO) I3 I4 I5 I6 I13 Execute I3 I4 I5 I13 Instr.(EI) Here is where data dependency is shown Store Result I3 I4 I5 I13 in the diagram

56 : I 3 MOV AX,NUM ; AX <-- NUM I 4 ADD BX, AX ; BX <-- BX + AX I 5 JMP LOOP1 ; I13 I 6 : : : I 13 LOOP1: Solve the data dependency problem using NOP instructions. Insert NOP between I3 and I4 I3 NOP I4 I5 : : Fetch Inst.(FI) I3 NOP I4 I5 I13 Decode Inst.(DI) I3 NOP I4 I5 I13 Fetch Opr.(FO) I3 NOP I4 I5 I13 Execute Instr.(EI) I3 NOP I4 I5 I13 Store Result I3 NOP I4 I5 I13

57 : I 3 MOV AX,NUM ; AX <-- NUM NOP I 4 ADD BX, AX ; BX <-- BX + AX I 5 JMP LOOP1 ; I13 I 6 : : : I 13 LOOP1: Solve the branching problem by rearranging the instructions. The number of notches = No segment 1 = 5 1 = 4 I3 NOP I4 I5 : : I5 I2 I3 NOP I4 : : Fetch Inst.(FI) I5 I2 I3 NOP I4 I13 Decode Inst.(DI) Fetch Opr.(FO) Execute Instr.(EI) I5 I2 I3 NOP I4 I13 I5 I2 I3 NOP I4 I13 I5 I2 I3 NOP I4 I13 Store Result I5 I2 I3 NOP I4 I13

58 : I 3 MOV AX,NUM ; AX <-- NUM NOP I 4 ADD BX, AX ; BX <-- BX + AX I 5 JMP LOOP1 ; I13 I 6 : : : I 13 LOOP1: Solve the branching problem using NOP instructions. I3 NOP I4 I5 : : The number of NOPs = No segment 1 = 5 1 = Fetch Inst.(FI) I3 NOP I4 I5 NOP NOP NOP NOP I13 Decode Inst.(DI) I3 NOP I4 I5 NOP NOP NOP NOP I13 Fetch Opr.(FO) I3 NOP I4 I5 NOP NOP NOP NOP I13 Execute Instr.(EI) I3 NOP I4 I5 NOP NOP NOP NOP I13 Store Result I3 NOP I4 I5 NOP NOP NOP NOP I13

59 Try this out Given a pipeline that consist of 4 segments; Fetch Instruction, Decode Instruction, Execute Instruction and Store Result. Use this information to answer the following questions. -Draw the space time diagram for I1 MOV AX,BX the execution. I2 SUB BX,2 -Label clearly the data dependency I3 ADD BX, NUMB problem, the flushed instructions and the empty/idle slots. I4 JMP LOSSY I5 INC BX 16 DEC NUMB I7 LOSSY: DEC AX -Solve the data dependency using NOP instructions -Solve the branching problem by rearranging the instructions - Draw a new space time diagram

60 I1 I2 I3 I4 MOV AX,BX SUB BX,2 ADD BX, NUMB JMP LOSSY Draw the space time diagram for the execution. Label the data dependency problem, the flushed instructions and the empty/idle slots. I5 INC BX 16 DEC NUMB I7 LOSSY: DEC AX Fetch Inst.(FI) Decode Inst.(DI) Execute Instr.(EI) I1 I2 I3 I4 I5 I6 I7 I1 I2 I3 I4 I5 I7 Data dependency problem Flushed instructions Empty/idle slots I1 I2 I3 I4 I7 Store Result I1 I2 I3 I4 I7

61 I1 MOV AX,BX I2 SUB BX,2 I3 ADD BX, NUMB I4 JMP LOSSY I5 INC BX 16 DEC NUMB I7 LOSSY: DEC AX A I1 I2 NOP I3 I4 B I1 I4 I2 NOP I3 -Solve the data dependency using NOP instructions A -Solve the branching problem by rearranging the instructions B - Draw a new space time diagram C Fetch Inst.(FI) I1 I4 I2 NOP I3 I7 Decode Inst.(DI) Execute Instr.(EI) I1 I4 I2 NOP I3 I7 I1 I4 I2 NOP I3 I7 Store Result I1 I4 I2 NOP I3 I7

62 Arithmetic Pipeline The process of a complex arithmetic is subdivided into several task segments Every parameter that is involve with the arithmetic operations is processed in the segments and is control by the CPU clock. Example: S i = A i * B i +C i 71 Z i = A i * B i +C i *D i K i = A i * B i +C i /D i

63 Example 6: Designing Arithmetic Pipeline 72 Design an arithmetic pipeline for the operation S = A i * B i + C i where i = 1, 2, 3,, n. Specifications: 3 segment pipeline Each segment is allow to process only two tasks. Component: Register (R), Adder (A), Multiplier (M).

64 Example 6 S = A i * B i + C i where i = 1, 2, 3,, n. 73 The sequence of operation (follow the standard mathematical operational: Access the require data for the registers. Multiplying operation Addition operation

65 Example 6: S = A i * B i + C i Segment definition: Segment 1 : Reg 1 A i, Reg 2 B i Segment 2 : Multiply Reg 3 A i * B i, Reg 4 C i Segment 3 : Add Reg 5 Reg 3 + C i, S Reg 5 74 Notice that all segments only execution of 2 subtasks

Chapter 8. Pipelining

Chapter 8. Pipelining Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization requires sophisticated compilation techniques.