微算機系統第六章. Enhancing Performance with Pipelining 陳伯寧教授電信工程學系國立交通大學. Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold

Size: px

Start display at page:

Download "微算機系統第六章. Enhancing Performance with Pipelining 陳伯寧教授電信工程學系國立交通大學. Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold"

Byron Nash
5 years ago
Views:

1 微算機系統第六章 Enhancing Performance with Pipelining 陳伯寧教授電信工程學系國立交通大學 chap6- Pipeline is natural! Laundry Example Ann, Brian, athy, Dave each have one load of clothes to wash, dry, and fold A B D Washer takes 3 minutes Dryer takes 4 minutes Folder takes 2 minutes chap6-2

2 Sequential laundry Time T a s k O r d e r A B D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? chap6-3 Pipeline laundry: Start work ASAP Time T a s k O r d e r A B D Pipelined laundry takes 3.5 hours for 4 loads chap6-4

3 T a s k O r d e r Pipeline lessons Time A B D Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage ultiple tasks operating simultaneously using different resources Potential speedup = Number of pipeline stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for Dependences chap6-5 Pipeline is nature! Laundry Example Ann, Brian, athy, Dave each have one load of clothes to wash, dry, and fold Washer takes 3 minutes A B D Dryer takes 3 minutes Folder takes 3 minutes Storer takes 3 minutes to put clothes into drawers chap6-6

4 Sequential laundry T a s k O r d e r Time A B D Sequential laundry takes 8 hours for 4 loads If they learned pipelining, how long would laundry take? chap6-7 Pipelined laundry: Start work ASAP T a s k O r d e r A B D Time Pipelined laundry takes 3.5 hours for 4 loads! chap6-8

5 T a s k O r d e r Lessons lessons A B D Time Pipelining doesn t help latency of single task, it helps throughput of entire workload ultiple tasks operating simultaneously using different resources Potential speedup = Number of pipeline stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for Dependences chap6-9 The five stages of load instruction ycle ycle 2 ycle 3 ycle 4 ycle 5 Load Ifetch /Dec Exec em Wr Ifetch: Fetch Fetch the instruction from the emory /Dec: (isters Fetch and) Decode Exec: alculate the address em: the from the emory Wr: the to the register file chap6-

6 Pipeline execution Time IFetch Dcd Exec em WB IFetch Dcd Exec em WB IFetch Dcd Exec em WB IFetch Dcd Exec em WB IFetch Dcd Exec em WB Program Flow IFetch Dcd Exec em WB Utilization? Now we just have to make it work chap6- Single cycle, multiple cycle vs. pipeline lk ycle ycle 2 Single ycle Implementation: Load Store Waste ycle ycle 2 ycle 3 ycle 4 ycle 5 ycle 6 ycle 7 ycle 8 ycle 9 ycle lk ultiple ycle Implementation: Load Ifetch Exec em Wr Store Ifetch Exec em R-type Ifetch Pipeline Implementation: Load Ifetch Exec em Wr Store Ifetch Exec em Wr R-type Ifetch Exec em Wr chap6-2

7 Why pipeline? class lw sw add, sub, and, or, slt branch fetch 2ps 2ps 2ps 2ps ister read ps ps ps ps operation 2ps 2ps 2ps 2ps access 2ps 2ps ister write ps ps Total time 8ps 7ps 6ps 5ps chap6-3 Why pipeline? Single cycle vs. pipelined performance Program execution Time order (in instructions) lw $, ($) lw $2, 2($) lw $3, 3($) fetch access 8 ps fetch 8 ps access fetch 8 ps Program execution Time order (in instructions) lw $, ($) fetch lw $2, 2($) 2 ps fetch lw $3, 3($) 2 ps fetch access access access Speed-up factor = 24/4=.7 2 ps 2 ps 2 ps 2 ps 2 ps chap6-4

8 Why pipeline? Single cycle vs. pipelined performance Suppose we add additional instructions to the previous 3 instructions Single ycle achine 24 + * 8 = 824 ps Pipelined machine 4 + * 2 = 24 ns The speed-up factor becomes 824/24 = chap6-5 Speed-up of pipeline The ideal speed-up from pipelining equals the number of pipeline stages: In the previous example, 5. However, due to the imbalance of time required by each stage, the time required by the longest instruction divides by the time required by the longest stage, i.e., 8ps/2ps = 4. As instruction count increases, the speed-up of pipeline will approach the ideal value, i.e., 4, in the previous example. Notable, 5 is an non-achievable speed-up ratio. chap6-6

9 Why pipeline? Because the resources are there! Time (clock cycles) I n s t r. O r d e r Inst Inst Inst 2 Inst 3 Inst 4 Im Dm Im Dm Im Dm Im Dm Im Dm chap6-7 an pipelining get us into trouble? Yes: Pipeline Hazards, defined as the moment when the next instruction cannot be executed in the following clock cycle.. structural hazards: attempt to use the same resource two different ways at the same time E.g., the next instruction and the to be written are placed in the same chip. 2. control hazards: attempt to make a decision before condition is evaluated E.g., branch instructions chap6-8

10 an pipelining get us into trouble? ( ps). If one cannot resolve the branch in the second stage, then an even larger slowdown will occur for stall on branches. E.g., gcc consists of 7% of conditional branches. chap6-9 an pipelining get us into trouble? Why nickname of pipeline stall is bubble? An example explanation. Some may wish to use Prediction to resolve the stall on branches. Just execute the next instruction anyway (guess the condition will fail). If the guess of executing the next instruction is wrong, just bubble the previous execution. A dynamic prediction based on the history is also possible! chap6-2

11 an pipelining get us into trouble? Guess is right! Guess is wrong! chap6-2 an pipelining get us into trouble? Another way to solve the control hazard - delayed branch Ask the assembler to re-order the program such that an instruction that is not affected by branch is placed after the conditional branch instruction. E.g., the assembly code in the previous slide can be as follows. chap6-22

12 an pipelining get us into trouble? 3. hazards: attempt to use item before it is ready, or more specifically, an instruction depends on the s of a previous instruction still in the pipeline E.g., instruction depends on of prior instruction still in the pipeline such as add $s, $t, $t; $s=$t+$t sub $t2, $s, $t3; $t2=$s-$t3 Since the second instruction has to wait for the first instruction to pass the fifth stage, additional three bubbles may need to be added in the pipeline. chap6-23 an pipelining get us into trouble? Solution to hazard forwarding or bypassing chap6-24

13 an pipelining get us into trouble? Sometimes additional stall is needed for forwarding or bypassing chap6-25 Summary Pipelining is a fundamental concept multiple steps using distinct resources Utilize capabilities of the path by pipelined instruction processing start next instruction while working on the current one limited by length of longest stage (plus fill/flush latency) detect and resolve hazards pipeline control must detect the hazard take action (or delay action) to resolve hazards chap6-26

14 Pipelining What makes it easy (for IPS) all instructions are the same length just a few instruction formats operands appear only in loads and stores What makes it hard (in general)? structural hazards: suppose we had only one control hazards: need to worry about branch instructions hazards: an instruction depends on a previous instruction chap6-27 Pipelining We ll build a simple pipeline and look at these issues We ll talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc. chap6-28

15 Basic idea: What do we need to add to actually split the path into stages? IF: fetch ID: decode/ register file read EX: Execute/ address calculation Possible cause of control hazard E: emory access WB: back 4 Sift left 2 P ress ister ister register 2 isters 6 32 Sign extend A Zero L U ress Possible cause of hazard chap6-29 Pipelined path IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register isters 6 2 Sign extend 32 Zero ress chap6-3

16 Pipeline example: lw $s, 2($t) ( st stage) Fetch Shaded area = active area Right-half = read and left-half = write IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register isters 6 2 Sign extend 32 Zero ress chap6-3 Pipeline example: lw $s,2($t) (2 nd stage) Decode IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register isters 6 2 Sign extend 32 Zero ress chap6-32

17 Pipeline example: lw $s,2($t) (3 rd stage) Execution IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register isters 6 2 Sign extend 32 Zero ress chap6-33 Pipeline example: lw $s,2($t) (4 th stage) emory IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register isters 6 2 Sign extend 32 Zero ress chap6-34

18 Pipeline example: lw $s,2($t) (5 th stage) Back IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register isters 6 2 Sign extend 32 Zero ress chap6-35 Problem on the previous design The write register number has been changed by the subsequent instruction. Back IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register isters 6 2 Sign extend 32 Zero ress chap6-36

19 orrected path IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register 2 isters Zero ress 6 Sign extend 32 chap6-37 Graphically representing pipelines ultiple-clock-cycle pipeline diagrams An overall view of multiple clock snap-shot. Single-clock-cycle pipeline diagrams Each view shows one clock cycle snap-shot. chap6-38

20 ultiple-clock-cycle pipeline diagram Time (in clock cycles) Program execution order (in instructions) lw $, 2($) I D sub $, $2, $3 I D an help with answering questions like: how many cycles does it take to execute this code? what is the doing during cycle 4? use this representation to help understand paths chap6-39 Single-clock-cycle pipeline diagram lw $,2($) IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register 2 isters Zero ress 6 Sign extend 32 chap6-4

21 Single-clock-cycle pipeline diagram sub $,$2,$3 lw $,2($) IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register 2 isters Zero ress 6 Sign extend 32 chap6-4 Single-clock-cycle pipeline diagram sub $,$2,$3 lw $,2($) IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register 2 isters Zero ress 6 Sign extend 32 chap6-42

22 Single-clock-cycle pipeline diagram sub $,$2,$3 lw $,2($) IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register 2 isters Zero ress 6 Sign extend 32 chap6-43 Single-clock-cycle pipeline diagram sub $,$2,$3 lw $,2($) IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register 2 isters Zero ress 6 Sign extend 32 chap6-44

23 Single-clock-cycle pipeline diagram sub $,$2,$3 IF/ID ID/EX EX/E E/WB 4 Sift Left 2 P ress register register 2 register 2 isters Zero ress 6 Sign extend 32 chap6-45 Pipeline control PSrc IF/ID ID/EX EX/E E/WB P 4 ress register register 2 2 register isters [5 ] 6 Sign 32 extend [2 6] [5 ] Shift Left 2 Src 6 control Op Zero Branch em ress em emto Dst chap6-46

24 Pipeline control We have 5 stages. What needs to be controlled in each stage? Fetch and P Increment Decode / ister Fetch Execution emory Stage Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine? chap6-47 Pipeline control Pass control signals along just like the Execution/ress alculation stage control lines emory access stage control lines em em -back stage control lines Dst Op Op Src Branch write R-format lw sw X X beq X X em to W B ontrol W B EX WB IF/ID ID/E X EX/ E E /W B chap6-48

25 path with control PSrc ontrol ID/EX WB EX/E WB E/WB IF/ID EX WB P 4 ress n register register 2 isters 2 register Shift left 2 Src Zero Branch ress em emto [5 ] 6 Sign 32 extend 6 control em [2 6] [5 ] Dst Op chap6-49 Dependences Problem with starting next instruction before the first one is finished dependencies that go backward in time are hazards chap6-5

26 Dependences Value of register $2: Program execution order (in instructions) Time (in clock cycles) sub $2, $, $ I / D and $2, $2, $5 I D or $3, $6, $2 I D add $4, $2, $2 I D sw $5, ($2) I D chap6-5 Software solution Have compiler guarantee no hazards Where do we insert the nop s? sub $2, $, $3 and $2, $2, $5 or $3, $6, $2 add $4, $2, $2 sw $5, ($2) Problem: This really slows us down! sub $2, $, $3 nop nop and $2, $2, $5 or $3, $6, $2 add $4, $2, $2 sw $5, ($2) chap6-52

27 Forwarding Use temporary s, don t wait for them to be written register file forwarding to handle read/write to same register forwarding chap6-53 Forwarding Time (in clock cycles) Valueofregister$2: / Value of EX/E : X X X 2 X X X X X Value of E/WB : X X X X 2 X X X X Program execution order (in instructions) sub $2, $, $3 I D and $2, $2,$5 I D or $3, $6, $2 I D add $4, $2, $2 I D sw $5, ($2) I D chap6-54

28 Forwarding ID/EX WB EX/E ontrol WB E/WB IF/ID EX WB P isters u x u x u x IF/ID.isterRs Rs IF/ID.isterRt Rt IF/ID.isterRt IF/ID.isterRd Rt Rd u x EX/E.isterRd The two Rt s are identical. Forwarding unit E/WB.isterRd chap6-55 Load word can still cause a () hazard: an instruction tries to read a register, following a load instruction that writes to the same register. Thus, in additional to a forwarding unit, we need a (timebackward) hazard detection unit to stall the load instruction Program execution order (in instructions) lw $2,2($) Time (in clock cycles) I D This time-backward hazard cannot be solved by forwarding and $4, $2,$5 I D or $8, $2,$6 I D add $9, $4, $2 I D slt $, $6, $7 I D chap6-56

29 We can stall the pipeline by keeping an instruction in the same stage Program Time (in clock cycles) execution order (in instructions) lw $2, 2($) I D and $4, $2, $5 I D or $8, $2, $6 I I D bubble add $9, $4, $2 I D Repeat in 4 slt $, $6, $7 I D chap6-57 We can stall the pipeline by keeping an instruction in the same stage Time (in clock cycles) Program execution order (in instructions) lw $2, 2($) I D bubble and becomes nop I D add $4, $2, $5 I D or $8, $2, $6 I D add $9, $4, $2 I D chap6-58

30 Hazard Detection Unit Stall by letting an instruction that won t write anything go forward Hazard detection unit ID/EX.em ID/EX IF/ID IF/ID ontrol WB EX EX/E WB E/WB WB P P isters u x u x IF/ID.isterRs IF/ID.isterRt IF/ID.isterRt IF/ID.isterRd Rt Rd u x EX/E.isterRd ID/EX.isterRt Rs Rt Forwarding unit E/WB.isterRd chap6-59 Branch hazard When we decide to branch, other instructions are in the pipeline! Program execution order (in instructions) Time (in clock cycles) beq $, $3, 72 I D 44 and $2, $2, $5 I D 48 or $3, $6, $2 I D 52 add $4, $2, $2 I D 72 lw $4, 5($7) I D chap6-6

31 Branch hazard We are predicting branch not taken need to add hardware for flushing instructions (as well as the necessary inter-stage register contents) if we are wrong chap6-6 Flushing s: an early decision at the ID stage IF.Flush Hazard detection unit ID/EX WB EX/E ontrol WB E/WB IF/ID EX WB 4 Shift left 2 P isters = Sign extend Forwarding unit chap6-62

32 Dynamic branching prediction Assuming branch-not-taken is not the only prediction. If the branch is taken, we have a penalty of one cycle For our simple design, this is reasonable With deeper pipelines, penalty increases and static branch prediction drastically hurts performance Solution: dynamic branch prediction Branch prediction buffer or branch history table can be added to record if a branch was taken the last time. This helps improving the prediction accuracy, especially for loops. -bit prediction: The prediction status changes if the one prediction error occurs. 2-bit prediction: The prediction status changes if a prediction is wrong twice consecutively. chap6-63 Dynamic branching prediction Taken Predict taken Not taken Taken Predict taken Taken Not taken Predict not taken Not taken Taken Predict not taken Not taken A 2-bit prediction scheme chap6-64

33 Dynamic branching prediction Sophisticated Techniques: orrelating predictors that base prediction on global behavior and recently executed branches (e.g., prediction for a specific branch instruction based on what happened in previous branches) Tournament predictors that use different types of prediction strategies and keep track of which one is performing best. A branch delay slot which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! odern processors predict correctly 95% of the time! chap6-65 Improving performance Try and avoid stalls! E.g., reorder these instructions: lw $t, ($t) lw $t2, 4($t) sw $t2, ($t) sw $t, 4($t) lw $t, ($t) lw $t2, 4($t) sw $t, 4($t) sw $t2, ($t) Dynamic Pipeline Scheduling Hardware chooses which instructions to execute next Will execute instructions out of order (e.g., doesn t wait for a dependency to be resolved, but rather keeps going!) Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect) Trying to exploit instruction-level parallelism chap6-66

34 Advanced Pipelining Increase the depth of the pipeline Start more than one instruction each cycle (multiple issue) Loop unrolling to expose more Line Parallelism (better scheduling) Superscalar processors DE Alpha 2264: 9 stage pipeline, 6 instruction issue All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different pipes ) Very long instruction word (VLIW), static multiple issue (relies more on compiler technology) This class has given you the background you need to learn more! chap6-67 Summary Pipelining does not improve latency, but does improve throughput ulticycle (Section 5.5) Deeply pipelined Pipelined ultiple issue with deep pipeline (Section 6.) ultiple-issue pipelined (Section 6.9) ultiple issue with deep pipeline (Section 6.) ultiple-issue pipelined (Section 6.9) Single-cycle (Section 5.4) Pipelined Deeply pipelined Single-cycle (Section 5.4) ulticycle (Section 5.5) Slower Faster s per clock (IP = /PI) Several Use latency in instructions chap6-68

35 Exceptions/Interrupts Another form of control hazard is the exception. E.g., add $, $2, $ happens to have overflow. Then we need to transfer control immediately to the exception routine at some specific location, because we do not want this invalid value to contaminate other registers or location. Result: Another flush signal should be generated after the execution unit. chap6-69 Basic Interrupt Processing for X86 How 886 system handles interrupts? The Software view. From H~3FFH, there are 256 types. Notably, type = interrupt vector. Basically, the first 32 interrupt vectors are reserved for system. The remaining 224 interrupt vectors are user interrupt vectors. FFFFFH 3FFH ~ H : 4243 S:IP reserved for interrupts chap6-7

36 Basic Interrupt Processing for X86 urrent executing program Example. INT 2 ; The above is equivalent to ; S = [4*2+2] ; = [5] = [32H] = 234H ; IP = [4*2] = [3H] = 5H ; Then type 2 interrupt subroutine is ; placed at 2345H 3FFH : 2H 34H H 5H : reserved for interrupts H ~ S:IP 2345H ~ 33H 32H 3H 3H return interrupt subroutine chap6-7 Basic Interrupt Processing for X86 7 Assignment of software interrupts. Type : Divide Error aused by DIV and IDIV when the quotient exceeds the maximum value that the division instruction allows. Example. DIV X (DX*65536+AX)/X = AX... DX If X=DX = H and AX=H then quotient=ax=h>ffffh. A type interrupt is launched. chap6-72

Basic Interrupt Processing for X86 Hardware interrupt chap6-73 Interrupt Vector Function Table Number PU interrupt P interrupt H Divide error Divide error H Single step Single step (debug) 2H

Illegal instruction error 7H oprocessor emulation interrupt oprocessor not present interrupt 8H Double fault Timer tick (harware)(approximately 8.

37 Basic Interrupt Processing for X86 Hardware interrupt chap6-73 Interrupt Vector Function Table Number PU interrupt P interrupt H Divide error Divide error H Single step Single step (debug) 2H NI(hardware interrupt) Nonmaskable interrupt pin 3H Breakpoint Breakpoint 4H Interrupt on overflow Arithmetic overflow 5H BOUND interrupt Print screen key and BOUND instruction 6H Invalid opcode Illegal instruction error 7H oprocessor emulation interrupt oprocessor not present interrupt 8H Double fault Timer tick (harware)(approximately 8.2Hz) 9H oprocessor segment overrun Keyboard(harware) AH Invalid task state segment Hardware interrupt 2(system bus)(cascade in AT) BH Segment not present Hardware interrupt 3(system bus) H Stack fault Hardware interrupt 4(system bus) DH General protection fault Hardware interrupt 5(system bus) EH Page fault Hardware interrupt 6(system bus) FH Reserved* Hardware interrupt 7(system bus) H Floating-point error Video BIOS H Alignment check interrupt Equipment environment 2H Reserved* oventional size 3H Reserved* Direct disk service 4H Reserved* Serial O port service 5H Reserved* iscellaneous service 6H Reserved* Keyboard service 7H Reserved* Parallel port LPT service 8H Reserved* RO BASI 9H Reserved* Reboot AH Reserved* lock service BH Reserved* ontrol-break handler H Reserved* User timer service chap6-74

38 Interrupt Vector Function Table Number PU interrupt P interrupt DH Reserved* Pointer for video parameter table EH Reserved* Pointer for disk drive parameter table FH Reserved* Pointer to graphics character pattern table 2H User interrupts Terminate program 2H User interrupts DOS services 22H User interrupts Program termination handler 23H User interrupts ontrol handler 24H User interrupts ritical error handler 25H User interrupts disk 26H User interrupts disk 27H User interrupts Terminate and stay resident 28H User interrupts DOS idle 29H User interrupts unused 2AH User interrupts unused 2BH User interrupts unused 2H User interrupts unused 2DH User interrupts unused 2EH User interrupts unused 2FH User interrupts ultiplex handler 3H-6FH User interrupts unused 7H User interrupts Hardware interrupts 8(AT style computer) 7H User interrupts Hardware interrupts 9(AT style computer) 72H User interrupts Hardware interrupts (AT style computer) 73H User interrupts Hardware interrupts (AT style computer) 74H User interrupts Hardware interrupts 2(AT style computer) 75H User interrupts Hardware interrupts 3(AT style computer) 76H User interrupts Hardware interrupts 4(AT style computer) 77H User interrupts Hardware interrupts 5(AT style computer) 78H-FFH User interrupts unused chap6-75 Superscalar and dynamic pipelining Superpipelining = longer pipelines Superscalar = Replicate the internal components of the computers so that multiple instructions in every pipeline stage can be performed. The hardware may still issue only one instruction if certain parallel condition is not met. Dynamic pipelining = Later ready-to-go instructions can proceed in parallel even if a hazard occurs and is currently under resolving in previous instruction. Sections after 6.9 will not be included in exams. chap6-76

39 Suggestive exercises 6., 6.2, 6.3, 6.4, 6.6, 6.7, 6.9, 6.2, 6.3, 6.35 chap6-77

Chapter 4 (Part II) Sequential Laundry

Chapter 4 (Part II) Sequential Laundry Chapter 4 (Part II) The Processor Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Sequential Laundry 6 P 7 8 9 10 11 12 1 2 A T a s k O r d e r A B C D 30 30 30 30 30 30 30 30 30 30