Computer Architecture ELEC3441

Size: px

Start display at page:

Download "Computer Architecture ELEC3441"

Blaze Green
5 years ago
Views:

Computer Architecture ELEC3441 RISC vs CISC Iron Law CPUTime = # of instruction program # of cycle instruction cycle Lecture 5 Pipelining Dr.

long RISC pipelined 1 short Pipeline Motivation Buying Food from Canteen 1 customer 2 customers Order Food Drink Order Food Drink Order Food Drink n Getting

Slow 2 Food Ordering Pipeline 4 customers (pipeline) 4 customers (no pipeline) n Serving one after one è Slow: Assume each step take 1 unit of, then N

1 Computer Architecture ELEC3441 RISC vs CISC Iron Law CPUTime = # of instruction program # of cycle instruction cycle Lecture 5 Pipelining Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering L4 L5,6 Microarchitecture CPI Cycle Time CISC >1 short RISC single cycle unpipelined 1 long RISC pipelined 1 short Pipeline Motivation Buying Food from Canteen 1 customer 2 customers Order Food Drink Order Food Drink Order Food Drink n Getting food from canteen involves 3 steps: Place order (P) Pickup food (F) Pickup drink (D) n If there is only 1 customer: P è F è D n How to serve 2 customer? Slow 2 Food Ordering Pipeline 4 customers (pipeline) 4 customers (no pipeline) n Serving one after one è Slow: Assume each step take 1 unit of, then N customers è 3N units of n Better solution: Pipeline Overlap different steps in parallel N customers è 2 + N units of Pre-requisite: All steps must be able to operate independently in parallel 3 4

Pipeline Observations: 4 customers (pipeline) 4 customers (unbalanced pipeline) 2 Views of Pipeline Customer 0 Customer 1 Customer 2 Customer 3 Timeline View 4 customers (no pipeline) n All stages

2 Pipeline Observations: 4 customers (pipeline) 4 customers (unbalanced pipeline) 2 Views of Pipeline Customer 0 Customer 1 Customer 2 Customer 3 Timeline View 4 customers (no pipeline) n All stages (P, F, D) are busy all the Non-pipeline: busy 1/3 of the n Balanced pipeline: 1 customer per 1 unit of n The longest stage dictates the overall performance 1 customer per -of-longest-stage Balanced delay on each stage è best performance Order (P) Food (F) Drink (D) c0 c1 c0 c2 c1 c0 c3 c2 c1 c3 c2 c3 Resource View 5 6 ruction Pipelining Pipelined path n Recall there are 5 steps to execute 1 instruction in RISC-V ruction Fetch ruction Decode Execution operation Write Back n The 5 steps can be pipelined if they can operate independently è Pipeline registers And more. ruc0on Fetch ruc0on decode & Reg-fetch Execute Access write -back 7 2 nd Semester 2013 ELEC HS 8

Semester 2013 ELEC3441 - HS Execute (EX) (MA) Write -Back (WB) 9. I-Fetch (IF) Resources 5-Stage Pipelined Execu6on Resource Usage Diagram ws Decode, Reg. Fetch (ID) 0me t0 t1 t2 t3 t4 t5 t6 t7.

3 . I-Fetch (IF) 5-Stage Pipelined Execu6on Decode, Reg. Fetch (ID) 0me t0 t1 t2 t3 t4 t5 t6 t7.... instruchon1 IF 1 ID 1 EX 1 MA 1 WB 1 instruchon2 IF 2 ID 2 EX 2 MA 2 WB 2 instruchon3 IF 3 ID 3 EX 3 MA 3 WB 3 instruchon4 IF 4 ID 4 EX 4 MA 4 WB 4 instruchon5 IF 5 ID 5 EX 5 MA 5 WB 5 2 nd Semester 2013 ELEC HS Execute (EX) (MA) Write -Back (WB) 9. I-Fetch (IF) Resources 5-Stage Pipelined Execu6on Resource Usage Diagram ws Decode, Reg. Fetch (ID) 0me t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 2 I 3 I 4 I 5 ID I 1 I 2 I 3 I 4 I 5 EX I 1 I 2 I 3 I 4 I 5 MA I 1 I 2 I 3 I 4 I 5 WB I 1 I 2 I 3 I 4 I 5 2 nd Semester 2013 ELEC HS Execute (EX) What is wrong with this? (MA) writing back to regfile is from an instruction 3 cycles ago Write -Back (WB) 10 inst add x1, x2, x3! lw x4, 20(x5)! ori x6, x7, 1! sub x8, x9, x

Pipelined Execution Control: inst A B MD1 Y MD2 decode R Pipelined RISC-V path without jumps F D E M W inst RegWriteEn FuncSel MemWrite WBSel n Replicate

13 14 Benefit of ruction Pipelining CPUTime = # of instruction program n When the pipeline is filled, CPI=1 n Shorter cycle because less work to do per cycle In

Hazard Hazard Control Hazard n On every cycle, the hardre needs to detect and resolve all types of hazards, while keeping pipeline as filled as possible to achieve

4 Pipelined Execution Control: inst A B MD1 Y MD2 decode R Pipelined RISC-V path without jumps F D E M W inst RegWriteEn FuncSel MemWrite WBSel n Replicate instruction register to every stage n Distributed decoding for each stage based on the current instruction of that stage Op2Sel Control Points Need to Be Connected Benefit of ruction Pipelining CPUTime = # of instruction program n When the pipeline is filled, CPI=1 n Shorter cycle because less work to do per cycle In fact, more pipeline stages è shorter cycle Commercial processors can have up to 20 stages pipeline # of cycle instruction cycle Pipeline is Difficult Structural Hazard Hazard Control Hazard n On every cycle, the hardre needs to detect and resolve all types of hazards, while keeping pipeline as filled as possible to achieve CPI=1 In real systems, CPI suffers slightly in return for higher clock speed n Need to make sure hardre adheres to the ISA contract with the programmer difficult but worth it 15 16

5 Structural Hazard Structural Hazard Ex n Structural hazard arises when more than 1 pipeline stages require access to the same physical hardre n Solutions: 1. extra copies of the resource 2. Change resource so that it can handle concurrent use IF ID EX MEM WB t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 n Physically there is only 1 main memory in a computer Holds instruction and data n During run-, both IF and MEM stages need access to the main memory è structural hazard 3. Require different stages to access hardre at different Stall one (some) of the conflicting stages Avoid the concurrent use n Solution so far: replicate the memory ruction + memory Reality: + Cache Structural Hazard Ex Regfile IF ID EX MEM WB t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 n Physically there is only 1 register file n During run-, there are chances that both ID and WB stages need access to the regfile ID: read regfile (, ) WB: write regfile (rd) è structural hazard n Solution so far: regfile supports concurrent read and write 19 20

n n n n n Hazard hazard arises when pipeline stages access data location in ys that are incompatible with the ISA contract with the programmer Technically 3 types of hazards Read After Write hazards

6 n n n n n Hazard hazard arises when pipeline stages access data location in ys that are incompatible with the ISA contract with the programmer Technically 3 types of hazards Read After Write hazards (RAW) Write After Read hazards (WAR) Write After Write hazards (WAW) What may go wrong? RAW: a later read happens before an earlier write WAR: a later write happens before an earlier read WAW: a later write happens before an earlier write hazard happens on register AND memory locations In our 5-stage pipeline, only RAW can happen RAW x1 ß x0 + 10!! x4 ß x1 + 17!! x4 ß x2 + x3!! x2 ß x4 + 1 WAR WAW. I-Fetch (IF) Hazard Example Decode, Reg. Fetch (ID) Execute (EX) (MA) 0me t0 t1 t2 t3 t4 t5 t6 t7.... x1 ß x IF 1 ID 1 EX 1 MA 1 WB 1 x4 ß x IF 2 ID 2 EX 2 MA 2 WB 2 writes val of x1 Write -Back (WB) new val of x1 calculated old val of x1 read 21 2 nd Semester 2013 ELEC HS 22 Resolving Hazards n Strategy 1: Stalling Wait for the result to be available by freezing earlier pipeline stages è Interlocks n Strategy 2: Forrding Route data as soon as possible after it is calculated to the earlier pipeline stage è bypass FB 1 Feedback to Resolve Hazards stage 1 FB 2 FB 3 FB 4 stage 2 stage 3 stage 4 Later stages provide dependence informahon to earlier stages which can stall (or kill) instruc0ons Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) 23 24

Interlocks to resolve Hazards Stall Condition Interlocks to resolve Hazards Send in place of Freeze Stall Condition at

x0 + 10 x4 x1 + 17 A B MD1 Y MD2 R 25 26 Stalled Stages and Pipeline Bubbles t0 t1 t2 t3 t4 t5 t6 t7.

IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) stalled stages IF 4 ID 4 EX 4 MA 4 WB 4 (I 5 ) IF 5 ID 5 EX 5 MA 5 WB 5 Resource

... IF I 1 I 2 I 3 I 3 I 3 I 3 I 4 I 5 ID I 1 I 2 I 2 I 2 I 2 I 3 I 4 I 5 EX I 1 - - - I 2 I 3 I 4 I 5 MA I 1 - - - I 2 I

7 Interlocks to resolve Hazards Stall Condition Interlocks to resolve Hazards Send in place of Freeze Stall Condition at decoded 2 nd instruction 2 nd instruction 1 st instruction proceeds 1 1 inst x1 x x4 x A B MD1 Y MD2 R inst x1 x x4 x A B MD1 Y MD2 R Stalled Stages and Pipeline Bubbles t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) x1 (x0) + 10 IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) x4 (x1) + 17 IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) stalled stages IF 4 ID 4 EX 4 MA 4 WB 4 (I 5 ) IF 5 ID 5 EX 5 MA 5 WB 5 Resource Usage t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 2 I 3 I 3 I 3 I 3 I 4 I 5 ID I 1 I 2 I 2 I 2 I 2 I 3 I 4 I 5 EX I I 2 I 3 I 4 I 5 MA I I 2 I 3 I 4 I 5 WB I I 2 I 3 I 4 I 5 Pipeline Bubbles n Pipeline is a logical concept n Can be implemented using NOP instruction Special control decoding Stall pipeline by disabling pipeline registers n Causes pipeline stalls - pipeline 27 28

8 Bubbles turns into NOPs addi x1, x0, 10! addi x4, x1, 17! ori x6, x7, 1! sub x8, x9, x10 addi x1, x0, 10! NOP! NOP! NOP! addi x4, x1, 17! ori x6, x7, 1! sub x8, x9, x10 inst M[x1+7] x2 x4 M[x3+5] Hazards due to Loads & Stores Stall Condi0on A B MD1 Y MD2 What if x1+7 = x3+5? Is there any possible data hazard in this instruc0on sequence? 1 R M[x1+7] x2 x4 M[x3+5] Load & Store Hazards x1+7 = x3+5 data hazard Time Pipeline CPI Examples Measure from when first instruc0on finishes to when last instruc0on in sequence finishes. 3 instruchons finish in 3 cycles CPI = 3/3 =1 Hover, the hazard is avoided because our memory system completes writes in one cycle! Load/Store hazards are somehmes resolved in the pipeline and somehmes in the memory system itself. More on this later in the course. 1 2 Bubble 3 1 Bubble 1 2 Bubble instruchons finish in 4 cycles CPI = 4/3 = instruchons finish in 5cycles CPI = 5/3 =

9 Resolving Hazards n Strategy 1: Stalling Wait for the result to be available by freezing earlier pipeline stages è Interlocks n Strategy 2: Forrding Route data as soon as possible after it is calculated to the earlier pipeline stage è bypass Bypassing t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) x1 x IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) x4 x IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 (I 4 ) stalled stages IF 4 ID 4 EX 4 (I 5 ) IF 5 ID 5 Each stall or kill introduces a in the pipeline CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the to its input t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) x1 x IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) x4 x IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) IF 4 ID 4 EX 4 MA 4 WB 4 (I 5 ) IF 5 ID 5 EX 5 MA 5 WB stall inst D ing a Bypass x4 x1 ASrc A B MD1 x1 E M W Y MD2 When does this bypass help? (I 1 ) x1 x x1 M[x0 + 10] JAL x1, 500 (I 2 ) x4 x yes x4 x no x4 x no 1 R 35 stall inst Is there s0ll a need for the stall signal? D Fully Bypassed path for JAL, ASrc BSrc A B MD1 E M W Y MD2 1 R 36

10 Ques6ons about LW and forrding ADDIU R1 R1 24! OR R3,R3,R2! LW R1 128(R29)! Do need to stall? ID (Decode) EX!!! MEM WE, MemToReg!! WB Ques6ons about LW and forrding ADDIU R1 R1 24! LW R1 128(R29)! OR R1,R3,R1! Do need to stall? ID (Decode) EX!!! MEM WE, MemToReg!! WB Mux,Logic! From! WB! Mux,Logic! From! WB! A! Y! R! A! Y! R! M! M! M! M! B! B! Fully Bypassed path stall for JAL, ASrc E M W 1 inst Is there s0ll a need for the stall signal? D BSrc A B MD1 Y MD2 stall = ( D =ws E ). (opcode E =LW E ).(ws E 0 ).re1 D + ( D =ws E ). (opcode E =LW E ).(ws E 0 ).re2 D R 39 40

Control Hazard n Control hazards occur as a result of branches and jumps next instruction not necessarily at +4 n Unconditional jumps: Next instruction is determined by the jump instruction n

11 Control Hazard n Control hazards occur as a result of branches and jumps next instruction not necessarily at +4 n Unconditional jumps: Next instruction is determined by the jump instruction n Conditional branches: Next instruction depends on result of branch comparison n Possible solutions: Stall Change ISA (forrd) Speculation n Important questions to ask yourself: Pipelining Branches F D E M W Sel inst correct target depending on Bcomp Br Logic Bcomp? Calc target Take branch? When do know the ess of next instruction to execute? What happen to the instructions in the rest of the pipeline? 41 Challenge: Does not know target ess until EX stage 42 Not so good solution Stalling t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: BEQ +200 IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADD (I 4 ) 108: ADD (I 5 ) 300: SUB IF 5 ID 5 EX 5 MA 5 WB 5 n Stalling: Wait 2 cycles Fetch the correct target after ess calculation is completed in EX stage n Stalling doesn t quite work: The hardre doesn t know it is a branch instruction until ID stage è What should happen at t2? Huge performance penalty if hardre alys stall 2 cycles regardless of instruction è 3x cycle Solution 1: Change ISA n Expose the fact that there is pipeline in hardre n Change ISA: The 2 instructions following branch will ALWAYS be executed regardless of the branch comparison result n The extra cycle when an instruction is alys executed regardless of the comparison result is called a branch delay slot n Compiler may insert useful instructions in the branch delay slot or NOPs e.g. instruction that may be executed regardless of the branch target 43 44

Branch Delay Slot Example addi x2, x1, 4! lw x4, 16(x2)! beq x1, x0, err! ok: add x5, x3, x4! ori x6, x0, 23!

beq x1, x0, err! addi x2, x1, 4! lw x4, 16(x2)! ok: add x5, x3, x4! ori x6, x0, 23!! err: sub x5, x3, x4!

Brach decision is moved to ID stage Only support very simple branch: beqz on 1 register n Compiler must find

Speculate + Kill n Step 1: Speculate that the instruction in delay slots will be executed.

then do nothing n Pro: Waste cycles only in cases when branch taken n Cons: complicate hardre interact with stall

... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: BEQ +200 IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADD IF 3 ID 3

12 Branch Delay Slot Example addi x2, x1, 4! lw x4, 16(x2)! beq x1, x0, err! ok: add x5, x3, x4! ori x6, x0, 23!! err: sub x5, x3, x4! Original n ructions in delay slot must not affect the branch decision e.g. in above: they cannot modify x1 n Is the value of x4 ok? beq x1, x0, err! addi x2, x1, 4! lw x4, 16(x2)! ok: add x5, x3, x4! ori x6, x0, 23!! err: sub x5, x3, x4! Rearranged delay slot Real Processor: MIPS-I n The first generation of MIPS processor has 1 delay slot defined n Brach decision is moved to ID stage Only support very simple branch: beqz on 1 register n Compiler must find instruction to fill the delay slot or put NOP Microprocessor without Interlocked Pipeline Stages 45 Solution 2: Speculate + Kill n Step 1: Speculate that the instruction in delay slots will be executed. n Step 2: Determine at EX stage: if branch taken, then kill the instructions in IF and ID stage if branch not taken, then do nothing n Pro: Waste cycles only in cases when branch taken n Cons: complicate hardre interact with stall Branch/Jump in delay slots? Killing instructions in IF, ID Branch taken t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: BEQ +200 IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADD IF 3 ID (I 4 ) 108: ADD IF (I 5 ) 300: SUB IF 5 ID 5 EX 5 MA 5 WB 5 Kill instructions in pipeline Branch not taken t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 ructions (I 2 ) 100: BEQ +200 IF 2 ID 2 EX 2 MA 2 WB 2 continue (I 3 ) 104: ADD IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) 108: ADD IF 4 ID 4 EX 4 MA 4 WB new 4 (I 5 ) 112: SLL IF 5 ID 5 EX 5 MA 5 WB instruction 5 48

Killing ructions F Sel D kill E M W inst Mem kill correct target depending on Bcomp Br Logic Bcomp?

Note: kill signal stall signal as instruction in ID is invalid Pipelining JAL inst Mem F Sel D kill E M W kill

Save +4 from JAL instruction 49 Pipelining Jumps (JAL) n Unconditional jumps can be implemented similar to branches

destination register rd Proceed until WB stage to write back data in register file Need to be careful with data

has branch delay slot n Performance: I-cache miss at delay slot causes significant performance penalty n Delay slot

13 Killing ructions F Sel D kill E M W inst Mem kill correct target depending on Bcomp Br Logic Bcomp? Calc target Take branch? Note: kill signal stall signal as instruction in ID is invalid Pipelining JAL inst Mem F Sel D kill E M W kill brjmp Calc target Br Logic Bcomp? Save +4 from JAL instruction 49 Pipelining Jumps (JAL) n Unconditional jumps can be implemented similar to branches with the branch condition being alys true n JAL has additional requirements for storing return ess (+4) in the destination register rd Proceed until WB stage to write back data in register file Need to be careful with data forrding and stalling on rd n Alys kill instructions after JAL 50 Branch Delay Slots n Post 1990s processors rarely has branch delay slot n Performance: I-cache miss at delay slot causes significant performance penalty n Delay slot complicates advanced microarchitectures e.g. super scalar processors with multiple instructions issued per cycles n Difficult to find instructions to fill deeply pipelined processors Modern processors can have up to 30 pipeline stages n Other techniques helpful branch prediction, predicated instructions, etc 51 52

14 In Conclusions n Pipeline is a ll-studied digital system design technique n Pipelining allows concurrent execution of multiple steps n 5-stages of RISV-V pipeline: ruction Fetch ruction Decode ruction Execute Access Write Back n 3 Types of Hazards Structural hazard hazard Control hazard Acknowledgements n These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) n MIT material derived from course n UCB material derived from course CS152, CS

Lecture 4 - Pipelining

CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw