ECE 4750 Computer Architecture, Fall 2016 T02 Fundamental Processor Microarchitecture

Size: px
Start display at page:

Download "ECE 4750 Computer Architecture, Fall 2016 T02 Fundamental Processor Microarchitecture"

Transcription

1 ECE 4750 Computer Architecture, Fall 2016 T02 Fundamental Processor Microarchitecture School of Electrical and Computer Engineering Cornell University revision: Processor Microarchitectural Design Patterns Transactions and Steps Microarchitecture: /Datapath Split TinyRV1 Single-Cycle Processor High-Level Idea for Single-Cycle Processors Single-Cycle Processor Datapath Single-Cycle Processor Unit Analyzing Performance TinyRV1 FSM Processor High-Level Idea for FSM Processors FSM Processor Datapath FSM Processor Unit Analyzing Performance TinyRV1 Pipelined Processor High-Level Idea for Pipelined Processors Pipelined Processor Datapath and Unit

2 5 Pipeline Hazards: RAW Data Hazards Expose in Instruction Set Architecture Hardware Stalling Hardware Bypassing/Forwarding RAW Data Hazards Through Memory Pipeline Hazards: Hazards Expose in Instruction Set Architecture Hardware Speculation Interrupts and Exceptions Pipeline Hazards: Structural Hazards Expose in Instruction Set Architecture Hardware Stalling Hardware Duplication Pipeline Hazards: WAW and WAR Name Hazards Software Renaming Hardware Stalling Summary of Processor Performance Case Study: Transition from CISC to RISC Example CISC: IBM 360/M Example RISC: MIPS R2K

3 1. Processor Microarchitectural Design Patterns 2.0. Transactions and Steps 1. Processor Microarchitectural Design Patterns Time Program = Instructions Avg Cycles Program Instruction Time Cycle Instructions / program depends on source code, compiler, ISA Avg cycles / instruction (CPI) depends on ISA, microarchitecture Time / cycle depends upon microarchitecture and implementation Microarchitecture CPI Cycle Time Single-Cycle Processor 1 long FSM Processor >1 short Pipelined Processor 1 short 1.1. Transactions and Steps We can think of each instruction as a transaction Executing a transaction involves a sequence of steps add addi mul lw sw jal jr bne Fetch Instruction Decode Instruction Read isters ister Arithmetic Read Memory Write Memory Write isters Update PC 3

4 1. Processor Microarchitectural Design Patterns 1.2. Microarchitecture: /Datapath Split 1.2. Microarchitecture: /Datapath Split Unit Signals Status Signals Datapath imem req_val imem req imem resp dmem req_val dmem req dmem resp Memory 4

5 2. TinyRV1 Single-Cycle Processor 2.1. High-Level Idea for Single-Cycle Processors 2. TinyRV1 Single-Cycle Processor Time Program = Instructions Avg Cycles Program Instruction Time Cycle Instructions / program depends on source code, compiler, ISA Avg cycles / instruction (CPI) depends on ISA, microarchitecture Time / cycle depends upon microarchitecture and implementation Microarchitecture CPI Cycle Time Single-Cycle Processor 1 long FSM Processor >1 short Pipelined Processor 1 short Technology Constraints Assume technology where logic is not too expensive, so we do not need to overly minimize the number of registers and combinational logic Assume multi-ported register file with a reasonable number of ports is feasible Assume a dual-ported combinational memory imem req Memory Unit imem resp dmem req <1 cycle combinational Status Datapath dmem resp 5

6 2. TinyRV1 Single-Cycle Processor 2.1. High-Level Idea for Single-Cycle Processors 2.1. High-Level Idea for Single-Cycle Processors add addi mul lw sw jal jr bne Fetch Instruction Decode Instruction Read isters ister Arithmetic Read Memory Write Memory Write isters Update PC Single-Cycle Fetch Inst lw Decode Inst Read Arith Update PC Read Mem Write Fetch Inst add Decode Inst Read Arith Update PC Write Fetch Inst jal Decode Inst Write Update PC 6

7 2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath 2.2. Single-Cycle Processor Datapath ADD add rd, rs1, rs2 R[rd] R[rs1] + R[rs2] PC PC rs2 rs1 000 rd (read) (write) pc ADDI addi rd, rs1, imm 31 imm rs1 000 rd R[rd] R[rs1] + sext(imm) PC PC + 4 (read) (write) pc 7

8 2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath Implementing ADD and ADDI Instructions To control unit ir[11:7] pc_plus4 pc +4 ir[19:15] ir[24:20] ir[31:7] (read) imm gen alu (write) op2_sel imemreq. imemresp. MUL mul rd, rs1, rs2 R[rd] R[rs1] R[rs2] PC PC + 4 To control unit ir[11:7] rs2 rs1 000 rd mul wb_sel pc_plus4 pc +4 ir[19:15] ir[24:20] ir[31:7] (read) imm gen alu (write) op2_sel imemreq. imemresp. 8

9 2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath LW lw rd, imm(rs1) 31 imm rs1 010 rd R[rd] M[ R[rs1] + sext(imm) ] PC PC + 4 To control unit ir[11:7] mul wb_sel pc_plus4 pc +4 ir[19:15] ir[24:20] ir[31:7] (read) imm gen alu (write) op2_sel imemreq. imemresp. dmemreq. dmemresp. SW sw rs2, imm(rs1) 31 imm rs rs1 010 imm M[ R[rs1] + sext(imm) ] R[rs2] PC PC + 4 imm = { inst[31:25], inst[11:7] } To control unit ir[11:7] mul wb_sel rf_wen pc_plus4 pc +4 ir[19:15] ir[24:20] ir[31:7] (read) imm gen alu (write) imm_type op2_sel imemreq. imemresp. dmemreq. dmemreq. dmemresp. 9

10 2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath JAL jal rd, imm 31 imm rd R[rd] PC + 4 PC PC + sext(imm) imm = { inst[31], inst[19:12], inst[20], inst[30:21], 0 } To control unit ir[11:7] jalbr_targ mul wb_sel rf_wen pc_plus4 pc_sel pc +4 ir[19:15] ir[24:20] ir[31:7] (read) imm gen imm_type + op2_sel alu alu_func (write) imemreq. imemresp. dmemreq. dmemreq. dmemresp. JR jr rs1 PC R[rs1] rs To control unit jr_targ ir[11:7] jalbr_targ mul wb_sel rf_wen pc_plus4 pc_sel pc +4 ir[19:15] ir[24:20] ir[31:7] (read) imm gen imm_type + op2_sel alu alu_func (write) imemreq. imemresp. dmemreq. dmemreq. dmemresp. 10

11 2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath BNE bne rs1, rs2, imm 31 imm rs rs1 001 imm if ( R[rs1]!= R[rs2] ) PC PC + sext(imm) else PC PC + 4 imm = { inst[31], inst[7], inst[30:25], inst[11:8], 0 } To control unit jr_targ ir[11:7] jalbr_targ mul eq wb_sel rf_wen pc_plus4 pc_sel pc +4 ir[19:15] ir[24:20] ir[31:7] (read) imm gen imm_type + op2_sel alu alu_func (write) imemreq. imemresp. dmemreq. dmemreq. dmemresp. 11

12 2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath Adding a New Auto-Incrementing Load Instruction Draw on the path diagram what paths we need to use as well as any new paths we will need to add in order to implement the following auto-incrementing load instruction. LW.AI lw.ai rd, imm(rs1) R[rd] M[ R[rs1] + sext(imm) ] R[rs1] R[rs1] + 4 PC PC + 4 To control unit 31 imm rs1 000 rd jr_targ ir[11:7] jalbr_targ mul eq wb_sel rf_wen pc_plus4 pc_sel pc +4 ir[19:15] ir[24:20] ir[31:7] (read) imm gen imm_type + op2_sel alu alu_func (write) imemreq. imemresp. dmemreq. dmemreq. dmemresp. 12

13 2. TinyRV1 Single-Cycle Processor 2.3. Single-Cycle Processor Unit 2.3. Single-Cycle Processor Unit imem dmem pc imm op1 alu wb rf req req inst sel type sel func sel wen val val add pc+4 rf + alu addi mul pc+4 mul lw pc+4 i imm + mem sw jal jr jr i bne Need to factor eq status signal into pc_sel signal for BNE! 2.4. Analyzing Performance Time Program = Instructions Program Cycles Instruction Time Cycles Instructions / program depends on source code, compiler, ISA Cycles / instruction (CPI) depends on ISA, microarchitecture Time / cycle depends upon microarchitecture and implementation 13

14 2. TinyRV1 Single-Cycle Processor 2.4. Analyzing Performance Estimating cycle time There are many paths through the design that start at a state element and end at a state element. The critical path is the longest path across all of these paths. We can usually use a simple first-order static timing estimate to estimate the cycle time (i.e., the clock period and thus also the clock frequency). To control unit jr_targ ir[11:7] jalbr_targ mul eq wb_sel rf_wen pc_plus4 pc_sel pc +4 ir[19:15] ir[24:20] ir[31:7] (read) imm gen imm_type + op2_sel alu alu_func (write) imemreq. imemresp. dmemreq. dmemreq. dmemresp. register read = 1τ register write = 1τ read = 10τ write = 10τ memory read = 20τ memory write = 20τ +4 unit = 4τ immgen = 2τ mux = 3τ multiplier = 20τ alu = 10τ adder = 8τ 14

15 2. TinyRV1 Single-Cycle Processor 2.4. Analyzing Performance Estimating execution time Using our first-order equation for processor performance, how long in nanoseconds will it take to execute the vector-vector add example assuming n is 64? loop: lw x5, 0(x13) lw x6, 0(x14) add x7, x5, x6 sw x7, 0(x12) addi x13, x12, 4 addi x14, x14, 4 addi x12, x12, 4 addi x15, x15, -1 bne x15, x0, loop jr x1 Using our first-order equation for processor performance, how long in nanoseconds will it take to execute the mystery program assuming n is 64 and that we find a match on the last element. addi x5, x0, 0 loop: lw x6, 0(x12) bne x6, x14, foo addi x10, x5, 0 jr x1 foo: addi x12, x12, 4 addi x5, x5, 1 bne x5, x13, loop addi x10, x0, -1 jr x1 15

16 3. TinyRV1 FSM Processor 3.1. High-Level Idea for FSM Processors 3. TinyRV1 FSM Processor Time Program = Instructions Avg Cycles Program Instruction Time Cycle Instructions / program depends on source code, compiler, ISA Avg cycles / instruction (CPI) depends on ISA, microarchitecture Time / cycle depends upon microarchitecture and implementation Microarchitecture CPI Cycle Time Single-Cycle Processor 1 long FSM Processor >1 short Pipelined Processor 1 short Technology Constraints Assume legacy technology where logic is expensive, so we want to minimize the number of registers and combinational logic Assume an (unrealistic) combinational memory Unit Status Datapath Assume multi-ported register files and memories are too expensive, these structures can only have a single read/write port Memory mem req <1 cycle combinational mem resp 16

17 3. TinyRV1 FSM Processor 3.1. High-Level Idea for FSM Processors 3.1. High-Level Idea for FSM Processors add addi mul lw sw jal jr bne Fetch Instruction Decode Instruction Read isters ister Arithmetic Read Memory Write Memory Write isters Update PC Single-Cycle Fetch Inst lw Decode Inst Read Arith Update PC Read Mem Write Fetch Inst add Decode Inst Read Arith Update PC Write Fetch Inst jal Decode Inst Write Update PC FSM Fetch Inst lw Decode Inst Read Arith Update PC Read Mem Write Fetch Inst add Decode Inst Read Arith Update PC Write Fetch Inst jal Decode Inst Write Update PC 3.2. FSM Processor Datapath Implementing an FSM path requires thinking about the required FSM states, but we will defer discussion of how to implement the control logic to the next section. 17

18 3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath Implementing Fetch Sequence To control unit pc_en ir_en a_en Datapath Bus PC IR A +4 pc_ bus_en alu_ bus_en rd_ bus_en RD memreq. memresp. F0 (pseudo-control-signal syntax) F1 F2 18

19 3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath Implementing ADD Instruction To control unit pc_en ir_en a_en b_en Datapath Bus PC IR alu_ func A alu +4 B rf_wen RF rs1 rs2 rd pc_ bus_en alu_ bus_en rf_ bus_en rf sel alu func +4: A + 4 +: A + B RD rd_ bus_en memreq. memresp. F0 F1 (pseudo-control-signal syntax) add rd, rs1, rs2 F2 A0 A1 A2 19

20 3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath Full Datapath for TinyRV1 FSM Processor To control unit Datapath Bus pc_en PC imm_ type pc_ bus_en ir_en IR imm sext imm gen alu_ func a_en immgen_ bus_en A b_sel b_en alu +4 B c_sel c_en RF rf_wen eq alu_ rf_ bus_en bus_en >> C >> rf sel x0 rs1 rs2 rd alu func +4: A + 4 +: A + B +?: A +? B cmp: A == B 4: A 4 imm gen i: sext(ir[31:20]) s: sext({ir[31:25],ir[11:7]}) b: sext({ir[31],ir[7],ir[30:25],ir[11:8],0}) j: sext({ir[31],ir[19:12],ir[20],ir[30:21],0}) memreq. wd_en WD memreq. RD rd_ bus_en memresp. F0 F1 F2 ADDI Pseudo--Signal Fragment addi rd, rs1, imm A0 AI0 M0 L0 S0 JA0 JR0 B0 A1 AI1 M1 L1 S1 JA1 B1 A2 AI2 M2 L2 S2 JA2 B2 M3 L3 S3 B3 B4 M35 B5 20

21 3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath MUL Instruction mul rd, rs1, rs2 M0: A RF[x0] M1: B RF[rs1] M2: C RF[rs2] M3: A A +? B; B B << 1; C C >> 1 M4: A A +? B; B B << 1; C C >> 1... M35: RF[rd] A +? B; goto F0 LW Instruction lw rd, imm(rs1) L0: A RF[rs1] L1: B sext(imm_i) L2: memreq. A + B L3: RF[rd] RD; goto F0 SW Instruction sw rs2, imm(rs1) S0: WD RF[rs2] S1: A RF[rs1] S2: B sext(imm_s) S3: memreq. A + B; goto F0 JAL Instruction jal rd, imm JA0: RF[rd] PC JA1: B sext(imm_j) JA2: PC A + B; goto F0 JR Instruction jr rs1 JR0: PC RF[rs1]; goto F0 BNE Instruction bne rs1, rs2, imm B0: A RF[rs1] B1: B RF[rs2] B2: B sext(imm_b); if A == B goto F0 B3: A PC B4: A A 4 B5: PC A + B; goto F0 21

22 3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath Adding a Complex Instruction FSM processors simplify adding complex instructions. New instructions usually do not require path modifications, only additional states. add.mm rd, rs1, rs2 M[ R[rd] ] M[ R[rs1] ] + M[ R[rs2] ] To control unit Datapath Bus pc_en PC imm_ type pc_ bus_en ir_en IR imm sext imm gen alu_ func a_en immgen_ bus_en A b_sel b_en alu +4 B c_sel c_en RF rf_wen eq alu_ rf_ bus_en bus_en >> C >> rf sel x0 rs1 rs2 rd alu func +4: A + 4 +: A + B +?: A +? B cmp: A == B 4: A 4 imm gen i: sext(ir[31:20]) s: sext({ir[31:25],ir[11:7]}) b: sext({ir[31],ir[7],ir[30:25],ir[11:8],0}) j: sext({ir[31],ir[19:12],ir[20],ir[30:21],0}) memreq. wd_en WD memreq. RD rd_ bus_en memresp. 22

23 3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath Adding a New Auto-Incrementing Load Instruction Implement the following auto-incrementing load instruction using pseudo-control-signal syntax. Modify the path if necessary. lw.ai rd, imm(rs1) R[rd] M[ R[rs1] + sext(imm_i) ]; R[rs1] R[rs1] + 4 To control unit Datapath Bus pc_en PC imm_ type pc_ bus_en ir_en IR imm sext imm gen alu_ func a_en immgen_ bus_en A b_sel b_en alu +4 B c_sel c_en RF rf_wen eq alu_ rf_ bus_en bus_en >> C >> rf sel x0 rs1 rs2 rd alu func +4: A + 4 +: A + B +?: A +? B cmp: A == B 4: A 4 imm gen i: sext(ir[31:20]) s: sext({ir[31:25],ir[11:7]}) b: sext({ir[31],ir[7],ir[30:25],ir[11:8],0}) j: sext({ir[31],ir[19:12],ir[20],ir[30:21],0}) memreq. wd_en WD memreq. RD rd_ bus_en memresp. 23

24 3. TinyRV1 FSM Processor 3.3. FSM Processor Unit 3.3. FSM Processor Unit A0 A1 AI0 AI1 M0 M1 F0 F1 F2 L0 L1 S0 S1 JA0 JA1 JR0 B0 B1 We will study three techniques for implementing FSM control units: Hardwired control units are high-performance, but inflexible A2 AI2 M2 M3 L2 L3 S2 S3 JA2 B2 B3 B4 Horizontal µcoding increases flexibility, requires large control store M35 B5 Vertical µcoding is an intermediate design point Hardwired FSM State Signal State Transition Signals (22) Status Signals (1) 24

25 3. TinyRV1 FSM Processor 3.3. FSM Processor Unit signal output table for hardwired control unit To control unit Datapath Bus pc_en PC imm_ type pc_ bus_en ir_en IR imm sext imm gen alu_ func a_en immgen_ bus_en A b_sel b_en alu +4 B c_sel c_en RF rf_wen eq alu_ rf_ bus_en bus_en >> C >> rf sel x0 rs1 rs2 rd alu func +4: A + 4 +: A + B +?: A +? B cmp: A == B 4: A 4 imm gen i: sext(ir[31:20]) s: sext({ir[31:25],ir[11:7]}) b: sext({ir[31],ir[7],ir[30:25],ir[11:8],0}) j: sext({ir[31],ir[19:12],ir[20],ir[30:21],0}) memreq. wd_en WD memreq. RD rd_ bus_en memresp. F0: memreq. PC; A PC F1: IR RD F2: PC A + 4; goto inst A0: A RF[rs1] A1: B RF[rs2] A2: RF[rd] A + B; goto F0 Bus Enables ister Enables Mux Func RF MReq state pc ig alu rf rd pc ir a b c wd b c ig alu sel wen val op F r F F A0 A1 A2 25

26 3. TinyRV1 FSM Processor 3.3. FSM Processor Unit Vertically Microcoded FSM opcode decoder +1 F0 upc Next State Encoding n : increment upc by one d : dispatch based on opcode f : goto state F0 b : goto state F0 if A == B Signals Next State eq Status Signals (1) bus en mux sel Signals (22) Use memory array (called the control store) instead of random logic to encode both the control signal logic and the state transition logic Enables a more systematic approach to implementing complex multi-cycle instructions Microcoding can produce good performance if accessing the control store is much faster than accessing main memory Read-only control stores might be replaceable enabling in-field updates, while read-write control stores can simplify diagnostics and microcode patches 26

27 3. TinyRV1 FSM Processor 3.3. FSM Processor Unit signal store for microcoded control unit To control unit Datapath Bus pc_en PC imm_ type pc_ bus_en ir_en IR imm sext imm gen alu_ func a_en immgen_ bus_en A b_sel b_en alu +4 B c_sel c_en RF rf_wen eq alu_ rf_ bus_en bus_en >> C >> rf sel x0 rs1 rs2 rd alu func +4: A + 4 +: A + B +?: A +? B cmp: A == B 4: A 4 imm gen i: sext(ir[31:20]) s: sext({ir[31:25],ir[11:7]}) b: sext({ir[31],ir[7],ir[30:25],ir[11:8],0}) j: sext({ir[31],ir[19:12],ir[20],ir[30:21],0}) memreq. wd_en WD memreq. RD rd_ bus_en memresp. B0: A RF[rs1] B1: B RF[rs2] B2: B sext(imm_b); if A == B goto F0 B3: A PC B4: A A 4 B5: PC A + B; goto F0 Bus Enables ister Enables Mux Func RF MReq state pc ig alu rf rd pc ir a b c wd b c ig alu sel wen val op next B rs1 0 0 B b rs2 0 0 B b b cmp 0 0 B B B

28 3. TinyRV1 FSM Processor 3.4. Analyzing Performance 3.4. Analyzing Performance Time Program = Instructions Program Cycles Instruction Time Cycles Estimating cycle time To control unit Datapath Bus pc_en PC imm_ type pc_ bus_en ir_en IR imm sext imm gen alu_ func a_en immgen_ bus_en A b_sel b_en alu +4 B c_sel c_en RF rf_wen eq alu_ rf_ bus_en bus_en >> C >> rf sel x0 rs1 rs2 rd alu func +4: A + 4 +: A + B +?: A +? B cmp: A == B 4: A 4 imm gen i: sext(ir[31:20]) s: sext({ir[31:25],ir[11:7]}) b: sext({ir[31],ir[7],ir[30:25],ir[11:8],0}) j: sext({ir[31],ir[19:12],ir[20],ir[30:21],0}) memreq. wd_en WD memreq. RD rd_ bus_en memresp. register read/write read/write mem read/write immgen mux alu 1b shifter tri-state buf = 1τ = 10τ = 20τ = 2τ = 3τ = 10τ = 1τ = 1τ 28

29 3. TinyRV1 FSM Processor 3.4. Analyzing Performance Estimating execution time Using our first-order equation for processor performance, how long in units of τ will it take to execute the vector-vector add example assuming n is 64? loop: lw x5, 0(x13) lw x6, 0(x14) add x7, x5, x6 sw x7, 0(x12) addi x13, x12, 4 addi x14, x14, 4 addi x12, x12, 4 addi x15, x15, -1 bne x15, x0, loop jr x1 Using our first-order equation for processor performance, how long in units of τ will it take to execute the mystery program assuming n is 64 and that we find a match on the last element. addi x5, x0, 0 loop: lw x6, 0(x12) bne x6, x14, foo addi x10, x5, 0 jr x1 foo: addi x12, x12, 4 addi x5, x5, 1 bne x5, x13, loop addi x10, x0, -1 jr x1 29

30 4. TinyRV1 Pipelined Processor 4.1. High-Level Idea for Pipelined Processors 4. TinyRV1 Pipelined Processor Time Program = Instructions Avg Cycles Program Instruction Time Cycle Instructions / program depends on source code, compiler, ISA Avg cycles / instruction (CPI) depends on ISA, microarchitecture Time / cycle depends upon microarchitecture and implementation Microarchitecture CPI Cycle Time Single-Cycle Processor 1 long FSM Processor >1 short Pipelined Processor 1 short Technology Constraints Assume modern technology where logic is cheap and fast (e.g., fast integer ALU) Assume multi-ported register files with a reasonable number of ports are feasible Assume small amount of very fast memory (caches) backed by large, slower memory imem req Memory Unit imem resp dmem req <1 cycle combinational Status Datapath dmem resp 30

31 4. TinyRV1 Pipelined Processor 4.1. High-Level Idea for Pipelined Processors 4.1. High-Level Idea for Pipelined Processors Anne, Brian, Cathy, and Dave each have one load of clothes Washing, drying, folding, and storing each take 30 minutes Fixed Time-Slot Laundry 7pm 8pm 9pm 10pm 11pm 12am 1am 2am 3am Anne's Load Ben's Load Cathy's Load Dave's Load Pipelined Laundry 7pm 8pm 9pm 10pm Pipelined Laundry with Slow Dryers 7pm 8pm 9pm 10pm 11pm 12am Anne's Load Ben's Load Cathy's Load Dave's Load Anne's Load Ben's Load Cathy's Load Dave's Load Pipelining lessons Multiple transactions operate simultaneously using different resources Pipelining does not help the transaction latency Pipelining does help the transaction throughput Potential speedup is proportional to the number of pipeline stages Potential speedup is limited by the slowest pipeline stage Potential speedup is reduced by time to fill the pipeline 31

32 4. TinyRV1 Pipelined Processor 4.1. High-Level Idea for Pipelined Processors Applying pipelining to processors add addi mul lw sw jal jr bne Fetch Instruction Decode Instruction Read isters ister Arithmetic Read Memory Write Memory Write isters Update PC Single-Cycle Fetch Inst lw Decode Inst Read Arith Update PC Read Mem Write Fetch Inst add Decode Inst Read Arith Update PC Write Fetch Inst jal Decode Inst Write Update PC FSM Fetch Inst lw Decode Inst Read Arith Update PC Read Mem Write Fetch Inst add Decode Inst Read Arith Update PC Write Fetch Inst jal Decode Inst Write Update PC Pipelined Fetch Inst lw Decode Inst Read Arith Update PC add Read Mem Fetch Inst Write Decode Inst Read Arith Update PC Write jal Fetch Inst Decode Inst Write Update PC 32

33 4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Unit 4.2. Pipelined Processor Datapath and Unit Incrementally develop an unpipelined path Keep flowing from left to right Position control signal table early in the diagram Divide path/control into stages by inserting pipeline registers Keep the pipeline stages roughly balanced Forward arrows should avoid skipping pipeline registers Backward arrows will need careful consideration Signal Table jbtarg jr pc_plus4 pc_sel_f ir[31:0] immgen + pc_f +4 ir[19:15] ir[24:20] imm_type (read) op1_sel_d alu mul result_sel_x wb_sel_m rf_ w_w rf_ wen_w (write) op2_sel_d alu_fn_x eq_x imemreq. imemresp. dmemreq. dmemreq. dmemresp. imemreq imemresp addi x1, x2, 1 dmemreq F D X M W dmemreq dmemresp addi x3, x4, 1 F D X M W addi x5, x6, 1 F D X M W 33

34 4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Unit Adding a new auto-incrementing load instruction Draw on the above path diagram what paths we need to use as well as any new paths we will need to add in order to implement the following auto-incrementing load instruction. lw.ai rd, imm(rs1) R[rd] M[ R[rs1] + sext(imm) ]; R[rs1] R[rs1] + 4 val_f val_fd Signal Table val_dx val_xm val_mw cs_dx cs_xm cs_mw pc_f +4 F Stage D Stage X Stage M Stage W Stage btarg jtarg jbtarg jr pc_plus4 always pc_sel_f pc_plus4 pc_fd ir_fd ir[31:0] ir[19:15] ir[24:20] immgen imm_type (read) + btarg_dx op1_sel_d op1_dx op2_sel_d op2_dx alu mul alu_fn_x result_sel_x eq_x result _XM sd_xm rf_ w_w wb_sel_m rf_ wen_w result _MW (write) sd_dx imemreq imemresp dmemreq dmemreq dmemresp 34

35 4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Unit Pipeline diagrams addi x1, x2, 1 addi x3, x4, 1 addi x5, x6, 1 What would be the total execution time if these three instructions were repeated 10 times? Hazards occur when instructions interact with each other in pipeline RAW Data Hazards: An instruction depends on a value produced by an earlier instruction Hazards: Whether or not an instruction should be executed depends on a control decision made by an earlier instruction Structural Hazards: An instruction in the pipeline needs a resource being used by another instruction in the pipeline WAW and WAR Name Hazards: An instruction in the pipeline is writing a register that an earlier instruction in the pipeline is either writing or reading Stalling and squashing instructions Stalling: An instruction originates a stall due to a hazard, causing all instructions earlier in the pipeline to also stall. When the hazard is resolved, the instruction no longer needs to stall and the pipeline starts flowing again. Squashing: An instruction originates a squash due to a hazard, and squashes all previous instructions in the pipeline (but not itself). We restart the pipeline to begin executing a new instruction sequence. 35

36 4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Unit logic with no stalling and no squashing Stage A Stage A Datapath Stage B Stage B Datapath Stage C Stage C Datapath posedge clk ) if ( reset ) val_b <= 0 else val_b <= next_val_a next_val_b = val_b logic with stalling and no squashing control, ostall signals val_b next_ val_b reg_en_b =!stall_b Stage A Stage B Stage C posedge clk ) if ( reset ) val_b <= 0 else if ( reg_en_b ) val_b <= next_val_a Stage A Datapath reg_en_b Stage B Datapath Stage C Datapath ostall_b = val_b && ( ostall_hazard1_b ostall_hazard2_b ) stall_b = val_b && ( ostall_b ostall_c... ) next_val_b = val_b &&!stall_b ostall_b stall_b next_val_b Originating stall due to hazards detected in B stage. Should we actually stall B stage? Factors in ostalls due to hazards and ostalls from later pipeline stages. Only send transaction to next stage if transaction in B stage is valid and we are not stalling B stage. 36

37 4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Unit logic with stalling and squashing control, control, ostall, ostall osquash signals signals val_b next_ val_b reg_en_b =!stall_b squash_b Stage A Stage B Stage C posedge clk ) if ( reset ) val_b <= 0 else if ( reg_en_b ) val_b <= next_val_a Stage A Datapath reg_en_b Stage B Datapath Stage C Datapath ostall_b = val_b && ( ostall_hazard1_b ostall_hazard2_b ) stall_b = val_b && ( ostall_b ostall_c... ) osquash_b = val_b &&!stall_b && ( osquash_hazard1_b... ) squash_b = val_b && ( osquash_c... ) next_val_b = val_b &&!stall_b &&!squash_b ostall_b stall_b osquash_b squash_b next_val_b Originating stall due to hazards detected in B stage. Should we actually stall B stage? Factors in ostalls due to hazards and ostalls from later pipeline stages. Originating squash due to hazards detected in B stage. If this stage is stalling, do not originate a squash. Should we squash B stage? Factors in the originating squashes from later pipeline stages. An originating squash from B stage means to squash all stages earlier than B, so osquash_b is not factored into squash_b. Only send transaction to next stage if transaction in B stage is valid and we are not stalling or squashing B stage. 37

38 5. Pipeline Hazards: RAW Data Hazards 5. Pipeline Hazards: RAW Data Hazards RAW hazards occur when one instruction depends on a value produced by a preceding instruction still in the pipeline. We use architectural dependency arrows to illustrate RAW dependencies in assembly code sequences. addi x1, x2, 1 addi x3, x1, 1 addi x4, x3, 1 Using pipeline diagrams to illustrate RAW hazards We use microarchitectural dependency arrows to illustrate RAW hazards on pipeline diagrams. addi x1, x2, 1 F D X M W addi x3, x1, 1 F D X M W addi x4, x3, 1 F D X M W addi x1, x2, 1 addi x3, x1, 1 addi x4, x3, 1 38

39 5. Pipeline Hazards: RAW Data Hazards Approaches to resolving hazards Expose in Instruction Set Architecture: Expose hazards in ISA forcing compiler to explicitly avoid scheduling instructions that would create hazards (i.e., software scheduling for correctness) Hardware Scheduling: Hardware dynamically schedules instructions to avoid RAW hazards, potentially allowing instructions to execute out of order Hardware Stalling: Hardware includes control logic that freezes later instructions until earlier instruction has finished producing value; software scheduling can still be used to avoid stalling (i.e., software scheduling for performance) Hardware Bypassing/Forwarding: Hardware allows values to be sent from an earlier instruction to a later instruction before the earlier instruction has left the pipeline (sometimes called forwarding) Hardware Speculation: Hardware guesses that there is no hazard and allows later instructions to potentially read invalid ; detects when there is a problem, squashes and then re-executes instructions that operated on invalid 39

40 5. Pipeline Hazards: RAW Data Hazards 5.1. Expose in Instruction Set Architecture 5.1. Expose in Instruction Set Architecture Insert nops to delay read of earlier write. These nops count as real instructions increasing instructions per program. addi x1, x2, 1 nop nop nop addi x3, x1, 1 nop nop nop addi x4, x3, 1 Insert independent instructions to delay read of earlier write, and only use nops if there is not enough useful work addi x1, x2, 1 addi x6, x7, 1 addi x8, x9, 1 nop addi x3, x1, 1 nop nop nop addi x4, x3, 1 Pipeline diagram showing exposing RAW hazards in the ISA addi x1, x2, 1 addi x6, x7, 1 addi x8, x9, 1 nop addi x3, x1, 1 nop nop nop addi x4, x3, 1 Note: If hazard is exposed in ISA, software scheduling is required for correctness! A scheduling mistake can cause undefined behavior. 40

41 5. Pipeline Hazards: RAW Data Hazards 5.2. Hardware Stalling 5.2. Hardware Stalling Hardware includes control logic that freezes later instructions (in front of pipeline) until earlier instruction (in back of pipeline) has finished producing value. Pipeline diagram showing hardware stalling for RAW hazards addi x1, x2, 1 addi x3, x1, 1 addi x4, x3, 1 Note: Software scheduling is not required for correctness, but can improve performance! Programmer or compiler schedules independent instructions to reduce the number of cycles spent stalling. Modifications to path/control to support hardware stalling val_f val_fd CSig Table Signal Stall Table val_dx val_xm val_mw cs_dx cs_xm cs_mw pc_f +4 btarg jtarg jbtarg jr pc_plus4 always pc_sel_f pc_plus4 reg_ en_f F Stage D Stage X Stage M Stage W Stage pc_fd ir_fd reg_ en_d ir[31:0] ir[19:15] ir[24:20] immgen imm_type (read) + btarg_dx op1_sel_d op1_dx op2_sel_d op2_dx alu mul alu_fn_x result_sel_x eq_x result _XM sd_xm rf_ w_w wb_sel_m rf_ wen_w result _MW (write) sd_dx imemreq. imemresp. dmemreq. dmemreq. dmemresp. imemreq imemresp 41 dmemreq dmemreq dmemresp

42 5. Pipeline Hazards: RAW Data Hazards 5.3. Hardware Bypassing/Forwarding Deriving the stall signal add addi mul lw sw jal jr bne rs1_en rs2_en rf_wen ostall_w_x_rs1_d = val_d && rs1_en_d && val_x && rf_wen_x && (inst_rs1_d == rf_w_x) && (rf_w_x!= 0) ostall_w_m_rs1_d = val_d && rs1_en_d && val_m && rf_wen_m && (inst_rs1_d == rf_w_m) && (rf_w_m!= 0) ostall_w_w_rs1_d = val_d && rs1_en_d && val_w && rf_wen_w && (inst_rs1_d == rf_w_w) && (rf_w_w!= 0)... similar for ostall signals for rs2 source register... ostall_d = val_d && ( ostall_w_x_rs1_d ostall_w_x_rs2_d ostall_w_m_rs1_d ostall_w_m_rs2_d ostall_w_w_rs1_d ostall_w_w_rs2_d ) 5.3. Hardware Bypassing/Forwarding Hardware allows values to be sent from an earlier instruction (in back of pipeline) to a later instruction (in front of pipeline) before the earlier instruction has left the pipeline. Sometimes called forwarding. 42

43 5. Pipeline Hazards: RAW Data Hazards 5.3. Hardware Bypassing/Forwarding Pipeline diagram showing hardware bypassing for RAW hazards addi x1, x2, 1 addi x3, x1, 1 addi x4, x3, 1 Adding single bypass path to support limited hardware bypassing val_f pc_f +4 btarg jtarg jbtarg jr pc_plus4 always pc_sel_f pc_plus4 reg_ en_f val_fd ir_fd CSig Table Signal Stall Stall & Table Bypass F Stage D Stage X Stage M Stage W Stage pc_fd reg_ en_d ir[31:0] ir[19:15] ir[24:20] immgen imm_type (read) op1_ byp_ sel_d val_dx val_xm val_mw cs_dx cs_xm cs_mw + btarg_dx op1_sel_d op1_dx op2_sel_d op2_dx alu mul alu_fn_x result_sel_x eq_x result _XM sd_xm rf_ w_w wb_sel_m rf_ wen_w result _MW (write) sd_dx imemreq imemresp dmemreq dmemreq dmemresp Deriving the bypass and stall signals ostall_w_x_rs1_d = 0 bypass_w_x_rs1_d = val_d && rs1_en_d && val_x && rf_wen_x && (inst_rs1_d == rf_w_x) && (rf_w_x!= 0) 43

44 5. Pipeline Hazards: RAW Data Hazards 5.3. Hardware Bypassing/Forwarding Pipeline diagram showing multiple hardware bypass paths addi x2, x10, 1 addi x2, x11, 1 addi x1, x2, 1 addi x3, x4, 1 addi x5, x3, 1 add x6, x1, x3 sw x5, 0(x1) jr x6 Adding all bypass path to support full hardware bypassing val_f pc_f +4 btarg jtarg jbtarg jr pc_plus4 always pc_sel_f pc_plus4 reg_ en_f val_fd ir_fd CSig Table Signal Stall Stall & Table Bypass F Stage D Stage X Stage M Stage W Stage pc_fd reg_ en_d ir[31:0] ir[19:15] ir[24:20] immgen imm_type (read) op1_ byp_ sel_d op2_ byp_ sel_d val_dx val_xm val_mw cs_dx cs_xm cs_mw + btarg_dx op1_sel_d op1_dx op2_sel_d op2_dx alu mul alu_fn_x result_sel_x eq_x result _XM sd_xm rf_ w_w wb_sel_m rf_ wen_w result _MW (write) bypass_from_x bypass_from_m bypass_from_w sd_dx imemreq imemresp dmemreq dmemreq dmemresp 44

45 5. Pipeline Hazards: RAW Data Hazards 5.3. Hardware Bypassing/Forwarding Handling load-use RAW dependencies ALU-use latency is only one cycle, but load-use latency is two cycles. lw x1, 0(x2) addi x3, x1, 1 lw x1, 0(x2) addi x3, x1, 1 ostall_load_use_x_rs1_d = val_d && rs1_en_d && val_x && rf_wen_x && (inst_rs1_d == rf_w_x) && (rf_w_x!= 0) && (op_x == lw) ostall_load_use_x_rs2_d = val_d && rs2_en_d && val_x && rf_wen_x && (inst_rs2_d == rf_w_x) && (rf_w_x!= 0) && (op_x == lw) ostall_d = val_d && ( ostall_load_use_x_rs1_d ostall_load_use_x_rs2_d ) bypass_w_x_rs1_d = val_d && rs1_en_d && val_x && rf_wen_x && (inst_rs1_d == rf_w_x) && (rf_w_x!= 0) && (op_x!= lw) bypass_w_x_rs2_d = val_d && rs2_en_d && val_x && rf_wen_x && (inst_rs2_d == rf_w_x) && (rf_w_x!= 0) && (op_x!= lw) 45

46 5. Pipeline Hazards: RAW Data Hazards 5.4. RAW Data Hazards Through Memory Pipeline diagram for simple assembly sequence Draw a pipeline diagram illustrating how the following assembly sequence would execute on a fully bypassed pipelined TinyRV1 processor. Include microarchitectural dependency arrows to illustrate how is transferred along various bypass paths. lw lw add sw x1, 0(x2) x3, 0(x4) x5, x1, x3 x5, 0(x6) addi x2, x2, 4 addi x4, x4, 4 addi x6, x6, 4 addi x7, x7, -1 bne x7, x0, loop 5.4. RAW Data Hazards Through Memory So far we have only studied RAW hazards through registers, but we must also carefully consider RAW hazards through memory. sw x1, 0(x2) lw x3, 0(x4) # RAW dependency occurs if R[x2] == R[x4] sw x1, 0(x2) lw x3, 0(x4) 46

47 6. Pipeline Hazards: Hazards 6. Pipeline Hazards: Hazards hazards occur when whether or not an instruction should be executed depends on a control decision made by an earlier instruction We use architectural dependency arrows to illustrate control dependencies in assembly code sequences. Static Instr Sequence addi x1, x0, 1 jal x0, foo opa opb foo: addi x2, x3, 1 bne x0, x1, bar opc opd ope bar: addi x4, x5, 1 Dynamic Instr Sequence addi x1, x0, 1 jal x0, foo addi x2, x3, 1 bne x0, x1, bar addi x4, x5, 1 Using pipeline diagrams to illustrate control hazards We use microarchitectural dependency arrows to illustrate control hazards on pipeline diagrams. addi x1, x0, 1 jal x0, foo addi x2, x3, 1 bne x0, x1, bar addi x4, x5, 1 47

48 6. Pipeline Hazards: Hazards 6.1. Expose in Instruction Set Architecture The jump resolution latency and branch resolution latency are the number of cycles we need to delay the fetch of the next instruction in order to avoid any kind of control hazard. Jump resolution latency is two cycles, and branch resolution latency is three cycles. addi x1, x0, 1 jal x0, foo addi x2, x3, 1 bne x0, x1, bar addi x4, x5, 1 Approaches to resolving control hazards Expose in Instruction Set Architecture: Expose control hazards in ISA forcing compiler to explicitly avoid scheduling instructions that would create hazards (i.e., software scheduling for correctness) Software Predication: Programmer or compiler converts control flow into flow by using instructions that conditionally execute based on a value Hardware Speculation: Hardware guesses which way the control flow will go and potentially fetches incorrect instructions; detects when there is a problem and re-executes instructions that are along the correct control flow Software Hints: Programmer or compiler provides hints about whether a conditional branch will be taken or not taken, and hardware can use these hints for more efficient hardware speculation 48

49 6. Pipeline Hazards: Hazards 6.1. Expose in Instruction Set Architecture 6.1. Expose in Instruction Set Architecture Expose branch delay slots as part of the instruction set. Branch delay slots are instructions that follow a jump or branch and are always executed regardless of whether a jump or branch is taken or not taken. Compiler tries to insert useful instructions, otherwise inserts nops. addi x1, x0, 1 jal x0, foo nop opa opb foo: addi x2, x3, 1 bne x0, x1, bar nop nop opc opd ope bar: addi x4, x5, 1 Assume we modify the TinyRV1 instruction set to specify that JAL, and JR instructions have a single-instruction branch delay slot (i.e., one instruction after a JAL and JR is always executed) and the BNE instruction has a two-instruction branch delay slot (i.e., two instructions after a BNE are always executed). Pipeline diagram showing using branch delay slots for control hazards addi x1, x0, 1 jal x0, foo nop addi x2, x3, 1 bne x0, x1, bar nop nop addi x4, x5, 1 49

50 6. Pipeline Hazards: Hazards 6.2. Hardware Speculation 6.2. Hardware Speculation Hardware guesses which way the control flow will go and potentially fetches incorrect instructions; detects when there is a problem and re-executes instructions the instructions that are along the correct control flow. For now, we will only consider a simple branch prediction scheme where the hardware always predicts not taken. Pipeline diagram when branch is not taken addi x1, x0, 1 jal opa x0, foo addi x2, x3, 1 bne opc opd x0, x1, bar Pipeline diagram when branch is taken addi x1, x0, 1 jal x0, foo opa addi x2, x3, 1 bne x0, x1, bar opc opd addi x4, x5, 1 50

51 6. Pipeline Hazards: Hazards 6.2. Hardware Speculation Modifications to path/control to support hardware speculation val_f pc_f +4 btarg jtarg jbtarg jr pc_plus4 always pc_sel_f pc_plus4 reg_ en_f val_fd ir_fd CSig Table Signal Stall, Stall & Bypass, Table & Squash F Stage D Stage X Stage M Stage W Stage pc_fd reg_ en_d ir[31:0] ir[19:15] ir[24:20] immgen imm_type (read) op1_ byp_ sel_d op2_ byp_ sel_d val_dx val_xm val_mw cs_dx cs_xm cs_mw + btarg_dx op1_sel_d op1_dx op2_sel_d op2_dx alu mul alu_fn_x result_sel_x eq_x result _XM sd_xm rf_ w_w wb_sel_m rf_ wen_w result _MW (write) bypass_from_x bypass_from_m bypass_from_w sd_dx imemreq imemresp dmemreq dmemreq dmemresp Deriving the squash signals osquash_j_d = (op_d == jal) (op_d == jr) osquash_br_x = (op_x == bne) &&!eq_x Our generic stall/squash scheme gives priority to squashes over stalls. osquash_d = val_d &&!stall_d && osquash_j_d squash_d = val_d && osquash_x osquash_x = val_d &&!stall_x && osquash_br_x squash_x = 0 Important: PC select logic must give priority to older instructions (i.e., prioritize branches over jumps)! Good quiz question? 51

52 6. Pipeline Hazards: Hazards 6.2. Hardware Speculation Pipeline diagram for simple assembly sequence Draw a pipeline diagram illustrating how the following assembly sequence would execute on a fully bypassed pipelined TinyRV1 processor that uses hardware speculation which always predicts not-taken. Unlike the standard TinyRV1 processor, you should also assume that we add a single-instruction branch delay slot to the instruction set. So this processor will partially expose the control hazard in the instruction, but also use hardware speculation. Include microarchitectural dependency arrows to illustrate both and control flow. addi x1, x2, 1 bne x0, x3, foo # assume R[rs]!= 0 addi x4, x5, 1 # instruction is in branch delay slot addi x6, x7, 1... foo: add x8, x1, x4 addi x9, x1, 1 52

53 6. Pipeline Hazards: Hazards 6.3. Interrupts and Exceptions 6.3. Interrupts and Exceptions Interrupts and exceptions alter the normal control flow of the program. They are caused by an external or internal event that needs to be processed by the system, and these events are usually unexpected or rare from the program s point of view. Asynchronous Interrupts Input/output device needs to be serviced Timer has expired Power distruption or hardware failure Synchronous Exceptions Undefined opcode, privileged instruction Arithmetic overflow, floating-point exception Misaligned memory access for instruction fetch or access Memory protection violation Virtual memory page faults System calls (traps) to jump into the operating system kernel Interrupts and Exception Semantics Interrupts are asynchronous with respect to the program, so the microarchitecture can decide when to service the interrupt Exceptions are synchronous with respect to the program, so they must be handled immediately 53

54 6. Pipeline Hazards: Hazards 6.3. Interrupts and Exceptions To handle an interrupt or exception the hardware/software must: Stop program at current instruction (I), ensure previous insts finished Save cause of interrupt or exception in privileged arch state Save the PC of the instruction I in a special register (EPC) Switch to privileged mode Set the PC to the ess of either the interrupt or the exception handler Disable interrupts Save the user architectural state Check the type of interrupt or exception Handle the interrupt or exception Enable interrupts Switch to user mode Set the PC to EPC if I should be restarted Potentially set PC to EPC+4 if we should skip I Handling a misaligned ess and syscall exceptions Static Instr Sequence addi x1, x0, 0x2001 lw x2, 0(x1) syscall opb opc... exception_hander: opd # disable interrupts ope # save user registers opf # check exception type opg # handle exception oph # enable interrupts addi EPC, EPC, 4 eret Dynamic Instr Sequence addi x1, x0, 0x2001 lw x2, 0(x1) (excep) opd ope opf opg oph addi EPC, EPC, 4 eret syscall (excep) opd ope opf 54

55 6. Pipeline Hazards: Hazards 6.3. Interrupts and Exceptions Interrupts and Exceptions in a RISC-V Pipelined Processor F D X M W Inst Address Exceptions Illegal Instruction Arithmetic Overflow Data Address Exceptions How should we handle a single instruction which generates multiple exceptions in different stages as it goes down the pipeline? Exceptions in earlier pipeline stages override later exceptions for a given instruction How should we handle multiple instructions generating exceptions in different stages at the same or different times? We always want the execution to appear as if we have completely executed one instruction before going onto the next instruction So we want to process the exception corresponding to the earliest instruction in program order first Hold exception flags in pipeline until commit point Commit point is after all exceptions could be generated but before any architectural state has been updated To handle an exception at the commit point: update cause and EPC, squash all stages before the commit point, and set PC to exception handler How and where to handle external asynchronous interrupts? Inject asynchronous interrupts at the commit point Asynchronous interrupts will then naturally override exceptions caused by instructions earlier in the pipeline 55

56 6. Pipeline Hazards: Hazards 6.3. Interrupts and Exceptions Modifications to path/control to support exceptions xcept_fd xcept_dx xcept_xm Exception Handling val_f btarg jtarg jbtarg jr pc_plus4 xcept handler pc_sel_f pc_f +4 reg_ en_f val_fd ir_fd CSig Table Signal Stall, Stall & Bypass, Table & Squash F Stage D Stage X Stage M Stage W Stage pc_fd reg_ en_d ir[31:0] ir[19:15] ir[24:20] immgen imm_type (read) op1_ byp_ sel_d op2_ byp_ sel_d val_dx val_xm val_mw cs_dx cs_xm cs_mw + btarg_dx op1_sel_d op1_dx op2_sel_d op2_dx alu mul alu_fn_x result_sel_x eq_x result _XM sd_xm rf_ w_w wb_sel_m rf_ wen_w result _MW (write) rf_ wen1_w rf_ w1_w bypass_from_x bypass_from_m bypass_from_w sd_dx imemreq imemresp dmemreq dmemreq dmemresp Deriving the squash signals osquash_j_d = (op_d == jal) (op_d == jr) osquash_br_x = (op_x == bne) &&!eq_x osquash_xcept_m = exception_m logic needs to redirect the front end of the pipeline just like for a jump or branch. Again, squashes take priority over stalls, and PC select logic must give priority to older instructions (i.e., priortize exceptions, over branches, over jumps)! 56

57 6. Pipeline Hazards: Hazards 6.3. Interrupts and Exceptions Pipeline diagram of exception handling addi x1, x0, 0x2001 lw x2, 0(x1) # assume causes misaligned ess exception syscall # causes a syscall exception opb opc... exception_hander: opd # disable interrupts ope # save user registers opf # check exception type opg # handle exception oph # enable interrupts addi EPC, EPC, 4 eret 57

58 7. Pipeline Hazards: Structural Hazards 7. Pipeline Hazards: Structural Hazards Structural hazards occur when an instruction in the pipeline needs a resource being used by another instruction in the pipeline. The TinyRV1 processor pipeline is specifically designed to avoid structural hazards. Let s introduce a structural hazard by allowing instructions that do not do any real work in the M stage (i.e., non-memory instructions) to effectively skip that stage. This would require adding an extra path which skips over the pipeline reigster between the X and M stages and connects directly to the writeback mux at the end of the M stage. For non-memory instructions we set wb_sel_m to choose the value from the end of the X stage, while for memory instructions we set wb_sel_m to choose the value coming back from memory. val_f pc_f +4 btarg jtarg jbtarg jr pc_plus4 always pc_sel_f pc_plus4 reg_ en_f val_fd ir_fd CSig Table Signal Stall, Stall & Bypass, Table & Squash F Stage D Stage X Stage M Stage W Stage pc_fd reg_ en_d ir[31:0] ir[19:15] ir[24:20] immgen imm_type (read) op1_ byp_ sel_d op2_ byp_ sel_d val_dx val_xm val_mw cs_dx cs_xm cs_mw + btarg_dx op1_sel_d op1_dx op2_sel_d op2_dx alu mul alu_fn_x result_sel_x eq_x result _XM sd_xm rf_ w_w wb_sel_m rf_ wen_w result _MW (write) bypass_from_x bypass_from_m bypass_from_w sd_dx imemreq imemresp dmemreq dmemreq dmemresp 58

59 7. Pipeline Hazards: Structural Hazards 7.1. Expose in Instruction Set Architecture Using pipeline diagrams to illustrate structural hazards We use structural dependency arrows to illustrate structural hazards. addi x1, x2, 1 addi x3, x4, 1 lw x5, 0(x6) addi x7, x8, 1 Note that the key shared resources that are causing the structural hazard are the pipeline registers at the end of the M stage. We cannot write these pipeline registers with the transaction that is in the X stage and also the transaction that is the M stage at the same time. Approaches to resolving structural hazards Expose in Instruction Set Architecture: Expose structural hazards in ISA forcing compiler to explicitly avoid scheduling instructions that would create hazards (i.e., software scheduling for correctness) Hardware Stalling: Hardware includes control logic that freezes later instructions until earlier instruction has finished using the shared resource; software scheduling can still be used to avoid stalling (i.e., software scheduling for performance) Hardware Duplication: Add more hardware so that each instruction can access separate resources at the same time 7.1. Expose in Instruction Set Architecture Insert independent instructions or nops to delay non-memory instructions if they follow a LW or SW instruction. 59

60 7. Pipeline Hazards: Structural Hazards 7.2. Hardware Stalling Pipeline diagram showing exposing structural hazards in the ISA addi x1, x2, 1 addi x3, x4, 1 lw x5, 0(x6) nop addi x7, x8, Hardware Stalling Hardware includes control logic that stalls a non-memory instruction if it follows a LW or SW instruction. Pipeline diagram showing hardware stalling for structural hazards addi x1, x2, 1 addi x3, x4, 1 lw x5, 0(x6) addi x7, x8, 1 Deriving the stall signal ostall_wport_hazard_d = val_d &&!mem_inst_d && val_x && mem_inst_x where mem_inst is true for a LW or SW instruction and false otherwise. Stall far before the structural hazard actually occurs, because we know exactly how instructions move down the pipeline. Also possible to use dynamic arbitration in the back-end of the pipeline. 60

61 7. Pipeline Hazards: Structural Hazards 7.3. Hardware Duplication 7.3. Hardware Duplication Add more pipeline registers at the end of M stage and a second write port so that non-memory and memory instructions can writeback to the register file at the same time. val_f pc_f +4 btarg jtarg jbtarg jr pc_plus4 always pc_sel_f pc_plus4 reg_ en_f val_fd ir_fd CSig Table Signal Stall, Stall & Bypass, Table & Squash F Stage D Stage X Stage M Stage W Stage pc_fd reg_ en_d ir[31:0] ir[19:15] ir[24:20] immgen imm_type (read) op1_ byp_ sel_d op2_ byp_ sel_d val_dx val_xm val_mw cs_dx cs_xm cs_mw + btarg_dx op1_sel_d op1_dx op2_sel_d op2_dx alu mul alu_fn_x result_sel_x eq_x result _XM sd_xm non_mem _result _MW rf_ w_w rf_ wen_w mem_result _MW (write) rf_ wen1_w rf_ w1_w bypass_from_x bypass_from_m bypass_from_w sd_dx imemreq imemresp dmemreq dmemreq dmemresp Does allowing early writeback help performance in the first place? addi x1, x2, 1 addi x3, x1, 1 addi x4, x3, 1 addi x5, x4, 1 addi x6, x5, 1 addi x7, x6, 1 61

62 8. Pipeline Hazards: WAW and WAR Name Hazards 8. Pipeline Hazards: WAW and WAR Name Hazards WAW dependencies occur when an instruction overwrites a register than an earlier instruction has already written. WAR dependencies occur when an instruction writes a register than an earlier instruction needs to read. We use architectural dependency arrows to illustrate WAW and WAR dependencies in assembly code sequences. mul x1, x2, x3 addi x4, x1, 1 addi x1, x5, 1 WAW name hazards occur when an instruction in the pipeline writes a register before an earlier instruction (in back of the pipeline) has had a chance to write that same register. WAR name hazards occur when an instruction in the pipeline writes a register before an earlier instuction (in back of pipeline) has had a chance to read that same register. The TinyRV1 processor pipeline is specifically designed to avoid any WAW or WAR name hazards. Instructions always write the registerfile in-order in the same stage, and instructions always read registers in the front of the pipeline and write registers in the back of the pipeline. Let s introduce a WAW name hazard by using an iterative variable latency multiplier, and allowing other instructions to continue executing while the multiplier is working. 62

63 8. Pipeline Hazards: WAW and WAR Name Hazards 8.1. Software Renaming Using pipeline diagrams to illustrate WAW name hazards We use microarchitectural dependency arrows to illustrate WAW hazards on pipeline diagrams. mul x1, x2, x3 addi x4, x6, 1 addi x1, x5, 1 Approaches to resolving structural hazards Software Renaming: Programmer or compiler changes the register names to avoid creating name hazards Hardware Renaming: Hardware dynamically changes the register names to avoid creating name hazards Hardware Stalling: Hardware includes control logic that freezes later instructions until earlier instruction has finished either writing or reading the problematic register name 8.1. Software Renaming As long as we have enough architectural registers, renaming registers in software is easy. WAW and WAR dependencies occur because we have a finite number of architectural registers. mul x1, x2, x3 addi x4, x1, 1 addi x7, x5, 1 63

64 9. Summary of Processor Performance 8.2. Hardware Stalling 8.2. Hardware Stalling Simplest approach is to add stall logic in the decode stage similar to what the approach used to resolve other hazards. mul x1, x2, x3 addi x4, x6, 1 addi x1, x5, 1 Deriving the stall signal ostall_struct_hazard_d = val_d && (op_d == MUL) &&!imul_rdy_d ostall_waw_hazard_d = val_d && rf_wen_d && val_z && rf_wen_z && (rf_w_d == rf_w_z) && (rf_w_z!= 0) 9. Summary of Processor Performance Time Program = Instructions Program Cycles Instruction Time Cycles Results for vector-vector add example Microarchitecture Inst CPI Cycle Time Exec Time Single-Cycle Processor τ 43 kτ FSM Processor τ 154 kτ Pipelined Processor

65 9. Summary of Processor Performance Estimating cycle time for pipelined processor val_f pc_f +4 val_fd btarg jtarg jbtarg jr pc_plus4 always pc_sel_f pc_plus4 reg_ en_f ir_fd CSig Table Signal Stall, Stall & Bypass, Table & Squash F Stage D Stage X Stage M Stage W Stage pc_fd reg_ en_d ir[31:0] ir[19:15] ir[24:20] immgen imm_type (read) op1_ byp_ sel_d op2_ byp_ sel_d val_dx val_xm val_mw cs_dx cs_xm cs_mw + btarg_dx op1_sel_d op1_dx op2_sel_d op2_dx alu mul alu_fn_x result_sel_x eq_x result _XM sd_xm rf_ w_w wb_sel_m rf_ wen_w result _MW (write) bypass_from_x bypass_from_m bypass_from_w sd_dx imemreq imemresp dmemreq dmemreq dmemresp register read = 1τ register write = 1τ read = 10τ write = 10τ memory read = 20τ memory write = 20τ +4 unit = 4τ immgen = 2τ mux = 3τ multiplier = 20τ alu = 10τ adder = 8τ 65

66 9. Summary of Processor Performance Estimating execution time Using our first-order equation for processor performance, how long in τ will it take to execute the vvadd example assuming n is 64? loop: lw x5, 0(x13) lw x6, 0(x14) add x7, x5, x6 sw x7, 0(x12) addi x13, x12, 4 addi x14, x14, 4 addi x12, x12, 4 addi x15, x15, -1 bne x15, x0, loop jr x1 lw lw add sw addi addi addi addi bne opa opb lw 66

67 9. Summary of Processor Performance Using our first-order equation for processor performance, how long in τ will it take to execute the mystery program assuming n is 64 and that we find a match on the last element. addi x5, x0, 0 loop: lw x6, 0(x12) bne x6, x14, foo addi x10, x5, 0 jr x1 foo: addi x12, x12, 4 addi x5, x5, 1 bne x5, x13, loop addi x10, x0, -1 jr x1 67

68 10. Case Study: Transition from CISC to RISC 10. Case Study: Transition from CISC to RISC Microcoding thrived in the 1970 s ROMs significantly faster than DRAMs For complex instruction sets, microcode was cheaper and simpler New instructions supported without modifying path Fixing bugs in controller is easier ISA compatibility across models relatively straight-forward Maurice Wilkes,

69 10. Case Study: Transition from CISC to RISC Example CISC: IBM 360/M Example CISC: IBM 360/M30 M30 M40 M50 M65 Datapath width (bits) µinst width (bits) µcode size (1K µinsts) µstore technology CCROS TROS BCROS BCROS µstore cycle (ns) Memory cycle (ns) Rental fee ($K/month) TROS = transformer read-only storage (magnetic storage) BCROS = balanced capacitor read-only storage (capacitive storage) CCROS = card capacitor read-only storage (metal punch cards, replace in field) Only the fastest models (75,95) were hardwired IBM 360/M30 microprogram for register-register logical OR Instruction Fetch Fetch first byte of operands OR Writeback Result Prepare for Next Byte 69

70 10. Case Study: Transition from CISC to RISC Example CISC: IBM 360/M30 IBM 360/M30 microprogram for register-register binary ADD Analyzing Microcoded Machines John Cocke and group at IBM Working on a simple pipelined processor, 801, and advanced compilers Ported experimental PL8 compiler to IBM 370, and only used simple register-register and load/store instructions similar to 801 Code ran faster than other existing compilers that used all 370 instructions! (up to 6 MIPS, whereas 2 MIPS considered good before) Joel Emer and Douglas Clark at DEC Measured VAX-11/780 using external hardware Found it was actually a 0.5 MIPS machine, not a 1 MIPS machine 20% of VAX instrs = 60% of µcode, but only 0.2% of the dynamic execution VAX 8800, high-end VAX in 1984 store: 16K 147b RAM, Unified Cache: 64K 8b RAM 4.5 more microstore RAM than cache RAM! 70

71 10. Case Study: Transition from CISC to RISC Example CISC: IBM 360/M30 From CISC to RISC Key changes in technology constraints, RAM, ROM all implemented with MOS transistors RAM same speed as ROM Use fast RAM to build fast instruction cache of user-visible instructions, not fixed hardware microfragments Change contents of fast instruction memory to fit what app needs Use simple ISA to enable hardwired pipelined implementation Most compiled code only used a few of CISC instructions Simpler encoding allowed pipelined implementations Load/Store - ISA as opposed to Mem-Mem ISA Further benefit with integration Early 1980 s fit 32-bit path, small caches on single chip No chip crossing in common case allows faster operation Vertical μcode ler RISC ler μpc User PC ROM for μinst RAM for Instr Cache Small Decoder "Larger" Decoder 71

72 10. Case Study: Transition from CISC to RISC Example RISC: MIPS R2K Example RISC: MIPS R2K MIPSI MIPS R2K is one of the first popular pipelined RISC processors MIPS R2K implements the MIPS I instruction set MIPS = Microprocessor without Interlocked Pipeline Stages MIPSIII MIPSIV MIPS64 MIPSII MIPS32 MIPS16 MIPS I used software scheduling to avoid some RAW hazards by including a single-instruction load-use delay slot MIPS I used software scheduling to avoid some control hazards by including a single-instruction branch delay slot One-Instr Branch Delay Slot addiu r1, r2, 1 j foo addiu r3, r4, 1 # BDS... foo: addiu r5, r6, 1 bne r7, r8, bar addiu r9, r10, 1 # BDS... bar: Present in all MIPS instruction sets; not possible to depricate and still enable legacy code to execute on new microarchitectures One-Instr Load-Use Delay Slot lw r1, 0(r2) lw r3, 0(r4) addiu r2, r2, 4 # LDS addu r5, r1, r3 Deprecated in MIPS II instruction set; legacy code can still execute on new microarchitectures, but code using the MIPS II instruction set can rely in hardware stalling 72

73 10. Case Study: Transition from CISC to RISC Example RISC: MIPS R2K MIPS R2K Microarchitecture The pipelined path and control were located on a single die. Cache control and memory management unit were also integrated on-die, but the actual tag and storage for the cache was located off-chip. Unit Datapath I$ ler D$ ler I$ (4-32KB) D$ (4-32KB) Used two-phase clocking to enable five pipeline stages to fit into four clock cycles. This avoided the need for explicit bypassing from the W 1.2 The MIPS Five-Stage Pipeline 5 stage to the end of the D stage. Instruction sequence Instruction 1 Instruction 2 IF from I-cache RD from register file ALU MEM from D-cache WB to register file IF RD ALU MEM WB Instruction 3 IF RD ALU MEM WB Time FIGURE 1.2 MIPS five-stage pipeline. Two-phase clocking enabled a single-cycle branch resolution latency since register read, branch ess generation, and branch comparison can fit in a single cycle. Branch instruction Branch delay Branch target IF fetch from the cache; 1.5 a cache MIPSmiss Compared is a relatively with CISC rare event Architectures and we can just 27 stop the CPU when it happens (though cleverer CPUs find more useful things to do). The MIPS architecture was planned with separate instruction and caches, so it can fetch an instruction and read or write a memory variable simultaneously. Branch RD CISC ess MEM WB architectures have caches too, but they re most often afterthoughts, fitted in as a feature of the memory system. A RISC architecture makes more sense if you regard the caches as very much part of the CPU and tied firmly into the pipeline. IF RD ALU MEM WB 1.2 The MIPS Five-Stage Pipeline FIGURE 1.3 The pipeline and branch delays. IF RD MEM The MIPS architecture is made for pipelining, and Figure 1.2 is close to the earliest MIPS CPUs and typical of many. So long as the CPU runs from the cache, the execution of every MIPS instruction is divided into five phases, called pipestages, with each pipestage taking a fixed amount of time. The fixed amount of time is usually a processor clock cycle (though some actions take only half a clock, so the MIPS five-stage pipeline actually occupies only four clock cycles). on a MIPS CPU without using some registers). For a program running 73 Allininstructions any systemare that rigidly takes defined interrupts so or they traps, can the follow values the of same these sequence registers of pipestages, may change even at where any time, the instruction so you d better does nothing use at them. some stage. The net ALU WB

74 10. Case Study: Transition from CISC to RISC Example RISC: MIPS R2K MIPS R2K VLSI Design Process: 2 µm, two metal layers Clock Frequency: 8 15 MHz Size: 110K transistors, 80 mm 2 74

75 10. Case Study: Transition from CISC to RISC Example RISC: MIPS R2K 4.0 Performance Ratio 3.5 Ratio of 3.0 MIPS to 2.5 VAX MIPS R10000 Uses Decoupled Architecture 2.0 High-Performance Core Will Drive MIPS Instructions High-End Executed for Years Ratio 1.5 by Linley Gwennap Taking advantage of its experience with the 200-2x more instr MHz R4400, MTI was able to streamline the design and Not to be left out in the move to the expects it to run at a high clock rate. Speaking at the 1.0 6x lower CPI next generation of RISC, MIPS Technologies (MTI) unveiled 2-4x the design higher of perf first R10000 processors will reach a speed of 200 MHz, Microprocessor Forum, MTI s Chris Rowen said that the FORUM the R10000, also known as T5. As the 50% faster than the PowerPC 620. At this speed, he expects performance in excess of 300 SPECint92 and spiritual successor 0.5 to the R4000, the SPECfp92, challenging CPI Digital s Ratio for the perfor- M I C R O P R O C E S S O R R E P O R T MICROPROCESSOR 0.0 new design will be the basis of high-end MIPS processors for some time, at least until By swapping superpipelining for an aggressively out-oforder superscalar design, the R10000 has the potential to deliver high performance throughout that period. The new processor uses deep queues decouple the instruction fetch logic from the execution units. Instructions that are ready to execute can jump ahead of those waiting for operands, increasing the utilization of the execution units. This technique, known as out-of-order execution, has been used in PowerPC processors for some time (see PDF), but the new MIPS design is the spice matrix most CISC/RISC aggressive implementation Convergence yet, allowing more instructions to be queued than any of its competitors. ITLB 8 entry PC Unit Memory Queue 16 entries Address Adder virtual Main TLB 64 entries BHT 512 x 2 ALU1 Instruction Cache 32K, two-way associative 4 instr Integer isters 64! 64 bits phys Decode, Map, Dispatch ALU2! 4 instr Integer Queue 16 entries Resume Cache 64 FP " 4 instr Active List FP Mult Predecode Unit FP isters 64! 64 bits Data Cache 32K, two-way associative Map Table FP Queue 16 entries FP Adder mance lead. Due to schedule slips, however, the R10000 has not yet taped out; we do not expect volume shipments until 4Q95, by which time Digital may enhance the performance of its processor. nasa7 Figure 1. The R10000 uses deep instruction queues to decouple the instruction fetch logic from the five function units. MIPS R10K uses sophisticated out-of-order engine; branch delay slot not useful 128 fpppp tomcatv doduc espresso eqntott Speculative Execution Beyond Branches The front end of the processor is responsible for maintaining a continuous -- H&P, flow of Appendix instructions J, into from the Bhandarkar and Clark, 1991 queues, despite problems caused by branches and cache misses. As Figure 1 shows, the chip uses a two-way setassociative instruction cache of 32K. Like other highly superscalar designs, the R10000 predecodes instructions as they are loaded into this cache, which holds four extra bits per instruction. These bits reduce the time needed to determine the appropriate queue for each instruction. The processor fetches four instructions per cycle from the cache and de- Tag SRAM codes them. If a branch is discovered, it is immediately predicted; if it is predicted taken, the target ess is sent to the instruction cache, redirecting the 128 Data fetch stream. Because of the one cycle SRAM needed to decode the branch, taken branches create a bubble in the fetch 512K -16M stream; the deep queues, however, generally prevent this bubble from delaying the execution pipeline. The sequential instructions that are loaded during this extra cycle are not discarded but are saved in a resume cache. If the branch is later determined to have been mispredicted, the sequential instructions are reloaded from the resume cache, reducing the mispredicted branch penalty by one cycle. The resume cache has four entries of four instructions each, allowing speculative execution beyond four branches. The R10000 design uses the standard two-bit Smith method to predict MIPS R10000 Uses Decoupled Architecture Vol. 8, No. 14, October 24, MicroDesign Resources Gwennap, MPR, 1994 L2 Cache Interface System Interface Avalanche Bus (64 bit /) Intel Nehalem frontend breaks x86 CISC into smaller RISC-like µops; µcode engine handles rarely used complex instr Kanter, Real World Technologies, 2009 li 75

ECE 4750 Computer Architecture, Fall 2014 T03 Pipelined Processors

ECE 4750 Computer Architecture, Fall 2014 T03 Pipelined Processors ECE 4750 Computer Architecture, Fall 2014 T03 Pipelined Processors School of Electrical and Computer Engineering Cornell University revision: 2014-09-24-12-26 1 Pipelined Processors 3 2 RAW Data Hazards

More information

ECE 4750 Computer Architecture, Fall 2014 T01 Single-Cycle Processors

ECE 4750 Computer Architecture, Fall 2014 T01 Single-Cycle Processors ECE 4750 Computer Architecture, Fall 2014 T01 Single-Cycle Processors School of Electrical and Computer Engineering Cornell University revision: 2014-09-03-17-21 1 Instruction Set Architecture 2 1.1. IBM

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 Lecture 3 Pipelining Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair)

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Appendix C: Pipelining: Basic and Intermediate Concepts

Appendix C: Pipelining: Basic and Intermediate Concepts Appendix C: Pipelining: Basic and Intermediate Concepts Key ideas and simple pipeline (Section C.1) Hazards (Sections C.2 and C.3) Structural hazards Data hazards Control hazards Exceptions (Section C.4)

More information

Modern Computer Architecture

Modern Computer Architecture Modern Computer Architecture Lecture2 Pipelining: Basic and Intermediate Concepts Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each

More information

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin.   School of Information Science and Technology SIST CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Pipelining. Maurizio Palesi

Pipelining. Maurizio Palesi * Pipelining * Adapted from David A. Patterson s CS252 lecture slides, http://www.cs.berkeley/~pattrsn/252s98/index.html Copyright 1998 UCB 1 References John L. Hennessy and David A. Patterson, Computer

More information

Computer Architecture. Lecture 6.1: Fundamentals of

Computer Architecture. Lecture 6.1: Fundamentals of CS3350B Computer Architecture Winter 2015 Lecture 6.1: Fundamentals of Instructional Level Parallelism Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and

More information

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers? Pipelined CPUs Where are the registers? Study Chapter 4 of Text Second Quiz on Friday. Covers lectures 8-14. Open book, open note, no computers or calculators. L17 Pipelined CPU I 1 Review of CPU Performance

More information

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07 Pipelined CPUs Where are the registers? Study Chapter 6 of Text L19 Pipelined CPU I 1 Review of CPU Performance MIPS = Millions of Instructions/Second MIPS = Freq CPI Freq = Clock Frequency, MHz CPI =

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Single-cycle ISA Implementation Prof. Michel A. Kinsy Computer System View Processor Applications Compiler Firmware ISA Memory organization Digital Design Circuit Design Operating

More information

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science Pipeline Overview Dr. Jiang Li Adapted from the slides provided by the authors Outline MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and

More information

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations Principles of pipelining Pipelining Simple pipelining Structural Hazards Data Hazards Control Hazards Interrupts Multicycle operations Pipeline clocking ECE D52 Lecture Notes: Chapter 3 1 Sequential Execution

More information

ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction

ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction School of Electrical and Computer Engineering Cornell University revision: 2017-11-20-08-48 1 Branch Prediction Overview

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Computer Architecture

Computer Architecture Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in

More information

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight Pipelining: Its Natural! Chapter 3 Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes A B C D Dryer takes 40 minutes Folder

More information

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Instruction Level Parallelism ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Basic Block A straight line code sequence with no branches in except to the entry and no branches

More information

Lecture 7 Pipelining. Peng Liu.

Lecture 7 Pipelining. Peng Liu. Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt

More information

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories School of Electrical and Computer Engineering Cornell University revision: 2014-10-08-02-02 1 FSM Set-Associative Cache Memory

More information

Thomas Polzer Institut für Technische Informatik

Thomas Polzer Institut für Technische Informatik Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =

More information

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

CPS104 Computer Organization and Programming Lecture 19: Pipelining. Robert Wagner

CPS104 Computer Organization and Programming Lecture 19: Pipelining. Robert Wagner CPS104 Computer Organization and Programming Lecture 19: Pipelining Robert Wagner cps 104 Pipelining..1 RW Fall 2000 Lecture Overview A Pipelined Processor : Introduction to the concept of pipelined processor.

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

MIPS An ISA for Pipelining

MIPS An ISA for Pipelining Pipelining: Basic and Intermediate Concepts Slides by: Muhamed Mudawar CS 282 KAUST Spring 2010 Outline: MIPS An ISA for Pipelining 5 stage pipelining i Structural Hazards Data Hazards & Forwarding Branch

More information

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP Overview Appendix A Pipelining: Basic and Intermediate Concepts Basics of Pipelining Pipeline Hazards Pipeline Implementation Pipelining + Exceptions Pipeline to handle Multicycle Operations 1 2 Unpipelined

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science Cases that affect instruction execution semantics

More information

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding

More information

LECTURE 10. Pipelining: Advanced ILP

LECTURE 10. Pipelining: Advanced ILP LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Lecture 4 - Pipelining

Lecture 4 - Pipelining CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions Lecture 7: Pipelining Contd. Kunle Olukotun Gates 302 kunle@ogun.stanford.edu http://www-leland.stanford.edu/class/ee282h/ 1 More pipelining complications: Interrupts and Exceptions Hard to handle in pipelined

More information

Basic Pipelining Concepts

Basic Pipelining Concepts Basic ipelining oncepts Appendix A (recommended reading, not everything will be covered today) Basic pipelining ipeline hazards Data hazards ontrol hazards Structural hazards Multicycle operations Execution

More information

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage

More information

CSE 141 Computer Architecture Spring Lectures 11 Exceptions and Introduction to Pipelining. Announcements

CSE 141 Computer Architecture Spring Lectures 11 Exceptions and Introduction to Pipelining. Announcements CSE 4 Computer Architecture Spring 25 Lectures Exceptions and Introduction to Pipelining May 4, 25 Announcements Reading Assignment Sections 5.6, 5.9 The Processor Datapath and Control Section 6., Enhancing

More information

Appendix A. Overview

Appendix A. Overview Appendix A Pipelining: Basic and Intermediate Concepts 1 Overview Basics of Pipelining Pipeline Hazards Pipeline Implementation Pipelining + Exceptions Pipeline to handle Multicycle Operations 2 1 Unpipelined

More information

ECE/CS 552: Pipeline Hazards

ECE/CS 552: Pipeline Hazards ECE/CS 552: Pipeline Hazards Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipeline Hazards Forecast Program Dependences

More information

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University COSC4201 Pipelining Prof. Mokhtar Aboelaze York University 1 Instructions: Fetch Every instruction could be executed in 5 cycles, these 5 cycles are (MIPS like machine). Instruction fetch IR Mem[PC] NPC

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1 CMCS 611-101 Advanced Computer Architecture Lecture 8 Control Hazards and Exception Handling September 30, 2009 www.csee.umbc.edu/~younis/cmsc611/cmsc611.htm Mohamed Younis CMCS 611, Advanced Computer

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141 EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 13 Project Introduction You will design and optimize a RISC-V processor Phase 1: Design

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-06 1 Processor and L1 Cache Interface

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations Determined by ISA

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 4: Pipelining Last Time in Lecture 3 icrocoding, an effective technique to manage control unit complexity, invented in era when logic

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations Principles of pipelining Pipelining Simple pipelining Structural Hazards Data Hazards Control Hazards Interrupts Multicycle operations Pipeline clocking ECE D52 Lecture Notes: Chapter 3 1 Sequential Execution

More information

CSEE 3827: Fundamentals of Computer Systems

CSEE 3827: Fundamentals of Computer Systems CSEE 3827: Fundamentals of Computer Systems Lecture 21 and 22 April 22 and 27, 2009 martha@cs.columbia.edu Amdahl s Law Be aware when optimizing... T = improved Taffected improvement factor + T unaffected

More information

CS 152 Computer Architecture and Engineering. Lecture 5 - Pipelining II (Branches, Exceptions)

CS 152 Computer Architecture and Engineering. Lecture 5 - Pipelining II (Branches, Exceptions) CS 152 Computer Architecture and Engineering Lecture 5 - Pipelining II (Branches, Exceptions) John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 1 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20

More information

EIE/ENE 334 Microprocessors

EIE/ENE 334 Microprocessors EIE/ENE 334 Microprocessors Lecture 6: The Processor Week #06/07 : Dejwoot KHAWPARISUTH Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2009, Elsevier (MK) http://webstaff.kmutt.ac.th/~dejwoot.kha/

More information

Pipeline Architecture RISC

Pipeline Architecture RISC Pipeline Architecture RISC Independent tasks with independent hardware serial No repetitions during the process pipelined Pipelined vs Serial Processing Instruction Machine Cycle Every instruction must

More information

The Big Picture Problem Focus S re r g X r eg A d, M lt2 Sub u, Shi h ft Mac2 M l u t l 1 Mac1 Mac Performance Focus Gate Source Drain BOX

The Big Picture Problem Focus S re r g X r eg A d, M lt2 Sub u, Shi h ft Mac2 M l u t l 1 Mac1 Mac Performance Focus Gate Source Drain BOX Appendix A - Pipelining 1 The Big Picture SPEC f2() { f3(s2, &j, &i); *s2->p = 10; i = *s2->q + i; } Requirements Algorithms Prog. Lang./OS ISA f1 f2 f5 f3 s q p fp j f3 f4 i1: ld r1, b i2: ld r2,

More information

Instr. execution impl. view

Instr. execution impl. view Pipelining Sangyeun Cho Computer Science Department Instr. execution impl. view Single (long) cycle implementation Multi-cycle implementation Pipelined implementation Processing an instruction Fetch instruction

More information

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards CISC 662 Graduate Computer Architecture Lecture 6 - Hazards Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Simple Instruction Pipelining

Simple Instruction Pipelining Simple Instruction Pipelining Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Processor Performance Equation Time = Instructions * Cycles * Time Program Program Instruction

More information

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts CPE 110408443 Computer Architecture Appendix A: Pipelining: Basic and Intermediate Concepts Sa ed R. Abed [Computer Engineering Department, Hashemite University] Outline Basic concept of Pipelining The

More information

Very Simple MIPS Implementation

Very Simple MIPS Implementation 06 1 MIPS Pipelined Implementation 06 1 line: (In this set.) Unpipelined Implementation. (Diagram only.) Pipelined MIPS Implementations: Hardware, notation, hazards. Dependency Definitions. Hazards: Definitions,

More information

Pipelining. CSC Friday, November 6, 2015

Pipelining. CSC Friday, November 6, 2015 Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not

More information

ECE 250 / CPS 250 Computer Architecture. Processor Design Datapath and Control

ECE 250 / CPS 250 Computer Architecture. Processor Design Datapath and Control ECE 250 / CPS 250 Computer Architecture Processor Design Datapath and Control Benjamin Lee Slides based on those from Andrew Hilton (Duke), Alvy Lebeck (Duke) Benjamin Lee (Duke), and Amir Roth (Penn)

More information

Computer System Architecture Midterm Examination Spring 2002

Computer System Architecture Midterm Examination Spring 2002 Computer System Architecture 6.823 Midterm Examination Spring 2002 Name: This is an open book, open notes exam. 110 Minutes 1 Pages Notes: Not all questions are of equal difficulty, so look over the entire

More information

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017! Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

(Basic) Processor Pipeline

(Basic) Processor Pipeline (Basic) Processor Pipeline Nima Honarmand Generic Instruction Life Cycle Logical steps in processing an instruction: Instruction Fetch (IF_STEP) Instruction Decode (ID_STEP) Operand Fetch (OF_STEP) Might

More information

Working on the Pipeline

Working on the Pipeline Computer Science 6C Spring 27 Working on the Pipeline Datapath Control Signals Computer Science 6C Spring 27 MemWr: write memory MemtoReg: ALU; Mem RegDst: rt ; rd RegWr: write register 4 PC Ext Imm6 Adder

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

Very Simple MIPS Implementation

Very Simple MIPS Implementation 06 1 MIPS Pipelined Implementation 06 1 line: (In this set.) Unpipelined Implementation. (Diagram only.) Pipelined MIPS Implementations: Hardware, notation, hazards. Dependency Definitions. Hazards: Definitions,

More information

CS 152, Spring 2011 Section 2

CS 152, Spring 2011 Section 2 CS 152, Spring 2011 Section 2 Christopher Celio University of California, Berkeley About Me Christopher Celio celio @ eecs Office Hours: Tuesday 1-2pm, 751 Soda Agenda Q&A on HW1, Lab 1 Pipelining Questions

More information

Computer Architecture ELEC3441

Computer Architecture ELEC3441 Computer Architecture ELEC3441 RISC vs CISC Iron Law CPUTime = # of instruction program # of cycle instruction cycle Lecture 5 Pipelining Dr. Hayden Kwok-Hay So Department of Electrical and Electronic

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

EECS150 - Digital Design Lecture 10- CPU Microarchitecture. Processor Microarchitecture Introduction

EECS150 - Digital Design Lecture 10- CPU Microarchitecture. Processor Microarchitecture Introduction EECS150 - Digital Design Lecture 10- CPU Microarchitecture Feb 18, 2010 John Wawrzynek Spring 2010 EECS150 - Lec10-cpu Page 1 Processor Microarchitecture Introduction Microarchitecture: how to implement

More information

Chapter 8. Pipelining

Chapter 8. Pipelining Chapter 8. Pipelining Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization requires sophisticated compilation techniques.

More information

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one

More information

Computer Systems Architecture Spring 2016

Computer Systems Architecture Spring 2016 Computer Systems Architecture Spring 2016 Lecture 01: Introduction Shuai Wang Department of Computer Science and Technology Nanjing University [Adapted from Computer Architecture: A Quantitative Approach,

More information

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards Instructors: Vladimir Stojanovic and Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/sp16 1 Pipelined Execution Representation Time

More information

Chapter 4 The Processor (Part 4)

Chapter 4 The Processor (Part 4) Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch branch taken Revisiting Branch Hazard Solutions Stall Predict Not Taken Predict Taken Branch Delay Slot Branch I+1 I+2 I+3 Predict Not Taken branch not taken Branch I+1 IF (bubble) (bubble) (bubble) (bubble)

More information

There are different characteristics for exceptions. They are as follows:

There are different characteristics for exceptions. They are as follows: e-pg PATHSHALA- Computer Science Computer Architecture Module 15 Exception handling and floating point pipelines The objectives of this module are to discuss about exceptions and look at how the MIPS architecture

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 4 Processor Part 2: Pipelining (Ch.4) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations from Mike

More information

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University CSE 533: Advanced Computer Architectures Pipelining Instructor: Gürhan Küçük Yeditepe University Lecture notes based on notes by Mark D. Hill and John P. Shen Updated by Mikko Lipasti Pipelining Forecast

More information

CS 61C: Great Ideas in Computer Architecture Control and Pipelining

CS 61C: Great Ideas in Computer Architecture Control and Pipelining CS 6C: Great Ideas in Computer Architecture Control and Pipelining Instructors: Vladimir Stojanovic and Nicholas Weaver http://inst.eecs.berkeley.edu/~cs6c/sp6 Datapath Control Signals ExtOp: zero, sign

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2 ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 9 Instruction-Level Parallelism Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4 IC220 Set #9: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life Return to Chapter 4 Midnight Laundry Task order A B C D 6 PM 7 8 9 0 2 2 AM 2 Smarty Laundry Task order A B C D 6 PM

More information