Design of Digital Circuits Lecture 15: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2017

Size: px

Start display at page:

Download "Design of Digital Circuits Lecture 15: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2017"

Gary Reynolds
6 years ago
Views:

1 Design of Digital Circuits Lecture 5: Pipelining Prof. Onur Mutlu ETH Zurich Spring 27 3 April 27

2 Agenda for Today & Next Few Lectures! Single-cycle Microarchitectures! Multi-cycle and Microprogrammed Microarchitectures! Pipelining! Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery,! Out-of-Order Execution! Issues in OoO Execution: Load-Store Handling, 2

3 Readings for This Week! H&H, Chapter 7.5 (keep reading) 3

4 Wrap Up Microprogramming 4

5 Remember: An Exercise in Microprogramming 5

6 Handouts! 7 pages of Microprogrammed LC-3b design! infk/inst-infsec/system-security-group-dam/education/ Digitaltechnik_7/lecture/lc3b-figures.pdf 6

7 A Simple LC-3b Control and Datapath 7

8 MAR <! PC PC <! PC + 2 8, 9 MDR <! M 33 R R IR <! MDR 35 To 8 RTI ADD 32 BEN<!IR[] & N + IR[] & Z + IR[9] & P [IR[5:2]] BR To To To 8 DR<!SR+OP2* set CC DR<!SR&OP2* set CC 5 AND XOR TRAP SHF LEA LDB LDW STW STB JSR JMP [BEN] 22 PC<!PC+LSHF(off9,) To 8 9 DR<!SR XOR OP2* set CC 2 PC<!BaseR To 8 To 8 MAR<!LSHF(ZEXT[IR[7:]],) 5 4 [IR[]] To 8 R MDR<!M[MAR] R7<!PC R PC<!MDR R7<!PC PC<!BaseR 2 R7<!PC To 8 PC<!PC+LSHF(off,) To 8 3 DR<!SHF(SR,A,D,amt4) set CC To 8 To 8 4 DR<!PC+LSHF(off9, ) set CC 2 MAR<!B+off6 6 MAR<!B+LSHF(off6,) 7 MAR<!B+LSHF(off6,) 3 MAR<!B+off6 To NOTES B+off6 : Base + SEXT[offset6] PC+off9 : PC + SEXT[offset9] *OP2 may be SR2 or SEXT[imm5] ** [5:8] or [7:] depending on MAR[] MDR<!M[MAR[5:] ] R R 3 DR<!SEXT[BYTE.DATA] set CC MDR<!M[MAR] 27 R DR<!MDR set CC R MDR<!SR 6 M[MAR]<!MDR R R MDR<!SR[7:] 7 M[MAR]<!MDR** R R To 8 To 8 To 8 To 9

9 GateMARMUX GatePC LD.PC PC ZEXT & LSHF MARMUX 6 6 LSHF PCMUX ADDRMUX LD.REG 3 SR2 6 SR2 OUT REG FILE SR OUT 3 3 DR SR [7:] 2 ADDR2MUX [:] SEXT [8:] SEXT SR2MUX [5:] [4:] SEXT SEXT CONTROL R LD.IR IR 6 LD.CC N Z P 2 B A ALUK ALU SHF 6 IR[5:] LOGIC 6 6 GateALU 6 GateSHF GateMDR MAR LD. MAR A Simple Datapath Can Become Very Powerful LOGIC MDR DATA.SIZE MAR[] 6 LD. MDR MIO.EN WE WE WE LOGIC MEMORY MEM.EN R [] R.W DATA. SIZE ADDR. CTL. LOGIC 2 MIO.EN INPUT KBDR KBSR DDR OUTPUT DSR 6 6 LOGIC DATA.SIZE MAR[] INMUX

10 State Machine for LDW Microsequencer COND COND BEN R IR[] Branch Ready Addr. Mode J[5] J[4] J[3] J[2] J[] J[],,IR[5:2] 6 IRD 6 Address of Next State State 8 () State 33 () State 35 () State 32 () State 6 () State 25 () State 27 ()

11 IR[:9] DR IR[:9] IR[8:6] SR DRMUX SRMUX (a) (b) IR[:9] N Z P Logic BEN (c)

13 R IR[5:] BEN Microsequencer 6 Simple Design of the Control Structure Control Store 2 6 x Microinstruction 9 26 (J, COND, IRD)

14 COND COND BEN R IR[] Branch Ready Addr. Mode J[5] J[4] J[3] J[2] J[] J[],,IR[5:2] 6 IRD 6 Address of Next State

15 J IRD Cond LD.MDR LD.IR LD.BEN LD.REG LD.CC LD.MAR GatePC GateMDR GateALU LD.PC GateMARMUX GateSHF PCMUX DRMUX SRMUX ADDRMUX ADDR2MUX MARMUX ALUK MIO.EN R.W DATA.SIZE LSHF (State ) (State ) (State 2) (State 3) (State 4) (State 5) (State 6) (State 7) (State 8) (State 9) (State ) (State ) (State 2) (State 3) (State 4) (State 5) (State 6) (State 7) (State 8) (State 9) (State 2) (State 2) (State 22) (State 23) (State 24) (State 25) (State 26) (State 27) (State 28) (State 29) (State 3) (State 3) (State 32) (State 33) (State 34) (State 35) (State 36) (State 37) (State 38) (State 39) (State 4) (State 4) (State 42) (State 43) (State 44) (State 45) (State 46) (State 47) (State 48) (State 49) (State 5) (State 5) (State 52) (State 53) (State 54) (State 55) (State 56) (State 57) (State 58) (State 59) (State 6) (State 6) (State 62) (State 63)

16 End of the Exercise in Microprogramming 6

17 Variable-Latency Memory! The ready signal (R) enables memory read/write to execute correctly " Example: transition from state 33 to state 35 is controlled by the R bit asserted by memory when memory data is available! Could we have done this in a single-cycle microarchitecture?! What did we assume about memory and registers in a single-cycle microarchitecture? 7

18 The Microsequencer: Advanced Questions! What happens if the machine is interrupted?! What if an instruction generates an exception?! How can you implement a complex instruction using this control structure? " Think REP MOVS instruction in x86 8

19 The Power of Abstraction! The concept of a control store of microinstructions enables the hardware designer with a new abstraction: microprogramming! The designer can translate any desired operation to a sequence of microinstructions! All the designer needs to provide is " The sequence of microinstructions needed to implement the desired operation " The ability for the control logic to correctly sequence through the microinstructions " Any additional datapath elements and control signals needed (no need if the operation can be translated into existing control signals) 9

20 Let s Do Some More Microprogramming! Implement REP MOVS in the LC-3b microarchitecture! What changes, if any, do you make to the " state machine? " datapath? " control store? " microsequencer?! Show all changes and microinstructions! Extra Credit Assignment 2

21 x86 REP MOVS (String Copy) REP MOVS (DEST SRC) How many instructions does this take in MIPS ISA? How many microinstructions does this take to add to the LC-3b microarchitecture? 2

22 Aside: Alignment Correction in Memory! Unaligned accesses! LC-3b has byte load and byte store instructions that move data not aligned at the word-address boundary " Convenience to the programmer/compiler! How does the hardware ensure this works correctly? " Take a look at state 29 for LDB " States 24 and 7 for STB " Additional logic to handle unaligned accesses! P&P, Revised Appendix C.5 22

23 Aside: Memory Mapped I/O! Address control logic determines whether the specified address of LDW and STW are to memory or I/O devices! Correspondingly enables memory or I/O devices and sets up muxes! An instance where the final control signals of some datapath elements (e.g., MEM.EN or INMUX/2) cannot be stored in the control store " These signals are dependent on memory address! P&P, Revised Appendix C.6 23

24 Advantages of Microprogrammed Control! Allows a very simple design to do powerful computation by controlling the datapath (using a sequencer) " High-level ISA translated into microcode (sequence of u-instructions) " Microcode (u-code) enables a minimal datapath to emulate an ISA " Microinstructions can be thought of as a user-invisible ISA (u-isa)! Enables easy extensibility of the ISA " Can support a new instruction by changing the microcode " Can support complex instructions as a sequence of simple microinstructions (e.g., REP MOVS, INC [MEM])! Enables update of machine behavior " A buggy implementation of an instruction can be fixed by changing the microcode in the field! Easier if datapath provides ability to do the same thing in different ways 24

25 Update of Machine Behavior! The ability to update/patch microcode in the field (after a processor is shipped) enables " Ability to add new instructions without changing the processor! " Ability to fix buggy hardware implementations! Examples " IBM 37 Model 45: microcode stored in main memory, can be updated after a reboot " IBM System z: Similar to 37/45.! Heller and Farrell, Millicode in an IBM zseries processor, IBM JR&D, May/Jul 24. " B7 microcode can be updated while the processor is running! User-microprogrammable machine!! Wilner, Microprogramming environment on the Burroughs B7, CompCon

26 Multi-Cycle vs. Single-Cycle uarch! Advantages! Disadvantages! For you to fill in 26

27 Can We Do Better? 27

28 Can We Do Better?! What limitations do you see with the multi-cycle design?! Limited concurrency " Some hardware resources are idle during different phases of instruction processing cycle " Fetch logic is idle when an instruction is being decoded or executed " Most of the datapath is idle when a memory access is happening 28

29 Can We Use the Idle Hardware to Improve Concurrency?! Goal: More concurrency # Higher instruction throughput (i.e., more work completed in one cycle)! Idea: When an instruction is using some resources in its processing phase, process other instructions on idle resources not needed by that instruction " E.g., when an instruction is being decoded, fetch the next instruction " E.g., when an instruction is being executed, decode another instruction " E.g., when an instruction is accessing data memory (ld/st), execute the next instruction " E.g., when an instruction is writing its result into the register file, access data memory for the next instruction 29

30 Pipelining 3

31 Pipelining: Basic Idea! More systematically: " Pipeline the execution of multiple instructions " Analogy: Assembly line processing of instructions! Idea: " Divide the instruction processing cycle into distinct stages of processing " Ensure there are enough hardware resources to process one instruction in each stage " Process a different instruction in each stage! s consecutive in program order are processed in consecutive stages! Benefit: Increases instruction processing throughput (/CPI)! Downside: Start thinking about this 3

32 Example: Execution of Four Independent ADDs! Multi-cycle: 4 cycles per instruction F D E W F D E W F D E W F D E W! Pipelined: 4 cycles per 4 instructions (steady state) F D E W F D E W F D E W Is life always this beau9ful? Time F D E W Time 32

33 The Laundry Analogy Time Task order A B C D 6 PM AM! place one dirty load of clothes in the washer! when the washer is finished, place the wet load in the dryer! when the dryer is finished, take out the dry load and fold! when folding is finished, ask your roommate (??) to put the clothes away - steps to do a load are sequentially dependent - no dependence between different loads - different steps do not share resources Based on original figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 33

34 Pipelining Multiple Loads of Laundry Time Task order A B C D 6 PM AM Time 6 PM AM Task order A B C D - 4 loads of laundry in parallel - no additional resources - throughput increased by 4 - latency per load is the same Based on original figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 34

35 Pipelining Multiple Loads of Laundry: In Practice Time Task order A B C D 6 PM AM Time 6 PM AM Task order A B C D the slowest step decides throughput Based on original figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 35

36 Pipelining Multiple Loads of Laundry: In Practice Time Task order A B C D 6 PM AM Time 6 PM AM Task order A B C D Based on original figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] A B A B throughput restored (2 loads per hour) using 2 dryers 36

37 An Ideal Pipeline! Goal: Increase throughput with little increase in cost (hardware cost, in case of instruction processing)! Repetition of identical operations " The same operation is repeated on a large number of different inputs (e.g., all laundry loads go through the same steps)! Repetition of independent operations " No dependencies between repeated operations! Uniformly partitionable suboperations " Processing can be evenly divided into uniform-latency suboperations (that do not share resources)! Fitting examples: automobile assembly line, doing laundry " What about the instruction processing cycle? 37

38 Ideal Pipelining combinatonal logic (F,D,E,M,W) T psec BW=~(/T) T/2 ps (F,D,E) T/2 ps (M,W) BW=~(2/T) T/3 ps (F,D) T/3 ps (E,M) T/3 ps (M,W) BW=~(3/T) 38

39 More Realistic Pipeline: Throughput! Nonpipelined version with delay T BW = /(T+S) where S = latch delay T ps! k-stage pipelined version BW k-stage = / (T/k +S ) BW max = / ( gate delay + S ) Latch delay reduces throughput (switching overhead b/w stages) T/k ps T/k ps 39

40 More Realistic Pipeline: Cost! Nonpipelined version with combinatonal cost G Cost = G+L where L = latch cost G gates! k-stage pipelined version Cost k-stage = G + Lk Latches increase hardware cost G/k G/k 4

41 Pipelining Processing 4

42 Remember: The Processing Cycle. " Fetch fetch (IF) 2. " Decode decode and register " Evaluate operand Address fetch (ID/RF) 3. Execute/Evaluate " Fetch Operands memory address (EX/AG) 4. Memory operand fetch (MEM) " Execute 5. Store/writeback result (WB) " Store Result 42

43 Remember the Single-Cycle Uarch [25 ] Shift Jump address [3 ] left PCSrc =Jump 4 Add PC+4 [3 28] [3 26] Control RegDst Jump Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Shift left 2 Add result ALU M u x M u x PCSrc 2 =Br Taken PC Read address memory [3 ] [25 2] [2 6] [5 ] M u x Read register Read data Read register 2 Registers Read Write data 2 register Write data M u x Zero ALU ALU result bcond Address Write data Data memory Read data M u x [5 ] 6 Sign 32 extend ALU control [5 ] ALU operaton Based on original figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] T BW=~(/T) 43

44 Dividing Into Stages 2ps IF: fetch M u x ps 2ps 2ps ps ID: decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back ignore for now Add 4 Shift left 2 Add Add result PC Address memory Read register Read data Read register 2 Registers Read data 2 Write register Write data M u x Zero ALU ALU result Address Write data Data memory Read data M u x RF write 6 Sign extend 32 Is this the correct partitioning? Why not 4 or 6 stages? Why not different boundaries? Based on original figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 44

45 Pipeline Throughput Program execution order Time (in instructions) lw $, ($) fetch Reg ALU Data access Reg lw $2, 2($) 8ps 8 ns fetch Reg ALU Data access Reg lw $3, 3($) Program execution Time order (in instructions) lw $, ($) fetch 8 ns 8ps Reg ALU Data access Reg fetch 8ps 8 ns... lw $2, 2($) 2 ns 2ps fetch Reg ALU Data access Reg lw $3, 3($) 2ps 2 ns fetch Reg ALU Data access Reg 2ps 2 ns 2ps 2 ns 22ps ns 2ps 2 ns 2ps 2 ns 5-stage speedup is 4, not 5 as predicted by the ideal model. Why? 45

46 Enabling Pipelined Processing: Pipeline Registers IF: fetch M M u u x x ID: decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back No resource is used by more than stage! IF/ID ID/EX EX/MEM MEM/WB 4 4 Add Add PC D +4 PC E +4 Add Add Add Add result result npc M Shift Shift left left 2 2 PC PC PC F Address Address memory memory IR D Read Read register register Read Read data data Read Read register 2 2 Registers Read Read Write Write data data 2 2 register register Write Write data data Sign Sign extend extend A E B E Imm E M M u u x x Zero Zero ALU ALU ALU ALU result result Aout M B M Address Address Write Write data data Data Data memory Read Read data data MDR W Aout W M M u u x x Based on original figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] T/k ps T T/k ps 46

47 Pipelined Operation Example lw fetch M u x All instruction classes must follow the same path and timing through the pipeline stages. lw lw decode Any performance impact? Execution lw Memory lw Write back IF/ID ID/EX EX/MEM MEM/WB Add 4 Shift left left 2 Add Add result PC PC Address memory Read register Read data Read register 2 Registers Read Write data 2 register Write data 6 6 Sign extend M u x Zero ALU ALU result Address Data memory Data memory Write data data Read data M u x Based on original figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 47

48 Write data Pipelined Operation Example 32 6 Sign M u x Write data Data memory M u x extend Clock 5 sub lw $, $, 2($) $2, $3 fetch M u x sub lw $, $, 2($) $2, $3 decode lw $, 2($) Execution sub $, $2, $3 Execution sub lw $, $, 2($) $2, $3 Memory sub lw $, $, 2($) $2, $3 Write back IF/ID ID/EX EX/MEM MEM/WB Add 4 Shift left 2 Add Add result PC Address memory Read register Read Read data register 2 Zero Registers Read ALU ALU Write data 2 result register M u Write x data Is life always this beau9ful? 6 Sign extend 32 Address Data memory Write data Read data M u x Clock 2 3 Clock Clock 56 4 Based on original figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 48 sub $, $2, $3

49 Illustrating Pipeline Operation: Operation View t t t 2 t 3 t 4 t 5 Inst Inst Inst 2 Inst 3 Inst 4 IF ID IF EX ID IF MEM EX ID IF WB MEM EX ID IF steady state (full pipeline) WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF 49

50 Illustrating Pipeline Operation: Resource View t t t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t IF I I I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I ID I I I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 EX I I I 2 I 3 I 4 I 5 I 6 I 7 I 8 MEM I I I 2 I 3 I 4 I 5 I 6 I 7 WB I I I 2 I 3 I 4 I 5 I 6 5

51 Control Points in a Pipeline PCSrc M u x IF/ID ID/EX EX/MEM MEM/WB Add 4 RegWrite Shift left 2 Add Add result Branch PC Address memory Read register Read data Read register 2 Registers Read Write data 2 register Write data [5 ] 6 Sign 32 extend ALUSrc M u x 6 ALU control Zero ALU ALU result Address Write data MemWrite Data memory MemRead Read data MemtoReg M u x Based on original figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] [2 6] [5 ] M u x RegDst ALUOp Identical set of control points as the single-cycle datapath!! 5

52 Control Signals in a Pipeline! For a given instruction " same control signals as single-cycle, but " control signals required at different cycles, depending on stage Option : decode once using the same logic as single-cycle and buffer signals until consumed WB Control M WB EX M WB IF/ID ID/EX EX/MEM MEM/WB Option 2: carry relevant instruction word/field down the pipeline and decode locally within each or in a previous stage Which one is better? 52

53 Pipelined Control Signals PCSrc M u x Control ID/EX WB M EX/MEM WB MEM/WB IF/ID EX M WB Add PC 4 Address memory Read register Read data Read register 2 Registers Read Write data 2 register Write data RegWrite Shift left 2 M u x Add Add result ALUSrc Zero ALU ALU result Branch Write data MemWrite Address Data memory Read data MemtoReg M u x [5 ] 6 Sign 32 extend 6 ALU control MemRead [2 6] [5 ] M u x RegDst ALUOp Based on original figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 53

54 Carnegie Mellon Another Example: Single-Cycle and Pipelined CLK PC' PC A RD Memory Instr 25:2 2:6 CLK A A2 A3 WD3 WE3 Register File RD RD2 SrcA SrcB ALU Zero ALUResult WriteData CLK A RD Data Memory WD WE ReadData 4 + PCPlus4 2:6 5: 5: Sign Extend SignImm PC' CLK PCF A RD Memory CLK InstrD 25:2 2:6 2:6 5: CLK A A2 A3 WD3 WE3 Register File RD RD2 CLK RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: CLK ZeroM ALUOutM WriteDataM CLK WE A RD Data Memory WD ALUOutW ReadDataW + 4 5: Sign Extend SignImmE <<2 PCBranchM + WriteReg 4: <<2 + PCBranch Result ALU CLK PCPlus4F PCPlus4D PCPlus4E Fetch Decode Execute Memory Writeback ResultW 54

55 Carnegie Mellon Another Example: Correct Pipelined Datapath CLK CLK ALUOutW CLK PC' PCF A RD Memory CLK InstrD 25:2 2:6 2:6 5: CLK A A2 A3 WD3 WE3 Register File RD RD2 CLK RtE RdE SrcAE SrcBE ALU WriteDataE WriteRegE 4: ZeroM ALUOutM WriteDataM WriteRegM 4: CLK A RD Data Memory WD WE ReadDataW WriteRegW 4: 4 + 5: Sign Extend SignImmE <<2 + PCBranchM PCPlus4F PCPlus4D PCPlus4E ResultW Fetch Decode Execute Memory Writeback! WriteReg must arrive at the same 9me as Result 55

56 Carnegie Mellon Another Example: Pipelined Control CLK CLK CLK Control Unit RegWriteD MemtoRegD MemWriteD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM 3:26 5: Op Funct BranchD ALUControlD ALUSrcD BranchE ALUControlE 2: ALUSrcE BranchM PCSrcM RegDstD RegDstE ALUOutW PC' CLK PCF A RD Memory CLK InstrD 25:2 2:6 2:6 5: CLK A A2 A3 WD3 WE3 Register File RD RD2 RtE RdE SrcAE SrcBE WriteDataE ALU WriteRegE 4: ZeroM ALUOutM WriteDataM WriteRegM 4: CLK A RD Data Memory WD WE ReadDataW WriteRegW 4: 4 + 5: Sign Extend SignImmE <<2 + PCBranchM PCPlus4F PCPlus4D PCPlus4E ResultW! Same control unit as single-cycle processor Control delayed to proper pipeline stage 56

57 Remember: An Ideal Pipeline! Goal: Increase throughput with little increase in cost (hardware cost, in case of instruction processing)! Repetition of identical operations " The same operation is repeated on a large number of different inputs (e.g., all laundry loads go through the same steps)! Repetition of independent operations " No dependencies between repeated operations! Uniformly partitionable suboperations " Processing an be evenly divided into uniform-latency suboperations (that do not share resources)! Fitting examples: automobile assembly line, doing laundry " What about the instruction processing cycle? 57

58 Pipeline: Not An Ideal Pipeline! Identical operations... NOT! different instructions # not all need the same stages Forcing different instructions to go through the same pipe stages # external fragmentation (some pipe stages idle for some instructions)! Uniform suboperations... NOT! different pipeline stages # not the same latency Need to force each stage to be controlled by the same clock # internal fragmentation (some pipe stages are too fast but all take the same clock cycle time)! Independent operations... NOT! instructions are not independent of each other Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results # pipeline stalls (pipeline is not always moving) 58

59 Issues in Pipeline Design! Balancing work in pipeline stages " How many stages and what is done in each stage! Keeping the pipeline correct, moving, and full in the presence of events that disrupt pipeline flow " Handling dependences! Data! Control " Handling resource contention " Handling long-latency (multi-cycle) operations! Handling exceptions, interrupts! Advanced: Improving pipeline throughput " Minimizing stalls 59

60 Causes of Pipeline Stalls! Stall: A condition when the pipeline stops moving! Resource contention! Dependences (between instructions) " Data " Control! Long-latency (multi-cycle) operations 6

61 Dependences and Their Types! Also called dependency or less desirably hazard! Dependences dictate ordering requirements between instructions! Two types " Data dependence " Control dependence! Resource contention is sometimes called resource dependence " However, this is not fundamental to (dictated by) program semantics, so we will treat it separately 6

62 Handling Resource Contention! Happens when instructions in two pipeline stages need the same resource! Solution : Eliminate the cause of contention " Duplicate the resource or increase its throughput! E.g., use separate instruction and data memories (caches)! E.g., use multiple ports for memory structures! Solution 2: Detect the resource contention and stall one of the contending stages " Which stage do you stall? " Example: What if you had a single read and write port for the register file? 62

63 Carnegie Mellon Example Resource Dependence: RegFile! The register file can be read and wrinen in the same cycle: $ write takes place during the st half of the cycle $ read takes place during the 2nd half of the cycle => no problem!!! $ However operatons that involve register file have only half a clock cycle to complete the operaton!! Time (cycles) add $s2 add $s, $s2, $s3 IM RF $s3 + DM $s RF and $t, $s, $s IM and $s RF $s & DM $t RF or $t, $s4, $s IM or $s4 RF $s DM $t RF sub $t2, $s, $s5 IM sub $s RF $s5 - DM $t2 RF 63

64 Design of Digital Circuits Lecture 5: Pipelining Prof. Onur Mutlu ETH Zurich Spring 27 3 April 27

Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures

18-447 Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 1/28/2015 Agenda for Today & Next Few Lectures Single-cycle