Design of Digital Circuits Lecture 13: Multi-Cycle Microarch. Prof. Onur Mutlu ETH Zurich Spring April 2017

Size: px

Start display at page:

Download "Design of Digital Circuits Lecture 13: Multi-Cycle Microarch. Prof. Onur Mutlu ETH Zurich Spring April 2017"

Rafe Clarke
5 years ago
Views:

1 Design of Digital Circuits Lecture 3: Multi-Cycle Microarch. Prof. Onur Mutlu ETH Zurich Spring 27 6 April 27

2 Agenda for Today & Next Few Lectures! Single-cycle Microarchitectures! Multi-cycle and Microprogrammed Microarchitectures! Pipelining! Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery,! Out-of-Order Execution! Issues in OoO Execution: Load-Store Handling, 2

3 Readings for Today! P&P, Chapter 4, 5 " Microarchitecture! P&P, Revised Appendix C " Microarchitecture of the LC-3b " Appendix A (LC-3b ISA) will be useful in following this! H&H, Chapter 7.4 (keep reading)! Optional " Maurice Wilkes, The Best Way to Design an Automatic Calculating Machine, Manchester Univ. Computer Inaugural Conf., 95. 3

4 Implementing the ISA: Microarchitecture Basics (Review) 4

5 How Does a Machine Process Instructions?! What does processing an instruction mean?! We will assume the von Neumann model (for now) AS = Architectural (programmer visible) state before an instruction is processed Process instruction AS = Architectural (programmer visible) state after an instruction is processed! Processing an instruction: Transforming AS to AS according to the ISA specification of the instruction 5

6 The Von Neumann Model/Architecture! Also called stored program computer (instructions in memory). Two key properties:! Stored program " Instructions stored in a linear memory array " Memory is unified between instructions and data! The interpretation of a stored value depends on the control signals When is a value interpreted as an instruction?! Sequential instruction processing " One instruction processed (fetched, executed, and completed) at a time " Program counter (instruction pointer) identifies the current instr. " Program counter is advanced sequentially except for control transfer instructions 6

7 The Von Neumann Model/Architecture! Recommended reading " Burks, Goldstein, von Neumann, Preliminary discussion of the logical design of an electronic computing instrument, 946.! Required reading " Patt and Patel book, Chapter 4, The von Neumann Model! Stored program! Sequential instruction processing 7

8 The Von Neumann Model (of a Computer) MEMORY Mem Addr Reg Mem Data Reg INPUT PROCESSING UNIT ALU TEMP OUTPUT CONTROL UNIT IP Inst Register 8

9 The Von Neumann Model (of a Computer)! Q: Is this the only way that a computer can operate?! A: No.! Qualified Answer: But, it has been the dominant way " i.e., the dominant paradigm for computing " for N decades 9

10 The Dataflow Model (of a Computer)! Von Neumann model: An instruction is fetched and executed in control flow order " As specified by the instruction pointer " Sequential unless explicit control flow instruction! Dataflow model: An instruction is fetched and executed in data flow order " i.e., when its operands are ready " i.e., there is no instruction pointer " Instruction ordering specified by data flow dependence! Each instruction specifies who should receive the result! An instruction can fire whenever all operands are received " Potentially many instructions can execute at the same time! Inherently more parallel

11 Von Neumann vs Dataflow! Consider a Von Neumann program " What is the significance of the program order? " What is the significance of the storage locations? a b v <= a + b; w <= b * 2; x <= v - w + *2 y <= v + w z <= x * y Sequential - + * Dataflow! Which model is more natural to you as a programmer? z

12 More on Data Flow! In a data flow machine, a program consists of data flow nodes " A data flow node fires (fetched and executed) when all it inputs are ready! i.e. when all inputs have tokens! Data flow node and its ISA representation 2

13 Data Flow Nodes 3

14 An Example Data Flow Program OUT 4

15 ISA-level Tradeoff: Instruction Pointer! Do we need an instruction pointer in the ISA? " Yes: Control-driven, sequential execution! An instruction is executed when the IP points to it! IP automatically changes sequentially (except for control flow instructions) " No: Data-driven, parallel execution! An instruction is executed when all its operand values are available (data flow)! Tradeoffs: MANY high-level ones " Ease of programming (for average programmers)? " Ease of compilation? " Performance: Extraction of parallelism? " Hardware complexity? 5

16 ISA vs. Microarchitecture Level Tradeoff! A similar tradeoff (control vs. data-driven execution) can be made at the microarchitecture level! ISA: Specifies how the programmer sees instructions to be executed " Programmer sees a sequential, control-flow execution order vs. " Programmer sees a data-flow execution order! Microarchitecture: How the underlying implementation actually executes instructions " Microarchitecture can execute instructions in any order as long as it obeys the semantics specified by the ISA when making the instruction results visible to software! Programmer should see the order specified by the ISA 6

17 The Process Instruction Step! ISA specifies abstractly what AS should be, given an instruction and AS " It defines an abstract finite state machine where! State = programmer-visible state! Next-state logic = instruction execution specification " From ISA point of view, there are no intermediate states between AS and AS during instruction execution! One state transition per instruction! Microarchitecture implements how AS is transformed to AS " There are many choices in implementation " We can have programmer-invisible state to optimize the speed of instruction execution: multiple state transitions per instruction! Choice : AS # AS (transform AS to AS in a single clock cycle)! Choice 2: AS # AS+MS # AS+MS2 # AS+MS3 # AS (take multiple clock cycles to transform AS to AS ) 7

18 A Very Basic Instruction Processing Engine! Each instruction takes a single clock cycle to execute! Only combinational logic is used to implement instruction execution " No intermediate, programmer-invisible state updates AS = Architectural (programmer visible) state at the beginning of a clock cycle Process instruction in one clock cycle AS = Architectural (programmer visible) state at the end of a clock cycle 8

19 A Very Basic Instruction Processing Engine! Single-cycle machine Combina.onal Logic AS Sequen.al Logic (State) AS! What is the clock cycle time determined by?! What is the critical path of the combinational logic determined by? 9

20 Remember: Programmer Visible (Architectural) State M[] M[] M[2] M[3] M[4] Registers - given special names in the ISA (as opposed to addresses) - general vs. special purpose M[N-] Memory array of storage loca.ons indexed by an address Program Counter memory address of the current instruc.on Instruc.ons (and programs) specify how to transform the values of programmer visible state 2

21 Single-cycle vs. Multi-cycle Machines! Single-cycle machines " Each instruction takes a single clock cycle " All state updates made at the end of an instruction s execution " Big disadvantage: The slowest instruction determines cycle time # long clock cycle time! Multi-cycle machines " Instruction processing broken into multiple cycles/stages " State updates can be made during an instruction s execution " Architectural state updates made only at the end of an instruction s execution " Advantage over single-cycle: The slowest stage determines cycle time! Both single-cycle and multi-cycle machines literally follow the von Neumann model at the microarchitecture level 2

22 Instruction Processing Cycle! Instructions are processed under the direction of a control unit step by step.! Instruction cycle: Sequence of steps to process an instruction! Fundamentally, there are six steps:! Fetch! Decode! Evaluate Address! Fetch Operands! Execute! Store Result! Not all instructions require all six steps (see P&P Ch. 4) 22

23 Instruction Processing Cycle vs. Machine Clock Cycle! Single-cycle machine: " All six phases of the instruction processing cycle take a single machine clock cycle to complete! Multi-cycle machine: " All six phases of the instruction processing cycle can take multiple machine clock cycles to complete " In fact, each phase can take multiple clock cycles to complete 23

24 Instruction Processing Viewed Another Way! Instructions transform Data (AS) to Data (AS )! This transformation is done by functional units " Units that operate on data! These units need to be told what to do to the data! An instruction processing engine consists of two components " Datapath: Consists of hardware elements that deal with and transform data signals! functional units that operate on data! hardware structures (e.g. wires and muxes) that enable the flow of data into the functional units and registers! storage units that store data (e.g., registers) " Control logic: Consists of hardware elements that determine control signals, i.e., signals that specify what the datapath elements should do to the data 24

25 Single-cycle vs. Multi-cycle: Control & Data! Single-cycle machine: " Control signals are generated in the same clock cycle as the one during which data signals are operated on " Everything related to an instruction happens in one clock cycle (serialized processing)! Multi-cycle machine: " Control signals needed in the next cycle can be generated in the current cycle " Latency of control processing can be overlapped with latency of datapath operation (more parallelism)! We will see the difference clearly in microprogrammed multi-cycle microarchitectures 25

26 Many Ways of Datapath and Control Design! There are many ways of designing the data path and control logic! Single-cycle, multi-cycle, pipelined datapath and control! Single-bus vs. multi-bus datapaths! Hardwired/combinational vs. microcoded/microprogrammed control " Control signals generated by combinational logic versus " Control signals stored in a memory structure! Control signals and structure depend on the datapath design 26

27 Performance Analysis! Execution time of an instruction " {CPI} x {clock cycle time}! Execution time of a program " Sum over all instructions [{CPI} x {clock cycle time}] " {# of instructions} x {Average CPI} x {clock cycle time}! Single cycle microarchitecture performance " CPI = " Clock cycle time = long! Multi-cycle microarchitecture performance " CPI = different for each instruction! Average CPI # hopefully small " Clock cycle time = short Now, we have two degrees of freedom to optimize independently 27

28 Review: Complete Single-Cycle Processor SignImm A RD Instruction Memory + 4 A A3 WD3 RD2 RD WE3 A2 Sign Extend Register File A RD Data Memory WD WE PC PC' Instr 25:2 2:6 5: 5: SrcB 2:6 5: <<2 + ALUResult ReadData WriteData SrcA PCPlus4 PCBranch WriteReg 4: Result 3:26 RegDst Branch MemWrite MemtoReg ALUSrc RegWrite Op Funct Control Unit Zero PCSrc ALUControl 2: ALU

29 Another Example Single-Cycle Processor Instruction [25 ] Shift Jump address [3 ] left PCSrc =Jump 4 Add PC+4 [3 28] Instruction [3 26] Control RegDst Jump Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Shift left 2 Add ALU result M u x M u x PCSrc 2 =Br Taken PC Read address Instruction memory Instruction [3 ] Instruction [25 2] Instruction [2 6] Instruction [5 ] M u x Read register Read data Read register 2 Registers Read Write data 2 register Write data M u x bcond Zero ALU ALU result Address Write data Data memory Read data M u x Instruction [5 ] 6 Sign 32 extend ALU control ALU opera.on Instruction [5 ] **Based on original figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] JAL, JR, JALR omiaed 29

30 Evaluating the Single-Cycle Microarchitecture 3

31 A Single-Cycle Microarchitecture! Is this a good idea/design?! When is this a good design?! When is this a bad design?! How can we design a better microarchitecture? 3

32 A Single-Cycle Microarchitecture: Analysis! Every instruction takes cycle to execute " CPI (Cycles per instruction) is strictly! How long each instruction takes is determined by how long the slowest instruction takes to execute " Even though many instructions do not need that long to execute! Clock cycle time of the microarchitecture is determined by how long it takes to complete the slowest instruction " Critical path of the design is determined by the processing time of the slowest instruction 32

33 What is the Slowest Instruction to Process?! Let s go back to the basics! All six phases of the instruction processing cycle take a single machine clock cycle to complete " Fetch " Decode " Evaluate Address " Fetch Operands " Execute " Store Result. Instruction fetch (IF) 2. Instruction decode and register operand fetch (ID/RF) 3. Execute/Evaluate memory address (EX/AG) 4. Memory operand fetch (MEM) 5. Store/writeback result (WB)! Do each of the above phases take the same time (latency) for all instructions? 33

34 Example Single-Cycle Datapath Analysis! Assume (for the design in Slide 29) " memory units (read or write): 2 ps " ALU and adders: ps " register file (read or write): 5 ps " other combinational logic: ps steps IF ID EX MEM WB resources mem RF ALU mem RF Delay R-type I-type LW SW Branch Jump

35 Let s Find the Critical Path Instruction [25 ] Shift Jump address [3 ] left PCSrc =Jump 4 Add PC+4 [3 28] Instruction [3 26] Control RegDst Jump Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Shift left 2 Add ALU result M u x M u x PCSrc 2 =Br Taken PC Read address Instruction memory Instruction [3 ] Instruction [25 2] Instruction [2 6] Instruction [5 ] M u x Read register Read data Read register 2 Registers Read Write data 2 register Write data M u x bcond Zero ALU ALU result Address Write data Data memory Read data M u x Instruction [5 ] 6 Sign 32 extend ALU control ALU opera.on Instruction [5 ] [Based on original figure from P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 35

36 R-Type and I-Type ALU Instruction [25 ] Shift Jump address [3 ] left PCSrc =Jump ps 4 Add ps PC+4 [3 28] Instruction [3 26] Control RegDst Jump Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Shift left 2 Add ALU result M u x M u x PCSrc 2 =Br Taken PC Read address Instruction memory Instruction [3 ] 2ps Instruction [25 2] Instruction [2 6] Instruction [5 ] M u x Read register Read data Read register 2 Registers Read Write data 2 register Write data 4ps 25ps M u x bcond Zero ALU ALU result Address 35ps Write data Data memory Read data M u x Instruction [5 ] 6 Sign 32 extend ALU control ALU opera.on Instruction [5 ] [Based on original figure from P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 36

37 LW Instruction [25 ] Shift Jump address [3 ] left PCSrc =Jump ps 4 Add ps PC+4 [3 28] Instruction [3 26] Control RegDst Jump Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Shift left 2 Add ALU result M u x M u x PCSrc 2 =Br Taken PC Read address Instruction memory Instruction [3 ] 2ps Instruction [25 2] Instruction [2 6] Instruction [5 ] M u x Read register Read data Read register 2 Registers Read Write data 2 register Write 6ps data 25ps M u x bcond Zero ALU ALU result Address 35ps Write data Data memory Read data 55ps M u x Instruction [5 ] 6 Sign 32 extend ALU control ALU opera.on Instruction [5 ] [Based on original figure from P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 37

38 SW Instruction [25 ] Shift Jump address [3 ] left PCSrc =Jump ps 4 Add ps PC+4 [3 28] Instruction [3 26] Control RegDst Jump Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Shift left 2 Add ALU result M u x M u x PCSrc 2 =Br Taken PC Read address Instruction memory Instruction [3 ] 2ps Instruction [25 2] Instruction [2 6] Instruction [5 ] M u x Read register Read data Read register 2 Registers Read Write data 2 register Write data 25ps M u x bcond Zero ALU ALU result 35ps Address Read data Data Write 55ps memory data M u x Instruction [5 ] 6 Sign 32 extend ALU control ALU opera.on Instruction [5 ] [Based on original figure from P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 38

39 Branch Taken 35ps PC 4 Read address Instruction memory Add Instruction [3 ] ps Instruction [25 ] Shift Jump address [3 ] left ps PC+4 [3 28] Instruction [3 26] Instruction [25 2] Instruction [2 6] Instruction [5 ] Instruction [5 ] Control M u x RegDst Jump Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Read register Read data Read register 2 Registers Read Write data 2 register Write data 6 Sign 32 extend Shift left 2 25ps M u x ALU control 2ps Add ALU result bcond Zero ALU ALU result 35ps ALU opera.on Write data Address PCSrc =Jump M u x Data memory M u x PCSrc 2 =Br Taken Read data M u x Instruction [5 ] [Based on original figure from P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 39

40 Jump Instruction [25 ] Shift Jump address [3 ] left PCSrc =Jump 2ps 4 Add ps PC+4 [3 28] Instruction [3 26] Control RegDst Jump Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Shift left 2 Add ALU result M u x M u x PCSrc 2 =Br Taken PC Read address Instruction memory Instruction [3 ] 2ps Instruction [25 2] Instruction [2 6] Instruction [5 ] M u x Read register Read data Read register 2 Registers Read Write data 2 register Write data M u x bcond Zero ALU ALU result Address Write data Data memory Read data M u x Instruction [5 ] 6 Sign 32 extend ALU control ALU opera.on Instruction [5 ] [Based on original figure from P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 4

41 What About Control Logic?! How does that affect the critical path?! Food for thought for you: " Can control logic be on the critical path? " A note on CDC 56: control store access too long 4

42 What is the Slowest Instruction to Process?! Memory is not magic! What if memory sometimes takes ms to access?! Does it make sense to have a simple register to register add or jump to take {ms+all else to do a memory operation}?! And, what if you need to access memory more than once to process an instruction? " Which instructions need this? " Do you provide multiple ports to memory? 42

43 Single Cycle uarch: Complexity! Contrived " All instructions run as slow as the slowest instruction! Inefficient " All instructions run as slow as the slowest instruction " Must provide worst-case combinational resources in parallel as required by any instruction " Need to replicate a resource if it is needed more than once by an instruction during different parts of the instruction processing cycle! Not necessarily the simplest way to implement an ISA " Single-cycle implementation of REP MOVS (x86) or INDEX (VAX)?! Not easy to optimize/improve performance " Optimizing the common case does not work (e.g. common instructions) " Need to optimize the worst case all the time 43

44 (Micro)architecture Design Principles! Critical path design " Find and decrease the maximum combinational logic delay " Break a path into multiple cycles if it takes too long! Bread and butter (common case) design " Spend time and resources on where it matters most! i.e., improve what the machine is really designed to do " Common case vs. uncommon case! Balanced design " Balance instruction/data flow through hardware components " Design to eliminate bottlenecks: balance the hardware for the work 44

45 Single-Cycle Design vs. Design Principles! Critical path design! Bread and butter (common case) design! Balanced design How does a single-cycle microarchitecture fare in light of these principles? 45

46 Aside: System Design Principles! When designing computer systems/architectures, it is important to follow good principles! Remember: principled design from our first lecture " Frank Lloyd Wright: architecture [ ] based upon principle, and not upon precedent 46

47 Aside: From Lecture! architecture [ ] based upon principle, and not upon precedent 47

48 Aside: System Design Principles! We will continue to cover key principles in this course! Here are some references where you can learn more! Yale Patt, Requirements, Bottlenecks, and Good Fortune: Agents for Microprocessor Evolution, Proc. of IEEE, 2. (Levels of transformation, design point, etc)! Mike Flynn, Very High-Speed Computing Systems, Proc. of IEEE, 966. (Flynn s Bottleneck # Balanced design)! Gene M. Amdahl, "Validity of the single processor approach to achieving large scale computing capabilities," AFIPS Conference, April 967. (Amdahl s Law # Common-case design)! Butler W. Lampson, Hints for Computer System Design, ACM Operating Systems Review, 983. " 48

! Albert Einstein And, keep it low cost: An engineer is a person who can do for a dime what any

49 A Key System Design Principle!! Keep it simple Everything should be made as simple as possible, but no simpler. "!! Albert Einstein And, keep it low cost: An engineer is a person who can do for a dime what any fool can do for a dollar. For more, see: " " Butler W. Lampson, Hints for Computer System Design, ACM Operating Systems Review,

50 Multi-Cycle Microarchitectures 5

51 Multi-Cycle Microarchitectures! Goal: Let each instruction take (close to) only as much time it really needs! Idea " Determine clock cycle time independently of instruction processing time " Each instruction takes as many clock cycles as it needs to take! Multiple state transitions per instruction! The states followed by each instruction is different 5

52 Remember: The Process instruction Step! ISA specifies abstractly what AS should be, given an instruction and AS " It defines an abstract finite state machine where! State = programmer-visible state! Next-state logic = instruction execution specification " From ISA point of view, there are no intermediate states between AS and AS during instruction execution! One state transition per instruction! Microarchitecture implements how AS is transformed to AS " There are many choices in implementation " We can have programmer-invisible state to optimize the speed of instruction execution: multiple state transitions per instruction! Choice : AS # AS (transform AS to AS in a single clock cycle)! Choice 2: AS # AS+MS # AS+MS2 # AS+MS3 # AS (take multiple clock cycles to transform AS to AS ) 52

53 Multi-Cycle Microarchitecture AS = Architectural (programmer visible) state at the beginning of an instruction Step : Process part of instruction in one clock cycle Step 2: Process part of instruction in the next clock cycle AS = Architectural (programmer visible) state at the end of a clock cycle 53

54 Benefits of Multi-Cycle Design! Critical path design " Can keep reducing the critical path independently of the worstcase processing time of any instruction! Bread and butter (common case) design " Can optimize the number of states it takes to execute important instructions that make up much of the execution time! Balanced design " No need to provide more capability or resources than really needed! An instruction that needs resource X multiple times does not require multiple X s to be implemented! Leads to more efficient hardware: Can reuse hardware components needed multiple times for an instruction 54

55 Downsides of Multi-Cycle Design! Need to store the intermediate results at the end of each clock cycle " Hardware overhead for registers " Register setup/hold overhead paid multiple times for an instruction 55

56 Remember: Performance Analysis! Execution time of an instruction " {CPI} x {clock cycle time}! Execution time of a program " Sum over all instructions [{CPI} x {clock cycle time}] " {# of instructions} x {Average CPI} x {clock cycle time}! Single cycle microarchitecture performance " CPI = " Clock cycle time = long! Multi-cycle microarchitecture performance " CPI = different for each instruction! Average CPI # hopefully small " Clock cycle time = short Not easy to optimize design We have two degrees of freedom to optimize independently 56

57 A Multi-Cycle Microarchitecture A Closer Look 57

58 How Do We Implement This?! Maurice Wilkes, The Best Way to Design an Automatic Calculating Machine, Manchester Univ. Computer Inaugural Conf., 95.! An elegant implementation: " The concept of microcoded/microprogrammed machines 58

59 Multi-Cycle uarch! Key Idea for Realization " One can implement the process instruction step as a finite state machine that sequences between states and eventually returns back to the fetch instruction state " A state is defined by the control signals asserted in it " Control signals for the next state determined in current state 59

60 The Instruction Processing Cycle " Fetch " Decode " Evaluate Address " Fetch Operands " Execute " Store Result 6

61 A Basic Multi-Cycle Microarchitecture! Instruction processing cycle divided into states! A stage in the instruction processing cycle can take multiple states! A multi-cycle microarchitecture sequences from state to state to process an instruction! The behavior of the machine in a state is completely determined by control signals in that state! The behavior of the entire processor is specified fully by a finite state machine! In a state (clock cycle), control signals control two things:! How the datapath should process the data! How to generate the control signals for the (next) clock cycle 6

62 One Example Multi-Cycle Microarchitecture 62

63 Carnegie Mellon Remember: Single-Cycle MIPS Processor Jump 3:26 5: MemtoReg Control MemWrite Unit Branch ALUControl 2: Op ALUSrc Funct RegDst RegWrite PCSrc PC' PC A RD Instruction Memory Instr 25:2 2:6 A A2 A3 WD3 WE3 Register File RD RD2 SrcA SrcB ALU Zero ALUResult WriteData A WE RD Data Memory WD ReadData Result PCJump 4 + PCPlus4 2:6 5: 5: Sign Extend WriteReg 4: SignImm <<2 + PCBranch 27: 3:28 25: <<2 63

64 Carnegie Mellon MulB-cycle MIPS Processor! Single-cycle microarchitecture: + simple - cycle.me limited by longest instruc.on (lw) - three adders/alus and two memories! MulB-cycle microarchitecture: + higher clock speed + simpler instruc.ons run faster + reuse expensive hardware on mul.ple cycles - sequencing overhead paid many.mes - hardware overhead for storing intermediate results! Same design steps: datapath & control 64

65 Carnegie Mellon What Do We Want To OpBmize! Single Cycle Architecture uses two memories $ One memory stores instruc.ons, the other data $ We want to use a single memory (Smaller size) 65

66 Carnegie Mellon What Do We Want To OpBmize! Single Cycle Architecture uses two memories $ One memory stores instruc.ons, the other data $ We want to use a single memory (Smaller size)! Single Cycle Architecture needs three adders $ ALU, PC, Branch address calcula.on $ We want to use the ALU for all opera.ons (smaller size) 66

67 Carnegie Mellon What Do We Want To OpBmize! Single Cycle Architecture uses two memories $ One memory stores instruc.ons, the other data $ We want to use a single memory (Smaller size)! Single Cycle Architecture needs three adders $ ALU, PC, Branch address calcula.on $ We want to use the ALU for all opera.ons (smaller size)! In Single Cycle Architecture all instrucbons take one cycle $ The most complex opera.on slows down everything! $ Divide all instruc.ons into mul.ple steps $ Simpler instruc.ons can take fewer cycles (average case may be faster) 67

68 Carnegie Mellon Consider the lw instrucbon! For an instrucbon such as: lw $t, x2($t)! We need to: $ Read the instruc.on from memory $ Then read $t from register array $ Add the immediate value (x2) to calculate the memory address $ Read the content of this address $ Write to the register $t this content 68

69 Carnegie Mellon MulB-cycle Datapath: instrucbon fetch! First consider execubng lw $ STEP : Fetch instruc.on IRWrite PC' b PC A WE RD EN Instr A A2 WE3 RD RD2 Instr / Data Memory WD A3 WD3 Register File read from the memory location [rs]+imm to location [rt] I-Type op rs rt imm 6 bits 5 bits 5 bits 6 bits 69

70 Carnegie Mellon MulB-cycle Datapath: lw register read IRWrite PC' b PC A WE RD EN Instr 25:2 A A2 WE3 RD RD2 A Instr / Data Memory WD A3 WD3 Register File I-Type op rs rt imm 6 bits 5 bits 5 bits 6 bits 7

71 Carnegie Mellon MulB-cycle Datapath: lw immediate IRWrite PC' b PC A WE RD EN Instr 25:2 A A2 WE3 RD RD2 A Instr / Data Memory WD A3 WD3 Register File SignImm 5: Sign Extend I-Type op rs rt imm 6 bits 5 bits 5 bits 6 bits 7

72 Carnegie Mellon MulB-cycle Datapath: lw address IRWrite ALUControl 2: PC' b PC A RD Instr / Data Memory WD WE EN Instr 25:2 A A2 A3 WD3 WE3 Register File RD RD2 A SrcA SrcB ALU ALUResult ALUOut SignImm 5: Sign Extend I-Type op rs rt imm 6 bits 5 bits 5 bits 6 bits 72

73 Carnegie Mellon MulB-cycle Datapath: lw memory read IorD IRWrite ALUControl 2: PC' b PC Adr A RD Instr / Data Memory WD WE Instr EN Data 25:2 A A2 A3 WD3 WE3 Register File RD RD2 A SrcA SrcB ALU ALUResult ALUOut SignImm 5: Sign Extend I-Type op rs rt imm 6 bits 5 bits 5 bits 6 bits 73

74 Carnegie Mellon MulB-cycle Datapath: lw write register IorD IRWrite RegWrite ALUControl 2: PC' b PC Adr A RD Instr / Data Memory WD WE Instr EN Data 25:2 2:6 A A2 A3 WD3 WE3 Register File RD RD2 A SrcA SrcB ALU ALUResult ALUOut SignImm 5: Sign Extend I-Type op rs rt imm 6 bits 5 bits 5 bits 6 bits 74

75 Carnegie Mellon MulB-cycle Datapath: increment PC PCWrite IorD IRWrite RegWrite ALUSrcA ALUSrcB : ALUControl 2: PC' b EN PC Adr A RD Instr / Data Memory WD WE Instr EN Data 25:2 2:6 A A2 A3 WD3 WE3 Register File RD RD2 A 4 SrcA SrcB ALU ALUResult ALUOut 5: Sign Extend SignImm 75

76 Carnegie Mellon MulB-cycle Datapath: sw! Write data in rt to memory PCWrite IorD MemWrite IRWrite RegWrite ALUSrcA ALUSrcB : ALUControl 2: PC' PC EN b Adr A RD Instr / Data Memory WD WE Instr EN Data 25:2 2:6 2:6 A A2 A3 WD3 WE3 Register File RD RD2 A B 4 SrcA SrcB ALU ALUResult ALUOut 5: Sign Extend SignImm 76

77 Carnegie Mellon MulB-cycle Datapath: R-type InstrucBons! Read from rs and rt $ Write ALUResult to register file $ Write to rd (instead of rt) PCWrite IorD MemWrite IRWrite RegDst MemtoReg RegWrite ALUSrcA ALUSrcB : ALUControl 2: PC' PC EN b Adr A RD Instr / Data Memory WD WE Instr EN Data 25:2 2:6 2:6 5: A A2 A3 WD3 WE3 Register File RD RD2 A B 4 SrcA SrcB ALU ALUResult ALUOut 5: Sign Extend SignImm 77

78 Carnegie Mellon MulB-cycle Datapath: beq! Determine whether values in rs and rt are equal $ Calculate branch target address: BTA = (sign-extended immediate << 2) + (PC+4) PCEn IorD MemWrite IRWrite RegDst MemtoReg RegWrite ALUSrcA ALUSrcB : ALUControl 2: BranchPCWrite PCSrc PC' PC EN b Adr A RD Instr / Data Memory WD WE Instr EN Data 25:2 2:6 2:6 5: A A2 A3 WD3 WE3 Register File RD RD2 A B SrcA Zero ALUResult ALUOut 4 SrcB <<2 ALU 5: Sign Extend SignImm 78

79 Carnegie Mellon Complete MulB-cycle Processor IorD MemWrite IRWrite 3:26 5: Control Unit Op Funct PCWrite Branch PCSrc ALUControl 2: ALUSrcB : ALUSrcA RegWrite PCEn PC' PC EN Adr A RD Instr / Data Memory WD WE Instr EN Data 25:2 2:6 2:6 5: RegDst MemtoReg A A2 A3 WD3 WE3 Register File RD RD2 A B SrcA Zero ALUResult ALUOut 4 SrcB <<2 ALU 5: Sign Extend SignImm 79

80 Carnegie Mellon Control Unit Control Unit Opcode 5: Main Controller (FSM) MemtoReg RegDst IorD PCSrc ALUSrcB : ALUSrcA IRWrite MemWrite PCWrite Branch RegWrite Multiplexer Selects Register Enables ALUOp : Funct 5: ALU Decoder ALUControl 2: 8

81 Carnegie Mellon Main Controller FSM: Fetch Reset S: Fetch IorD MemWrite IRWrite 3:26 5: Control Unit Op Funct PCWrite Branch PCSrc ALUControl 2: ALUSrcB : ALUSrcA RegWrite PCEn PC' WE PC Instr Adr RD EN A EN Instr / Data Memory WD Data 25:2 2:6 2:6 5: RegDst X MemtoReg X A A2 A3 WD3 WE3 Register File RD RD2 A B SrcA Zero ALUResult ALUOut 4 SrcB <<2 ALU 5: Sign Extend SignImm 8

82 Carnegie Mellon Main Controller FSM: Fetch S: Fetch IorD = Reset AluSrcA = ALUSrcB = ALUOp = PCSrc = IRWrite PCWrite IorD MemWrite IRWrite 3:26 5: Control Unit Op Funct PCWrite Branch PCSrc ALUControl 2: ALUSrcB : ALUSrcA RegWrite PCEn PC' WE PC Instr Adr RD EN A EN Instr / Data Memory WD Data 25:2 2:6 2:6 5: RegDst X MemtoReg X A A2 A3 WD3 WE3 Register File RD RD2 A B SrcA Zero ALUResult ALUOut 4 SrcB <<2 ALU 5: Sign Extend SignImm 82

83 Carnegie Mellon Main Controller FSM: Decode S: Fetch S: Decode IorD = Reset AluSrcA = ALUSrcB = ALUOp = PCSrc = IRWrite PCWrite IorD MemWrite IRWrite 3:26 5: Control Unit Op Funct PCWrite Branch PCSrc ALUControl 2: ALUSrcB : ALUSrcA RegWrite PCEn PC' X WE PC Instr Adr RD EN A EN Instr / Data Memory WD Data 25:2 2:6 2:6 5: RegDst X MemtoReg X A A2 A3 WD3 WE3 Register File RD RD2 A B X SrcA XXX Zero X XX ALUResult ALUOut 4 SrcB <<2 ALU 5: Sign Extend SignImm 83

84 Carnegie Mellon Main Controller FSM: Address CalculaBon S: Fetch IorD = Reset AluSrcA = ALUSrcB = ALUOp = PCSrc = IRWrite PCWrite S: Decode S2: MemAdr Op = LW or Op = SW IorD MemWrite IRWrite 3:26 5: Control Unit Op Funct PCWrite Branch PCSrc ALUControl 2: ALUSrcB : ALUSrcA RegWrite PCEn PC' X WE PC Instr Adr RD EN A EN Instr / Data Memory WD Data 25:2 2:6 2:6 5: RegDst X MemtoReg X A A2 A3 WD3 WE3 Register File RD RD2 A B SrcA Zero X ALUResult ALUOut 4 SrcB <<2 ALU 5: Sign Extend SignImm 84

85 Carnegie Mellon Main Controller FSM: Address CalculaBon S: Fetch IorD = Reset AluSrcA = ALUSrcB = ALUOp = PCSrc = IRWrite PCWrite S: Decode S2: MemAdr ALUSrcA = ALUSrcB = ALUOp = Op = LW or Op = SW IorD MemWrite IRWrite 3:26 5: Control Unit Op Funct PCWrite Branch PCSrc ALUControl 2: ALUSrcB : ALUSrcA RegWrite PCEn PC' X WE PC Instr Adr RD EN A EN Instr / Data Memory WD Data 25:2 2:6 2:6 5: RegDst X MemtoReg X A A2 A3 WD3 WE3 Register File RD RD2 A B SrcA Zero X ALUResult ALUOut 4 SrcB <<2 ALU 5: Sign Extend SignImm 85

86 Carnegie Mellon Main Controller FSM: lw S: Fetch IorD = Reset AluSrcA = ALUSrcB = ALUOp = PCSrc = IRWrite PCWrite S: Decode S2: MemAdr Op = LW or Op = SW ALUSrcA = ALUSrcB = ALUOp = Op = LW S3: MemRead IorD = S4: Mem Writeback RegDst = MemtoReg = RegWrite 86

87 Carnegie Mellon Main Controller FSM: sw S: Fetch IorD = Reset AluSrcA = ALUSrcB = ALUOp = PCSrc = IRWrite PCWrite S: Decode S2: MemAdr Op = LW or Op = SW ALUSrcA = ALUSrcB = ALUOp = Op = LW S3: MemRead Op = SW S5: MemWrite IorD = IorD = MemWrite S4: Mem Writeback RegDst = MemtoReg = RegWrite 87

88 Carnegie Mellon Main Controller FSM: R-Type S: Fetch IorD = Reset AluSrcA = ALUSrcB = ALUOp = PCSrc = IRWrite PCWrite S: Decode S2: MemAdr Op = LW or Op = SW Op = R-type S6: Execute ALUSrcA = ALUSrcB = ALUOp = ALUSrcA = ALUSrcB = ALUOp = Op = LW S3: MemRead Op = SW S5: MemWrite S7: ALU Writeback IorD = IorD = MemWrite RegDst = MemtoReg = RegWrite S4: Mem Writeback RegDst = MemtoReg = RegWrite 88

89 Carnegie Mellon Main Controller FSM: beq S2: MemAdr S: Fetch IorD = Reset AluSrcA = ALUSrcB = ALUOp = PCSrc = IRWrite PCWrite ALUSrcA = ALUSrcB = ALUOp = Op = LW or Op = SW S: Decode ALUSrcA = ALUSrcB = ALUOp = Op = R-type S6: Execute ALUSrcA = ALUSrcB = ALUOp = Op = BEQ S8: Branch ALUSrcA = ALUSrcB = ALUOp = PCSrc = Branch Op = LW S3: MemRead Op = SW S5: MemWrite S7: ALU Writeback IorD = IorD = MemWrite RegDst = MemtoReg = RegWrite S4: Mem Writeback RegDst = MemtoReg = RegWrite 89

90 Carnegie Mellon Complete MulB-cycle Controller FSM S2: MemAdr S: Fetch IorD = Reset AluSrcA = ALUSrcB = ALUOp = PCSrc = IRWrite PCWrite ALUSrcA = ALUSrcB = ALUOp = Op = LW or Op = SW S: Decode ALUSrcA = ALUSrcB = ALUOp = Op = R-type S6: Execute ALUSrcA = ALUSrcB = ALUOp = Op = BEQ S8: Branch ALUSrcA = ALUSrcB = ALUOp = PCSrc = Branch Op = LW S3: MemRead Op = SW S5: MemWrite S7: ALU Writeback IorD = IorD = MemWrite RegDst = MemtoReg = RegWrite S4: Mem Writeback RegDst = MemtoReg = RegWrite 9

91 Carnegie Mellon Main Controller FSM: addi S2: MemAdr S: Fetch IorD = Reset AluSrcA = ALUSrcB = ALUOp = PCSrc = IRWrite PCWrite ALUSrcA = ALUSrcB = ALUOp = Op = LW or Op = SW S: Decode ALUSrcA = ALUSrcB = ALUOp = Op = R-type S6: Execute ALUSrcA = ALUSrcB = ALUOp = Op = BEQ Op = ADDI S8: Branch ALUSrcA = ALUSrcB = ALUOp = PCSrc = Branch S9: ADDI Execute Op = LW S3: MemRead Op = SW S5: MemWrite S7: ALU Writeback S: ADDI Writeback IorD = IorD = MemWrite RegDst = MemtoReg = RegWrite S4: Mem Writeback RegDst = MemtoReg = RegWrite 9

92 Carnegie Mellon Main Controller FSM: addi S2: MemAdr S: Fetch IorD = Reset AluSrcA = ALUSrcB = ALUOp = PCSrc = IRWrite PCWrite ALUSrcA = ALUSrcB = ALUOp = Op = LW or Op = SW S: Decode ALUSrcA = ALUSrcB = ALUOp = Op = R-type S6: Execute ALUSrcA = ALUSrcB = ALUOp = Op = BEQ Op = ADDI S8: Branch ALUSrcA = ALUSrcB = ALUOp = PCSrc = Branch S9: ADDI Execute ALUSrcA = ALUSrcB = ALUOp = Op = LW S3: MemRead Op = SW S5: MemWrite S7: ALU Writeback S: ADDI Writeback IorD = IorD = MemWrite RegDst = MemtoReg = RegWrite RegDst = MemtoReg = RegWrite S4: Mem Writeback RegDst = MemtoReg = RegWrite 92

93 Carnegie Mellon Extended FuncBonality: j PCEn IorD MemWrite IRWrite RegDst MemtoReg RegWrite ALUSrcA ALUSrcB : ALUControl 2: BranchPCWrite PCSrc : PC' PC EN Adr A RD Instr / Data Memory WD WE Instr EN Data 25:2 2:6 2:6 5: A A2 A3 WD3 WE3 Register File RD RD2 A B 3:28 4 <<2 SrcA SrcB ALU Zero ALUResult ALUOut PCJump <<2 27: SignImm 5: Sign Extend 25: (jump) 93

94 Carnegie Mellon Control FSM: j S2: MemAdr S: Fetch IorD = Reset AluSrcA = ALUSrcB = ALUOp = PCSrc = IRWrite PCWrite ALUSrcA = ALUSrcB = ALUOp = Op = LW or Op = SW S: Decode ALUSrcA = ALUSrcB = ALUOp = Op = R-type S6: Execute ALUSrcA = ALUSrcB = ALUOp = Op = BEQ Op = J Op = ADDI S8: Branch ALUSrcA = ALUSrcB = ALUOp = PCSrc = Branch S: Jump S9: ADDI Execute ALUSrcA = ALUSrcB = ALUOp = Op = LW S3: MemRead Op = SW S5: MemWrite S7: ALU Writeback S: ADDI Writeback IorD = IorD = MemWrite RegDst = MemtoReg = RegWrite RegDst = MemtoReg = RegWrite S4: Mem Writeback RegDst = MemtoReg = RegWrite 94

95 Carnegie Mellon Control FSM: j S2: MemAdr S: Fetch IorD = Reset AluSrcA = ALUSrcB = ALUOp = PCSrc = IRWrite PCWrite ALUSrcA = ALUSrcB = ALUOp = Op = LW or Op = SW S: Decode ALUSrcA = ALUSrcB = ALUOp = Op = R-type S6: Execute ALUSrcA = ALUSrcB = ALUOp = Op = BEQ Op = J Op = ADDI S8: Branch ALUSrcA = ALUSrcB = ALUOp = PCSrc = Branch S: Jump S9: ADDI Execute PCSrc = PCWrite ALUSrcA = ALUSrcB = ALUOp = Op = LW S3: MemRead Op = SW S5: MemWrite S7: ALU Writeback S: ADDI Writeback IorD = IorD = MemWrite RegDst = MemtoReg = RegWrite RegDst = MemtoReg = RegWrite S4: Mem Writeback RegDst = MemtoReg = RegWrite 95

96 Design of Digital Circuits Lecture 3: Multi-Cycle Microarch. Prof. Onur Mutlu ETH Zurich Spring 27 6 April 27

Design of Digital Circuits 2017 Srdjan Capkun Onur Mutlu (Guest starring: Frank K. Gürkaynak and Aanjhan Ranganathan)

Design of Digital Circuits 2017 Srdjan Capkun Onur Mutlu (Guest starring: Frank K. Gürkaynak and Aanjhan Ranganathan) Microarchitecture Design of Digital Circuits 27 Srdjan Capkun Onur Mutlu (Guest starring: Frank K. Gürkaynak and Aanjhan Ranganathan) http://www.syssec.ethz.ch/education/digitaltechnik_7 Adapted from Digital