Computer Architecture Lecture 6: Multi-cycle Microarchitectures. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 2/6/ PDF Free Download

8-447 Compter Architectre Lectre 6: lti-cycle icroarchitectres Prof. Onr tl Carnegie ellon University Spring 22, 2/6/22

Reminder: Homeworks Homework soltions Check and stdy the soltions! Learning now is better than rshing later Homework 2 Already ot De Febrary 3 ISA concepts, ISA vs. microarchitectre, microcoded machines 2

Reminder: Lab Assignment 2 Lab Assignment.5 Verilog practice Not to be trned in Lab Assignment 2 De Friday, Feb 7, at the end of the lab Individal assignment No collaboration; please respect the honor code 3

Etra Credit for Lab Assignment 2 Complete yor normal (single-cycle) implementation first, and get it checked off in lab. Then, implement the IPS core sing a microcoded approach similar to what we are discssing in class. We are not specifying any particlar details of the microcode format or the microarchitectre; yo shold be creative. For the etra credit, the microcoded implementation shold eecte the same programs that yor ordinary implementation does, and yo shold demo it by the normal lab deadline. 4

Feedback on Lab Assignment Chris, Lavanya, and Abeer are working hard on grading We will have very comprehensive tests for all labs Lab tests eercise every case of each instrction as well as long programs (e.g., REP OVS) We will release test cases and register dmps Be thorogh and test all possible cases Follow directions they are there for a reason No modifications to shell code! No naligned accesses to memory Remove all yor debgging printf s before handing in code Do the etra credit work if the lab is too easy! 5

ings for Today P&P, Revised Appendi C icroarchitectre of the LC-3b Appendi A (LC-3b ISA) will be sefl in following this P&H, Appendi D apping Control to Hardware Optional arice Wilkes, The Best Way to Design an Atomatic Calclating achine, anchester Univ. Compter Inagral Conf., 95. 6

ings for Net Lectre Pipelining P&H Chapter 4.5-4.8 7

Review of Last Lectre: Single-Cycle Uarch What phases of the instrction processing cycle does the IPS JAL instrction eercise? How many cycles does it take to process an instrction in the single-cycle microarchitectre? What determines the clock cycle time? What is the difference between path and control logic? What abot combinational vs. seqential control? What is the semantics of a delayed branch? Why this is so will become clear when we cover pipelining 8

Review: Instrction Processing Cycle Instrctions are processed nder the direction of a control nit step by step. Instrction cycle: Seqence of steps to process an instrction Fndamentally, there are si phases: Fetch Decode Evalate Address Fetch Operands Eecte Store Reslt Not all instrctions reqire all si stages (see P&P Ch. 4) 9

Review: Datapath vs. Control Logic Instrctions transform Data (AS) to Data (AS ) This transformation is done by fnctional nits Units that operate on These nits need to be told what to do to the An instrction processing engine consists of two components Datapath: Consists of hardware elements that deal with and transform signals fnctional nits that operate on hardware strctres (e.g. wires and mes) that enable the flow of into the fnctional nits and registers storage nits that store (e.g., registers) Control logic: Consists of hardware elements that determine control signals, i.e., signals that specify what the path elements shold do to the

Today s Agenda Finish single-cycle microarchitectres Critical path icroarchitectre design principles Performance evalation primer lti-cycle microarchitectres icroprogrammed control

A Note: How to ake the Best Ot of 447? Do the readings P&P Appendies A and C Wilkes 95 paper Today s lectres will be easy to nderstand if yo read these And, yo can ask more in-depth qestions and learn more Do the assignments early Yo can do things for etra credit if yo finish early We will describe what to do for etra credit Stdy the material and bzzwords daily Lectre notes, videos Bzzwords take notes dring class 2

Review: The Fll Single-Cycle Datapath Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp 4 Add PC+4 [3 28] Instrction [3 26] Control RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg Shift left 2 Add ALU reslt PCSrc 2 =Br Taken PC address Instrction memory Instrction [3 ] Instrction [25 2] Instrction [2 6] Instrction [5 ] register register 2 Registers register 2 bcond Zero ALU ALU reslt Address Data memory Instrction [5 ] 6 Sign 32 etend ALU control ALU operation Instrction [5 ] **Based on original figre from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] JAL, JR, JALR omitted 3

Single-Cycle Datapath for Arithmetic and Logical Instrctions 4

Review: Datapath for R and I-Type ALU Insts. Add 4 PC address Instrction memory Instrction 25:2 2:6 Instrction 5: RegDest isitype register register 2 Registers register Reg 2 6 Sign 32 etend 3 ALUSrc isitype ALU operation Zero ALU ALU reslt Address em Data memo em if E[PC] == ADDI rt rs immediate GPR[rt] GPR[rs] + sign-etend (immediate) PC PC + 4 **Based on original figre from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] IF ID EX E WB Combinational state pdate logic 5

Single-Cycle Datapath for Data ovement Instrctions 6

Review: Datapath for Non-Control-Flow Insts. Add PC address Instrction memory 4 Instrction Instrction RegDest isitype register register 2 Registers register Reg!isStore 2 6 Sign 32 etend 3 Zero ALU ALU reslt ALUSrc isitype ALU operation Address isstore em Data memory isload em emtoreg isload **Based on original figre from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 7

Single-Cycle Datapath for Control Flow Instrctions 8

Review: Unconditional Jmp Instrctions Assembly J immediate 26 achine encoding J 6-bit immediate 26-bit J-type Semantics if E[PC]==J immediate 26 target = { PC[3:28], immediate 26, 2 b } PC target 9

Review: Unconditional Jmp Datapath isj PCSrc concat PC address Instrction memory **Based on original figre from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 4 Add Instrction Instrction? register register 2 Registers register Reg 2 6 Sign 32 etend X 3 X ALU operation Zero ALU ALU reslt ALUSrc Address em Data memory em if E[PC]==J immediate26 PC = { PC[3:28], immediate26, 2 b } What abot JR, JAL, JALR? 2

Conditional Branch Instrctions Assembly (e.g., branch if eqal) BEQ rs reg rt reg immediate 6 achine encoding BEQ 6-bit rs 5-bit rt 5-bit immediate 6-bit I-type Semantics (assming no branch delay slot) if E[PC]==BEQ rs rt immediate 6 target = PC + 4 + sign-etend(immediate) 4 if GPR[rs]==GPR[rt] then PC target else PC PC + 4 2

Conditional Branch Datapath (For Yo to Fi) watch ot PCSrc concat PC address Instrction memory 4 Instrction Add Instrction PC + 4 from instrction path register register 2 Registers register 2 Shift left 2 Add sb 3 Sm ALU operation ALU bcond Zero Branch target To branch control logic Reg 6 Sign 32 etend **Based on original figre from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] How to phold the delayed branch semantics? 22

Ptting It All Together Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp 4 Add PC+4 [3 28] Instrction [3 26] Control RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg Shift left 2 Add ALU reslt PCSrc 2 =Br Taken PC address Instrction memory Instrction [3 ] Instrction [25 2] Instrction [2 6] Instrction [5 ] register register 2 Registers register 2 bcond Zero ALU ALU reslt Address Data memory Instrction [5 ] 6 Sign 32 etend ALU control ALU operation Instrction [5 ] **Based on original figre from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] JAL, JR, JALR omitted 23

Single-Cycle Control Logic 24

Single-Cycle Hardwired Control As combinational fnction of Inst=E[PC] 3 6-bit 3 3 opcode 6-bit opcode 6-bit 26 26 26 rs 5-bit rs 5-bit 2 2 rt 5-bit rt 5-bit immediate 26-bit 6 6 rd 5-bit immediate 6-bit shamt 5-bit 6 fnct 6-bit R-type I-type J-type Consider All R-type and I-type ALU instrctions LW and SW BEQ, BNE, BLEZ, BGTZ J, JR, JAL, JALR 25

Single-Bit Control Signals When De-asserted When asserted Eqation RegDest GPR write select according to rt, i.e., inst[2:6] GPR write select according to rd, i.e., inst[5:] opcode== ALUSrc 2 nd ALU inpt from 2 nd GPR read port 2 nd ALU inpt from signetended 6-bit immediate (opcode!=) && (opcode!=beq) && (opcode!=bne) Steer ALU reslt to GPR steer memory load to opcode==lw write port GPR wr. port JAL and JALR reqire additional RegDest and emtoreg options 26

Single-Bit Control Signals When De-asserted When asserted Eqation em emory read disabled emory read port retrn load vale opcode==lw em emory write disabled emory write enabled opcode==sw PCSrc According to PCSrc 2 net PC is based on 26- bit immediate jmp target (opcode==j) (opcode==jal) PCSrc 2 net PC = PC + 4 net PC is based on 6- bit immediate branch target (opcode==b) && bcond is satisfied JR and JALR reqire additional PCSrc options 27

ALU Control case opcode select operation according to fnct ALUi selection operation according to opcode LW select addition SW select addition B select bcond generation fnction don t care Eample ALU operations ADD, SUB, AND, OR, XOR, NOR, etc. bcond on eqal, not eqal, LE zero, GT zero, etc. 28

R-Type ALU Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp PC 4 address Instrction memory Add Instrction [3 ] PC+4 [3 28] Instrction [3 26] Instrction [25 2] Instrction [2 6] Instrction [5 ] Control RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg register register 2 Registers register 2 Shift left 2 Add ALU reslt bcond Zero ALU ALU reslt Address Data memory PCSrc 2 =Br Taken Instrction [5 ] Instrction [5 ] 6 Sign 32 etend ALU fnct control ALU operation **Based on original figre from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 29

I-Type ALU Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp PC 4 address Instrction memory Add Instrction [3 ] PC+4 [3 28] Instrction [3 26] Instrction [25 2] Instrction [2 6] Instrction [5 ] Control RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg register register 2 Registers register 2 Shift left 2 Add ALU reslt bcond Zero ALU ALU reslt Address Data memory PCSrc 2 =Br Taken Instrction [5 ] Instrction [5 ] 6 Sign 32 etend ALU opcode control ALU operation **Based on original figre from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 3

LW Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp PC 4 address Instrction memory Add Instrction [3 ] PC+4 [3 28] Instrction [3 26] Instrction [25 2] Instrction [2 6] Instrction [5 ] Control RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg register register 2 Registers register 2 Shift left 2 Add ALU reslt bcond Zero ALU ALU reslt Address Data memory PCSrc 2 =Br Taken Instrction [5 ] Instrction [5 ] 6 Sign 32 etend ALU Add control ALU operation **Based on original figre from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 3

SW Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp PC 4 address Instrction memory Add Instrction [3 ] PC+4 [3 28] Instrction [3 26] Instrction [25 2] Instrction [2 6] Instrction [5 ] Control * RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg register register 2 Registers register 2 Shift left 2 Add ALU reslt bcond Zero ALU ALU reslt Address Data memory PCSrc 2 =Br Taken * Instrction [5 ] Instrction [5 ] 6 Sign 32 etend ALU Add control ALU operation **Based on original figre from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 32

Branch Not Taken Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp PC 4 address Instrction memory Add Instrction [3 ] PC+4 [3 28] Instrction [3 26] Instrction [25 2] Instrction [2 6] Instrction [5 ] Control * RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg register register 2 Registers register 2 Shift left 2 Add ALU reslt bcond Zero ALU ALU reslt Address Data memory PCSrc 2 =Br Taken * Instrction [5 ] Instrction [5 ] 6 Sign 32 etend ALU bcond control ALU operation **Based on original figre from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 33

Branch Taken Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp PC 4 address Instrction memory Add Instrction [3 ] PC+4 [3 28] Instrction [3 26] Instrction [25 2] Instrction [2 6] Instrction [5 ] Control * RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg register register 2 Registers register 2 Shift left 2 Add ALU reslt bcond Zero ALU ALU reslt Address Data memory PCSrc 2 =Br Taken * Instrction [5 ] Instrction [5 ] 6 Sign 32 etend ALU bcond control ALU operation **Based on original figre from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 34

Jmp Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp PC 4 address Instrction memory Add Instrction [3 ] PC+4 [3 28] Instrction [3 26] Instrction [25 2] Instrction [2 6] Instrction [5 ] Control * RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg register register 2 Registers register 2 Shift left 2 * Add ALU reslt bcond Zero ALU ALU reslt Address * Data memory PCSrc 2 =Br Taken * Instrction [5 ] Instrction [5 ] 6 Sign 32 etend * ALU control ALU operation **Based on original figre from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 35

What is in That Control Bo? Combinational Logic Hardwired Control Idea: Control signals generated combinationally based on instrction Seqential Logic Seqential/icroprogrammed Control Control Store Idea: A memory strctre contains the control signals associated with an instrction 36

Evalating the Single-Cycle icroarchitectre 37

A Single-Cycle icroarchitectre Is this a good idea/design? When is this a good design? When is this a bad design? How can we design a better microarchitectre? 38

A Single-Cycle icroarchitectre: Analysis Every instrction takes cycle to eecte CPI (Cycles per instrction) is strictly How long each instrction takes is determined by how long the slowest instrction takes to eecte Even thogh many instrctions do not need that long to eecte Clock cycle time of the microarchitectre is determined by how long it takes to complete the slowest instrction Critical path of the design is determined by the processing time of the slowest instrction 39

What is the Slowest Instrction to Process? Let s go back to the basics All si phases of the instrction processing cycle take a single machine clock cycle to complete Fetch Decode Evalate Address Fetch Operands Eecte Store Reslt. Instrction fetch (IF) 2. Instrction decode and register operand fetch (ID/RF) 3. Eecte/Evalate memory address (EX/AG) 4. emory operand fetch (E) 5. Store/writeback reslt (WB) Do each of the above phases take the same time (latency) for all instrctions? 4

Single-Cycle Datapath Analysis Assme memory nits (read or write): 2 ps ALU and adders: ps register file (read or write): 5 ps other combinational logic: ps steps IF ID EX E WB resorces mem RF ALU mem RF Delay R-type 2 5 5 4 I-type 2 5 5 4 LW 2 5 2 5 6 SW 2 5 2 55 Branch 2 5 35 Jmp 2 2 4

Let s Find the Critical Path Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp 4 Add PC+4 [3 28] Instrction [3 26] Control RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg Shift left 2 Add ALU reslt PCSrc 2 =Br Taken PC address Instrction memory Instrction [3 ] Instrction [25 2] Instrction [2 6] Instrction [5 ] register register 2 Registers register 2 bcond Zero ALU ALU reslt Address Data memory Instrction [5 ] 6 Sign 32 etend ALU control ALU operation Instrction [5 ] [Based on original figre from P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 42

R-Type and I-Type ALU Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp ps 4 Add ps PC+4 [3 28] Instrction [3 26] Control RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg Shift left 2 Add ALU reslt PCSrc 2 =Br Taken PC address Instrction memory Instrction [3 ] 2ps Instrction [25 2] Instrction [2 6] Instrction [5 ] register register 2 Registers register 2 4ps 25ps bcond Zero ALU ALU reslt Address 35ps Data memory Instrction [5 ] 6 Sign 32 etend ALU control ALU operation Instrction [5 ] [Based on original figre from P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 43

LW Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp ps 4 Add ps PC+4 [3 28] Instrction [3 26] Control RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg Shift left 2 Add ALU reslt PCSrc 2 =Br Taken PC address Instrction memory Instrction [3 ] 2ps Instrction [25 2] Instrction [2 6] Instrction [5 ] register register 2 Registers 2 register 6ps 25ps bcond Zero ALU ALU reslt Address 35ps Data memory 55ps Instrction [5 ] 6 Sign 32 etend ALU control ALU operation Instrction [5 ] [Based on original figre from P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 44

SW Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp ps 4 Add ps PC+4 [3 28] Instrction [3 26] Control RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg Shift left 2 Add ALU reslt PCSrc 2 =Br Taken PC address Instrction memory Instrction [3 ] 2ps Instrction [25 2] Instrction [2 6] Instrction [5 ] register register 2 Registers register 2 25ps bcond Zero ALU ALU reslt 35ps Address Data 55ps memory Instrction [5 ] 6 Sign 32 etend ALU control ALU operation Instrction [5 ] [Based on original figre from P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 45

Branch Taken 35ps PC 4 address Instrction memory Add ps Instrction [3 ] Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 2ps PC+4 [3 28] Instrction [3 26] Instrction [25 2] Instrction [2 6] Instrction [5 ] Instrction [5 ] Control RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg register register 2 Registers register 2 6 Sign 32 etend Shift left 2 25ps ALU control 2ps Add ALU reslt bcond Zero ALU ALU reslt 35ps ALU operation Address PCSrc =Jmp Data memory PCSrc 2 =Br Taken Instrction [5 ] [Based on original figre from P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 46

Jmp Instrction [25 ] Shift Jmp address [3 ] left 2 26 28 PCSrc =Jmp 2ps 4 Add ps PC+4 [3 28] Instrction [3 26] Control RegDst Jmp Branch em emtoreg ALUOp em ALUSrc Reg Shift left 2 Add ALU reslt PCSrc 2 =Br Taken PC address Instrction memory Instrction [3 ] 2ps Instrction [25 2] Instrction [2 6] Instrction [5 ] register register 2 Registers register 2 bcond Zero ALU ALU reslt Address Data memory Instrction [5 ] 6 Sign 32 etend ALU control ALU operation Instrction [5 ] [Based on original figre from P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 47

What Abot Control Logic? How does that affect the critical path? Food for thoght for yo: Can control logic be on the critical path? A note on CDC 56: control store access too long 48

What is the Slowest Instrction to Process? emory is not magic What if memory sometimes takes ms to access? Does it make sense to have a simple register to register add or jmp to take {ms+all else to do a memory operation}? And, what if yo need to access memory more than once to process an instrction? Which instrctions need this? VAX INDEX instrction Do yo provide mltiple ports to memory? 49

Single Cycle Arch: Compleity Contrived All instrctions rn as slow as the slowest instrction Inefficient All instrctions rn as slow as the slowest instrction st provide worst-case combinational resorces in parallel as reqired by any instrction Need to replicate a resorce if it is needed more than once by an instrction dring different parts of the instrction processing cycle Not necessarily the simplest way to implement an ISA Single-cycle implementation of REP OVS, INDEX, POLY? Not easy to optimize/improve performance Optimizing the common case does not work (e.g. common instrctions) Need to optimize the worst case all the time 5

icroarchitectre Design Principles Critical path design Find the maimm combinational logic delay and decrease it Bread and btter (common case) design Spend time and resorces on where it matters i.e., improve what the machine is really designed to do Common case vs. ncommon case Balanced design Balance instrction/ flow throgh hardware components Balance the hardware needed to accomplish the work How does a single-cycle microarchitectre fare in light of these principles? 5

lti-cycle icroarchitectres 52

lti-cycle icroarchitectres Goal: Let each instrction take (close to) only as mch time it really needs Idea Determine clock cycle time independently of instrction processing time Each instrction takes as many clock cycles as it needs to take ltiple state transitions per instrction The states followed by each instrction is different 53

Remember: The Process instrction Step ISA specifies abstractly what A shold be, given an instrction and A It defines an abstract finite state machine where State = programmer-visible state Net-state logic = instrction eection specification From ISA point of view, there are no intermediate states between A and A dring instrction eection One state transition per instrction icroarchitectre implements how A is transformed to A There are many choices in implementation We can have programmer-invisible state to optimize the speed of instrction eection: mltiple state transitions per instrction Choice : AS AS (transform A to A in a single clock cycle) Choice 2: AS AS+S AS+S2 AS+S3 AS (take mltiple clock cycles to transform AS to AS ) 54

lti-cycle icroarchitectre AS = Architectral (programmer visible) state at the beginning of an instrction Step : Process part of instrction in one clock cycle Step 2: Process part of instrction in the net clock cycle AS = Architectral (programmer visible) state at the end of a clock cycle 55

Benefits of lti-cycle Design Critical path design Can keep redcing the critical path independently of worst-case processing time of any instrction Bread and btter (common case) design Can optimize the nmber of states it takes to eecte important instrctions that make p mch of the eection time Balanced design No need to provide more capability or resorces than really needed An instrction that needs resorce X mltiple times does not reqire mltiple X s to be implemented Leads to more efficient hardware: Can rese hardware components needed mltiple times for an instrction 56

Performance Analysis Eection time of an instrction {CPI} {clock cycle time} Eection time of a program Sm over all instrctions [{CPI} {clock cycle time}] {# of instrctions} {Average CPI} {clock cycle time} Single cycle microarchitectre performance CPI = Clock cycle time = long lti-cycle microarchitectre performance CPI = different for each instrction Average CPI hopeflly small Clock cycle time = short Now, we have two degrees of freedom to optimize independently 57

An Aside: CPI vs. Freqency CPI vs. Clock cycle time At odds with each other Redcing one increases the other for a single instrction Why? Average CPI can be amortized/redced via concrrent processing of mltiple instrctions The same cycle is devoted to mltiple instrctions Eample: Pipelining, sperscalar eection 58

A lti-cycle icroarchitectre A Closer Look 59

How Do We Implement This? arice Wilkes, The Best Way to Design an Atomatic Calclating achine, anchester Univ. Compter Inagral Conf., 95. The concept of microcoded/microprogrammed machines Realization One can implement the process instrction step as a finite state machine that seqences between states and eventally retrns back to the fetch instrction state A state is defined by the control signals asserted in it Control signals for the net state determined in crrent state 6

The Instrction Processing Cycle Fetch Decode Evalate Address Fetch Operands Eecte Store Reslt 6

A Basic lti-cycle icroarchitectre Instrction processing cycle divided into states A stage in the instrction processing cycle can take mltiple states A mlti-cycle microarchitectre seqences from state to state to process an instrction The behavior of the machine in a state is completely determined by control signals in that state The behavior of the entire processor is specified flly by a finite state machine In a state (clock cycle), control signals control How the path shold process the How to generate the control signals for the net clock cycle 62

icroprogrammed Control Terminology Control signals associated with the crrent state icroinstrction Act of transitioning from one state to another Determining the net state and the microinstrction for the net state icroseqencing Control store stores control signals for every possible state Store for microinstrctions for the entire FS icroseqencer determines which set of control signals will be sed in the net clock cycle (i.e. net state) 63

What Happens In A Clock Cycle? The control signals (microinstrction) for the crrent state control Processing in the path Generation of control signals (microinstrction) for the net cycle See Spplemental Fig Datapath and microseqencer operate concrrently Qestion: why not generate control signals for the crrent cycle in the crrent cycle? This will lengthen the clock cycle Why wold it lengthen the clock cycle? See Spplemental Fig 2 64

A Clock Cycle 65

A Bad Clock Cycle! 66

A Simple LC-3b Control and Datapath 67

What Determines Net-State Control Signals? What is happening in the crrent clock cycle See the 9 control signals coming from Control block What are these for? The instrction that is being eected IR[5:] coming from the Data Path Whether the condition of a branch is met, if the instrction being processed is a branch BEN bit coming from the path Whether the memory operation is completing in the crrent cycle, if one is in progress R bit coming from memory 68

A Simple LC-3b Control and Datapath 69

The State achine for lti-cycle Processing The behavior of the LC-3b arch is completely determined by the 35 control signals and additional 7 bits that go into the control logic from the path 35 control signals completely describe the state of the control strctre We can completely describe the behavior of the LC-3b as a state machine, i.e. a directed graph of Nodes (one corresponding to each state) Arcs (showing flow from each state to the net state(s)) 7

An LC-3b State achine Patt and Patel, App C, Figre C.2 Each state mst be niqely specified Done by means of state variables 3 distinct states in this LC-3b state machine Encoded with 6 state variables Eamples State 8,9 correspond to the beginning of the instrction processing cycle Fetch phase: state 8, 9 state 33 state 35 Decode phase: state 32 7

AR <! PC PC <! PC + 2 8, 9 DR <! 33 R R IR <! DR 35 To 8 RTI ADD BEN<! IR[] & N + IR[] & Z + IR[9] & P [IR[5:2]] 32 BR To To To 8 DR<! SR+OP2* set CC DR<! SR&OP2* set CC 5 AND XOR TRAP SHF LEA LDB LDW STW STB JSR JP [BEN] 22 PC<! PC+LSHF(off9,) To 8 9 DR<! SR XOR OP2* set CC 2 PC<! BaseR To 8 To 8 AR<! LSHF(ZEXT[IR[7:]],) 5 4 [IR[]] To 8 R DR<! [AR] R7<! PC R PC<! DR 28 3 2 R7<! PC PC<! BaseR 2 R7<! PC To 8 PC<! PC+LSHF(off,) To 8 3 DR<! SHF(SR,A,D,amt4) set CC To 8 To 8 4 DR<! PC+LSHF(off9, ) set CC 2 AR<! B+off6 6 AR<! B+LSHF(off6,) 7 AR<! B+LSHF(off6,) 3 AR<! B+off6 To 8 29 25 23 24 NOTES B+off6 : Base + SEXT[offset6] PC+off9 : PC + SEXT[offset9] *OP2 may be SR2 or SEXT[imm5] ** [5:8] or [7:] depending on AR[] DR<! [AR[5:] ] R R 3 DR<! SEXT[BYTE.DATA] set CC DR<! [AR] 27 R DR<! DR set CC R DR<! SR 6 [AR]<! DR R R DR<! SR[7:] 7 [AR]<! DR** R R To 8 To 8 To 8 To 9

LC-3b State achine: Some Qestions How many cycles does the fastest instrction take? How many cycles does the slowest instrction take? Why does the BR take as long as it takes in the FS? What determines the clock cycle? Is this a ealy machine or a oore machine? 73

LC-3b Datapath Patt and Patel, App C, Figre C.3 Single-bs path design At any point only one vale can be gated on the bs (i.e., can be driving the bs) Advantage: Low hardware cost: one bs Disadvantage: Redced concrrency if instrction needs the bs twice for two different things, these need to happen in different states Control signals (26 of them) determine what happens in the path in one clock cycle Patt and Patel, App C, Table C. 74

We did not cover the following slides in lectre. These are for yor preparation for the net lectre.

C.4. THE CONTROL STRUCTURE IR[:9] IR[:9] DR SR IR[8:6] DRUX SRUX (a) (b) IR[:9] N Z P Logic BEN (c) Figre C.6: Additional logic reqired to provide control signals

LC-3b Datapath: Some Qestions How does instrction fetch happen in this path according to the state machine? What is the difference between gating and loading? Is this the smallest hardware yo can design? 79

LC-3b icroprogrammed Control Strctre Patt and Patel, App C, Figre C.4 Three components: icroinstrction, control store, microseqencer icroinstrction: control signals that control the path (26 of them) and determine the net state (9 of them) Each microinstrction is stored in a niqe location in the control store (a special memory strctre) Uniqe location: address of the state corresponding to the microinstrction Remember each state corresponds to one microinstrction icroseqencer determines the address of the net microinstrction (i.e., net state) 8

R IR[5:] BEN icroseqencer 6 Control Store 2 6 35 35 icroinstrction 9 26 (J, COND, IRD)

APPENDIX C. THE ICROARCHITECTURE OF THE LC-3B, BASIC ACHINE COND COND BEN R IR[] Branch y Addr. ode J[5] J[4] J[3] J[2] J[] J[],,IR[5:2] 6 IRD 6 Address of Net State Figre C.5: The microseqencer of the LC-3b base machine

J IRD Cond LD.DR LD.IR LD.BEN LD.REG LD.CC LD.AR GatePC GateDR GateALU LD.PC GateARUX GateSHF PCUX DRUX SRUX ADDRUX ADDR2UX ARUX ALUK IO.EN R.W DATA.SIZE LSHF (State ) (State ) (State 2) (State 3) (State 4) (State 5) (State 6) (State 7) (State 8) (State 9) (State ) (State ) (State 2) (State 3) (State 4) (State 5) (State 6) (State 7) (State 8) (State 9) (State 2) (State 2) (State 22) (State 23) (State 24) (State 25) (State 26) (State 27) (State 28) (State 29) (State 3) (State 3) (State 32) (State 33) (State 34) (State 35) (State 36) (State 37) (State 38) (State 39) (State 4) (State 4) (State 42) (State 43) (State 44) (State 45) (State 46) (State 47) (State 48) (State 49) (State 5) (State 5) (State 52) (State 53) (State 54) (State 55) (State 56) (State 57) (State 58) (State 59) (State 6) (State 6) (State 62) (State 63)

LC-3b icroseqencer Patt and Patel, App C, Figre C.5 The prpose of the microseqencer is to determine the address of the net microinstrction (i.e., net state) Net address depends on 9 control signals 84

The icroseqencer: Some Qestions When is the IRD signal asserted? What happens if an illegal instrction is decoded? What are condition (COND) bits for? How is variable latency memory handled? How do yo do the state encoding? inimize nmber of state variables Start with the 6-way branch Then determine constraint tables and states dependent on COND 86

Variable-Latency emory The ready signal (R) enables memory read/write to eecte correctly Eample: transition from state 8 to state 33 is controlled by the R bit asserted by memory when memory is available Cold we have done this in a single-cycle microarchitectre? 87

The icroseqencer: Advanced Qestions What happens if the machine is interrpted? What if an instrction generates an eception? How can yo implement a comple instrction sing this control strctre? Think REP OVS 88

Computer Architecture Lecture 6: Multi-cycle Microarchitectures. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 2/6/2012