CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

CO2-3224 Computer Architecture and Programming Languages CAPL Lecture 8 & 9 Dr. Kinga Lipskoch Fall 27

Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction Especially problematic for more complex instructions like floating point multiply Clk Cycle Cycle 2 lw sw Waste May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle But, it is simple and easy to understand CAPL Fall 27 2 / 44

Multicycle Datapath Approach () Let an instruction take more than clock cycle to complete Break up instructions into steps where each step takes a cycle while trying to balance the amount of work to be done in each step restrict each cycle to use only one major functional unit Not every instruction takes the same number of clock cycles In addition to faster clock rates, multicycle allows functional units that can be used more than once per instruction as long as they are used on different clock cycles, as a result only need one memory but only one memory access per cycle need only one /adder but only one operation per cycle CAPL Fall 27 3 / 44

Multicycle Datapath Approach (2) At the end of a cycle Store values needed in a later cycle by the current instruction in an internal register (not visible to the programmer). All (except IR) hold data only between a pair of adjacent clock cycles (no write control signal needed) PC Memory Address Data (Instr. or Data) IR MDR Addr Register Addr 2Data File Write Addr Data 2 A B out IR Instruction Register MDR Memory Data Register A, B regfile read data registers out output register Data used by subsequent instructions are stored in programmer visible registers (i.e., register file, PC, or memory) CAPL Fall 27 4 / 44

Additional Registers Needed Data used by the same instruction must be stored in additional registers Position is determined by two factors which units fit into same clock cycle what data is needed for later cycles Instruction register (IR) and Memory data register (MDR) added to save output of the memory for instruction read and data read A and B registers added to hold register operand values read from register file Out register holds the output of All registers (except IR) will hold data just between a pair of adjacent cycles, thus do not need write control signal CAPL Fall 27 5 / 44

Multicycle Datapath Functional units shared for different purposes Multiplexer needed for Memory access PC or Out Three s replaced by one additional multiplexer for first input (A or PC) added 4-way multiplexer for second input More registers and multiplexers, but Less memory units ( instead of 2) Fewer adders (2) Reduced hardware cost CAPL Fall 27 6 / 44

More Control Lines Needed Multiple clock cycles per instruction State units (PC, memory, registers) need write control lines Memory needs read signal Additional multiplexers need control line, 4-way needs 2 PC has three possible sources PCWrite, unconditional write of PC PCWriteCond, cause write of PC, if branch is true CAPL Fall 27 7 / 44

Signal name RegDst Effect when deasserted The register destination number for the Write register comes from the rt field (bits 2:6) Actions of the -bit control signals Effect when asserted The register destination number for the Write register comes from the rd field (bits 5:). RegWrite None. The register on the Write register input is written with the value on the Write data input. SrcA The first operand comes from the PC. The first operand comes from the A register. Mem None. Content of memory at the location specified by the Address input is put on Memory data output. MemWrite None. Memory contents at the location specified by the Address input is replaced by value on Write data input. MemtoReg IorD The value fed to the register Write data input comes from the Out. The PC is used to supply the address to the memory unit. The value fed to the register Write data input comes from the MDR. Out is used to supply the address to the memory unit. IRWrite None. The output of the memory is written into the IR. PCWrite None. The PC is written; the source is controlled by PCSource. PCWriteCond None. The PC is written if the Zero output from the is also active. Signal name Op Actions of the 2-bit control signals Value (binary) The performs an add operation. The performs an subtract operation. Effect The funct field of the instruction determines the operation The second input to the comes from the B register. SrcB PCSource The second input to the is the constant 4. The second input to the is the sign-extended, lower bits of the IR. The second input to the is the sign-extended, lower bits of the IR shifted left 2 bits. Output of the (PC+4) is sent to the PC for writing. The contents of Out (branch target address) are sent to the PC for writing. The jump target address (IR[25:]) shifted left 2 bits and concatenated with PC+4[3:28] is sent to the PC for writing. CAPL Fall 27 8 / 44

Instr[3-26] Multicycle Datapath The Multicycle Datapath with Control Signals PCWriteCond PCWrite PCSource IorD Mem Control Op SrcB MemWrite SrcA MemtoReg RegWrite IRWrite RegDst PC[3-28] PC Address Memory Data (Instr. or Data) IR MDR Addr Register Addr 2 Data File Write Addr Data 2 Instr[5-] Sign Extend 32 Instr[5-] Instr[25-] Shift left 2 A B 4 2 3 Shift left 2 zero 28 control out 2 CAPL Fall 27 9 / 44

Instructions from ISA Perspective Move from one-cycle to multi-cycle Identifying steps that take one cycle Equal distribution of execution time At most one operation for each of the modules Register file Memory New registers if The signal is computed in one cycle and used in another cycle The inputs of the block generating the signal may change in the second cycle CAPL Fall 27 / 44

The Five Execution Steps. Instruction fetch Move the instruction from the instruction memory to the instruction register IR 2. Instruction decode and register fetch Provide the register contents for the 3. Execution, memory address computation or branch completion 4. Memory access or R-type instruction completion 5. Write back step CAPL Fall 27 / 44

Step : Instruction Fetch () Load instruction from memory IR = Memory [PC] Set address mux (IorD) = select instruction Set Mem = Set IRWrite = Increment PC PC = PC + 4 Set SrcA = get operand from IR Set SrcB = get operand 4 Set Op = add Allow storing new PC in PC register CAPL Fall 27 2 / 44

Instr[3-26] Multicycle Datapath Step : Instruction Fetch (2) PCWriteCond PCWrite PCSource IorD Mem Control Op SrcB MemWrite SrcA MemtoReg RegWrite IRWrite RegDst PC[3-28] PC Address Memory Data (Instr. or Data) IR MDR Addr Register Addr 2 Data File Write Addr Data 2 Instr[5-] Sign Extend 32 Instr[5-] Instr[25-] Shift left 2 A B 4 2 3 Shift left 2 zero 28 control out 2 CAPL Fall 27 3 / 44

Step 2: Instruction Decode & Register Fetch () Switch registers to the output of the register block A <= register [IR [25:2]] rs B <= register [IR [2:6]] rt No signal setting required (Always) calculate the branch target address Out <= PC + (sign-ext. (IR [5:]) << 2) Value can just be ignored if instruction is not branch Stored in the Out register Set SrcB = Set Op = add CAPL Fall 27 4 / 44

Instr[3-26] Multicycle Datapath Step 2: Instruction Decode & Register Fetch (2) PCWriteCond PCWrite PCSource IorD Mem Control Op SrcB MemWrite SrcA MemtoReg RegWrite IRWrite RegDst PC[3-28] PC Address Memory Data (Instr. or Data) IR MDR Addr Register Addr 2 Data File Write Addr Data 2 Instr[5-] Sign Extend 32 Instr[5-] Instr[25-] Shift left 2 A B 4 2 3 Shift left 2 zero 28 control out 2 CAPL Fall 27 5 / 44

Step 3: Execution, Memory Address Computation or Branch Completion First cycle where step depends on the instruction Selection performed by interpretation of the op + function field of the instruction Memory reference calculate address Out <= A + sign-extend(ir[5:]) Set SrcA = get operand from A Set SrcB = get operand from sign extension unit Set Op = add CAPL Fall 27 6 / 44

Instr[3-26] Multicycle Datapath Step 3: Memory Reference PCWriteCond PCWrite PCSource IorD Mem Control Op SrcB MemWrite SrcA MemtoReg RegWrite IRWrite RegDst PC[3-28] PC Address Memory Data (Instr. or Data) IR MDR Addr Register Addr 2 Data File Write Addr Data 2 Instr[5-] Sign Extend 32 Instr[5-] Instr[25-] Shift left 2 A B 4 2 3 Shift left 2 zero 28 control out 2 CAPL Fall 27 7 / 44

Step 3: Execution, Memory Address Computation or Branch Completion Arithmetic-logical instruction (R-type): Out = A op B Set SrcA = Set SrcB = Set Op = get operand from A get operand from B code from IR CAPL Fall 27 8 / 44

Instr[3-26] Multicycle Datapath Step 3: Arithmetic-Logical Instruction PCWriteCond PCWrite PCSource IorD Mem Control Op SrcB MemWrite SrcA MemtoReg RegWrite IRWrite RegDst PC[3-28] PC Address Memory Data (Instr. or Data) IR MDR Addr Register Addr 2 Data File Write Addr Data 2 Instr[5-] Sign Extend 32 Instr[5-] Instr[25-] Shift left 2 A B 4 2 3 Shift left 2 zero 28 control out 2 CAPL Fall 27 9 / 44

Step 3: Execution, Memory Address Computation or Branch Completion Branch: if (A == B) PC <= Out Set SrcA = Set SrcB = Set Op = Write Out to PC register get operand from A get operand from B subtraction CAPL Fall 27 2 / 44

Instr[3-26] Multicycle Datapath Step 3: Branch PCWriteCond PCWrite PCSource IorD Mem Control Op SrcB MemWrite SrcA MemtoReg RegWrite IRWrite RegDst PC[3-28] PC Address Memory Data (Instr. or Data) IR MDR Addr Register Addr 2 Data File Write Addr Data 2 Instr[5-] Sign Extend 32 Instr[5-] Instr[25-] Shift left 2 A B 4 2 3 Shift left 2 zero 28 control out 2 CAPL Fall 27 2 / 44

Step 3: Execution, Memory Address Computation or Branch Completion Jump: PC <= {PC [3:28], (IR[25:] << 2)} CAPL Fall 27 22 / 44

Instr[3-26] Multicycle Datapath Step 3: Jump PCWriteCond PCWrite PCSource IorD Mem Control Op SrcB MemWrite SrcA MemtoReg RegWrite IRWrite RegDst PC[3-28] PC Address Memory Data (Instr. or Data) IR MDR Addr Register Addr 2 Data File Write Addr Data 2 Instr[5-] Sign Extend 32 Instr[5-] Instr[25-] Shift left 2 A B 4 2 3 Shift left 2 zero 28 control out 2 CAPL Fall 27 23 / 44

Step 4: Memory Access or R-type Instruction Completion Memory reference: controls must remain stable Set IorD = load from memory MDR <= memory[out] Set Mem = store to memory memory[out] <= B Set MemWrite = address from CAPL Fall 27 24 / 44

Instr[3-26] Multicycle Datapath Step 4: Memory Reference (load word) PCWriteCond PCWrite PCSource IorD Mem Control Op SrcB MemWrite SrcA MemtoReg RegWrite IRWrite RegDst PC[3-28] PC Address Memory Data (Instr. or Data) IR MDR Addr Register Addr 2 Data File Write Addr Data 2 Instr[5-] Sign Extend 32 Instr[5-] Instr[25-] Shift left 2 A B 4 2 3 Shift left 2 zero 28 control out 2 CAPL Fall 27 25 / 44

Instr[3-26] Multicycle Datapath Step 4: Memory Reference (save word) PCWriteCond PCWrite PCSource IorD Mem Control Op SrcB MemWrite SrcA MemtoReg RegWrite IRWrite RegDst PC[3-28] PC Address Memory Data (Instr. or Data) IR MDR Addr Register Addr 2 Data File Write Addr Data 2 Instr[5-] Sign Extend 32 Instr[5-] Instr[25-] Shift left 2 A B 4 2 3 Shift left 2 zero 28 control out 2 CAPL Fall 27 26 / 44

Step 4: Memory Access or R-type Instruction Completion Arithmetic-logical instruction completion: Register[IR[5:]] <= Out Set RegDst = Set RegWrite = Set MemToReg = Select write register Allow write operation Select data Op, SrcA, SrcB = constant CAPL Fall 27 27 / 44

Instr[3-26] Multicycle Datapath Step 4: Arithmetic-Logical Instruction Completion PCWriteCond PCWrite PCSource IorD Mem Control Op SrcB MemWrite SrcA MemtoReg RegWrite IRWrite RegDst PC[3-28] PC Address Memory Data (Instr. or Data) IR MDR Addr Register Addr 2 Data File Write Addr Data 2 Instr[5-] Sign Extend 32 Instr[5-] Instr[25-] Shift left 2 A B 4 2 3 Shift left 2 zero 28 control out 2 CAPL Fall 27 28 / 44

Step 5: Write Back Write data from memory to the register: Register[IR[2:6]] <= MDR Set RegDst = Set RegWrite = Set MemToReg = Select write rt as target register Allow write operation Select Memory data Op, SrcA, SrcB = constant CAPL Fall 27 29 / 44

Instr[3-26] Multicycle Datapath Step 5: Memory Reference (load word) PCWriteCond PCWrite PCSource IorD Mem Control Op SrcB MemWrite SrcA MemtoReg RegWrite IRWrite RegDst PC[3-28] PC Address Memory Data (Instr. or Data) IR MDR Addr Register Addr 2 Data File Write Addr Data 2 Instr[5-] Sign Extend 32 Instr[5-] Instr[25-] Shift left 2 A B 4 2 3 Shift left 2 zero 28 control out 2 CAPL Fall 27 3 / 44

... Multicycle Datapath Multicycle Control Unit Not determined solely by the bits in the instruction e.g., op code bits tell what operation the should be doing, but not what instruction cycle is to be done next Must use a finite state machine (FSM) for control a set of states (current state stored in State Register) next state function (determined by current state and the input) output function (determined by current state and the input) Combinational control logic Outputs Inputs...... State Reg Inst Opcode Datapath control points Next State CAPL Fall 27 3 / 44

Graphic Representation of FSM Common part Instruction specific CAPL Fall 27 32 / 44

The Five Steps of the Load Instruction Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB : IFetch: Instruction Fetch and Update PC 2: Dec: Instruction Decode, Register, Sign Extend Offset 3: Exec: Execute R-type; Calculate Memory Address; Branch Comparison; Branch and Jump Completion 4: Mem: Memory ; Memory Write Completion; R-type Completion (RegFile write) 5: WB: Memory Completion (RegFile write) Instructions take 3 5 cycles CAPL Fall 27 33 / 44

Multicycle Advantages & Disadvantages Uses the clock cycle efficiently the clock cycle is timed to accommodate the slowest instruction step Clk Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch... Multicycle implementations allow functional units to be used more than once per instruction as long as they are used on different clock cycles But Requires additional internal state registers, more muxes, and more complicated (FSM) control CAPL Fall 27 34 / 44

Single Cycle vs. Multiple Cycle Timing Single Cycle Implementation: Clk Cycle Cycle 2 lw sw Waste multicycle clock slower than /5 th of Multiple Cycle Implementation: single cycle clock due to state register overhead Clk Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch... CAPL Fall 27 35 / 44

Summary If we understand the instructions We can build a simple processor If instructions take different amounts of time, multi-cycle is better Datapath implemented using: Combinational logic for arithmetic State holding elements to remember bits Control implemented using: Combinational logic for single-cycle implementation Finite state machine for multi-cycle implementation CAPL Fall 27 36 / 44

: How Can We Make It Even Faster? Split the multiple instruction cycle into smaller and smaller steps There is a point of diminishing returns where as much time is spent loading the state registers as doing the work Start fetching and executing the next instruction before the current one has completed (all?) modern processors are pipelined for performance Remember the performance equation: CPU time = CPI * CC * IC Fetch (and execute) more than one instruction at a time Superscalar processing CAPL Fall 27 37 / 44

Preconditions Instruction set design Instructions (ideally) of equal length enables to fetch in first stage and decode in second Few instruction formats source register at same place for each instruction read register and determine type of instruction Memory operands only in load & store Aligned data: only one memory access/read operation Sources of problems Instructions with variable length multiple memory accesses Unaligned data multiple memory access for one data item CAPL Fall 27 38 / 44

An Analogous Example () Laundry problem Four processing stages (wash, dry, fold, put away) Identical time (3 minutes) Fixed sequence of usage Total time for n loads: n * 2 hours CAPL Fall 27 39 / 44

An Analogous Example (2) Laundry optimization Units operate independently Overlapping use of resources Total time, loads: * 2 hours + (n-) * /2 hour = 3.5 hours Average time for laundry: 3.5 h / 4 = 52.5 min CAPL Fall 27 4 / 44

An Analogous Example (3) All stages operate concurrently Many tasks are being done in parallel, pipelining improves throughput of the laundry, while time to complete single load (instructions...) does not change (latency is not reduced) is only faster for many loads Far more important metric, because programs execute billions of instructions If stages take same amount of time, and if all stages can be used, speedup due to pipelining is equal to number of stages in pipeline But two ifs..., there is a limit for the length of a pipeline where no further speedup will be seen CAPL Fall 27 4 / 44

Single Cycle Datapath CAPL Fall 27 42 / 44

Real Pipeline MIPS pipeline steps. IF: Fetch instruction from memory 2. ID: registers while decoding 3. EX: Execute the operation or calculate an address 4. MEM: Access an operand in data memory 5. WB: Write back results in register Unequal time for steps (in ps) Single cycle: cycle depends on slowest instruction Instruction class Load Word (lw) Store Word (sw) R-format (add, sub, and, or, slt) Branch (beq) Instruction fetch Register read operation Data access Register write Total time 2 ps ps 2 ps 2 ps ps 8 ps 2 ps ps 2 ps 2 ps 7 ps 2 ps ps 2 ps ps 6 ps 2 ps ps 2 ps 5 ps CAPL Fall 27 43 / 44

Single Cycle vs. Pipelined Execution regfile write in first half regfile read in second half CAPL Fall 27 44 / 44