Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

CSCI 42: Computer Architectures The Processor (2) Fengguang Song Department of Computer & Information Science IUPUI Full Datapath Branch Target Instruction Fetch Immediate 4

Today s Contents We have looked at how to design a Data Path. 4.4, 4.5 Today, we will design a Control Unit for a single-cycle processor (i.e., how to set 8 control signals) Also learn a new pipeline processor (i.e., a much faster implementation) Processor Control Memory Input Datapath Output 5 How to Control the Instruction Fetch Unit? Our st control signal: PC_sel Inst Memory Adr Instruction<3:> PC_sel st control PC_sel works as follows:. Increase by 4 if PC_sel = 2. Branch target if PC_sel = 4 imm6 Sign Et Adder Adder PC Mu Clk After fetching, we will eecute it à 6 2

Activated Datapath for Eecuting Add Rd Rt RegDst = Mu Rs Rt RegWr = 5 5 5 busw 32 Clk 3 8 control signals imm6 26 2 6 6 op rs rt rd shamt funct Rw Ra Rb 32 32-bit Registers 6 PC_sel= incr busb 32 Etender Clk busa 32 ALUctr = Add 32 Mu EtOp = don t care ALUSrc = Instruction Fetch Unit Data In ALU Clk Rs Rt Rd MemWr = Zero 32 32 Instruction<3:> <2:25> <6:2> WrEn Adr Data Memory <:5> 32 <:5> Imm6 MemtoReg = Mu 7 But, How to Generate Correct Control Signals? Control signals are derived from the instruction. R-type Load/ Store Branch rs rt rd shamt funct 3:26 25:2 2:6 5: :6 5: 35 or 43 rs rt address 3:26 25:2 2:6 5: 4 rs rt address 3:26 25:2 2:6 5: opcode always read read, ecept for load write for R-type and load sign-etend then add 8 3

Adding Control to Datapath Instruction<3:> Inst Memory Adr <5:> <3:26> Op Fun <2:25> Rs <2:6> Rt <:5> Rd <:5> Imm6 Inputs: blue variables Outputs: red variables Control Unit: A Combinational Logic Circuit PC_sel RegWr RegDst EtOp ALUSrc ALUctr MemWr MemtoReg Zero? DATA PATH 9 inst A Few Eamples of Control Signals Register Transfer ADD R[rd] ß R[rs] + R[rt]; PC ß PC + 4 ALUsrc = BusB, ALUctr = add, RegDst = rd, RegWr, PC_sel = incr SUB R[rd] ß R[rs] R[rt]; PC ß PC + 4 ALUsrc = BusB, ALUctr = sub, RegDst = rd, RegWr, PC_sel = incr ORi R[rt] ß R[rs] OR zero_et(imm6); PC ß PC + 4 ALUsrc = Im, Etop = Z, ALUctr = or, RegDst = rt, RegWr, PC_sel = incr LOAD R[rt] ß MEM[ R[rs] + sign_et(imm6) ]; PC ß PC + 4 ALUsrc = Im, Etop = Sign, ALUctr = add, MemtoReg=, RegDst = rt, RegWr, PC_sel = incr STORE MEM[ R[rs] + sign_et(imm6)] ß R[rt]; PC ß PC + 4 ALUsrc = Im, Etop = Sn, ALUctr = add, MemWr, PC_sel = incr BEQ if ( R[rs] == R[rt] ) then PC ß PC + 4 + sign_et(imm6)]*4; else PC ß PC + 4 PC_sel = output of ALU, ALUctr = sub 4

Summary of 8 Control Signals (for 7 instructions) See MIPS reference First 2 columns are identical ecept last row ->can be combined! RegDst ALUSrc MemtoReg RegWrite MemWrite PCsel func N/A op add sub ori lw sw beq jump 7 instr. EtOp ALUctr<3:> Add Subtract Or Add Add Subtract 3 R-type I-type J-type op 26 2 6 6 rs rt rd shamt funct add, sub op rs rt immediate ori, lw, sw, beq op target address jump RegDst ALUSrc MemtoReg RegWrite MemWrite Branch EtOp ALUop<:> The Concept of Local Decoding op R-type ori lw sw beq jump R-type Or Add First two columns in prev slide collapsed to one func op Main 6 6 Control ALUop 2 ALU Control (Local) Add Subtract ALUctr 4 ALUctr generated locally based on funct code ALU 4 classes. Need 2 bits 2 5

The ALU Control Assume 2-bit ALUOp derived from opcode Net, combinational logic derives the ALU control opcode ALUOp Operation funct ALU function ALUCtr lw load word XXXXXX add sw store word XXXXXX add beq branch equal XXXXXX subtract ori or immediate XXXXXX OR R-type add add subtract subtract AND AND OR OR set-on-less-than set-on-less-than 3 ALU Control ALU ALUCtr ALU Function AND OR add subtract set-on-less-than NOR 4 6

Truth Table for the Main Control op 6 Main Control RegDst ALUSrc : ALUop func 6 ALU Control (Local) ALUctr 4 2 op R-type ori lw sw beq jump RegDst ALUSrc MemtoReg RegWrite MemWrite Branch Jump EtOp ALUop (Symbolic) R-type Or Add Add Subtract ALUop <> ALUop <> These columns don t need func 6 A Simple Datapath + Control Based on the previous truth table OP 2 bits 4 bits Func 7 7

R-Type Instruction func ALU Ctr 8 Load Instruction add 9 8

Branch-on-Equal Instruction (beq) sub 2 Finally, Implementing Jumps (j) J-type 2 address 3:26 25: Jump uses word addressing It updates PC with a concatenation of: The most significant 4 bits of <current PC+4>, 26-bit jump address (shift left by 2 bits to get 32-bit address) Now, we need a new control signal decoded from opcode for jump 2 9

DatapathWith Jumps Added 4 bits 22 Performance Issues Yes, the previous Single-Cycle CPU can work correctly But the longest delay will determine the CPU clock cycle What is the critical (or longest) path in the processor? The load instruction Instruction memory register file ALU data memory register file Could be even longer if you deal with floating point numbers Working, but this violates the design principle of: Making the common case fast Net, we will improve it using pipelining 23

Pipeline is natural! Pipelining Analogy The classic laundry eample: Washer, Dryer, Folder, Storer Total = 8 hours Total = 3.5 hours n n 4 loads: n Speedup = 8/3.5 = 2.3X Non-stop (in a steady state): n Speedup 4 (2N/.5N) = Number of stages 24 Important Lessons about Pipelining Pipelining doesn t help latency of single task, but helps throughput of entire workload Multiple tasks can operate simultaneously using different resources (i.e., in parallel) Potential speedup = Number of pipeline stages Pipeline rate limited by slowest stage Unbalanced lengths of stages reduce speedup Time to fill pipeline and time to drain will reduce speedup May stall for dependences 25

The MIPS CPU Pipeline 5 pipeline stages on MIPS processors:. IF: Instruction Fetch from memory 2. ID: Instruction Decode and Register Read 3. EX: Eecute operation or calculate address 4. MEM: Access memory operand 5. WB: Write result back to register file 26 Pipeline Stage s Performance Assume time for different stages is: ps for ID stage ps for WB stage 2ps for all the other stages Performance of the old single-cycle datapath design: Instructio n eample Instr fetch Register read ALU op Memory access Register Write Back Total time lw 2ps ps 2ps 2ps ps 8ps sw 2ps ps 2ps 2ps 7ps R-format 2ps ps 2ps ps 6ps beq 2ps ps 2ps 5ps 27 2

Single-Cycle vs Pipeline CPU Single-cycle (CC = 8ps) Pipelined (CC = 2ps) Because 2ps is the slowest stage time 4 speedup! 28 A Simple Convenient Pipeline Representation Time IFetch ID Eec Mem WB IFetch ID Eec Mem WB IFetch ID Eec Mem WB IFetch ID Eec Mem WB Program Flow IFetch ID Eec Mem WB IFetch ID Eec Mem WB 29 3

Pipeline Speedup If all stages are balanced (i.e., all stages take the same time) t pipelined = t nonpipelined # of stages If stages are Not balanced, speedup becomes less. The obtained speedup is due to an increased throughput. Note: Latency (i.e., time of each instruction) does not necessarily improve! Under ideal conditions and with many instructions, Speedup is equal to #Stages. 3 MIPS s ISA Design is Suitable for Pipelining All MIPS instructions are 32 bits Easier to fetch and decode VS 86 CISC: - to 7-byte instructions, more difficult So PC+?? //it depends. Has very regular instruction formats So that we can decode and read registers simultaneously in one stage Only load/store can access memory Can calculate address in EX stage, access memory in MEM stage (i.e., E, Mem, WR) Alignment of memory operands So memory access takes only one cycle (in one stage) One data transfer. 3 4

Pipeline Hazards Hazards : Situations when the net instruction cannot eecute in the net cycle. Structural hazards A required resource (e.g., memory) is occupied See more details in net slide 2. Data hazards Need to wait for previous instruction s data to complete its data read/write 3. Control hazards Depend on a control action from the previous instruction (a branch instruction: beq) 32 Structural Hazards When there is a conflict for the use of a resource already occupied (e.g., one memory unit) Suppose MIPS pipeline has a single memory Load/store instructions requires using memory unit Instruction Fetch would have to stall for that cycle This causes a pipeline bubble Hence, MIPS pipelines require separate instruction and data memories (2) In order to avoid a structural hazard Fortunately, MIPS is well designed so that there is No structural hazard. 33 5

How about Data Hazards? An instruction depends on the completion of a data access by a previous instruction add $s, $t, $t 2 sub $t2, $s, $t3 Shading on right --> register is read Shading on left --> register is written Wait for 3 cycles Figure: Graphical representation of the instruction pipeline. 34 6