3/6/8 CSCI 42: Computer Architectures The Processor (2) Fengguang Song Department of Computer & Information Science IUPUI Today s Content We have looked at how to design a Data Path. 4.4, 4.5 We will design a control unit for the single-cycle processor (i.e., how to set up 8 control signals) We will also learn the pipeline processor (i.e., a much faster implementation) Processor Control Memory Input Datapath Output 2
3/6/8 How to Control the Instruction Fetch Unit? The first control signal: PC_sel Inst Memory Adr Instruction<3:> PC_sel st control PC_sel works as follows:. Increase by 4 if PC_sel = 2. Branch target if PC_sel = 4 imm6 Sign Et Adder Adder PC Mu Clk 3 Activated Datapath for Eecuting Add Rd Rt RegDst = Mu Rs Rt RegWr = 5 5 5 busw 32 Clk 3 8 control signals imm6 26 2 6 6 op rs rt rd shamt funct Rw Ra Rb 32 32-bit Registers 6 PC_sel= incr busb 32 Etender Clk busa 32 ALUctr = Add 32 Mu EtOp = don t care ALUSrc = Instruction Fetch Unit Data In ALU Clk Rs Rt Rd MemWr = Zero 32 32 Instruction<3:> <2:25> <6:2> WrEn Adr Data Memory <:5> 32 <:5> Imm6 MemtoReg = Mu 4 2
3/6/8 But, How to Generate Correct Control Signals? Control signals are derived from the instruction R-type Load/ Store Branch rs rt rd shamt funct 3:26 25:2 2:6 5: :6 5: 35 or 43 rs rt address 3:26 25:2 2:6 5: 4 rs rt address 3:26 25:2 2:6 5: opcode always read read, ecept for load write for R-type and load sign-etend then add 5 Adding Control to Datapath Instruction<3:> Inst Memory Adr <2:6> <2:25> <5:> <3:26> Op Fun Rs Rt <:5> Rd <:5> Imm6 Inputs: blue variables Outputs: red variables Control: Combinational Logic Circuit PC_sel RegWr RegDst EtOp ALUSrc ALUctr MemWr MemtoReg Zero? DATA PATH 6 3
3/6/8 inst Eamples of Control Signals Register Transfer ADD R[rd] ß R[rs] + R[rt]; PC ß PC + 4 ALUsrc = BusB, ALUctr = add, RegDst = rd, RegWr, PC_sel = incr SUB R[rd] ß R[rs] R[rt]; PC ß PC + 4 ALUsrc = BusB, ALUctr = sub, RegDst = rd, RegWr, PC_sel = incr ORi R[rt] ß R[rs] OR zero_et(imm6); PC ß PC + 4 ALUsrc = Im, Etop = Z, ALUctr = or, RegDst = rt, RegWr, PC_sel = incr LOAD R[rt] ß MEM[ R[rs] + sign_et(imm6) ]; PC ß PC + 4 ALUsrc = Im, Etop = Sign, ALUctr = add, MemtoReg=, RegDst = rt, RegWr, PC_sel = incr STORE MEM[ R[rs] + sign_et(imm6)] ß R[rt]; PC ß PC + 4 ALUsrc = Im, Etop = Sn, ALUctr = add, MemWr, PC_sel = incr BEQ if ( R[rs] == R[rt] ) then PC ß PC + 4 + sign_et(imm6)]*4; else PC ß PC + 4 PC_sel = output of ALU, ALUctr = sub 7 See MIPS reference First 2 columns identical ecept last row ->can be combined! Summary of Control Signals (for 7 instructions) RegDst ALUSrc MemtoReg RegWrite MemWrite PCsel func N/A op add sub ori lw sw beq jump EtOp ALUctr<3:> Add Subtract Or Add Add Subtract 3 R-type I-type J-type 26 2 6 6 op rs rt rd shamt funct add, sub op rs rt immediate ori, lw, sw, beq op target address jump 8 4
3/6/8 RegDst ALUSrc MemtoReg RegWrite MemWrite Branch EtOp ALUop<:> The Concept of Local Decoding op R-type ori lw sw beq jump R-type Or First two columns in prev slide collapsed to one func op Main 6 6 Control ALUop 2 Add This could be more bits ALU Control (Local) Add Subtract ALUctr 4 ALUctr generated locally based on funct code ALU 9 The ALU Control Assume 2-bit ALUOp derived from opcode Net, combinational logic derives the ALU control opcode ALUOp Operation funct ALU function ALUCtr lw load word XXXXXX add sw store word XXXXXX add beq branch equal XXXXXX subtract ori or immediate XXXXXX OR R-type add add subtract subtract AND AND OR OR set-on-less-than set-on-less-than 5
3/6/8 ALU Control ALU ALUCtr ALU Function AND OR add subtract set-on-less-than NOR Logic Function for Each Signal Mostly just a simple function: f(op) PC_sel ç if (OP == BEQ) then EQUAL ZERO, else ALUsrc ç if (OP == Rtype ) then BusB else immed ALUctr ç if (OP == Rtype ) then check funct elseif (OP == ORi) then OR elseif (OP == BEQ) then sub else add EtOp ç if (OP == ORi) then zero else sign MemWr ç (OP == Store) MemtoReg ç (OP == Load) RegWr: ç if ((OP == Store) (OP == BEQ)) then else RegDst: ç if ((OP == Load) (OP == ORi)) then Rt else Rd 2 6
3/6/8 Truth Table for the Main Control op 6 Main Control RegDst ALUSrc : ALUop func 6 ALU Control (Local) ALUctr 4 2 op R-type ori lw sw beq jump RegDst ALUSrc MemtoReg RegWrite MemWrite Branch Jump EtOp ALUop (Symbolic) R-type Or Add Add Subtract ALUop <> ALUop <> don t need func 3 A Simple Datapath + Control Based on the previous truth table 2 bits 4 bits 4 7
3/6/8 R-Type Instruction func ALU Ctr 5 Load Instruction add 6 8
3/6/8 Branch-on-Equal Instruction (beq) sub 7 Finally, Implementing Jumps (j) J-type 2 address 3:26 25: Jump uses word addressing It updates PC with concatenation of: Most significant 4 bits of <current PC+4> 26-bit jump address (shift left by 2 bits to get byte-wise address) Now we need a new control signal decoded from opcode for jump 8 9
3/6/8 DatapathWith Jumps Added 4 bits 9 Performance Issues Yes, the single-cycle CPU works correctly But the longest delay determines the CPU clock cycle What is the critical (or longest) path? The load instruction Instruction memory register file ALU data memory register file Could be worse if you deal with floating point numbers This violates a design principle: Making the common case fast Net, we will improve it using pipelining 2
3/6/8 Pipeline is natural! Pipelining Analogy The classic laundry eample: Washer, dryer, folder, storer Total = 8 hours Total = 3.5 hours n n Four loads: n Speedup = 8/3.5 = 2.3X Non-stop (steady state): n Speedup 4 (2N/.5N) = number of stages 2 Important Lessons about Pipelining Pipelining doesn t help latency of single task, but helps throughput of entire workload Multiple tasks operate simultaneously using different resources (in parallel) Potential speedup = Number pipe stages Pipeline rate limited by slowest stage Unbalanced lengths of stages reduce speedup Time to fill pipeline and time to drain it reduces speedup May stall for dependences 22
3/6/8 The MIPS Pipeline Five pipeline stages on MIPS processors:. IF: Instruction Fetch from memory 2. ID: Instruction Decode and Register Read 3. EX: Eecute operation or calculate address 4. MEM: Access memory operand 5. WB: Write result back to register file 23 Pipeline Performance Assume the time for different stages is: ps for ID stage ps for WB stage 2ps for all the other stages Performance of the single-cycle datapath design Instructio n eample Instr fetch Register read ALU op Memory access Register write back Total time lw 2ps ps 2ps 2ps ps 8ps sw 2ps ps 2ps 2ps 7ps R-format 2ps ps 2ps ps 6ps beq 2ps ps 2ps 5ps 24 2
3/6/8 Single-Cycle vs Pipeline Single-cycle (CC = 8ps) Pipelined (CC = 2ps) 2ps is the slowest stage time 4 speedup! 25 Convenient Pipelined Representation Time IFetch ID Eec Mem WB IFetch ID Eec Mem WB IFetch ID Eec Mem WB IFetch ID Eec Mem WB Program Flow IFetch ID Eec Mem WB IFetch ID Eec Mem WB 26 3
3/6/8 Pipeline Speedup If all stages are balanced (i.e., all take the same time) t pipelined = t nonpipelined # of stages If stages are NOT balanced, speedup becomes less Speedup is due to an increased throughput Latency (time for each instruction) does not necessarily improve Under ideal conditions and if a large number of instructions, then speedup = #stages 27 ISA Design is Suitable for Pipelining All MIPS instructions are 32 bits Much easier to fetch and decode But, VS 86: - to 7-byte instructions, more difficult Very regular instruction formats So that we can decode and read registers simultaneously in one stage Only load/store can access memory Can calculate address in EX stage, access memory in MEM stage (i.e., E, Mem, WR) Alignment of memory operands Always have a single data transfer So memory access takes only one cycle (in one stage) 28 4
3/6/8 Pipeline Hazards Hazards eist: Situations when the net instruction cannot eecute in the net cycle. Structural hazards A required resource (e.g., memory) is occupied/busy more details in net slide 2. Data hazards Need to wait for previous instruction s data to complete its data read/write 3. Control hazards Depend on a control action from a previous instruction (e.g., branch instruction: beq) 29 Structural Hazards Conflict for the use of a resource already occupied (e.g., only one memory!) MIPS is well designed so that there is No structural hazard Suppose MIPS pipeline has a single memory Load/store requires memory access Instruction fetch would have to stall for that cycle Would cause a pipeline bubble Hence, MIPS pipelines require separate instruction and data memories To avoid a structural hazard 3 5
3/6/8 Data Hazards An instruction depends on the completion of data access by a previous instruction add $s, $t, $t 2 sub $t2, $s, $t3 Shading on right --> register is read Shading on left --> register is written Waited for 3 cycles Figure. Graphical representation of the instruction pipeline 3 Data Dependencies RAW (read-after-write) data dependency need not always be a data hazard add $s, $t, $t sub $t4, $t, $t and $t5, $t, $t sub $t2, $s, $t3 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB There is a RAW dependency of $s, but there is no pipeline data hazard! How to solve it? Either Stall or Forwarding. EX already has the result!! 32 add sub and sub 6
3/6/8 Forwarding (aka Bypassing) Use result whenever it is computed/available Don t need to wait until data is stored to register Requires etra connections in the datapath Work most of the time Need to modify hardware 33 Load-Use Data Hazard Unfortunately, we can t always avoid stalls even with forwarding values are still not available when needed Must stall one cycle for a load-use data hazard A special form of data hazard 34 7
3/6/8 How to Use Code Scheduling to Avoid Stalls: a software solution First, find the load-use data hazards i.e., the immediate net instruction Reorder to avoid using a load result in the net instruction C code for A = B + E; C = B + F; stall stall lw $t, ($t) lw $t2, 4($t) add $t3, $t, $t2 sw $t3, 2($t) lw $t4, 8($t) add $t5, $t, $t4 sw $t5, 6($t) 3 cycles lw $t, ($t) lw $t2, 4($t) lw $t4, 8($t) add $t3, $t, $t2 sw $t3, 2($t) add $t5, $t, $t4 sw $t5, 6($t) cycles 35 Other Types of Data Hazards We have discussed RAW (read after write) data hazard The other Two Data Hazards are avoided by design! Eliminate WAR by always fetching operands early (ID) in pipe Eliminate WAW by doing all WBs in order (always at the last stage, static) WAR: ADD R3, R2, R SUB R2, R4, R5 36 8
3/6/8 37 9