Computer Science 6C Spring 27 Working on the Pipeline
Datapath Control Signals Computer Science 6C Spring 27 MemWr: write memory MemtoReg: ALU; Mem RegDst: rt ; rd RegWr: write register 4 PC Ext Imm6 Adder Adder Inst Address npc_sel & Equal Mux PC clk RegDst RegWr busw 32 ALUctr: "add", "sub", "OR",... Extender: zero-ext; sign-ext npc_sel: pc+4; branch-if-equal Rd clk Imm6 Rt Rs 5 5 Rw Ra Rb RegFile 6 ExtOp Rt 5 busa Extender busb 32 32 32 ALUSrc ALUctr Data In clk ALU 32 MemWr 32 WrEn MemtoReg Adr Data Memory 2
Summary of the Control Signals (/2) Computer Science 6C Spring 27 inst Register Transfer add R[rd] R[rs] + R[rt]; PC PC + 4 ALUsrc=RegB, ALUctr= ADD, RegDst=rd, RegWr, npc_sel= +4 sub R[rd] R[rs] R[rt]; PC PC + 4 ALUsrc=RegB, ALUctr= SUB, RegDst=rd, RegWr, npc_sel= +4 ori R[rt] R[rs] + zero_ext(imm6); PC PC + 4 ALUsrc=Im, Extop= Z, ALUctr= OR, RegDst=rt,RegWr, npc_sel= +4 lw R[rt] MEM[ R[rs] + sign_ext(imm6)]; PC PC + 4 ALUsrc=Im, Extop= sn, ALUctr= ADD, MemtoReg, RegDst=rt, RegWr, npc_sel = +4 sw MEM[ R[rs] + sign_ext(imm6)] R[rs]; PC PC + 4 ALUsrc=Im, Extop= sn, ALUctr = ADD, MemWr, npc_sel = +4 beq if (R[rs] == R[rt]) then PC PC + sign_ext(imm6)] else PC PC + 4 npc_sel = br, ALUctr = SUB 3
Summary of the Control Signals (2/2) Computer Science 6C Spring 27 See func We Don t Care :-) Appendix A op add sub ori lw sw beq jump RegDst ALUSrc MemtoReg RegWrite MemWrite npcsel Jump ExtOp ALUctr<2:> x Add x Subtract Or Add x x Add x x x Subtract x x x? x x R-type 3 26 2 6 6 op rs rt rd shamt funct add, sub I-type op rs rt immediate ori, lw, sw, beq J-type op target address jump 4
Boolean Expressions for Controller Computer Science 6C Spring 27 RegDst = add + sub ALUSrc = ori + lw + sw MemtoReg = lw RegWrite = add + sub + ori + lw MemWrite = sw npcsel = beq Jump = jump ExtOp = lw + sw ALUctr[] = sub + beq (assume ALUctr is ADD, SUB, OR) ALUctr[] = or Where: rtype = ~op 5 ~op 4 ~op 3 ~op 2 ~op ~op, ori = ~op 5 ~op 4 op 3 op 2 ~op op lw = op 5 ~op 4 ~op 3 ~op 2 op op sw = op 5 ~op 4 op 3 ~op 2 op op beq = ~op 5 ~op 4 ~op 3 op 2 ~op ~op jump = ~op 5 ~op 4 ~op 3 ~op 2 op ~op How do we implement this in gates? add = rtype func 5 ~func 4 ~func 3 ~func 2 ~func ~func sub = rtype func 5 ~func 4 ~func 3 ~func 2 func ~func 5
Controller Implementation Computer Science 6C Spring 27 opcode func AND logic add sub ori lw sw beq jump OR logic RegDst ALUSrc MemtoReg RegWrite MemWrite npcsel Jump ExtOp ALUctr[] ALUctr[] 6
P&H Figure 4.7 Computer Science 6C Spring 27 7
Summary: Single-cycle Processor Computer Science 6C Spring 27 Five steps to design a processor:. Analyze instruction set à datapath requirements 2. Select set of datapath components & establish clock methodology 3. Assemble datapath meeting the requirements Processor Control Datapath Memory 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Assemble the control logic Formulate Logic Equations Design Circuits Input Output 8
Single Cycle Performance Computer Science 6C Spring 27 Assume time for actions are ps for register read or write; 2ps for other events Instr Instr fetch Register read ALU op Memory access Register write Total time lw 2ps ps 2ps 2ps ps 8ps sw 2ps ps 2ps 2ps 7ps R-format 2ps ps 2ps ps 6ps beq 2ps ps 2ps 5ps What can we do to improve clock rate? Will this improve performance as well? Want increased clock rate to mean faster programs 9
Gotta Do Laundry Computer Science 6C Spring 27 Alice, Bob, Carol, and Dave A B C D each have one load of clothes to wash, dry, fold, and put away Washer takes 3 minutes Dryer takes 3 minutes Folder takes 3 minutes Stasher takes 3 minutes to put clothes into drawers
Sequential Laundry Computer Science 6C Spring 27 6 PM 7 8 9 2 2 AM T a s k O r d e r A B C D 33 3 3 3 3 3 3 33 3 3 33 3 3 Time Sequential laundry takes 8 hours for 4 loads
Pipelined Laundry Computer Science 6C Spring 27 2 2 AM 6 PM 7 8 9 T a s k O r d e r A B C D 33 3 3 3 3 3 Time Pipelined laundry takes 3.5 hours for 4 loads! 2
Pipelining Lessons (/2) Computer Science 6C Spring 27 Pipelining doesn t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously and independently 6 PM 7 8 9 using different resources Potential speedup = Number pipe stages Time to fill pipeline and time to drain it reduces speedup: 2.3x (8/3.5) v. 4x (8/2) in this example T a s k O r d e A B C D Time 3 3 3 3 3 3 3 3
Pipelining Lessons (2/2) Computer Science 6C Spring 27 Suppose new Washer takes 2 minutes, new Stasher takes 2 minutes. How much faster is pipeline? Pipeline rate limited by slowest pipeline stage Suppose Bob doesn't bother folding his laundry? Idle steps in the pipeline don't enable others to fold Unbalanced lengths and idle stages reduces speedup T a s k O r d e 6 PM 7 8 9 A B C D Time 3 3 3 3 3 3 3 4
Execution Steps in MIPS Datapath Computer Science 6C Spring 27 ) IFtch/IF: Instruction Fetch & Increment PC 2) Dcd/ID: Instruction Decode & Read Registers 3) Exec/EX: Mem-ref: Calculate Address Arith-log: Perform ALU Operation 4) Mem: Load: Read Data from Memory Store: Write Data to Memory Memory is now synchronous 5) WB: Write Data Back to Register 5
Single Cycle Datapath Computer Science 6C Spring 27 PC instruction memory rd rs rt registers ALU Data memory +4 imm. Instruction Fetch 2. Decode/ 3. Execute 4. Memory Register Read 5. Write Back 6
Pipeline registers Computer Science 6C Spring 27 Need registers between stages To hold information produced in previous cycle PC instruction memory rd rs rt registers ALU Data memory +4 imm. Instruction Fetch 2. Decode/ 3. Execute 4. Memory Register Read 5. Write Back 7
More Detailed Pipeline Computer Science 6C Spring 27 8
IF for Load, Store, Computer Science 6C Spring 27 9
ID for Load, Store, Computer Science 6C Spring 27 2
EX for Load Computer Science 6C Spring 27 2
MEM for Load Computer Science 6C Spring 27 22
WB for Load Oops! Computer Science 6C Spring 27 Wrong register number! 23
Corrected Datapath for Load Computer Science 6C Spring 27 24
Pipelined Execution Representation Computer Science 6C Spring 27 Every instruction must take same number of steps, so some stages will idle e.g. MEM stage for any arithmetic instruction Time IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB 25
Graphical Pipeline Diagrams Computer Science 6C Spring 27 Use datapath figure below to represent pipeline: PC MUX +4 instruction memory rd rs rt imm Register File ALU Data memory. Instruction Fetch 2. Decode/ Register Read IF ID EX Mem WB 3. Execute 4. Memory 5. Write Back ALU I$ Reg D$ Reg 26
Graphical Pipeline Representation Computer Science 6C Spring 27 RegFile: left half is write, right half is read Time (clock cycles) I n I$ Reg D$ Reg s Load t I$ Reg D$ Reg r Add O r d e r Store Sub Or ALU I$ ALU Reg I$ ALU Reg I$ D$ ALU Reg Reg D$ ALU Reg D$ Reg 27
Pipelining Performance (/3) Computer Science 6C Spring 27 Use T c ( time between completion of instructions ) to measure speedup Equality only achieved if stages are balanced (i.e. take the same amount of time) If not balanced, speedup is reduced Speedup due to increased throughput Latency for each instrucbon does not decrease 28
Pipelining Performance (2/3) Computer Science 6C Spring 27 Assume time for stages is ps for register read or write 2ps for other stages Instr Instr fetch Register read ALU op Memory access Register write Total time lw 2ps ps 2ps 2ps ps 8ps sw 2ps ps 2ps 2ps 7ps R-format 2ps ps 2ps ps 6ps beq 2ps ps 2ps 5ps What is pipelined clock rate? Compare pipelined datapath with single-cycle datapath 29
Pipelining Performance (3/3) Computer Science 6C Spring 27 Single-cycle T c = 8 ps f =.25GHz Pipelined T c = 2 ps f = 5GHz 3
Clicker/Peer Instruction Computer Science 6C Spring 27 Logic in some stages takes 2ps and in some ps. Clk-Q delay is 3ps and setup-time is 2ps. What is the maximum clock frequency at which a pipelined design can operate? A: GHz B: 5GHz C: 6.7GHz D: 4.35GHz E: 4GHz 3
Project 3... Computer Science 6C Spring 27 Project 3. will be released in a couple of hours In project 3, you will build a CPU in logisim. 3. is the ALU and register file 3.2 is putting together the control logic 32
We Grossly Simplified The Project... Why? Computer Science 6C Spring 27 Last semester and last year it was building effectively a full MIPS Now it is a much smaller architecture with narrower words: Why are we cheating you out of the experience? We use logisim for pedagogical reasons Almost all design these days uses "HDL" (High-Level Design Languages) like VHDL and Verilog In an HDL, doing a 32b, 32 register register file is no harder than doing a 6b, 8 register one But in logisim, it is at least 4x more work... And 4x more chance to make an error 33
Why Nick's Ph.D. Was An Incredibly Stupid Idea... Computer Science 6C Spring 27 My Ph.D. was on a highly pipelined FPGA architecture FPGA -> Field Programmable Gate Array: Basically programmable hardware The design was centered around being able to pipeline multiple independent tasks We will see on Wednesday how to handle "pipeline hazards" and "forwarding: add $s $s $s2 add $s3 $s $s4 This is critical to get real performance gains But my dissertation design didn't have this ability I also showed how you could use the existing registers in the FPGA to heavily pipeline it automatically 34
But pipelining is not free! Computer Science 6C Spring 27 Not only does pipelining not improve latency... It actually makes it worse! Two sources: Unbalanced pipeline stages The setup & clk->q time for the pipeline registers Pipelining only independent tasks also can't "forward" So independent task pipelining is only about reducing cost You can always just duplicate logic instead Latency is fundamental, independent task throughput can always be solved by throwing $$$ at the problem So I proved my Ph.D. design was no better than the conventional FPGA on throughput/$ and far far far worse on latency! 35