Simple Instruction Pipelining

Size: px

Start display at page:

Download "Simple Instruction Pipelining"

Hope Cannon
5 years ago
Views:

1 Simple Instruction Pipelining Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology

2 Processor Performance Equation Time = Instructions * Cycles * Time Program Program Instruction Cycle Instructions per program depends on source code, compiler technology, and ISA Microcoded DLX from last lecture had cycles per instruction (CPI) of around 7 minimum Time per cycle for microcoded DLX fixed by microcode cycle time mostly ROM access + next µpc select logic

3 Pipelined DLX Asanovic/Devadas To pipeline DLX: First build unpipelined DLX with CPI=1 Next, add pipeline registers to reduce cycle time while maintaining CPI=1

4 A Simple Model Asanovic/Devadas Clock WriteEnable ress WriteData MAGIC RAM ReadData Reads and writes are always completed in one cycle a Read can be done any time (i.e. combinational) a Write is performed at the rising clock edge if it is enabled the write ess, data, and enable must be stable at the clock edge

5 Datapath for Instructions PC 0x4 inst Inst. inst<25:21> inst<20:16> inst<15:11> inst<15:0> inst<31:26> <5:0> RegWrite rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z OpCode RegDst ExtSel OpSel BSrc rf2 / rf3 Reg / Imm rf1 rf2 rf3 0 func rf3 (rf1) func (rf2) opcode rf1 rf2 immediate rf2 (rf1) op immediate

6 Datapath for Asanovic/Devadas Instructions Should program and data memory be separate? Harvard style: separate (Aiken and Mark 1 influence) - read-only program memory - read/write data memory at some level the two memories have to be the same Princeton style: the same (von Neumann s influence) - A Load or Store instruction requires accessing the memory more than once during its execution

7 Load/Store Instructions: Asanovic/Devadas Harvard-Style Datapath PC 0x4 inst Inst. base disp RegWrite rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z MemWrite Data wdata WBSrc / Mem OpCode RegDst ExtSel OpSel BSrc essing mode opcode rf1 rf2 displacement (rf1) + displacement rf1 is the base register rf2 is the destination of a Load or the source for a Store

8 Hierarchy c.2002 Desktop & Small Server On-chip Caches I$ Proc L2 D$ Off-chip L3 Cache SRAM/ edram Interleaved Banks of DRAM Hard Disk 0.5-2ns 2-3 8~64KB <10ns 5~ MB < 25ns 15~50 1~8MB ~150ns 100~300 64M~1GB ~10ms seek time ~ ~100GB Our memory model is a good approximation of the hierarchical memory system when hit in the on-chip cache

9 Conditional Branches PCSrc ( ~j / j ) RegWrite MemWrite WBSrc 0x4 PC inst Inst. rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z Data wdata OpCode RegDst ExtSel OpSel BSrc zero?

10 Register-Indirect Jumps PCSrc ( ~j / j RInd / j PCR ) RegWrite MemWrite WBSrc 0x4 PC inst Inst. Jump & Link? rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z Data wdata OpCode RegDst ExtSel OpSel BSrc zero?

11 Jump & Link PCSrc RegWrite MemWrite WBSrc / Mem / PC 0x4 PC inst Inst. 31 rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z Data wdata OpCode RegDst ExtSel rf3 / rf2 / R31 OpSel BSrc zero?

12 PCSrc Asanovic/Devadas PC-Relative Jumps RegWrite MemWrite WBSrc 0x4 PC inst Inst. No new datapath required 31 rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z Data wdata OpCode RegDst ExtSel OpSel BSrc zero? Ext16 / Ext26

13 Hardwired Control is pure Combinational Logic: Unpipelined DLX op code zero? combinational logic ExtSel BSrc OpSel MemWrite WBSrc RegDst RegWrite PCSrc

14 Control & Immediate Inst<5:0> (Func) Extension Inst<31:26> (Opcode) + 0? op OpSel ( Func, Op, +, 0? ) Decode Map ExtSel ( sext 16, uext 16, sext 26,High 16 )

15 Hardwired Control worksheet PCSrc RegWrite MemWrite WBSrc PCR / RInd / ~j / Mem / PC 0x4 0x4 PC inst Inst. inst<25:21> inst<20:16> 31 inst<15:11> inst<25:0> rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext z Data wdata inst<31:26><5:0> Control OpCode RegDst rf2 / rf3 / R31 ExtSel OpSel sext 16 /uext 16 / Func/ sext 26 /High 16 Op/+ / 0? BSrc Reg / Imm zero?

16 Hardwired Control Table Asanovic/Devadas BSrc = Reg / Imm WBSrc = / Mem / PC RegDst = rf2 / rf3 / R31 PCSrc1 = j / ~j PCSrc2 = PCR / RInd Opcode u i iu LW SW BEQZ z=0 BEQZ z=1 J JAL Ext Sel BSrc Op Sel Mem Write Reg Write WB Src Reg Dest * Reg Func no yes rf3 * Reg Func no yes rf3 sext 16 Imm Op no yes rf2 uext 16 Imm Op no yes rf2 sext 16 Imm + no yes Mem rf2 sext 16 Imm + yes no * * sext 16 * 0? no no * * sext 16 * 0? no no * * sext 26 * * no no * * sext 26 * * no yes PC R31 PC Src ~j ~j ~j ~j ~j ~j PCR ~j PCR PCR JR JALR * * * no no * * RInd * * * no yes PC R31 RInd

17 Hardwired Unpipelined Machine Simple One instruction per cycle Why wasn t this a popular machine style?

18 Unpipelined DLX Asanovic/Devadas Clock period must be sufficiently long for all of the following steps to be completed : 1. instruction fetch 2. decode and register fetch 3. operation 4. data fetch if required 5. register write-back setup time t C > t IFetch + t RFetch + t + t DMem + t RWB At the rising edge of the following clock, the PC, the register file and the memory are updated

19 Pipelined DLX Datapath PC 0x4 Inst. IR rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Data wdata fetch phase decode & Reg-fetch phase execute phase memory phase Clock period can be reduced by dividing the execution of an instruction into multiple cycles t C > max {t IM, t RF, t, t DM, t RW } = t DM (probably) write -back phase Hover, CPI will increase unless instructions are pipelined

20 An Ideal Pipeline stage 1 stage stage stage All objects go through the same stages No sharing of resources beten any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines. An instruction pipeline, hover, cannot satisfy the last condition. Why?

21 Pipelining History Asanovic/Devadas Some very early machines had limited pipelined execution (e.g., Zuse Z4, WWII) Usually overlap fetch of next instruction with current execution IBM Stretch first major supercomputer incorporating extensive pipelining, result bypassing, and branch prediction project started in 1954, delivered in 1961 didn t meet initial performance goal of 100x faster with 10x faster circuits up to 11 macroinstructions in pipeline at same time microcode engine highly pipelined also (up to 6 microinstructions in pipeline at same time) Stretch was origin of 8-bit byte and lor case characters, carried on into IBM 360

22 How to divide the datapath into stages Suppose memory is significantly slor than other stages. In particular, suppose t IM = t DM = 10 units t = 5 units t RF = t RW = 1 unit Since the slost stage determines the clock, it may be possible to combine some stages without any loss of performance

23 Minimizing Critical Path Asanovic/Devadas 0 x4 PC Inst. IR rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Data wdata fetch phase decode & Reg-fetch phase & execute memory phase write -back phase t C > max {t IM, t RF + t, t DM, t RW } Write-back stage takes much less time than other stages. Suppose combined it with the memory phase increase the critical path by 10%

24 Maximum Speedup by Asanovic/Devadas Pipelining For the 4-stage pipeline, given t IM = t DM = 10 units, t = 5 units, t RF = t RW = 1 unit t C could be reduced from 27 units to 10 units speedup = 2.7 Hover, if t IM = t DM = t = t RF = t RW = 5 units The same 4-stage pipeline can reduce t C from 25 units to 10 units speedup = 2.5 But, since t IM = t DM = t = t RF = t RW, it is possible to achieve higher speedup with more stages in the pipeline. A 5-stage pipeline can reduce t C from 25 units to 5 units speedup = 5

25 Technology Assumptions Asanovic/Devadas We will assume A small amount of very fast memory (caches) backed up by a large, slor memory Fast (at least for integers) Multiported Register files (slor!). It makes the following timing assumption valid t IM t RF t t DM t RW A 5-stage pipelined Harvard-style architecture will be the focus of our detailed design

26 5-Stage Pipelined Execution Asanovic/Devadas PC 0x4 wdata IR rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext wdata write fetch decode & Reg-fetch execute memory -back phase phase phase phase phase (IF) (ID) (EX) (MA) (WB) time t0 t1 t2 t3 t4 t5 t6 t7.... instruction1 IF 1 ID 1 EX 1 MA 1 WB 1 instruction2 IF 2 ID 2 EX 2 MA 2 WB 2 instruction3 IF 3 ID 3 EX 3 MA 3 WB 3 instruction4 IF 4 ID 4 EX 4 MA 4 WB 4 instruction5 IF 5 ID 5 EX 5 MA 5 WB 5

27 PC 5-Stage Pipelined Execution Asanovic/Devadas 0x4 wdata IR Resource Usage Diagram rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext wdata write fetch decode & Reg-fetch execute memory -back phase phase phase phase phase (IF) (ID) (EX) (MA) (WB) Resources time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 2 I 3 I 4 I 5 ID I 1 I 2 I 3 I 4 I 5 EX I 1 I 2 I 3 I 4 I 5 MA I 1 I 2 I 3 I 4 I 5 WB I 1 I 2 I 3 I 4 I 5

28 Pipelined Execution: Instructions 0x4 not quite correct! PC inst Inst IR 31 rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B Y Data wdata R MD1 MD2

29 Pipelined Execution: Need for Several IR s 0x4 IR IR IR 31 PC inst Inst IR rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B Y Data wdata R MD1 MD2

30 IRs and Control points Asanovic/Devadas 0x4 IR IR IR 31 PC inst Inst IR rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B Y Data wdata R MD1 MD2 Are control points connected properly? - Load/Store instructions - instructions

31 Pipelined DLX Datapath without jumps 0x4 RegWrite IR IR IR 31 RegDst PC inst Inst IR rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B OpSel Y MemWrite Data wdata WBSrc R MD1 MD2 ExtSel BSrc

32 How Instructions can Interact with each other in a Pipeline An instruction in the pipeline may need a resource being used by another instruction in the pipeline structural hazard An instruction may produce data that is needed by a later instruction data hazard In the extreme case, an instruction may determine the next instruction to be executed control hazard (branches, interrupts,...)

33 Feedback to Resolve Hazards FB 1 FB 2 FB 3 FB 4 stage 1 stage 2 stage 3 stage 4 Controlling pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) Feedback to previous stages is used to stall or kill instructions

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 4 Reduced Instruction Set Computers

ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 4 Reduced Instruction Set Computers Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html