Computer Architectures

Size: px

Start display at page:

Download "Computer Architectures"

Cecilia Mosley
5 years ago
Views:

1 Computer Architectures Pipelined instruction execution Hazards, stages balancing, super-scalar systems Pavel Píša, Michal Štepanovský, Miroslav Šnorek Main source of inspiration: Patterson Czech Technical University in Prague, Faculty of Electrical Engineering English version partially supported by: European Social Fund Prague & EU: We invests in your future. AEB36APO Computer Architectures Ver..

2 Motivation AMD Bulldozer 5h (FX, Opteron) - 2 AEB36APO Computer Architectures 2

3 Motivation Intel Nehalem (Core i7) - 28 AEB36APO Computer Architectures 3

4 The goal of today lecture Convert/extend CPU presented in the lecture 2 to the pipelined CPU design. The following instructions are considered for our CPU design: add, sub, and, or, slt, addi, lw, sw and beq Typ 3 R opcode(6), 3:26 rs(5), 25:2 rt(5), 2:6 rd(5), 5: shamt(5) funct(6), 5: I opcode(6), 3:26 rs(5), 25:2 rt(5), 2:6 immediate (6), 5: J opcode(6), 3:26 address(26), 25: AEB36APO Computer Architectures 4

5 Single cycle CPU together with memories 3:26 5: Control Unit Opcode Funct MemToReg MemWrite Branch ALUControl 2: ALUScr RegDest RegWrite PC PC A RD Instr 25:2 Instr. 2:6 4 PCPlus4 2:6 5: 5: WE3 A RD A2 A3 WD3 RD2 Reg. File Sign Ext Rt Rd SignImm SrcA Zero WE ALU A RD SrcB AluOut Data ReadData WriteData WD WriteReg <<2 PCBranch Result AEB36APO Computer Architectures From lecture 2 5

6 Single cycle CPU performance: IPS = IC / T = IPC avg.f CLK What is the maximal possible frequency of this CPU? It is given by latency on the critical path it is lw in our case: T c = t PC t Mem t RFread t ALU t Mem t Mux t RFsetup PC PC 4 PCPlus4 A RD Instr. Instr 25:2 5: 5: WE3 A RD 2:6 A2 RD2 A3 Reg. WD3 File 2:6 Sign Ext Rt Rd SignImm SrcA Zero WE ALU A RD SrcB AluOut Data ReadData WriteData WD WriteReg <<2 PCBranch Result AEB36APO Computer Architectures From lecture 2 6

7 Single cycle CPU throughput: IPS = IC / T = IPC avg.f CLK Tc = tpc t Mem t RFread t ALU t Mem t Mux t RFsetup Consider following parameters t PC = 3 ns t Mem = 3 ns t RFread = 5 ns t ALU = 2 ns t Mux = 2 ns t RFsetup = 2 ns Then Tc = 2 ns --> f CLK max = 98 khz, IPS = 98e3 = 98 instructions per second AEB36APO Computer Architectures From lecture 2 7

8 Pipelined instructions execution Suppose that instruction execution can be divided into 5 stages: IF ID EX MEM WB IF Instruction Fetch, ID Instruction decode (and Operands Fetch), EX Execute, MEM Access, WB Write Back and = max { i } k i=, where i is time required for signal propagation (propagation delay) through i-th stage. IF setup PC for memory and fetch pointed instruction. Update PC = PC4 ID decode the opcode and read registers specified by instruction, check for equality (for possible beq instruction), sign extend offset, compute branch target address for branch case (this is means to extend offset and add PC) EX execute function/pass register values through ALU MEM read/write main memory for load/store instruction case WB write result into RF for instructions of register-register class or instruction load (result source is ALU or memory) AEB36APO Computer Architectures 8

9 Instruction-level parallelism - pipelining IF I I2 I3 I4 I5 I6 I7 I8 I9 I ID I I2 I3 I4 I5 I6 I7 I8 I9 EX I I2 I3 I4 I5 I6 I7 I8 MEM I I2 I3 I4 I5 I6 I7 ST I I2 I3 I4 I5 I6 The time to execute n instructions in the k-stage pipeline: T k = k. (n ) Speedup: S k = T nk τ = T k kτ (n )τ lim S k =k n čas Prerequisite: pipeline is optimally balanced, circuit can arbitrarily divided AEB36APO Computer Architectures 9

10 Instruction-level parallelism - pipelining Does not reduce the execution time of individual instructions, effect is just the opposite... Hazards: structural (resolved by duplication), data (result of data dependencies: RAW, WAR, WAW) control (caused by instructions which change PC)... Hazard prevention can result in pipeline stall or pipeline flush Remark : Deeper pipeline (more stages) results in shorter sequences of gates in each stage which enables to increase the operating frequency of the processor, but more stages means higher overhead (demand to arrange better instructions into pipeline and result in more significant lag in the case of stall or pipeline flush) AEB36APO Computer Architectures

11 Instruction-level parallelism Semantics violations Data hazard: Add writes new value to R ADD R,R2,R3 SUB R4,R,R3 flow of instructions and expected effect Control hazard: BEQZ R3, M ADD R6,R,R2 instruction 3 instruction 4 M: ADD R4,R6,R7 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB SUB reads incorrect value from R Condition and new PC evaluation IF ID EX MEM WB PC set to branch target IF ID EX MEM WB IF ID EX MEM WB Should be these instructions fetched (and executed then)? AEB36APO Computer Architectures

12 Non-pipelined execution PC PC Instr 25:2 A RD Instr. 2:6 4 PCPlus4 2:6 5: 5: WE3 A RD A2 A3 WD3 RD2 Reg. File Sign Ext Rt Rd SignImm SrcA Zero WE ALU A RD SrcB AluOut Data ReadData WriteData WD WriteReg <<2 PCBranch Result AEB36APO Computer Architectures From lecture 2 2

13 Pipelined execution AluOutW PC PC Instr 25:2 A RD Instr. 2:6 4 PCPlus4F 2:6 5: 5: WE3 A RD A2 A3 WD3 PCPlus4D RD2 Reg. File Sign Ext Rt Rd SignImm PCPlus4E SrcA Zero WE ALU Result A RD SrcB AluOutM Data ReadData WriteDataE WriteRegE WriteDataM WD WriteRegM WriteRegW <<2 PCBranch Fetch Decode Execute WriteBack AEB36APO Computer Architectures 3

14 Pipelined execution 3:26 5: Control Unit Opcode Funct MemToReg MemWrite Branch ALUControl 2: ALUScr RegDest RegWrite AluOutW PC PC Instr 25:2 A RD Instr. 2:6 4 PCPlus4F 2:6 5: 5: WE3 A RD A2 A3 WD3 PCPlus4D RD2 Reg. File Sign Ext Rt Rd SignImm PCPlus4E SrcA Zero WE ALU Result A RD SrcB AluOutM Data ReadData WriteDataE WriteRegE WriteDataM WD WriteRegM WriteRegW <<2 PCBranch Fetch Decode Execute WriteBack AEB36APO Computer Architectures 4

15 The same design but drawn scaled down Control unit 3:26 Op 5: Funct RegWriteD MemToRegD MemWriteD ALUControlD ALUSrcD RegDstD BranchD RegWriteE MemToRegE MemWriteE ALUControlE ALUSrcE RegDstE BranchE RegWriteM MemToRegM MemWriteM BranchD PCSrcM RegWriteW MemTo RegW PC PC 4 A RD Instruction InstrD 25:2 WE3 A RD 2:6 A2 RD2 A3 Reg. WD3 File 2:6 5: 5: SignImmD Sign Ext RtD RdD RtE RdE SrcAE SrcBE ALU WriteDataE WriteRegE 4: SignImmE Zero ALUOutM WriteDataM A RD Data WD WE ReadDataW ALUOutW WriteRegM 4: WriteRegW 4: PCPlus4F PCPlus4D PCBranchD <<2 ResultW AEB36APO Computer Architectures 5

16 Cause of the data hazards Register File access from two pipeline stages (Decode, WriteBack) actual write occurs at the first half of the clock cycle, the read in the second half there is no hazard for sub $s input operand RAW (Read After Write) hazard and (or) requires $s in 3 (4) How can such hazard be prevented without pipeline throughput degradation? AEB36APO Computer Architectures 6

17 Forwarding to avoid data hazards If a result is available (computed) before subsequent instruction(s) requires the value then data hazard can be avoided by forwarding Hazard case is indicated when some of source registers in EX stage is the same as destination register in stage MEM or WB The register numbers are fed to the Hazard Unit The RegWrite signal from MEM and WB stage has to be monitored as well to check that register number on WriteReg lines takes effect lw / sw etc. AEB36APO Computer Architectures 7

18 CPU after previous design steps Control unit 3:26 Op 5: Funct RegWriteD MemToRegD MemWriteD ALUControlD ALUSrcD RegDstD BranchD RegWriteE MemToRegE MemWriteE ALUControlE ALUSrcE RegDstE BranchE RegWriteM MemToRegM MemWriteM BranchD PCSrcM RegWriteW MemTo RegW PC PC 4 A RD Instruction InstrD 25:2 WE3 A RD 2:6 A2 RD2 A3 Reg. WD3 File 2:6 5: 5: SignImmD Sign Ext RtD RdD RtE RdE SrcAE SrcBE ALU WriteDataE WriteRegE 4: SignImmE Zero ALUOutM WriteDataM A RD Data WD WE ReadDataW ALUOutW WriteRegM 4: WriteRegW 4: PCPlus4F PCPlus4D PCBranchD <<2 ResultW AEB36APO Computer Architectures 8

19 Data hazards solved by forwarding Control unit 3:26 Op 5: Funct RegWriteD MemToRegD MemWriteD ALUControlD ALUSrcD RegDstD BranchD RegWriteE MemToRegE MemWriteE ALUControlE ALUSrcE RegDstE BranchE RegWriteM MemToRegM MemWriteM BranchD PCSrcM RegWriteW MemTo RegW PC PC 4 A RD Instruction InstrD 25:2 WE3 A RD 2:6 A2 RD2 A3 Reg. WD3 File 25:2 2:6 5: 5: SignImmD Sign Ext RsD RtD RdD RsE RtE RdE SrcAE SrcBE ALU WriteDataE WriteRegE 4: SignImmE Zero ALUOutM WriteDataM A RD Data WD WE ReadDataW ALUOutW WriteRegM 4: WriteRegW 4: PCPlus4F PCPlus4D PCBranchD <<2 ResultW Forward AE Forward BE RegWriteM RegWrite W Hazard unit AEB36APO Computer Architectures 9

20 Data hazard avoided by pipeline stall If subsequent instructions require result before it is available in CPU then the pipeline has to be stalled (stall state inserted) The stall is mean to solve hazard but affect system throughput Pipeline stages preceding that one which is affected by the hazard are stalled until all results required by subsequent instructions are available results are forwarded to the sink which required their value AEB36APO Computer Architectures 2

21 Data hazard avoided by pipeline stall The stall is realized by the holding content of the inter-stage registers (gating their clocks or blocking their latch enable signals) Results from colliding stages have to be discarded certain control signals in CPU (RF or memory write enable, branch gating) are reset (held low) Both is achieved by introduction of control signals to hold and/or reset inter-stages registers AEB36APO Computer Architectures 2

22 Processor design build till now Control unit 3:26 Op 5: Funct RegWriteD MemToRegD MemWriteD ALUControlD ALUSrcD RegDstD BranchD RegWriteE MemToRegE MemWriteE ALUControlE ALUSrcE RegDstE BranchE RegWriteM MemToRegM MemWriteM BranchD PCSrcM RegWriteW MemTo RegW PC PC 4 A RD Instruction InstrD 25:2 WE3 A RD 2:6 A2 RD2 A3 Reg. WD3 File 25:2 2:6 5: 5: SignImmD Sign Ext RsD RtD RdD RsE RtE RdE SrcAE SrcBE ALU WriteDataE WriteRegE 4: SignImmE Zero ALUOutM WriteDataM A RD Data WD WE ReadDataW ALUOutW WriteRegM 4: WriteRegW 4: PCPlus4F PCPlus4D PCBranchD <<2 ResultW Forward AE Forward BE RegWriteM RegWrite W Hazard unit AEB36APO Computer Architectures 22

23 Processor with data hazards avoided by stall Control unit 3:26 Op 5: Funct RegWriteD MemToRegD MemWriteD ALUControlD ALUSrcD RegDstD BranchD RegWriteE MemToRegE MemWriteE ALUControlE ALUSrcE RegDstE BranchE RegWriteM MemToRegM MemWriteM BranchD PCSrcM RegWriteW MemTo RegW PC EN PC 4 A RD Instruction InstrD 25:2 WE3 A RD 2:6 A2 RD2 A3 Reg. WD3 File 25:2 2:6 5: 5: SignImmD Sign Ext RsD RtD RdD RsE RtE RdE SrcAE SrcBE ALU WriteDataE WriteRegE 4: SignImmE Zero ALUOutM WriteDataM A RD Data WD WE ReadDataW ALUOutW WriteRegM 4: WriteRegW 4: PCPlus4F EN PCPlus4D PCBranchD CLR <<2 ResultW Stall F Stall D Forward AE Forward BE RegWriteM RegWrite W Hazard unit AEB36APO Computer Architectures 23

24 Control hazards (branch and jump) Result is not known before 4 th cycle. Why? AEB36APO Computer Architectures 24

25 Control hazards better to know result earlier If the result of comparison can be evaluated in the 2 nd cycle misprediction penalty can be reduced But the processing of the comparison at earlier stage can induce new RAW hazards..!!! AEB36APO Computer Architectures 25

26 Resolve control hazards by early evaluate and flush PC EN PC 4 A RD Instruction PCPlus4F CLR EN Control unit 3:26 Op 5: Funct InstrD 25:2 WE3 A RD 2:6 A2 RD2 A3 Reg. WD3 File 25:2 2:6 5: 5: SignImmD Sign Ext <<2 PCPlus4D PCBranchD RegWriteD MemToRegD MemWriteD ALUControlD ALUSrcD RegDstD BranchD EquaD = RsD RtD RdD PCSrcD CLR RegWriteE MemToRegE MemWriteE ALUControlE ALUSrcE RegDstE RsE RtE RdE SrcAE SignImmE SrcBE ALU WriteDataE WriteRegE 4: RegWriteM MemToRegM MemWriteM ALUOutM WriteDataM A RD Data WD WE RegWriteW MemTo RegW ReadDataW ALUOutW WriteRegM 4: WriteRegW 4: ResultW Stall F Stall D Forward AE Forward BE RegWriteM RegWrite W Hazard unit AEB36APO Computer Architectures 26

27 PC EN PC 4 A Instruction Resolve RAW hazards by forwarding or stalling RD PCPlus4F CLR EN Control unit 3:26 Op 5: Funct InstrD 25:2 WE3 A RD 2:6 A2 RD2 A3 Reg. WD3 File 25:2 2:6 5: 5: SignImmD Sign Ext <<2 PCPlus4D PCBranchD RegWriteD MemToRegD MemWriteD ALUControlD ALUSrcD RegDstD BranchD EquaD = RsD RtD RdD PCSrcD CLR RegWriteE MemToRegE MemWriteE ALUControlE ALUSrcE RegDstE RsE RtE RdE Stall SrcAE SignImmE SrcBE ALU WriteDataE WriteRegE 4: Forward / Stall RegWriteM MemToRegM MemWriteM ALUOutM WriteDataM A RD Data WD WE RegWriteW MemTo RegW ReadDataW ALUOutW WriteRegM 4: WriteRegW 4: No Action Required ResultW Stall F Stall D BranchD Forward BD Forward AE Forward BE RegWriteM RegWrite W Hazard unit AEB36APO Computer Architectures 27

28 We are finished pipelined processor is designed PC EN PC 4 A RD Instruction PCPlus4F CLR EN Control unit 3:26 Op 5: Funct InstrD 25:2 WE3 A RD 2:6 A2 RD2 A3 Reg. WD3 File 25:2 2:6 5: 5: SignImmD Sign Ext <<2 PCPlus4D PCBranchD RegWriteD MemToRegD MemWriteD ALUControlD ALUSrcD RegDstD BranchD EquaD = RsD RtD RdD PCSrcD CLR RegWriteE MemToRegE MemWriteE ALUControlE ALUSrcE RegDstE RsE RtE RdE SrcAE SignImmE SrcBE ALU WriteDataE WriteRegE 4: RegWriteM MemToRegM MemWriteM ALUOutM WriteDataM A RD Data WD WE RegWriteW MemTo RegW ReadDataW ALUOutW WriteRegM 4: WriteRegW 4: ResultW Stall F Stall D BranchD Forward BD Forward AE Forward BE RegWriteM RegWrite W Hazard unit AEB36APO Computer Architectures 28

29 Pipelined CPU performance: IPS = IC / T = IPC avg.f CLK What is maximal acceptable frequency for the CPU? Which stage is the slowest one? The cycle time is determined by the slowest stage For our case: Tc = 3 ns --> khz If the pipeline fill overhead is neglected (i.e. no pipeline stalls and flushes are considered) then ideal IPC =. IPS = 3 333e3 = instructions per second Introduction of the 5-stage pipeline increases performance (throughput) / 98 = 3.4 times! (considering IPC=) AEB36APO Computer Architectures 29

30 What is result of the design? Return back to non-pipelined CPU version 4 3:26 5: PC PC Instr 25:2 A RD PCPlus4F Instr. 2:6 2:6 5: 5: Control Unit Opcode Funct WE3 A RD A2 RD2 A3 WD3 Reg. File Sign Ext PCPlus4D MemToReg MemWrite Branch ALUControl 2: ALUScr RegDest RegWrite Rt Rd SignImm PCPlus4E SrcA Zero WE ALU Result A RD SrcB AluOutM Data ReadData WriteData WriteReg <<2 PCBranch WD AluOutW AEB36APO Computer Architectures 3

31 What is result of the design? Return back to non-pipelined CPU version A Instr. A Data WD RD WE RD 4 3:26 5: PC PC Instr 25:2 A RD PCPlus4F 2:6 2:6 5: 5: Control Unit Opcode Funct WE3 A RD A2 RD2 A3 WD3 Reg. File Sign Ext PCPlus4D MemToReg MemWrite Branch ALUControl 2: ALUScr RegDest RegWrite Rt Rd SignImm PCPlus4E SrcA Zero WE ALU Result A RD SrcB AluOutM ReadData WriteData WriteReg <<2 WD PCBranch Control unit (control path) AluOutW Data/ALU (data path) AEB36APO Computer Architectures 3

32 What is result of the design? Processor Control unit PC A RD Instruction RD A PC Instr. Address for data Read/Write Data to Write A RD Data WD WE Write enable Read data Data-path (ALU, registers) RD A WD Address Results AEB36APO Computer Architectures 32

33 CPU design result pipelined version PC EN PC 4 A RD Instruction PCPlus4F EN InstrD Contr ol unit 3:26 Op 5: Funct 25:2 2:6 25:2 2:6 5: RegWriteD MemToRegD MemWriteD ALUControlD ALUSrcD RegDstD BranchD EquaD 5: SignImmD Sign Ext <<2 PCPlus4D WE3 RD RD2 Reg. File A A2 A3 WD3 PCBranchD = PCSrcD RsD RtD RdD CLR RegWriteE MemToRegE MemWriteE ALUControlE ALUSrcE RegDstE RsE RtE RdE SrcAE SrcBE ALU SignImmE WriteDataE WriteRegE 4: RegWriteM MemToRegM MemWriteM WE ALUOutM A RD Data WriteDataM WD RegWriteW MemTo RegW ReadDataW ALUOutW WriteRegM 4: WriteRegW 4: ResultW Stall F Stall D BranchD Forward BD Hazard unit Forward AE Forward BE RegWriteM RegWrite W AEB36APO Computer Architectures 33

34 Pipelined CPU timing The timing/ac characteristics of synchronous sequential circuit : t setup inputs setup time t hold inputs hold time Signal integrity constrain for the setup time before the clock: Tc >= t pcq t pd t setup t pd combinatorial logic propagation delay AEB36APO Computer Architectures 34

35 Pipelined processor timing Constraint for the setup time (consider the clock distribution jitter): Tc >= t pcq t pd t setup t skew Clock distribution jitter is limiting factor, if it reaches or exceeds value of t pd (too deep pipeline / too many stages ) AEB36APO Computer Architectures 35

36 Pipeline stages balancing Linear pipelining: (applies to tree based adder, multiplier, (unrolled) iterative divider..) Balancing: the goal is to divide the processing into N stages in such way, that stage propagation delays are roughly the same The number of stages reflects preference of performance (throughput) versus latency. AEB36APO Computer Architectures 36

37 Superpipeline and beyond Not well balanced 5-stage pipeline: IM RF DM RF IF ID EX MEM WB Deeper pipeline is result of decomposing stages into more stages IM RF DM RF IF IS RF EX DF DS TC WB It allows CPU to work at higher frequencies but introduces many problems as well.. Complex forwarding, more pipeline stalls, hazards need to be solved by complex logic AEB36APO Computer Architectures 37

38 Typical pipeline depths in todays CPUs P5 (Pentium) : 5 P6 (Pentium 3): P6 (Pentium Pro): 4 NetBurst (Willamette, 8 nm) - Celeron, Pentium 4: 2 NetBurst (Northwood, 3 nm) - Celeron, Pentium 4, Pentium 4 HT: 2 NetBurst (Prescott, 9 nm) - Celeron D, Pentium 4, Pentium 4 HT, Pentium 4 ExEd: 3 NetBurst (Cedar Mill, 65 nm): 3 NetBurst (Presler 65 nm) - Pentium D: 3 Core : 4 Bonnell: 6 K7 Architecture - Athlon : -5 K8 - Athlon 64, Sempron, Opteron, Turion 64: 2-7 ARM 8-9: 5 ARM : 8 Cortex A7: 8- Cortex A8: 3 Cortex A5: 5-25 The Optimum Pipeline Depth for a Microprocessor: AEB36APO Computer Architectures 38

39 Branch stall discussion and delay slots The instruction memory read and fetch is expensive and result of condition evaluation in branch instructions (even worse target in indirect branch instructions) has to be evaluated before next fetch and execute. The stall state is waste of cycles. Options to use that cycle(s) are: Start fetch and execution of instruction(s) following branch and flush/discard results if it is resolved that it should not be executed Extend above by adding condition results/branch predictor (taken/not-taken) and branch target cache (BTB) Execute one or more instructions after branch unconditionally in (so called) delay slot Delay slots unconditional execution is common for many DSP (digital signal processor) and some RISC architectures (MIPS, SPARC) AEB36APO Computer Architectures 39

40 AEB36APO Computer Architectures 4

Topics. Lecture 12: Pipelining. Introduction to pipelining. Pipelined datapath. Hazards in pipeline. Performance. Design issues.

Topics. Lecture 12: Pipelining. Introduction to pipelining. Pipelined datapath. Hazards in pipeline. Performance. Design issues. Lecture 2: Pipelining Topics Introduction to pipelining Performance Pipelined datapath Design issues Hazards in pipeline Types Solutions Pipelining is Natural! Laundry Example Use case scenario Ann, Brian,