DLX computer. Electronic Computers M

DLX computer Electronic Computers 1

RISC architectures RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer In CISC architectures the 10% of the instructions are used in 90% of cases Waste of silicon Bottleneck: the bus id 80s a new architecture: RISC Solution: reduction of instruction number and complexity (fewer simpler machine instructions) Fixed instruction format (simpler instruction decoders) Simpler control logic network increasing the number of on-chip registers Reduction of bus/memory accesses Increase of machine instructions needed for a job which is (in many cases) more than compensated (in term of time) by the reduction of bus accesses CISC and RISC are each one the best solution in different application fields Nowadays coexistence of both architectures in the same processor: analysis at the end of the course A simplified RISC architecture: DLX (implemented as real processor in the 80s as R4000) 2

DLX (fixed) instruction format 31 26 25 21 20 16 15 11 10 0 6 bit 5 bit 5 bit 5 bit 11 bit R Op-code Ra Rb Rc Cod. op (11 bit) extension Arithmetic or logic instructions; i.e. Ra Rb op Rc or Set Conditions between registers Branch instructions 31 26 25 21 20 16 15 0 I Op-code Ra Rb Immediate operand or offset Data transfer (Load, Store), conditional Branch, JR and JALR (Control transfer via register), Set Condition e ALU with immediate operator. In Load and ALU instructions Ra=destination, in the Store Ra=source. -- Rb as ALU value for the immediate instructions - Branch instructions 31 26 25 0 J Op-code 26 bit (PC relative) offset 3 Direct, unconditional control transfer(j e JAL)

DLX non floating-point instructions (31x32bit registers R31 R1 - R0=0 fixed - Ra and Rb any of the 32 registers) Data Transfer Arithmetic/Logic Control LW LB LBU LHU LH SW SH SB LHI 4 Ra, offset(rb) Ra, offset(rb) Ra, offset(rb) Ra, offset(rb) Ra, offset(rb) Ra, offset(rb) Ra, offset(rb) Ra, offset(rb) Ra, value ADD ADDI ADDU ADDUI SUB SUBI SUBU SUBUI DIV DIVI ULU ULI SLL SLLI SHR SHRI SLA SLAI OR ORI XOR XORI AND ANDI Ra,Rb,Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb, value Ra,Rb,Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb, value Ra,Rb,Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb, value Ra,Rb,Rc Ra,Rb;value Ra,Rb.Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb,value No STACK registers SETx SETIx BEQZ BNEQZ J JR JL JLR Ra,Rb,Rc Ra,Rb,value Ra, offset (- - - +[PC]) Ra, offset (- - - +[PC]) offset Ra offset (- - - +[PC]) Ra N.B. Postfix x (set condition) can be LT, GT, LE, GE, EQ, NE JL (via or non via register) -> Jump and link saving PC in R31 Offset is a value within the instruction Postfix I means «immediate» (value within the instruction) PostfixA means «arithmetic» (sign extension) Postfix U means «unsigned» Value is the immediate within the instruction

DLX ALU operations Two inputs data One output data plus flags S1 S2 32 32 Flags ALU Controls OUT 32 S1, S2 : ALU inputs (32 bit) S1 + S2 S1 S2 S1 and S2 S1 or S2 S1 exor S2 Left Shift S1 of S2 positions Right Shift S1 of S2 positions Arithmetic Right Shift S1 of S2 positions S1 S2 0 1 Output Flags Zero Negative sign ALU is a combinatorial circuit!!! 5

Sequential DLX Ready? [REG INSTR] ]<= [PC] INSTRUCTION FETCH Abstract instruction execution [X] number of the destination register [PC] <= [PC] +4 [A ]<= [Ra] [B ]<= [Rb] [C] <= [Rc] [X ]<= num [Ra] INSTRUCTION DECODE PC is the Program Counter, A and B are two scratchpad internal registers,reg instr is the register where the new fetched instruction is stored. All these registers are unknown to the programmer Data transfer ALU INSTRUCTION EXECUTION This is a synchronous state diagram Set Jump Branch 6

Example: LB (LOAD BYTE format I) 31 26 25 21 20 16 15 0 Op-code Ra Rb offset LB Ra, offset(rb) I NSTR <= [PC] [PC] <= [PC] +4 [A ]<= [Ra] [B ]<= [Rb] [C ]<= [Rc] [X ]<= num [Ra] Instruction bit 15 (sign) is left extended 16 times Instr 15.0. is the instruction offset Address is always 32 bit 31 Bbit 0 LSbit Sign extension!! Example [Addr] 7..0 =A7 H => (10100111) b LOAD Byte Addr. < =[B] + (Instr 15 ) 16 ## Instr 15..0 Byte address compute ## => JOIN operator Sign extension [Ra] < =([Addr.] 7 ) 24 ## [Addr.] 7..0 Sign extended address <= FFFFFFA7 H Byte in register 7 Next Instruction

Sign extension - example with IR (IR 15 ) 16 ## IR 15..0 IR From the Control Unit 0 15 31 8 Tri-state devices 31 30 17 1615-0

Ra unsigned Data transfer Instructions (R format) Addr. <= [B] + (Instr 15 ) 16 ## Instr 15..0 Examples LW Ra, offset(rb) LB Ra, offset(rb) LBU Ra, offset(rb) unsigned LHU Ra, offset(rb) unsigned SW Ra, offset(rb) LB LB(byte) [Ra] <= ([Addr] 7 ) 24 ## [Addr] 7..0 LBU (byte) [Ra] < = (0) 24 ## [Addr] 7..0 LH (half word) LH LHU LHU (half word) [Ra ]< = ([Addr] 15 ) 16 ## [Addr] 15..0 [Ra] <= (0) 16 ## [Addr] 15..0. Signed LW [Addr]<=[A] SW 9

Register (format R) Immediate (format I) ALUinstructions examples (I format) [T]<= [Rc] [T]<= (Instr 15 ) 16 ## Instr 15..0] (T is a hidden register unknown to the programmer storing temporary data) Register content signed if arithmeticoperations ADD AND [Ra ]<= [Rb ]+ [T] [Ra] <= [Rb] and [T] ADD Ra,Rb,Rc ADDI Ra,Rb,value ADDU Ra,Rb,Rc ADDUI Ra,Rb, value SUB [Ra]<= [Rb] - [T] XOR [Ra] <= [Rb] xor [T] OR [Ra] <=[Rb] or [T] The same scheme for the shift etc. A and B generic registers (Ra, Rb) 10

Register (format R) Immediate (format I) SET instructions (see branch) [T]<= [Rc] [T]<= (Instr 15 ) 16 ## Instr 15..0] ex. SLT Ra,Rb,Rc Set Ra=1 if Rb is less than Rc otherwise Ra=0 Register content as signed SEQ SLT SGE (T is a hidden register unknown to the programmer storing temporary data) [Ra] = 1 if [Rb] = [T] [Ra] = 1 if [Rb] < [T] [Ra] = 1 if [Rb] >= [T] SNE SGT SLE [Ra] = 1 if [Rb]! = [T] [Ra] = 1 if [Rb] > [T] [Ra] =1 if [Rb] <= [T] 11

format J For saving [PC] in R31 JALR JAL [T] <= [PC] [T] <= [PC] JUP Instructions JALR JR JP JAL format I J offset (jump address) JR Ra (jump register) JL offset (jump and link address) JLR Ra (jump and link register) [PC] <= [Ra] [PC] <= [PC] + (Instr 25 ) 6 ## Instr 25..0 JALR [R31 ]<= [T] JAL 12

BRANCH BEQZ format R BNEZ Branch Instructions [Ra] = 1 [Ra!] = 1 Ex. BNEQZ R5, 100 Jump to PC+100 if R5 not equal 0 YES NO YES NO [PC] <= [PC] + (Instr 15 ) 16 ## Instr 15..0 INIT 13

The Pipelining Principle Pipelining is the main basic technique used for speeding-up a CPU. The key idea for pipelining is general, and is currently applied to several industry fields (productions lines, oil pipelines, ) A system S must operate N times on a task A i producing result R i : A 1, A 2, A 3 A N S R 1, R 2, R 3 R N Latency : time occurring between the beginning and the end of task A (T A ). Throughput : frequency of each task completion 14

The Pipelining Principle 1) Sequential System - A new instruction starts when the previous instruction is finished A 1 A 2 A 3 A n t T A A n n-th instruction - Latency (execution time of a single instruction) = T An Different execution times 2) Pipelined System (instruction are subdivided in stages each stage during one n th 1/4 in this example - of the entire instruction time) Successive instructions stages overlap A P 1 P 2 P 3 P 4 t S i : pipeline stage S 1 S 2 S 3 S 4 S 15

T P A 1 P 1 P 2 P 3 P 4 A 2 P 1 P 2 P 3 P 4 The Pipelining Principle A 3 P 1 P 2 P 3 P 4 A 4 P 1 P 2 P 3 P 4 T P : pipeline cycle (ideally one clock) For each cycle one instruction terminates In figure A1 terminates at t x Next cycle A2 terminates at t y etc. A n t x t y t P 1 P 2 P 3 P 4 16

Typical instruction stages IF ID EX E WB Instruction fetch (from memory) Write-back (if needed jump no need) Instruction decode Instruction execution (ALU) Data memory access (if needed registers instructions no need) N.B. The execution time (latency) of all instructions must be the same, for maintaining the results order. Some stages are not used for some instructions (the stage is a NOP for them) i.e. the stage E for register operations) 17

Pipelining of a CPU (DLX) Instruction sequence: I 1, I 2, I 3 I N Instruction j Combinatorial circuits IF ID EX E WB t IF/ID ID/EX EX/E E/WB IF ID EX E WB Registers (Pipeline Registers D FF) CPU (datapath) Pipeline Cycle Clock Cycle Delay of the slowest stage ClockPerInstruction (CPI)=1 (ideally!) 18

DLX Pipeline Instr i IF ID EX E WB CPI (ideally) = 1 Instr i+1 IF ID EX E WB Instr i+2 IF ID EX E WB Instr i+3 Instr i+4 IF ID EX E WB IF ID EX E WB Overhead introduced by the Pipeline Registers: T clk = T d + T P + T su Clock Cycle Switch delay of the input stage register Delay of the slowest combinatorial stage Set-up time of the 19 output stage register

Tp D Combinatorial Circuit D Switch delay of the input stage register Delay of the slowest combinatorial stage Set-up time of the output stage register 20

Each stage is active at each clock cycle. Pipeline implementationrequirements The PC is incremented in the IF stage. An ADDER should be introduced (PC <=PC+4 one instruction is 4 bytes) in the IF stage. But instructions are aligned (each one ends to an address multiple of the instruction length in bytes) and therefore a 30 bit only register (a programmable counter for jumps) is used, incremented by 1 each clock cycle 31 2 1 0 PC Always 0 Two emory Data Registers are required (referred to as LDR e SDR). In fact when a LOAD is immediately followed by a STORE there is a WB/E stages overlap two data waiting therefore to be written (one onto the memory, the other onto a register of the RF). Each clock cycle 2 memory accesses must be possibly executed (IF, E): Instruction emory (I) and Data emory (D): Harvard Architecture The CPU clock is determined by the slowest stage PipelineRegisters store both data and controlinformation ( distributed controlunit) 21

Actually a programmable counter DLX Pipelined Datapath IF ID For Set Condition EX for Branch (also <0 and >0) E WB [it acts on the output] if jump 4 A D D PC DEC For computing new PC value when branch JL and JLR (PC in R31) =0? PC INSTR E Ra Rb Rc DR D RF =0? A L U DATA E Sign extension SE Num [R a ] For operations with immediates Number of dest. registers in case of LOAD and ALU instr. destination register number (1-31) Data (from reg. or mem or PC per link) IF/ID ID/EX EX/E E/WB

ID stage (N.B. stage layout different from previous slide!) IF/ID IR 25-16 (Jump; Jump and Link) ID/EX IR 15-0 (Offset/Immediate 11-15 as dest. reg. in R instr. ) 26 (J and JL) 32 I R IR 31-26 (Opcode) IR 10-00 (R Istr.) DEC LB SW Info travelling with the instruction Sign extension 32 P C IR 25-21 IR 20-16 IR 15-11 RF Ra Rb Rc A B DR C D Num Ra IR 15 (31-16) Immed./Branch IR 25 SE (31-26) Jump Sign extension PC 31-0 (JL and JLR) 32 32 5 16 6 32 Data (from WB stage) 23 Number of the dest. register (from WB stage)

SDR => Store emory Data Register LDR => Load memory data Register IRi => Instruction Register i DLX Pipelined Datapath for Set Condition (also <0 e >0) [it acts on output] IF ID EX E WB for Branch PC 4 A D D Address I X: Computed data or emory Address or Branch Address Y: Computed data from the previous stage Data P P P C C C3 (PC saved in R31) P C4 1 2 I R 1 DEC Ra Rb Rc RF DR D SE Num [R a ] =0? IF/ID ID/EX EX/E E/WB 24 =0? A L U I R 2 destination register number C O ND Z S DR I R 3 JL JLR D L DR Y I R4

Pipelined execution of an ALU instruction The result of each stage is sampled at the end of its cycle IF IR <= [PC] ; PC <= PC + 4 ; PC1 <= PC + 4 Decoded opcode travels through all stages ID EX E WB A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[ra] Z<= A op B or Z <= A op [(IR2 15 ) 16 ## IR2 15..0 ] Y <= Z (temporary storage for WB) Ra <= Y [IR4 <.= IR3] [PC3 <= PC2] [IR3 <= IR2] [PC4 <= PC3] NOTE: IRi bits which are dropped stage by stage when no more needed for all instructions. Why? JAL, JALR!! 25

Pipelined execution of a E instruction IF ID IR <= [PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[ra] Decoded opcode travels through all stages EX E WB AR <= B op (IR2 15 ) 16 ## IR2 15..0 SDR <= A [IR3 <= IR2 [PC3 <= PC2] LDR <= [AR] (if LOAD) or [AR] <= SDR (if STORE) [PC4 <= PC3] [IR4 <= IR3] Ra <= DR (if LOAD) [Sign ext.] 26

Pipelined execution of a BRANCH instruction (normally after a SCn instruction see later) Computed new PC address IF ID EX IR <= [PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[ra] Z <= PC2 op (IR 15 ) 16 ## IR 15..0 Cond <= A op 0 [PC3 <= PC2] [IR3 <= IR2] Decoded opcode travels through all stages E WB if (Cond) PC <= Z (NOP) [PC4 <= PC3] [IR4 <= IR3 New value in PC at the end of this cycle. When Branch is taken 3 new unwanted instructions have already started X : BTA (BRANCH TARGET ADDRESS) Branch on Reg A value (0/1) 27

Pipelined execution of a JR instruction new PC address IF ID IR <= [PC] ; PC <= PC + 4 ; PC1 <= PC + 4 ID A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[ra] Decoded opcode travels through all stages EX E Z E <= A WB PC <= Z [IR3 <= IR2] [PC3 <= PC2] [IR4 <= IR3] [PC4 <= PC3] WB (NOP) New value in PC in this interval. When Jump executed 3 new unwanted instructions are already started Which would be the stage sequence for a J instruction? 28

Pipelined execution of a JL or JLR instruction IF ID EX ID IR <= [PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[ra] Z <= A (If JLR) PC3 <= PC2 [IR3 <= IR2] Z <= PC2 + (IR 25 ) 6 ## IR 25..0 (If JL) E WB PC <= Z ; PC4<= PC3 R31 <= PC4 [IR4 <= IR3] In this case PCi values are used Decoded opcode through all stages NOTE: Write on R31 CANNOT be performed on-the fly since it could overlap with another register write New value in PC in this interval. When Jump executed 3 new unwanted instructions are already started 29

Which would be the sequence in case of SCn (ex SLT R1,R2,R3)? IF ID EX E WB ID IR <= [PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[ra]??? 30

Pipeline Hazards A Hazard occurs when during a clock cycle an instruction currently in a pipeline stage can t be executed in the same clock cycle. Structural Hazards The same resource is used by two different pipeline stages: the instructions currently in those stages can t be executed simultaneously. Data Hazards they are due to instruction dependencies. For example, an instruction that needs to read a RF register not yet written by a previousinstruction (Read After Write). Control Hazards Instructions following a branch depend from the branch result (taken/not taken). The instruction that cannot be executed must be stalled ( pipeline stall or pipeline bubbling ), together with all the following instructions, while the previous instructions must proceed normally (so as to eliminate the hazard). 31

Hazards and stalls The consequence of a data hazard: if instruction I i needs the result of instruction I i-1 (data are read in ID stage), must wait until after WB of I i-1 I i-3 I i-2 Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 IF ID EX E WB IF ID EX E I i-1 IF ID EX Clk 6 Clk 7 Clk 8 WB E WB Clk 9 Clk 10 Clk 11 Clk 12 I i IF ID S S S ID WB I i+1 IF S S S IF WB Stall: the clock signal for I i, I i+1 etc. is blocked for three periods T i = 8 * CLK = (5 + 3) * CLK Normally the three stalled instructions are transformed in NOPs to avoid clock blocking T i = 5 * (1 + 3/5 ) * CLK Instruction stalls 32

Forwarding Data are read from registers in the ID stage Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 Clk 6 Clk 7 Clk 8 Clk 9 ADD R3, R1, R4 IF ID EX E WB SUB R7, R3, R5 hazard IF ID EX E WB OR R1, R3, R5 hazard IF ID EX E WB LW R6, 100 (R3) hazard IF ID EX E WB AND R9, R5, R3 no hazard IF ID EX E WB Here too the requested data is not yet in RF since it is written on the positive clock edge at the end of WB (register value is read in ID!) Forwarding allows eliminating almost all RAW hazards of the pipeline without stalling the pipeline. (NOTE: in DLX, registers are modified only in WB stage) 33

A,B,C source registers 1-31 Forward implementation Combinatorial!! comparison between A,B,C, and R d 1, R d 2 and the Opcodes R d 1 (/OpCode) R d 2/OpCode R d 1, R d 2 destination registers 1-31 RF Bypass A,B,C OpCode PC A C B PC FU A L U IR3 FD3 IR4 PC em ALU Offset FD1 FD2 ID/EX EX/E E/WB FD3 Often performed inside the RF It allows the anticipation of the register on ID/EX control: IF opcode and comparison of RD with Ra, Rb and Rc numbers 34

Forward Unit implementation Does the instruction in the em stage want to write a register? Yes Does the instruction in the E or WB stage will write a register number which is identical to Ra or Rb or Rc number? No No FD1 FD2 Yes No Is the destination register number identical to Ra or Rb or Rc number? No Does the instruction in the WB stage want to write a register? Does the fetched instruction needs the register in em stage? Yes FD1 Yes Is the destination register number identical to Ra or Rb or Rc number FD3 Yes No Does the instruction in the WB stage want to write a register? Yes Is the destination register number identical to Ra or Rb or Rc number and different from the register which will be written by the E stage? No No No FD2 Yes No FD2 Yes NO FD1 Does the fetched instruction needs the register being written by WB stage? FD3 Yes No NoFD3 FD1 35

This slide must be viewed using its.ps version Data hazard due to LOAD instructions LW R1,32(R6) IF ID EX E WB ADD R4,R1,R7 SUB R5,R1,R8 AND R6,R1,R7 IF ID EX E IF ID EX NOTE: the data required by the ADD is available only at the end of E stage. This hazard cannot be eliminated by forwarding (unless there is an additional input in the s between memory and ALU delays!) IF ID Transformed in NOP PC-<PC-4 (Re-fetch) The pipeline needs to be stalled LW R1,32(R6) IF ID EX E WB ADD NOP R4,R1,R7 IF ID EX S E EX E WB ADD R4,R1,R7 IF ID EX E SUB R5,R1,R8 IF ID EX AND R6,R1,R7 IF ID From the end of this stage onwards: standard forwarding 36

Delayed load In many RISC CPUs, the special hazard associated with the LOAD instruction (which would in any case lead to a stall ) is not handled by stalling the pipeline but by software through the compiler (delayed load). In this example R3 is needed by the ADD instruction while it is read from the memory [instruction LW R3, 10(R4)]. Please notice that in any case a hardware forward netwotk is required LW R1,32(R6) LW R3,10 (R4) ADD R5,R1,R3 LW R6, 20 (R7) LW R8, 40(R9) LW R1,32(R6) LW R3,10 (R4) LW R6, 20 (R7) ADD R5,R1,R3 LW R8, 40(R9) Forward hardware LOAD Instruction delay slot Next instruction The compiler tries to fill the delay-slot with a useful instruction (worst case: NOP). 37

PC BEQZ R4, 200 Control Hazards PC+4 SUB R7, R3, R5 PC+8 OR R1, R3, R5 PC+12 LW R6, 100 (R8) Next InstructionAddress R4 = 0 : (taken) R4 0 : PC+4 (not taken) Branch Target Address PC+4+200 (BTA) BEQZ R4, 200 SUB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8) AND R9, R5, R3 Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 IF ID EX E WB IF ID IF EX E ID New computed PC value (Aluout) New value in PC (one clock after: new value must be clocked onto the PC) EX Clk 6 Clk 7 Clk 8 WB E WB IF ID EX E WB IF ID EX E WB Fetch with the new PC 38

Detailed dapath slide: See DLX Pipelined Datapath Here we assume that the JP instruction is the Ith instruction Instruction Fetch ID EX 4 A D D JI + 1P 32 DLX Branch or JP DEC JI + P1 2 BEQZ R4, 200 JI + P1 NOTE if the feedback signal of the new PC were output directly from the ALU output instead of Z the required stalls would be only two slower clock! E J P WB RF =0? PC I Ra Rb Rc DR D RF =0? A L U Z D PC em ALU When the new PC acts on the I three instructions have already travelled through the first three stages (EX included) IF/ID SE Num [R a ] ID/EX EX/E 39 E/WB

BEQZ R4,200 Handling the Control Hazards Always Stall (three-clock block being propagated) IF here: the previous instruction (BEQZ) has not been yet decoded Predict Not Taken NOP NOP NOP BEQZ R4, 200 SUB R7, R3, R5 OR R1, R3, R5 Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 IF ID EX E WB IF LW R6, 100 (R8) No problem because no instruction in WB stage S S S S IF Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 IF ID EX E WB IF S Here the new value of PC has been computed ID IF IF ID Clk 6 Clk 7 Clk 8 EX E IF ID Here the new value is sampled by the PC EX ID Clk 6 Clk 7 Clk 8 WB E WB EX E Fetch at new PC Real situation Repeated IF PC <= PC - 4 Branch Completion If branch taken: flush. They become NOP. No data yet written WB 40

When the Branch Target Address is clocked into the PC three unwanted instructions are already in IF/ID, ID/EX and EX/E Stalls with jumps (1/3) IF ID EX E WB Active if jump Jump forced NOP 4 A D D N O P PC DEC N O P N O P =0? PC I Ra Rb Rc DR D RF =0? A L U D Three NOPs UST replace the 3 unwanted instructions already started SE Num [R a ] IF/ID ID/EX EX/E E/WB 41 Data

NOTE in this case the jump condition detection and the new PC value are input to the in the same clok interval 4 A D D Stalls with jump (2/3) IF ID EX E WB Active if jump forced NOP when jump N O P PC DEC N O P =0? PC I Ra RF RS1 Rb RS2 Rc DR D =0? A L U DATA D E Two NOPs UST replace the 2 unwanted instructions already started SE Num [R a ] IF/ID ID/EX EX/E E/WB 42 Data

NOTE In this case the jump condition and the new PC act on the in the same Stalls with jump (3/3) period when the condition is detected Very slow clock solution! IF ID EX E WB Active if jump 4 A D D N O P Becomes NOP if jump PC DEC =0? PC I Ra Rb Rc DR D RF =0? A L U DATA D E A NOP UST replace the unwanted instruction already started SE Num [R a ] IF/ID ID/EX EX/E E/WB 43 Data

Delayed branch Similarly to the LOAD case. In several RISC CPUs the BRANCH instructions hazard is handled by SW through the compiler (delayed branch): BRANCH instruction delay slot delay slot delay slot The compiler tries to fill the delay-slots with useful instructions (worst case: NOP). Next instruction 44

Delayed branch/jump Original Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Sne R1, R8, R9 ; Br R1, +100 branch condition Obviously in this instructions group there must be no jumps!!! Compiled Sne R1, R8, R9 ; branch condition Br R1, +100 Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Executed in both cases Instead of one or more postponed instructions, the compiler inserts NOPs when no suitable instructions are available 45

Handling the Control Hazards Dynamic Prediction: Branch Target Buffer => no stall (almost..) PC TAGS Predicted PC T/NT T/NT taken/not taken N.B. Here the branch slot is selected during the IF clock cycle that loads IR1 in IF/ID = HIT : Fetch with predicted PC ISS : Fetch with PC + 4 Correct prediction : Wrong prediction : no stalls 1-3 stalls (correct fetch in ID or EX, see before) 48

Prediction Buffer: the simplest implementation uses a single bit that indicates what happened when last branch occurred. Loop1 Loop2 When the program ends loop2, the prediction fails (branch predicted as taken but actually it is untaken), then it fails again when it predicts as untaken whilst entering once again loop2 In case of predominance of one prediction, when the opposite situation occurs we have two consecutive errors. 49

Usually two bits. TAKEN TAKEN UNTAKEN TAKEN TAKEN TAKEN UNTAKEN UNTAKEN UNTAKEN TAKEN UNTAKEN UNTAKEN 50