DLX computer. Electronic Computers M

Size: px
Start display at page:

Download "DLX computer. Electronic Computers M"

Transcription

1 DLX computer Electronic Computers 1

2 RISC architectures RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer In CISC architectures the 10% of the instructions are used in 90% of cases Waste of silicon Bottleneck: the bus id 80s a new architecture: RISC Solution: reduction of instruction number and complexity (fewer simpler machine instructions) Fixed instruction format (simpler instruction decoders) Simpler control logic network increasing the number of on-chip registers Reduction of bus/memory accesses Increase of machine instructions needed for a job which is (in many cases) more than compensated (in term of time) by the reduction of bus accesses CISC and RISC are each one the best solution in different application fields Nowadays coexistence of both architectures in the same processor: analysis at the end of the course A simplified RISC architecture: DLX (implemented as real processor in the 80s as R4000) 2

3 DLX (fixed) instruction format bit 5 bit 5 bit 5 bit 11 bit R Op-code Ra Rb Rc Cod. op (11 bit) extension Arithmetic or logic instructions; i.e. Ra Rb op Rc or Set Conditions between registers Branch instructions I Op-code Ra Rb Immediate operand or offset Data transfer (Load, Store), conditional Branch, JR and JALR (Control transfer via register), Set Condition e ALU with immediate operator. In Load and ALU instructions Ra=destination, in the Store Ra=source. -- Rb as ALU value for the immediate instructions - Branch instructions J Op-code 26 bit (PC relative) offset 3 Direct, unconditional control transfer(j e JAL)

4 DLX non floating-point instructions (31x32bit registers R31 R1 - R0=0 fixed - Ra and Rb any of the 32 registers) Data Transfer Arithmetic/Logic Control LW LB LBU LHU LH SW SH SB LHI 4 Ra, offset(rb) Ra, offset(rb) Ra, offset(rb) Ra, offset(rb) Ra, offset(rb) Ra, offset(rb) Ra, offset(rb) Ra, offset(rb) Ra, value ADD ADDI ADDU ADDUI SUB SUBI SUBU SUBUI DIV DIVI ULU ULI SLL SLLI SHR SHRI SLA SLAI OR ORI XOR XORI AND ANDI Ra,Rb,Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb, value Ra,Rb,Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb, value Ra,Rb,Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb, value Ra,Rb,Rc Ra,Rb;value Ra,Rb.Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb,value Ra,Rb,Rc Ra,Rb,value No STACK registers SETx SETIx BEQZ BNEQZ J JR JL JLR Ra,Rb,Rc Ra,Rb,value Ra, offset ( [PC]) Ra, offset ( [PC]) offset Ra offset ( [PC]) Ra N.B. Postfix x (set condition) can be LT, GT, LE, GE, EQ, NE JL (via or non via register) -> Jump and link saving PC in R31 Offset is a value within the instruction Postfix I means «immediate» (value within the instruction) PostfixA means «arithmetic» (sign extension) Postfix U means «unsigned» Value is the immediate within the instruction

5 DLX ALU operations Two inputs data One output data plus flags S1 S Flags ALU Controls OUT 32 S1, S2 : ALU inputs (32 bit) S1 + S2 S1 S2 S1 and S2 S1 or S2 S1 exor S2 Left Shift S1 of S2 positions Right Shift S1 of S2 positions Arithmetic Right Shift S1 of S2 positions S1 S2 0 1 Output Flags Zero Negative sign ALU is a combinatorial circuit!!! 5

6 Sequential DLX Ready? [REG INSTR] ]<= [PC] INSTRUCTION FETCH Abstract instruction execution [X] number of the destination register [PC] <= [PC] +4 [A ]<= [Ra] [B ]<= [Rb] [C] <= [Rc] [X ]<= num [Ra] INSTRUCTION DECODE PC is the Program Counter, A and B are two scratchpad internal registers,reg instr is the register where the new fetched instruction is stored. All these registers are unknown to the programmer Data transfer ALU INSTRUCTION EXECUTION This is a synchronous state diagram Set Jump Branch 6

7 Example: LB (LOAD BYTE format I) Op-code Ra Rb offset LB Ra, offset(rb) I NSTR <= [PC] [PC] <= [PC] +4 [A ]<= [Ra] [B ]<= [Rb] [C ]<= [Rc] [X ]<= num [Ra] Instruction bit 15 (sign) is left extended 16 times Instr is the instruction offset Address is always 32 bit 31 Bbit 0 LSbit Sign extension!! Example [Addr] 7..0 =A7 H => ( ) b LOAD Byte Addr. < =[B] + (Instr 15 ) 16 ## Instr Byte address compute ## => JOIN operator Sign extension [Ra] < =([Addr.] 7 ) 24 ## [Addr.] 7..0 Sign extended address <= FFFFFFA7 H Byte in register 7 Next Instruction

8 Sign extension - example with IR (IR 15 ) 16 ## IR IR From the Control Unit Tri-state devices

9 Ra unsigned Data transfer Instructions (R format) Addr. <= [B] + (Instr 15 ) 16 ## Instr Examples LW Ra, offset(rb) LB Ra, offset(rb) LBU Ra, offset(rb) unsigned LHU Ra, offset(rb) unsigned SW Ra, offset(rb) LB LB(byte) [Ra] <= ([Addr] 7 ) 24 ## [Addr] 7..0 LBU (byte) [Ra] < = (0) 24 ## [Addr] 7..0 LH (half word) LH LHU LHU (half word) [Ra ]< = ([Addr] 15 ) 16 ## [Addr] [Ra] <= (0) 16 ## [Addr] Signed LW [Addr]<=[A] SW 9

10 Register (format R) Immediate (format I) ALUinstructions examples (I format) [T]<= [Rc] [T]<= (Instr 15 ) 16 ## Instr 15..0] (T is a hidden register unknown to the programmer storing temporary data) Register content signed if arithmeticoperations ADD AND [Ra ]<= [Rb ]+ [T] [Ra] <= [Rb] and [T] ADD Ra,Rb,Rc ADDI Ra,Rb,value ADDU Ra,Rb,Rc ADDUI Ra,Rb, value SUB [Ra]<= [Rb] - [T] XOR [Ra] <= [Rb] xor [T] OR [Ra] <=[Rb] or [T] The same scheme for the shift etc. A and B generic registers (Ra, Rb) 10

11 Register (format R) Immediate (format I) SET instructions (see branch) [T]<= [Rc] [T]<= (Instr 15 ) 16 ## Instr 15..0] ex. SLT Ra,Rb,Rc Set Ra=1 if Rb is less than Rc otherwise Ra=0 Register content as signed SEQ SLT SGE (T is a hidden register unknown to the programmer storing temporary data) [Ra] = 1 if [Rb] = [T] [Ra] = 1 if [Rb] < [T] [Ra] = 1 if [Rb] >= [T] SNE SGT SLE [Ra] = 1 if [Rb]! = [T] [Ra] = 1 if [Rb] > [T] [Ra] =1 if [Rb] <= [T] 11

12 format J For saving [PC] in R31 JALR JAL [T] <= [PC] [T] <= [PC] JUP Instructions JALR JR JP JAL format I J offset (jump address) JR Ra (jump register) JL offset (jump and link address) JLR Ra (jump and link register) [PC] <= [Ra] [PC] <= [PC] + (Instr 25 ) 6 ## Instr JALR [R31 ]<= [T] JAL 12

13 BRANCH BEQZ format R BNEZ Branch Instructions [Ra] = 1 [Ra!] = 1 Ex. BNEQZ R5, 100 Jump to PC+100 if R5 not equal 0 YES NO YES NO [PC] <= [PC] + (Instr 15 ) 16 ## Instr INIT 13

14 The Pipelining Principle Pipelining is the main basic technique used for speeding-up a CPU. The key idea for pipelining is general, and is currently applied to several industry fields (productions lines, oil pipelines, ) A system S must operate N times on a task A i producing result R i : A 1, A 2, A 3 A N S R 1, R 2, R 3 R N Latency : time occurring between the beginning and the end of task A (T A ). Throughput : frequency of each task completion 14

15 The Pipelining Principle 1) Sequential System - A new instruction starts when the previous instruction is finished A 1 A 2 A 3 A n t T A A n n-th instruction - Latency (execution time of a single instruction) = T An Different execution times 2) Pipelined System (instruction are subdivided in stages each stage during one n th 1/4 in this example - of the entire instruction time) Successive instructions stages overlap A P 1 P 2 P 3 P 4 t S i : pipeline stage S 1 S 2 S 3 S 4 S 15

16 T P A 1 P 1 P 2 P 3 P 4 A 2 P 1 P 2 P 3 P 4 The Pipelining Principle A 3 P 1 P 2 P 3 P 4 A 4 P 1 P 2 P 3 P 4 T P : pipeline cycle (ideally one clock) For each cycle one instruction terminates In figure A1 terminates at t x Next cycle A2 terminates at t y etc. A n t x t y t P 1 P 2 P 3 P 4 16

17 Typical instruction stages IF ID EX E WB Instruction fetch (from memory) Write-back (if needed jump no need) Instruction decode Instruction execution (ALU) Data memory access (if needed registers instructions no need) N.B. The execution time (latency) of all instructions must be the same, for maintaining the results order. Some stages are not used for some instructions (the stage is a NOP for them) i.e. the stage E for register operations) 17

18 Pipelining of a CPU (DLX) Instruction sequence: I 1, I 2, I 3 I N Instruction j Combinatorial circuits IF ID EX E WB t IF/ID ID/EX EX/E E/WB IF ID EX E WB Registers (Pipeline Registers D FF) CPU (datapath) Pipeline Cycle Clock Cycle Delay of the slowest stage ClockPerInstruction (CPI)=1 (ideally!) 18

19 DLX Pipeline Instr i IF ID EX E WB CPI (ideally) = 1 Instr i+1 IF ID EX E WB Instr i+2 IF ID EX E WB Instr i+3 Instr i+4 IF ID EX E WB IF ID EX E WB Overhead introduced by the Pipeline Registers: T clk = T d + T P + T su Clock Cycle Switch delay of the input stage register Delay of the slowest combinatorial stage Set-up time of the 19 output stage register

20 Tp D Combinatorial Circuit D Switch delay of the input stage register Delay of the slowest combinatorial stage Set-up time of the output stage register 20

21 Each stage is active at each clock cycle. Pipeline implementationrequirements The PC is incremented in the IF stage. An ADDER should be introduced (PC <=PC+4 one instruction is 4 bytes) in the IF stage. But instructions are aligned (each one ends to an address multiple of the instruction length in bytes) and therefore a 30 bit only register (a programmable counter for jumps) is used, incremented by 1 each clock cycle PC Always 0 Two emory Data Registers are required (referred to as LDR e SDR). In fact when a LOAD is immediately followed by a STORE there is a WB/E stages overlap two data waiting therefore to be written (one onto the memory, the other onto a register of the RF). Each clock cycle 2 memory accesses must be possibly executed (IF, E): Instruction emory (I) and Data emory (D): Harvard Architecture The CPU clock is determined by the slowest stage PipelineRegisters store both data and controlinformation ( distributed controlunit) 21

22 Actually a programmable counter DLX Pipelined Datapath IF ID For Set Condition EX for Branch (also <0 and >0) E WB [it acts on the output] if jump 4 A D D PC DEC For computing new PC value when branch JL and JLR (PC in R31) =0? PC INSTR E Ra Rb Rc DR D RF =0? A L U DATA E Sign extension SE Num [R a ] For operations with immediates Number of dest. registers in case of LOAD and ALU instr. destination register number (1-31) Data (from reg. or mem or PC per link) IF/ID ID/EX EX/E E/WB

23 ID stage (N.B. stage layout different from previous slide!) IF/ID IR (Jump; Jump and Link) ID/EX IR 15-0 (Offset/Immediate as dest. reg. in R instr. ) 26 (J and JL) 32 I R IR (Opcode) IR (R Istr.) DEC LB SW Info travelling with the instruction Sign extension 32 P C IR IR IR RF Ra Rb Rc A B DR C D Num Ra IR 15 (31-16) Immed./Branch IR 25 SE (31-26) Jump Sign extension PC 31-0 (JL and JLR) Data (from WB stage) 23 Number of the dest. register (from WB stage)

24 SDR => Store emory Data Register LDR => Load memory data Register IRi => Instruction Register i DLX Pipelined Datapath for Set Condition (also <0 e >0) [it acts on output] IF ID EX E WB for Branch PC 4 A D D Address I X: Computed data or emory Address or Branch Address Y: Computed data from the previous stage Data P P P C C C3 (PC saved in R31) P C4 1 2 I R 1 DEC Ra Rb Rc RF DR D SE Num [R a ] =0? IF/ID ID/EX EX/E E/WB 24 =0? A L U I R 2 destination register number C O ND Z S DR I R 3 JL JLR D L DR Y I R4

25 Pipelined execution of an ALU instruction The result of each stage is sampled at the end of its cycle IF IR <= [PC] ; PC <= PC + 4 ; PC1 <= PC + 4 Decoded opcode travels through all stages ID EX E WB A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[ra] Z<= A op B or Z <= A op [(IR2 15 ) 16 ## IR ] Y <= Z (temporary storage for WB) Ra <= Y [IR4 <.= IR3] [PC3 <= PC2] [IR3 <= IR2] [PC4 <= PC3] NOTE: IRi bits which are dropped stage by stage when no more needed for all instructions. Why? JAL, JALR!! 25

26 Pipelined execution of a E instruction IF ID IR <= [PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[ra] Decoded opcode travels through all stages EX E WB AR <= B op (IR2 15 ) 16 ## IR SDR <= A [IR3 <= IR2 [PC3 <= PC2] LDR <= [AR] (if LOAD) or [AR] <= SDR (if STORE) [PC4 <= PC3] [IR4 <= IR3] Ra <= DR (if LOAD) [Sign ext.] 26

27 Pipelined execution of a BRANCH instruction (normally after a SCn instruction see later) Computed new PC address IF ID EX IR <= [PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[ra] Z <= PC2 op (IR 15 ) 16 ## IR Cond <= A op 0 [PC3 <= PC2] [IR3 <= IR2] Decoded opcode travels through all stages E WB if (Cond) PC <= Z (NOP) [PC4 <= PC3] [IR4 <= IR3 New value in PC at the end of this cycle. When Branch is taken 3 new unwanted instructions have already started X : BTA (BRANCH TARGET ADDRESS) Branch on Reg A value (0/1) 27

28 Pipelined execution of a JR instruction new PC address IF ID IR <= [PC] ; PC <= PC + 4 ; PC1 <= PC + 4 ID A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[ra] Decoded opcode travels through all stages EX E Z E <= A WB PC <= Z [IR3 <= IR2] [PC3 <= PC2] [IR4 <= IR3] [PC4 <= PC3] WB (NOP) New value in PC in this interval. When Jump executed 3 new unwanted instructions are already started Which would be the stage sequence for a J instruction? 28

29 Pipelined execution of a JL or JLR instruction IF ID EX ID IR <= [PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[ra] Z <= A (If JLR) PC3 <= PC2 [IR3 <= IR2] Z <= PC2 + (IR 25 ) 6 ## IR (If JL) E WB PC <= Z ; PC4<= PC3 R31 <= PC4 [IR4 <= IR3] In this case PCi values are used Decoded opcode through all stages NOTE: Write on R31 CANNOT be performed on-the fly since it could overlap with another register write New value in PC in this interval. When Jump executed 3 new unwanted instructions are already started 29

30 Which would be the sequence in case of SCn (ex SLT R1,R2,R3)? IF ID EX E WB ID IR <= [PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb;C<=Rc PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; [X]<= num[ra]??? 30

31 Pipeline Hazards A Hazard occurs when during a clock cycle an instruction currently in a pipeline stage can t be executed in the same clock cycle. Structural Hazards The same resource is used by two different pipeline stages: the instructions currently in those stages can t be executed simultaneously. Data Hazards they are due to instruction dependencies. For example, an instruction that needs to read a RF register not yet written by a previousinstruction (Read After Write). Control Hazards Instructions following a branch depend from the branch result (taken/not taken). The instruction that cannot be executed must be stalled ( pipeline stall or pipeline bubbling ), together with all the following instructions, while the previous instructions must proceed normally (so as to eliminate the hazard). 31

32 Hazards and stalls The consequence of a data hazard: if instruction I i needs the result of instruction I i-1 (data are read in ID stage), must wait until after WB of I i-1 I i-3 I i-2 Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 IF ID EX E WB IF ID EX E I i-1 IF ID EX Clk 6 Clk 7 Clk 8 WB E WB Clk 9 Clk 10 Clk 11 Clk 12 I i IF ID S S S ID WB I i+1 IF S S S IF WB Stall: the clock signal for I i, I i+1 etc. is blocked for three periods T i = 8 * CLK = (5 + 3) * CLK Normally the three stalled instructions are transformed in NOPs to avoid clock blocking T i = 5 * (1 + 3/5 ) * CLK Instruction stalls 32

33 Forwarding Data are read from registers in the ID stage Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 Clk 6 Clk 7 Clk 8 Clk 9 ADD R3, R1, R4 IF ID EX E WB SUB R7, R3, R5 hazard IF ID EX E WB OR R1, R3, R5 hazard IF ID EX E WB LW R6, 100 (R3) hazard IF ID EX E WB AND R9, R5, R3 no hazard IF ID EX E WB Here too the requested data is not yet in RF since it is written on the positive clock edge at the end of WB (register value is read in ID!) Forwarding allows eliminating almost all RAW hazards of the pipeline without stalling the pipeline. (NOTE: in DLX, registers are modified only in WB stage) 33

34 A,B,C source registers 1-31 Forward implementation Combinatorial!! comparison between A,B,C, and R d 1, R d 2 and the Opcodes R d 1 (/OpCode) R d 2/OpCode R d 1, R d 2 destination registers 1-31 RF Bypass A,B,C OpCode PC A C B PC FU A L U IR3 FD3 IR4 PC em ALU Offset FD1 FD2 ID/EX EX/E E/WB FD3 Often performed inside the RF It allows the anticipation of the register on ID/EX control: IF opcode and comparison of RD with Ra, Rb and Rc numbers 34

35 Forward Unit implementation Does the instruction in the em stage want to write a register? Yes Does the instruction in the E or WB stage will write a register number which is identical to Ra or Rb or Rc number? No No FD1 FD2 Yes No Is the destination register number identical to Ra or Rb or Rc number? No Does the instruction in the WB stage want to write a register? Does the fetched instruction needs the register in em stage? Yes FD1 Yes Is the destination register number identical to Ra or Rb or Rc number FD3 Yes No Does the instruction in the WB stage want to write a register? Yes Is the destination register number identical to Ra or Rb or Rc number and different from the register which will be written by the E stage? No No No FD2 Yes No FD2 Yes NO FD1 Does the fetched instruction needs the register being written by WB stage? FD3 Yes No NoFD3 FD1 35

36 This slide must be viewed using its.ps version Data hazard due to LOAD instructions LW R1,32(R6) IF ID EX E WB ADD R4,R1,R7 SUB R5,R1,R8 AND R6,R1,R7 IF ID EX E IF ID EX NOTE: the data required by the ADD is available only at the end of E stage. This hazard cannot be eliminated by forwarding (unless there is an additional input in the s between memory and ALU delays!) IF ID Transformed in NOP PC-<PC-4 (Re-fetch) The pipeline needs to be stalled LW R1,32(R6) IF ID EX E WB ADD NOP R4,R1,R7 IF ID EX S E EX E WB ADD R4,R1,R7 IF ID EX E SUB R5,R1,R8 IF ID EX AND R6,R1,R7 IF ID From the end of this stage onwards: standard forwarding 36

37 Delayed load In many RISC CPUs, the special hazard associated with the LOAD instruction (which would in any case lead to a stall ) is not handled by stalling the pipeline but by software through the compiler (delayed load). In this example R3 is needed by the ADD instruction while it is read from the memory [instruction LW R3, 10(R4)]. Please notice that in any case a hardware forward netwotk is required LW R1,32(R6) LW R3,10 (R4) ADD R5,R1,R3 LW R6, 20 (R7) LW R8, 40(R9) LW R1,32(R6) LW R3,10 (R4) LW R6, 20 (R7) ADD R5,R1,R3 LW R8, 40(R9) Forward hardware LOAD Instruction delay slot Next instruction The compiler tries to fill the delay-slot with a useful instruction (worst case: NOP). 37

38 PC BEQZ R4, 200 Control Hazards PC+4 SUB R7, R3, R5 PC+8 OR R1, R3, R5 PC+12 LW R6, 100 (R8) Next InstructionAddress R4 = 0 : (taken) R4 0 : PC+4 (not taken) Branch Target Address PC (BTA) BEQZ R4, 200 SUB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8) AND R9, R5, R3 Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 IF ID EX E WB IF ID IF EX E ID New computed PC value (Aluout) New value in PC (one clock after: new value must be clocked onto the PC) EX Clk 6 Clk 7 Clk 8 WB E WB IF ID EX E WB IF ID EX E WB Fetch with the new PC 38

39 Detailed dapath slide: See DLX Pipelined Datapath Here we assume that the JP instruction is the Ith instruction Instruction Fetch ID EX 4 A D D JI + 1P 32 DLX Branch or JP DEC JI + P1 2 BEQZ R4, 200 JI + P1 NOTE if the feedback signal of the new PC were output directly from the ALU output instead of Z the required stalls would be only two slower clock! E J P WB RF =0? PC I Ra Rb Rc DR D RF =0? A L U Z D PC em ALU When the new PC acts on the I three instructions have already travelled through the first three stages (EX included) IF/ID SE Num [R a ] ID/EX EX/E 39 E/WB

40 BEQZ R4,200 Handling the Control Hazards Always Stall (three-clock block being propagated) IF here: the previous instruction (BEQZ) has not been yet decoded Predict Not Taken NOP NOP NOP BEQZ R4, 200 SUB R7, R3, R5 OR R1, R3, R5 Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 IF ID EX E WB IF LW R6, 100 (R8) No problem because no instruction in WB stage S S S S IF Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 IF ID EX E WB IF S Here the new value of PC has been computed ID IF IF ID Clk 6 Clk 7 Clk 8 EX E IF ID Here the new value is sampled by the PC EX ID Clk 6 Clk 7 Clk 8 WB E WB EX E Fetch at new PC Real situation Repeated IF PC <= PC - 4 Branch Completion If branch taken: flush. They become NOP. No data yet written WB 40

41 When the Branch Target Address is clocked into the PC three unwanted instructions are already in IF/ID, ID/EX and EX/E Stalls with jumps (1/3) IF ID EX E WB Active if jump Jump forced NOP 4 A D D N O P PC DEC N O P N O P =0? PC I Ra Rb Rc DR D RF =0? A L U D Three NOPs UST replace the 3 unwanted instructions already started SE Num [R a ] IF/ID ID/EX EX/E E/WB 41 Data

42 NOTE in this case the jump condition detection and the new PC value are input to the in the same clok interval 4 A D D Stalls with jump (2/3) IF ID EX E WB Active if jump forced NOP when jump N O P PC DEC N O P =0? PC I Ra RF RS1 Rb RS2 Rc DR D =0? A L U DATA D E Two NOPs UST replace the 2 unwanted instructions already started SE Num [R a ] IF/ID ID/EX EX/E E/WB 42 Data

43 NOTE In this case the jump condition and the new PC act on the in the same Stalls with jump (3/3) period when the condition is detected Very slow clock solution! IF ID EX E WB Active if jump 4 A D D N O P Becomes NOP if jump PC DEC =0? PC I Ra Rb Rc DR D RF =0? A L U DATA D E A NOP UST replace the unwanted instruction already started SE Num [R a ] IF/ID ID/EX EX/E E/WB 43 Data

44 Delayed branch Similarly to the LOAD case. In several RISC CPUs the BRANCH instructions hazard is handled by SW through the compiler (delayed branch): BRANCH instruction delay slot delay slot delay slot The compiler tries to fill the delay-slots with useful instructions (worst case: NOP). Next instruction 44

45 Delayed branch/jump Original Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Sne R1, R8, R9 ; Br R1, +100 branch condition Obviously in this instructions group there must be no jumps!!! Compiled Sne R1, R8, R9 ; branch condition Br R1, +100 Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Executed in both cases Instead of one or more postponed instructions, the compiler inserts NOPs when no suitable instructions are available 45

46 Handling the Control Hazards Dynamic Prediction: Branch Target Buffer => no stall (almost..) PC TAGS Predicted PC T/NT T/NT taken/not taken N.B. Here the branch slot is selected during the IF clock cycle that loads IR1 in IF/ID = HIT : Fetch with predicted PC ISS : Fetch with PC + 4 Correct prediction : Wrong prediction : no stalls 1-3 stalls (correct fetch in ID or EX, see before) 48

47 Prediction Buffer: the simplest implementation uses a single bit that indicates what happened when last branch occurred. Loop1 Loop2 When the program ends loop2, the prediction fails (branch predicted as taken but actually it is untaken), then it fails again when it predicts as untaken whilst entering once again loop2 In case of predominance of one prediction, when the opposite situation occurs we have two consecutive errors. 49

48 Usually two bits. TAKEN TAKEN UNTAKEN TAKEN TAKEN TAKEN UNTAKEN UNTAKEN UNTAKEN TAKEN UNTAKEN UNTAKEN 50

Computer Architectures. DLX ISA: Pipelined Implementation

Computer Architectures. DLX ISA: Pipelined Implementation Computer Architectures L ISA: Pipelined Implementation 1 The Pipelining Principle Pipelining is nowadays the main basic technique deployed to speed-up a CP. The key idea for pipelining is general, and

More information

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

More information

A Model RISC Processor. DLX Architecture

A Model RISC Processor. DLX Architecture DLX Architecture A Model RISC Processor 1 General Features Flat memory model with 32-bit address Data types Integers (32-bit) Floating Point Single precision (32-bit) Double precision (64 bits) Register-register

More information

Design for a simplified DLX (SDLX) processor Rajat Moona

Design for a simplified DLX (SDLX) processor Rajat Moona Design for a simplified DLX (SDLX) processor Rajat Moona moona@iitk.ac.in In this handout we shall see the design of a simplified DLX (SDLX) processor. We shall assume that the readers are familiar with

More information

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Thoai Nam Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy & David a Patterson,

More information

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts Prof. Sherief Reda School of Engineering Brown University S. Reda EN2910A FALL'15 1 Classical concepts (prerequisite) 1. Instruction

More information

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Pipeline Thoai Nam Outline Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information

Reminder: tutorials start next week!

Reminder: tutorials start next week! Previous lecture recap! Metrics of computer architecture! Fundamental ways of improving performance: parallelism, locality, focus on the common case! Amdahl s Law: speedup proportional only to the affected

More information

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University COSC4201 Pipelining Prof. Mokhtar Aboelaze York University 1 Instructions: Fetch Every instruction could be executed in 5 cycles, these 5 cycles are (MIPS like machine). Instruction fetch IR Mem[PC] NPC

More information

Speeding Up DLX Computer Architecture Hadassah College Spring 2018 Speeding Up DLX Dr. Martin Land

Speeding Up DLX Computer Architecture Hadassah College Spring 2018 Speeding Up DLX Dr. Martin Land Speeding Up DLX 1 DLX Execution Stages Version 1 Clock Cycle 1 I 1 enters Instruction Fetch (IF) Clock Cycle2 I 1 moves to Instruction Decode (ID) Instruction Fetch (IF) holds state fixed Clock Cycle3

More information

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA CISC 662 Graduate Computer Architecture Lecture 4 - ISA Michela Taufer http://www.cis.udel.edu/~taufer/courses Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

TSK3000A - Generic Instructions

TSK3000A - Generic Instructions TSK3000A - Generic Instructions Frozen Content Modified by Admin on Sep 13, 2017 Using the core set of assembly language instructions for the TSK3000A as building blocks, a number of generic instructions

More information

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Moore s Law Gordon Moore @ Intel (1965) 2 Computer Architecture Trends (1)

More information

R-type Instructions. Experiment Introduction. 4.2 Instruction Set Architecture Types of Instructions

R-type Instructions. Experiment Introduction. 4.2 Instruction Set Architecture Types of Instructions Experiment 4 R-type Instructions 4.1 Introduction This part is dedicated to the design of a processor based on a simplified version of the DLX architecture. The DLX is a RISC processor architecture designed

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

EXERCISE 3: DLX II - Control

EXERCISE 3: DLX II - Control TAMPERE UNIVERSITY OF TECHNOLOGY Institute of Digital and Computer Systems TKT-3200 Computer Architectures I EXERCISE 3: DLX II - Control.. 2007 Group Name Email Student nr. DLX-CONTROL The meaning of

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

What do we have so far? Multi-Cycle Datapath (Textbook Version)

What do we have so far? Multi-Cycle Datapath (Textbook Version) What do we have so far? ulti-cycle Datapath (Textbook Version) CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instruction being processed in datapath How to lower CPI further? #1 Lec # 8 Summer2001

More information

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA MIPS ISA. In a CPU. (vonneumann) Processor Organization

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA MIPS ISA. In a CPU. (vonneumann) Processor Organization CISC 662 Graduate Computer Architecture Lecture 4 - ISA MIPS ISA Michela Taufer http://www.cis.udel.edu/~taufer/courses Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

Computer Architecture

Computer Architecture Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in

More information

Processor Architecture

Processor Architecture Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong (jinkyu@skku.edu)

More information

ISA: The Hardware Software Interface

ISA: The Hardware Software Interface ISA: The Hardware Software Interface Instruction Set Architecture (ISA) is where software meets hardware In embedded systems, this boundary is often flexible Understanding of ISA design is therefore important

More information

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science Pipeline Overview Dr. Jiang Li Adapted from the slides provided by the authors Outline MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and

More information

DLX: A Simplified RISC Model

DLX: A Simplified RISC Model DLX: A Simplified RISC Model 1 DLX Pipeline Fetch Decode Integer ALU Data Memory Access Write Back Memory Floating Point Unit (FPU) Data Memory IF ID EX MEM WB definition based on MIPS 2000 commercial

More information

Computer Architecture. The Language of the Machine

Computer Architecture. The Language of the Machine Computer Architecture The Language of the Machine Instruction Sets Basic ISA Classes, Addressing, Format Administrative Matters Operations, Branching, Calling conventions Break Organization All computers

More information

Lecture 7 Pipelining. Peng Liu.

Lecture 7 Pipelining. Peng Liu. Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt

More information

A Processor. Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. See: P&H Chapter , 4.1-3

A Processor. Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. See: P&H Chapter , 4.1-3 A Processor Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 2.16-20, 4.1-3 Let s build a MIPS CPU but using Harvard architecture Basic Computer System Registers ALU

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017! Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Pipelining. Maurizio Palesi

Pipelining. Maurizio Palesi * Pipelining * Adapted from David A. Patterson s CS252 lecture slides, http://www.cs.berkeley/~pattrsn/252s98/index.html Copyright 1998 UCB 1 References John L. Hennessy and David A. Patterson, Computer

More information

ISA and RISCV. CASS 2018 Lavanya Ramapantulu

ISA and RISCV. CASS 2018 Lavanya Ramapantulu ISA and RISCV CASS 2018 Lavanya Ramapantulu Program Program =?? Algorithm + Data Structures Niklaus Wirth Program (Abstraction) of processor/hardware that executes 3-Jul-18 CASS18 - ISA and RISCV 2 Program

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

Pipelining: Basic Concepts

Pipelining: Basic Concepts Pipelining: Basic Concepts Prof. Cristina Silvano Dipartimento di Elettronica e Informazione Politecnico di ilano email: silvano@elet.polimi.it Outline Reduced Instruction Set of IPS Processor Implementation

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

Processor. Han Wang CS3410, Spring 2012 Computer Science Cornell University. See P&H Chapter , 4.1 4

Processor. Han Wang CS3410, Spring 2012 Computer Science Cornell University. See P&H Chapter , 4.1 4 Processor Han Wang CS3410, Spring 2012 Computer Science Cornell University See P&H Chapter 2.16 20, 4.1 4 Announcements Project 1 Available Design Document due in one week. Final Design due in three weeks.

More information

DLX: A Simplified RISC Model

DLX: A Simplified RISC Model 1 DLX Pipeline DLX: A Simplified RISC Model Integer ALU Floating Point Unit (FPU) definition based on MIPS 2000 commercial microprocessor 32 bit machine address, integer, register width, instruction length

More information

14:332:331 Pipelined Datapath

14:332:331 Pipelined Datapath 14:332:331 Pipelined Datapath I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate

More information

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3. Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview

More information

T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good

T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good CPU performance equation: T = I x CPI x C Both effective CPI and clock cycle C are heavily influenced by CPU design. For single-cycle CPU: CPI = 1 good Long cycle time bad On the other hand, for multi-cycle

More information

1 Hazards COMP2611 Fall 2015 Pipelined Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor 1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Pipeline Architecture RISC

Pipeline Architecture RISC Pipeline Architecture RISC Independent tasks with independent hardware serial No repetitions during the process pipelined Pipelined vs Serial Processing Instruction Machine Cycle Every instruction must

More information

6.823 Computer System Architecture Datapath for DLX Problem Set #2

6.823 Computer System Architecture Datapath for DLX Problem Set #2 6.823 Computer System Architecture Datapath for DLX Problem Set #2 Spring 2002 Students are allowed to collaborate in groups of up to 3 people. A group hands in only one copy of the solution to a problem

More information

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University Lecture 9 Pipeline Hazards Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee18b 1 Announcements PA-1 is due today Electronic submission Lab2 is due on Tuesday 2/13 th Quiz1 grades will

More information

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 Lecture 3 Pipelining Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair)

More information

Laboratory Exercise 6 Pipelined Processors 0.0

Laboratory Exercise 6 Pipelined Processors 0.0 Laboratory Exercise 6 Pipelined Processors 0.0 Goals After this laboratory exercise, you should understand the basic principles of how pipelining works, including the problems of data and branch hazards

More information

Anne Bracy CS 3410 Computer Science Cornell University. See P&H Chapter: , , Appendix B

Anne Bracy CS 3410 Computer Science Cornell University. See P&H Chapter: , , Appendix B Anne Bracy CS 3410 Computer Science Cornell University The slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, and Sirer. See P&H Chapter: 2.16-2.20, 4.1-4.4,

More information

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one

More information

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count

More information

Vertieferlabor Mikroelektronik (85-324) & Embedded Processor Lab (85-546) Task 5

Vertieferlabor Mikroelektronik (85-324) & Embedded Processor Lab (85-546) Task 5 FB Elektrotechnik und Informationstechnik Prof. Dr.-Ing. Norbert Wehn Dozent: Uwe Wasenmüller Raum 12-213, wa@eit.uni-kl.de Task 5 Introduction Subject of the this task is the extension of the fundamental

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Pipelining: Hazards Ver. Jan 14, 2014

Pipelining: Hazards Ver. Jan 14, 2014 POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Pipelining: Hazards Ver. Jan 14, 2014 Marco D. Santambrogio: marco.santambrogio@polimi.it Simone Campanoni:

More information

Instruction Set Architecture of. MIPS Processor. MIPS Processor. MIPS Registers (continued) MIPS Registers

Instruction Set Architecture of. MIPS Processor. MIPS Processor. MIPS Registers (continued) MIPS Registers CSE 675.02: Introduction to Computer Architecture MIPS Processor Memory Instruction Set Architecture of MIPS Processor CPU Arithmetic Logic unit Registers $0 $31 Multiply divide Coprocessor 1 (FPU) Registers

More information

Improving Performance: Pipelining

Improving Performance: Pipelining Improving Performance: Pipelining Memory General registers Memory ID EXE MEM WB Instruction Fetch (includes PC increment) ID Instruction Decode + fetching values from general purpose registers EXE EXEcute

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining

More information

Computer Architecture (TT 2011)

Computer Architecture (TT 2011) Computer Architecture (TT 2011) The MIPS/DLX/RISC Architecture Daniel Kroening Oxford University, Computer Science Department Version 1.0, 2011 Outline ISAs Overview MIPS/DLX Instruction Formats D. Kroening:

More information

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Anne Bracy CS 3410 Computer Science Cornell University [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Understanding the basics of a processor We now have the technology to build a CPU! Putting it all

More information

Instruction Set Principles. (Appendix B)

Instruction Set Principles. (Appendix B) Instruction Set Principles (Appendix B) Outline Introduction Classification of Instruction Set Architectures Addressing Modes Instruction Set Operations Type & Size of Operands Instruction Set Encoding

More information

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19 CO2-3224 Computer Architecture and Programming Languages CAPL Lecture 8 & 9 Dr. Kinga Lipskoch Fall 27 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be

More information

Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) Instruction Set Architecture (ISA)... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data

More information

6.004 Tutorial Problems L22 Branch Prediction

6.004 Tutorial Problems L22 Branch Prediction 6.004 Tutorial Problems L22 Branch Prediction Branch target buffer (BTB): Direct-mapped cache (can also be set-associative) that stores the target address of jumps and taken branches. The BTB is searched

More information

The MIPS Instruction Set Architecture

The MIPS Instruction Set Architecture The MIPS Set Architecture CPS 14 Lecture 5 Today s Lecture Admin HW #1 is due HW #2 assigned Outline Review A specific ISA, we ll use it throughout semester, very similar to the NiosII ISA (we will use

More information

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions. MIPS Pipe Line 2 Introduction Pipelining To complete an instruction a computer needs to perform a number of actions. These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously

More information

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002, Appendix C Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Pipelining Multiple instructions are overlapped in execution Each is in a different stage Each stage is called

More information

Chapter 2. Instruction Set Design. Computer Architectures. software. hardware. Which is easier to change/design??? Tien-Fu Chen

Chapter 2. Instruction Set Design. Computer Architectures. software. hardware. Which is easier to change/design??? Tien-Fu Chen Computer Architectures Chapter 2 Tien-Fu Chen National Chung Cheng Univ. chap2-0 Instruction Set Design software instruction set hardware Which is easier to change/design??? chap2-1 Instruction Set Architecture:

More information

Appendix C. Abdullah Muzahid CS 5513

Appendix C. Abdullah Muzahid CS 5513 Appendix C Abdullah Muzahid CS 5513 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero) Single address mode for load/store: base + displacement no indirection

More information

MIPS Instruction Set

MIPS Instruction Set MIPS Instruction Set Prof. James L. Frankel Harvard University Version of 7:12 PM 3-Apr-2018 Copyright 2018, 2017, 2016, 201 James L. Frankel. All rights reserved. CPU Overview CPU is an acronym for Central

More information

Reduced Instruction Set Computer (RISC)

Reduced Instruction Set Computer (RISC) Reduced Instruction Set Computer (RISC) Focuses on reducing the number and complexity of instructions of the ISA. RISC Goals RISC: Simplify ISA Simplify CPU Design Better CPU Performance Motivated by simplifying

More information

ECE331: Hardware Organization and Design

ECE331: Hardware Organization and Design ECE331: Hardware Organization and Design Lecture 35: Final Exam Review Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Material from Earlier in the Semester Throughput and latency

More information

Programmable Machines

Programmable Machines Programmable Machines Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. Quiz 1: next week Covers L1-L8 Oct 11, 7:30-9:30PM Walker memorial 50-340 L09-1 6.004 So Far Using Combinational

More information

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes NAME: STUDENT NUMBER: EE557--FALL 1999 MAKE-UP MIDTERM 1 Closed books, closed notes Q1: /1 Q2: /1 Q3: /1 Q4: /1 Q5: /15 Q6: /1 TOTAL: /65 Grade: /25 1 QUESTION 1(Performance evaluation) 1 points We are

More information

Very Simple MIPS Implementation

Very Simple MIPS Implementation 06 1 MIPS Pipelined Implementation 06 1 line: (In this set.) Unpipelined Implementation. (Diagram only.) Pipelined MIPS Implementations: Hardware, notation, hazards. Dependency Definitions. Hazards: Definitions,

More information

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Hakim Weatherspoon CS 3410 Computer Science Cornell University Hakim Weatherspoon CS 3410 Computer Science Cornell University The slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, McKee, and Sirer. memory inst register

More information

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl. Lecture 4: Review of MIPS Instruction formats, impl. of control and datapath, pipelined impl. 1 MIPS Instruction Types Data transfer: Load and store Integer arithmetic/logic Floating point arithmetic Control

More information

CHAPTER 2: INSTRUCTION SET PRINCIPLES. Prepared by Mdm Rohaya binti Abu Hassan

CHAPTER 2: INSTRUCTION SET PRINCIPLES. Prepared by Mdm Rohaya binti Abu Hassan CHAPTER 2: INSTRUCTION SET PRINCIPLES Prepared by Mdm Rohaya binti Abu Hassan Chapter 2: Instruction Set Principles Instruction Set Architecture Classification of ISA/Types of machine Primary advantages

More information

Instruction Pipelining

Instruction Pipelining Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages

More information

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

P = time to execute I = number of instructions executed C = clock cycles per instruction S = clock speed

P = time to execute I = number of instructions executed C = clock cycles per instruction S = clock speed RISC vs CISC Reduced Instruction Set Computer vs Complex Instruction Set Computers for a given benchmark the performance of a particular computer: P = 1 II CC 1 SS where P = time to execute I = number

More information

Computer Architecture

Computer Architecture CS3350B Computer Architecture Winter 2015 Lecture 4.2: MIPS ISA -- Instruction Representation Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and Design,

More information

EECS 322 Computer Architecture Improving Memory Access: the Cache

EECS 322 Computer Architecture Improving Memory Access: the Cache EECS 322 Computer Architecture Improving emory Access: the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow

More information

Programmable Machines

Programmable Machines Programmable Machines Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. Quiz 1: next week Covers L1-L8 Oct 11, 7:30-9:30PM Walker memorial 50-340 L09-1 6.004 So Far Using Combinational

More information

Instruction Pipelining

Instruction Pipelining Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1 ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 6 Pipelining Part 1 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

Advanced Computer Architecture Pipelining

Advanced Computer Architecture Pipelining Advanced Computer Architecture Pipelining Dr. Shadrokh Samavi Some slides are from the instructors resources which accompany the 6 th and previous editions of the textbook. Some slides are from David Patterson,

More information

Modern Computer Architecture

Modern Computer Architecture Modern Computer Architecture Lecture2 Pipelining: Basic and Intermediate Concepts Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each

More information

What is Pipelining? RISC remainder (our assumptions)

What is Pipelining? RISC remainder (our assumptions) What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP Overview Appendix A Pipelining: Basic and Intermediate Concepts Basics of Pipelining Pipeline Hazards Pipeline Implementation Pipelining + Exceptions Pipeline to handle Multicycle Operations 1 2 Unpipelined

More information

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations Principles of pipelining Pipelining Simple pipelining Structural Hazards Data Hazards Control Hazards Interrupts Multicycle operations Pipeline clocking ECE D52 Lecture Notes: Chapter 3 1 Sequential Execution

More information

Improve performance by increasing instruction throughput

Improve performance by increasing instruction throughput Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8ns fetch ALU Data access

More information

101 Assembly. ENGR 3410 Computer Architecture Mark L. Chang Fall 2009

101 Assembly. ENGR 3410 Computer Architecture Mark L. Chang Fall 2009 101 Assembly ENGR 3410 Computer Architecture Mark L. Chang Fall 2009 What is assembly? 79 Why are we learning assembly now? 80 Assembly Language Readings: Chapter 2 (2.1-2.6, 2.8, 2.9, 2.13, 2.15), Appendix

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information