14:332:331 Pipelined Datapath

Size: px

Start display at page:

Download "14:332:331 Pipelined Datapath"

Sarah Hamilton
6 years ago
Views:

1 14:332:331 Pipelined Datapath I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction Clk Single Cycle Implementation: Cycle 1 Cycle 2 lw sw Waste Is wasteful of area since some functional units must be duplicated since they can not be shared during a clock cycle (e.g., adders, memory units) But, it is simple and easy to understand 1

2 Multi-cycle Advantages & Disadvantages Uses the clock cycle efficiently the clock cycle is timed to accommodate the slowest instruction step balance the amount of work to be done in each step restrict each step to use only one major functional unit Multi-cycle implementations allow functional units to be used more than once per instruction, as long as they are used on different clock cycles Allow faster clock rates than single cycle architecture Different instructions to take a different number of clock cycles But requires additional internal state registers, multiplexers, and more complicated (Finite State Machine) control The Five Stages of Load Instruction We will consider only a subset of instructions (lw, sw, add, sub, and, or, slt, beq) IFetch: Instruction Fetch and Update PC Dec: Registers Read and Instruction Decode Exec: calculate memory address Mem: Read the data from the Data Memory WB: Write the data back to the register file lw Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 IFetch Dec RR Exec Mem WB 2 ns 2

3 What if. Several instructions were worked on by the CPU at the same time Each major logic unit works on a different stage of a different instruction - Like doing laundry for different roommates Pipelined MIPS Processor Start the next instruction while still working on the current one Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw IFetch Dec RR Exec Mem WB sw IFetch Dec RR Exec Mem WB R-type IFetch Dec RR Exec Mem WB Improves throughput - total amount of work done in a given time If pipeline is full (ideal situation) Time between Inst Pipelined = Time between inst. Non-pipelined Number of pipeline stages Instruction latency is not reduced - time from the start of an instruction to its completion 3

4 Single Cycle, Multi Cycle, vs. Pipelined Single Cycle Implementation: Cycle 1 Cycle 2 Clk Load Store Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch Pipeline Implementation: lw IFetch Dec Exec Mem WB wasted cycle sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB Single Cycle Implementation: Single Cycle, vs. Pipelined Assume memory and ops take 200 ps, Reg ops take 100 ps Pipelined Implementation: Time savings Instruction 2 Time savings Instruction 3 4

5 Designing MIPS Instructions for Pipelining What makes it easy - all instructions are the same length (32 bits) - The first two pipeline stages are the same for all instructions. few instruction formats (three) with symmetry across formats - registers addresses are in the same location and thus can be read while instructions are being decoded memory operations can occur only in loads and stores, thus the can compute memory addresses in EX stage operands are aligned in memory so a single data transfer requires only one memory access MIPS Pipeline Datapath Modifications What do we need to add/modify in our single-cycle per instruction datapath to make it pipelined? The MIPS instruction has (up to) five stages, thus pipeliene has 5 stages: Ifetch to fetch the instruction from Instruction memory Dec to decode the instruction and read Register File registers Exec to do the operations Mem to read from/write into Data Memory WB to write back into the register file. So we need a way to separate the data path into five pieces, without losing intermediate results. We will introduce Pipeline registers between pipeline stages to isolate them 5

6 MIPS Pipeline Datapath Modifications All instructions advance during one clock cycle between one pipeline register and the next IFetch Dec Exec Mem WB 1 0 PC 4 Instruction Memory Read Address Add IFetch/Dec Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Dec/Exec Shift left Add Exec/Mem Data Address Memory Write Data Read Data Mem/WB 1 0 System Clock 16 Sign Extend 32 MIPS Pipeline Datapath Modifications Because all data is passed through the pipeline, the address of the register where data needs to be loaded (lw) also needs to be passed IFetch 1 0 Dec Exec Mem WB PC 4 Instruction Memory Read Address Add IFetch/Dec Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Dec/Exec Shift left Add Exec/Mem Data Address Memory Write Data Read Data Mem/WB 1 0 System Clock 16 Sign Extend 32 6

MIPS Pipeline Control Path Modifications All control signals are determined during Decode and held in the pipeline registers between pipeline stages IFetch 1 0 Dec Exec Mem WB Control PC 4 Add

7 MIPS Pipeline Control Path Modifications All control signals are determined during Decode and held in the pipeline registers between pipeline stages IFetch 1 0 Dec Exec Mem WB Control PC 4 Add Instruction Memory Read Address IFetch/Dec Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Dec/Exec Shift left Add Exec/Mem Data Address Memory Write Data Read Data Mem/WB 1 0 System Clock 16 Sign Extend 32 MIPS Pipeline Control Path Modifications The modified control path is

8 Pipeline Example How does the non-dependent instruction sequence execute in a pipeline? (no support for forwarding) before <4> before <3> before <2> before <1> lw $10, 20($1) sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9 after <1> after <2> Pipeline Example - before <4> completes 8

9 Pipeline Example - before <3> completes Pipeline Example - before <2> completes

10 Pipeline Example - before <1> completes $4, $ Pipeline Example - lw completes Data memory not used (MEM control lines 0) $5 12 destination register 10

11 Pipeline Example - sub completes $6, $7 Data memory not used (MEM control lines 0) $7 13 Pipeline Example - and completes Normal PC+4 increment (PCSrc=0) $

12 Pipeline Example - or completes Pipeline Example - add completes 12

13 Graphically Representing MIPS Pipeline So-far we saw the single-clock-cycle pipeline diagrams show the state of the entire datapath during a clock cycle (instructions are identified above the pipeline stages). Multi-clock-cycle pipeline diagram are simpler, and can help answer how many cycles does it take to execute this code Or what is the doing during a certain cycle Can represent multiple instructions in a single figure If there is a hazard, it shows why it occurs, and how it can be fixed Why Pipeline? For Throughput! Time (clock cycles) I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Once the pipeline is full, one instruction is completed every cycle Inst 4 Time to fill the pipeline 13

Example of graphical representation 7 8 9 Can be converted in a

14 Example of graphical representation Can be converted in a single-clock-cycle pipeline diagram M Example of single-clock-cycle pipeline representation 14

15 Pipelining the MIPS ISA What makes it hard - structural hazards: what if we had only one memory - then the pipeline cannot have one instruction read from memory (fetch stage), while at the same time another instruction writes into memory (sw) control hazards: need to make a decision based on the results of one instruction, while that instruction is still executing. what about branches? Stalling Impact of branch stalling We assume that all instructions in the pipeline have a CPI of 1. Branches which always are followed by a stall have a CPI of 2. In a typical program branches occur 13% of the time. Thus we can compute the aggregate CPI of the alwaysstall for branch architecture as: n Then CPI = Σ CPI i x F i i=1 CPI always stall = 1 x 87% + 2 x 13% = 1.13 cycles/instruction Thus CPU Perform. always stall = Inst. Count x CPI no stall x Clock Perform. no stall Inst. CountxCPI always stall xclock Perform. always stall = 1 = ( 88.5%) Perform. no stall

Pipelining the MIPS ISA control hazards: Another approach is prediction - either static - always execute the instruction following a branch (assume always that the branch is not taken), or predict

16 Pipelining the MIPS ISA control hazards: Another approach is prediction - either static - always execute the instruction following a branch (assume always that the branch is not taken), or predict dynamically (keep a history of each branch as taken or not taken - accurate 90% of time). Branch not taken Branch taken Pipelining the MIPS ISA We can represent the pipeline in a simplified way, shading the blocks that are used in a given clock cycle. data hazards: what if an instruction s input operands depend on the output of a previous instruction that did not finish? Example an add followed by a sub. ns Forwarding 16

17 Pipelining the MIPS ISA Forwarding will fail for a lw followed immediately by an instruction that uses the results of the lw operation. Example lw followed by a sub. Pipelining the MIPS ISA Solution - stall pipeline one clock cycle, then forward Forward from the MEM/WB pipeline register Another solution - optimize compiler, such that lw is followed by an instruction which does not depend on the loaded word. 17

18 How About Register File Access? Time (clock cycles) I n s t r. add Inst 1 Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half. O r d e r Inst 2 add Inst 4 Branch Instructions Cause Control Hazards Dependencies backward in time cause hazards time I n s t r. O r d e r add beq lw Inst 3 Inst 4 18

19 One Way to Fix a Control Hazard I n s t r. add beq Can fix branch hazard by waiting stall but affects throughput O r d e r stall stall lw Inst 3 Register Usage Can Cause Data Hazards Dependencies backward in time cause hazards I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 Data hazard No data hazard 19

20 One Way to Fix a Data Hazard I n s t r. add r1,r2,r3 stall Can fix data hazard by waiting stall but affects throughput O r d e r stall sub r4,r1,r5 and r6,r1,r7 Loads Can Cause Data Hazards Dependencies backward in time cause hazards I n s t r. O r d e r lw r1,100(r2) sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 20

21 Stores Can Cause Data Hazards Dependencies backward in time cause hazards I n s t r. O r d e r add r1,r2,r3 sw r1,100(r5) and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 Pipeline Changes to accommodate Forwarding To avoid slowing down throughput, we need to add a hardware that detects data hazards. We call this the forwarding unit. Data needs to be forwarded to the when a data hazard is detected. Thus the forwarding unit controls forwarding data through additional multiplexing at the input. This logic unit needs input from the three pipeline registers. It also needs to detect if the RegWrite control signal is asserted so it needs input from the control lines also. No forwarding if EX/MEM.RegisterRd=$0 and MEM/WB.RegisterRd=$0 21

22 Pipeline Changes to accommodate Forwarding It needs to detect one of four cases of data hazards: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0 and (EX/MEM.RegisterRd=ID/EX.RegisterRs) Forward similarly if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0 and (EX/MEM.RegisterRd=ID/EX.RegisterRt) Forward Pipeline Changes to accommodate Forwarding similarly if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0 and (MEM/WB.RegisterRd=ID/EX.RegisterRs) Forward similarly if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0 and (MEM/WB.RegisterRd=ID/EX.RegisterRt) Forward 22

23 Pipeline Changes to accommodate Forwarding Pipeline Changes to accommodate Forwarding 0 1 Src ForwardA 00 ID/EX input to 1 - no fwd 01 MEM/WB input to 1 10 EX/MEM input to 1 ForwardB 00 ID/EX input to 2 01 MEM/WB input to 2 10 EX/MEM input to 2 ForwardB 11 sign extension input to 2 OR add another multiplexer 23

24 Forwarding Pipeline Example How does the dependent instruction sequence execute in a pipeline with support for forwarding? before <4> before <3> before <2> before <1> sub $2, $1, $3 and $4, $2, $5 or $4, $4, $2 add $9, $4, $2 after <1> after <2> Forw. Pipeline Example - before <2> completes 24

Forw. Pipeline Example - before <1> completes Use this value

RegWrite is asserted EX/MEM.RegisterRd=ID/EX.RegisterRs Forw.

forwarded EX/MEM.RegWrite is asserted MEM/WB.

25 Forw. Pipeline Example - before <1> completes Use this value of $2 not the one fetched from register file EX/MEM.RegWrite is asserted EX/MEM.RegisterRd=ID/EX.RegisterRs Forw. Pipeline Example - sub completes Both $4 and $2 are forwarded EX/MEM.RegWrite is asserted MEM/WB.RegWrite is asserted EX/MEM.RegisterRd=ID/EX.RegisterRs MEM/WB.RegisterRd=ID/EX.RegisterRt 25

RegisterRs Pipeline Changes to accommodate Stalls Forwarding does not work when an instruction following a lw tries to

26 Forwarding Pipeline Example - and completes Use this value of $4 not the one fetched from register file EX/MEM.RegWrite is asserted EX/MEM.RegisterRd=ID/EX.RegisterRs Pipeline Changes to accommodate Stalls Forwarding does not work when an instruction following a lw tries to read the value from the destination register of lw lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 The pipeline needs to be stalled, and data forwarded from the MEM/WB pipeline register Forwarding does not work 26

How Stalls are inserted Stalls happen in the EX stage, such that the subsequent two instructions in the pipeline both repeat what they were doing for one cycle This allows forwarding to work Stall in

27 How Stalls are inserted Stalls happen in the EX stage, such that the subsequent two instructions in the pipeline both repeat what they were doing for one cycle This allows forwarding to work Stall in CC4 and and or repeat what they did in CC3 Pipeline Changes to accommodate Stalls We need a logic unit which detects hazards and then stalls. The hazard detection unit operates in the instruction decode stage, and tests to see if the instruction is a load (if ID/EX.MemRead control line is asserted) Then it checks if either of the source registers of the instruction currently being decoded is the same as the target/destination register of the lw being executed (that is if ID/EX.RegisterRt=IF/ID.RegisterRs or ID/ EX.RegisterRt= IF/ID.RegisterRt) During stalling the PC is prevented from incrementing and the instruction in the IF/ID pipeline register is preserved. Need additional control lines for the IF/ID register and for the PC. The bubble is inserted by setting the pipelined control signals in the ID/EX pipeline register to 0. So we need a way to change the values of the control lines. 27

28 Pipeline Changes to do Hazard Detection ID/EX.RegisterRt Instruction source registers Pipeline Changes to do Hazard Detection Stall by 0-ing all 9 control lines PCwrite IF/ID write 28

29 Pipeline stalling example before<3> completes Pipeline stalling example before<2> completes Hazard is detected 29

30 Pipeline stalling example before<1> completes PCWrite is asserted Bubble inserted 0 0 IF/IDWrite is asserted Registers continue to be read Pipeline stalling example lw completes Forwarding unit sets src multiplexer to use value from WB register 30

100($7) sub $7, $6, $8 How many cycles will it take to execute the code?

31 Pipeline stalling example bubble completes Forwarding unit sets src multiplexer to use value from EX/MEM register Example Consider executing the following code add $5, $6, $7 lw $6, 100($7) sub $7, $6, $8 How many cycles will it take to execute the code? Draw a diagram that illustrates the dependencies that need to be resolved CC 7 CC 8 add $5,$6,$7 lw $6,100($7) sub $7, $6, $8 31

32 Example - continued Draw a diagram that illustrates how the code will actually be executed (incorporating any stalls or forwarding to solve the identified problems) CC 7 CC 8 add.. lw $6,100($7) Stall one cycle forwarding sub $7, $6, $8 MIPS Pipeline Control Path Modifications Branch decision in MEM stage 32

33 Pipeline Changes to accommodate Control Hazards Control hazards are due to branch hazards and to exceptions (I/O interrupts, requests from the OS, overflow, or an unknown instruction). A branch hazard occurs less frequently than data hazards, and is detected in the MEM stage of the pipeline. Assume branch not taken, the three instructions following a branch that is taken will be in the pipeline, and need to be flushed. branch detected CC4 40 beq $1,$3,7 44 and $12,$2,$5 48 or $13,$6,$2 52 add $14,$2,$ lw $4,50($7) Pipeline Changes to accommodate Branch Hazards The pipeline throughput can be improved by moving the decision whether the branch is taken or not to the Decode stage of the pipeline; Then if the branch is taken, only one instruction needs to be flushed (discarded) - the instruction immediately after the branch instruction. Thus we need a new logic circuit which compares the contents of the register file outputs; Since the decision is taken in the decode stage, the branch address needs to be computed in the decode phase too, in case the branch is to be taken Thus we need a new adder in the decode phase, as well as add an IF Flush control line to flush the IF/ID pipeline register. 33

34 Pipeline Changes to accommodate Branch Hazards Branch Switch to branch address Compute branch address Check for equality Pipelined branch example <before 2> completes PC-relative branch *4=72 Branch IF Flush 2 Flushing means instruction field is 0s 34

Pipelined branch example <before 1> completes 3 Pipeline Changes to accommodate Branch Hazards The above scheme will fail if we have the following series of instructions: 36 add $1, $6, $7 40 beq $1,

35 Pipelined branch example <before 1> completes 3 Pipeline Changes to accommodate Branch Hazards The above scheme will fail if we have the following series of instructions: 36 add $1, $6, $7 40 beq $1, $3, and $12, $2, $5 72 lw Because the correct value of register $1 is not in the decode stage (in the register file) at the time when the comparator needs it Pipeline needs to be stalled and the value of $1 needs to be forwarded from EX/Mem pipeline register 35

36 Pipeline Changes to accommodate Branch Hazards 36 add $1, $6, $7 40 beq $1, $3, and $12, $2, $5 72 lw Pipeline Changes to accommodate Branch Hazards 36 add $1, $6, $7 Stall 40 beq $1, $3, 28 flush 72 lw 36

37 Pipeline Changes to accommodate Branch Hazards Example How can the following code be modified to make use of a delayed branch slot?: Loop: lw $2, 100($3) addi $3, $3, 4 beq $3, $4, Loop We cannot put addi after the beq since it modifies register $3 We cannot just put lw after the beq since register $3 had changed First we re-write the code as Loop: addi $3, $3, 4 lw $2, 96($3) beq $3, $4, Loop Then we can move the lw after the beq Loop: addi $3, $3, 4 beq $3, $4, Loop lw $2, 96($3) 37

38 Example 2 Consider the pipelined datapath that does not accommodate branch hazards. Can an attempt to flush and an attempt to stall occur simultaneously? You may want to consider the following code sequence to help you answer this question: beq $1, $2, TARGET #assume the branch is taken lw $3, 40($4) add $3, $3, $3 sw $3, 40($4) TARGET: or $10,$11, $12 If the beq resolution is in the MEM stage, and the branch is taken, it requires a flush of the IF/ID pipeline register (means the register needs to be written to) and a change of the PC to the branch address; this happens in clock cycle 4. Example 2 - continued At the same time a hazard is detected between lw and the next instruction (add) which is dependent (due to $3 used as source register). Thus the hazard detection unit issues a stall, and requests that the PC and the IF/ID registers not be written to. The answer is YES, a flush and a stall are issued simultaneously. If there are any conflicting actions, which should take priority? Flush should take priority Is there a simple change you can make to the datapath to ensure the necessary priority? 38

Example 2- continued The hazard detection unit should be changed to see the RegWrite signal in the execution stage after it goes through the MUX used to flush the pipeline RegWrite Dynamic branch

as the multiple issue of Pentium IV). One approach is to have a branch prediction buffer (a small memory unit indexed by the lower portion of the address in the branch instruction).

39 Example 2- continued The hazard detection unit should be changed to see the RegWrite signal in the execution stage after it goes through the MUX used to flush the pipeline RegWrite Dynamic branch prediction The static branch predicts that it will not be taken and then flush if it was taken works for simple pipelines, but is wasteful for performance for aggressive pipelining architecture (such as the multiple issue of Pentium IV). One approach is to have a branch prediction buffer (a small memory unit indexed by the lower portion of the address in the branch instruction). It contains a bit that says if the branch was recently taken or not. The value of the prediction bit is inverted if the prediction turned out to be wrong. When the branch is almost always taken, this 1-bit predictor will predict wrong twice (at the start and end of the run of branches). 39

Dynamic branch prediction A better approach is to use a two-bit scheme, which must be wrong twice to change the direction of

If the beq is predicted as taken, then fetching begins from the target once beq is in ID.

40 Dynamic branch prediction A better approach is to use a two-bit scheme, which must be wrong twice to change the direction of prediction. The branch prediction is stored in a special buffer which is accessed with the beq instruction in the IF stage. If the beq is predicted as taken, then fetching begins from the target once beq is in ID. Not Taken Taken Taken Not Taken Not Taken Taken Further optimization with a global predictor taking into consideration the global behavior of recently executed branches. Each branch has two predictors, and tournament predictor keeps track and favors the one that was more accurate. Dynamic branch prediction with compiler optimization Furthermore, compilers place instructions that always execute in the delay spot For mostly taken branches Best choice 40

41 Pipeline Changes to accommodate Exceptions Overflow is discovered at the end of the execute stage when the sends a signal to the control unit. Following notification of an overflow the control unit has to flush the two instructions that followed the one causing the overflow. These instructions are now in the IF and ID stages of the pipeline. Thus we add an input to the MUX in the ID stage that 0s the control signals using an ID.Flush signal ID.Flush IF.Flush Overflow Pipeline Changes to accommodate Exceptions The instruction that cause the overflow (which is detected in the EX stage) needs to be flushed from the pipeline. This means that an EX.Flush signal needs to be sent to two multiplexers to zero the control signals for the last two stages of the pipeline. Overflow is only one of the many possible exception causes. The cause is stored in a Cause register below: 4 address error exception (load) 5- address error exception (store) 10 unknown instructions or reserved instruction 12 arithmetic overflow 15 floating point exception 41

42 Pipeline Changes to accommodate Exceptions An additional input is added to the PC MUX that sends to the PC hex (system reserved memory address for overflow) The address of the instruction following the offending command is saved in the Exception Program Counter (EPC) register and the cause in the Cause Register. If there are multiple exceptions, their causes are stored in the cause register, such that hardware can interrupt based on later exceptions once the earliest exception has been serviced. In case of an I/o interrupt, the execution jumps to the system routine needed to deal with the I/o, followed by a return to the address stored in the EPC for program completion. The OS responds to an exception either by terminating the process that caused the exception or by performing some action. The process who s exception is due to an unimplemented instruction is killed by the OS. Pipeline Changes to accommodate Overflow Branch EX.Flush Overflow 42

43 Pipeline Changes to accommodate Unknown Instruction Branch EX.Flush (LOW) Pipelined exception example: and completes Overflow 50 add causes an overflow 43

Pipelined exception example or completes OS instruction fetched 80000180 80000184 80000184 80000180 Pipelining

Another way is superscalar architectures which have CPI less than 1.

44 Pipelined exception example or completes OS instruction fetched Pipelining Speed-ups One way to speed up pipelines is to have more stages (up to eight) results in shorter clock cycles. Another way is superscalar architectures which have CPI less than 1. Multiple instructions can be launched at the same time (multiple issue) - Instruction execution rate exceeds the clock rate! We re talking of number of Instructions per Clock Cycle (IPC instead of CPI) Architectures try to issue 3 to 8 instructions at every clock cycle. A third way is to balance load through dynamic pipeline scheduling, to avoid hazards (stalls). The price for these speed-ups is more hardware, more complicated control and a more complicated instruction execution model. If instructions are launched in pairs, only the first instruction is launched if dynamic conditions are not met. 44

45 Static Multiple Issue Used in embedded processors and VLIW processors Can improve performance by up to 200% Layout is restricted to simplify the decoding and instruction issue Instructions are issued in pairs, aligned on a 64-bit boundary with the and branch portion operating first; If one of the instruction of the pair cannot be used, it is replaced by a no-op. The hardware detects data hazards and generates stalls between two issue packets, but the compiler is required to avoid all dependencies within the instruction pair. A load will cause the next two instructions to stall if they were to use the loaded word. CC 7 CC 8 add lw beq sw sub lw 45

46 Static two-issue datapath We need two output ports for Instruction memory, two more read and one more write ports for the Register file, two s (one handles address computation for Data memory access), and two sign-extending units Three Primary Units of Dynamically Scheduled Pipeline Dynamic pipeline scheduling chooses which instruction to execute next, re-ordering them to avoid stalls Buffer holding all the operands and the operation Results sent to other reservation stations or the commit unit Buffers results until it is safe to put them in the register file or in data memory (store) Commit unit serves as a forwarding station For operands that are needed before they were written back in the register file 46

AMD Opteron X4 12-stage pipeline Speculative pipeline that executes 3 instructions/clock cycle Register renaming removes antidependencies.

47 AMD Opteron X4 12-stage pipeline Speculative pipeline that executes 3 instructions/clock cycle Register renaming removes antidependencies. In case of incorrect speculation, the mapping between architectural and physical registers is undone. Memory address calculation Actual memory access Intel Core pipeline Each core can execute 4 instructions simultaneously A Core duo can execute 8 instructions simultaneously Better branch prediction Enhanced Less power consumption 47

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number