Part IV Data Path and Control. Feb Computer Architecture, Data Path and Control Slide 1

Size: px

Start display at page:

Download "Part IV Data Path and Control. Feb Computer Architecture, Data Path and Control Slide 1"

Suzanna Conley
6 years ago
Views:

1 Part IV Path and Control Feb. Computer Architecture, Path and Control Slide

2 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 5, ISBN X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 54, Introduction to Computer Architecture, at the University of California, Santa Barbara. uctors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. Behrooz Parhami Edition Released Revised Revised Revised Revised First July 3 July 4 July 5 Mar. 6 Feb. 7 Feb. 8 Feb. 9 Feb. Feb. Computer Architecture, Path and Control Slide

3 A Few Words About Where We Are Headed Performance = Execution time simplified to CPU execution time CPU execution time = uctions CPI (Clock rate) Performance = Clock rate ( uctions CPI ) Try to achieve CPI = with clock that is as high as that for CPI > designs; is CPI < feasible? (Chap 5-6) Design memory & IO structures to support ultrahigh-speed CPUs (chap 7-4) Define an instruction set; make it simple enough to require a small number of cycles and allow high clock rate, but not so simple that we need many instructions, even for very simple tasks (Chap 5-8) Design hardware for CPI = ; seek improvements with CPI > (Chap 3-4) Design for arithmetic & logic ops (Chap 9-) Feb. Computer Architecture, Path and Control Slide 3

4 IV Path and Control Design a simple computer (MicroMIPS) to learn about: path part of the CPU where data signals flow Control unit guides data signals through data path Pipelining a way of achieving greater performance Topics in This Part Chapter 3 Chapter 4 Chapter 5 Chapter 6 uction Execution Steps Control Unit Synthesis Pipelined Paths Pipeline Performance Limits Feb. Computer Architecture, Path and Control Slide 4

5 3 uction Execution Steps A simple computer executes instructions one at a time Fetches an instruction from the loc pointed to by PC Interprets and executes the instruction, then repeats Topics in This Chapter 3. A Small Set of uctions 3. The uction Execution Unit 3.3 A Single-Cycle Path 3.4 Branching and Jumping 3.5 Deriving the Control Signals 3.6 Performance of the Single-Cycle Design Feb. Computer Architecture, Path and Control Slide 5

6 Fig A Small Set of uctions R 6 bits 5 bits 5 bits 5 bits I J op rs rt Opcode Source or base Source or dest n MicroMIPS instruction formats and naming of the various fields. rd jta Jump target address, 6 bits inst uction, 3 bits sh 5 bits fn 6 bits Destination Unused Opcode ext imm Operand Offset, 6 bits Seven R-format instructions (add, sub, slt, and, or, xor, nor) Six I-format instructions (lui, addi, slti, andi, ori, xori) Two I-format memory access instructions (lw, sw) Three I-format conditional branch instructions (bltz, beq, bne) Four unconditional jump instructions (j, jr, jal, syscall) Feb. Computer Architecture, Path and Control Slide 6

7 The MicroMIPS uction Set Arithmetic Logic Memory access Control transfer Table 3. Copy uction Usage Load upper immediate lui rt,imm Add add rd,rs,rt Subtract sub rd,rs,rt Set less than slt rd,rs,rt Add immediate addi rt,rs,imm Set less than immediate slti rt,rs,imm AND and rd,rs,rt OR or rd,rs,rt XOR xor rd,rs,rt NOR nor rd,rs,rt AND immediate andi rt,rs,imm OR immediate ori rt,rs,imm XOR immediate xori rt,rs,imm Load word lw rt,imm(rs) Store word sw rt,imm(rs) Jump j L Jump register jr rs Branch less than bltz rs,l Branch equal beq rs,rt,l Branch not equal bne rs,rt,l Jump and link jal L System call syscall op fn Feb. Computer Architecture, Path and Control Slide 7

8 3. The uction Execution Unit PC syscall Next addr inst beq,bne jta j,jal rs,rt,rd bltz,jr (rs) (rt) op rs rt rd sh fn R 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits I J Opcode Source or base AL, lui, lw,sw Source or dest n jta Jump target address, 6 bits inst uction, 3 bits Destination Unused Opcode ext imm Operand Offset, 6 bits Address instructions imm op fn Control Harvard architecture Fig. 3. Abstract view of the instruction execution unit for MicroMIPS. For naming of instruction fields, see Fig. 3.. Feb. Computer Architecture, Path and Control Slide 8

9 3.3 A Single-Cycle Path Incr PC Next PC PC Next addr (PC) inst jta rd 3 imm rs rt 6 (rs) (rt) 3 SE Ovfl Ovfl Func out addr in ister writeback out op fn ister input Br&Jump Dst Write Src Func Read Write InSrc uction fetch access decode operation access Fig. 3.3 Key elements of the single-cycle MicroMIPS data path. Feb. Computer Architecture, Path and Control Slide 9

10 Constant amount Variable amount Const Var 5 Amount Shift function Shifter No shift Logical left Logical right Arith right Function class lui Shift Set less Arithmetic Logic An for MicroMIPS x y 5 LSBs Shifted y 3 3 Add Sub k c Adder c c 3 3 imm x y 3 or MSB 3 3 s x Shorthand symbol for Control Func s AND OR XOR NOR Logic unit Logic function Fig..9 A multifunction with 8 control signals ( for function class, arithmetic, 3 shift, logic) specifying the operation. 3- input NOR Feb. Computer Architecture, Path and Control Slide Zero Ovfl y Ovfl Zero

11 3.4 Branching and Jumping Update options for PC (PC) 3: + (PC) 3: + + imm (PC) 3:8 jta (rs) 3: SysCallAddr Default option When instruction is branch and condition is met When instruction is j or jal When the instruction is jr Start address of an operating system routine IncrPC 3 NextPC 3 Lowest bits of PC always Adder 4 MSBs c in SysCallAddr MSBs BrTrue 3 6 SE imm Branch condition checker 3 MSBs (rt) (rs) (PC) jta 3: PCSrc BrType Fig. 3.4 Next-address logic for MicroMIPS (see top part of Fig. 3.3). Feb. Computer Architecture, Path and Control Slide

12 3.5 Deriving the Control Signals Table 3. Control signals for the single-cycle MicroMIPS implementation. Next addr Control signal 3 Write Don t write Write Dst, Dst rt rd $3 InSrc, InSrc out out IncrPC Src (rt ) imm Add Sub Add Subtract LogicFn, LogicFn AND OR XOR NOR FnClass, FnClass lui Set less Arithmetic Logic Read Don t read Read Write Don t write Write BrType, BrType No branch beq bne bltz PCSrc, PCSrc IncrPC jta (rs) SysCallAddr Feb. Computer Architecture, Path and Control Slide

13 Control Signal Settings Table 3.3 uction Load upper immediate Add Subtract Set less than Add immediate Set less than immediate AND OR XOR NOR AND immediate OR immediate XOR immediate Load word Store word Jump Jump register Branch on less than Branch on equal Branch on not equal Jump and link System call op fn Write Dst InSrc Src Add Sub LogicFn FnClass Read W rite BrType PCSrc Feb. Computer Architecture, Path and Control Slide 3

14 Control Signals in the Single-Cycle Path Incr PC Next PC Next addr jta Ovfl PC (PC) inst lui slt rd 3 imm rs rt 6 (rs) (rt) 3 SE Ovfl Func out addr in out PCSrc Br&Jump BrType op fn Dst Write x xx xx Src Func ister input Add Sub LogicFn FnClass Read Write InSrc Fig. 3.3 Key elements of the single-cycle MicroMIPS data path. Feb. Computer Architecture, Path and Control Slide 4

15 uction Decoding op fn 6 RtypeInst 6 bltzinst jinst 3 jalinst 4 beqinst 5 bneinst 8 jrinst syscallinst 8 addiinst op Decoder sltiinst andiinst oriinst xoriinst luiinst lwinst fn Decoder addinst subinst andinst orinst xorinst norinst sltinst 43 swinst Fig. 3.5 uction decoder for MicroMIPS built of two 6-to-64 decoders. Feb. Computer Architecture, Path and Control Slide 5

16 Control Signal Settings: Repeated for Reference Table 3.3 uction Load upper immediate Add Subtract Set less than Add immediate Set less than immediate AND OR XOR NOR AND immediate OR immediate XOR immediate Load word Store word Jump Jump register Branch on less than Branch on equal Branch on not equal Jump and link System call op fn Write Dst InSrc Src Add Sub LogicFn FnClass Read W rite BrType PCSrc Feb. Computer Architecture, Path and Control Slide 6

17 Control Signal Generation Auxiliary signals identifying instruction classes arithinst = addinst subinst sltinst addiinst sltiinst logicinst = andinst orinst xorinst norinst andiinst oriinst xoriinst imminst = luiinst addiinst sltiinst andiinst oriinst xoriinst Example logic expressions for control signals Write = luiinst arithinst logicinst lwinst jalinst Src = imminst lwinst swinst Add Sub = subinst sltinst sltiinst Read = lwinst PCSrc = jinst jalinst syscallinst... Control addinst subinst jinst... sltinst Feb. Computer Architecture, Path and Control Slide 7

18 A Putting It All Together Fig..9 Fig. 3.4 IncrPC 3 NextPC 3 PCSrc Adder c in 4 MSBs SysCallAddr MSBs BrTrue 3 6 SE imm Branch condition checker BrType 3 MSBs (rt) (rs) (PC) 3: jta Constant amount Variable amount x y Add Sub 5 5 Const Var Amount 5 k Shift function Shifter 5 LSBs Shifted y c Adder c c 3 3 x y 3 No shift Logical left Logical right Arith right imm or MSB Function class 3 3 s lui Shift Set less Arithmetic Logic x Shortha symb for AL Cont Fun Incr PC Next PC PC Next addr (PC) inst op fn jta rd 3 imm rs rt 6 (rs) (rt) 3 SE Fig. 3.3 Ovfl Ovfl Func out addr in ister input out AND OR XOR NOR Logic unit Logic function input NOR Zero Control Ovfl addinst subinst jinst... y O Zero sltinst Br&Jump Dst Write Src Func Read Write InSrc Feb. Computer Architecture, Path and Control Slide 8

19 Performance Estimation for Single-Cycle MicroMIPS -type P C Not used uction access ns ister read ns operation ns access ns ister write ns Total 8 ns Single-cycle clock = 5 MHz Load Store Branch (and jr) P C P C P C Not used Not used Not used Not used Jump (except jr & jal) P C Not used Not used Not used Not used Fig. 3.6 The MicroMIPS data path unfolded (by depicting the register write step as a separate block) so as to better visualize the critical-path latencies. Feb. Computer Architecture, Path and Control Slide 9

20 How Good is Our Single-Cycle Design? Clock rate of 5 MHz not impressive How does this compare with current processors on the market? Not bad, where latency is concerned uction access ns ister read ns operation ns access ns ister write ns Total 8 ns Single-cycle clock = 5 MHz A.5 GHz processor with or so pipeline stages has a latency of about.4 nscycle cycles = 8 ns Throughput, however, is much better for the pipelined processor: Up to times better with single issue Perhaps up to times better with multiple issue Feb. Computer Architecture, Path and Control Slide

21 4 Control Unit Synthesis The control unit for the single-cycle design is memoryless Problematic when instructions vary greatly in complexity Multiple cycles needed when resources must be reused Topics in This Chapter 4. A Multi-Cycle Implementation 4. Choosing the Clock Cycle 4.3 The Control State Machine 4.4 Performance of the Multi-Cycle Design Feb. Computer Architecture, Path and Control Slide

22 4. A Multi-Cycle Implementation Appointment book for a dentist Single-cycle Multicycle Assume longest treatment takes one hour Feb. Computer Architecture, Path and Control Slide

23 Single-Cycle vs. Multi-Cycle MicroMIPS Clock Time needed Time allotted 3 4 Clock Time needed Time allotted 3 cycles 5 cycles 3 cycles 4 cycles 3 4 Time saved Fig. 4. Single-cycle versus multi-cycle instruction execution. Feb. Computer Architecture, Path and Control Slide 3

24 A Multi-Cycle Path PC Address Inst jta rs,rt,rd (rs) x z Cache imm (rt) y op fn Control von Neumann (Princeton) architecture Fig. 4. Abstract view of a multi-cycle instruction execution unit for MicroMIPS. For naming of instruction fields, see Fig. 3.. Feb. Computer Architecture, Path and Control Slide 4

25 Multi-Cycle Path with Control Signals Shown Three major changes relative to the single-cycle data path: PC. uction & data s combined Address Cache Inst 3. isters added for intercycle data jta 6 rd 3 rs rt imm 4 MSBs 6 3 (rs). performs double duty for address calculation (rt) 3 SE x y 4 4 x Mux y Mux 3 SysCallAddr Zero Ovfl Zero Ovfl Func z out op fn Inst MemWrite InSrc SrcX Func PCWrite MemRead IRWrite Dst Write SrcY JumpAddr Fig. 4.3 Key elements of the multi-cycle MicroMIPS data path. PCSrc Feb. Computer Architecture, Path and Control Slide 5

26 Execution Cycles Fetch & PC incr Decode & reg read oper & PC update write or mem access write for lw Table 4. Execution cycles for multi-cycle MicroMIPS uction Operations Signal settings Any Any Read out the instruction and write it into instruction register, increment PC Read out rs & rt into x & y registers, compute branch address and save in z register Perform operation and save the result in z register Add base and offset values, save in z register If (x reg) = < (y reg), set PC to branch target address Inst =, MemRead = IRWrite =, SrcX = SrcY =, Func = + PCSrc = 3, PCWrite = SrcX =, SrcY = 3 Func = + type SrcX =, SrcY = or Func: Varies LoadStore SrcX =, SrcY = Func = + Branch SrcX =, SrcY = Func=, PCSrc = PCWrite = Zero or Zero or Out 3 Jump Set PC to the target address jta, SysCallAddr, or (rs) JumpAddr = or, PCSrc = or, PCWrite = type Write back z reg into rd Dst =, InSrc = Write = Load Read memory into data reg Inst =, MemRead = Store Copy y reg into memory Inst =, MemWrite = Load Copy data register into rt Dst =, InSrc = Write = Feb. Computer Architecture, Path and Control Slide 6

27 4.3 The Control State Machine Cycle Cycle Cycle 3 Cycle 4 Cycle 5 Notes for State 5: % for j or jal, for syscall, don t-care for other instr for j, jal, and syscall, for jr, for branches # for j, jr, jal, and syscall, Zero ( ) for beq (bne), bit 3 of out for bltz For jal, Dst =, InSrc =, Write = State Inst = MemRead = IRWrite = SrcX = SrcY = Func = + PCSrc = 3 PCWrite = State SrcX = SrcY = 3 Func = + Jump Branch lw sw State 5 SrcX = SrcY = Func = JumpAddr = % PCSrc PCWrite = # State SrcX = SrcY = Func = + sw lw State 6 Inst = MemWrite = State 3 Inst = MemRead = State 4 Dst = InSrc = Write = Start State 7 State 8 Note for State 7: Func is determined based on the op and fn fields type SrcX = SrcY = or Func = Varies Dst = or InSrc = Write = Fig. 4.4 The control state machine for multi-cycle MicroMIPS. Feb. Computer Architecture, Path and Control Slide 7

28 4.4 Performance of the Multi-Cycle Design R-type 44% 4 cycles Load 4% 5 cycles Store % 4 cycles Branch 8% 3 cycles Jump % 3 cycles -type Load P C P C Not used Contribution to CPI R-type.44 4 =.76 Load.4 5 =. Store. 4 =.48 Branch.8 3 =.54 Jump. 3 =.6 Average CPI 4.4 Store Branch (and jr) Jump (except jr & jal) P C P C P C Not used Not used Not used Not used Not used Not used Not used Not used Fig. 3.6 The MicroMIPS data path unfolded (by depicting the register write step as a separate block) so as to better visualize the critical-path latencies. Feb. Computer Architecture, Path and Control Slide 8

29 How Good is Our Multi-Cycle Design? Clock rate of 5 MHz better than 5 MHz of single-cycle design, but still unimpressive How does the performance compare with current processors on the market? Not bad, where latency is concerned A.5 GHz processor with or so pipeline stages has a latency of about.4 = 8 ns Throughput, however, is much better for the pipelined processor: Up to times better with single issue Perhaps up to with multiple issue Cycle time = ns Clock rate = 5 MHz R-type 44% 4 cycles Load 4% 5 cycles Store % 4 cycles Branch 8% 3 cycles Jump % 3 cycles Contribution to CPI R-type.44 4 =.76 Load.4 5 =. Store. 4 =.48 Branch.8 3 =.54 Jump. 3 =.6 Average CPI 4.4 Feb. Computer Architecture, Path and Control Slide 9

30 5 Pipelined Paths Pipelining is now used in even the simplest of processors Same principles as assembly lines in manufacturing Unlike in assembly lines, instructions not independent Topics in This Chapter 5. Pipelining Concepts 5. Pipeline Stalls or Bubbles 5.3 Pipeline Timing and Performance 5.4 Pipelined Path Design 5.5 Pipelined Control Feb. Computer Architecture, Path and Control Slide 3

31 Feb. Computer Architecture, Path and Control Slide 3

32 Single-Cycle Path of Chapter 3 Incr PC Next PC Next addr jta Ovfl Clock rate = 5 MHz CPI = (5 MIPS) PC (PC) inst rd 3 imm rs rt 6 (rs) (rt) 3 SE Ovfl Func out addr in out op fn ister input Br&Jump Dst Write Src Func Read Write InSrc Fig. 3.3 Key elements of the single-cycle MicroMIPS data path. Feb. Computer Architecture, Path and Control Slide 3

33 Multi-Cycle Path of Chapter 4 Clock rate = 5 MHz CPI 4 ( 5 MIPS) 6 4 MSBs 3 SysCallAddr 3 PC Address Cache Inst jta rd 3 rs rt imm 6 (rs) (rt) 3 SE x y 4 4 x Mux y Mux 3 Zero Ovfl Zero Ovfl Func z out 4 3 op Inst MemWrite PCWrite MemRead IRWrite fn InSrc Dst Write SrcX SrcY Func JumpAddr PCSrc Fig. 4.3 Key elements of the multi-cycle MicroMIPS data path. Feb. Computer Architecture, Path and Control Slide 33

34 Getting the Best of Both Worlds Single-cycle: Clock rate = 5 MHz CPI = Pipelined: Clock rate = 5 MHz CPI Multi-cycle: Clock rate = 5 MHz CPI 4 Single-cycle analogy: Doctor appointments scheduled for 6 min per patient Multi-cycle analogy: Doctor appointments scheduled in 5-min increments Feb. Computer Architecture, Path and Control Slide 34

35 5. Pipelining Concepts Strategies for improving performance Use multiple independent data paths accepting several instructions that are read out at once: multiple-instruction-issue or superscalar Overlap execution of several instructions, starting the next instruction before the previous one has run to completion: (super)pipelined Start here Approval Cashier istrar ID photo Pickup Exit Fig. 5. Pipelining in the student registration process. Feb. Computer Architecture, Path and Control Slide 35

36 Pipelined uction Execution Cycle Cycle Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Time dimension Task dimension Fig. 5. Pipelining in the MicroMIPS instruction execution process. Feb. Computer Architecture, Path and Control Slide 36

37 Alternate Representations of a Pipeline Except for start-up and drainage overheads, a pipeline can execute one instruction per clock tick; IPS is dictated by the clock frequency f r a d w Cycle f f f f f f f Cycle 3 f r f a r d a w d w 3 r r r r r r r a a a a a a a Drainage region f f = Fetch r = read a = op d = access w = Writeback uction (a) Task-time diagram r f a r f d a r f w d a r w d a w d w 4 5 Start-up region Pipeline stage d d d d d d d w w w w w w w (b) Space-time diagram Fig. 5.3 Two abstract graphical representations of a 5-stage pipeline executing 7 tasks (instructions). Feb. Computer Architecture, Path and Control Slide 37

38 5. Pipeline Stalls or Bubbles First type of data dependency Cycle Cycle Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 $5 = $6 + $7 forwarding $8 = $8 + $6 $9 = $8 + $ sw $9, ($3) Fig. 5.4 Read-after-write data dependency and its possible resolution through data forwarding. Feb. Computer Architecture, Path and Control Slide 38

39 Inserting Bubbles in a Pipeline Cycle Cycle Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Writes into $8 Time dimension Bubble 3 Bubble 4 Bubble 5 Task dimension Reads from $8 Without data forwarding, three bubbles are needed to resolve a read-after-write data dependency Feb. Computer Architecture, Path and Control Slide 39

40 Second Type of Dependency Cycle Cycle Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 sw $6,... mem mem Reorder? lw $8,... mem mem Insert bubble? mem mem $9 = $8 + $ mem mem Without data forwarding, three bubbles are needed to resolve a read-after-load data dependency Fig. 5.5 Read-after-load data dependency and its possible resolution through bubble insertion and data forwarding. Feb. Computer Architecture, Path and Control Slide 4

41 Control Dependency in a Pipeline Cycle Cycle Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 $6 = $3 + $5 beq $, $,... mem mem mem mem Reorder? (delayed branch) Insert bubble? mem mem $9 = $8 + $ mem mem Assume branch resolved here Here would need - more bubbles Fig. 5.6 Control dependency due to conditional branch. Feb. Computer Architecture, Path and Control Slide 4

42 5.3 Pipeline Timing and Performance t Latching of results Function unit Stage Stage Stage 3... Stage q Stage q tq Fig. 5.7 Pipelined form of a function unit with latching overhead. Feb. Computer Architecture, Path and Control Slide 4

43 Throughput Increase in a q-stage Pipeline 8 Throughput improvement factor Ideal: t = t =.5 t =. t tq + or q + q t Number q of pipeline stages Fig. 5.8 Throughput improvement due to pipelining as a function of the number of pipeline stages for different pipelining overheads. Feb. Computer Architecture, Path and Control Slide 43

44 5.4 Pipelined Path Design PC Stage Stage Stage 3 Stage 4 Stage 5 Next addr NextPC Ovfl addr Address inst rs (rs) Ovfl rt Incr IncrPC imm SE rt rd 3 (rt) Func SeqInst op Br&Jump Dst Read RetAddr fn Write Src Func Write InSrc Fig. 5.9 Key elements of the pipelined MicroMIPS data path. Feb. Computer Architecture, Path and Control Slide 44

45 5.5 Pipelined Control PC Stage Stage Stage 3 Stage 4 Stage 5 Next addr NextPC Ovfl addr Address inst rs (rs) Ovfl rt Incr IncrPC imm SE rt rd 3 (rt) 5 Func 3 SeqInst op Br&Jump Dst fn Write Src Func Read RetAddr Write InSrc Fig. 5. Pipelined control signals. Feb. Computer Architecture, Path and Control Slide 45

46 6 Pipeline Performance Limits Pipeline performance limited by data & control dependencies Hardware provisions: data forwarding, branch prediction Software remedies: delayed branch, instruction reordering Topics in This Chapter 6. Dependencies and Hazards 6. Forwarding 6.3 Pipeline Branch Hazards 6.4 Delayed Branch and Branch Prediction 6.5 Advanced Pipelining Feb. Computer Architecture, Path and Control Slide 46

47 6. Dependencies and Hazards Cycle Cycle Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 $ = $ - $3 uctions that read register $ Fig. 6. dependency in a pipeline. Feb. Computer Architecture, Path and Control Slide 47

48 Resolving Dependencies via Forwarding Cycle Cycle Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 $ = $ - $3 uctions that read register $ Fig. 6. When a previous instruction writes back a value computed by the into a register, the data dependency can always be resolved through forwarding. Feb. Computer Architecture, Path and Control Slide 48

49 Pipelined MicroMIPS Repeated for Reference PC Stage Stage Stage 3 Stage 4 Stage 5 Next addr NextPC Ovfl addr Address inst rs (rs) Ovfl rt Incr IncrPC imm SE rt rd 3 (rt) 5 Func 3 SeqInst op Br&Jump Dst fn Write Src Func Read RetAddr Write InSrc Fig. 5. Pipelined control signals. Feb. Computer Architecture, Path and Control Slide 49

50 Certain Dependencies Lead to Bubbles Cycle Cycle Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 lw $,4($) uctions that read register $ Fig. 6.3 When the immediately preceding instruction writes a value read out from the data memory into a register, the data dependency cannot be resolved through forwarding (i.e., we cannot go back in time) and a bubble must be inserted in the pipeline. Feb. Computer Architecture, Path and Control Slide 5

51 6. Forwarding rs rt Stage SE Src (rs) (rt) s x y t x3 y3 x4 y4 x3 y3 x4 y4 Src Forw arding unit, upper Forw arding unit, lower Stage 3 Stage 4 Stage 5 d3 d4 RetAddr3, Write3, Write4 InSrc3, InSrc4 Ovfl Func x3 y3 d3 InSrc3 d3 Write3 d4 RetAddr3 RetAddr3, Write3, Write4 InSrc3, InSrc4 x4 y4 d4 InSrc4 Write4 Fig. 6.4 Forwarding unit for the pipelined MicroMIPS data path. Feb. Computer Architecture, Path and Control Slide 5

52 Design of the Forwarding Units Let s focus on designing the upper data forwarding unit Table 6. Partial truth table for the upper forwarding unit in the pipelined MicroMIPS data path. rs rt Fig. 6.4 Stage SE Src (rs) (rt) s x y t x3 y3 x4 y4 x3 y3 x4 y4 Src Forw arding unit, upper Forw arding unit, lower Stage 3 Stage 4 Stage 5 d3 d4 RetAddr3, Write3, Write4 InSrc3, InSrc4 Ovfl Func d4 InSrc4 Write4 Forwarding unit for the pipelined MicroMIPS data path. x3 y3 d3 InSrc3 d3 Write3 d4 RetAddr3 RetAddr3, Write3, Write4 InSrc3, InSrc4 x4 y4 Write3 Write4 smatchesd3 smatchesd4 RetAddr3 InSrc3 InSrc4 Choose x x x x x x x x x x x x x x x4 x x x y4 x x x3 x x y3 x x3 Incorrect in textbook Feb. Computer Architecture, Path and Control Slide 5

53 Augmentations to Pipelined Path and Control Branch predictor Next addr forwarders Hazard detector forwarders forwarder PC Stage Stage Stage 3 Stage 4 Stage 5 Next addr NextPC Ovfl addr Address inst rs (rs) Ovfl rt Incr IncrPC imm SE rt rd 3 (rt) 5 Func 3 SeqInst Fig. 5. op Br&Jump Dst fn Write Src Func Read RetAddr Write InSrc Feb. Computer Architecture, Path and Control Slide 53

54 6.3 Pipeline Branch Hazards Software-based solutions Compiler inserts a no-op after every branch (simple, but wasteful) Branch is redefined to take effect after the instruction that follows it Branch delay slot(s) are filled with useful instructions via reordering Hardware-based solutions Mechanism similar to data hazard detector to flush the pipeline Constitutes a rudimentary form of branch prediction: Always predict that the branch is not taken, flush if mistaken More elaborate branch prediction strategies possible Feb. Computer Architecture, Path and Control Slide 54

55 6.4 Branch Prediction Predicting whether a branch will be taken Always predict that the branch will not be taken Use program context to decide (backward branch is likely taken, forward branch is likely not taken) Allow programmer or compiler to supply clues Decide based on past history (maintain a small history table); to be discussed later Apply a combination of factors: modern processors use elaborate techniques due to deep pipelines Feb. Computer Architecture, Path and Control Slide 55

56 Forward and Backward Branches List A is stored in memory beginning at the address given in $s. List length is given in $s. Find the largest integer in the list and copy it into $t. Solution Example 5.5 Scan the list, holding the largest element identified thus far in $t. lw $t,($s) # initialize maximum to A[] addi $t,$zero, # initialize index i to loop: add $t,$t, # increment index i by beq $t,$s,done # if all elements examined, quit add $t,$t,$t # compute i in $t add $t,$t,$t # compute 4i in $t add $t,$t,$s # form address of A[i] in $t lw $t3,($t) # load value of A[i] into $t3 slt $t4,$t,$t3 # maximum < A[i]? beq $t4,$zero,loop # if not, repeat with no change addi $t,$t3, # if so, A[i] is the new maximum j loop # change completed; now repeat done:... # continuation of the program Feb. Computer Architecture, Path and Control Slide 56

57 Hardware Implementation of Branch Prediction Low-order bits used as index Addresses of recent branch instructions Target addresses History bit(s) Incremented PC Read-out table entry Next PC From PC Compare = Logic Fig. 6.7 Hardware elements for a branch prediction scheme. The mapping scheme used to go from PC contents to a table entry is the same as that used in direct-mapped s (Chapter 8) Feb. Computer Architecture, Path and Control Slide 57

58 6.5 Advanced Pipelining Deep pipeline = superpipeline; also, superpipelined, superpipelining Parallel instruction issue = superscalar, j-way issue (-4 is typical) Stage Stage Stage 3 Stage 4 Stage 5 Variable # of stages Stage q Stage q Stage q Function unit decode Operand prep issue Function unit Function unit 3 uction fetch Retirement & commit stages Fig. 6.8 Dynamic instruction pipeline with in-order issue, possible out-of-order completion, and in-order retirement. Feb. Computer Architecture, Path and Control Slide 58

59 CPI Variations with Architectural Features Table 6. Effect of processor architecture, branch prediction methods, and speculative execution on CPI. Architecture Methods used in practice CPI Nonpipelined, multicycle Strict in-order instruction issue and exec 5- Nonpipelined, overlapped In-order issue, with multiple function units 3-5 Pipelined, static In-order exec, simple branch prediction -3 Superpipelined, dynamic Out-of-order exec, adv branch prediction - Superscalar - to 4-way issue, interlock & speculation.5- Advanced superscalar 4- to 8-way issue, aggressive speculation inst cycle 3 Gigacycles s GIPS Feb. Computer Architecture, Path and Control Slide 59

60 The Shrinking Supercomputer Feb. Computer Architecture, Path and Control Slide 6

61 The Three Hardware Designs for MicroMIPS Incr PC Next PC PC Br&Jump Next addr (PC) inst op 5 MHz CPI = fn jta rd 3 imm rs rt 6 Dst Write PC 5 MHz CPI. (rs) (rt) 3 SE Single-cycle Ovfl Ovfl Func Src Func out addr in ister input Read Write out InSrc Multi-cycle PC PCWrite Address Inst Cache MemRead MemWrite Inst IRWrite op jta fn 6 rd 3 Dst rs rt imm 4 MSBs 6 InSrc 3 Write (rs) (rt) 3 SE x y SrcX Stage Stage Stage 3 Stage 4 Stage 5 Next addr NextPC Ovfl addr Incr inst IncrPC rs rt imm SE rt rd 3 (rs) (rt) 5 Ovfl Func 3 Address x Mux y Mux SrcY SysCallAddr Zero Ovfl Zero Ovfl Func Func z out JumpAddr PCSrc 5 MHz CPI 4 SeqInst op Br&Jump Dst fn Write Src Func Read RetAddr Write InSrc Feb. Computer Architecture, Path and Control Slide 6

62 Where Do We Go from Here? Memory Design: How to build a memory unit that responds in clock Input and Output: Peripheral devices, IO programming, interfacing, interrupts Higher Performance: Vectorarray processing Parallel processing Feb. Computer Architecture, Path and Control Slide 6

Array vs. Linked list Slowly changing size, order More often dynamically y( (rarely statically)

Array vs. Linked list Slowly changing size, order More often dynamically y( (rarely statically) Array vs. Linked list Quickly changing size, order Slowly changing size, order More often dynamically y( (rarely statically) Could be allocated dynamically or statically Could be contiguous (when static)