12.1. CS356 Unit 12. Processor Hardware Organization Pipelining

Size: px

Start display at page:

Download "12.1. CS356 Unit 12. Processor Hardware Organization Pipelining"

Sheryl Hodge
5 years ago
Views:

1 12.1 CS356 Unit 12 Processor Hardware Organization Pipelining

2 BASIC HW 12.2

3 Inputs Outputs 12.3 Logic Circuits Combinational logic Performs a specific function (mapping of 2 n input combinations to desired output combinations) No internal state or feedback Given a set of inputs, we will always get the same output after some time (propagation) delay Sequential logic (Storage devices) Registers are the fundamental building blocks Remembers a set of bits for later use Acts like a variable from software Controlled by a "clock" signal Current inputs Sequential values feedback and provide "memory" Combinational Logic (Usually operations like +, -, *, /, &,, <<) Outputs depend only on current outputs Combinational Logic Register holding "state" Sequential Logic Outputs depend on current inputs and previous inputs (previous inputs summarized via state) Ou

4 12.4 Combinational Logic Gates Main Idea: Circuits called "gates" perform logic operations to produce desired outputs from some digital inputs N P R C V B T OR gate 0 NOT gate OR gate AND gate OR gate

5 12.5 Propagation Delay Main Idea: All digital logic circuits have propagation delay Time it takes for output to change when inputs are changed N P R C 1 0 V B T gate delays for input to propagate to outputs

6 Inputs Outputs 12.6 Combinational Logic Functions Map input combinations of n-bits to desired m-bit output Can describe function with a truth table and then find its circuit implementation IN0 IN1 IN2 OUT0 OUT In0 In1 Logic Circuit In2 Out

00_0000 A SHL B 10_0000 A+B 00_0010 A SHR B 10_0010 A-B 00_0011 A SAR B 10_0100 A AND B X[31:0] 0x80000000 Y[31:0]

7 12.7 ALU s Perform a selected operation on two input numbers. FS[5:0] select the desired operation Func. Code Op. Func. Code Op. 00_0000 A SHL B 10_0000 A+B 00_0010 A SHR B 10_0010 A-B 00_0011 A SAR B 10_0100 A AND B X[31:0] 0x Y[31:0] 0x FS[5:0] A B C0 ALU RES OF Func ZF CF 0x7fffffff RES[31:0] _1000 A * B 10_0101 A OR B 01_1001 A * B (uns.) 10_0110 A XOR B 01_1010 A / B 10_0111 A NOR B 01_1011 A / B (uns.) 10_1010 A < B

8 12.8 Sequential Devices (Registers) Registers capture the D input value when a control input (aka the clock signal) transitions from 0->1 (clock edge) and stores that value at the Q output until the next clock edge A register is like a variable in software. It stores a value for later use We can choose to only clock the register at "certain" times when we want the register to capture a new value (i.e. when it is the destination of an instruction) Key Idea: Registers store data while we operate on those values The clock pulse (positive edge) here Clock pulse ALU sum Data Input (could be many bits) Clock pulse t = 0 ns t = 1 ns t = 5 ns t = 7 ns t = 10 ns d(t) add %rbx,%rax add %rcx,%rax Some input value changing over time %rax D Q CP Block Diagram of a Register d(1) d(2) d(3) d(4) d(5) d(6) d(7) d(8) d(9) d(10) d(11) d(12) Data Output (could be many bits) causes q(t) to sample and hold the current d(t) value until another clock pulse %rax q(t) unk d(1) d(5) d(7) d(10)

9 12.9 Clock Signal Alternating high/low voltage pulse train Controls the ordering and timing of operations performed in the processor 1 cycle is usually measured from rising edge to rising edge Clock frequency = # of cycles per second (e.g. 2.8 GHz = 2.8 * 10 9 cycles per second) 1 (5V) 0 (0V) Clock Signal Op. 1 Op. 2 Op. 3 1 cycle 2.8 GHz = 2.8*10 9 cycles per second = ns/cycle Processor

10 12.10 Basic HW organization for a simplified instruction set FROM X86 TO RISC

11 12.11 From CISC to RISC Complex Instruction Set Computers often have instructions that vary widely in how much work they perform and how much time they take to execute Benefit is fewer instructions are needed to accomplish a task Reduced Instruction Set Computers favor instructions that take roughly the same time to execute and follow a common sequence of steps It often requires more instructions to describe the overall task (larger code size) See example to the right RISC makes the hardware design easier so let's tweak our x86 instructions to be more RISC-like // CISC instruction movq 0x40(%rdi, %rsi, 4), %rax // RISC Equiv. w/ 1 mem. or ALU op. // per instruction mov %rsi, %rbx # use %rbx as a temp. shl 2, %rbx # %rsi * 4 add %rdi, %rbx # %rdi + (%rsi*4) add $0x40, %rbx # 0x40 + %rdi + (%rsi*4) mov (%rbx), %rax # %rax = *%rbx CISC vs. RISC Equivalents

12 12.12 A RISC Subset of x86 Split mov instructions that access memory into separate instructions: ld = Load/Read from memory st = Store/Write to memory Limit ld & st instructions to use at most indirect w/ displacement No ld 0x04(%rdi, %rsi, 4), %rax Too much work At most ld 0x40(%rdi), %rax or st %rax, 0x40(%rdi) Limit arithmetic & logic instructions to only operate on registers No add (%rsp), %rax since this implicitly accesses (dereferences) memory Only add %reg1, %reg2 // 3 x86 memory read instructions mov (%rdi), %rax // 1 mov 0x40(%rdi), %rax // 2 mov 0x40(%rdi,%rsi), %rax // 3 // Equiv. load sequences ld 0x0(%rdi), %rax // 1 ld 0x40(%rdi), %rax // 2 mov %rsi, %rbx // 3a add %rdi, %rbx // 3b ld 0x40(%rbx), %rax // 3c // 3 x86 memory write instructions mov %rax, (%rdi) // 1 mov %rax, 0x40(%rdi) // 2 mov %rax, 0x40(%rdi,%rsi) // 3 // Equiv. store sequences st %rax, 0x0(%rdi) // 1 st %rax, 0x40(%rdi) // 2 mov %rsi, %rbx // 3a add %rdi, %rbx // 3b st %rax, 0x40(%rbx) // 3c // CISC instruction add %rax, (%rsp) // Equiv. RISC sequence (w/ ld and st) ld 0(%rsp), %rbx add %rax, %rbx st %rbx, 0(%rsp)

13 12.13 Developing a Processor Organization Identify which hardware components each instruction type would use and in what order: ALU-Type, LD, ST, Jump Cond. Codes PC Addr. Data I-Cache / I-MEM Registers (%rax, %rbx, etc.) (aka RegFile) Addr. Data D-Cache / D-MEM ALU Zero Res. ALU-Type add %rax,%rbx LD ld 8(%rax),%rbx ST st %rbx, 8(%rax) JE je label/displacement PC Addr. of Instruc I-Cache Fetch Instruc Registers Get %rax,%rbx ALU Sum %rax+%rbx Registers Save result to %rbx PC Addr. of Instruc I-Cache Fetch Instruc Registers Get %rax ALU Sum %rax+8 D-Cache Read data 6. Registers Save data to %rbx PC Addr. of Instruc I-Cache Fetch Instruc Registers Get %rax ALU Sum %rax+8 D-Cache Write %rbx data PC Addr. of Instruc I-Cache Fetch Instruc Registers Get %rax ALU If cond=true, PC = PC+disp.

14 12.14 Processor Block Diagram Fetch Exec. Mem WB PC Control Signals (e.g. ALU operation, Read/Write D-Cache, etc.) ZF OFCF SF I-Cache Registers (aka RegFile) ALU D-Cache Addr Data Data Instruction (Machine Code) Operands ALU Output (Addr. or Result) Data to write to dest. register 10 ns 10 ns 10 ns 10 ns 10 ns Clock Cycle Time = Sum of delay through worst case pathway = 50 ns

15 12.15 Processor Execution (add) Fetch Exec. Mem WB PC I-Cache Registers (aka RegFile) Control Signals (e.g. ALU operation, Read/Write D-Cache, etc.) %rax %rdx ZF OFCF SF ALU %rax+%rdx D-Cache Addr Data Instruction (Machine Code) add %rax,%rdx [Machine Code: c2] Operands ALU Output (Addr. or Result) Data to write to dest. register %rdx = %rax+%rdx

16 12.16 Processor Execution (load) Fetch Exec. Mem WB PC I-Cache Registers (aka RegFile) Control Signals (e.g. ALU operation, Read/Write D-Cache, etc.) %rbx 0x40 ZF OFCF SF ALU addr D-Cache Addr Data data Instruction (Machine Code) ld 0x40(%rbx),%rax [Machine Code: 48 8b 43 40] Operands ALU Output (Addr. or Result) Data to write to dest. register %rdx = data

17 12.17 Processor Execution (store) Fetch Exec. Mem WB PC I-Cache Registers (aka RegFile) Control Signals (e.g. ALU operation, Read/Write D-Cache, etc.) %rbx 0x40 ZF OFCF SF ALU addr D-Cache Addr Data Data Instruction (Machine Code) %rax st %rax,0x40(%rbx) [Machine Code: ]

18 12.18 Processor Execution (branch/jump) Fetch Exec. Mem WB PC I-Cache Registers (aka RegFile) Control Signals (e.g. ALU operation, Read/Write D-Cache, etc.) PC 0x08 ZF 1 OF 0 CF SF 1 0 ALU PC + 0x08 D-Cache Addr Data Data Instruction (Machine Code) je L1 (disp. = 0x08) [Machine Code: 74 08]

19 PIPELINING 12.19

20 Cntr Example for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; i Memory A: A[i] B: C: B[i] 10 ns per input set = 1000 ns total

21 12.21 Pipelining Example for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; Stage 1 Stage 2 Stage 1 Stage 2 Clock 0 A[0] + B[0] Clock 1 A[1] + B[1] (A[0] + B[0]) / 4 Clock 2 A[2] + B[2] (A[1] + B[1]) / 4 Pipelining refers to insertion of registers to split combinational logic into smaller stages that can be overlapped in time (i.e. create an assembly line)

22 12.22 Need for Registers Provides separation between combinational functions Without registers, fast signals could catch-up to data values in the next operation stage 5 ns Signal i Signal j Register Register 2 ns CLK CLK Performing an operation yields signals with different paths and delays We don t want signals from two different data values mixing. Therefore we must collect and synchronize the values from the previous operation before passing them on to the next

23 12.23 Processors & Pipelines Overlaps execution of multiple instructions Natural breakdown into stages Fetch,, Execute Fetch an instruction, while decoding another, while executing another Inst 1 Inst 2 Inst 3 Inst 4 CLK 1 CLK 2 CLK 3 CLK 4 Fetch Execute Fetch Execute Fetch Fetch Pipelining (Instruction View) Fetch Exec. Clk 1 Inst. 1 Clk 2 Inst. 2 Inst. 1 Clk 3 Inst. 3 Inst. 2 Inst. 1 Clk 4 Inst. 4 Inst. 3 Inst. 2 Clk 5 Inst. 5 Inst. 4 Inst. 3 Pipelining (Stage View)

24 Fig. 4 F1 F2 D1 D2 E1a E1b E2a E2b Fig. 3 Fetch Register Register Exec. 1 Register Exec. 1 Fig. 2 Fetch Register Register Fig Balancing Pipeline Stages Clock period must equal the LONGEST delay from register to register Fig. 1: If total logic delay is 20ns => 50MHz Throughput: 1 instruc. / 20 ns Fig. 2: Unbalanced stage delays limit the clock speed to the slowest stage (worst case) Throughput: 1 instruc. / 10 ns => 100MHz Fig. 3: Better to split into more, balanced stages Throughput: 1 instruc. / 5 ns => 200MHz Fig. 4: Are more stages better Ideally: 2x stages => 2x throughput Throughput: 1 instruc. / 2.5 ns => 400MHz Each register adds extra delay so at some point deeper pipelines don't pay off 20 ns Processor Logic (Fetch + + Execute) 5 ns 5 ns 10 ns Exec 5 ns 5 ns 5 ns 5 ns 2.5 ns

25 Fig. 3 F1 F2 D1 D2 E1a E1b E2a E2b Fig. 2 Fetch Register Register Exec. 1 Register Exec. 1 Fig Balancing Pipeline Stages Main Points: Latency of any single instruction is unaffected Throughput and thus overall program performance can be dramatically improved Ideally K stage pipeline will lead to throughput increase by a factor of K Reality is splitting stages adds some delay and thus hits a point of diminishing returns 20 ns Processor Logic (Fetch + + Execute) 5 ns 5 ns 5 ns 5 ns 2.5 ns Non-pipelined (Latency = 20ns, Throughput = 1x) 4 Stage Pipeline (Latency = 20ns, Throughput = 4x) 8 Stage Pipeline (Latency = 20ns, Throughput = 8x)

26 Instruction (Machine Code) Operands & Control Signals ALU Output (Addr. or Result) Result of Instruction Stage Pipeline Fetch Exec. Mem WB PC ZF OFCF SF I-Cache Reg. File ALU D-Cache

27 12.27 Pipelining Let's see how a sequence of instructions can be executed Instruction ld 0x8(%rbx), %rax add %rcx,%rdx je L1

28 Instruction (Machine Code) Operands & Control Signals ALU Output (Addr. or Result) Result of Instruction Sample Sequence - 1 Fetch (LD) Exec. Mem WB PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch LD

29 LD 0x8(%rbx), %rax Operands & Control Signals ALU Output (Addr. or Result) Result of Instruction Sample Sequence - 2 Fetch (ADD) (LD) Exec. Mem WB PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch ADD instruction and fetch operands

30 ADD %rdx, %rcx 0x40 / %rbx / READ ALU Output (Addr. or Result) Result of Instruction Sample Sequence - 3 Fetch (JE) (ADD) Exec. (LD) Mem WB PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch JE instruction and fetch operands Add displacement 0x04 to %rbx

31 JE / displacement %rcx / %rdx / ADD %rbx + 0x40 / READ Result of Instruction Sample Sequence - 4 Fetch (i+1) (JE) Exec. (ADD) Mem (LD) WB PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch instruc i+1 instruction and fetch operands Add %rcx+%rdx Read word from memory

32 Instruc. i+1 Machine Code PC / Dsiplacement %rdx / %rcx + %rdx %rax / Value from Memory Sample Sequence - 5 Fetch (i+2) (i+1) Exec. (JE) Mem (ADD) WB (LD) PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch next instruc i+2 instruction i+1 and fetch operands Check if condition is true Just pass sum to next stage Write word to %rax

33 Instruc. i+2 Machine Code Instruc i+1 operands New PC %rdx / sum Sample Sequence - 6 Fetch (i+3) (i+2) Exec. (i+1) Mem (JE) WB (ADD) PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch next instruc i+3 instruction i+2 and fetch operands Use the ALU Update PC Write word to %rdx

34 Deleted Deleted Deleted %rdx / sum Sample Sequence - 7 Fetch (target) (i+3) Exec. (i+2) Mem (i+1) WB (JE) PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch next instruc i+3 Delete i+3 Delete i+2 Delete i+1 Do nothing

35 12.35 Problems from overlapping instruction execution HAZARDS

36 12.36 Hazards Hazards prevent parallel or overlapped execution! Control Hazards Problem: We don't know what instruction to fetch but we need to Examples: Jumps (branches) and calls Data Hazards / Data Dependencies Problem: When a later instruction needs data from a previous instruction Examples: sub %rdx,%rax add %rax,%rcx Structural Hazards Problem: Due to limited resources, the HW doesn't support overlapping a certain sequence of instructions Examples: See next slides

37 12.37 Structural Hazards Example structural hazard: A single cache rather than separate instruction & data caches Structural hazard any time an instruction needs to perform a data access (i.e. ld or st) since we always want to fetch a new instruction each clock cycle i+2 i+1 LD PC i+3 Reg. File ALU Hazard! Cache

38 ADD %rax, %rcx %rdx / %rax / SUB ALU Output (Addr. or Result) Result of Instruction Data Hazard - 1 add %rax,%rcx sub %rdx,%rax Fetch (i+1) (ADD) Exec. (SUB) Mem WB PC I-Cache Reg. File ALU D-Cache Fetch i+1 and get register operands (Do we get the desired %rax value?) Perform %rax-%rdx

i+1 %rax / %rcx / ADD New value for %rax Result of Instruction 12.39 Data Hazard - 2 add %rax,%rcx sub %rdx,%rax Fetch (i+2) (i+1) Exec.

39 i+1 %rax / %rcx / ADD New value for %rax Result of Instruction Data Hazard - 2 add %rax,%rcx sub %rdx,%rax Fetch (i+2) (i+1) Exec. (ADD) Mem (SUB) WB PC I-Cache Reg. File ALU D-Cache Fetch i+2 i+1 Perform %rax+%rcx using the wrong value! New value for %rax has not been written back yet

40 ADD %rax, %rcx nop nop New value of %rax Stalling Solution 1: Halt/Stall the ADD instruction in the DECODE stage and insert nops into the pipeline until the new value of the needed register is present at the cost of lower performance add %rax,%rcx sub %rdx,%rax Fetch (i+1) (ADD) Exec. (nop) Mem (nop) WB (SUB) PC I-Cache Reg. File ALU D-Cache

41 i+1 %rax / %rcx / ADD New value for %rax Result of Instruction Forwarding Solution 2: Create new hardware paths to hand-off (forward) the data from the producing instruction in the pipeline to the consuming instruction Fetch (i+2) (i+1) add %rax,%rcx Exec. (ADD) sub %rdx,%rax Mem (SUB) WB PC I-Cache Reg. File ALU D-Cache

42 12.42 Solving Data Hazards Key Point: Data dependencies (i.e. instructions needing values produced by earlier ones) limit performance Forwarding solves many of the data hazards (data dependencies) that exist It allows instructions to continue to flow through the pipeline without the need to stall and waste time The cost is additional hardware and added complexity Even forwarding cannot solve all the issues A structural hazard still exists when a LD reads a value needed by the next instruction

43 i+1 %old rax / %rcx / ADD Address (8+%rdx) / READ Result of Instruction LD + Dependent Instruction Hazard Even forwarding cannot prevent the need to stall when a Load instruction produces a value needed by the instruction behind it Would require performing 2 cycles worth of work in only a single cycle Fetch (i+2) (i+1) add %rax,%rcx Exec. (ADD) ld 8(%rdx),%rax Mem (LD) WB PC I-Cache Reg. File ALU D-Cache New value for %rax

44 i+1 %old rax / %rcx / ADD nop New Value of %rax LD + Dependent Instruction Hazard We would need to introduce 1 stall cycle (nop) into the pipeline to get the timing correct Keep this in mind as we move through the next slides Fetch (i+2) (i+1) add %rax,%rcx Exec. (ADD) Mem (nop) ld 8(%rdx),%rax WB (LD) PC I-Cache Reg. File ALU D-Cache New value for %rax

45 12.45 Control Hazards Branches/Jumps require us to know Where we want to jump to (aka branch/jump target location) really just the new value of the PC If we should branch or not (checking the jump condition) Problem: We often don't know those values until deep in the pipeline and thus we are not sure what instructions should be fetched in the interim Requires us to flush unwanted instructions and waste time

46 Instruc. i+2 Machine Code Instruc i+1 operands New PC %rdx / sum Control Hazard - 1 Fetch (i+3) (i+2) Exec. (i+1) Mem (JE) WB (ADD) PC I-Cache Reg. File ALU D-Cache Fetch next instruc i+3 instruction i+2 and fetch operands Use the ALU Update PC Write word to %rdx

47 Deleted Deleted Deleted %rdx / sum Control Hazard - 2 Fetch (target) (i+3) Exec. (i+2) Mem (i+1) WB (JE) PC I-Cache Reg. File ALU D-Cache Fetch next instruc i+3 Delete i+3 Delete i+2 Delete i+1 Need to "flush" wrongly fetched instructions Do nothing

48 12.48 Enlisting the help of the compiler A FIRST LOOK: CODE REORDERING

49 12.49 Two Sides of the Coin If the hardware has some problems it just can't solve, can software (i.e. the compiler) help? Yes!! Compilers can re-order instructions to take best advantage of the processor (pipeline) organization Identify the dependencies that will incur stalls and slow performance Load followed by add Jump instructions void sum(int* data, int n, int x) { for(int i=0; i < n; i++){ data[i] += x; } } sum: mov L1: cmp jge ld add st add add j L2: retq $0x0,%ecx %esi,%ecx L2 0(%rdi), %eax %edx, %eax %eax, 0(%rdi) $4, %rdi $1, %ecx L1 C code and its assembly translation

12.50 How Can the Compiler Help Compilers are written with general parsing and semantic representation front ends but architecture-specific backends that generate code optimized for a particular

50 12.50 How Can the Compiler Help Compilers are written with general parsing and semantic representation front ends but architecture-specific backends that generate code optimized for a particular processor Q: How could the compiler help improve pipelined performance while still maintaining the external behavior that the high level code indicates A: By finding independent instructions and reordering the code Could we have moved any other instruction into that slot? No! sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax stall/nop add %edx, %eax st %eax, 0(%rdi) add $4, %rdi add $1, %ecx j L1 L2: retq Original Code (incurring 1 stall cycle) sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax add $1, %ecx add %edx, %eax C code and its assembly st %eax, 0(%rdi) translation add $4, %rdi j L1 L2: retq Updated Code (w/ Compiler reordering)

51 12.51 Taken or Not Taken: Branch Behavior When a conditional jump/branch is True, we say it is Taken False, we say it is Not Taken Currently our pipeline will fetch sequentially and then potentially flush if the branch is taken Effectively, our pipeline "predicts" that each branch is Not Taken The j L1 instruction is always taken and thus will incur wasted clock cycles each time it is executed Most of the time the jge L2 will be not taken and perform well T sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax NT add $1, %ecx add %edx, %eax C code and its assembly st %eax, 0(%rdi) translation add $4, %rdi j L1 L2: retq

52 12.52 Branch Delay Slots Problem: After a jump/branch we fetch instructions that we are not sure should be executed Idea: Find an instruction(s) that should ALWAYS be executed (independent of whether branch is taken or not), move those instructions to directly after the branch, and have HW just let them be executed (not flushed) no matter what the branch outcome is Branch delay slot(s) = # of instructions that the HW will always execute (not flush) after a jump/branch instruction

53 12.53 Branch Delay Slot Example ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx je NEXT delay slot instruc. NOT TAKEN CODE NEXT: TAKEN CODE Assume a single instruction delay slot Taken Path Code Before Code ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx T je Delay Slot NT Not Taken Path Code After Code Flowchart perspective of the delay slot ld 0(%rdi), %rcx cmp %rcx, %rdx je NEXT add %rbx, %rax NOT TAKEN CODE NEXT: TAKEN CODE Delay Slot Move an ALWAYS executed instruction down into the delay slot and let it execute no matter what

54 12.54 Implementing Branch Delay Slots HW will define the number of branch delay slots (usually a small number 1 or 2) Compiler will be responsible for arranging instructions to fill the delay slots Must find instructions that the branch does NOT DEPEND on If no instructions can be rearranged, can always insert 'nop' and just waste those cycles ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx je NEXT delay slot instruc. Cannot move ld into delay slot because je needs the %rcx value generated by it ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rax je NEXT delay slot instruc. If no instruction can be found a 'nop' can be inserted by the compiler

12.55 A Look Ahead: Branch Prediction Currently our pipeline assumes Not Taken and fetches down the sequential path after a jump/branch Could we build a pipeline that could predict taken? Not yet!

55 12.55 A Look Ahead: Branch Prediction Currently our pipeline assumes Not Taken and fetches down the sequential path after a jump/branch Could we build a pipeline that could predict taken? Not yet! Location to jump to (branch target) not known until later stages But suppose we could overcome those problems, would we even know how to predict the outcome of a jump/branch before actually looking at the condition codes deeper in the pipeline? We could allow a static prediction per instruction (give a hint with the branch that indicates T or NT) We could allow dynamic prediction per instruction (use its runtime history) T: loop T: if T Code Loop Body loop branch NT: done if..else branch NT: else After Code NT Code Loops High probability of being Taken. Prediction can be static. If Statements May exhibit data dependent behavior. Prediction may need to be dynamic.

56 Demo 12.56

57 12.57 Summary 1 Pipelining is an effective and important technique to improve the throughput of a processor Overlapping execution creates hazards which lead to stalls or wasted cycles Data, Control, Structural More hardware can be investigated to attempt to mitigate the stalls (e.g. forwarding) The compiler can help reorder code to avoid stalls and perform useful work (e.g. delay slots)

CS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining

CS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining 2a. 2a.2 CS356 Unit 2a Processor Hardware Organization Pipelining BASIC HW Logic Circuits 2a.3 Combinational Logic Gates 2a.4 logic Performs a specific function (mapping of input combinations to desired