CS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining

2a. 2a.2 CS356 Unit 2a Processor Hardware Organization Pipelining BASIC HW Logic Circuits 2a.3 Combinational Logic Gates 2a.4 logic Performs a specific function (mapping of input combinations to desired output combinations) No internal state or feedback Given a set of inputs, we will always get the same output after some time (propagation) logic (Storage devices) are the fundamental building blocks Remembers a set of bits for later use Acts like a variable from software Controlled by a " " signal Current inputs Inputs Sequential values feedback and provide "memory" Combinational Logic (Usually operations like +, -, *, /, &,, <<) Outputs depend only on current outputs Combinational Logic holding "state" Sequential Logic Outputs Outputs depend on current inputs and previous inputs (previous inputs summarized via state) Ou Main Idea: Circuits called "gates" perform logic operations to produce desired outputs from some digital inputs N P R V B T OR gate NOT gate OR gate AND gate C OR gate

2a.5 2a.6 Propagation Delay Combinational Logic Functions Main Idea: All digital logic circuits have propagation delay Time it takes for output to change when inputs are changed Map input combinations of n-bits to desired m-bit output Can describe function with a truth table and then find its circuit implementation N P R V B T 4 gate delays for input to propagate to outputs C Inputs Logic Circuit Outputs IN IN IN2 OUT OUT In In In2 Out 2a.7 2a.8 s Sequential Devices (s) Perform a selected operation on two input numbers. X[3:] x8 Y[3:] x FS[5:] FS[5:] select the desired operation 32 32 6 A B Func C RES OF ZF CF RES[3:] 32 Func. Code Op. Func. Code Op. _ A SHL B _ A+B _ A SHR B _ A-B _ A SAR B _ A AND B _ A * B _ A OR B _ A * B (uns.) _ A XOR B _ A / B _ A NOR B _ A / B (uns.) _ A < B s the D input value when a control input (aka the clock signal) transitions from (clock edge) and stores that value at the Q output until the next clock edge A register is like a in software. It stores a value for later use. We can choose to only clock the register at " " times when we want the register to capture a new value (i.e. when it is the of an instruction) Key Idea: s data while we operate on those values The clock pulse (positive edge) here sum %rax add %rbx,%rax add %rcx,%rax %rax Block Diagram of a causes q(t) to sample and hold the current d(t) value until another clock pulse

Clock Signal 2a.9 2a. Alternating high/low voltage pulse train Controls the ordering and timing of operations performed in the processor cycle is usually measured from rising edge to rising edge Clock frequency = # of cycles per second (e.g. 2.8 GHz = 2.8 * 9 cycles per second) (5V) (V) Clock Signal Op. Op. 2 Op. 3 cycle 2.8 GHz = 2.8* 9 cycles per second =.357 ns/cycle Processor Basic HW organization for a simplified instruction set FROM X86 TO RISC From CISC to RISC Complex Instruction Set Computers often have instructions that widely in how much they perform and how much they take to execute Benefit is instructions are needed to accomplish a task Reduced Instruction Set Computers favor instructions that take roughly the time to execute and follow a common of steps It often requires instructions to describe the overall task (larger code size) See example to the right RISC makes the design easier so let's tweak our x86 instructions to be more RISC-like // CISC instruction movq x4(%rdi, %rsi, 4), %rax 2a. // RISC Equiv. w/ mem. or op. // per instruction mov %rsi, %rbx # use %rbx as a temp. shl 2, %rbx # %rsi * 4 add %rdi, %rbx # %rdi + (%rsi*4) add $x4, %rbx # x4 + %rdi + (%rsi*4) mov (%rbx), %rax # %rax = *%rbx CISC vs. RISC Equivalents A RISC Subset of x86 Split movinstructions that access memory into separate instructions: ld= from memory st= to memory Limit ld& stinstructions to use at most No ld x4(%rdi, %rsi, 4), %rax Too much work At most ld, %raxor st %rax, Limit arithmetic & logic instructions to only operate on registers No add (%rsp), %raxsince this implicitly accesses (dereferences) memory Only add // 3 x86 memory read instructions mov (%rdi), %rax // mov x4(%rdi), %rax // 2 mov x4(%rdi,%rsi), %rax // 3 // Equiv. load sequences ld x(%rdi), %rax // ld x4(%rdi), %rax // 2 mov %rsi, %rbx // 3a add %rdi, %rbx // 3b ld x4(%rbx), %rax // 3c // 3 x86 memory write instructions mov %rax, (%rdi) // mov %rax, x4(%rdi) // 2 mov %rax, x4(%rdi,%rsi) // 3 // Equiv. store sequences st %rax, x(%rdi) // st %rax, x4(%rdi) // 2 mov %rsi, %rbx // 3a add %rdi, %rbx // 3b st %rax, x4(%rbx) // 3c // CISC instruction add %rax, (%rsp) // Equiv. RISC sequence (w/ ld and st) ld (%rsp), %rbx add %rax, %rbx st %rbx, (%rsp) 2a.2

2a.3 2a.4 Developing a Processor Organization Processor Block Diagram. 2. 3. 4. 5. 6. Identify which hardware components each instruction type would use and in what order: -Type, LD, ST, Jump -Type add %rax,%rbx Addr. Data / I-MEM LD ld 8(%rax),%rbx s (%rax, %rbx, etc.) (aka Reg) ST st %rbx, 8(%rax) Addr. Data / D-MEM Cond. Codes Zero Res. JE je label/displacement Instruction (Machine Code) s (aka Reg) Control Signals (e.g. operation, Read/Write, etc.) Operands Output (Addr. or Result) Clock Cycle Time = Sum of delay through worst case pathway = 5 ns Data to write to dest. register ns ns ns ns ns Addr Data Data Processor Execution (add) 2a.5 Processor Execution (load) 2a.6 s (aka Reg) Control Signals (e.g. operation, Read/Write, etc.) %rax %rdx %rax+%rdx Addr Data s (aka Reg) Control Signals (e.g. operation, Read/Write, etc.) %rbx x4 addr Addr Data data Instruction (Machine Code) add %rax,%rdx [Machine Code: 48 c2] Operands Output (Addr. or Result) Data to write to dest. register Instruction (Machine Code) ld x4(%rbx),%rax [Machine Code: 48 8b 43 4] Operands Output (Addr. or Result) Data to write to dest. register %rdx = %rax+%rdx %rdx = data

Processor Execution (store) 2a.7 Processor Execution (branch/jump) 2a.8 s (aka Reg) Control Signals (e.g. operation, Read/Write, etc.) %rbx x4 addr Addr Data Data s (aka Reg) Control Signals + x8 (e.g. operation, Read/Write, etc.) x8 ZF OF CF SF Addr Data Data Instruction (Machine Code) st %rax,x4(%rbx) [Machine Code: 48 89 43 4] %rax Instruction (Machine Code) je L (disp. = x8) [Machine Code: 74 8] 2a.9 Example 2a.2 for(i=; i < ; i++) C[i] = (A[i] + B[i]) / 4; Cntr i ory A: A[i] B: B[i] PIPELINING C: ns per input set = ns total

2a.2 2a.22 Pipelining Example Need for s for(i=; i < ; i++) C[i] = (A[i] + B[i]) / 4; Provides separation between combinational functions Without registers, fast signals could to data values in the next operation stage 5 ns Signal i Signal j 2 ns Clock Clock Clock 2 Stage Stage 2 Stage Stage 2 Pipelining refers to insertion of registers to split combinational logic into smaller stages that can be overlapped in time (i.e. create an assembly line) Performing an operation yields signals with different paths and delays CLK CLK We don t want signals from two different data values mixing. Therefore we must collect and synchronize the values from the previous operation before passing them on to the next Processors & Pipelines Overlaps execution of multiple instructions Natural breakdown into stages,, an instruction, while decoding another, while executing another Inst Inst 2 Inst 3 Inst 4 CLK CLK 2 CLK 3 Pipelining (Instruction View) CLK 4 Clk Inst. Clk 2 Inst. 2 Inst. Clk 3 Inst. 3 Inst. 2 Inst. Clk 4 Inst. 4 Inst. 3 Inst. 2 Clk 5 Inst. 5 Inst. 4 Inst. 3 Pipelining (Stage View) 2a.23 Balancing Pipeline Stages Clock period must equal the LONGEST delay from register to register Fig. : If total logic delay is 2ns => 5MHz Throughput: instruc. / 2 ns Fig. 2: Unbalanced stage delays limit the clock speed to the slowest stage (worst case) Throughput: instruc. / ns => MHz Fig. 3: Better to split into more, balanced stages Throughput: instruc. / 5 ns => 2MHz Fig. 4: Are more stages better Ideally: 2x stages => 2x throughput Throughput: instruc. / 2.5 ns => 4MHz Each register adds extra delay so at some point deeper pipelines don't pay off Fig. Fig. 2 Fig. 3 Fig. 4 Processor Logic ( + + Execute) 2 ns 5 ns 5 ns ns Exec 2a.24 5 ns 5 ns 5 ns 5 ns 2.5 ns F F2 D D2 Ea Eb E2a E2b

Balancing Pipeline Stages Main Points: Latencyof any single instruction is Throughputand thus overall program performance can be dramatically Ideally K stage pipeline will lead to throughput increase by a factor of Reality is splitting stages adds some and thus hits a point of diminishing returns Fig. Fig. 2 Fig. 3 Processor Logic ( + + Execute) 2 ns 2a.25 5 ns 5 ns 5 ns 5 ns 2.5 ns F Non-pipelined (Latency = 2ns, Throughput = x) 4 Stage Pipeline (Latency = 2ns, Throughput = 4x) F2 D D2 Ea Eb E2a 8 Stage Pipeline (Latency = 2ns, Throughput = 8x) 2 E2b Instruction (Machine Code) 5-Stage Pipeline Operands & Control Signals Output (Addr. or Result) 2a.26 2a.27 2a.28 Pipelining Sample Sequence - Let's see how a sequence of instructions can be executed (LD) Instruction ld x8(%rbx), %rax add %rcx,%rdx je L Instruction (Machine Code) Operands & Control Signals Output (Addr. or Result) LD

2a.29 2a.3 Sample Sequence - 2 Sample Sequence - 3 (LD) (JE) (LD) LD x8(%rbx), %rax Operands & Control Signals Output (Addr. or Result) ADD %rdx, %rcx x4 / %rbx / READ Output (Addr. or Result) ADD instruction and fetch operands JE instruction and fetch operands Add displacement x4 to %rbx 2a.3 2a.32 Sample Sequence - 4 Sample Sequence - 5 (i+) (JE) (LD) (i+) (JE) (LD) JE / displacement %rcx / %rdx / ADD %rbx + x4 / READ Instruc. i+ Machine Code / Dsiplacement %rdx / %rcx + %rdx %rax / Value from ory instruc i+ instruction and fetch operands Add %rcx+%rdx Read word from memory next instruc i+2 instruction i+ and fetch operands Check if condition is true Just pass sum to next stage Write word to %rax

2a.33 2a.34 Sample Sequence - 6 Sample Sequence - 7 (i+3) (i+) (JE) (target) (i+3) (i+) (JE) Instruc. i+2 Machine Code Instruc i+ operands New %rdx / sum Deleted Deleted Deleted %rdx / sum next instruc i+3 instruction i+2 and fetch operands Use the Update Write word to %rdx next instruc i+3 Delete i+3 Delete i+2 Delete i+ Do nothing 2a.35 2a.36 Hazards Problems from overlapping instruction execution HAZARDS Hazards prevent parallel or execution! Hazards Problem: We don't know what instruction but we need to Examples: and calls Hazards / Data Problem: When a later instruction needs data from a instruction Examples: sub %rdx,%rax add %rax,%rcx Hazards Problem: Due to limited, the HW doesn't support overlapping a certain of instructions Examples: See next slides

2a.37 2a.38 Structural Hazards Data Hazard - Example structural hazard: A single cache rather than separate instruction & data caches (i+) add %rax,%rcx sub %rdx,%rax (SUB) Structural hazard any time an instruction needs to perform a data access (i.e. ldor st) since we always want to fetch a new instruction each clock cycle i+3 i+2 i+ LD ADD %rax, %rcx %rdx / %rax / SUB Output (Addr. or Result) Hazard! Cache i+ and get register operands (Do we get the desired %rax value?) Perform %rax-%rdx 2a.39 2a.4 Data Hazard -2 Stalling (i+) add %rax,%rcx sub %rdx,%rax (SUB) Solution : the ADD instruction in the DECODE stage and insert into the pipeline until the new value of the needed register is present at the cost of lower performance i+2 i+ i+ %rax / %rcx / ADD Perform %rax+%rcx using the wrong value! New value for %rax New value for %rax has not been written back yet (i+) ADD %rax, %rcx add %rax,%rcx nop (nop) nop (nop) sub %rdx,%rax New value of %rax (SUB)

2a.4 2a.42 Forwarding Solving Data Hazards Solution 2: Create new hardware paths to hand-off ( ) the data from the instruction in the pipeline to the instruction i+ (i+) %rax / %rcx / ADD add %rax,%rcx New value for %rax sub %rdx,%rax (SUB) Key Point: Data dependencies (i.e. instructions needing values produced by earlier ones) limit performance Forwarding solves many of the data hazards (data dependencies) that exist It allows instructions to continue to flow through the pipeline without the need to stall and waste time The cost is additional hardware and added complexity Even forwarding cannot solve all the issues A structural hazard still exists when a a value needed by the next instruction LD + Dependent Instruction Hazard 2a.43 LD + Dependent Instruction Hazard 2a.44 Even forwarding cannot prevent the need to stall when a Load instruction produces a value needed by the instruction behind it Would require performing worth of work in only a single cycle (i+) add %rax,%rcx ld 8(%rdx),%rax (LD) We would need to introduce stall cycle (nop) into the pipeline to get the timing correct Keep this in mind as we move through the next slides (i+) add %rax,%rcx (nop) ld 8(%rdx),%rax (LD) i+ %old rax / %rcx / ADD Address (8+%rdx) / READ New value for %rax i+ %old rax / %rcx / ADD nop New value for %rax New Value of %rax

2a.45 2a.46 Control Hazards Control Hazard - Branches/Jumps require us to know we want to jump to (aka branch/jump target location) really just the new value of the If we branch or not (checking the jump ) Problem: We often don't know those values until in the pipeline and thus we are not sure what instructions should be fetched in the interim Requires us to unwanted instructions and waste time (i+3) Instruc. i+2 Machine Code Instruc i+ operands (i+) New (JE) %rdx / sum next instruc i+3 instruction i+2 and fetch operands Use the Update Write word to %rdx 2a.47 2a.48 Control Hazard -2 (target) (i+3) (i+) (JE) Deleted Deleted Deleted %rdx / sum Enlisting the help of the compiler A FIRST LOOK: CODE REORDERING next instruc i+3 Delete i+3 Delete i+2 Delete i+ Need to "flush" wrongly fetched instructions Do nothing

2a.49 2a.5 Two Sides of the Coin How Can the Compiler Help If the hardware has some problems it just can't solve, can software (i.e. the compiler) help? Compilers can instructions to take best advantage of the processor (pipeline) organization Identify the dependencies that will incur stalls and slow performance Jump instructions void sum(int* data, int n, int x) { for(int i=; i < n; i++){ data[i] += x; } } sum: mov $x,%ecx L: cmp %esi,%ecx jge L2 ld (%rdi), %eax add %edx, %eax st %eax, (%rdi) add $4, %rdi add $, %ecx j L L2: retq C code and its assembly translation Compilers are written with general parsing and semantic representation front ends but backends that generate code optimized for a particular processor Q:How could the compiler help improve pipelined performance while still maintaining the external behavior that the high level code indicates A:By finding instructions and reordering the code Could we have moved any other instruction into that slot? sum: mov $x,%ecx L: cmp %esi,%ecx jge L2 ld (%rdi), %eax stall/nop add %edx, %eax st %eax, (%rdi) add $4, %rdi add $, %ecx j L L2: retq Original Code (incurring stall cycle) sum: mov $x,%ecx L: cmp %esi,%ecx jge L2 ld (%rdi), %eax add %edx, %eax C code and its assembly st %eax, (%rdi) translation add $4, %rdi j L L2: retq Updated Code (w/ Compiler reordering) 2a.5 2a.52 Taken or Not Taken: Branch Behavior Branch Delay Slots When a conditional jump/branch is True, we say it is False, we say it is Currently our pipeline will fetch sequentially and then potentially flush if the branch is taken Effectively, our pipeline " " that each branch is The j L instruction is always and thus will incur wasted clock cycles each time it is executed Most of the time the jge L2 will be and perform well T sum: mov $x,%ecx L: cmp %esi,%ecx jge L2 ld (%rdi), %eax NT add $, %ecx add %edx, %eax C code and its assembly st %eax, (%rdi) translation add $4, %rdi j L L2: retq Problem: After a jump/branch we fetch instructions that we are not sure should be executed Idea: Find an instruction(s) that should be executed (independent of whether branch is taken or not), move those instructions to directly the branch, and have HW just let them be executed (not ) no matter what the branch outcome is Branch delay slot(s) = that the HW will always execute (not flush) after a jump/branch instruction

2a.53 2a.54 Branch Delay Slot Example Implementing Branch Delay Slots ld (%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx je NEXT delay slot instruc. NOT TAKEN CODE NEXT: TAKEN CODE Assume a single instruction delay slot Taken Path Code Before Code ld (%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx T je Delay Slot NT Not Taken Path Code After Code Flowchart perspective of the delay slot ld (%rdi), %rcx cmp %rcx, %rdx je NEXT add %rbx, %rax NOT TAKEN CODE NEXT: TAKEN CODE Delay Slot Move an ALWAYS executed instruction down into the delay slot and let it execute no matter what HW will define the number of branch delay slots (usually a small number or 2) Compiler will be responsible for instructions to fill the delay slots Must find instructions that the branch does on If no instructions can be rearranged, can always insert 'nop' and just waste those cycles ld (%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx je NEXT delay slot instruc. Cannot move ld into delay slot because je needs the %rcx value generated by it ld (%rdi), %rcx add %rbx, %rax cmp %rcx, %rax je NEXT delay slot instruc. If no instruction can be found a 'nop' can be inserted by the compiler 2a.55 2a.56 A Look Ahead: Branch Prediction Demo Currently our pipeline assumes Not Taken and fetches down the sequential path after a jump/branch Could we build a pipeline that could predict taken?! Location to jump to (branch target) not known until But suppose we could overcome those problems, would we even know how to predict the outcome of a jump/branch before actually looking at the condition codes deeper in the pipeline? We could allow a prediction per instruction (give a with the branch that indicates T or NT) We could allow prediction per instruction (use its history) T: loop T: if T Code Loop Body loop branch NT: done if..else branch NT: else After Code NT Code Loops High probability of being Taken. Prediction can be static. If Statements May exhibit data dependent behavior. Prediction may need to be dynamic.

2a.57 Summary Pipelining is an effective and important technique to improve the throughput of a processor Overlapping execution creates hazards which lead to stalls or wasted cycles Data, Control, Structural More hardware can be investigated to attempt to mitigate the stalls (e.g. forwarding) The compiler can help reorder code to avoid stalls and perform useful work (e.g. delay slots)