CS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining

Similar documents
12.1. CS356 Unit 12. Processor Hardware Organization Pipelining

Mark Redekopp, All rights reserved. EE 352 Unit 8. HW Constructs

12.1. CS356 Unit 12b. Advanced Processor Organization

Intel x86-64 and Y86-64 Instruction Set Architecture

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Chapter 4. The Processor

void twiddle1(int *xp, int *yp) { void twiddle2(int *xp, int *yp) {

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Where we are. Instruction selection. Abstract Assembly. CS 4120 Introduction to Compilers

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN

Chapter 4 The Processor 1. Chapter 4A. The Processor

COMPUTER ORGANIZATION AND DESIGN

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

ECE260: Fundamentals of Computer Engineering

What is Pipelining? RISC remainder (our assumptions)

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Lecture 3 CIS 341: COMPILERS

Lecture 7 Pipelining. Peng Liu.

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Machine Language CS 3330 Samira Khan

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Pipelining. Principles of pipelining Pipeline hazards Remedies. Pre-soak soak soap wash dry wipe. l Chapter 4.4 and 4.5

+ Machine Level Programming: x86-64 History

EE 457 Unit 6a. Basic Pipelining Techniques

Branching and Looping

CS 33. Architecture and Optimization (2) CS33 Intro to Computer Systems XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Pipelining. CSC Friday, November 6, 2015

CS356 Unit 5. Translation to Assembly. Translating HLL to Assembly ASSEMBLY TRANSLATION EXAMPLE. x86 Control Flow

Computer Architectures. DLX ISA: Pipelined Implementation

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

CS61C : Machine Structures

Review addressing modes

RISC I from Berkeley. 44k Transistors 1Mhz 77mm^2

Module 4c: Pipelining

Instruction-set Design Issues: what is the ML instruction format(s) ML instruction Opcode Dest. Operand Source Operand 1...

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

5.1. CS356 Unit 5. x86 Control Flow

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Lecture 15: Pipelining. Spring 2018 Jason Tang

6.1. CS356 Unit 6. x86 Procedures Basic Stack Frames

Pipelining, Branch Prediction, Trends

Where We Are. Optimizations. Assembly code. generation. Lexical, Syntax, and Semantic Analysis IR Generation. Low-level IR code.

CS 33. Architecture and Optimization (2) CS33 Intro to Computer Systems XV 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

LECTURE 3: THE PROCESSOR

CSC 252: Computer Organization Spring 2018: Lecture 5

CSE351 Spring 2018, Midterm Exam April 27, 2018

Machine Program: Procedure. Zhaoguo Wang

COMPUTER ORGANIZATION AND DESI

Complex Pipelines and Branch Prediction

Register Allocation, i. Overview & spilling

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 12

Areas for growth: I love feedback

Outline Marquette University

CSCI 2021: x86-64 Control Flow

Assembly Language II: Addressing Modes & Control Flow

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

CSE 351 Midterm Exam

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 4

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

CS 107. Lecture 13: Assembly Part III. Friday, November 10, Stack "bottom".. Earlier Frames. Frame for calling function P. Increasing address

C to Machine Code x86 basics: Registers Data movement instructions Memory addressing modes Arithmetic instructions

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

Intel x86 Jump Instructions. Part 5. JMP address. Operations: Program Flow Control. Operations: Program Flow Control.

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Instruction Pipelining Review

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu

ECE 154A Introduction to. Fall 2012

x86-64 Programming II

Instruction Pipelining

Pipelining. Maurizio Palesi

Instruction Pipelining

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Control flow (1) Condition codes Conditional and unconditional jumps Loops Conditional moves Switch statements

CS356: Discussion #15 Review for Final Exam. Marco Paolieri Illustrations from CS:APP3e textbook

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Transcription:

2a. 2a.2 CS356 Unit 2a Processor Hardware Organization Pipelining BASIC HW Logic Circuits 2a.3 Combinational Logic Gates 2a.4 logic Performs a specific function (mapping of input combinations to desired output combinations) No internal state or feedback Given a set of inputs, we will always get the same output after some time (propagation) logic (Storage devices) are the fundamental building blocks Remembers a set of bits for later use Acts like a variable from software Controlled by a " " signal Current inputs Inputs Sequential values feedback and provide "memory" Combinational Logic (Usually operations like +, -, *, /, &,, <<) Outputs depend only on current outputs Combinational Logic holding "state" Sequential Logic Outputs Outputs depend on current inputs and previous inputs (previous inputs summarized via state) Ou Main Idea: Circuits called "gates" perform logic operations to produce desired outputs from some digital inputs N P R V B T OR gate NOT gate OR gate AND gate C OR gate

2a.5 2a.6 Propagation Delay Combinational Logic Functions Main Idea: All digital logic circuits have propagation delay Time it takes for output to change when inputs are changed Map input combinations of n-bits to desired m-bit output Can describe function with a truth table and then find its circuit implementation N P R V B T 4 gate delays for input to propagate to outputs C Inputs Logic Circuit Outputs IN IN IN2 OUT OUT In In In2 Out 2a.7 2a.8 s Sequential Devices (s) Perform a selected operation on two input numbers. X[3:] x8 Y[3:] x FS[5:] FS[5:] select the desired operation 32 32 6 A B Func C RES OF ZF CF RES[3:] 32 Func. Code Op. Func. Code Op. _ A SHL B _ A+B _ A SHR B _ A-B _ A SAR B _ A AND B _ A * B _ A OR B _ A * B (uns.) _ A XOR B _ A / B _ A NOR B _ A / B (uns.) _ A < B s the D input value when a control input (aka the clock signal) transitions from (clock edge) and stores that value at the Q output until the next clock edge A register is like a in software. It stores a value for later use. We can choose to only clock the register at " " times when we want the register to capture a new value (i.e. when it is the of an instruction) Key Idea: s data while we operate on those values The clock pulse (positive edge) here sum %rax add %rbx,%rax add %rcx,%rax %rax Block Diagram of a causes q(t) to sample and hold the current d(t) value until another clock pulse

Clock Signal 2a.9 2a. Alternating high/low voltage pulse train Controls the ordering and timing of operations performed in the processor cycle is usually measured from rising edge to rising edge Clock frequency = # of cycles per second (e.g. 2.8 GHz = 2.8 * 9 cycles per second) (5V) (V) Clock Signal Op. Op. 2 Op. 3 cycle 2.8 GHz = 2.8* 9 cycles per second =.357 ns/cycle Processor Basic HW organization for a simplified instruction set FROM X86 TO RISC From CISC to RISC Complex Instruction Set Computers often have instructions that widely in how much they perform and how much they take to execute Benefit is instructions are needed to accomplish a task Reduced Instruction Set Computers favor instructions that take roughly the time to execute and follow a common of steps It often requires instructions to describe the overall task (larger code size) See example to the right RISC makes the design easier so let's tweak our x86 instructions to be more RISC-like // CISC instruction movq x4(%rdi, %rsi, 4), %rax 2a. // RISC Equiv. w/ mem. or op. // per instruction mov %rsi, %rbx # use %rbx as a temp. shl 2, %rbx # %rsi * 4 add %rdi, %rbx # %rdi + (%rsi*4) add $x4, %rbx # x4 + %rdi + (%rsi*4) mov (%rbx), %rax # %rax = *%rbx CISC vs. RISC Equivalents A RISC Subset of x86 Split movinstructions that access memory into separate instructions: ld= from memory st= to memory Limit ld& stinstructions to use at most No ld x4(%rdi, %rsi, 4), %rax Too much work At most ld, %raxor st %rax, Limit arithmetic & logic instructions to only operate on registers No add (%rsp), %raxsince this implicitly accesses (dereferences) memory Only add // 3 x86 memory read instructions mov (%rdi), %rax // mov x4(%rdi), %rax // 2 mov x4(%rdi,%rsi), %rax // 3 // Equiv. load sequences ld x(%rdi), %rax // ld x4(%rdi), %rax // 2 mov %rsi, %rbx // 3a add %rdi, %rbx // 3b ld x4(%rbx), %rax // 3c // 3 x86 memory write instructions mov %rax, (%rdi) // mov %rax, x4(%rdi) // 2 mov %rax, x4(%rdi,%rsi) // 3 // Equiv. store sequences st %rax, x(%rdi) // st %rax, x4(%rdi) // 2 mov %rsi, %rbx // 3a add %rdi, %rbx // 3b st %rax, x4(%rbx) // 3c // CISC instruction add %rax, (%rsp) // Equiv. RISC sequence (w/ ld and st) ld (%rsp), %rbx add %rax, %rbx st %rbx, (%rsp) 2a.2

2a.3 2a.4 Developing a Processor Organization Processor Block Diagram. 2. 3. 4. 5. 6. Identify which hardware components each instruction type would use and in what order: -Type, LD, ST, Jump -Type add %rax,%rbx Addr. Data / I-MEM LD ld 8(%rax),%rbx s (%rax, %rbx, etc.) (aka Reg) ST st %rbx, 8(%rax) Addr. Data / D-MEM Cond. Codes Zero Res. JE je label/displacement Instruction (Machine Code) s (aka Reg) Control Signals (e.g. operation, Read/Write, etc.) Operands Output (Addr. or Result) Clock Cycle Time = Sum of delay through worst case pathway = 5 ns Data to write to dest. register ns ns ns ns ns Addr Data Data Processor Execution (add) 2a.5 Processor Execution (load) 2a.6 s (aka Reg) Control Signals (e.g. operation, Read/Write, etc.) %rax %rdx %rax+%rdx Addr Data s (aka Reg) Control Signals (e.g. operation, Read/Write, etc.) %rbx x4 addr Addr Data data Instruction (Machine Code) add %rax,%rdx [Machine Code: 48 c2] Operands Output (Addr. or Result) Data to write to dest. register Instruction (Machine Code) ld x4(%rbx),%rax [Machine Code: 48 8b 43 4] Operands Output (Addr. or Result) Data to write to dest. register %rdx = %rax+%rdx %rdx = data

Processor Execution (store) 2a.7 Processor Execution (branch/jump) 2a.8 s (aka Reg) Control Signals (e.g. operation, Read/Write, etc.) %rbx x4 addr Addr Data Data s (aka Reg) Control Signals + x8 (e.g. operation, Read/Write, etc.) x8 ZF OF CF SF Addr Data Data Instruction (Machine Code) st %rax,x4(%rbx) [Machine Code: 48 89 43 4] %rax Instruction (Machine Code) je L (disp. = x8) [Machine Code: 74 8] 2a.9 Example 2a.2 for(i=; i < ; i++) C[i] = (A[i] + B[i]) / 4; Cntr i ory A: A[i] B: B[i] PIPELINING C: ns per input set = ns total

2a.2 2a.22 Pipelining Example Need for s for(i=; i < ; i++) C[i] = (A[i] + B[i]) / 4; Provides separation between combinational functions Without registers, fast signals could to data values in the next operation stage 5 ns Signal i Signal j 2 ns Clock Clock Clock 2 Stage Stage 2 Stage Stage 2 Pipelining refers to insertion of registers to split combinational logic into smaller stages that can be overlapped in time (i.e. create an assembly line) Performing an operation yields signals with different paths and delays CLK CLK We don t want signals from two different data values mixing. Therefore we must collect and synchronize the values from the previous operation before passing them on to the next Processors & Pipelines Overlaps execution of multiple instructions Natural breakdown into stages,, an instruction, while decoding another, while executing another Inst Inst 2 Inst 3 Inst 4 CLK CLK 2 CLK 3 Pipelining (Instruction View) CLK 4 Clk Inst. Clk 2 Inst. 2 Inst. Clk 3 Inst. 3 Inst. 2 Inst. Clk 4 Inst. 4 Inst. 3 Inst. 2 Clk 5 Inst. 5 Inst. 4 Inst. 3 Pipelining (Stage View) 2a.23 Balancing Pipeline Stages Clock period must equal the LONGEST delay from register to register Fig. : If total logic delay is 2ns => 5MHz Throughput: instruc. / 2 ns Fig. 2: Unbalanced stage delays limit the clock speed to the slowest stage (worst case) Throughput: instruc. / ns => MHz Fig. 3: Better to split into more, balanced stages Throughput: instruc. / 5 ns => 2MHz Fig. 4: Are more stages better Ideally: 2x stages => 2x throughput Throughput: instruc. / 2.5 ns => 4MHz Each register adds extra delay so at some point deeper pipelines don't pay off Fig. Fig. 2 Fig. 3 Fig. 4 Processor Logic ( + + Execute) 2 ns 5 ns 5 ns ns Exec 2a.24 5 ns 5 ns 5 ns 5 ns 2.5 ns F F2 D D2 Ea Eb E2a E2b

Balancing Pipeline Stages Main Points: Latencyof any single instruction is Throughputand thus overall program performance can be dramatically Ideally K stage pipeline will lead to throughput increase by a factor of Reality is splitting stages adds some and thus hits a point of diminishing returns Fig. Fig. 2 Fig. 3 Processor Logic ( + + Execute) 2 ns 2a.25 5 ns 5 ns 5 ns 5 ns 2.5 ns F Non-pipelined (Latency = 2ns, Throughput = x) 4 Stage Pipeline (Latency = 2ns, Throughput = 4x) F2 D D2 Ea Eb E2a 8 Stage Pipeline (Latency = 2ns, Throughput = 8x) 2 E2b Instruction (Machine Code) 5-Stage Pipeline Operands & Control Signals Output (Addr. or Result) 2a.26 2a.27 2a.28 Pipelining Sample Sequence - Let's see how a sequence of instructions can be executed (LD) Instruction ld x8(%rbx), %rax add %rcx,%rdx je L Instruction (Machine Code) Operands & Control Signals Output (Addr. or Result) LD

2a.29 2a.3 Sample Sequence - 2 Sample Sequence - 3 (LD) (JE) (LD) LD x8(%rbx), %rax Operands & Control Signals Output (Addr. or Result) ADD %rdx, %rcx x4 / %rbx / READ Output (Addr. or Result) ADD instruction and fetch operands JE instruction and fetch operands Add displacement x4 to %rbx 2a.3 2a.32 Sample Sequence - 4 Sample Sequence - 5 (i+) (JE) (LD) (i+) (JE) (LD) JE / displacement %rcx / %rdx / ADD %rbx + x4 / READ Instruc. i+ Machine Code / Dsiplacement %rdx / %rcx + %rdx %rax / Value from ory instruc i+ instruction and fetch operands Add %rcx+%rdx Read word from memory next instruc i+2 instruction i+ and fetch operands Check if condition is true Just pass sum to next stage Write word to %rax

2a.33 2a.34 Sample Sequence - 6 Sample Sequence - 7 (i+3) (i+) (JE) (target) (i+3) (i+) (JE) Instruc. i+2 Machine Code Instruc i+ operands New %rdx / sum Deleted Deleted Deleted %rdx / sum next instruc i+3 instruction i+2 and fetch operands Use the Update Write word to %rdx next instruc i+3 Delete i+3 Delete i+2 Delete i+ Do nothing 2a.35 2a.36 Hazards Problems from overlapping instruction execution HAZARDS Hazards prevent parallel or execution! Hazards Problem: We don't know what instruction but we need to Examples: and calls Hazards / Data Problem: When a later instruction needs data from a instruction Examples: sub %rdx,%rax add %rax,%rcx Hazards Problem: Due to limited, the HW doesn't support overlapping a certain of instructions Examples: See next slides

2a.37 2a.38 Structural Hazards Data Hazard - Example structural hazard: A single cache rather than separate instruction & data caches (i+) add %rax,%rcx sub %rdx,%rax (SUB) Structural hazard any time an instruction needs to perform a data access (i.e. ldor st) since we always want to fetch a new instruction each clock cycle i+3 i+2 i+ LD ADD %rax, %rcx %rdx / %rax / SUB Output (Addr. or Result) Hazard! Cache i+ and get register operands (Do we get the desired %rax value?) Perform %rax-%rdx 2a.39 2a.4 Data Hazard -2 Stalling (i+) add %rax,%rcx sub %rdx,%rax (SUB) Solution : the ADD instruction in the DECODE stage and insert into the pipeline until the new value of the needed register is present at the cost of lower performance i+2 i+ i+ %rax / %rcx / ADD Perform %rax+%rcx using the wrong value! New value for %rax New value for %rax has not been written back yet (i+) ADD %rax, %rcx add %rax,%rcx nop (nop) nop (nop) sub %rdx,%rax New value of %rax (SUB)

2a.4 2a.42 Forwarding Solving Data Hazards Solution 2: Create new hardware paths to hand-off ( ) the data from the instruction in the pipeline to the instruction i+ (i+) %rax / %rcx / ADD add %rax,%rcx New value for %rax sub %rdx,%rax (SUB) Key Point: Data dependencies (i.e. instructions needing values produced by earlier ones) limit performance Forwarding solves many of the data hazards (data dependencies) that exist It allows instructions to continue to flow through the pipeline without the need to stall and waste time The cost is additional hardware and added complexity Even forwarding cannot solve all the issues A structural hazard still exists when a a value needed by the next instruction LD + Dependent Instruction Hazard 2a.43 LD + Dependent Instruction Hazard 2a.44 Even forwarding cannot prevent the need to stall when a Load instruction produces a value needed by the instruction behind it Would require performing worth of work in only a single cycle (i+) add %rax,%rcx ld 8(%rdx),%rax (LD) We would need to introduce stall cycle (nop) into the pipeline to get the timing correct Keep this in mind as we move through the next slides (i+) add %rax,%rcx (nop) ld 8(%rdx),%rax (LD) i+ %old rax / %rcx / ADD Address (8+%rdx) / READ New value for %rax i+ %old rax / %rcx / ADD nop New value for %rax New Value of %rax

2a.45 2a.46 Control Hazards Control Hazard - Branches/Jumps require us to know we want to jump to (aka branch/jump target location) really just the new value of the If we branch or not (checking the jump ) Problem: We often don't know those values until in the pipeline and thus we are not sure what instructions should be fetched in the interim Requires us to unwanted instructions and waste time (i+3) Instruc. i+2 Machine Code Instruc i+ operands (i+) New (JE) %rdx / sum next instruc i+3 instruction i+2 and fetch operands Use the Update Write word to %rdx 2a.47 2a.48 Control Hazard -2 (target) (i+3) (i+) (JE) Deleted Deleted Deleted %rdx / sum Enlisting the help of the compiler A FIRST LOOK: CODE REORDERING next instruc i+3 Delete i+3 Delete i+2 Delete i+ Need to "flush" wrongly fetched instructions Do nothing

2a.49 2a.5 Two Sides of the Coin How Can the Compiler Help If the hardware has some problems it just can't solve, can software (i.e. the compiler) help? Compilers can instructions to take best advantage of the processor (pipeline) organization Identify the dependencies that will incur stalls and slow performance Jump instructions void sum(int* data, int n, int x) { for(int i=; i < n; i++){ data[i] += x; } } sum: mov $x,%ecx L: cmp %esi,%ecx jge L2 ld (%rdi), %eax add %edx, %eax st %eax, (%rdi) add $4, %rdi add $, %ecx j L L2: retq C code and its assembly translation Compilers are written with general parsing and semantic representation front ends but backends that generate code optimized for a particular processor Q:How could the compiler help improve pipelined performance while still maintaining the external behavior that the high level code indicates A:By finding instructions and reordering the code Could we have moved any other instruction into that slot? sum: mov $x,%ecx L: cmp %esi,%ecx jge L2 ld (%rdi), %eax stall/nop add %edx, %eax st %eax, (%rdi) add $4, %rdi add $, %ecx j L L2: retq Original Code (incurring stall cycle) sum: mov $x,%ecx L: cmp %esi,%ecx jge L2 ld (%rdi), %eax add %edx, %eax C code and its assembly st %eax, (%rdi) translation add $4, %rdi j L L2: retq Updated Code (w/ Compiler reordering) 2a.5 2a.52 Taken or Not Taken: Branch Behavior Branch Delay Slots When a conditional jump/branch is True, we say it is False, we say it is Currently our pipeline will fetch sequentially and then potentially flush if the branch is taken Effectively, our pipeline " " that each branch is The j L instruction is always and thus will incur wasted clock cycles each time it is executed Most of the time the jge L2 will be and perform well T sum: mov $x,%ecx L: cmp %esi,%ecx jge L2 ld (%rdi), %eax NT add $, %ecx add %edx, %eax C code and its assembly st %eax, (%rdi) translation add $4, %rdi j L L2: retq Problem: After a jump/branch we fetch instructions that we are not sure should be executed Idea: Find an instruction(s) that should be executed (independent of whether branch is taken or not), move those instructions to directly the branch, and have HW just let them be executed (not ) no matter what the branch outcome is Branch delay slot(s) = that the HW will always execute (not flush) after a jump/branch instruction

2a.53 2a.54 Branch Delay Slot Example Implementing Branch Delay Slots ld (%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx je NEXT delay slot instruc. NOT TAKEN CODE NEXT: TAKEN CODE Assume a single instruction delay slot Taken Path Code Before Code ld (%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx T je Delay Slot NT Not Taken Path Code After Code Flowchart perspective of the delay slot ld (%rdi), %rcx cmp %rcx, %rdx je NEXT add %rbx, %rax NOT TAKEN CODE NEXT: TAKEN CODE Delay Slot Move an ALWAYS executed instruction down into the delay slot and let it execute no matter what HW will define the number of branch delay slots (usually a small number or 2) Compiler will be responsible for instructions to fill the delay slots Must find instructions that the branch does on If no instructions can be rearranged, can always insert 'nop' and just waste those cycles ld (%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx je NEXT delay slot instruc. Cannot move ld into delay slot because je needs the %rcx value generated by it ld (%rdi), %rcx add %rbx, %rax cmp %rcx, %rax je NEXT delay slot instruc. If no instruction can be found a 'nop' can be inserted by the compiler 2a.55 2a.56 A Look Ahead: Branch Prediction Demo Currently our pipeline assumes Not Taken and fetches down the sequential path after a jump/branch Could we build a pipeline that could predict taken?! Location to jump to (branch target) not known until But suppose we could overcome those problems, would we even know how to predict the outcome of a jump/branch before actually looking at the condition codes deeper in the pipeline? We could allow a prediction per instruction (give a with the branch that indicates T or NT) We could allow prediction per instruction (use its history) T: loop T: if T Code Loop Body loop branch NT: done if..else branch NT: else After Code NT Code Loops High probability of being Taken. Prediction can be static. If Statements May exhibit data dependent behavior. Prediction may need to be dynamic.

2a.57 Summary Pipelining is an effective and important technique to improve the throughput of a processor Overlapping execution creates hazards which lead to stalls or wasted cycles Data, Control, Structural More hardware can be investigated to attempt to mitigate the stalls (e.g. forwarding) The compiler can help reorder code to avoid stalls and perform useful work (e.g. delay slots)