Processor Design Pipelined Processor. Hung-Wei Tseng

Similar documents
Processor Design Pipelined Processor (II) Hung-Wei Tseng

COMPUTER ORGANIZATION AND DESIGN

Chapter 4 The Processor 1. Chapter 4B. The Processor

Virtual memory. Hung-Wei Tseng

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Pipeline design. Mehran Rezaei

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Virtual memory. Hung-Wei Tseng

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

Full Datapath. Chapter 4 The Processor 2

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Full Datapath. Chapter 4 The Processor 2

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Improving Performance: Pipelining

Chapter 4. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

CENG 3420 Lecture 06: Pipeline

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESIGN

Processor (II) - pipelining. Hwansoo Han

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Pipelining. CSC Friday, November 6, 2015

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Lecture 9 Pipeline and Cache

CSEN 601: Computer System Architecture Summer 2014

Complex Pipelines and Branch Prediction

ECS 154B Computer Architecture II Spring 2009

Processor (IV) - advanced ILP. Hwansoo Han

Chapter 4. The Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Chapter 4. The Processor

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Pipelined Processor Design. EE/ECE 4305: Computer Architecture University of Minnesota Duluth By Dr. Taek M. Kwon

Thomas Polzer Institut für Technische Informatik

Pipelined datapath Staging data. CS2504, Spring'2007 Dimitris Nikolopoulos

LECTURE 3: THE PROCESSOR

Chapter 4. The Processor

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

CS3350B Computer Architecture Quiz 3 March 15, 2018

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

14:332:331 Pipelined Datapath

CSE 378 Midterm 2/12/10 Sample Solution

Pipelining is Hazardous!

COMP2611: Computer Organization. The Pipelined Processor

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

EIE/ENE 334 Microprocessors

Chapter 4. The Processor

Chapter 5 Solutions: For More Practice

The Processor: Instruction-Level Parallelism

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

CS232 Final Exam May 5, 2001

DEE 1053 Computer Organization Lecture 6: Pipelining

Lecture 7 Pipelining. Peng Liu.

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

ELE 655 Microprocessor System Design

Chapter 5: The Processor: Datapath and Control

CS 351 Exam 2 Mon. 11/2/2015

What about branches? Branch outcomes are not known until EXE What are our options?

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

EECS150 - Digital Design Lecture 10- CPU Microarchitecture. Processor Microarchitecture Introduction

Chapter 4. The Processor. Computer Architecture and IC Design Lab

Chapter 4 (Part II) Sequential Laundry

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Chapter 4. The Processor

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

ECE232: Hardware Organization and Design

ECE 313 Computer Organization FINAL EXAM December 14, This exam is open book and open notes. You have 2 hours.

LECTURE 9. Pipeline Hazards

Lecture 8: Control COS / ELE 375. Computer Architecture and Organization. Princeton University Fall Prof. David August

COMPUTER ORGANIZATION AND DESIGN

CSEE 3827: Fundamentals of Computer Systems

Static, multiple-issue (superscaler) pipelines

ECE Exam II - Solutions November 8 th, 2017

Quiz for Chapter 4 The Processor3.10

Final Exam Spring 2017

EE 457 Unit 6a. Basic Pipelining Techniques

Processor (I) - datapath & control. Hwansoo Han

Transcription:

Processor Design Pipelined Processor Hung-Wei Tseng

Pipelining 7

Pipelining Break up the logic with isters into pipeline stages Each stage can act on different instruction/data States/Control signals of instructions are hold in isters...... latch latch 8

Pipelining cycle # cycle #2 cycle #3 cycle #4 cycle #5 After the 5th cycle, the processor can do 5 instructions in parallel 9

Pipelining cycle #6 cycle #7 cycle #8 cycle #9 cycle # The processor can complete instruction each cycle CPI == if everything works perfectly!

Single-cycle v.s. pipeline v.s.

Cycle time of a pipeline processor Critical path is the longest possible delay between two registers in a design. The critical path sets the cycle time, since the cycle time must be long enough for a signal to traverse the critical path. change performance Lengthening or shortening non-critical paths does not Ideally, all paths are about the same length 3

Designing a 5-stage pipeline processor for MIPS 5

Basic steps of execution Instruction fetch: where? instruction memory Decode: What s the instruction? Where are the operands? registers Execute ALUs Memory access data memory Where is my data? Where to put the result Processor Write back registers 8bf94: 8 8 Determine the next PC 8bf98: c3 277952 8bf9c: 8 8 6 instruction memory ALU PC R R R2... R3 registers 27a3: fbb27 ldah gp,5(t2) 27a34: 59cbd23 lda gp,-2552(gp) 27a38: 5d24 ldah t,(gp) 27a3c: bd24 ldah t4,(gp) 27a4: 2ca422a ldl t,-2358(t) 27a44: 32e4 beq t,27a94 27a48: 3d24 ldah t,(gp) 27a4c: 2ca4e2b3 stl zero,-2358(t) 8bf94: 8 8 8bf98: c2f 2775424 8bf9c: 8 8 8bf9: c2f8 2777472 data memory 8bf9: c2e8 2773376

Pipeline a MIPS processor Instruction Fetch from instruction memory Decode Instruction Fetch () Figure out the incoming instruction? Instruction Decode () Fetch the operands from the registers Execution Perform ALU functions Memory access /write data memory Write back results to registers Write to the register file Execution () Memory Access () Write Back () 7

PC From single-cycle to pipeline Instruction Fetch Instruction Decode Execution PCSrc = Branch & Zero PCSrc Memory Access Write Back Control 4 Address Add Instruc(on Memory inst[3:] inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg inst[5:] RegDst Data 2 Write Data 6 signextend 32 ALUSrc Shi> le> 2 Zero ALU ALUop Add Address MemWrite Write Data Data Memory Mem Data MemtoReg / /EX EX/ / Will this work? 8

PC Pipelined processor PCSrc Control 4 Address Add Instruc(on Memory add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$,$ sw $, ($2) inst[3:] inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg inst[5:] RegDst Data 2 Write Data 6 signextend 32 ALUSrc Shi> le> 2 Zero ALU ALUop Add Address MemWrite Write Data Data Memory Mem Data MemtoReg / /EX EX/ / 9

PC Pipelined processor PCSrc Control 4 Address Add Instruc(on Memory add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$,$ sw $, ($2) inst[3:] inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg inst[5:] RegDst Data 2 Write Data 6 signextend 32 ALUSrc Shi> le> 2 Zero ALU ALUop Add Address MemWrite Write Data Data Memory Mem Data MemtoReg / /EX EX/ / 2

PC Pipelined processor PCSrc 4 Address Add Instruc(on Memory add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$,$ sw $, ($2) inst[3:] Control inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg inst[5:] RegDst Data 2 Write Data 6 signextend 32 ME EX ALUSrc Shi> le> 2 Zero ALU ALUop Add Address MemWrite Write Data Data Memory Mem Data MemtoReg / /EX EX/ / Where can I find these? ME 2

PC Pipelined processor PCSrc 4 Address Add Instruc(on Memory add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$,$ sw $, ($2) inst[3:] Control inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg inst[5:] RegDst Data 2 Write Data 6 signextend 32 ME EX ALUSrc Shi> le> 2 Zero ALU ALUop Add Address MemWrite Write Data Data Memory Mem Data MemtoReg / /EX EX/ / ME 22

PC Pipelined processor PCSrc Is this right? RegWrite 4 Address Add Instruc(on Memory add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$,$ sw $, ($2) inst[3:] Control inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg inst[5:] Data 2 RegDst Write Data 6 signextend 32 ME EX ALUSrc Shi> le> 2 Zero ALU ALUop Add Address MemWrite Write Data Data Memory Mem Data MemtoReg / /EX EX/ / ME 23

PC Pipelined processor 4 PCSrc Address Add Instruc(on Memory inst[3:] / /EX EX/ / inst[5:] Control inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 signextend 32 ME EX ALUSrc Shi> le> 2 RegDst Zero ALU ALUop Add ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg 24

PC 5-stage pipelined processor 4 PCSrc Address Add Instruc(on Memory inst[3:] / /EX EX/ / inst[5:] Control inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 signextend 32 ME EX ALUSrc Shi> le> 2 RegDst Zero ALU ALUop Add ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg 25

Simplified pipeline diagram Use symbols to represent the physical resources with the abbreviations for pipeline stages.,,,, Horizontal axis represent the timeline, vertical axis for the instruction stream Example: add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$,$ sw $, ($2) 26

Pipeline hazards 28

Pipeline hazards Even though we perfectly divide pipeline stages, it s still hard to achieve CPI ==. Pipeline hazards: Structural hazard The hardware does not allow two pipeline stages to work concurrently Data hazard A later instruction in a pipeline stage depends on the outcome of an earlier instruction in the pipeline Control hazard The processor is not clear about what s the next instruction to fetch 29

Can we get the right result? Given the current 5-stage pipeline, how many of the following MIPS code can work correctly? a: b: c: d: e: add $, $2, $3 lw $4, ($) sub $6, $7, $8 sub $9,$,$ sw $, ($2) I II III IV add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9, $, $ sw $, ($2) add $, $2, $3 lw $4, ($5) bne $, $7, L sub $9,$,$ sw $, ($2) add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$,$ sw $, ($2) b cannot get $ produced by a before Data hazard both a and d are accessing $ at 5th cycle Structural hazard We don t know if d & e will be executed or not Control hazard 3

Structural hazard 3

Structural hazard The hardware cannot support the combination of instructions that we want to execute at the same cycle two instructions competing the same register. The original pipeline incurs structural hazard when Solution: write early, read late Writes occur at the clock edge and complete long enough before the end of the clock cycle. This leaves enough time for outputs to settle for reads The revised register file is the default one from now! add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$, $ sw $, ($2) 33

Structural hazard The design of hardware causes structural hazard We need to modify the hardware design to avoid structural hazard 35

Data hazard 36

Data hazard When an instruction in the pipeline needs a value that is not available Data dependences The output of an instruction is the input of a later instruction May result in data hazard if the later instruction that consumes the result is still in the pipeline 38

Sol. of data hazard I: Stall When the source operand of an instruction is not ready, stall the pipeline Suspend the instruction and the following instruction Allow the previous instructions to proceed This introduces a pipeline bubble: a bubble does nothing, propagate through the pipeline like a nop instruction Disable the PC update How to stall the pipeline? Disable the isters on the earlier pipeline stages When the stall is over, re-enable the isters, PC updates 4

PC PCWrite PCSrc 4 Address Hazard detection & stall hazard detection unit Add Instruc(on Memory /Write inst[3:] Check if the destination register of EX == source register of the instruction in / /EX EX/ / inst[5:] Control inst[3:25],inst[5:] inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 RegWrite /EX.Mem signextend 32 ME EX ALUSrc Shi> le> 2 RegDst Zero ALU ALUop Insert a noop if we need to stall Add ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg Check if the destination register of == source register of the instruction in 4

Performance of stall Insert a noop in stage Insert another noop in stage, previous noop goes to stage add $, $2, $3 lw $4, ($) sub $5, $2, $4 sub $, $3, $ sw $, ($5) 5 cycles! CPI == 3 (If there is no stall, CPI should be just!) 42

Sol. of data hazard II: Forwarding The result is available after and stage, but publicized in! The data is already there, we should use it right away! Also called bypassing add $, $2, $3 lw $4, ($) sub $5, $2, $4 sub $, $3, $ sw $, ($5) We can obtain the result here! 43

Sol. of data hazard II: Forwarding Take the values, where ever they are! add $, $2, $3 lw $4, ($) sub $5, $2, $4 sub $, $3, $ sw $, ($5) cycles! CPI == 2 (Not optimal, but much better!) 44

When can/should we forward data? If the instruction entering the stage consumes a result from a previous instruction that is entering stage or stage A source of the instruction entering stage is the destination of an instruction entering / stage The previous instruction must be an instruction that updates register file 46

PC 4 PCSrc Address Forwarding in hardware Add Instruc(on Memory inst[3:] / /EX EX/ / inst[5:] revious instruction (Ins#) urernt instruction (Ins#2) How about load? Control inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 Rs of Ins#2 Rt of Ins#2 signextend 32 ME EX Control of Ins#2 ForwardA Shi> le> 2 RegDst ForwardB ForwardA ForwardB Zero ALU ALUop Add forwarding unit ALUSrc ME Control of Ins# Address MemWrite Write Data Data Memory Mem Data MemtoReg 47 RegWrite ALU result of Ins# destination of Ins#

PC 4 PCSrc Address Forwarding in hardware Add Instruc(on Memory inst[3:] / /EX EX/ / inst[5:] Control inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 signextend 32 ME EX ForwardA Shi> le> 2 RegDst ForwardB Zero ALU ALUop Add ME ALU/ result of Ins# Control of Ins# Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg ForwardA ForwardB forwarding unit ALUSrc Rd of Ins# 48

There is still a case that we have to stall... Revisit the following code: add $, $2, $3 lw $4, ($) sub $5, $2, $4 sub $, $3, $ sw $, ($5) lw generates result at stage, we have to stall If the instruction entering stage depends on a load instruction that does not finish its stage yet, we have to stall! We call this hazard detection We need to know the following:. If an instruction in EX/ updates a register (RegWrite) 2. If an instruction in EX/ reads memory (Mem) 3. If the destination register of EX/ is a source of /EX (rs, rt of /EX == rt of EX/ #) 49

PC Hazard detection with forwarding hazard detection unit PCWrite PCSrc 4 Address Add Instruc(on Memory /Write inst[3:] / /EX EX/ / inst[5:] Control inst[3:25],inst[5:] inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 RegWrite /EX.Mem signextend 32 ME EX ForwardA Shi> le> 2 RegDst ForwardB Zero ALU ALUop Add ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg ForwardA ForwardB forwarding unit ALUSrc 5

Control hazard 5

Control hazard The processor cannot determine the next PC to fetch LOOP: lw $t3, ($s) addi $t, $t, add $v, $v, $t3 addi $s, $s, 4 bne $t, $t, LOOP lw $t3, ($s) stall 7 cycles per loop 54

Reducing the overhead of control hazards 55

Solution I: Delayed branches An agreement between ISA and hardware Branch delay slots: the next N instructions after a branch are always executed Compiler decides the instructions in branch delay slots Reordering the instruction cannot affect the correctness of the program MIPS has one branch delay slot Good Simple hardware Bad N cannot change Sometimes cannot find good candidates for the slot 56

Solution I: Delayed branches LOOP: lw $t3, ($s) addi $t, $t, add $v, $v, $t3 addi $s, $s, 4 bne $t, $t, LOOP branch delay slot LOOP: lw $t3, ($s) addi $t, $t, add $v, $v, $t3 bne $t, $t, LOOP addi $s, $s, 4 lw $t3, ($s) stall 6 cycles per loop 57

Solution II: always predict not-taken Always predict the next PC is PC+4 LOOP: lw $t3, ($s) addi $t, $t, add $v, $v, $t3 addi $s, $s, 4 bne $t, $t, LOOP sw $v, ($s) add $t4, $t3, $t5 nop nop nop nop nop lw $t3, ($s) If branch is not taken: no stalls! If branch is taken: doesn t hurt! 7 cycles per loop flush the instructions fetched incorrectly 58

PC Solution III: always predict taken PCWrite PCSrc 4 Address Add Instruc(on Memory /Write inst[3:] / /EX EX/ / inst[5:] hazard detection unit Control inst[3:25],inst[5:] inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 RegWrite /EX.Mem signextend 32 ME EX ForwardA Shi> le> 2 RegDst ForwardB Zero ALU ALUop Add ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg ForwardA ForwardB forwarding unit ALUSrc 6

PC Solution III: always predict taken PCWrite PCSrc 4 Address Add Instruc(on Memory /Write inst[3:] / /EX EX/ / inst[5:] hazard detection unit Control inst[3:25],inst[5:] inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 Shi> le> 2 Add RegWrite signextend 32 /EX.Mem ME EX ForwardA RegDst ForwardB Zero ALU ALUop ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg Still have to stall cycle ForwardA ForwardB forwarding unit ALUSrc 62

PC Solution III: always predict taken PCWrite PCSrc 4 Address Add Instruc(on Memory /Write inst[3:] / /EX EX/ / inst[5:] hazard detection unit Control inst[3:25],inst[5:] inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 Shi> le> 2 Add RegWrite signextend 32 /EX.Mem ME EX ForwardA RegDst ForwardB Zero ALU ALUop ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg Branch Target Buffer Consult BTB in fetch stage ForwardA ForwardB forwarding unit ALUSrc 63

PC Branch Target Buffer branch PC target address or target instruction Branch Target Buffer 64

Solution III: always predict taken Always predict taken with the help of BTB LOOP: lw $t3, ($s) addi $t, $t, add $v, $v, $t3 addi $s, $s, 4 bne $t, $t, LOOP lw $t3, ($s) addi $t, $t, add $v, $v, $t3 5 cycles per loop (CPI ==!!!) But what if the branch is not always taken? 65

Dynamic branch prediction 68

-bit counter Predict this branch will go the same way as the result of the last time this branch executed for taken, for not takens PC = x442 x442 x848324 x4464 x848392 Taken! x4578 x8485a x4c x849624 Branch Target Buffer 69

2-bit counter A 2-bit counter for each branch taken Predict taken if the counter value >= 2 If the prediction in taken states, fetch from target PC, otherwise, use PC+4 Taken 3 () not taken taken Taken 2 () PC= x442 Not Taken () taken not taken taken Not Taken () not taken x442 x848324 x4464 x848392 x4578 x8485a Taken! not taken x4c x849624 Branch Target Buffer 7

Performance of 2-bit counter 2-bit state machine for each branch taken for(i = ; i < ; i++) {! sum += a[i]; } Taken 3 () Not Taken () not taken not taken taken taken not taken taken Taken 2 () Not Taken () not taken 9% accuracy! i state predict actual T T 2 T T 3 T T 4-9 T T T NT Application: 8% ALU, 2% Branch, and branch resolved in EX stage, average CPI? +2%*(-9%)*2 =.4 72 +

Make the prediction better Consider the following code: i = ; do { if( i % 3!= ) // Branch Y, taken if i % 3 == a[i] *= 2; a[i] += i; } while ( ++i < ) // Branch X Can we capture the pattern? i branch result Y T X T Y NT X T 2 Y NT 2 X T 3 Y T 3 X T 4 Y NT 4 X T 5 Y NT 5 X T 6 Y T 6 X T 7 Y NT 74

Predict using history Instead of using the PC to choose the predictor, use a bit vector (global history register, GHR) made up of the previous branch outcomes. Each entry in the history table has its own counter. n-bit GHR index = (T, NT, T) 2 n entries history table Taken! 75

Performance of global history predictor Consider the following code: i = ; do { if( i % 3!= ) // Branch Y, taken if i % 3 == a[i] *= 2; a[i] += i; // Branch Y } while ( ++i < ) // Branch X Assume that we start with a 4- bit GHR=, all counters are. Nearly perfect after this i? GHR BHT prediction actual New BHT Y T T X T T Y T NT X T T 2 Y T NT 2 X T T 3 Y T T 3 X T T 4 Y T NT 4 X T T 5 Y NT NT 5 X T T 6 Y T T 6 X T T 7 Y NT NT 7 X T T 8 Y NT NT 8 X T T 9 Y T T 9 X T T 76 Y NT NT

Branch prediction and modern processors 79

Deeper pipeline Higher frequencies by shortening the pipeline stages performance with frequencies Potentially higher power consumption as dynamic/active power = acv 2 f Higher marketing values since consumers usually link If the execution time is better, still consume less energy 8

Case Study 8

Intel Pentium 4 Microarch. 82

Intel Pentium 4 Very deep pipeline: in order to achieve high frequency! (start from.5ghz) 2 stages in Netburst 2 3 4 5 6 7 8 9 TC Nxt IP TC Fetch Drive Alloc Rename Que 3 stages in Prescott Sch 3W (3.6GHz, 65nm) Reference The Microarchitecture of the Pentium 4 Processor Sch 2 Sch 3 Disp 4 Disp 5 RF 6 RF 7 Ex 8 Flgs 9 Br Ck 2 Drive 83

AMD Athlon 64 84

2 stage pipeline AMD Athlon 64 Inst. Addr Decode 2 Inst Mem 3 Inst. Byte Pick 4 5 2 6 Inst. Dbl. & Pack 7 and Pack 8 Dispatch 9 Scheduling Execution D-Cache Address 2 D-cache Access 89W TDP (Opteron 2.2GHz 9nm) 85

Demo revisited Why the sorting the array speed up the code despite the increased instruction count? if(option) std::sort(data, data + arraysize); for (unsigned i = ; i < ; ++i) { int threshold = std::rand(); for (unsigned i = ; i < arraysize; ++i) { if (data[i] >= threshold) sum ++; } } 88

Deep pipelining and data hazards 89

Data hazard revisited How many cycles it takes to execute the following code? Draw the pipeline execution diagram assume that we have full data forwarding. lw $t, ($a) lw $a, ($t) bne $a, $zero, EX 9 cycles 9

Intel s latest SkyLake BPU 32K L Instruction Cache MSROM 4 uops/cycle 6 uops/cycle Decoded Icache (DSB) Instruction Decode Queue (Q,, or micro-op queue) 5 uops/cycle Legacy Decode Pipeline Allocate/Rename/Retire/MoveElimination/ZeroIdiom Port Scheduler Port Port 5 Port 6 Port 2 LD/STA 256K L2 Cache (Unified) Int ALU, Vec FMA, Vec MUL, Vec Add, Vec ALU, Vec Shft, Divide, Branch2 Int ALU, Fast LEA, Vec FMA, Vec MUL, Vec Add, Vec ALU, Vec Shft, Int MUL, Slow LEA Int ALU, Fast LEA, Vec SHUF, Vec ALU, CVT Int ALU, Int Shft, Branch, Port 3 LD/STA Port 4 STD Port 7 STA 32K L Data Cache Good reference for intel microarchitectures: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf 92