Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline

Size: px
Start display at page:

Download "Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline"

Transcription

1 Current Microprocessors Pipeline Efficient Utilization of Hardware Blocks Execution steps for an instruction:.send instruction address ().Instruction Fetch ().Store instruction ().Decode Instruction, fetch operands ().Address Computation ().Memory Access (ME).Execution () 8.Write Back () ADD R, R, # DR Only one block used every cycle Efficient Utilization of Hardware Blocks for (i=; i < ; i++) { a[i] = a[i] + } LOOP LDR R, R, # ADD R, R, # STR R, R, # 8 ADD R, R, # 9 ADD R, R, R A BRn LOOP All blocks used One instruction terminates each cycle A/ / 8/ 9/ / / / / / 8/ A/ / / / 8/ 9/ A/ 8/ 9/ / / / / / / / 9/ / / 8/ / / /

2 Pipeline All blocks used Inst. 8 9 LDR R,R,# ADD R,R,# STR R,R,# ADD R,R,# ADD R,R,R BRn LOOP LDR R,R,# ADD R,R,# Program execution up to 8 times faster. Instruction execution time barely changed. Pipeline Implementation PC IR + Memory Registers Address computation Pipeline registers Memory ALU All information (data and control) stored in pipeline registers Registers Pipeline Implementation PC IR + PC b xb registers b address instruction Opcode xb registers Opcode b data xb registers Opcode b result b data

3 Trends The Pentium IV pipeline has stages + 8 conversion stages x8 µinstructions. The initial motivation for the pipeline was: Efficiently exploiting architecture blocks Increase instruction execution rate Current motivation is increasing clock frequency: Split stages into sub-stages Maximum duration of a sub-stage reduced Enables clock frequency increase Difficult to reach sustained performance Pipeline Hazards Not always possible to issue/commit one instruction per cycle When an instruction cannot proceed it is a pipeline hazard Three types of pipeline hazards: Resource hazard Data hazard Control hazard Hazard induces a pipeline stall The control circuit injects one or several bubbles Performance metric: IPC (Instructions Per Cycle) < (optimum) if hazards Pipeline Hazard PC IR + Inst- Inst- Inst- Inst- Inst- Inst- Inst- Inst- Bubble Inst- Inst- Bubble Inst- Inst- Bubble Inst- Inst- Bubble Inst- Inst- Inst. 8 9 Inst Inst 8 cycles Inst Inst cycles

4 Resource Hazards Access same block at same cycle Solutions: Accept pipeline stall (SPARC) Increase resources (memory ports) 9/ / Inst. LDR R,R,# ADD R,R,# STR R,R,# ADD R,R,# ADD R,R,R Resource conflict Resource Hazards No memory access no resource hazard Inst. 8 9 LDR R,R,# ADD R,R,# STR R,R,# ADD R,R,# ADD R,R,R BRn LOOP LDR R,R,# ADD R,R,# Data Hazards Data dependence between two instructions LDR R R,# ADD R R,# An instruction can fetch operands only when available R contains value expected by LDR Inst. 8 9 LDR R,R,# ADD R,R,#

5 Forwarding Data is often available in processor before it is written in register: Immediately pass data to expecting block = Forwarding Pipeline stall only when data absolutely necessary Data available; pipeline stall avoided Inst. 8 9 LDR R,R,# ADD R,R,# Implementing forwarding Forwarding requires: Additional data paths Adding/Increasing size of muxes Modifying control circuit (detect/activate forwarding) ADD ADD Forwarding Forwarding cannot avoid all pipeline stalls: ADD R R,# STR R R,# forwarding Inst. 8 9 LDR R,R,# ADD R,R,# STR R,R,#

6 Multi-Cycles Instructions Example: floating-point instructions FPADD: execution cycles FPMUL: execution cycles Resource hazard if FPADD and FPMUL in same block Inst. FPMUL F F,F FPADD F F,F FPADD F F,F New data dependences: Write registers out of order New resource conflicts (register banks ports) Ecriture dans le désordre Inst. FPMUL F F,F FPADD F F,F FPADD F F,F Pipeline and Exceptions Le pipeline makes exception management harder. Example: LDR has a page fault in ADD has a page fault in Precise exception on instruction I: All instructions before I finish normally All instructions after I can be interrupted, then reexecuted from the start after exception handled Exceptions Inst. 8 LDR ADD Necessary to implement precise exceptions Pipeline and Exceptions PC IR + Exception vector Exception vector for each pipeline register After exception, no more state modification In, exceptions dealt with in same order as instructions

7 Multi-Cycle Instructions and Exceptions Example: FPADD does exception NaN in Processor state modified in FPADD before exception detection Forbid out of order state modification Exception Inst. FPMUL F F,F FPADD F F,F Control Hazards Branch: must know destination (and possibly condition value) before fetching next instruction LOOP Condition (bits n,p,z) known at the end of this stage LDR R R,# BRn LOOP Branch destination address available at end of this stage Inst. 8 9 BRn LOOP LDR R,R,# Current Microprocessors Branch Prediction

8 Branch Prediction Usually, branch destination address is constant (except for RET and indirect branches) Predict destination address: store destination addresses in a table for each branch execution table is indexed by branch instruction PC when PC sent to memory, also sent to table table says if it s a branch, and the destination address Known branch instructions Destination address It is/is not a branch PC Predicted address Address Prediction If PC corresponds to branch, update PC: PC = Destination address Example: conditional branch Inst. JMPR LABEL Inst Without address prediction 8 Computed destination address Inst. 8 9 JMPR LABEL Inst With address prediction 9 Predicted destination address Error Prediction Detect error (destination address always computed) Squash speculatively fetched instructions Speculated instructions only modify machine state after check Branch squashing costs cycle Prediction error detected Speculated instructions did not modify machine state Inst. 8 9 JMPR Inst (err) Inst (err) Inst (err) Inst (OK) 8

9 Recovery After Misprediction PC IR + X Inst JMPR JMPR Inst JMPR Label X Inst Label (err) X (err) (OK) (OK) (err) Label X Inst JMPR (err) Label X JMPR Label JMPR Label JMPR Label JMPR Label Condition Prediction Conditional branches Must predict condition value Condition value can change often from one branch execution to another prediction difficult Example: branch taken Inst. BR LABEL Inst No condition prediction, address prediction 8 9 Condition computed Inst. BR LABEL Inst With condition and address prediction 8 9 Condition predicted Prediction Strategies Static prediction: Always taken Works well with loops Compile-Time analysis EPIC/-; limitations of static analysis Hit rate: from % to 9% LOOP LDR R, R, # ADD R, R, # STR R, R, # 8 ADD R, R, # 9 ADD R, R, R A BRn LOOP 9

10 Dynamique prediction: Commonplace in processors Recent mechanisms: hit rate up to 99% for certain applications Principle: learn individual branch behavior A first mechanism: local history one -state automaton per branch Prediction Strategies Taken PC Update table with condition Weakly Taken n bits Weakly Not taken n entries Prediction Taken () Not Taken Not taken () PC i= (taken) i= (taken) i= (taken) i= (taken) i=99 (non taken) Example Initial state P P P NP NP NP P FP FNP NP LOOP LDR R, R, # ADD R, R, # STR R, R, # 8 ADD R, R, # 9 ADD R, R, R A BRn LOOP Prediction not taken taken taken taken taken iterations: iteration i=: BRn taken iteration i=: BRn taken iteration i=99: BRn not taken Improving Dynamic Prediction A small hit rate increase can have a significant impact on overall processor performance if (a == ) a = ; /* Branchement B */ if (b == ) b = ; /* Branchement B */ if (a == b) ; /* Branchement B */ To improve prediction accuracy, use behavior of preceding branches: global history

11 History of last p branches PC n bits n+p entries Prediction p bits Update table Index p+n bits not taken in history taken in history p = B PC B PC B xx B B PC B PC B B B B B Impact of Branch Prediction on Processor Performance On average, instruction out of is a branch In current pipelines, misprediction cost is to cycles 8 cycles between fetch and branch resolution cycles to restart the pipeline cycles penalty instruction / cycle ( instructions) : % wrong predictions: * (,8* +,*(,* +,*)) = cycles % wrong predictions: * (,8* +,*(,8* +,*)) = cycles % wrong predictions: * (,8* +,*(,9* +,*)) = cycles instructions / cycle ( instructions) : % wrong predictions: * (,* +,*) = cycles % wrong predictions: * (,8* +,*) = cycles % wrong predictions: * (,9* +,*) = cycles Current Microprocessors Caches

12 Performance Memory Access Latency Moore s law µproc CPU %/an (X/.yr) DRAM/Processor gap: (% / year) DRAM 999 DRAM 9%/an (X/yr) Processor cycle time << memory access time Cache Memory Fast (SRAM) but small (cost) memory: to cycles (pipelined) Processor sends memory requests to cache Data in cache: hit Data not in cache: miss Performance: Hit rate Average memory access time Processor Cache Memory -bit SRAM Cell bit Cell -bit selection Amplifier Selection W SRAM=Static Random Access Memory. Writing: bit=value; bit =value selection= Reading: selection= bit=v DD, bit =V DD selection= value: /V DD V decreases on bit /V SS V decreases on bit bit Stored value bit

13 Caches and Locality Properties Most programs have strong locality properties Temporal locality: address A referenced at time t has strong probability to be referenced again within a short time interval Spatial locality: address A referenced at time t, strong probability to reference a neighbor address within a short time interval for (i=; i<n; i++) { for (j=; j<n; j++) { y[i] = y[i] + a[i][j] * x[j] } } y[i]: spatial and temporal locality a[i][j]: spatial locality x[j]: spatial and temporal locality Data and Instructions Locality Locality properties for instructions as well Temporal locality: just keep address in cache Sptial locality: load addresses by blocks LOOP LDR R, R, # ADD R, R, # STR R, R, # 8 ADD R, R, # 9 ADD R, R, R A BRn LOOP Loop reuse instructions temporal locality Consecutive instructions spatial locality Cache Architecture Processor request (memory address) Cache block (or line) Processor Latch Data addresses in Memory (tags) Yes Latch = No Memory Data

14 Hardware mapping Programmer does not manage it Transparent for programmer Mapping is a simple function of the address # line in cache # byte in line Address Mapping # byte in cache line # line in cache Bits 8 Tag # line in cache # byte in cache C S bytes cache L S bytes cache # byte: log (L S ) least significant bits of address # line: log (C S /L S ) least significant bits of address Cache Line Fetch data by blocks Address = byte address Block: Most significant bits of bytes addresses identical Only least significant bits vary Bits 8 Example: -bit address, 8-byte block Block address Byte address Same cache line Distinct lines Consecutive addresses Reading Data Example: C S = bytes L S = 8 bytes Requested address ( bits): # line: # byte in line: A request can have a variable size: byte, half-word, word request = address + nb of bytes address = addresse of first byte example: bytes (-bit word) bytes sent to processor

15 Associativity Physical memory size >> cache size Mapping function can breed data conflicts Reduce conflicts by increasing cache associativity for (i=; i<n; i++) { for (j=; j<; j++) { x[j] = y[j] } = i=, j= j= x y Lines: Cache (horizontal view) Cache conflict Associative Cache Structure Processor request (memory address) Banks Processor Latch = = MUX Latch Memory Associative Cache Operations Degree of associativity n. A data can be stored within n different entries Upon a cache miss, choose block/bank Set of possible blocks = set: LRU: Least Recently Used Random Pseudo-LRU: most recently used line not replaced, random among others FO: First In First Out Set

16 Writing a Data (Write-Through) Processor request (memory request) Data to write Latch = = Memory Writing a Data (Write-Back) Processor request (memory address) Data to write Latch = = Latch Memory Cache miss Virtual Memory/Physical Memory: TLB Processor uses virtual addresses Data have an address in physical memory Virtual/Physical address translation TLB (Translation Lookaside Buffer) Cache of address translations TLB entry = page TLB often fully associative (n = number of lines). = Processor request (virtual address) Virtual address = = = Physical address TLB

17 Summary Cache & TLB Instructions Cache & TLB Data Processor TLB Cache Simultaneous: Virtual/physical address translation Cache access using virtual address Several Cache Levels Example: Alpha Processor Instruction Cache 8 Ko, -way Data Cache 8 Ko, -way Shared Cache 9 Ko, -way Offchip Shared Cache -8Mo, -way Memory Impact of Cache Misses on Processor Performance On average, / instructions are load/store Instruction cache misses Cache hierarchies reduce average memory latency GHz processor, ns for memory access instructions : % cache misses: * (,* +,*(,* +,*)) = cycles % cache misses: * (,* +,*(,9* +,*)) = cycles % cache misses: cycles

18 Current Microprocessors Superscalar Execution Superscalar Processor Pipeline: at most one instruction per cycle Superscalar degree of n: up to n instructions complete per cycle (in practice, n ) Requirements for a superscalar processor: An uninterrupted flow of instructions Determine which instructions can execute in parallel Propagate data among instructions (result of instruction i is operand of instruction j) Several functional units Constraint: precise interruptions Superscalar implemenations share a lot of features Superscalar Processor Architecture Architecture Source: Proceedings of the IEEE Pipeline 8

19 Instruction-Level Parallelism (ILP) (fine-grain parallelism) Instruction Fetch Cache lines Instruction flow disruptions: branches instruction cache misses Avoid disruptions: Number fetched instructions : possibly several cache lines (multi-port cache) Buffer to store pre-fetched instructions 8 9 XX 8 9 XXX Branch predicted taken Inst. buffer Dependences Find instruction dependences Avoid false dependences due to register aliasing RAW (Read After Write): true dependence WAW (Write After Write): out-of-order write (false dependence) WAR (Write After Read): too early write (false dependence) 9

20 Register Aliasing: Renaming Binary compatibility limits number of registers Technology allows more registers Physical registers + ReOrder Buffer (ROB) Each instruction mapped to a ROB entry A table maps logical registers to: either a physical register or a ROB entry Eliminates register aliasing Finds true dependences L: move r, r lw r8, (r) add r, r, Logical register r r 8 Physical register r r 8 ROB Physical storage ROB 8 ROB Value produced # by by move add produced # by lw move lw add 8 After dependences, dispatch Reservation stations: buffer for each function unit Instruction executed when: all operands available functional unit available Tomasulo algorithm Dispatch add r, r, Ope- Source Data Valid Source Data Valid Result ration add/+ ROB lw/+ ROB move r imm imm - - Reservation stations for ALU ROB 8 ROB ROB ALUs Issue Ready instructions sent to FU Result propagated through buses: to physical storage (ROB entries) to reservation stations Pending instructions immediately issued Internal model closer to dataflow than von Neumann Stations ROB UF UF UF UF Common Data Bus

21 Commit An instruction can only commit when reaching ROB end commit order = program order Logical architecture state only modified at commit precise interruptions possible in OoO processors Logical state: registers and memory When an instruction leaves the ROB: result written into register OR data sent to memory Superscalar processor of degree n: n instructions can commit simultaneously If instruction at top of ROB has not completed, processor is stall Pentium IV Source: Tom s Hardware Trace Cache Address Instruction cache Address Trace Cache A B C D A B D A B C Predictor Fetch Predictor Fetch A P B NP P A n predictions A B D C D

22 Pentium IV Source: Tom s Hardware Ideally, Instruction Issues per cycle 8,8,,9, 8, gcc espresso li fpppp doducd tomcatv Programs Factoring in main constraints gcc expres so li fpppp doducd tom catv P rogra m 8 Nombre d instructions/cycle

23 In reality instructions/cycle Max, instructions per cycle; on average

Multiple Instruction Issue and Hardware Based Speculation

Multiple Instruction Issue and Hardware Based Speculation Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 12 Mahadevan Gomathisankaran March 4, 2010 03/04/2010 Lecture 12 CSCE 4610/5610 1 Discussion: Assignment 2 03/04/2010 Lecture 12 CSCE 4610/5610 2 Increasing Fetch

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

RECAP. B649 Parallel Architectures and Programming

RECAP. B649 Parallel Architectures and Programming RECAP B649 Parallel Architectures and Programming RECAP 2 Recap ILP Exploiting ILP Dynamic scheduling Thread-level Parallelism Memory Hierarchy Other topics through student presentations Virtual Machines

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007, Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would

More information

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction CS252 Graduate Computer Architecture Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue March 23, 01 Prof. David A. Patterson Computer Science 252 Spring 01 Review Tomasulo Reservations

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP CS252 Graduate Computer Architecture Lecture 18: Branch Prediction + analysis resources => ILP April 2, 2 Prof. David E. Culler Computer Science 252 Spring 2 Today s Big Idea Reactive: past actions cause

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

CMSC411 Fall 2013 Midterm 2 Solutions

CMSC411 Fall 2013 Midterm 2 Solutions CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has

More information

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Multiple Issue ILP Processors. Summary of discussions

Multiple Issue ILP Processors. Summary of discussions Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures Advanced Computer Architecture Chapter 4: More sophisticated CPU architectures Lecturer: Paul H J Kelly Autumn 2001 Department of Computing Imperial College Room 423 email: phjk@doc.ic.ac.uk Course web

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections ) Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each

More information

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Chapter 06: Instruction Pipelining and Parallel Processing

Chapter 06: Instruction Pipelining and Parallel Processing Chapter 06: Instruction Pipelining and Parallel Processing Lesson 09: Superscalar Processors and Parallel Computer Systems Objective To understand parallel pipelines and multiple execution units Instruction

More information

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog) Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis

More information

ILP: Instruction Level Parallelism

ILP: Instruction Level Parallelism ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal

More information

Case Study IBM PowerPC 620

Case Study IBM PowerPC 620 Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue 1 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction

More information

Hardware Speculation Support

Hardware Speculation Support Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification

More information

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr Ti5317000 Parallel Computing PIPELINING Michał Roziecki, Tomáš Cipr 2005-2006 Introduction to pipelining What is this What is pipelining? Pipelining is an implementation technique in which multiple instructions

More information

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units 6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

Photo David Wright   STEVEN R. BAGLEY PIPELINES AND ILP Photo David Wright https://www.flickr.com/photos/dhwright/3312563248 STEVEN R. BAGLEY PIPELINES AND ILP INTRODUCTION Been considering what makes the CPU run at a particular speed Spent the last two weeks

More information

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Memory Hierarchies 2009 DAT105

Memory Hierarchies 2009 DAT105 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III) COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model. Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution

More information

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the

More information

Lecture 19: Instruction Level Parallelism

Lecture 19: Instruction Level Parallelism Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information