Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline
|
|
- Lynne Douglas
- 5 years ago
- Views:
Transcription
1 Current Microprocessors Pipeline Efficient Utilization of Hardware Blocks Execution steps for an instruction:.send instruction address ().Instruction Fetch ().Store instruction ().Decode Instruction, fetch operands ().Address Computation ().Memory Access (ME).Execution () 8.Write Back () ADD R, R, # DR Only one block used every cycle Efficient Utilization of Hardware Blocks for (i=; i < ; i++) { a[i] = a[i] + } LOOP LDR R, R, # ADD R, R, # STR R, R, # 8 ADD R, R, # 9 ADD R, R, R A BRn LOOP All blocks used One instruction terminates each cycle A/ / 8/ 9/ / / / / / 8/ A/ / / / 8/ 9/ A/ 8/ 9/ / / / / / / / 9/ / / 8/ / / /
2 Pipeline All blocks used Inst. 8 9 LDR R,R,# ADD R,R,# STR R,R,# ADD R,R,# ADD R,R,R BRn LOOP LDR R,R,# ADD R,R,# Program execution up to 8 times faster. Instruction execution time barely changed. Pipeline Implementation PC IR + Memory Registers Address computation Pipeline registers Memory ALU All information (data and control) stored in pipeline registers Registers Pipeline Implementation PC IR + PC b xb registers b address instruction Opcode xb registers Opcode b data xb registers Opcode b result b data
3 Trends The Pentium IV pipeline has stages + 8 conversion stages x8 µinstructions. The initial motivation for the pipeline was: Efficiently exploiting architecture blocks Increase instruction execution rate Current motivation is increasing clock frequency: Split stages into sub-stages Maximum duration of a sub-stage reduced Enables clock frequency increase Difficult to reach sustained performance Pipeline Hazards Not always possible to issue/commit one instruction per cycle When an instruction cannot proceed it is a pipeline hazard Three types of pipeline hazards: Resource hazard Data hazard Control hazard Hazard induces a pipeline stall The control circuit injects one or several bubbles Performance metric: IPC (Instructions Per Cycle) < (optimum) if hazards Pipeline Hazard PC IR + Inst- Inst- Inst- Inst- Inst- Inst- Inst- Inst- Bubble Inst- Inst- Bubble Inst- Inst- Bubble Inst- Inst- Bubble Inst- Inst- Inst. 8 9 Inst Inst 8 cycles Inst Inst cycles
4 Resource Hazards Access same block at same cycle Solutions: Accept pipeline stall (SPARC) Increase resources (memory ports) 9/ / Inst. LDR R,R,# ADD R,R,# STR R,R,# ADD R,R,# ADD R,R,R Resource conflict Resource Hazards No memory access no resource hazard Inst. 8 9 LDR R,R,# ADD R,R,# STR R,R,# ADD R,R,# ADD R,R,R BRn LOOP LDR R,R,# ADD R,R,# Data Hazards Data dependence between two instructions LDR R R,# ADD R R,# An instruction can fetch operands only when available R contains value expected by LDR Inst. 8 9 LDR R,R,# ADD R,R,#
5 Forwarding Data is often available in processor before it is written in register: Immediately pass data to expecting block = Forwarding Pipeline stall only when data absolutely necessary Data available; pipeline stall avoided Inst. 8 9 LDR R,R,# ADD R,R,# Implementing forwarding Forwarding requires: Additional data paths Adding/Increasing size of muxes Modifying control circuit (detect/activate forwarding) ADD ADD Forwarding Forwarding cannot avoid all pipeline stalls: ADD R R,# STR R R,# forwarding Inst. 8 9 LDR R,R,# ADD R,R,# STR R,R,#
6 Multi-Cycles Instructions Example: floating-point instructions FPADD: execution cycles FPMUL: execution cycles Resource hazard if FPADD and FPMUL in same block Inst. FPMUL F F,F FPADD F F,F FPADD F F,F New data dependences: Write registers out of order New resource conflicts (register banks ports) Ecriture dans le désordre Inst. FPMUL F F,F FPADD F F,F FPADD F F,F Pipeline and Exceptions Le pipeline makes exception management harder. Example: LDR has a page fault in ADD has a page fault in Precise exception on instruction I: All instructions before I finish normally All instructions after I can be interrupted, then reexecuted from the start after exception handled Exceptions Inst. 8 LDR ADD Necessary to implement precise exceptions Pipeline and Exceptions PC IR + Exception vector Exception vector for each pipeline register After exception, no more state modification In, exceptions dealt with in same order as instructions
7 Multi-Cycle Instructions and Exceptions Example: FPADD does exception NaN in Processor state modified in FPADD before exception detection Forbid out of order state modification Exception Inst. FPMUL F F,F FPADD F F,F Control Hazards Branch: must know destination (and possibly condition value) before fetching next instruction LOOP Condition (bits n,p,z) known at the end of this stage LDR R R,# BRn LOOP Branch destination address available at end of this stage Inst. 8 9 BRn LOOP LDR R,R,# Current Microprocessors Branch Prediction
8 Branch Prediction Usually, branch destination address is constant (except for RET and indirect branches) Predict destination address: store destination addresses in a table for each branch execution table is indexed by branch instruction PC when PC sent to memory, also sent to table table says if it s a branch, and the destination address Known branch instructions Destination address It is/is not a branch PC Predicted address Address Prediction If PC corresponds to branch, update PC: PC = Destination address Example: conditional branch Inst. JMPR LABEL Inst Without address prediction 8 Computed destination address Inst. 8 9 JMPR LABEL Inst With address prediction 9 Predicted destination address Error Prediction Detect error (destination address always computed) Squash speculatively fetched instructions Speculated instructions only modify machine state after check Branch squashing costs cycle Prediction error detected Speculated instructions did not modify machine state Inst. 8 9 JMPR Inst (err) Inst (err) Inst (err) Inst (OK) 8
9 Recovery After Misprediction PC IR + X Inst JMPR JMPR Inst JMPR Label X Inst Label (err) X (err) (OK) (OK) (err) Label X Inst JMPR (err) Label X JMPR Label JMPR Label JMPR Label JMPR Label Condition Prediction Conditional branches Must predict condition value Condition value can change often from one branch execution to another prediction difficult Example: branch taken Inst. BR LABEL Inst No condition prediction, address prediction 8 9 Condition computed Inst. BR LABEL Inst With condition and address prediction 8 9 Condition predicted Prediction Strategies Static prediction: Always taken Works well with loops Compile-Time analysis EPIC/-; limitations of static analysis Hit rate: from % to 9% LOOP LDR R, R, # ADD R, R, # STR R, R, # 8 ADD R, R, # 9 ADD R, R, R A BRn LOOP 9
10 Dynamique prediction: Commonplace in processors Recent mechanisms: hit rate up to 99% for certain applications Principle: learn individual branch behavior A first mechanism: local history one -state automaton per branch Prediction Strategies Taken PC Update table with condition Weakly Taken n bits Weakly Not taken n entries Prediction Taken () Not Taken Not taken () PC i= (taken) i= (taken) i= (taken) i= (taken) i=99 (non taken) Example Initial state P P P NP NP NP P FP FNP NP LOOP LDR R, R, # ADD R, R, # STR R, R, # 8 ADD R, R, # 9 ADD R, R, R A BRn LOOP Prediction not taken taken taken taken taken iterations: iteration i=: BRn taken iteration i=: BRn taken iteration i=99: BRn not taken Improving Dynamic Prediction A small hit rate increase can have a significant impact on overall processor performance if (a == ) a = ; /* Branchement B */ if (b == ) b = ; /* Branchement B */ if (a == b) ; /* Branchement B */ To improve prediction accuracy, use behavior of preceding branches: global history
11 History of last p branches PC n bits n+p entries Prediction p bits Update table Index p+n bits not taken in history taken in history p = B PC B PC B xx B B PC B PC B B B B B Impact of Branch Prediction on Processor Performance On average, instruction out of is a branch In current pipelines, misprediction cost is to cycles 8 cycles between fetch and branch resolution cycles to restart the pipeline cycles penalty instruction / cycle ( instructions) : % wrong predictions: * (,8* +,*(,* +,*)) = cycles % wrong predictions: * (,8* +,*(,8* +,*)) = cycles % wrong predictions: * (,8* +,*(,9* +,*)) = cycles instructions / cycle ( instructions) : % wrong predictions: * (,* +,*) = cycles % wrong predictions: * (,8* +,*) = cycles % wrong predictions: * (,9* +,*) = cycles Current Microprocessors Caches
12 Performance Memory Access Latency Moore s law µproc CPU %/an (X/.yr) DRAM/Processor gap: (% / year) DRAM 999 DRAM 9%/an (X/yr) Processor cycle time << memory access time Cache Memory Fast (SRAM) but small (cost) memory: to cycles (pipelined) Processor sends memory requests to cache Data in cache: hit Data not in cache: miss Performance: Hit rate Average memory access time Processor Cache Memory -bit SRAM Cell bit Cell -bit selection Amplifier Selection W SRAM=Static Random Access Memory. Writing: bit=value; bit =value selection= Reading: selection= bit=v DD, bit =V DD selection= value: /V DD V decreases on bit /V SS V decreases on bit bit Stored value bit
13 Caches and Locality Properties Most programs have strong locality properties Temporal locality: address A referenced at time t has strong probability to be referenced again within a short time interval Spatial locality: address A referenced at time t, strong probability to reference a neighbor address within a short time interval for (i=; i<n; i++) { for (j=; j<n; j++) { y[i] = y[i] + a[i][j] * x[j] } } y[i]: spatial and temporal locality a[i][j]: spatial locality x[j]: spatial and temporal locality Data and Instructions Locality Locality properties for instructions as well Temporal locality: just keep address in cache Sptial locality: load addresses by blocks LOOP LDR R, R, # ADD R, R, # STR R, R, # 8 ADD R, R, # 9 ADD R, R, R A BRn LOOP Loop reuse instructions temporal locality Consecutive instructions spatial locality Cache Architecture Processor request (memory address) Cache block (or line) Processor Latch Data addresses in Memory (tags) Yes Latch = No Memory Data
14 Hardware mapping Programmer does not manage it Transparent for programmer Mapping is a simple function of the address # line in cache # byte in line Address Mapping # byte in cache line # line in cache Bits 8 Tag # line in cache # byte in cache C S bytes cache L S bytes cache # byte: log (L S ) least significant bits of address # line: log (C S /L S ) least significant bits of address Cache Line Fetch data by blocks Address = byte address Block: Most significant bits of bytes addresses identical Only least significant bits vary Bits 8 Example: -bit address, 8-byte block Block address Byte address Same cache line Distinct lines Consecutive addresses Reading Data Example: C S = bytes L S = 8 bytes Requested address ( bits): # line: # byte in line: A request can have a variable size: byte, half-word, word request = address + nb of bytes address = addresse of first byte example: bytes (-bit word) bytes sent to processor
15 Associativity Physical memory size >> cache size Mapping function can breed data conflicts Reduce conflicts by increasing cache associativity for (i=; i<n; i++) { for (j=; j<; j++) { x[j] = y[j] } = i=, j= j= x y Lines: Cache (horizontal view) Cache conflict Associative Cache Structure Processor request (memory address) Banks Processor Latch = = MUX Latch Memory Associative Cache Operations Degree of associativity n. A data can be stored within n different entries Upon a cache miss, choose block/bank Set of possible blocks = set: LRU: Least Recently Used Random Pseudo-LRU: most recently used line not replaced, random among others FO: First In First Out Set
16 Writing a Data (Write-Through) Processor request (memory request) Data to write Latch = = Memory Writing a Data (Write-Back) Processor request (memory address) Data to write Latch = = Latch Memory Cache miss Virtual Memory/Physical Memory: TLB Processor uses virtual addresses Data have an address in physical memory Virtual/Physical address translation TLB (Translation Lookaside Buffer) Cache of address translations TLB entry = page TLB often fully associative (n = number of lines). = Processor request (virtual address) Virtual address = = = Physical address TLB
17 Summary Cache & TLB Instructions Cache & TLB Data Processor TLB Cache Simultaneous: Virtual/physical address translation Cache access using virtual address Several Cache Levels Example: Alpha Processor Instruction Cache 8 Ko, -way Data Cache 8 Ko, -way Shared Cache 9 Ko, -way Offchip Shared Cache -8Mo, -way Memory Impact of Cache Misses on Processor Performance On average, / instructions are load/store Instruction cache misses Cache hierarchies reduce average memory latency GHz processor, ns for memory access instructions : % cache misses: * (,* +,*(,* +,*)) = cycles % cache misses: * (,* +,*(,9* +,*)) = cycles % cache misses: cycles
18 Current Microprocessors Superscalar Execution Superscalar Processor Pipeline: at most one instruction per cycle Superscalar degree of n: up to n instructions complete per cycle (in practice, n ) Requirements for a superscalar processor: An uninterrupted flow of instructions Determine which instructions can execute in parallel Propagate data among instructions (result of instruction i is operand of instruction j) Several functional units Constraint: precise interruptions Superscalar implemenations share a lot of features Superscalar Processor Architecture Architecture Source: Proceedings of the IEEE Pipeline 8
19 Instruction-Level Parallelism (ILP) (fine-grain parallelism) Instruction Fetch Cache lines Instruction flow disruptions: branches instruction cache misses Avoid disruptions: Number fetched instructions : possibly several cache lines (multi-port cache) Buffer to store pre-fetched instructions 8 9 XX 8 9 XXX Branch predicted taken Inst. buffer Dependences Find instruction dependences Avoid false dependences due to register aliasing RAW (Read After Write): true dependence WAW (Write After Write): out-of-order write (false dependence) WAR (Write After Read): too early write (false dependence) 9
20 Register Aliasing: Renaming Binary compatibility limits number of registers Technology allows more registers Physical registers + ReOrder Buffer (ROB) Each instruction mapped to a ROB entry A table maps logical registers to: either a physical register or a ROB entry Eliminates register aliasing Finds true dependences L: move r, r lw r8, (r) add r, r, Logical register r r 8 Physical register r r 8 ROB Physical storage ROB 8 ROB Value produced # by by move add produced # by lw move lw add 8 After dependences, dispatch Reservation stations: buffer for each function unit Instruction executed when: all operands available functional unit available Tomasulo algorithm Dispatch add r, r, Ope- Source Data Valid Source Data Valid Result ration add/+ ROB lw/+ ROB move r imm imm - - Reservation stations for ALU ROB 8 ROB ROB ALUs Issue Ready instructions sent to FU Result propagated through buses: to physical storage (ROB entries) to reservation stations Pending instructions immediately issued Internal model closer to dataflow than von Neumann Stations ROB UF UF UF UF Common Data Bus
21 Commit An instruction can only commit when reaching ROB end commit order = program order Logical architecture state only modified at commit precise interruptions possible in OoO processors Logical state: registers and memory When an instruction leaves the ROB: result written into register OR data sent to memory Superscalar processor of degree n: n instructions can commit simultaneously If instruction at top of ROB has not completed, processor is stall Pentium IV Source: Tom s Hardware Trace Cache Address Instruction cache Address Trace Cache A B C D A B D A B C Predictor Fetch Predictor Fetch A P B NP P A n predictions A B D C D
22 Pentium IV Source: Tom s Hardware Ideally, Instruction Issues per cycle 8,8,,9, 8, gcc espresso li fpppp doducd tomcatv Programs Factoring in main constraints gcc expres so li fpppp doducd tom catv P rogra m 8 Nombre d instructions/cycle
23 In reality instructions/cycle Max, instructions per cycle; on average
Multiple Instruction Issue and Hardware Based Speculation
Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we
More informationInstruction Level Parallelism
Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic
More informationEITF20: Computer Architecture Part3.2.1: Pipeline - 3
EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 12 Mahadevan Gomathisankaran March 4, 2010 03/04/2010 Lecture 12 CSCE 4610/5610 1 Discussion: Assignment 2 03/04/2010 Lecture 12 CSCE 4610/5610 2 Increasing Fetch
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not
More informationInstruction Level Parallelism (ILP)
1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationEE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationSuperscalar Processors
Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationCS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.
CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationSuper Scalar. Kalyan Basu March 21,
Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build
More informationRECAP. B649 Parallel Architectures and Programming
RECAP B649 Parallel Architectures and Programming RECAP 2 Recap ILP Exploiting ILP Dynamic scheduling Thread-level Parallelism Memory Hierarchy Other topics through student presentations Virtual Machines
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationDynamic Scheduling. CSE471 Susan Eggers 1
Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationArchitectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.
Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationMetodologie di Progettazione Hardware-Software
Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism
More informationChapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,
Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would
More informationReview Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction
CS252 Graduate Computer Architecture Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue March 23, 01 Prof. David A. Patterson Computer Science 252 Spring 01 Review Tomasulo Reservations
More informationPage 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationPage 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP
CS252 Graduate Computer Architecture Lecture 18: Branch Prediction + analysis resources => ILP April 2, 2 Prof. David E. Culler Computer Science 252 Spring 2 Today s Big Idea Reactive: past actions cause
More informationLimitations of Scalar Pipelines
Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline
More informationChapter 4 The Processor 1. Chapter 4D. The Processor
Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
More informationCMSC411 Fall 2013 Midterm 2 Solutions
CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has
More informationDYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING
DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationMultiple Issue ILP Processors. Summary of discussions
Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationRecall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationCS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationAdvanced Computer Architecture. Chapter 4: More sophisticated CPU architectures
Advanced Computer Architecture Chapter 4: More sophisticated CPU architectures Lecturer: Paul H J Kelly Autumn 2001 Department of Computing Imperial College Room 423 email: phjk@doc.ic.ac.uk Course web
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II
CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time
More informationReorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)
Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationLecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )
Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each
More informationChapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences
Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationChapter 06: Instruction Pipelining and Parallel Processing
Chapter 06: Instruction Pipelining and Parallel Processing Lesson 09: Superscalar Processors and Parallel Computer Systems Objective To understand parallel pipelines and multiple execution units Instruction
More informationAnnouncements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)
Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis
More informationILP: Instruction Level Parallelism
ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal
More informationCase Study IBM PowerPC 620
Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationPage 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson
More informationComplex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar
Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationComputer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue
More informationLecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue 1 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction
More informationHardware Speculation Support
Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification
More informationTi Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr
Ti5317000 Parallel Computing PIPELINING Michał Roziecki, Tomáš Cipr 2005-2006 Introduction to pipelining What is this What is pipelining? Pipelining is an implementation technique in which multiple instructions
More informationComplex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units
6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo
More informationLoad1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1
Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]
More informationProcessor: Superscalars Dynamic Scheduling
Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),
More informationPhoto David Wright STEVEN R. BAGLEY PIPELINES AND ILP
Photo David Wright https://www.flickr.com/photos/dhwright/3312563248 STEVEN R. BAGLEY PIPELINES AND ILP INTRODUCTION Been considering what makes the CPU run at a particular speed Spent the last two weeks
More informationLocality. Cache. Direct Mapped Cache. Direct Mapped Cache
Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationMemory Hierarchies 2009 DAT105
Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationCOSC 6385 Computer Architecture - Memory Hierarchy Design (III)
COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses
More informationAdvanced Computer Architecture
Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I
More informationPerformance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.
Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution
More informationMinimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline
Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the
More informationLecture 19: Instruction Level Parallelism
Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationHardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.
Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)
More informationLecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )
Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use
More information