COSC 6385 Computer Architecture. Instruction Level Parallelism
|
|
- Lilian Ferguson
- 6 years ago
- Views:
Transcription
1 COSC 6385 Computer Architecture Instruction Level Parallelism Spring 2013 Instruction Level Parallelism Pipelining allows for overlapping the execution of instructions Limitations on the (pipelined) execution of instruction Data Dependencies Control Dependencies Minimizing the effect of the limitations can be done in Hardware Software (Compiler) 1
2 Compiler techniques Scheduling instructions to minimize the no. of stall cycles Loop unrolling: modify a loop such that multiple iterations of the loop are executed at once Reduces the no. of instructions that control the loop Increases binary size for ( i=0; i < N ; i++ ) x[i] = x[i] + s; for (i=0; i < N ; i+=k) { } x[i] = x[i] + s; x[i+1] = x[i+1] + s; x[i+2] = x[i+2] + s; x[i+k] = x[i+k] + s; Loop unrolling Might require two loops to deal with situation where N%k!= 0 } for (i=0; i< N%k; i++ ) { x[i] = x[i] + s; for ( ; i < N ; i+=k) { x[i] = x[i] + s; x[i+1] = x[i+1] + s; x[i+2] = x[i+2] + s; x[i+k] = x[i+k] + s; 2
3 An example for ( i=1000; i > 0 ; i-- ) { x[i] = x[i] + s; } Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Assumptions Instruction producing results Instruction using results FP ALU FP ALU 3 FP ALU Store 2 Load FP FP ALU 1 Load FP Store 0 Latency in clock cycles Latency: number of intervening cycles between an instruction that produces a result and instruction that uses the result 1 cycle branch delay 3
4 An example (III) Loop: L.D F0, 0(R1) 1 s 2 ADD.D F4, F0, F2 3 s 4 s 5 S.D F4, 0(R1) 6 DADDUI R1, R1, #-8 7 s 8 BNE R1, R2, Loop 9 s 10 wait for F0 to propagate wait for ADD to complete wait for ADD to complete wait for R1 to propagate branch delay slot Rescheduling the code Loop: L.D F0, 0(R1) 1 DADDUI R1, R1, #-8 2 ADD.D F4, F0, F2 3 s 4 BNE R1, R2, Loop 5 S.D F4, 8(R1) 6 Delayed branch slot Each loop iteration consists of 3 instructions of actual work (load, add, store) and 3 cycles loop overhead 4
5 Loop unrolling Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) /* Drop DADDUI & BNE */ L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) /* Drop DADDUI & BNE */ L.D F10, -16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) /* Drop DADDUI & BNE */ L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop Loop unrolling (II) Eliminates three branches and three decrements Reduces loop overhead The previous code sequence still contains many stalls, since many operations are still dependent on each other Requires more registers Increase of the code size 5
6 Scheduled version of the unrolled loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, #-32 S.D F12, 16(R1) BNE R1, R2, Loop S.D F16, 8(R1) Scheduled version of the unrolled loop (II) Non trivial transformations in the previous code Adjust the offsets of S.D instructions Determine that it is legal to move the S.D after DADDUI and BNE Determine that loads and stores of difference iterations can be interchanged 6
7 Branch prediction No instruction is allowed to initiate execution until all branches preceding the instruction have completed Static techniques to avoid branch hazards Stall Predict not taken Predict taken Delayed branch -> do not take the previous behavior of branches into account Dynamic branch prediction Algorithms using the previous execution of a branch to predict the outcome of the next execution Seven techniques for dynamic branch prediction 1bit branch prediction buffer 2bit branch prediction buffer Correlating Branch Prediction Buffer Branch Target Buffer Return Address Predictors Tournament predictors 7
8 1bit Branch prediction buffer (I) Branch prediction buffer: Small memory area indexed by the lower portion of the address of the branch instruction Records whether the branch was taken the last time or not (1 bit is sufficient) Please note: Several branches might share the same address since we do not use the full branch instruction address for accessing the branch prediction buffer 1bit Branch Prediction Buffer (II) Limitations Even for a regular loop (embedded in another large loop) the 1bit Branch Prediction Buffer will mispredict at least the first and the last iteration 1 st iteration: the bit has been set by the last iteration of the same loop to not-taken, but the branch will be taken Last iteration: the bit says taken, but the branch won t be taken 8
9 2bit Branch Prediction Buffer A prediction must miss twice before the prediction is changed Can be extended to n-bits Taken Predict taken 11 Taken Not taken Taken Predict taken 10 Not taken Predict not taken 01 Not taken Taken Predict not taken 00 Correlated branches For a (1,1) predictor: each branch has two different branch prediction buffers: Predictor used in case the previous branch in the application has not been taken X / Y Predictor used in case the previous branch in the application has been taken The content of the two branch prediction buffers are determined by the branch to which they belong Which of the two branch prediction buffers are used is depending on the outcome of the previous branch in the application 9
10 Correlated branches - example if ( d==0 ) d = 1; if ( d==1 ) BNEZ R1, L1 DADDIU R1, R0, #1 L1: DADDIU R3, R1, #-1 BNEZ R3, L2 L2:!branch b1!branch b2 Initial value of d d==0? b1 Value of d before b2 d==1? 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken b2 Correlated branches - example d=? BPB b1 b1 act. BPB b2 B2 act. 2 NT/NT NT/NT the branch prediction buffers for the branches b1 and b2 are assumed to hold the prediction Not taken for both option (previous branch not taken/taken) 10
11 Correlated branches - example d=? BPB b1 b1 act. BPB b2 B2 act. 2 NT/NT NT/NT assuming BPB for b1 uses the Not Taken predictor because the previous branch in the application has not been taken BPB for b1 predicts that b1 will not be taken Correlated branches - example d=? BPB b1 b1 act. BPB b2 B2 act. 2 NT/NT T NT/NT BPB for b1 predicts that b1 will not be taken b1 is taken (see table for d=2) Initial value of d d==0? b1 Value of d before b2 d==1? 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken b2 11
12 Correlated branches - example d=? BPB b1 b1 act. BPB b2 B2 act. 2 NT/NT T NT/NT T/NT updating the Previous branch has not been taken part of BPB for b1 to Taken because b1 has been taken, the last branch has been taken part of BPB b2 will be used BPB b2 predicts, that b2 will not be taken Correlated branches - example d=? BPB b1 b1 act. BPB b2 B2 act. 2 NT/NT T NT/NT T T/NT NT/T b2 is taken (see table for d=2) updating the Previous branch has been taken part of BPB for b2 to Taken because b2 has been taken, the last branch has been taken part of BPB b1 will be used Initial BPB value b1 predicts, d==0? that b1 will b1 not be Value taken of d d==1? b2 of d before b2 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken 12
13 Correlated branches - example d=? BPB b1 b1 act. BPB b2 B2 act. 2 NT/NT T NT/NT T 0 T/NT NT NT/T b1 is not taken (see table for d=0) matches prediction! update of BPB b1 does not modify any entry taken because b1 has not been taken, the last branch has not been taken part of BPB b2 will be used BPB b2 predicts that b2 will not be taken Initial value of d d==0? b1 Value of d before b2 d==1? 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken b2 Correlated branches A (2,1) correlated branch predictor Uses the behavior of the last 2 branches to choose from 2 2 different predictions Uses a 1 bit predictor for each of the 4 prediction buffers Predictor used in case the previous 2 branches in the application have both not been taken (00) Predictor used in case the previous branches have the history :second last branch not taken, last branch taken (01) Predictor used in case the previous branches have the history: second last branch taken, last branch not taken (10) Predictor used in case the previous 2 branches in the application have both been taken (11) A / B / C / D 13
14 Correlated branches How do we know which of the four sections of our branch predictor to use Need to record the behavior of all branches in the application Initial value of d d==0? b1 Value of d before b2 d==1? 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken 2 No Taken 2 No Taken 0 Yes Not taken 1 Yes Not taken b2 e.g Correlated branches For a (2,n) branch predictor, the last two branches are relevant 11 2-bit global branch history (implemented using a 2bit shift register)
15 Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction (2,2) predictor: 2-bit global, 2-bit local Correlated Branches Branch address (4 bits) 2-bits per branch local predictors Prediction 2-bit global branch history (01 = not taken then taken) Slide based on a lecture by David A. Patterson, University of California, Berkley Frequency of Mispredictions Frequency of Mispredictions 16% 14% 12% 10% 8% 6% 2% 0% Accuracy of Different Schemes 20% 18% 18% 4% 0% 1% 0% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 1% 5% 6% 6% nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li 11% 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) 4% Slide based on a lecture by David A. Patterson, University of California, Berkley 6% 5% 15
16 Branch Target Buffers Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) PC of instruction FETCH No: branch not predicted, proceed normally (Next PC = PC+4) Branch PC =? Predicted PC Yes: instruction is branch and use predicted PC as next PC Slide based on a lecture by David A. Patterson, University of California, Berkley Extra prediction state bits Need Address at Same Time as Prediction (II) Send PC to memory and branch target buffer (BTB) No Entry found in BTB? Yes Is instruction a taken branch? Send out predicted PC No Yes No Taken branch? Yes Normal execution Enter branch address and next PC count into BTB Mispredicted branch, kill fetched instruction Branch correctly predicted 16
17 Return Addresses and Tournament predictor Return Address: Register Indirect branch hard to predict address Save return address in small stack 8 to 16 entries reduces miss rate dramatically Tournament predictors: Combine multiple prediction algorithms Keeps track which predictor was the most accurate for the last execution of a branch Allows to use different prediction algorithms for different types of branches 17
COSC 6385 Computer Architecture Dynamic Branch Prediction
COSC 6385 Computer Architecture Dynamic Branch Prediction Edgar Gabriel Spring 208 Pipelining Pipelining allows for overlapping the execution of instructions Limitations on the (pipelined) execution of
More informationTomasulo Loop Example
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop This time assume Multiply takes 4 clocks Assume 1st load takes 8 clocks, 2nd load takes 1 clock Clocks for SUBI,
More informationPage 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson
More informationAdministrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies
Administrivia CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) HW #3, on memory hierarchy, due Tuesday Continue reading Chapter 3 of H&P Alan Sussman als@cs.umd.edu
More informationDynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers
Dynamic Hardware Prediction Importance of control dependences Branches and jumps are frequent Limiting factor as ILP increases (Amdahl s law) Schemes to attack control dependences Static Basic (stall the
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationCSE4201 Instruction Level Parallelism. Branch Prediction
CSE4201 Instruction Level Parallelism Branch Prediction Prof. Mokhtar Aboelaze York University Based on Slides by Prof. L. Bhuyan (UCR) Prof. M. Shaaban (RIT) 1 Introduction With dynamic scheduling that
More informationInstruction-Level Parallelism. Instruction Level Parallelism (ILP)
Instruction-Level Parallelism CS448 1 Pipelining Instruction Level Parallelism (ILP) Limited form of ILP Overlapping instructions, these instructions can be evaluated in parallel (to some degree) Pipeline
More informationInstruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.
Instruction Fetch and Branch Prediction CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.3) 1 Frontend and Backend Feedback: - Prediction correct or not, update
More informationPage # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationCS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction
Outline CMSC Computer Systems Architecture Lecture 9 Instruction Level Parallelism (Static & Dynamic Branch ion) ILP Compiler techniques to increase ILP Loop Unrolling Static Branch ion Dynamic Branch
More informationCS422 Computer Architecture
CS422 Computer Architecture Spring 2004 Lecture 14, 18 Feb 2004 Bhaskaran Raman Department of CSE IIT Kanpur http://web.cse.iitk.ac.in/~cs422/index.html Dealing with Control Hazards Software techniques:
More informationLecture 15: Instruc.on Level Parallelism -- Introduc.on, Compiler Techniques, and Advanced Branch Predic.on
Lecture 15: Instruc.on Level Parallelism -- Introduc.on, Compiler Techniques, and Advanced Branch Predic.on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong
More informationBranch Prediction Chapter 3
1 Branch Prediction Chapter 3 2 More on Dependencies We will now look at further techniques to deal with dependencies which negatively affect ILP in program code. A dependency may be overcome in two ways:
More informationInstruction Level Parallelism
Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches
More informationPipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &
More informationLecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures
More informationLecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism
Lecture 8: Compiling for ILP and Branch Prediction Kunle Olukotun Gates 302 kunle@ogun.stanford.edu http://www-leland.stanford.edu/class/ee282h/ 1 Advanced pipelining and instruction level parallelism
More informationInstruction Level Parallelism. Taken from
Instruction Level Parallelism Taken from http://www.cs.utsa.edu/~dj/cs3853/lecture5.ppt Outline ILP Compiler techniques to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction
More informationCISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions
CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy
More informationAs the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.
Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction
More informationLecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )
Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 4.4) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationMulticycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline
//11 Limitations of Our Simple stage Pipeline Diversified Pipelines The Path Toward Superscalar Processors HPCA, Spring 11 Assumes single cycle EX stage for all instructions This is not feasible for Complex
More informationNOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism
Outline Csci 211 Computer System Architecture Lec 4 Instruction Level Parallelism Xiuzhen Cheng Department of Computer Sciences The George Washington University ILP Compiler techniques to increase ILP
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationNOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review
Review from Last Time CSE 820 Graduate Computer Architecture Lec 7 Instruction Level Parallelism Based on slides by David Patterson 4 papers: All about where to draw line between HW and SW IBM set foundations
More informationESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism
Computer Architecture ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling 1 Outline ILP Compiler techniques to increase ILP Loop Unrolling Static
More informationInstruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties
Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,
More informationPipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...
CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100
More informationLecture: Pipeline Wrap-Up and Static ILP
Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Multicycle
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationLecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S
Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching
More informationCSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism
CSE 502 Graduate Computer Architecture Lec 8-10 Instruction Level Parallelism Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,
More informationHY425 Lecture 05: Branch Prediction
HY425 Lecture 05: Branch Prediction Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 19, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 05: Branch Prediction 1 / 45 Exploiting ILP in hardware
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationReduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction
ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch
More informationAdvanced Computer Architecture
Advanced Computer Architecture Instruction Level Parallelism Myung Hoon, Sunwoo School of Electrical and Computer Engineering Ajou University Outline ILP Compiler techniques to increase ILP Loop Unrolling
More informationBranch prediction ( 3.3) Dynamic Branch Prediction
prediction ( 3.3) Static branch prediction (built into the architecture) The default is to assume that branches are not taken May have a design which predicts that branches are taken It is reasonable to
More informationECE473 Computer Architecture and Organization. Pipeline: Control Hazard
Computer Architecture and Organization Pipeline: Control Hazard Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson UCB Lec 15.1 Pipelining Outline Introduction
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2018 Static Instruction Scheduling 1 Techniques to reduce stalls CPI = Ideal CPI + Structural stalls per instruction + RAW stalls per instruction + WAR stalls per
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of
More informationLecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.SP96 1 Review: Evaluating Branch Alternatives Two part solution: Determine
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence
More informationILP: Instruction Level Parallelism
ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal
More informationExercises (III) Output dependencies: Instruction 6 has output dependence with instruction 1 for access to R1
1) a)solution: Let s number instructions: 1. LD R1, 50(R2) 2. ADD R3, R1, R4 3. LD R5, 100(R3) 4. MUL R6, R5, R7 5. STORE R6, 50(R2) 6. ADD R1, R1, #100 7. SUB R2, R2, #8 b) Exercises (III) Data dependencies:
More informationTopics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation
Digital Systems Architecture EECE 343-01 EECE 292-02 Predication, Prediction, and Speculation Dr. William H. Robinson February 25, 2004 http://eecs.vanderbilt.edu/courses/eece343/ Topics Aha, now I see,
More informationECSE 425 Lecture 11: Loop Unrolling
ECSE 425 Lecture 11: Loop Unrolling H&P Chapter 2 Vu, Meyer Textbook figures 2007 Elsevier Science Last Time ILP is small within basic blocks Dependence and Hazards Change code, but preserve program correctness
More informationFloating Point/Multicycle Pipelining in DLX
Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations in one or two cycles is impractical since it requires: A much longer CPU clock cycle, and/or
More informationAppendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation
More informationComputer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution
CSCE 6 (Fall 07) Computer Architecture Homework Set # COVER SHEET Please turn in with your own solution Eun Jung Kim Write your answers on the sheets provided. Submit with the COVER SHEET. If you need
More informationFunctional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory
The Big Picture: Where are We Now? CS152 Computer Architecture and Engineering Lecture 18 The Five Classic Components of a Computer Processor Input Control Dynamic Scheduling (Cont), Speculation, and ILP
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not
More informationWebsite for Students VTU NOTES QUESTION PAPERS NEWS RESULTS
Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly
More informationMetodologie di Progettazione Hardware-Software
Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism
More informationCSE 490/590 Computer Architecture Homework 2
CSE 490/590 Computer Architecture Homework 2 1. Suppose that you have the following out-of-order datapath with 1-cycle ALU, 2-cycle Mem, 3-cycle Fadd, 5-cycle Fmul, no branch prediction, and in-order fetch
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More information/ : Computer Architecture and Design Fall Midterm Exam October 16, Name: ID #:
16.482 / 16.561: Computer Architecture and Design Fall 2014 Midterm Exam October 16, 2014 Name: ID #: For this exam, you may use a calculator and two 8.5 x 11 double-sided page of notes. All other electronic
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationCS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes
CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.
More informationCMSC411 Fall 2013 Midterm 2 Solutions
CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationDYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING
DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,
More informationLecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) 1 Loop Example for (i=1000; i>0; i--) x[i] = x[i] + s; Source code FPALU -> any: 3 s FPALU -> ST : 2 s IntALU -> BR :
More informationLecture 4: Introduction to Advanced Pipelining
Lecture 4: Introduction to Advanced Pipelining Prepared by: Professor David A. Patterson Computer Science 252, Fall 1996 Edited and presented by : Prof. Kurt Keutzer Computer Science 252, Spring 2000 KK
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationEEC 581 Computer Architecture. Lec 4 Instruction Level Parallelism
EEC 581 Computer Architecture Lec 4 Instruction Level Parallelism Chansu Yu Electrical and Computer Engineering Cleveland State University Acknowledgement Part of class notes are from David Patterson Electrical
More informationInstruction-Level Parallelism (ILP)
Instruction Level Parallelism Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance 2 approaches to exploit ILP: 1. Rely on hardware to help discover and exploit
More informationInstruction Level Parallelism (ILP)
1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationNOW Handout Page 1. COSC 5351 Advanced Computer Architecture
COSC 5351 Advanced Computer Slides modified from Hennessy CS252 course slides ILP Compiler techniques to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction Overcoming Data Hazards
More informationAnnouncements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory
ECE4750/CS4420 Computer Architecture L10: Branch Prediction Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab2 and prelim grades Back to the regular office hours 2 1 Overview
More informationEI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)
EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationReview Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction
CS252 Graduate Computer Architecture Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue March 23, 01 Prof. David A. Patterson Computer Science 252 Spring 01 Review Tomasulo Reservations
More informationLecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )
Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use
More informationLecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo
Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998 DAP.F96 1 Review: Tomasulo Prevents Register as bottleneck Avoids WAR, WAW hazards
More informationLecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998
Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998 DAP.F96 1 Review: Tomasulo Prevents Register as bottleneck Avoids WAR, WAW hazards
More informationSuper Scalar. Kalyan Basu March 21,
Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationSome material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier
Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science ! CPI = (1-branch%) * non-branch CPI + branch% *
More informationCS433 Homework 2 (Chapter 3)
CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit
More informationLooking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction
Looking for Instruction Level Parallelism (ILP) Branch Prediction We want to identify and exploit ILP instructions that can potentially be executed at the same time. Branches are 15-20% of instructions
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques
More informationLecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis
Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis Credits John Owens / UC Davis 2007 2009. Thanks to many sources for slide material: Computer Organization
More informationPage 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough
ILP ILP Basics & Branch Prediction Today s topics: Compiler hazard mitigation loop unrolling SW pipelining Branch Prediction Parallelism independent enough e.g. avoid s» control correctly predict decision
More informationPage 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP
CS252 Graduate Computer Architecture Lecture 18: Branch Prediction + analysis resources => ILP April 2, 2 Prof. David E. Culler Computer Science 252 Spring 2 Today s Big Idea Reactive: past actions cause
More information/ : Computer Architecture and Design Fall 2014 Midterm Exam Solution
16.482 / 16.561: Computer Architecture and Design Fall 2014 Midterm Exam Solution 1. (8 points) UEvaluating instructions Assume the following initial state prior to executing the instructions below. Note
More informationTDT 4260 TDT ILP Chap 2, App. C
TDT 4260 ILP Chap 2, App. C Intro Ian Bratt (ianbra@idi.ntnu.no) ntnu no) Instruction level parallelism (ILP) A program is sequence of instructions typically written to be executed one after the other
More informationCS433 Homework 3 (Chapter 3)
CS433 Homework 3 (Chapter 3) Assigned on 10/3/2017 Due in class on 10/17/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies
More informationCOSC 6385 Computer Architecture - Instruction Level Parallelism (II)
COSC 6385 Computer Architecture - Instruction Level Parallelism (II) Edgar Gabriel Spring 2016 Data fields for reservation stations Op: operation to perform on source operands S1 and S2 Q j, Q k : reservation
More informationCS252 Graduate Computer Architecture Midterm 1 Solutions
CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationStatic, multiple-issue (superscaler) pipelines
Static, multiple-issue (superscaler) pipelines Start more than one instruction in the same cycle Instruction Register file EX + MEM + WB PC Instruction Register file EX + MEM + WB 79 A static two-issue
More informationCompiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation
Lecture 7 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2013 Reading: Textbook, Ch. 3 Complexity-Effective Superscalar Processors, PhD Thesis by Subbarao Palacharla, Ch.1
More informationComputer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction
More information