Chapter 3: Instruc0on Level Parallelism and Its Exploita0on
|
|
- Ella Wilkinson
- 6 years ago
- Views:
Transcription
1 Chapter 3: Instruc0on Level Parallelism and Its Exploita0on - Abdullah Muzahid
2 Hardware- Based Specula0on (Sec0on 3.6) In mul0ple issue processors, stalls due to branches would be frequent: You may need to execute 1 branch/cycle Solu0on: Speculate the outcome of the branch and execute as if the guess was correct. This is called hardware specula0on Requires the ability to recover from mispredic0ons Hardware- based specula0on combines: Dynamic branch predic0on Specula0on to allow EX Aggressive dynamic scheduling across basic blocks (because now we aggressively execute past basic blocks)
3 Specula0ve Execu0on Based on Tomasulo Used in most processors: Intel Core i3, i5, i7, AMD Phenom, IBM Power 7 Key: separate the bypassing of results among instruc0ons from full instruc0on comple0on Allow a specula0ve instruc0on to pass its result to others But, do not complete the instruc0on (not allow it to update any undoable state) When the instruc0on is no longer specula0ve: allow it to modify the reg file or memory à Instruc0on Commit Step Overall: instruc0ons execute out of order (ooo), but they commit in order Prevent any irrevocable ac0on un0l an instruc0on commits
4 Reorder Buffer (ROB) Buffer that holds the result of instruc0ons that have completed but not commi_ed Hardware buffer Also used to pass results among instruc0ons Provides addi0onal registers like the Reserva0on Sta0ons Difference ROB- Tomasulo Res Sta0ons: Tomasulo Once the FU writes its result, subsequent instruc0ons find the value in the register set ROB ROB supplies the results of completed but not yet commi_ed instruc0ons
5 ROB Fields in an entry: Instruc0on type: BR (no dest), SD (dest is mem), Other (dest is reg) Des0na0on field: des0na0on register or des0na0on mem address Value: value un0l commit Ready: is the instruc0on finished (and therefore the result is ready)
6
7 Func0ons of RS and ROB Res Sta0ons: Place to buffer opera0ons and operands from the 0me I issues un0l it can execute RS track the ROB assigned for an instruc0on ROB: Does the renaming Every instruc0on has a posi0on in the ROB from when it issues un0l it commits We tag a result using the ROB entry number rather than the RS number
8 Overall Instruc0on Execu0on ISSUE or DISPATCH: Get an instruc0on from instruc0on queue Issue it if there is an empty RS and an empty ROB entry Send ops to the RS if they are available in any reg or ROB entry Send the number of the ROB entry to the RS, so that it can be used to tag the result of the execu0on when it is dumped in the CDB If no RS or no ROB entry: stall instruc0on and successors EXECUTE: Monitor for any of your inputs to be available in the CDB (this enforces RAW). If so, read it in the RS When both ops ready and the FU free, execute May take several cycles Loads require 2 steps (address genera0on and memory access); stores need only 1 step (address genera0on)
9 Overall Instruc0on Execu0on WRITE RESULT: When EX complete, dump result on the bus, together with the ROB entry number. Mark RS as available. The data is sent to other RS wai0ng for it, and to the correct ROB entry For stores, if the data value is already available, it is wri_en to ROB entry. Otherwise, it just waits in ROB checking the CDB for data value. As soon as, the data value is available in CDB, it is wri_en in the dependent store s ROB. COMMIT or GRADUATION: When an instruc0on reaches the head of the ROB and its result is present in the entry à update the Reg file and remove the ROB entry. If the instruc0on is a Store: Do same except that memory is updated If the instruc0on is a branch with incorrect predic0on: flush the ROB and restart execu0on at the correct successor If the instruc0on is a branch with correct predic0on: as usual
10 Workout Table Instruc0on Issue Exec B- E Mem Rd CDB Write Commit/ Mem Wr LD F6, 32(R2) LD F2, 44(R3) MULD F0, F2, F4 SUBD F8, F2, F6 DIVD F10, F0, F6 ADDD F6, F8, F2
11 Example of Branch Mispredic0on Example page 189 and Figure 3.13 Aker a branch is mispredicted, recover by clear the ROB for all instruc0ons that follow the branch Let branch and the previous instruc0ons to commit fetch at the correct branch successor Excep0ons Record the excep0on in the ROB entry If the excep0ng instruc0on is in the mispredicted path, simply squash the instruc0on and flush the excep0on. When the excep0ng instruc0on reaches the head ROB: process it Flush all pending instruc0ons
12 Workout Table Instruc0on Issue Exec B- E Mem Rd CDB Write Commit/ Mem Wr LD F0, 0(R1) MULD F4, F0, F2 SD F4, 0(R1) DADDI R1, R1, #- 8 BNE R1, R2, Loop LD F0, 0(R1) MULD F4, F0, F2 SD F4, 0(R1) DADDI R1, R1, #- 8 BNE R1, R2, Loop
13 Handling Stores Tomasulo s algorithm: stores update caches/memory as soon as they complete the WB Specula0ve processor: stores only update caches/memory when they reach the head of the ROB Therefore: memory not updated by a specula0ve instruc0on
14 Dependences Through Memory WAW: Eliminated because the upda0ng of memory occurs in order, when the WR is at the head of ROB Therefore, no earlier LD or ST can be pending WAR: same as WAW RAW: Maintained with two restric0ons: LD cannot be sent to memory if there is a previous ST in the ROB with the same address LD computes its effec0ve address in order rela0ve to all earlier ST Note: some specula0ve machines actually allow the bypass of a value from a previous ST to a successor LD without the LD having to go to memory
15 Mul0ple Issue Processors If all techniques described are successful CPI = 1 To reduce CPI below 1 issue mul0ple inst per cycle superscalar processors VLIW or EPIC processors
16 Superscalar Hardware examines instruc0on stream for parallel inst. Sta0c Scheduling: Issue mul0ple inst that are quickly detected as independent If one inst in set has some dependency, only prior inst are issued (maintains in- order execu0on) Must increase inst fetch rate: widen icache bus, pipeline IF (oken by having integrated inst fetch unit)
17 Superscalar (Contd.) Dynamic Scheduling: We use Tomasulo algorithm Example: issue 2 instruc0ons per cycle. These instruc0ons may even have dependences with each other In general, for n- issue width: use a combina0on of two approaches: Run the issue step at higher speed, so that e.g. two instruc0ons can be issued in 1 cycle Build wider logic, so that mul0ple instruc0ons can be issued at the same 0me and all dependences are checked
18 VLIW Very Large Instruc0on Word Compiler puts parallel inst. In VLIW packet Intel modified the idea a li_le bit and called it EPIC The compiler has responsibility to package mul0ple independent inst together in VLIW Can issue many instruc0ons in parallel: Past machines have varied from 2-32 inst Simplifies issuing hardware
19 VLIW (Contd.) Performance limited by amount of ILP compiler can discover With many func0onal units to feed, this requires very complicated compiler scheduling support: Loop unrolling Sokware pipelining Trace scheduling Superblock scheduling
20 Common name Issue structure Hazard detection Scheduling Distinguishing characteristic Examples Superscalar (static) Superscalar (dynamic) Superscalar (speculative) dynamic hardware static in-order execution mostly in the embedded space: MIPS and ARM dynamic hardware dynamic some out-of-order execution, but no speculation dynamic hardware dynamic with speculation VLIW/LIW static primarily software EPIC primarily static primarily software static mostly static out-of-order execution with speculation all hazards determined and indicated by compiler (often implicitly) all hazards determined and indicated explicitly by the compiler none at the present Pentium 4, MIPS R12K, IBM Power5 most examples are in the embedded space, such as the TI C6x Itanium
21 Example of 2- Issue Dynamic Scheduling Processor Loop LD R2, 0(R1) DADDIU R2, R2, #1 SD R2, 0(R1) DADDIU R1, R1, #8 BNE R2, R3, Loop Assume separate FUs for: effec0ve address calc, ALU ops, branch condi0on eval Assume 2 inst of any type can commit/cycle Assume 2 CDB
22 Workout Table Instruc0on Issue Exec begin- end LD R2, 0(R1) DADDI R2, R2, #1 SD R2, 0(R1) DADDI R1, R1, #8 BNE R2, R3, Loop LD R2, 0(R1) DADDI R2, R2, #1 SD R2, 0(R1) DADDI R1, R1, #8 BNE R2, R3, Loop LD R2, 0(R1) DADDI R2, R2, #1 SD R2, 0(R1) DADDI R1, R1, #8 BNE R2, R3, Loop Mem Access CDB write
23 Workout Table Instruc0on Issue Exec begin- end Mem Access LD R2, 0(R1) DADDI R2, R2, # SD R2, 0(R1) DADDI R1, R1, # BNE R2, R3, Loop LD R2, 0(R1) DADDI R2, R2, # SD R2, 0(R1) DADDI R1, R1, # BNE R2, R3, Loop LD R2, 0(R1) DADDI R2, R2, # SD R2, 0(R1) DADDI R1, R1, # BNE R2, R3, Loop CDB write
24 Workout Table Instruc0on Issue Exec begin- end LD R2, 0(R1) 1 DADDI R2, R2, #1 1 SD R2, 0(R1) 2 DADDI R1, R1, #8 2 BNE R2, R3, Loop 3 LD R2, 0(R1) 3 DADDI R2, R2, #1 4 SD R2, 0(R1) 4 DADDI R1, R1, #8 5 BNE R2, R3, Loop 5 LD R2, 0(R1) 6 DADDI R2, R2, #1 6 SD R2, 0(R1) 7 DADDI R1, R1, #8 7 BNE R2, R3, Loop 8 Mem Rd CDB write Commit/ Mem Write
25 Instruc0on Issue Exec begin- end Workout Table Mem Rd CDB write Commit/ Mem Write LD R2, 0(R1) DADDI R2, R2, # SD R2, 0(R1) DADDI R1, R1, # BNE R2, R3, Loop LD R2, 0(R1) DADDI R2, R2, # SD R2, 0(R1) DADDI R1, R1, # BNE R2, R3, Loop LD R2, 0(R1) DADDI R2, R2, # SD R2, 0(R1) DADDI R1, R1, # BNE R2, R3, Loop
26 Register Renaming Key: think of architectural registers as names, not loca0ons We can have more loca0ons than names Dynamically map names to loca0ons A map table contains the current mappings [name- >loca0on] Reg write: allocate a loca0on and record it in map table Reg read: find the last write to that name in the map table Minor detail: de- allocate loca0ons carefully Copyright 2012 by Muzahid, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipas0
27 An Example names: r1,r2,r3, locations: l1,l2,l3,l4,l5,l6,l7 original mapping: r1 l1, r2 l2, r3 l3 (l4-l7 free ) map table raw instructions r1 r2 r3 free locations renamed instructions l1 l2 l3 l4,l5,l6,l7 add r1,r2,r3 l4 l2 l3 l5,l6,l7 add l4,l2,l3 sub r3,r2,r1 l4 l2 l5 l6,l7 sub l5,l2,l4 mul r1,r2,r3 l6 l2 l5 l7 mul l6,l2,l5 div r2,r1,r3 l6 l7 l5 div l7,l6,l5 Copyright 2012 by Muzahid, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipas0
28 Considera0ons for Spec Machines 1. ROB vs Register Renaming: Using ROB architectural registers contained in combina0on of register set, reserva0on sta0ons, and ROB if we do not issue instr for a while: as instruc0ons commit, reg values will appear in the register file Explicit set of physical regs + register renaming: Physical regs hold both the architecturally visible regs and temp values Extended regs play role of both res sta0ons + ROB WAW and WAR hazards eliminated by renaming the des0na0on register Copyright J. Torrellas 1999,2001,2002,
29 Considera0ons for Spec Machines 2. How much to speculate: Advantage of specula0on: execute past branches Disadvantage: may execute a lot of useless instruc0ons Consequently: execute under specula0ve mode only low- cost events: Misses in L1: Yes TLB misses: No. In this case, wait un0l the instruc0on causing the event becomes non- specula0ve Copyright J. Torrellas 1999,2001,2002,
30 Considera0ons for Spec Machines 3. Specula0ng through mul0ple branches: Some programs have a high frequency of branches or branch clustering - - > speculate across mul0ple branches Complicates the process of specula0on recovery It is hard to speculate on more than one branch per cycle Copyright J. Torrellas 1999,2001,2002,
Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,
Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would
More informationComputer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505
Computer Architecture: Mul1ple Issue Berk Sunar and Thomas Eisenbarth ECE 505 Outline 5 stages of RISC Type of hazards Sta@c and Dynamic Branch Predic@on Pipelining with Excep@ons Pipelining with Floa@ng-
More informationLoad1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1
Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]
More informationSuper Scalar. Kalyan Basu March 21,
Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationEITF20: Computer Architecture Part3.2.1: Pipeline - 3
EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationStatic vs. Dynamic Scheduling
Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor
More informationHardware-based Speculation
Hardware-based Speculation M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica 1 Introduction Hardware-based speculation is a technique for reducing the effects of control dependences
More informationWebsite for Students VTU NOTES QUESTION PAPERS NEWS RESULTS
Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not
More informationThe basic structure of a MIPS floating-point unit
Tomasulo s scheme The algorithm based on the idea of reservation station The reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationCOSC 6385 Computer Architecture - Instruction Level Parallelism (II)
COSC 6385 Computer Architecture - Instruction Level Parallelism (II) Edgar Gabriel Spring 2016 Data fields for reservation stations Op: operation to perform on source operands S1 and S2 Q j, Q k : reservation
More informationChapter 4 The Processor 1. Chapter 4D. The Processor
Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationHardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.
Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)
Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of
More informationInstruction Level Parallelism
Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationTomasulo s Algorithm
Tomasulo s Algorithm Architecture to increase ILP Removes WAR and WAW dependencies during issue WAR and WAW Name Dependencies Artifact of using the same storage location (variable name) Can be avoided
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo
More informationChapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences
Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation
More informationSuperscalar Architectures: Part 2
Superscalar Architectures: Part 2 Dynamic (Out-of-Order) Scheduling Lecture 3.2 August 23 rd, 2017 Jae W. Lee (jaewlee@snu.ac.kr) Computer Science and Engineering Seoul NaMonal University Download this
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationDonn Morrison Department of Computer Science. TDT4255 ILP and speculation
TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple
More informationCPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationReduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:
Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationOutline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches
Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,
More informationCOSC4201 Instruction Level Parallelism Dynamic Scheduling
COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism
More informationProcessor: Superscalars Dynamic Scheduling
Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationEEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)
1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview
More informationEEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)
EEC 581 Computer Architecture Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University
More informationPage 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationScoreboard information (3 tables) Four stages of scoreboard control
Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy
More informationLecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,
Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques, ARM Cortex-A53, and Intel Core i7 CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 09
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationILP: Instruction Level Parallelism
ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3 Instructor: Dan Garcia inst.eecs.berkeley.edu/~cs61c! Compu@ng in the News At a laboratory in São Paulo,
More informationCISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions
CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationInstruction Level Parallelism
Instruction Level Parallelism Dynamic scheduling Scoreboard Technique Tomasulo Algorithm Speculation Reorder Buffer Superscalar Processors 1 Definition of ILP ILP=Potential overlap of execution among unrelated
More informationESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW
Computer Architecture ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW 1 Review from Last Lecture Leverage Implicit
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationComplex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar
Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationMetodologie di Progettazione Hardware-Software
Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism
More informationCSE 502 Graduate Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 CSE 502 Graduate Computer Architecture Lec 15-19 Inst. Lvl. Parallelism Instruction-Level Parallelism and Its Exploitation Larry Wittie
More informationCS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes
CS433 Midterm Prof Josep Torrellas October 16, 2014 Time: 1 hour + 15 minutes Name: Alias: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your
More informationRecall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationEECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 7 Winter 2018
EECS 470 Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 7 Winter 2018 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen,
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationInstruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction
Instruction Level Parallelism ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Basic Block A straight line code sequence with no branches in except to the entry and no branches
More informationECSE 425 Lecture 25: Mul1- threading
ECSE 425 Lecture 25: Mul1- threading H&P Chapter 3 Last Time Theore1cal and prac1cal limits of ILP Instruc1on window Branch predic1on Register renaming 2 Today Mul1- threading Chapter 3.5 Summary of ILP:
More informationAnnouncements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory
ECE4750/CS4420 Computer Architecture L11: Speculative Execution I Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab3 due today 2 1 Overview Branch penalties limit performance
More informationCS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste
More informationReorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)
Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers
More informationELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism
ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationMultiple Instruction Issue and Hardware Based Speculation
Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we
More informationAs the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.
Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction
More informationSlide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng
Slide Set 8 for ENCM 501 in Winter 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 501 Winter 2018 Slide Set 8 slide
More informationCS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences
More informationExploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville
Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop
More informationECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 9 Instruction-Level Parallelism Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html
More informationCS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.
CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in
More informationAdvanced Computer Architecture. Chapter 4: More sophisticated CPU architectures
Advanced Computer Architecture Chapter 4: More sophisticated CPU architectures Lecturer: Paul H J Kelly Autumn 2001 Department of Computing Imperial College Room 423 email: phjk@doc.ic.ac.uk Course web
More informationECE 505 Computer Architecture
ECE 505 Computer Architecture Pipelining 2 Berk Sunar and Thomas Eisenbarth Review 5 stages of RISC IF ID EX MEM WB Ideal speedup of pipelining = Pipeline depth (N) Practically Implementation problems
More informationCISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3
CISC 662 Graduate Computer Architecture Lecture 10 - ILP 3 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationEECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)
EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.) Warning: Crazy times coming Project handout and group formation today Help me to end class 12 minutes early P3
More informationCS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes
CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.
More informationComplex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units
6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationComputer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining
Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one
More information1. Chapter 3: Exploiting ILP
1. Chapter 3: Exploiting ILP Simple static techniques for finding ILP Basic scheduling, renaming and loop unrolling (data, name, & control) Hardware (dynamic) techniques for reducing deps (dep. limit ILP):
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationLecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ
Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)
More informationDynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:
LECTURE - 13 Dynamic Scheduling Better than static scheduling Scoreboarding: Used by the CDC 6600 Useful only within basic block WAW and WAR stalls Tomasulo algorithm: Used in IBM 360/91 for the FP unit
More informationDYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING
DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,
More informationPipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.
Calcolatori Elettronici e Sistemi Operativi Pipeline issues Hazards Pipeline issues Data hazard Control hazard Structural hazard Pipeline hazard: RaW Pipeline hazard: RaW 5 6 7 8 9 5 6 7 8 9 : add R,R,R
More information15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011
5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More information