ILP Limit: Perfect/Infinite Hardware. Chapter 3: Limits of Instr Level Parallelism. ILP Limit: see Figure in book. Narrow Window Size
|
|
- Ernest Stephens
- 6 years ago
- Views:
Transcription
1 Chapter 3: Limits of Instr Level Parallelism Ultimately, how much instruction level parallelism is there? Consider study by Wall (summarized in H & P) First, assume perfect/infinite hardware Then successively refine to more realistic hardware ILP Limit: Perfect/Infinite Hardware Infinite re-name registers no WAW, WAR hazards Perfect branch prediction Perfect memory hazard analysis Unlimited issues per cycle Looking an unlimited distance into the instruction stream infinite issue "window" Single cycle execution ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & ILP Limit: see Figure in book larger window => more ILP for finite window sizes, integer programs have less ILP than floating point Narrow Window Size n 2 complexity of dependence checking logic for window size n assuming all instructions in window are simultaneously being considered for issue see fig in book ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & 3 163
2 Realistic Branch/Jump Prediction 2K entry window, 64 simultaneous issue, prediction choices: perfect see fig in book Realistic Branch/Jump Prediction no branch prediction to selective is a big improvement Selective history predictor correlating and non-correlating with selection 97% accurate on the benchmarks Std. 512x2 predictor; 16-entry return address buffer Static predictor, based on profile No branch prediction; jumps predicted ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & Finite Rename Registers Finite Rename Registers: Limit the number of registers available for renaming Assume 2K entry window; 64 simult. issues see fig in book no renaming to about 128 renaming tags is big jump Assume 2-level, 8K entry branch predictor 2K jump/return predictor ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & 3 167
3 Imperfect Alias Analysis Imperfect Alias Analysis: choices: see fig in book Global/stack perfect; all heap conflict Inspection of object code None; all references may conflict Assume 2K entry window; 64 simult. issues Assume 2-level, 8K entry branch predictor huge difference between inspection and global/stack perfect inspection is close to what compilers can do global/stack perfect is what we could do if we had better compilers 2K jump/return predictor Assume 256 rename registers ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & Realizable Machine Realizable Machine: Consider even more restrictive hardware see fig in book Selective predictor with 1K entries; 16-entry return predictor Perfect disambiguation of memory references; dynamic within window Register renaming with additional 64 registers Variable window size integer programs ILP less than floating point integer programs level off at around ILP floating point programs keep going up with larger window ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & 3 171
4 ILP Limit: Discussion Single cycle assumption? What is performance bottomline? window assumptions memory alias assumption? Simultaneous Multithreading (SMT) we build 6-wide superscalar for high performance for 1 program but real programs commit only 1-2 instrs per cycle stalls due to RAW - memory latencies wasted issue slots due to mispredictions so pipeline goes unused does not mean pipe was unnecessarily too wide 4 wide give 1 instr per cycle 1 wide would give only 0.25 instrs per cycle! parallelism is uneven and bursty (chapter 1 slides) ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & Simultaneous Multithreading (SMT) SMT Idea: run multiple programs at the same time through the pipe Eg: Fetch and execute from 2 programs (or threads in a parallel program) No dependence between threads/program => Pipeline utilized better SMT does NOT improve single-thread performance SMT improves job throughput and CPU utilization better throughput good if you have more than 1 program better utilization good if you are data center manager Intel calls it Hyperthreading Pipeline has in-flight instrs from multiple programs in ANY stage, instrs from more than 1 program could be processed in the SAME cycle 1 hardware context per program (eg 4 contexts for max 4 progs) Some h/w replicated for each context, rest shared by all contexts large h/w should be shared and small can be replicated what h/w should be separate? stages: F D Rename Issue Regrd EX Mem WB ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & 3 175
5 SMT For correctness, following are replicated for each context Rename, ROB -- why? For performance, following are replicated for each context Br History Buffer, Ret Addr Stack, load/store Q Shared h/w - shared among all contexts fetch, decode, issue, physical registers, EX units, Mem larger physical regfiles, caches, TLBs Fortunately, replicated h/w is small and large h/w is shared SMT OS thinks of each context as a virtual CPU h/w people say contexts and OS people say virtual cpus OS assigns as many programs to one real CPU as possible (upto max contexts) SMT allows multiple programs to share MOST of the pipeline improves CPU utilization and increases job throughput if you increase the replicated h/w all the way, you end up with multicores In multicores - each core is SMT-capable ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & Latency Hierarchy L1 hit (~2 cycles) and L1 miss-l2 hit (near) (~12 cycles) overlapped by out-of-order issue (within 1 thread) L1 miss-l2 hit (far) (~ 40 cycles) and L2 miss-memory hit (~300 cycles) overlapped by OoO issue (within 1 thread) + SMT and multicores (across multiple programs/threads) memory miss (page fault) (~tens of millions of cycles) overlapped by OoO issue (within 1 thread) + SMT and multicores (across multiple programs/threads)+ OS multitasking (across multiple programs/threads) Other ILP Approaches: Vectors a vector is a one-dimensional array of numbers many multimedia/graphics/scientific programs operate on vectors do I = 1, 64 c[i] = a[i] + b[i] ONE vector instruction perform an operation on EACH element of ENTIRE vector addv c, a, b ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & 3 179
6 Why Vectors? want deeper pipelines but interlock logic is hard to divide into more stages eg rename, issue, bypass logic bubbles due to data hazard increase hard to issue multiple instructions per cycle fetch&issue bottleneck (Flynn bottleneck) Why Vectors? Vector instructions allow deeper pipelines no intra vector interlocks no intra-vector data hazards inner loop control hazards eliminated need not issue multiple instrs to get multiple operations vectors can present memory access pattern to h/w Simple super-fast pipeline- much faster than OoO (why?) Who converts high-level code to vector instructions? Compiler called automatic vectorization (non-trivial analyses) ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & Vector Architectures Vector Architectures Vector-Register Machines load/store architectures vector operations use vector registers except ld/st register ports cheaper than memory ports optimized for small vectors Memory-memory vector machines all vectors reside in memory long startup latency memory ports expensive optimized for long vectors fact: most vectors are short early machines were memory-memory TI ASC, CDC STAR-100 modern vector machines use vector-registers ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & 3 183
7 DLXV Architecture strongly based on CRAY-1 extend DLX (baby pipeline) with vector instructions eight vector registers (V0-V7) 64 double-precision FP each (4K bytes total) DLXV Architecture five vector functional units FP+, FP*, FP/, integer and logical fully pipelined with 2-20 stages vector load/store units fully pipelined with stages ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & DLXV Architecture DLXV Architecture vector-vector instructions operate on two vectors produce a third vector do I = 1, 64 v1[i] = v2[i] + v3[i] addv v1, v2,v3 ENTIRE loop in one instr no branches, no hazards vector-scalar instructions operate on one vector and one scalar do i = 1, 64 v1[i] = f0 + v3[i] addv v1, f0, v3 ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & 3 187
8 DLXV Architecture vector load-store instructions load/store a vector from memory into a vector register operates on contiguous addresses lv v1, r1 ; v1[i] = M[R1+i] sv r1, v1 ; M[R1+i] = v1[i] DLXV Architecture load/store vector with stride vectors are not always contiguous in memory add non-unit stride on each access lvws v1, (r1,r2) ; v1[i] = M[r1+i*r2] svws (r1,r2), v1 ; M[r1+i*r2] = v1[i] vector load/store indexed indirect accesses through an index vector lvi v1, (r1+v2) ; v1[i] = M[r1+v2[i]] svi (r1+v2), v1 ; M[r1+v2[i]] = v1[i] ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & DLXV Architecture DLXV Architecture do i = 1,64 - double-precision a * x + y (daxpy) y[i] = a * x[i] + y[i] VLR 64 ld f0, a lv, v1, rx multsv v2, f0, v1 lv v3, ry addv v4,v2,v3 sv ry, v4 6 DLXV instructions instead of 600 DLX instructions remember MIPS is a useless measure of performance! ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & 3 191
9 Vector Length not all vectors are 64 elements long vector length register (VLR) controls length of vector operations 0 < VLR < MVL = 64 do i = 1, 100 x[i] = a * x[i] ld f0, a movi2s VLR, 36 lv v1, rx multsv v2, f0, v1 sv rx, v2 add rx, rx, 36 movi2s VLR 64 lv v1, rx multsv v2, f0, v1 Vector Length ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & sv rx, v2 Strip Mining use strip mining for: do i = 1, n x[i] = a * x[i] low = 1 VL = n mod MVL do j = 0, (n/mvl) do i = low, low+vl-1 x[i] = a * x[i] low = low + VL VL = MVL ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & 3 195
10 Vector Masks use masked vector register for vectorzing if statements do i = 1, 64 if a[i] < 0.0 then a[i] = -a[i] use vector mask lv v1, ra ld f0, 0.0 sltsv f0, v1 # set vector mask[i] to 1 if v1[i] < f0 subv v1, 0, v1 cvm sv ra, v1 Vector Chaining use vector chaining (vector bypass) for RAWs multv v1, --, -- addv --, v1, -- ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & Vector Scatter/Gather Short vectors use gather/scatter for sparse matrices do i = 1, 64 a[k[i]] = a[k[i]] + c[d[i]] lv v1, rd lvi v3, (rc+v1) # load c[d[i]] lv v1, rk lvi v2, (ra+v1) # load a[k[i]] addv v4, v3,v2 svi (ra+v1), v4 effect of short vectors time for vector = startup + n*initiation rate time per element vector length ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & 3 199
11 Vectors what kind of memory hierarchy would you use for vectors? compiler techniques final word: make the scalar unit fast! remember Amdahl s law CRAY-1 was the fastest scalar computer Connection to Graphics/MMX MMX/graphics called SIMD - single instruction multiple data Vector and SIMD are the same thing a vector instr is a SIMD instr Intel, Sun multimedia and Nvidia graphics all use SIMD SIMD - 2 options option 1: full blown vector units like Cray option 2: mini vectors: pack 8 1-byte vectors into 1 64-bit 8-bit 16-bit data common in multimedia use normal datapath but do 4 or 8 ops in one shot (MMX) ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & MMX: Basics most multimedia apps work on short integers 8-bit pixels, 16-bit audio pack data into 64-bit words operate on packed data like short vectors single instruction multiple data (SIMD) around since Livermore S-1 (20 years) MMX Enhanced Instructions Also MOV s move MMX datatypes to and from memory loads followed by stores Pack/Unpack go back and forth between MMX and normal datatypes needed in multimedia computations integrate into x86 FP registers can improve performance by 8x (in theory) benchmarks not 8x but very good ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & 3 203
12 MMX Constraints: Integrating into Pentium share registers with FP ISA extensions but perfect backward compatibility 100% OS compatible (no extra registers,flags,exceptions) bit in CPUID instruction so applications test for MMX and include code use 64-bit datapaths pipeline capable of 2 MMX IPC cascade memory and execution stages to avoid stalls Relationship to Vectors vector length - no VL must be multiple of 64 total bits memory load/store - stride one only arithmetic - integer only conditionals - builds byte mask like vector mask no trap problems - no trapping instructions data movement - pack/unpack like vector scatter/gather minimal - only pack/unpack ECE565 Lecture Notes: Chapters 2 & ECE565 Lecture Notes: Chapters 2 & 3 205
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationCOSC 6385 Computer Architecture. - Vector Processors
COSC 6385 Computer Architecture - Vector Processors Spring 011 Vector Processors Chapter F of the 4 th edition (Chapter G of the 3 rd edition) Available in CD attached to the book Anybody having problems
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More information06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli
06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationPipelining and Vector Processing
Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationData-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano
Data-Level Parallelism in SIMD and Vector Architectures Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 1 Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationStatic Compiler Optimization Techniques
Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed at improving pipelined CPU performance: Static pipeline scheduling. Loop unrolling. Static branch
More informationCS 614 COMPUTER ARCHITECTURE II FALL 2005
CS 614 COMPUTER ARCHITECTURE II FALL 2005 DUE : November 9, 2005 HOMEWORK III READ : - Portions of Chapters 5, 6, 7, 8, 9 and 14 of the Sima book and - Portions of Chapters 3, 4, Appendix A and Appendix
More informationECE 571 Advanced Microprocessor-Based Design Lecture 4
ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted
More informationComputer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow
More informationParallel Processing SIMD, Vector and GPU s
Parallel Processing SIMD, Vector and GPU s EECS4201 Fall 2016 York University 1 Introduction Vector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating on Single Data
More informationMultiple Instruction Issue and Hardware Based Speculation
Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we
More information! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)
Master Informatics Eng. Advanced Architectures 2015/16 A.J.Proença Data Parallelism 1 (vector, SIMD ext., GPU) (most slides are borrowed) Instruction and Data Streams An alternate classification Instruction
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationLecture 9: Multiple Issue (Superscalar and VLIW)
Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationLIMITS OF ILP. B649 Parallel Architectures and Programming
LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump
More informationHardware Speculation Support
Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More information3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationSimultaneous Multithreading Processor
Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationLecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ
Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)
More informationLecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )
Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use
More informationPipelining. CSC Friday, November 6, 2015
Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationHardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.
Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)
More informationVector Processors. Abhishek Kulkarni Girish Subramanian
Vector Processors Abhishek Kulkarni Girish Subramanian Classification of Parallel Architectures Hennessy and Patterson 1990; Sima, Fountain, and Kacsuk 1997 Why Vector Processors? Difficulties in exploiting
More informationComputer System Architecture Final Examination Spring 2002
Computer System Architecture 6.823 Final Examination Spring 2002 Name: This is an open book, open notes exam. 180 Minutes 22 Pages Notes: Not all questions are of equal difficulty, so look over the entire
More informationPipelining to Superscalar
Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel
More informationComputer Architecture and Engineering CS152 Quiz #4 April 11th, 2011 Professor Krste Asanović
Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2011 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 17 Pages Notes: Not all questions are
More informationInstruction-Level Parallelism. Instruction Level Parallelism (ILP)
Instruction-Level Parallelism CS448 1 Pipelining Instruction Level Parallelism (ILP) Limited form of ILP Overlapping instructions, these instructions can be evaluated in parallel (to some degree) Pipeline
More informationOut of Order Processing
Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative
More informationLimitations of Scalar Pipelines
Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline
More informationReview Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction
CS252 Graduate Computer Architecture Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue March 23, 01 Prof. David A. Patterson Computer Science 252 Spring 01 Review Tomasulo Reservations
More informationComputer System Architecture Quiz #5 December 14th, 2005 Professor Arvind Dr. Joel Emer
Computer System Architecture 6.823 Quiz #5 December 14th, 2005 Professor Arvind Dr. Joel Emer Name: This is a closed book, closed notes exam. 80 Minutes 15 Pages Notes: Not all questions are of equal difficulty,
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors
William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,
More informationCS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti
ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationCS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationPredict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch
branch taken Revisiting Branch Hazard Solutions Stall Predict Not Taken Predict Taken Branch Delay Slot Branch I+1 I+2 I+3 Predict Not Taken branch not taken Branch I+1 IF (bubble) (bubble) (bubble) (bubble)
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationLecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )
Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3
More informationT T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.
A1: Architecture (25 points) Consider these four possible branch predictors: (A) Static backward taken, forward not taken (B) 1-bit saturating counter (C) 2-bit saturating counter (D) Global predictor
More informationInstruction Level Parallelism (ILP)
1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into
More informationPipelining, Branch Prediction, Trends
Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping
More informationComputer Architecture Lecture 16: SIMD Processing (Vector and Array Processors)
18-447 Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/24/2014 Lab 4 Reminder Lab 4a out Branch handling and branch
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)
Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case
More informationAdapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]
Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Pipelining Review PC IF/ID ID/EX EX/M
More informationSuperscalar Organization
Superscalar Organization Nima Honarmand Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies between insns. Average
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationComputer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.
Computer Architectures Chapter 4 Tien-Fu Chen National Chung Cheng Univ. chap4-0 Advance Pipelining! Static Scheduling Have compiler to minimize the effect of structural, data, and control dependence "
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationLecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue
Lecture 11: Out-of-order Processors Topics: more ooo design details, timing, load-store queue 1 Problem 0 Show the renamed version of the following code: Assume that you have 36 physical registers and
More informationLecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis
Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis Credits John Owens / UC Davis 2007 2009. Thanks to many sources for slide material: Computer Organization
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 5B: Data Level Parallelism Avinash Kodi, kodi@ohio.edu Thanks to Morgan Kauffman and Krtse Asanovic Agenda 2 Flynn s Classification Data Level Parallelism Vector
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationSuperscalar Processor
Superscalar Processor Design Superscalar Architecture Virendra Singh Indian Institute of Science Bangalore virendra@computer.orgorg Lecture 20 SE-273: Processor Design Superscalar Pipelines IF ID RD ALU
More informationChapter 4 The Processor 1. Chapter 4D. The Processor
Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationPage 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP
CS252 Graduate Computer Architecture Lecture 18: Branch Prediction + analysis resources => ILP April 2, 2 Prof. David E. Culler Computer Science 252 Spring 2 Today s Big Idea Reactive: past actions cause
More informationSpring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic
Spring 2010 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic C/C++ program Compiler Assembly Code (binary) Processor 0010101010101011110 Memory MAR MDR INPUT Processing Unit OUTPUT ALU TEMP PC Control
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationMultithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as
More informationCS 152, Spring 2011 Section 8
CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia
More informationLecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S
Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching
More informationCSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics
CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics Computing system: performance, speedup, performance/cost Origins and benefits of scalar instruction pipelines and caches
More informationUNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES
UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES Flynn s Taxonomy Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) o Vector
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationProblem M5.1: VLIW Programming
Problem M5.1: VLIW Programming Last updated: Ben Bitdiddle and Louis Reasoner have started a new company called Transbeta and are designing a new processor named Titanium. The Titanium processor is a single-issue
More information