ILP Limit: Perfect/Infinite Hardware. Chapter 3: Limits of Instr Level Parallelism. ILP Limit: see Figure in book. Narrow Window Size

Similar documents
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

COSC 6385 Computer Architecture. - Vector Processors

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Pipelining and Vector Processing

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

CS425 Computer Systems Architecture

CS 654 Computer Architecture Summary. Peter Kemper

TDT 4260 lecture 7 spring semester 2015

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Static Compiler Optimization Techniques

CS 614 COMPUTER ARCHITECTURE II FALL 2005

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Parallel Processing SIMD, Vector and GPU s

Multiple Instruction Issue and Hardware Based Speculation

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS425 Computer Systems Architecture

Lecture 9: Multiple Issue (Superscalar and VLIW)

LIMITS OF ILP. B649 Parallel Architectures and Programming

Hardware Speculation Support

Multithreaded Processors. Department of Electrical Engineering Stanford University

Hardware-based Speculation

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Exploitation of instruction level parallelism

Simultaneous Multithreading Processor

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Chapter 4. The Processor

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Pipelining. CSC Friday, November 6, 2015

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Vector Processors. Abhishek Kulkarni Girish Subramanian

November 7, 2014 Prediction

Computer System Architecture Final Examination Spring 2002

Pipelining to Superscalar

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2011 Professor Krste Asanović

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Out of Order Processing

Limitations of Scalar Pipelines

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Computer System Architecture Quiz #5 December 14th, 2005 Professor Arvind Dr. Joel Emer

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

EECC551 Exam Review 4 questions out of 6 questions

EC 513 Computer Architecture

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Handout 2 ILP: Part B

Multi-cycle Instructions in the Pipeline (Floating Point)

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Advanced Computer Architecture

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

The Processor: Instruction-Level Parallelism

Advanced issues in pipelining

COMPUTER ORGANIZATION AND DESI

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

Instruction Level Parallelism (ILP)

Pipelining, Branch Prediction, Trends

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors)

LECTURE 3: THE PROCESSOR

Processor (IV) - advanced ILP. Hwansoo Han

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Superscalar Organization

Instruction-Level Parallelism and Its Exploitation

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

Full Datapath. Chapter 4 The Processor 2

EE 4683/5683: COMPUTER ARCHITECTURE

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

E0-243: Computer Architecture

Superscalar Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Pipelining and Vector Processing

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Control Hazards. Branch Prediction

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

CS 152, Spring 2011 Section 8

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES

Chapter 4. The Processor

Problem M5.1: VLIW Programming

Transcription:

Chapter 3: Limits of Instr Level Parallelism Ultimately, how much instruction level parallelism is there? Consider study by Wall (summarized in H & P) First, assume perfect/infinite hardware Then successively refine to more realistic hardware ILP Limit: Perfect/Infinite Hardware Infinite re-name registers no WAW, WAR hazards Perfect branch prediction Perfect memory hazard analysis Unlimited issues per cycle Looking an unlimited distance into the instruction stream infinite issue "window" Single cycle execution ECE565 Lecture Notes: Chapters 2 & 3 160 ECE565 Lecture Notes: Chapters 2 & 3 161 ILP Limit: see Figure in book larger window => more ILP for finite window sizes, integer programs have less ILP than floating point Narrow Window Size n 2 complexity of dependence checking logic for window size n assuming all instructions in window are simultaneously being considered for issue see fig in book ECE565 Lecture Notes: Chapters 2 & 3 162 ECE565 Lecture Notes: Chapters 2 & 3 163

Realistic Branch/Jump Prediction 2K entry window, 64 simultaneous issue, prediction choices: perfect see fig in book Realistic Branch/Jump Prediction no branch prediction to selective is a big improvement Selective history predictor correlating and non-correlating with selection 97% accurate on the benchmarks Std. 512x2 predictor; 16-entry return address buffer Static predictor, based on profile No branch prediction; jumps predicted ECE565 Lecture Notes: Chapters 2 & 3 164 ECE565 Lecture Notes: Chapters 2 & 3 165 Finite Rename Registers Finite Rename Registers: Limit the number of registers available for renaming Assume 2K entry window; 64 simult. issues see fig in book no renaming to about 128 renaming tags is big jump Assume 2-level, 8K entry branch predictor 2K jump/return predictor ECE565 Lecture Notes: Chapters 2 & 3 166 ECE565 Lecture Notes: Chapters 2 & 3 167

Imperfect Alias Analysis Imperfect Alias Analysis: choices: see fig in book Global/stack perfect; all heap conflict Inspection of object code None; all references may conflict Assume 2K entry window; 64 simult. issues Assume 2-level, 8K entry branch predictor huge difference between inspection and global/stack perfect inspection is close to what compilers can do global/stack perfect is what we could do if we had better compilers 2K jump/return predictor Assume 256 rename registers ECE565 Lecture Notes: Chapters 2 & 3 168 ECE565 Lecture Notes: Chapters 2 & 3 169 Realizable Machine Realizable Machine: Consider even more restrictive hardware see fig in book Selective predictor with 1K entries; 16-entry return predictor Perfect disambiguation of memory references; dynamic within window Register renaming with additional 64 registers Variable window size integer programs ILP less than floating point integer programs level off at around 10-15 ILP floating point programs keep going up with larger window ECE565 Lecture Notes: Chapters 2 & 3 170 ECE565 Lecture Notes: Chapters 2 & 3 171

ILP Limit: Discussion Single cycle assumption? What is performance bottomline? window assumptions memory alias assumption? Simultaneous Multithreading (SMT) we build 6-wide superscalar for high performance for 1 program but real programs commit only 1-2 instrs per cycle stalls due to RAW - memory latencies wasted issue slots due to mispredictions so pipeline goes unused does not mean pipe was unnecessarily too wide 4 wide give 1 instr per cycle 1 wide would give only 0.25 instrs per cycle! parallelism is uneven and bursty (chapter 1 slides) ECE565 Lecture Notes: Chapters 2 & 3 172 ECE565 Lecture Notes: Chapters 2 & 3 173 Simultaneous Multithreading (SMT) SMT Idea: run multiple programs at the same time through the pipe Eg: Fetch and execute from 2 programs (or threads in a parallel program) No dependence between threads/program => Pipeline utilized better SMT does NOT improve single-thread performance SMT improves job throughput and CPU utilization better throughput good if you have more than 1 program better utilization good if you are data center manager Intel calls it Hyperthreading Pipeline has in-flight instrs from multiple programs in ANY stage, instrs from more than 1 program could be processed in the SAME cycle 1 hardware context per program (eg 4 contexts for max 4 progs) Some h/w replicated for each context, rest shared by all contexts large h/w should be shared and small can be replicated what h/w should be separate? stages: F D Rename Issue Regrd EX Mem WB ECE565 Lecture Notes: Chapters 2 & 3 174 ECE565 Lecture Notes: Chapters 2 & 3 175

SMT For correctness, following are replicated for each context Rename, ROB -- why? For performance, following are replicated for each context Br History Buffer, Ret Addr Stack, load/store Q Shared h/w - shared among all contexts fetch, decode, issue, physical registers, EX units, Mem larger physical regfiles, caches, TLBs Fortunately, replicated h/w is small and large h/w is shared SMT OS thinks of each context as a virtual CPU h/w people say contexts and OS people say virtual cpus OS assigns as many programs to one real CPU as possible (upto max contexts) SMT allows multiple programs to share MOST of the pipeline improves CPU utilization and increases job throughput if you increase the replicated h/w all the way, you end up with multicores In multicores - each core is SMT-capable ECE565 Lecture Notes: Chapters 2 & 3 176 ECE565 Lecture Notes: Chapters 2 & 3 177 Latency Hierarchy L1 hit (~2 cycles) and L1 miss-l2 hit (near) (~12 cycles) overlapped by out-of-order issue (within 1 thread) L1 miss-l2 hit (far) (~ 40 cycles) and L2 miss-memory hit (~300 cycles) overlapped by OoO issue (within 1 thread) + SMT and multicores (across multiple programs/threads) memory miss (page fault) (~tens of millions of cycles) overlapped by OoO issue (within 1 thread) + SMT and multicores (across multiple programs/threads)+ OS multitasking (across multiple programs/threads) Other ILP Approaches: Vectors a vector is a one-dimensional array of numbers many multimedia/graphics/scientific programs operate on vectors do I = 1, 64 c[i] = a[i] + b[i] ONE vector instruction perform an operation on EACH element of ENTIRE vector addv c, a, b ECE565 Lecture Notes: Chapters 2 & 3 178 ECE565 Lecture Notes: Chapters 2 & 3 179

Why Vectors? want deeper pipelines but interlock logic is hard to divide into more stages eg rename, issue, bypass logic bubbles due to data hazard increase hard to issue multiple instructions per cycle fetch&issue bottleneck (Flynn bottleneck) Why Vectors? Vector instructions allow deeper pipelines no intra vector interlocks no intra-vector data hazards inner loop control hazards eliminated need not issue multiple instrs to get multiple operations vectors can present memory access pattern to h/w Simple super-fast pipeline- much faster than OoO (why?) Who converts high-level code to vector instructions? Compiler called automatic vectorization (non-trivial analyses) ECE565 Lecture Notes: Chapters 2 & 3 180 ECE565 Lecture Notes: Chapters 2 & 3 181 Vector Architectures Vector Architectures Vector-Register Machines load/store architectures vector operations use vector registers except ld/st register ports cheaper than memory ports optimized for small vectors Memory-memory vector machines all vectors reside in memory long startup latency memory ports expensive optimized for long vectors fact: most vectors are short early machines were memory-memory TI ASC, CDC STAR-100 modern vector machines use vector-registers ECE565 Lecture Notes: Chapters 2 & 3 182 ECE565 Lecture Notes: Chapters 2 & 3 183

DLXV Architecture strongly based on CRAY-1 extend DLX (baby pipeline) with vector instructions eight vector registers (V0-V7) 64 double-precision FP each (4K bytes total) DLXV Architecture five vector functional units FP+, FP*, FP/, integer and logical fully pipelined with 2-20 stages vector load/store units fully pipelined with 10-50 stages ECE565 Lecture Notes: Chapters 2 & 3 184 ECE565 Lecture Notes: Chapters 2 & 3 185 DLXV Architecture DLXV Architecture vector-vector instructions operate on two vectors produce a third vector do I = 1, 64 v1[i] = v2[i] + v3[i] addv v1, v2,v3 ENTIRE loop in one instr no branches, no hazards vector-scalar instructions operate on one vector and one scalar do i = 1, 64 v1[i] = f0 + v3[i] addv v1, f0, v3 ECE565 Lecture Notes: Chapters 2 & 3 186 ECE565 Lecture Notes: Chapters 2 & 3 187

DLXV Architecture vector load-store instructions load/store a vector from memory into a vector register operates on contiguous addresses lv v1, r1 ; v1[i] = M[R1+i] sv r1, v1 ; M[R1+i] = v1[i] DLXV Architecture load/store vector with stride vectors are not always contiguous in memory add non-unit stride on each access lvws v1, (r1,r2) ; v1[i] = M[r1+i*r2] svws (r1,r2), v1 ; M[r1+i*r2] = v1[i] vector load/store indexed indirect accesses through an index vector lvi v1, (r1+v2) ; v1[i] = M[r1+v2[i]] svi (r1+v2), v1 ; M[r1+v2[i]] = v1[i] ECE565 Lecture Notes: Chapters 2 & 3 188 ECE565 Lecture Notes: Chapters 2 & 3 189 DLXV Architecture DLXV Architecture do i = 1,64 - double-precision a * x + y (daxpy) y[i] = a * x[i] + y[i] VLR 64 ld f0, a lv, v1, rx multsv v2, f0, v1 lv v3, ry addv v4,v2,v3 sv ry, v4 6 DLXV instructions instead of 600 DLX instructions remember MIPS is a useless measure of performance! ECE565 Lecture Notes: Chapters 2 & 3 190 ECE565 Lecture Notes: Chapters 2 & 3 191

Vector Length not all vectors are 64 elements long vector length register (VLR) controls length of vector operations 0 < VLR < MVL = 64 do i = 1, 100 x[i] = a * x[i] ld f0, a movi2s VLR, 36 lv v1, rx multsv v2, f0, v1 sv rx, v2 add rx, rx, 36 movi2s VLR 64 lv v1, rx multsv v2, f0, v1 Vector Length ECE565 Lecture Notes: Chapters 2 & 3 192 ECE565 Lecture Notes: Chapters 2 & 3 193 sv rx, v2 Strip Mining use strip mining for: do i = 1, n x[i] = a * x[i] low = 1 VL = n mod MVL do j = 0, (n/mvl) do i = low, low+vl-1 x[i] = a * x[i] low = low + VL VL = MVL ECE565 Lecture Notes: Chapters 2 & 3 194 ECE565 Lecture Notes: Chapters 2 & 3 195

Vector Masks use masked vector register for vectorzing if statements do i = 1, 64 if a[i] < 0.0 then a[i] = -a[i] use vector mask lv v1, ra ld f0, 0.0 sltsv f0, v1 # set vector mask[i] to 1 if v1[i] < f0 subv v1, 0, v1 cvm sv ra, v1 Vector Chaining use vector chaining (vector bypass) for RAWs multv v1, --, -- addv --, v1, -- ECE565 Lecture Notes: Chapters 2 & 3 196 ECE565 Lecture Notes: Chapters 2 & 3 197 Vector Scatter/Gather Short vectors use gather/scatter for sparse matrices do i = 1, 64 a[k[i]] = a[k[i]] + c[d[i]] lv v1, rd lvi v3, (rc+v1) # load c[d[i]] lv v1, rk lvi v2, (ra+v1) # load a[k[i]] addv v4, v3,v2 svi (ra+v1), v4 effect of short vectors time for vector = startup + n*initiation rate time per element vector length ECE565 Lecture Notes: Chapters 2 & 3 198 ECE565 Lecture Notes: Chapters 2 & 3 199

Vectors what kind of memory hierarchy would you use for vectors? compiler techniques final word: make the scalar unit fast! remember Amdahl s law CRAY-1 was the fastest scalar computer Connection to Graphics/MMX MMX/graphics called SIMD - single instruction multiple data Vector and SIMD are the same thing a vector instr is a SIMD instr Intel, Sun multimedia and Nvidia graphics all use SIMD SIMD - 2 options option 1: full blown vector units like Cray option 2: mini vectors: pack 8 1-byte vectors into 1 64-bit 8-bit 16-bit data common in multimedia use normal datapath but do 4 or 8 ops in one shot (MMX) ECE565 Lecture Notes: Chapters 2 & 3 200 ECE565 Lecture Notes: Chapters 2 & 3 201 MMX: Basics most multimedia apps work on short integers 8-bit pixels, 16-bit audio pack data into 64-bit words operate on packed data like short vectors single instruction multiple data (SIMD) around since Livermore S-1 (20 years) MMX Enhanced Instructions Also MOV s move MMX datatypes to and from memory loads followed by stores Pack/Unpack go back and forth between MMX and normal datatypes needed in multimedia computations integrate into x86 FP registers can improve performance by 8x (in theory) benchmarks not 8x but very good ECE565 Lecture Notes: Chapters 2 & 3 202 ECE565 Lecture Notes: Chapters 2 & 3 203

MMX Constraints: Integrating into Pentium share registers with FP ISA extensions but perfect backward compatibility 100% OS compatible (no extra registers,flags,exceptions) bit in CPUID instruction so applications test for MMX and include code use 64-bit datapaths pipeline capable of 2 MMX IPC cascade memory and execution stages to avoid stalls Relationship to Vectors vector length - no VL must be multiple of 64 total bits memory load/store - stride one only arithmetic - integer only conditionals - builds byte mask like vector mask no trap problems - no trapping instructions data movement - pack/unpack like vector scatter/gather minimal - only pack/unpack ECE565 Lecture Notes: Chapters 2 & 3 204 ECE565 Lecture Notes: Chapters 2 & 3 205