Chapter 3: Limits of Instr Level Parallelism Ultimately, how much instruction level parallelism is there? Consider study by Wall (summarized in H & P) First, assume perfect/infinite hardware Then successively refine to more realistic hardware ILP Limit: Perfect/Infinite Hardware Infinite re-name registers no WAW, WAR hazards Perfect branch prediction Perfect memory hazard analysis Unlimited issues per cycle Looking an unlimited distance into the instruction stream infinite issue "window" Single cycle execution ECE565 Lecture Notes: Chapters 2 & 3 160 ECE565 Lecture Notes: Chapters 2 & 3 161 ILP Limit: see Figure in book larger window => more ILP for finite window sizes, integer programs have less ILP than floating point Narrow Window Size n 2 complexity of dependence checking logic for window size n assuming all instructions in window are simultaneously being considered for issue see fig in book ECE565 Lecture Notes: Chapters 2 & 3 162 ECE565 Lecture Notes: Chapters 2 & 3 163
Realistic Branch/Jump Prediction 2K entry window, 64 simultaneous issue, prediction choices: perfect see fig in book Realistic Branch/Jump Prediction no branch prediction to selective is a big improvement Selective history predictor correlating and non-correlating with selection 97% accurate on the benchmarks Std. 512x2 predictor; 16-entry return address buffer Static predictor, based on profile No branch prediction; jumps predicted ECE565 Lecture Notes: Chapters 2 & 3 164 ECE565 Lecture Notes: Chapters 2 & 3 165 Finite Rename Registers Finite Rename Registers: Limit the number of registers available for renaming Assume 2K entry window; 64 simult. issues see fig in book no renaming to about 128 renaming tags is big jump Assume 2-level, 8K entry branch predictor 2K jump/return predictor ECE565 Lecture Notes: Chapters 2 & 3 166 ECE565 Lecture Notes: Chapters 2 & 3 167
Imperfect Alias Analysis Imperfect Alias Analysis: choices: see fig in book Global/stack perfect; all heap conflict Inspection of object code None; all references may conflict Assume 2K entry window; 64 simult. issues Assume 2-level, 8K entry branch predictor huge difference between inspection and global/stack perfect inspection is close to what compilers can do global/stack perfect is what we could do if we had better compilers 2K jump/return predictor Assume 256 rename registers ECE565 Lecture Notes: Chapters 2 & 3 168 ECE565 Lecture Notes: Chapters 2 & 3 169 Realizable Machine Realizable Machine: Consider even more restrictive hardware see fig in book Selective predictor with 1K entries; 16-entry return predictor Perfect disambiguation of memory references; dynamic within window Register renaming with additional 64 registers Variable window size integer programs ILP less than floating point integer programs level off at around 10-15 ILP floating point programs keep going up with larger window ECE565 Lecture Notes: Chapters 2 & 3 170 ECE565 Lecture Notes: Chapters 2 & 3 171
ILP Limit: Discussion Single cycle assumption? What is performance bottomline? window assumptions memory alias assumption? Simultaneous Multithreading (SMT) we build 6-wide superscalar for high performance for 1 program but real programs commit only 1-2 instrs per cycle stalls due to RAW - memory latencies wasted issue slots due to mispredictions so pipeline goes unused does not mean pipe was unnecessarily too wide 4 wide give 1 instr per cycle 1 wide would give only 0.25 instrs per cycle! parallelism is uneven and bursty (chapter 1 slides) ECE565 Lecture Notes: Chapters 2 & 3 172 ECE565 Lecture Notes: Chapters 2 & 3 173 Simultaneous Multithreading (SMT) SMT Idea: run multiple programs at the same time through the pipe Eg: Fetch and execute from 2 programs (or threads in a parallel program) No dependence between threads/program => Pipeline utilized better SMT does NOT improve single-thread performance SMT improves job throughput and CPU utilization better throughput good if you have more than 1 program better utilization good if you are data center manager Intel calls it Hyperthreading Pipeline has in-flight instrs from multiple programs in ANY stage, instrs from more than 1 program could be processed in the SAME cycle 1 hardware context per program (eg 4 contexts for max 4 progs) Some h/w replicated for each context, rest shared by all contexts large h/w should be shared and small can be replicated what h/w should be separate? stages: F D Rename Issue Regrd EX Mem WB ECE565 Lecture Notes: Chapters 2 & 3 174 ECE565 Lecture Notes: Chapters 2 & 3 175
SMT For correctness, following are replicated for each context Rename, ROB -- why? For performance, following are replicated for each context Br History Buffer, Ret Addr Stack, load/store Q Shared h/w - shared among all contexts fetch, decode, issue, physical registers, EX units, Mem larger physical regfiles, caches, TLBs Fortunately, replicated h/w is small and large h/w is shared SMT OS thinks of each context as a virtual CPU h/w people say contexts and OS people say virtual cpus OS assigns as many programs to one real CPU as possible (upto max contexts) SMT allows multiple programs to share MOST of the pipeline improves CPU utilization and increases job throughput if you increase the replicated h/w all the way, you end up with multicores In multicores - each core is SMT-capable ECE565 Lecture Notes: Chapters 2 & 3 176 ECE565 Lecture Notes: Chapters 2 & 3 177 Latency Hierarchy L1 hit (~2 cycles) and L1 miss-l2 hit (near) (~12 cycles) overlapped by out-of-order issue (within 1 thread) L1 miss-l2 hit (far) (~ 40 cycles) and L2 miss-memory hit (~300 cycles) overlapped by OoO issue (within 1 thread) + SMT and multicores (across multiple programs/threads) memory miss (page fault) (~tens of millions of cycles) overlapped by OoO issue (within 1 thread) + SMT and multicores (across multiple programs/threads)+ OS multitasking (across multiple programs/threads) Other ILP Approaches: Vectors a vector is a one-dimensional array of numbers many multimedia/graphics/scientific programs operate on vectors do I = 1, 64 c[i] = a[i] + b[i] ONE vector instruction perform an operation on EACH element of ENTIRE vector addv c, a, b ECE565 Lecture Notes: Chapters 2 & 3 178 ECE565 Lecture Notes: Chapters 2 & 3 179
Why Vectors? want deeper pipelines but interlock logic is hard to divide into more stages eg rename, issue, bypass logic bubbles due to data hazard increase hard to issue multiple instructions per cycle fetch&issue bottleneck (Flynn bottleneck) Why Vectors? Vector instructions allow deeper pipelines no intra vector interlocks no intra-vector data hazards inner loop control hazards eliminated need not issue multiple instrs to get multiple operations vectors can present memory access pattern to h/w Simple super-fast pipeline- much faster than OoO (why?) Who converts high-level code to vector instructions? Compiler called automatic vectorization (non-trivial analyses) ECE565 Lecture Notes: Chapters 2 & 3 180 ECE565 Lecture Notes: Chapters 2 & 3 181 Vector Architectures Vector Architectures Vector-Register Machines load/store architectures vector operations use vector registers except ld/st register ports cheaper than memory ports optimized for small vectors Memory-memory vector machines all vectors reside in memory long startup latency memory ports expensive optimized for long vectors fact: most vectors are short early machines were memory-memory TI ASC, CDC STAR-100 modern vector machines use vector-registers ECE565 Lecture Notes: Chapters 2 & 3 182 ECE565 Lecture Notes: Chapters 2 & 3 183
DLXV Architecture strongly based on CRAY-1 extend DLX (baby pipeline) with vector instructions eight vector registers (V0-V7) 64 double-precision FP each (4K bytes total) DLXV Architecture five vector functional units FP+, FP*, FP/, integer and logical fully pipelined with 2-20 stages vector load/store units fully pipelined with 10-50 stages ECE565 Lecture Notes: Chapters 2 & 3 184 ECE565 Lecture Notes: Chapters 2 & 3 185 DLXV Architecture DLXV Architecture vector-vector instructions operate on two vectors produce a third vector do I = 1, 64 v1[i] = v2[i] + v3[i] addv v1, v2,v3 ENTIRE loop in one instr no branches, no hazards vector-scalar instructions operate on one vector and one scalar do i = 1, 64 v1[i] = f0 + v3[i] addv v1, f0, v3 ECE565 Lecture Notes: Chapters 2 & 3 186 ECE565 Lecture Notes: Chapters 2 & 3 187
DLXV Architecture vector load-store instructions load/store a vector from memory into a vector register operates on contiguous addresses lv v1, r1 ; v1[i] = M[R1+i] sv r1, v1 ; M[R1+i] = v1[i] DLXV Architecture load/store vector with stride vectors are not always contiguous in memory add non-unit stride on each access lvws v1, (r1,r2) ; v1[i] = M[r1+i*r2] svws (r1,r2), v1 ; M[r1+i*r2] = v1[i] vector load/store indexed indirect accesses through an index vector lvi v1, (r1+v2) ; v1[i] = M[r1+v2[i]] svi (r1+v2), v1 ; M[r1+v2[i]] = v1[i] ECE565 Lecture Notes: Chapters 2 & 3 188 ECE565 Lecture Notes: Chapters 2 & 3 189 DLXV Architecture DLXV Architecture do i = 1,64 - double-precision a * x + y (daxpy) y[i] = a * x[i] + y[i] VLR 64 ld f0, a lv, v1, rx multsv v2, f0, v1 lv v3, ry addv v4,v2,v3 sv ry, v4 6 DLXV instructions instead of 600 DLX instructions remember MIPS is a useless measure of performance! ECE565 Lecture Notes: Chapters 2 & 3 190 ECE565 Lecture Notes: Chapters 2 & 3 191
Vector Length not all vectors are 64 elements long vector length register (VLR) controls length of vector operations 0 < VLR < MVL = 64 do i = 1, 100 x[i] = a * x[i] ld f0, a movi2s VLR, 36 lv v1, rx multsv v2, f0, v1 sv rx, v2 add rx, rx, 36 movi2s VLR 64 lv v1, rx multsv v2, f0, v1 Vector Length ECE565 Lecture Notes: Chapters 2 & 3 192 ECE565 Lecture Notes: Chapters 2 & 3 193 sv rx, v2 Strip Mining use strip mining for: do i = 1, n x[i] = a * x[i] low = 1 VL = n mod MVL do j = 0, (n/mvl) do i = low, low+vl-1 x[i] = a * x[i] low = low + VL VL = MVL ECE565 Lecture Notes: Chapters 2 & 3 194 ECE565 Lecture Notes: Chapters 2 & 3 195
Vector Masks use masked vector register for vectorzing if statements do i = 1, 64 if a[i] < 0.0 then a[i] = -a[i] use vector mask lv v1, ra ld f0, 0.0 sltsv f0, v1 # set vector mask[i] to 1 if v1[i] < f0 subv v1, 0, v1 cvm sv ra, v1 Vector Chaining use vector chaining (vector bypass) for RAWs multv v1, --, -- addv --, v1, -- ECE565 Lecture Notes: Chapters 2 & 3 196 ECE565 Lecture Notes: Chapters 2 & 3 197 Vector Scatter/Gather Short vectors use gather/scatter for sparse matrices do i = 1, 64 a[k[i]] = a[k[i]] + c[d[i]] lv v1, rd lvi v3, (rc+v1) # load c[d[i]] lv v1, rk lvi v2, (ra+v1) # load a[k[i]] addv v4, v3,v2 svi (ra+v1), v4 effect of short vectors time for vector = startup + n*initiation rate time per element vector length ECE565 Lecture Notes: Chapters 2 & 3 198 ECE565 Lecture Notes: Chapters 2 & 3 199
Vectors what kind of memory hierarchy would you use for vectors? compiler techniques final word: make the scalar unit fast! remember Amdahl s law CRAY-1 was the fastest scalar computer Connection to Graphics/MMX MMX/graphics called SIMD - single instruction multiple data Vector and SIMD are the same thing a vector instr is a SIMD instr Intel, Sun multimedia and Nvidia graphics all use SIMD SIMD - 2 options option 1: full blown vector units like Cray option 2: mini vectors: pack 8 1-byte vectors into 1 64-bit 8-bit 16-bit data common in multimedia use normal datapath but do 4 or 8 ops in one shot (MMX) ECE565 Lecture Notes: Chapters 2 & 3 200 ECE565 Lecture Notes: Chapters 2 & 3 201 MMX: Basics most multimedia apps work on short integers 8-bit pixels, 16-bit audio pack data into 64-bit words operate on packed data like short vectors single instruction multiple data (SIMD) around since Livermore S-1 (20 years) MMX Enhanced Instructions Also MOV s move MMX datatypes to and from memory loads followed by stores Pack/Unpack go back and forth between MMX and normal datatypes needed in multimedia computations integrate into x86 FP registers can improve performance by 8x (in theory) benchmarks not 8x but very good ECE565 Lecture Notes: Chapters 2 & 3 202 ECE565 Lecture Notes: Chapters 2 & 3 203
MMX Constraints: Integrating into Pentium share registers with FP ISA extensions but perfect backward compatibility 100% OS compatible (no extra registers,flags,exceptions) bit in CPUID instruction so applications test for MMX and include code use 64-bit datapaths pipeline capable of 2 MMX IPC cascade memory and execution stages to avoid stalls Relationship to Vectors vector length - no VL must be multiple of 64 total bits memory load/store - stride one only arithmetic - integer only conditionals - builds byte mask like vector mask no trap problems - no trapping instructions data movement - pack/unpack like vector scatter/gather minimal - only pack/unpack ECE565 Lecture Notes: Chapters 2 & 3 204 ECE565 Lecture Notes: Chapters 2 & 3 205