COMP 506 Rice University Spring 2017 Loops and Locality with an introduc-on to the memory hierarchy source code Front End IR OpJmizer IR Back End target code Copyright 2017, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 506 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educajonal insjtujons may use these materials for nonprofit educajonal purposes, provided this copyright nojce is preserved Most of this material is not in EaC2e
Op1miza1on From Lecture 14 Compilers operate at mul1ple granulari1es or scopes Local techniques Work on a single basic block Maximal length sequence of straight-line code Regional techniques Consider muljple blocks, but less than whole procedure Single loop, loop nest, dominator region, Intraprocedural, or global, techniques Operate on an enjre procedure Common of compilajon Interprocedural, or whole-program, techniques Operate on > 1 procedure, up to whole program LogisJcal issues related to accessing the code (but just one) (op-mize in the linker?) COMP 506, Spring 2017 2
The Opportuni1es Loop Op1miza1on Compilers have always focused on loops They have higher execujon counts than code outside loops They have repeated operajons and related operajons Much of the real work of compujng takes place inside loops There are several effects to acack Loop overhead Decrease the control-structure cost for each iterajon Locality SpaJal Locality use of co-resident Temporal Locality reuse of the same at different Jmes Parallelism Move loops with indepent operajons to inner or outer posijon 1 COMP 515 COMP 1 Innermost 506, Spring loop makes 2017 sense for vector machines; outermost loop makes sense for muljprocessors. 3
Elimina1ng Overhead Loop Unrolling (the oldest trick in the book) To reduce overhead, replicate the body Overhead is the increment, test, and branch do i = 1 to 100 by 1 a(i) = a(i) + b(i) becomes (unroll by 4) do i = 1 to 100 by 4 a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) Sources of Improvement Less overhead per useful operajon Longer basic blocks for local opjmizajon COMP 506, Spring 2017 4
Elimina1ng Overhead Loop Unrolling With Unknown Bounds Generate extra loops to handle cases smaller than the unroll factor do i = 1 to n by 1 a(i) = a(i) + b(i) becomes (unroll by 4) While loop needs an explicit update for variable i You will find code like this in the BLAS and in BitBlt i = 1 while (i+3 < n) do a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) i = i + 4 while (i < n ) do a(i) = a(i) + b(i) i = i + 1 COMP 506, Spring 2017 5
Elimina1ng Overhead One Other Use For Unrolling Eliminate copies at the of a loop t 1 = b(0) do i = 1 to 100 t 2 = b(i) a(i) = a(i) + t 1 + t 2 t 1 = t 2 becomes (unroll by 2 and rename) t 1 = b(0) do i = 1 to 100 by 2 t 2 = b(i) a(i) = a(i) + t 1 + t 2 t 1 = b(i+1) a(i+1) = a(i+1) + t 2 + t 1 More Complex Cases MulJple cycles of cross-iterajon copies Use LCM of cycle lengths as unroll factor Result has been rediscovered many Jmes [214] COMP 506, Spring 2017 6
Locality-Driven Improvement Loop Fusion Two loops iterate over the same iterajon space Convert them into a single loop do i = 1 to n c(i) = a(i) + b(i) do j = 1 to n d(j) = a(j) * e(j) becomes (fuse) do i = 1 to n c(i) = a(i) + b(i) d(i) = a(i) * e(i) Advanes Fewer total operajons (lower overhead) Longer basic blocks for local opjmizajon & scheduling Can convert reuse between loops to reuse within a loop COMP 506, Spring 2017 8
Locality-Driven Improvement Loop Fusion Two loops iterate over the same iterajon space Convert them into a single loop do i = 1 to n c(i) = a(i) + b(i) do j = 1 to n d(j) = a(j) * e(j) becomes (fuse) Advanes Fewer total operajons (lower overhead) Longer basic blocks for local opjmizajon & scheduling Can convert reuse between loops to reuse within a loop do i = 1 to n c(i) = a(i) + b(i) d(i) = a(i) * e(i) This transforma1on is safe if and only if the fused loop does not change the values used or defined by any statement in either loop. COMP 506, Spring 2017 9
Locality-Driven Improvement Loop Fusion Two loops iterate over the same iterajon space Convert them into a single loop do i = 1 to n c(i) = a(i) + b(i) do j = 1 to n d(j) = a(j) * e(j) becomes (fuse) For large enough arrays, a(x) will not be in the cache by the 1me the second loop tries to reuse it. Advanes Fewer total operajons (lower overhead) Longer basic blocks for local opjmizajon & scheduling Can convert reuse between loops to reuse within a loop do i = 1 to n c(i) = a(i) + b(i) d(i) = a(i) * e(i) a(x) will almost certainly be in the cache at the second use. Safety is expressed in terms of depences: essen1ally, the same values COMP 506, Spring 2017 flow to the same places. 10
Locality-Driven Improvement Loop Distribu1on (or fission) Single loop with muljple indepent statements Can transform it into muljple indepent loops Reads b, c, e, f, h, & k Writes a, d, & g do i = 1 to n a(i) = b(i) + c(i) d(i) = e(i) * f(i) g(i) = h(i) - k(i) becomes (fission) do i = 1 to n a(i) = b(i) + c(i) do i = 1 to n d(i) = e(i) * f(i) do i = 1 to n g(i) = h(i) - k(i) Advanes Loops in the transformed code can have a smaller cache footprint More reuse in the cache leads to faster execujon Enables other transformajons, such as vectorizajon Reads b & c Writes a Reads e & f Writes d Reads h & k Writes g DistribuJon is safe if all the statements that form a cycle in the COMP depence 506, Spring graph 2017 up in the same loop (see COMP 515) 11
Locality-Driven Improvement Loop Interchange Reorders Loops To Improve Locality Swap inner and outer loops to rearrange the iterajon space do i = 1 to 50 do j = 1 to 100 a(i,j) = b(i,j) * c(i,j) becomes (interchange) do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) Effect Improves spajal reuse by using more elements per cache line Goal is to get as much reuse into the inner loop as possible COMP 506, Spring 2017 12
Locality-Driven Improvement Loop Interchange Reorders Loops To Improve Locality Swap inner and outer loops to rearrange the iterajon space do i = 1 to 50 do j = 1 to 100 a(i,j) = b(i,j) * c(i,j) In Fortran s column-major order, a(4,4) would lay out as 1,1 2,1 3,1 4,1 1,2 2,2 3,2 4,2 1,3 2,3 3,3 4,3 1,4 2,4 3,4 4,4 As licle as 1 element used per line cache line Effect Improves spajal reuse by using more elements per cache line Goal is to get as much reuse into the inner loop as possible COMP 506, Spring 2017 13
Locality-Driven Improvement Loop Interchange Reorders Loops To Improve Locality Swap inner and outer loops to rearrange the iterajon space Aaer interchange, direc1on of Itera1on is changed 1,1 2,1 3,1 4,1 1,2 2,2 3,2 4,2 1,3 2,3 3,3 4,3 1,4 2,4 3,4 4,4 cache line Runs down cache lines do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) Root cause of the speed difference in the array example from the 1 st COMP 506 lecture Effect Improves spajal reuse by using more elements per cache line Goal is to get as much reuse into the inner loop as possible COMP 506, Spring 2017 14
Locality-Driven Improvement Loop Interchange Reorders Loops To Improve Locality Swap inner and outer loops to rearrange the iterajon space do i = 1 to 50 do j = 1 to 100 a(i,j) = b(i,j) * c(i,j) becomes do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) If arrays are stored in row-major order, the same effects occur with the opposite order of loops and subscripts. Effect Improves spajal reuse by using more elements per cache line Goal is to get as much reuse into the inner loop as possible COMP 506, Spring 2017 15
Locality-Driven Improvement Loop Permuta1on Generalizes Interchange to Mul1ple Loops Interchange (2 loops) is the degenerate case of two perfectly nested loops In more general seengs, the transformajon is called permuta-on Safety PermutaJon is safe iff no depences are reversed That is, the flow of from defini-ons to uses is preserved Effects Change the order of access and the order of computajon Move accesses closer together in Jme increased temporal locality Move computajons further apart in Jme cover pipeline latencies COMP 506, Spring 2017 16
The Big Picture Loop op1miza1ons can radically change locality For programs that are memory bound, loop opjmizajon is the primary way to find improvements Change the order of iterajon, change paferns of memory accesses Safety condi1ons and opportuni1es The formal statements of the safety condijons typically involve depence analysis (see COMP 515) Many formulajons of the transformajons Polyhedral analysis Unimodular transformajons Ad-hoc and one-off techniques Safety expressed in terms of depences: essen1ally, the same values flow to the same places. Improving memory-bound programs is possible, but takes some knowledge Most run-of-the-mill compilers do not perform op-miza-ons this complex COMP 506, Spring 2017 17
Address Space Layout We have seen this drawing several Jmes in COMP 506 Most language run1mes layout the address space in a similar way Stacks Growth space for stacks Heap Code Globals Pieces (stack, heap, code, & globals) may move, but all will be there Stack and heap grow toward each other (if heap grows) Arrays live on one of the stacks, in the global area, or in the heap The picture shows one virtual address space. The hardware supports one virtual address space per process. How does a virtual address space map into physical memory? Java Memory Layout COMP 506, Spring 2017 18
How Does Address Space Mapping Work? The Big Picture Compiler s view virtual address spaces S t a c k H e a p C o d e S G t l a & o t b i a c l S t a c k H e a p C o d e S G t l a & o t b i a c l S t a c k H e a p C o d e S G t l a & o t b i a c l... S t a c k H e a p C o d e S G t l a & o t b i a c l OS view... TLB 0 high Physical address space 1980 Hardware view TLB is an address cache used by the OS to speed virtual-to-physical address translajon. A processor may have COMP > 1 level 506, of Spring TLB. 2017 19
Cache structure macers for performance, not correctness More Address Space Mapping Of course, the Hardware view is no longer that simple Main Memory L2 Cache... 0 high Data & Code... TLB L1 Caches Data Code Data Code Processor Cores Registers Registers Many COMP processors 506, Spring now 2017 include L3 caches; L4 caches are on their way. 20
Cache Memory L3 L2 L1 Core Data Data & Code Data & Code Registers Code Typically shared among 2 cores TLB Modern hardware features mul1ple levels of cache & of TLB L1 is typically private to a core L2 (and beyond) is typically shared between cores and between code (I) and (D) Most caches are inclusive Item in L1 in L2 in L3 Some are exclusive (L1 not in L2) Most caches are set associajve 2, 4 or 8 way TLBs are also associajve Lifle documentajon Difficult to detect or measure COMP 506, Spring 2017 21
Cache Memory The primary func1on of a cache is to provide fast memory near the core L1 is a couple of cycles and small L2 is slower than L1 and larger; L3 is slower and larger, This Laptop (Core i7) L1 5 cycles 32KB L2 13 cycles 256KB L3 36 cycles 4,096KB COMP 506, Spring 2017 22
Cache Memory The primary func1on of a cache is to provide fast memory near the core L1 is a couple of cycles and small L2 is slower than L1 and larger; L3 is slower, again, and larger The other func1on of a cache is to map addresses Cache is organized into blocks, or lines Each line consists of a and a set of words Cache block or line COMP 506, Spring 2017 23
Cache Memory To make good use of cache memory, the code must reuse values. SpaJal reuse refers to the use of more than one word in a line. Temporal reuse refers to reuse of the same word over Jme. The primary func1on of a cache is to provide fast memory near the core L1 is a couple of cycles and small L2 is slower than L1 and larger; L3 is slower, again, and larger The other func1on of a cache is to map addresses Cache is organized into blocks, or lines Each line consists of a and a set of words Cache block or line A full cache is a set of lines Address maps into 3 parts:, index, and offset address index offset COMP 506, Spring 2017 index is a manyto-one map 24
Cache Memory Caches differ in how they appor1on the and index bits A direct-mapped cache has one line per index Cache lookup is simple The index bits are an ordinal index into the set of lines index offset t s o Direct-mapped cache Line 0 Cache Do the s match? Line 1 Line 2 Line 3 Line 2 2-1 Line 2 s rest of address 0000001 0000100 COMP 506, Spring 2017 A direct mapped cache has s lines. Capacity is the sum of the sizes of the lines. 25
Cache Memory Caches differ in how they appor1on the and index bits A set-associa1ve cache has muljple lines per index index maps to a set, lookup matches s within the set Small content-addressable memory 1 for each set 2-way Set-Associa1ve Cache A set-associajve cache has 2 s sets. For a given total size, s is smaller than in direct mapped. The is longer; the index is shorter. Set 0 COMP 506, Spring 2017 Set 1 Set 2 Set 2 s -2 Set 2 s -1 Way 1 Way 0 index offset 26 1 somejmes called associajve memory.
What Happens on a Load? The hardware must find the in this complex hierarchy Assume that the address is in a register, e.g. load r 0 => r 1 Assume set-associajve cache Assume cache s are virtual addresses Sequence of Events for a load 1. Processor looks in L1 cache Index maps to a set, then an associa-ve search on the s in the set If found (a cache hit), return the value; otherwise 2. Processor looks in L2 cache Index maps to a set, then an associa-ve search on the s in the set If found (a cache hit), return the value; otherwise 3. And so on COMP 506, Spring 2017 27
What Happens on a Load? What about virtual to physical address transla1on? The address in the load is a virtual address If the load misses in all caches, we need a physical address Caches can be designed to operate on virtual or physical addresses L1 is typically indexed by virtual addresses L2 and above are typically indexed by physical addresses Physically-addressed cache virtual address transla1on during lookup Involves understanding the map from virtual pages to physical pages Involves cooperajon between hardware and the operajng system Worst case behavior involves walking the page tables (ooen locked in L2 or L3) Design of virtual memory systems is covered in a good OS course COMP 506, Spring 2017 28
Cache Memory L3 L2 L1 Core Data Data & Code Data & Code Registers Code Typically shared among 2 cores TLB The TLB plays a key role in virtual to physical address mapping Small cache that maps virtual addresses to physical addresses Holds subset of (acjve) pages that are in virtual memory Tag is v-addr, content is p-addr Physically ged cache must translate v-addr to p-addr TLB hit access can conjnue TLB miss search to bring page into TLB, then conjnue (or reissue) the access A page-fault on the way to an L1 lookup is a lot of delay COMP 506, Spring 2017 29
Cache Memory L3 L2 L1 Core Data Data & Code Data & Code Registers Code Typically shared among 2 cores TLB The TLB plays a key role in virtual to physical address mapping Small cache that maps virtual addresses to physical addresses Holds subset of (acjve) pages that are in virtual memory Tag is v-addr, content is p-addr Most processors use a virtually ged L1 cache, with physical s in upper-level caches Removes TLB role in L1 lookup TLB can be as fast as L1, so it is not a problem for L2 and beyond Physical s are smaller than virtual s fewer gates, less area, lower power consumpjon COMP 506, Spring 2017 30
What Happens on a Load? index offset t s o Careful design can let the TLB lookup & index set lookup run in parallel By playing with the size of t, s, and o, the cache designer can separate index lookup from virtual-to-physical translajon If s + o log 2 (pagesize) then the index and offset bits are the same in physical & virtual addresses If s + o log 2 (pagesize), then the processor can start both the L1 lookup to find the set and the TLB lookup to translate the address at the same Jme By the Jme it has found the set, it should have the from the physical address (unless the lookup misses in the TLB) In effect, associajvity lets cache capacity grow without increasing the number of bits in the index field of the address Do manufacturers play this game? Absolutely. My laptop has a 32,768 byte L1 cache, with 64 byte lines, for 512 lines. It is 8-way set associajve, which means 64 sets. Thus, s = 6, o = 6, and s + o = 12 bits. 2 12 = 4,096, which is the pagesize. COMP With my 506, laptop s Spring cache 2017 parameters, a 4-way associajve cache would need 32 byte lines to keep s + o 12. 32