CACHE AWARENESS
Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable of executing two instructions in each cycle of 1 ns. The peak processor rating is therefore 2 GFLOPS. Since the memory latency is equal to 100 cycles every time a memory request is made, the processor must wait 100 cycles before it can process the data. Consider the problem of computing the dot-product of two vectors. A dot-product computation performs one multiply-add on a single pair of vector elements, i.e., each floating point operation requires one data fetch. It is easy to see that the peak speed of this computation is limited to one multiply-add operation every 100 ns, or a speed of 10 MFLOPS.
Cache Memory processor cache memory word block/line Smaller and faster memory between the processor and the DRAM The data needed by the processor is first fetched into the cache. All subsequent accesses to data items residing in the cache are serviced by the cache. Performance improves in presence of High Locality Def Cache Hit Ratio: fraction of memory references being resolved by the cache memory
With Cache Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and is capable of executing two instructions in each cycle of 1 ns. The peak processor rating is therefore 2 GFLOPS. Suppose we have a cache of 32KB, with a latency of 1 ns per word. We must multiply two matrices A and B of size 32*32. N.B.: A, B and A*B fit in cache The time needed to load A and B into cache is 32*32*2 * 100ns = 205 µs Multiplying two matrices of size n*n takes 2n 3 multiply-adds, in our case 2*32 3 = 66K multiply-adds, which implies 66 µs Total time is 205 + 66 = 271 µs Throughput is 66K*2/271µs = 488 MFLOPS (> 10 MFLOPS) (<2 GFLOPS) Locality: n^3 operations on n^2 memory locations!!
Effect of memory bandwidth Consider the previous example If the cache block has a width of one single word, then it takes 32*32*2 * 100ns = 205 µs to load the two matrices in cache If the cache line has a width of four words, then it takes 32*32*2/4 * 100ns = 51 µs to load the two matrices in cache Total time is 51+ 66 = 117 µs Throughput is 66K*2/117µs = 1282 MFLOPS (>488 MFLOPS) (> 10 MFLOPS) (<2 GFLOPS) Warning! We are assuming data is laid out linearly in memory.
Other approaches for hiding memory latency Multi-threading Split the problem in multiple sub-problems Run an independent thread for each sub-problem When a thread is idle on a miss, another thread can execute computational tasks Pre-fetching Anticipate load operations, so that data is already available when needed Drawbacks: They both impact on bandwidth and cache pollution
Cache-to-Memory coherence processor cache memory write After updating/writing data in cache, when to write to memory? Other devices might read the same data from memory Write-Through Policy Data is immediately written to memory Write delay (~100 cycles) When to write? Write-Back Policy Memory is updated upon eviction Less operations (especially in case of re-use/locality) In case of write to data not being in cache: n Write-allocate: first load into cache, then update cache (only) n Write no allocate: write (stream) to memory (no loads into cache)
Cache-to-Cache Coherence in Symmetric Multi-Processors P 1 P 2 P 3 P n $ u = 5 $ $ $ u = 5 7 BUS u = 5 Memory The processors see different values of u after event 3 Snooping Protocols I/O devices (Most) assume Write-through policy and a shared communication channel among processors/caches The Cache Controller snoops all the bus transactions a transaction is relevant for a cache if the referred data line (univocally identified by the (block) address) is present in cache the possible actions to guarantee the cache coherence are: n Invalidate: cache line is invalidated, and it must be re-loaded from memory first (write-through guarantees correctness) n Update: cache line is updated with the new value Note: There are strategies for non write-through policies Modern processors use point-to-point links among (multi-core) CPUs rather than a shared bus
Cache Invalidate vs. Cache Update Update cons: can waste unnecessarily the bandwidth of the bus n when a cache block is updated remotely, but is no longer read by the associated processor n subsequent writes by the same processor cause multiple updates pros: multiple R/W copies are kept coherent after each write. n This avoids misses for each subsequent read access (thus saving bus bandwidth) Invalidate pros: multiple writes by the same processors does not cause any additional bus traffic cons: an access that follows an invalidation causes a miss Pros and Cons of the two approaches depend on the application and its read/write patterns
False Sharing Coherence protocols work in terms of cache blocks/lines, rather than single words/bytes The block size plays an important role in the coherence protocol with small blocks, the protocols are more efficient (less data to transmit when update or flush) large blocks are better for spatial locality What if multiple processors access the same cache block?
False Sharing P1 P2 P1 P2 13 14 17 18 cache line cache update False sharing 13 14 17 18 cache line 13 14 17 23 cache line cache update 13 14 17 23 cache line consider two unrelated variables (e.g., variables that are logically private to distinct threads), which are allocated in the same block write accesses to a block by different processors running the threads are considered conflicts by the coherence protocol n even if the two processors access disjoint words of the same block thus, it is needed to put on the same block related variables n e.g., all of which are logically private for a given thread n compiling techniques, programmer
Intel Core i7 (4770) Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32GB L1 Data cache L1 Instruction cache = 32 KB, 64B/line L2 cache = 256 KB, 64B/line L3 cache L1 Data Cache Latency = 4-5 cycles L2 Cache Latency L3 Cache Latency = 36 cycles RAM Latency = 36 + ~200 cycles = 32 KB, 64B/line = 8 MB, 64B/line = 12 cycles
So Far Cache has a significant impact in the performance of modern applications How can we study the cache access patterns of an algorithm? How can we improve algorithm design? Two examples: Sort Matrix Multiplication
External Memory Model We use cache vs. disk to make it clear the relative costs Transfers occur in blocks of size B The cache has size M>=B With M/B entries Model Properties: Simple Asks to Minimize I/O cost Optimize for a specific M and B Explicitly issue read/write ops. Explicitly manage cache n Can you do this? n [J.S. Vitter, ACM Computing Surveys, 2001]
Example: Linear Scan Theorem: Scanning N elements stored in a contiguous segment of memory costs at most N/B + 1 memory transfers
Cache Oblivious Model Simple Idea: Design an algorithm optimal for any B and M Properties: CPU W work M/B Z L Cache lines Don t need to know M and B, which can be hard to know, and can harm the generality of an algorithm Only one cache level Do not explicitly manage the cache Cache Assumptions: n Tall cache assumption : M = Ω(B 2 ) n Ideal cache model = optimal cache replacement vs. FIFO, LRU n Full associativity vs. n-way associativity organized by optimal replacement strategy Cache Lines of length LB Q cache misses Main Memory n [Frigo et al., FOCS 99]
Cache Oblivious Model: Generalization to multiple cache levels Theorem: (from one level to many) If algorithm A is cache-oblivious optimal then it is optimal on any two adjacent memory levels in a complex hierarchy. Proof: If the inclusion property holds, i.e. M i M i+1 Consider M i+1 the external memory, A is optimal w.r.t. M i and B i M N, B N M 3, B 3 M 2, B 2 M 1, B 1 CPU Theorem: (levels of different cost) Let C i be the cost of accessing memory M i, if A is cache oblivious optimal up to a constant factor, then A is optimal for any possible set of constant factors C i.
Cache Oblivious Model : Generalization to multiple cache levels The following theorems make the model feasible: Theorem: (from optimal replacement strategy to LRU/FIFO) If A takes T transfers and cache size M/2, then A takes at most 2T transfers on a cache of size M with LRU or FIFO replacement policy. Theorem: (from fully associative to 1-way associative) An LRU/FIFO cache with of size M, and block size B, can be simulated in O(M) space, such that an access to a block takes O(1) expected time. Conclusion: A cache oblivious algorithm can be translated in a FIFO/LRU cache with 1-associativity paying only constant factors.
Matrix Multiplication Compute C = A * B With B and A being NxN matrices Preliminaries: how to store matrices? Row-major order vs. Column-major order
Cache cost/complexity For each element C ij Scan the i-th row of A Scan the j-th column of B (Compute multiply-add) Each element of C involves O(1+N/B) transfers Since there are N 2 elements in C, the complexity is: O(N 2 +N 3 /B) Simple approaches to reduce this cost: If M>N then row i of A can be stored in cache memory and used to compute all the C ix values In order to keep matrix A in cache, M should be >N 2
Improved Algorithm This can be improved to optimal complexity of O(N 2 /B + N 3 /(B* M)) with block matrices: A11 A12 B11 B12 A11*B11 + A11*B12 + A21 A22 * B21 B22 = A12*B21 A21*B11 + A12*B22 A21*B12 + A22*B21 A22*B22
Improved Algorithm What is the best strategy given M and B? What is the best Cache-Conscious algorithm? What is the best algorithm according to the external memory model? Simple: use blocks of size s*s, such that 3 * s 2 = M. Use blocked memory layout
Cache Oblivious Algorithm Divide-and-Conquer Approach A11 A21 A12 A22 * B11 B21 B12 B22 = + + + + A11 A21 A12 A22 * B11 B21 B12 B22 = + + + + A11 A21 A12 A22 * B11 B21 B12 B22 = + + + + At some point the recursion will fit in the cache, whatever its size
Data layout We don t know when it will fit!!!!!! We need a recursive data layout, such that however we recursively split the matrix, at some point all the data will be in (almost) consecutive memory locations that can be easily loaded in cache. Space filling curves: Z-Order
How to implement Z-Order For any subscript of 2 dimensional array such as array [ 2, 3 ] Binary value of row 2 -> 1 0 Binary value of col 3 -> 1 1 value is stored at location 1 1 0 1 location, i.e. 13th position.
Complexity Sketch: The base case of the recursion is when three blocks fit in cache, since successive recursion steps do not cause additional misses Assume the base case is with blocks of size k M * k M, for some constant k Such blocks fill the cache with M/B misses The computational complexity of the block-based multiplication is ( N/(k M) ) 3 Resulting in a number of misses ( N/(k M) ) 3 * M/B = O( N 3 /(B M) ) The sum across the various block generates misses O( N 2 /B ) The total cost is thus O(N 2 /B + N 3 /(B* M))
End of first part Parallel Programming for Multicore and Cluster Systems Sec. 2.7 Cache and Memory Hierarchy Cache-Oblivious Algorithms and Data Structures. Erik D. Demaine. Sec. 1, 2, 3.1.1, 3.2.3