Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran Presents: Maksym Planeta 03.09.2015
Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
Matrix multiplication ORD-MULT(A, B, C) 1 for i 1 to m 2 for j 1 to p 3 for k 1 to n 4 C ij C ij + A ik B kj
Matrix layout Like in C... e nd se l- c- e- l- p- s- r- LT m ed p: (a) 0 1 2 3 4 5 6 7 (b) 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 (c) Figure: 0 1 2Row 3 16 major 17 18 order 19 (d) 4 5 6 7 20 21 22 23 8 9 10 11 24 25 26 27 12 13 14 15 28 29 30 31 32 33 34 35 48 49 50 51 36 37 38 39 52 53 54 55 40 41 42 43 56 57 58 59 44 45 46 47 60 61 62 63 0 1 2 3 4 5 6 7 8 16 24 32 40 48 56 9 17 25 33 41 49 57 10 18 26 34 42 50 58 11 19 27 35 43 51 59 12 20 28 36 44 52 60 13 21 29 37 45 53 61 14 22 30 38 46 54 62 15 23 31 39 47 55 63 0 1 4 5 16 17 20 21 2 3 6 7 18 19 22 23 8 9 12 13 24 25 28 29 10 11 14 15 26 27 30 31 32 33 36 37 48 49 52 53 34 35 38 39 50 51 54 55 40 41 44 45 56 57 60 61 42 43 46 47 58 59 62 63
Matrix layout Like in C.... ein case (a) (4), 0we1 2 3(a) 4 50 61 72 3(b) 4 05 86 16 7 24 (b) 32 40 0 48 8 56 16 24 32 40 48 56 nd it vertically, and 8 9 10 11 12 13 8 14 9 15 10 11 12 13 1 14 9 17 15 25 33 41 1 49 9 57 17 25 33 41 49 57 esematrices, these 16 17 18 19 20 21 16 22 17 23 18 19 20 21 2 10 22 18 23 26 34 42 2 50 10 58 18 26 34 42 50 58 el-recursive mulhe base case32 oc- 33 34 35 36 37 32 38 33 39 34 35 36 37 4 12 38 20 39 28 36 44 4 52 12 60 20 28 36 44 52 60 e- ase the two 40 ele- 41 42 43 44 45 40 46 41 47 42 43 44 45 5 13 46 21 47 29 37 45 5 53 13 61 21 29 37 45 53 61 result matrix. 48 49 50 51 52 53 48 54 49 55 50 51 52 53 6 14 54 22 55 30 38 46 6 54 14 62 22 30 38 46 54 62 -and-conquer l- 56 al- 57 58 59 60 61 56 62 57 63 58 59 60 61 7 15 62 23 63 31 39 47 7 55 15 63 23 31 39 47 55 63 p- it uses cache op- 24 25 26 27 28 29 24 30 25 31 26 27 28 29 3 11 30 19 31 27 35 43 3 51 11 59 19 27 35 43 51 59 c- lgorithm, s- (c) we Figure: 0as-1 in r- row-major4or-5 6 7 20 21 4 22 5 23 6 7 20 21 2 22 3 23 6 7 18 19 2 22 3 23 6 7 18 19 22 23 LT ely, REC-MULT 8 9 10 11 24 25 8 26 9 27 10 11 24 25 8 26 9 12 27 13 24 25 8 28 9 29 12 13 24 25 28 29 ce m a subproblem Or like in Fortran 2Row 3(c) 16 major 17 0 18 1order 19 2 3(d) 16 17 0 18 1 19 4 5(d) Figure: 16 17 0 20 1Column 21 4 5 16 major 17 20 21 order 12 13 14 15 28 29 12 30 13 31 14 15 28 10 29 11 30 14 31 15 26 27 10 30 11 31 14 15 26 27 30 31 ed ms can be solved 32 33 34 35 48 49 32 50 33 51 34 35 48 32 49 33 50 36 51 37 48 49 32 52 33 53 36 37 48 49 52 53 36 37 38 39 52 53 36 54 37 55 38 39 52 34 53 35 54 38 55 39 50 51 34 54 35 55 38 39 50 51 54 55 mp: uses Θ8 mnp: 40 41 42 43 56 57 40 58 41 59 42 43 56 40 57 41 58 44 59 45 56 57 40 60 41 61 44 45 56 57 60 61 44 45 46 47 60 61 44 62 45 63 46 47 60 42 61 43 62 46 63 47 58 59 42 62 43 63 46 47 58 59 62 63 np mp: L
Cache friendly algorithm BLOCK-MULT(A, B, C, n) 1 for i 1 to n/s 2 for j 1 to n/s 3 for k 1 to n/s 4 ORD-MULT(A ik, B kj, C ij, s)
BLOCK-MULT issues Being cache aware is hard: Cumbersome structure Complicated choice of s Expensive mispicking of s Problematic if n mod s 0
Motivation Keeping algorithm simple is nice. But cache effectiveness is the must.
Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
TRACT System model arald Prokop Sridhar Ramachandran hnology Square, Cambridge, MA 02139!"# # $ % '&(#&*)+%,&- " CPU W work Z3 L Cache lines organized by optimal replacement strategy Cache Lines of length L Q cache misses Main Memory Figure 1: The ideal-cache model all assume that word size is constant; the particular nstant does not affect our asymptotic analyses. The Two level memory Fully associative Strictly optimal replacement Automatic replacement Tall cache: where: Z = Ω(L 2 ), Z number of words in the cache L number of words in a cache line
Matrix multiplication Given: A[m n] B[n p] C[m p] ( A1 A 2 ) ( A1 B B = A 2 B ( A1 A 2 ) ( B 1 B 2 ), m max(n, p) (1) ) = A 1 B 1 + A 2 B 2, n max(m, p) (2) A ( B 1 B 2 ) = ( AB1 AB 2 ), p max(n, m) (3) C ij := C ij + A ik B kj, m = n = p = 1 (4)
Bounds REC-MULT Work: Θ(n 3 ) Cache misses: Θ(n + n 2 /L + n 3 /L Z) vs BLOCK-MULT Work: Θ(n 3 ) Cache misses: Θ(1 + n 2 /L + n 3 /L Z) vs Strassen s [2] (cache oblivious) Work: Θ(n log 2 7 ) Cache misses: Θ(1 + n 2 /L + n log 2 7 /L Z)
Matrix transposition Given: A[m n] B[n m] A = ( A 1 A 2 ), B = ( B1 B 2 ) (5)
Bounds REC-TRANSPOSE Work: Θ(n m) Cache misses: Θ(1 + mn/l) Asymptotically optimal Naïve Work: Θ(n m) Cache misses: Θ(n m)
Discrete Fourier Transform (DFT) Compute: n 1 Y [i] = X [j]ωn ij, j=0 where ω n = e 2π 1/n Assume n = 2 k k N Choose n 1 = 2 log2n/2, n 2 = 2 log 2n/2 Factorized Y (Cooley-Turkey algorithm): Y [i 1 + i 2 n 1 ] = n 2 1 j 2 =0 n 1 1 j 1 =0 X [j 1 n 2 + j 2 ]ω j 1j 2 n ω j 1j 2 n 2
Sorting Mergesort is not optimal with respect to cache misses. 1. Funnelsort 2. Distribution sort Recursive Asymptotically cache-optimal Not every recursive sort is cache optimal
Funnelsort 1. Split input into n 1 3 of size n 2 3, and sort these arrays recursively 2. Merge n 1 3 sorted sequences using n 1 3 -merger
k-merger L 1 L k buffers k-merger Figure 3: Illustration of a k-merger. A k-merger is built recursively out of k left k-mergers L 1, L 2,, L k, a series of buffers, and one right k-merger R. R of 3 k buffe hold 2k 2 e are connecte the right par merger beco intermediate can hold 2k 3 of elements buffer space algorithm, a the recursion k 3 < 8 eleme A 3 k-merg In order to R k 2 times merger fills all buffers th to fill buffer left merger L buffer conta
Bounds Work: O(n log 2 n) Optimal cache misses: O(1 + (n/l)(1 + log Z n))
Relieved system model LRU Θ(Q(n; Z; L)) Multilevel cache inclusive cache
Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
Micro-benchmarks algorithm regularity y in expecmory man-. algorithms, FFT, and ith explicit Time (microseconds) 0.12 0.1 0.08 0.06 0.04 0.02 iterative recursive 0 0 100 200 300 400 500 600 N Figure 5: Average time taken to multiply two N N matrices, divided by N 3. Time (microseconds) 0.25 0 200 400 600 800 1000 1200 cept that theiterative 0.2 divide-and-conquer structure was modified recursive to produce exact powers of 2 as submatrix sizes wherever possible. In addition, the base cases were coars- 0.15 0.1 ened by inlining the recursion near the leaves to increase 0.05their size and overcome the overhead of procedure calls. 0 (A good research problem is to determine an effective compiler strategy for N coarsening base cases Figure automatically.) 4: Average time to transpose an N N matrix, divided Although by N 2 these. results must be considered preliminary, Figure 4 strongly indicates that the recursive aluses a recur transform ca duces straigh cases for the cache oblivi allocation ef erated witho target archite To close that should theoretic re rithms and c Separation: between cac It appears th use caches b they have m they are runn advantage is Ω8 lgz: adva
Real benchmarks [1] 4. Comparison of Cache Aware and Cache Oblivious Static Search Trees 89 Average number of cache misses per lookup 20 15 10 5 Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit Cache Misses for Static Search Trees 0 100 1000 10000 100000 1e+06 Number of items Fig. 4.8. Cache misses per lookup for static search algorithms Classic Binary Search Explicit Instruction Count for Static Search Trees
100 1000 10000 100000 1e+06 Number of items Real benchmarks [1] Fig. 4.8. Cache misses per lookup for static search algorithms Instruction Count for Static Search Trees Average number of instructions per lookup 1200 1000 800 600 400 200 Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit 0 100 1000 10000 100000 1e+06 Number of items Fig. 4.9. Instruction count per lookup for static search algorithms Figure 4.10 gives the results of an execution time study using Windows.
Real benchmarks [1] 90 Richard E. Ladner et al. 10 8 Execution Time on Windows for Static Search Trees Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit Time in microseconds per lookup 6 4 2 0 10000 100000 1e+06 Number of items Fig. 4.10. Execution time on Windows for static search algorithms of computing pointers. Inexplicably, cache aware search with explicit pointers
Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
FFMK tribute slide... FFTW library, which uses a recursive strategy to exploit caches in Fourier transform calculations. FFTW s code generator produces straight-line codelets, which are coarsened base cases for the FFT algorithm. Because these codelets are cache oblivious, a C compiler can perform its register allocation efficiently, and yet the codelets can be generated without knowing the number of registers on the target architecture.
Open questions Is there a gap in asymptotic complexity? Is there a limit as to how much better a cache-aware algorithm can be?
Conclusion Seem to be slower Provide cache optimality without knowing cache size Based on recursion
Richard E Ladner, Ray Fortna, and Bao-Hoang Nguyen. A comparison of cache aware and cache oblivious static search trees using program instrumentation. In Experimental Algorithmics, pages 78 92. Springer, 2002. Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354 356, 1969.