Cache-Oblivious Algorithms

Size: px

Start display at page:

Download "Cache-Oblivious Algorithms"

August Hardy
5 years ago
Views:

1 Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran Presents: Maksym Planeta

2 Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

3 Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

4 Matrix multiplication ORD-MULT(A, B, C) 1 for i 1 to m 2 for j 1 to p 3 for k 1 to n 4 C ij C ij + A ik B kj

5 Matrix layout Like in C... e nd se l- c- e- l- p- s- r- LT m ed p: (a) (b) (c) Figure: 0 1 2Row 3 16 major order 19 (d)

6 Matrix layout Like in C.... ein case (a) (4), 0we1 2 3(a) (b) (b) nd it vertically, and esematrices, these el-recursive mulhe base case32 oc e- ase the two 40 ele result matrix and-conquer l- 56 al p- it uses cache op c- lgorithm, s- (c) we Figure: 0as-1 in r- row-major4or LT ely, REC-MULT ce m a subproblem Or like in Fortran 2Row 3(c) 16 major order (d) (d) Figure: Column major order ed ms can be solved mp: uses Θ8 mnp: np mp: L

7 Cache friendly algorithm BLOCK-MULT(A, B, C, n) 1 for i 1 to n/s 2 for j 1 to n/s 3 for k 1 to n/s 4 ORD-MULT(A ik, B kj, C ij, s)

8 BLOCK-MULT issues Being cache aware is hard: Cumbersome structure Complicated choice of s Expensive mispicking of s Problematic if n mod s 0

9 Motivation Keeping algorithm simple is nice. But cache effectiveness is the must.

10 Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

11 TRACT System model arald Prokop Sridhar Ramachandran hnology Square, Cambridge, MA 02139!"# # $ % '&(#&*)+%,&- " CPU W work Z3 L Cache lines organized by optimal replacement strategy Cache Lines of length L Q cache misses Main Memory Figure 1: The ideal-cache model all assume that word size is constant; the particular nstant does not affect our asymptotic analyses. The Two level memory Fully associative Strictly optimal replacement Automatic replacement Tall cache: where: Z = Ω(L 2 ), Z number of words in the cache L number of words in a cache line

12 Matrix multiplication Given: A[m n] B[n p] C[m p] ( A1 A 2 ) ( A1 B B = A 2 B ( A1 A 2 ) ( B 1 B 2 ), m max(n, p) (1) ) = A 1 B 1 + A 2 B 2, n max(m, p) (2) A ( B 1 B 2 ) = ( AB1 AB 2 ), p max(n, m) (3) C ij := C ij + A ik B kj, m = n = p = 1 (4)

13 Bounds REC-MULT Work: Θ(n 3 ) Cache misses: Θ(n + n 2 /L + n 3 /L Z) vs BLOCK-MULT Work: Θ(n 3 ) Cache misses: Θ(1 + n 2 /L + n 3 /L Z) vs Strassen s [2] (cache oblivious) Work: Θ(n log 2 7 ) Cache misses: Θ(1 + n 2 /L + n log 2 7 /L Z)

14 Matrix transposition Given: A[m n] B[n m] A = ( A 1 A 2 ), B = ( B1 B 2 ) (5)

15 Bounds REC-TRANSPOSE Work: Θ(n m) Cache misses: Θ(1 + mn/l) Asymptotically optimal Naïve Work: Θ(n m) Cache misses: Θ(n m)

16 Discrete Fourier Transform (DFT) Compute: n 1 Y [i] = X [j]ωn ij, j=0 where ω n = e 2π 1/n Assume n = 2 k k N Choose n 1 = 2 log2n/2, n 2 = 2 log 2n/2 Factorized Y (Cooley-Turkey algorithm): Y [i 1 + i 2 n 1 ] = n 2 1 j 2 =0 n 1 1 j 1 =0 X [j 1 n 2 + j 2 ]ω j 1j 2 n ω j 1j 2 n 2

17 Sorting Mergesort is not optimal with respect to cache misses. 1. Funnelsort 2. Distribution sort Recursive Asymptotically cache-optimal Not every recursive sort is cache optimal

18 Funnelsort 1. Split input into n 1 3 of size n 2 3, and sort these arrays recursively 2. Merge n 1 3 sorted sequences using n 1 3 -merger

19 k-merger L 1 L k buffers k-merger Figure 3: Illustration of a k-merger. A k-merger is built recursively out of k left k-mergers L 1, L 2,, L k, a series of buffers, and one right k-merger R. R of 3 k buffe hold 2k 2 e are connecte the right par merger beco intermediate can hold 2k 3 of elements buffer space algorithm, a the recursion k 3 < 8 eleme A 3 k-merg In order to R k 2 times merger fills all buffers th to fill buffer left merger L buffer conta

20 Bounds Work: O(n log 2 n) Optimal cache misses: O(1 + (n/l)(1 + log Z n))

21 Relieved system model LRU Θ(Q(n; Z; L)) Multilevel cache inclusive cache

22 Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

23 Micro-benchmarks algorithm regularity y in expecmory man-. algorithms, FFT, and ith explicit Time (microseconds) iterative recursive N Figure 5: Average time taken to multiply two N N matrices, divided by N 3. Time (microseconds) cept that theiterative 0.2 divide-and-conquer structure was modified recursive to produce exact powers of 2 as submatrix sizes wherever possible. In addition, the base cases were coars ened by inlining the recursion near the leaves to increase 0.05their size and overcome the overhead of procedure calls. 0 (A good research problem is to determine an effective compiler strategy for N coarsening base cases Figure automatically.) 4: Average time to transpose an N N matrix, divided Although by N 2 these. results must be considered preliminary, Figure 4 strongly indicates that the recursive aluses a recur transform ca duces straigh cases for the cache oblivi allocation ef erated witho target archite To close that should theoretic re rithms and c Separation: between cac It appears th use caches b they have m they are runn advantage is Ω8 lgz: adva

24 Real benchmarks [1] 4. Comparison of Cache Aware and Cache Oblivious Static Search Trees 89 Average number of cache misses per lookup Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit Cache Misses for Static Search Trees e+06 Number of items Fig Cache misses per lookup for static search algorithms Classic Binary Search Explicit Instruction Count for Static Search Trees

25 e+06 Number of items Real benchmarks [1] Fig Cache misses per lookup for static search algorithms Instruction Count for Static Search Trees Average number of instructions per lookup Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit e+06 Number of items Fig Instruction count per lookup for static search algorithms Figure 4.10 gives the results of an execution time study using Windows.

26 Real benchmarks [1] 90 Richard E. Ladner et al Execution Time on Windows for Static Search Trees Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit Time in microseconds per lookup e+06 Number of items Fig Execution time on Windows for static search algorithms of computing pointers. Inexplicably, cache aware search with explicit pointers

27 Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

28 FFMK tribute slide... FFTW library, which uses a recursive strategy to exploit caches in Fourier transform calculations. FFTW s code generator produces straight-line codelets, which are coarsened base cases for the FFT algorithm. Because these codelets are cache oblivious, a C compiler can perform its register allocation efficiently, and yet the codelets can be generated without knowing the number of registers on the target architecture.

29 Open questions Is there a gap in asymptotic complexity? Is there a limit as to how much better a cache-aware algorithm can be?

30 Conclusion Seem to be slower Provide cache optimality without knowing cache size Based on recursion

31 Richard E Ladner, Ray Fortna, and Bao-Hoang Nguyen. A comparison of cache aware and cache oblivious static search trees using program instrumentation. In Experimental Algorithmics, pages Springer, Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4): , 1969.

3.2 Cache Oblivious Algorithms

3.2 Cache Oblivious Algorithms Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science,