3.2 Cache Oblivious Algorithms

Size: px

Start display at page:

Download "3.2 Cache Oblivious Algorithms"

Laura Greene
5 years ago
Views:

1 3.2 Cache Oblivious Algorithms

2 Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, October, 1999, New York, NY, USA. 2

3 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion 3

4 Assumption Only two levels of memory hierarchies: An ideal cache Fully associative Optimal replacement strategy Tall cache A very large memory 4

5 An Ideal Cache Model An ideal cache model (Z,L) Z: Total words in the cache L: Words in one cache line 5

6 Cache Complexity An algorithm with input size n is measured by: Work complexity W(n) Cache complexity: the number of cache misses it incurs. Q(n; Z, L) 6

7 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion 7

8 Cache Aware Algorithms Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L). Need to adjust parameters when running on different platforms. 8

9 Example: A blocked matrix multiplication algorithm s s A11 n A s is a tuning parameter to make the algorithm run fast 9

10 Example (2) Cache complexity The three s x s sub matrices should fit into the cache so 2 2 they occupy max( s, s / L) = Θ( s + s / L) cache lines Optimal performance is obtained when Z/L cache misses needed to bring 3 sub matrices into cache n 2 /L cache misses needed to read n 2 elements 2 3 It is Θ(1 + n / L + ( n / s) ( Z / L)) = Θ(1 + n 2 / L + n 3 / L Z ) s = Θ( Z ) 10

11 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition and FFT Conclusion 11

12 Cache Oblivious Algorithms Have no parameters about hardware, such as cache size (Z), cache-line length (L). No tuning needed, platform independent. The following algorithms introduced are proved to have the optimal cache complexity. 12

13 Matrix Multiplication Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p n max (m, p) m max (n, p) p max (n, m) Proceed recursively until reach the base case - one element. 13

14 Matrix Multiplication (2) Assume Sizes of A, B are nx4n, 4nxn B 11 ( A ) 11 A12 B12 A1*B1 A*B + A2*B2 + + B 1 ( A ) 1 A2 B2 21 ( A ) 21 A22 B22 A11*B11 A12*B12 A21*B21 A22*B22 B 14

15 Matrix Multiplication (3) Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses. 15

16 Matrix Multiplication (4) Cache complexity Can achieve the same as the cache complexity of Block-MULT algorithm (cache aware) For a square matrix, the optimal cache complexity is achieved. 16

17 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion 17

18 Matrix Transposition A A T for i 1 to m m x n B n x m for j 1 to n B( j, i ) = A( i, j ) If n is very large, the access of B in column will cause cache miss every time! (No spatial locality in B) 18

19 Matrix Transposition (2) Partition array A along the longer dimension and recursively execute the transpose function. A11 A21 A11 T A12 T A12 A22 A21 T A22 T 19

20 Matrix Transposition (3) Cache complexity It has the optimal cache complexity Q(m, n) = Θ(1+mn/L) 20

21 Fast Fourier Transform Y [ i ] = n 1 j = 0 X [ j ] ω ij n Use Cooley-Tukey algorithm Cooley-Tukey algorithms recursively re-express a DFT of a composite size n = n 1 n 2 as: Perform n 2 DFTs of size n 1. Multiply by complex roots of unity called twiddle factors. Perform n 1 DFTs of size n 2. 21

22 n 1 Yi [] = X[ j] w n2 1 n1 1 Yi [ 1+ in 2 1] = X[ jn j2] w w w j2= 0 j1= 0 j= 0 ij ij ij i j n n n n 1 n 2 22

23 Assume X is a row-major n 1 n 2 matrix Steps: Transpose X in place. Compute n 2 DFTs Multiply by twiddle factors Transpose X in place Compute n 1 DFTs Transpose X in-place 23

24 Fast Fourier Transform n1=4, n2=2 Transpose to select n2 DFT of size n1 Call FFT recursively with n1=2, n2=2 Reach the base case, return *twiddle factor Transpose to select n1 DFT of size n2 Transpose and return 24

25 Fast Fourier Transform Cache complexity Optimal for a Cooley-Tukey algorithm, when n is an exact power of 2 Q(n) = O(1+(n/L)(1+log z n) 25

26 Other Cache Oblivious Algorithms Funnelsort Distribution sort LU decomposition without pivots 26

27 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion 27

28 Questions How large is the range of practicality of cache-oblivious algorithms? What are the relative strengths of cacheoblivious and cache-aware algorithms? 28

29 Practicality of Cache-oblivious Algorithms Average time to transpose an NxN matrix, divided by N 2 29

30 Practicality of Cache-oblivious Algorithms (2) Average time taken to multiply two NxN matrices, divided by N 3 30

31 Question 2 Do cache-oblivious algorithms perform as well as cache-aware algorithms? FFTW library No answer yet. 31

32 References Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, October, 1999, New York, NY, USA. Cache-Oblivious Algorithms by Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran Presents: Maksym Planeta 03.09.2015 Table of Contents Introduction Cache-oblivious algorithms