Dense Matrix Multiplication

Size: px

Start display at page:

Download "Dense Matrix Multiplication"

Grace Woods
5 years ago
Views:

1 Dense Matrix Multiplication Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur October 7, 2015 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

2 Overview 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

3 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

4 Why is matrix multiplication important? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

5 Matrix Representation Single array contains entire matrix Matrix arranged in row-major format m n matrix contains m rows and n columns A(i, j) is the matrix entry at i th row and j th column of matrix A It is the (i n + j) th entry in the matrix array Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

6 Triple nested loop void square_dgemm (int n, double* A, double* B, double* C) for (int i = 0; i < n; ++i) const int ioffset = i*n; for (int j = 0; j < n; ++j) double cij = 0.0; for( int k = 0; k < n; k++ ) cij += A[iOffset+k] * B[k*n+j]; C[iOffset+j] += cij; Total number of multiplications : n 3 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

7 Row-based data decomposition in matrix C Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

8 Parallel Multiply void square_dgemm (int n, double* A, double* B, double* C) #pragma omp parallel for schedule(static) for (int i = 0; i < n; ++i) const int ioffset = i*n; for (int j = 0; j < n; ++j) double cij = 0.0; for( int k = 0; k < n; k++ ) cij += A[iOffset+k] * B[k*n+j]; C[iOffset+j] += cij; Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

9 (Almost) Perfect Scaling for matrix of size Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

10 How good is the serial performance? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

11 How good is the serial performance? Normalized time becomes almost 4x when size of matrix grows from 1000 to 6000 Experiments done on 3.2 GHz machine More than 5 clock cycles taken per double precision multiplication for matrix!!! Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

12 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

13 Memory Hierarchy Model for analysis L lines of capacity m double precision numbers each Tall Cache assumption : L > m Replacement Policy : Least Recently Used No Hardware Prefetching Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

14 Memory Access Pattern during multiplication A, B and C are square matrices of size n n n is large, i.e., n > L Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

15 Memory Access Pattern for A Sequential access : Accessing a row requires n m cache misses Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

16 Memory Access Pattern for B Strided access : Accessing a column requires n cache misses Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

17 Total cache misses For computing every C(i, j), the number of cache misses : 1 + n m + n If n < ml, total cache misses : 2n2 m + n3 If n > ml, total cache misses : n2 m + n3 m + n3 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

18 Is n < ml a practical assumption? 64 bytes cache line size means m = KB L2 cache means ml = For practical problems n < 10 15k Thus n < ml and the total cache misses : 2n2 m + n3 = Θ(n 3 ) Can this be improved? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

19 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

20 Alternate Memory Access Pattern For computing the row C(i, :), cache misses : 2n m + n2 m Total cache misses : 2n2 m + n3 m = Θ( n3 m ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

21 Improved Multiply void square_dgemm (int n, double* A, double* B, double* C) for (int i = 0; i < n; ++i) const int ioffset = i*n; for( int k = 0; k < n; k++ ) const int koffset = k*n; for (int j = 0; j < n; ++j) C[iOffset+j] += A[iOffset+j] * B[kOffset+j]; Triple-nested loop with the order of (i, j, k) changed to (i, k, j) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

22 ikj versus ijk Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

23 (Almost) Perfect Scaling for matrix of size Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

24 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

25 Blocking / Tiling Assumptions for analysis : n%b = 0 and b%m = 0 Cache misses in loading a block : b2 m Cache misses in finding a block of C : b2 m + b2 2n m b = b2 m + 2nb m n Total cache misses : 2 ( b2 b 2 m + 2nb m ) = n2 m + 2n3 mb = Θ( n3 mb ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

26 Choosing blocking parameter b The 3 blocks for A, B and C should just fit in the cache 3b 2 = ml, i.e., b = ml 3 For L1 cache of capacity 32KB, ml = 4096 and b = A good value for b is 32 Total cache misses : 2n3 mb = 2 3n 3 m ml = Θ( n 3 m Cache Size ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

27 Tiled Multiply void square_dgemm (int n, double* A, double* B, double* C) for (int i = 0; i < n; i += BLOCK_SIZE) const int ioffset = i * n; for (int j = 0; j < n; j += BLOCK_SIZE) for (int k = 0; k < n; k += BLOCK_SIZE) /* Correct block dimensions if block "goes off edge of" the matrix */ int M = min (BLOCK_SIZE, n-i); int N = min (BLOCK_SIZE, n-j); int K = min (BLOCK_SIZE, n-k); /* Perform individual block dgemm */ do_block(n, M, N, K, A + ioffset + k, B + k*n + j, C + ioffset + j); Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

28 Tiled Multiply... static void do_block (int n, int M, int N, int K, double* A, double* B, double* C) for (int i = 0; i < M; ++i) const int ioffset = i*n; for (int j = 0; j < N; ++j) double cij = 0.0; for (int k = 0; k < K; ++k) cij += A[iOffset+k] * B[k*n+j]; C[iOffset+j] += cij; Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

29 Tiled versus Normal Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

30 Tiled MT scaling Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

31 Instruction Level Parallelism Given that we have made the data being worked upon available in the cache closest to the processor, we could use some ILP ILP kicks in when there is significant amount of independent work available in a single block of code Loop unrolling can help us achieve that Compilers also unroll loop but in this case there are too many nesting levels for the compiler to do the correct thing Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

32 Tiled Multiply with unrolling for (int k = 0; k < K; ++k) cij += A[iOffset+k] * B[k*n+j]; for (int k = 0; k < K; k += 8) const double d0 = A[iOffset+k] * B[k*n+j]; const double d1 = A[iOffset+k+1] * B[(k+1)*n+j]; const double d2 = A[iOffset+k+2] * B[(k+2)*n+j]; const double d3 = A[iOffset+k+3] * B[(k+3)*n+j]; const double d4 = A[iOffset+k+4] * B[(k+4)*n+j]; const double d5 = A[iOffset+k+5] * B[(k+5)*n+j]; const double d6 = A[iOffset+k+6] * B[(k+6)*n+j]; const double d7 = A[iOffset+k+7] * B[(k+7)*n+j]; cij += (d0 + d1 + d2 + d3 + d4 + d5 + d6 + d7); Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

33 Tiled Multiply with unrolling... Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

34 What about the L2 cache? In addition to L1, blocking can be done for the L2 cache also 2-level tiled code For L2 cache of capacity 256KB, ml = 4096 and b = 105 L2 cache is shared for data and instruction memory accesses Finding the best L2 tile size is non-trivial and requires experimentation with different sizes We choose 88 as the block size for L2 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

35 Tiling for L1 and L2 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

36 Unrolling diminishes the difference Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

37 Multithreading shock Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

38 Automatic tuning Manual optimization and tuning is tedious and error-prone Entire process needs to be redone in full for any new architecture Multi-threaded optimization adds further complexity Code generated automatically by parameterized code generators Automatically Tuned Linear Algebra Software Portable High Performance ANSI C Essentially a search problem Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

39 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

40 A different formulation C =A B [ ] [ ] [ ] C11 C 12 A11 A = 12 B11 B 12 C 21 C 22 A 21 A 22 B 21 B 22 C 11 =A 11 B 11 + A 12 B 21 C 12 =A 11 B 12 + A 12 B 22 C 21 =A 21 B 11 + A 22 B 21 C 22 =A 21 B 12 + A 22 B 22 Solve for C 11, C 12, C 21, C 22 recursively Recursion formula : W (n) = 8W ( n 2 ) + Θ(1) W (n) = Θ(n 3 ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

41 How is it Cache oblivious? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

42 Cache Miss Analysis Θ( n2 Q(n) = m ) for n2 < cml where 0 < c 1, 8Q( n 2 ) + Θ(1) otherwise. Depth 0 O(1) Nodes 1 1 O(1) O(1) O(1) O(1) 8 2 O(1) O(1) O(1) O(1) 8^2 d O(cL) 8^d Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

43 Cache Miss Analysis... Θ( n2 Q(n) = m ) for n2 < cml where 0 < c 1, 8Q( n 2 ) + Θ(1) otherwise. Let n d be the value of n at depth d. Then n d = cml n 0 = n and n i = n i 1 2 = n 2 i. Therefore n d = n 2 d and 2 d = n cml n d = lg( cml ) and number of leaves l = 8 d = Number of cache misses at leaf level : n l Θ(cL) = Θ( 3 m ) = Θ( n 3 cml m ) ml n3 (cml) 3 2 Total number of cache misses : Θ( n 3 m ml ) = Θ( n 3 m Cache Size ) Recall that the total number of cache misses for tiled multiplication was found to be Θ( n 3 m Cache Size ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

44 Recursive Multiply void multiply(const int n, const int m, double * A, double * B, double * C) //Base cases if(m == 2) basecase2(n, m, A, B, C); return; else if(m == 1) C[0] += A[0] * B[0]; return; //Recursive multiply const int offset12 = m/2; const int offset21 = n*m/2; const int offset22 = offset21 + offset12; multiply(n, m/2, A, B, C); multiply(n, m/2, A+offset12, B+offset21, C); multiply(n, m/2, A, B+offset12, C+offset12); multiply(n, m/2, A+offset12, B+offset22, C+offset12); multiply(n, m/2, A+offset21, B, C+offset21); multiply(n, m/2, A+offset22, B+offset21, C+offset21); multiply(n, m/2, A+offset21, B+offset12, C+offset22); multiply(n, m/2, A+offset22, B+offset22, C+offset22); return; Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

45 Recursive multiply serial performance Can be made more consistent by writing base case code for matrices of sizes > 2 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

46 How far have we come? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

47 Parallelization How about 6, 10, 12, 14, 18 or 20 threads? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

48 OpenMP task construct #pragma omp parallel #pragma omp single... #pragma omp task #pragma omp task Tasks generated by a single thread Scheduling of tasks on available threads managed by OpenMP Useful in recursive functions, linked lists Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

49 task based recursive multiplication const int offset12 = m/2; const int offset21 = n*m/2; const int offset22 = offset21 + offset12; #pragma omp task multiply(n, m/2, A, B, C); multiply(n, m/2, A+offset12, B+offset21, C); #pragma omp task multiply(n, m/2, A, B+offset12, C+offset12); multiply(n, m/2, A+offset12, B+offset22, C+offset12); #pragma omp task multiply(n, m/2, A+offset21, B, C+offset21); multiply(n, m/2, A+offset22, B+offset21, C+offset21); #pragma omp task multiply(n, m/2, A+offset21, B+offset12, C+offset22); multiply(n, m/2, A+offset22, B+offset22, C+offset22); Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

50 Recursive Multiplication Scaling Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

51 Improved Parallel Recursive Multiplication Too many tasks of varying sizes Assignment of tasks to threads is arbitrary; does not consider locality At present, gcc support for OpenMP tasks is not great Explicit control of task creation Create tasks only at a pre-computed depth in the recursion tree int criticaldepth = 0; int leaves = 1; int branches = 4; while(2*numthreads > leaves) criticaldepth++; leaves *= branches; Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

52 Improved Parallel Recursive Multiplication... if(depth == criticaldepth) #pragma omp task multiply(n, m/2, A, B, C, depth+1, criticaldepth); multiply(n, m/2, A+offset12, B+offset21, C, depth+1, criticaldepth); #pragma omp task multiply(n, m/2, A, B+offset12, C+offset12, depth+1, criticaldepth); multiply(n, m/2, A+offset12, B+offset22, C+offset12, depth+1, criticaldepth); #pragma omp task multiply(n, m/2, A+offset21, B, C+offset21, depth+1, criticaldepth); multiply(n, m/2, A+offset22, B+offset21, C+offset21, depth+1, criticaldepth); #pragma omp task multiply(n, m/2, A+offset21, B+offset12, C+offset22, depth+1, criticaldepth); multiply(n, m/2, A+offset22, B+offset22, C+offset22, depth+1, criticaldepth); else multiply(n, m/2, A, B, C, depth+1, criticaldepth); multiply(n, m/2, A+offset12, B+offset21, C, depth+1, criticaldepth); multiply(n, m/2, A, B+offset12, C+offset12, depth+1, criticaldepth); multiply(n, m/2, A+offset12, B+offset22, C+offset12, depth+1, criticaldepth); multiply(n, m/2, A+offset21, B, C+offset21, depth+1, criticaldepth); multiply(n, m/2, A+offset22, B+offset21, C+offset21, depth+1, criticaldepth); multiply(n, m/2, A+offset21, B+offset12, C+offset22, depth+1, criticaldepth); Abhishek, multiply(n, Debdeep m/2, (IIT Kgp) A+offset22, B+offset22, C+offset22, Matrix Mult. depth+1, criticaldepth); October 7, / 56

53 Improved Parallel Recursive Multiplication... Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

54 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

55 Key differences from shared memory parallel methods Matrices can be much larger Each node in the network holds only a part of the matrices Focus is on reducing inter-process communication and scalability All methods discussed so far are applicable to the local multiplication happening at a node Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

56 In future classes Network topologies Analysis of cost of communication in different topologies Isoefficiency function to understand scalability with number of processors and problem size Distributed methods Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

Cache-Efficient Algorithms

6.172 Performance Engineering of Software Systems LECTURE 8 Cache-Efficient Algorithms Charles E. Leiserson October 5, 2010 2010 Charles E. Leiserson 1 Ideal-Cache Model Recall: Two-level hierarchy. Cache