Determinant Computation on the GPU using the Condensation AMMCS Method / 1

Size: px

Start display at page:

Download "Determinant Computation on the GPU using the Condensation AMMCS Method / 1"

Candace Barrett
5 years ago
Views:

1 Determinant Computation on the GPU using the Condensation Method Sardar Anisul Haque Marc Moreno Maza Ontario Research Centre for Computer Algebra University of Western Ontario, London, Ontario AMMCS 20, Waterloo, July 25, 20 Determinant Computation on the GPU using the Condensation AMMCS Method20 /

2 Determinant Computation on the GPU using the Condensation AMMCS Method20 2 /

3 Plan Determinant Computation on the GPU using the Condensation AMMCS Method20 3 /

4 Dodgson s condensation Algorithm Example of a condensation step: => Reference: C. L. Dodgson, Condensation of Determinants, Proceedings of the Royal Society of London, 5(866), Determinant Computation on the GPU using the Condensation AMMCS Method20 4 /

5 Dodgson s condensation Algorithm (cont.) Condensation step (cont.) => = 20 And the determinant is: 20/ 5 = 4. Determinant Computation on the GPU using the Condensation AMMCS Method20 5 /

6 Salem and Said s condensation Algorithm Condensation step with the same example: => A formula is needed before concluding (see next slide). Reference: Abdelmalek Salem, and Kouachi Said, Condensation of Determinants, Determinant Computation on the GPU using the Condensation AMMCS Method20 6 /

7 Salem and Said s condensation Algorithm (cont.) The input of a condensation step is a matrix A = (a i,j 0 i, j n ) of order n > 2. It produces a matrix B = (b i,j 0 i, j n ) of order n such that b i,j = a 0,l a 0,j+ a i+,l a i+,j+ for j l and by b i,j = a i+,j a 0,l for j < l. The key relation between A and B is the following: det(a) = det(b)/(a 0,l ) n 2 Determinant Computation on the GPU using the Condensation AMMCS Method20 7 /

8 Salem and Said s condensation Algorithm (cont.) Returning to our example, we obtain: => 2 2 => 4 So the determinant is: 4/( ) 3 2 = 4. Determinant Computation on the GPU using the Condensation AMMCS Method20 8 /

9 Complexity estimates for Salem and Said s Algorithm For the usual RAM model, In the worst case, the work is n 3 3/2n 2 + /2n 3 coefficient operations. Asymptotically, on (Z, L) ideal cache, the number of cache misses is in the order of (n Z) ( n 2 n+z 2 Z + Zn++4 L ) Hence, the ratio between the two is L, similarly to Gaussian Elimination. L However, the condensation method is more data-oblivious which is good for the hardware scheduling of a GPU. Determinant Computation on the GPU using the Condensation AMMCS Method20 9 /

10 Plan Determinant Computation on the GPU using the Condensation AMMCS Method 20 0 /

11 Data mapping Each condensation step is performed by one kernel call. No data copied back to the host until n = 2. 2 At each condensation step, the input A and output B are stored in global memory. Shared memory is used for efficiency issues. 3 Salem and Said s Algorithm suggest to store A and B in column major layout. 4 We use a -D grid of -D thread blocks. 5 With T threads per block and t elements of B written per thread, (n ) 2 /(Tt) blocks are required. For t = 4 and n > 200 this leads to about 0, 000 threads in flight. 6 The j-th thread in the i-th block computes-and-writes B[Ttj + it, Ttj + it +,...Ttj + it + t ]. Determinant Computation on the GPU using the Condensation AMMCS Method 20 /

12 Data mapping Each condensation step is performed by one kernel call. No data copied back to the host until n = 2. 2 At each condensation step, the input A and output B are stored in global memory. Shared memory is used for efficiency issues. 3 Salem and Said s Algorithm suggest to store A and B in column major layout. 4 We use a -D grid of -D thread blocks. 5 With T threads per block and t elements of B written per thread, (n ) 2 /(Tt) blocks are required. For t = 4 and n > 200 this leads to about 0, 000 threads in flight. 6 The j-th thread in the i-th block computes-and-writes B[Ttj + it, Ttj + it +,...Ttj + it + t ]. Determinant Computation on the GPU using the Condensation AMMCS Method 20 /

13 Data mapping Each condensation step is performed by one kernel call. No data copied back to the host until n = 2. 2 At each condensation step, the input A and output B are stored in global memory. Shared memory is used for efficiency issues. 3 Salem and Said s Algorithm suggest to store A and B in column major layout. 4 We use a -D grid of -D thread blocks. 5 With T threads per block and t elements of B written per thread, (n ) 2 /(Tt) blocks are required. For t = 4 and n > 200 this leads to about 0, 000 threads in flight. 6 The j-th thread in the i-th block computes-and-writes B[Ttj + it, Ttj + it +,...Ttj + it + t ]. Determinant Computation on the GPU using the Condensation AMMCS Method 20 /

14 Finding the l-th column: finite field case A condensation step produces a matrix B = (b i,j 0 i, j n ) of order n such that b i,j = a 0,l a 0,j+ a i+,l a i+,j+ for j l and by b i,j = a i+,j a 0,l for j < l. Recall that we have det(a) = det(b)/(a 0,l ) n 2 The above formula requires a 0,l to be the first non-zero on the first row: we call it the pivot. It is computed by one kernel call. The successive pivots are in the global memory of GPU and used to compute the determinant of the original matrix. Determinant Computation on the GPU using the Condensation AMMCS Method 20 2 /

15 Finding the l-th column: finite field case A condensation step produces a matrix B = (b i,j 0 i, j n ) of order n such that b i,j = a 0,l a 0,j+ a i+,l a i+,j+ for j l and by b i,j = a i+,j a 0,l for j < l. Recall that we have det(a) = det(b)/(a 0,l ) n 2 The above formula requires a 0,l to be the first non-zero on the first row: we call it the pivot. It is computed by one kernel call. The successive pivots are in the global memory of GPU and used to compute the determinant of the original matrix. Determinant Computation on the GPU using the Condensation AMMCS Method 20 2 /

16 Finding the l-th column: floating point number Case On the first row, we choose the element p whose absolute value is the closest to : we call it the pivot. Then we divide each element of the first row by p and we have det(a) = det(b) p The successive pivots are stored in the GPU global memory. After all condensation steps are completed, the pivots are multiplied together so as to avoid overflow/underflow, if possible: Step L := {p Pivots p }; L 2 := {p Pivots p L }; m := ; Step 2 While L and L 2 not empty do m := m pop(l ) pop(l ) Step 3 Finish wih the non-empty stack, if any. Determinant Computation on the GPU using the Condensation AMMCS Method 20 3 /

17 Finding the l-th column: floating point number Case On the first row, we choose the element p whose absolute value is the closest to : we call it the pivot. Then we divide each element of the first row by p and we have det(a) = det(b) p The successive pivots are stored in the GPU global memory. After all condensation steps are completed, the pivots are multiplied together so as to avoid overflow/underflow, if possible: Step L := {p Pivots p }; L 2 := {p Pivots p L }; m := ; Step 2 While L and L 2 not empty do m := m pop(l ) pop(l ) Step 3 Finish wih the non-empty stack, if any. Determinant Computation on the GPU using the Condensation AMMCS Method 20 3 /

18 Finding the l-th column: floating point number Case On the first row, we choose the element p whose absolute value is the closest to : we call it the pivot. Then we divide each element of the first row by p and we have det(a) = det(b) p The successive pivots are stored in the GPU global memory. After all condensation steps are completed, the pivots are multiplied together so as to avoid overflow/underflow, if possible: Step L := {p Pivots p }; L 2 := {p Pivots p L }; m := ; Step 2 While L and L 2 not empty do m := m pop(l ) pop(l ) Step 3 Finish wih the non-empty stack, if any. Determinant Computation on the GPU using the Condensation AMMCS Method 20 3 /

19 Plan Determinant Computation on the GPU using the Condensation AMMCS Method 20 4 /

20 Experimental setup The order of our test matrices varies from 0 to We conduct all our experiments on a GPU NVIDIA Tesla 2050 C. Our GPU code is written using CUDA. Our CPU is intel core 2 processor Q6600. It has L2 cache of 8MB and the CPU frequency is 2.40 GHz. Reference: NVIDIA developer zone, Determinant Computation on the GPU using the Condensation AMMCS Method 20 5 /

21 Effective memory bandwidth We use effective memory bandwidth to evaluate our GPU code. The effective memory bandwidth (measured in GB/seconds) of a kernel run is amount of data traversed in the global memory of GPU during the kernel run divided by the running time of the kernel. It is compared against a simple CUDA code, called copy kernel, that just performs one copy memory from one place to other place in the global area of GPU. Reference: Greg Ruetsch, and Paulius Micikevicius, Optimizing Matrix Transpose in CUDA, NVIDIA Corporation, Determinant Computation on the GPU using the Condensation AMMCS Method 20 6 /

22 CPU Vs GPU Time (s) Condensation Method for determinant CPU Vs GPU CPU GPU matrix order Determinant Computation on the GPU using the Condensation AMMCS Method 20 7 /

23 CPU Vs GPU (cont. ) Time (s) Condensation Method for determinant CPU Vs GPU CPU GPU matrix order Determinant Computation on the GPU using the Condensation AMMCS Method 20 8 /

24 Effective Memory Bandwidth (cont.) Bandwidth (GB/s) Memory Bandwidth of Condensation Method Bandwidth (GB/s) matrix order Determinant Computation on the GPU using the Condensation AMMCS Method 20 9 /

25 Finite Filed Case time (s) Condensation Vs Maple code for computing determinant Maple Condensation method matrix order Reference: Maple: Determinant Computation on the GPU using the Condensation AMMCS Method /

26 Finite Filed Case 2 time (s) Condensation Vs NTL code for computing determinant NTL Condensation method matrix order Reference: NTL: A library for doing number theory, Determinant Computation on the GPU using the Condensation AMMCS Method 20 2 /

27 Floating point number Case Determinant on MAPLE Vs Condensation Method on GPU MAPLE GPU Time (s) matrix order Determinant Computation on the GPU using the Condensation AMMCS Method /

28 Floating point number Case (cont.) Determinant on MAPLE Vs Condensation Method on GPU GPU MAPLE Time (s) matrix order Determinant Computation on the GPU using the Condensation AMMCS Method /

29 Floating point number Case 2 Determinant on MATLAB Vs Condensation Method on GPU MATLAB GPU Time (s) Reference: Matlab: matrix order Determinant Computation on the GPU using the Condensation AMMCS Method /

30 Hilbert Matrices In order to investigate the numerical stability of our GPU implementation of the condensation method, we use the infamous Hilbert matrix H ij = i+j, which is a canonical example of ill-conditioned (and invertible) matrix. For example, for n = 5, we have Determinant Computation on the GPU using the Condensation Method AMMCS /

31 Hilbert Matrices (cont.) Matrix order MAPLE MATLAB Condensation on GPU software double double floats floats floats plus lists e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e-369 Table: Determinant of Hilbert Matrix by MAPLE, MATLAB, and condensation method on both CPU and GPU. Determinant Computation on the GPU using the Condensation AMMCS Method /

32 Hilbert Matrices (cont.) Matrix order MAPLE MATLAB Condensation Method on GPU Table: Time(s) Required to compute determinant of Hilbert Matrix by MAPLE, MATLAB, and condensation method on both CPU and GPU. Determinant Computation on the GPU using the Condensation AMMCS Method /

33 Plan Determinant Computation on the GPU using the Condensation AMMCS Method /

34 Conclusion The condensation method implemented on GPU is a promising candidate to compute determinant of matrices with both modular integer and floating point number coefficients. We believe that it can be used to improve the efficiency, in terms of running time and numerical stability, of existing mathematical software packages. Acknowledgements. We are grateful to Dr. Wei Pan and Dr. Jürgen Gerhard for helpful discussion. Determinant Computation on the GPU using the Condensation AMMCS Method /

35 Thank you Determinant Computation on the GPU using the Condensation AMMCS Method /

Determinant Computation on the GPU using the Condensation Method

Determinant Computation on the GPU using the Condensation Method Sardar Anisul Haque, Marc Moreno Maza 2 University of Western Ontario, London N6A M8, Canada E-mail: shaque4@csd.uwo.ca, 2 moreno@csd.uwo.ca