Applications of Berkeley s Dwarfs on Nvidia GPUs

Size: px

Start display at page:

Download "Applications of Berkeley s Dwarfs on Nvidia GPUs"

Kristopher Stevens
5 years ago
Views:

1 Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang

Combinational Logic Graphical Model Summary 05.

2 Overview CUDA The Dwarfs Dynamic Programming Sparse Linear Algebra Unstructured Grids Combinational Logic Graphical Model Summary Team N2: Yang Zhang & Haiqing Wang Overview 2/37

Fortran Lots of libraries available (e.g.

3 CUDA Parallel computing platform and programming model for GPGPU Supports various languages including C/C++ and Fortran Lots of libraries available (e.g. cusparse, cublas, NPP, etc ) Team N2: Yang Zhang & Haiqing Wang CUDA 3/37

4 CUDA : Execution Model Each thread gets an ID Group of threads build a block Group of blocks build a grid Each thread executed by a core Each block executed by a SM A block is further split into warps Blocks are independent of each other Team N2: Yang Zhang & Haiqing Wang CUDA: Execution Model 4/37

5 CUDA : Memory Model Each thread has a private local memory Each block has a shared memory Allows communication between threads All thread can access the global memory Constant memory is a read-only memory Team N2: Yang Zhang & Haiqing Wang CUDA: Memory Model 5/37

6 Overview CUDA The Dwarfs Dynamic Programming Sparse Linear Algebra Unstructured Grids Combinational Logic Graphical Model Summary Team N2: Yang Zhang & Haiqing Wang Overview 6/37

7 Dynamic Programming [1] : Matrix Chain Product An example: ((A1 A2 A3 A4) (A5 A6)) 2*9*3+2*3*1+2*1*4+4*11*5+2*4*5=328 (A1 (A2 A3) (A4 A5) A6) 9*3*1+2*9*1+1*4*112*1*11+2*11*5=221 Goal: Minimize the total number of multiplications Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Matrix Chain Product 7/37

8 Dynamic Programming [1] : Algorithm Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Algorithm 8/37

9 Dynamic Programming [1] : Algorithm (n=6) Table m: Table s: Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Algorithm 9/37

10 Dynamic Programming [1] : Implementation Table m: (n=8) Computing is independent Can be computed in parallel Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 10/37

11 Dynamic Programming [1] : Implementation The number of (i,j) for each l The number of k for each (i,j) of each l the performance depends on various factors Using three different Kernels: OneThreadPerOneEntry OneBlockPerOneEntry BlocksPerOneEntry The amount of the computation for each l Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 11/37

m 1,5, m 2,6, m 3,7, m 4,8 each one is computed concurrently all use previous

12 Dynamic Programming [1] : Implementation OneThreadPerOneEntry Allocates one Thread to compute one entry e.g. m 1,5, m 2,6, m 3,7, m 4,8 each one is computed concurrently all use previous entries Change Memory Mapping Memory Mapping Direction: Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 12/37

13 Dynamic Programming [1] : Implementation OneThreadPerOneEntry Allocates one Thread to compute one entry e.g. m 1,5, m 2,6, m 3,7, m 4,8 each one is computed concurrently by one core all use previous entries in shared memory stored in Global memory after computing Stored in Global memory: CUDA Architecture: Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 13/37

14 Dynamic Programming [1] : Implementation OneBlockPerOneEntry Allocates one Block to compute one entry e.g. m 1,5 = min 1 k<5 (m 1,k + m k+1,5 + p 0 p k p 5 ) is computed by one Streaming multiprocessor each (m 1,k + m k+1,5 + p 0 p k p 5 ) is computed by one core use another core for selection CUDA Architecture: Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 14/37

15 Dynamic Programming [1] : Implementation BlocksPerOneEntry Allocates multiple Blocks to compute for one entry e.g. m 1,5 = min 1 k<5 (m 1,k + m k+1,5 + p 0 p k p 5 ) is computed by a few Streaming multiprocessors each (m 1,k + m k+1,5 + p 0 p k p 5 ) is computed by one core but maybe from different Streaming multiprocessors use another core in any Streaming multiprocessor for selection CUDA Architecture: Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 15/37

Dynamic Programming [1] : Evaluation GPU: Nvidia GeForce GTX 480 with 480 processing cores (15 Streaming Multiprocessors which has 32 processing cores) 1.

16 Dynamic Programming [1] : Evaluation GPU: Nvidia GeForce GTX 480 with 480 processing cores (15 Streaming Multiprocessors which has 32 processing cores) 1.4GHz, 3GB memory. Total time of each kernel for different number of threads and blocks (n = 16384) Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Evaluation 16/37

17 Dynamic Programming [1] : Evaluation GPU: Nvidia GeForce GTX 480 with 480 processing cores (15 Streaming Multiprocessors which has 32 processing cores) 1.4GHz, 3GB memory. Fastest Kernel for different l Running time with l of each kernel: OneThreadPerOneEntry OneBlockPerOneEntry BlocksPerOneEntry Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Evaluation 17/37

18 Dynamic Programming [1] : Evaluation GPU vs. CPU GPU: Nvidia GeForce GTX 480 with 480 processing cores (15 Streaming Multiprocessors which has 32 processing cores) 1.4GHz, 3GB memory. (combination of three Kernels) Fastest Kernel for different l CPU: Intel Core i7 870, 2.93GHz, 8GB memory (sequential program in C language) Total computing time for n = The speedup factor is unfair Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Evaluation GPU vs. CPU 18/37

19 Overview CUDA The Dwarfs Dynamic Programming Sparse Linear Algebra Unstructured Grids Combinational Logic Graphical Model Summary Team N2: Yang Zhang & Haiqing Wang Overview 19/37

Compute C = AB where A sparse matrix, B dense matrix X

20 Sparse Linear Algebra [2] Goal: Accelerate sparse matrix-matrix (SpMM) product on GPU SpMM product: Compute C = AB where A sparse matrix, B dense matrix X Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra 20/37

21 Sparse Linear Algebra [2] : FastSpMM Approach: Extension of the ELLR-T kernel called FastSpMM Relies on ELLPACK-R storage format Outperforms common libraries for SpMM (e.g. cusparse) Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra: FastSpMM 21/37

cusparse (CRS storage format) GTX480 Tesla C2050 NxN test sparse matrices 05.02.

22 Sparse Linear Algebra [2] : Evaluation SpMM Three versions of SpMM routines evaluated on two Nvidia GPUs: FastSpMM vs. ELLR-T (ELLPACK-R storage format) vs. cusparse (CRS storage format) GTX480 Tesla C2050 NxN test sparse matrices Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra: Evaluation SpMM 22/37

23 Sparse Linear Algebra [2] : Evaluation GPU vs. CPU GTX480 and Tesla C2050 using FastSpMM vs. Intel Xeon E5640 with 4 cores using the MKL library Runtimes (in seconds) on test matrices: Speedups compared to CPU: GTX480: 2,8 6,2 Tesla C2050: 1,7 3, Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra: Evaluation GPU vs. CPU 23/37

24 Overview CUDA The Dwarfs Dynamic Programming Sparse Linear Algebra Unstructured Grids Combinational Logic Graphical Model Summary Team N2: Yang Zhang & Haiqing Wang Overview 24/37

25 Unstructured Grids [3] : Compressible Flows Compressible flows simulation on 3-D unstructured grids Compressible flows : fluid mechanics that deals with flows having significant changes in fluid density An example : Subsonic Flow past a Sphere Team N2: Yang Zhang & Haiqing Wang Unstructured Grids: Compressible Flows 25/37

Unstructured Grids [3] : DG Method Discontinuous Galerkin (DG) method : in mathematics form a class of numerical methods for solving differential equations DG

26 Unstructured Grids [3] : DG Method Discontinuous Galerkin (DG) method : in mathematics form a class of numerical methods for solving differential equations DG method can be implemented in parallel An example : Subsonic Flow past a Sphere Team N2: Yang Zhang & Haiqing Wang Unstructured Grids: DG Method 26/37

27 Unstructured Grids [3] : Evaluation GPU vs. CPU GPU: NVIDIA Tesla K20c GPU containing 2496 multiprocessors (OpenACC-based program) Nelem: number of elements Ntime : number of time steps CPU: AMD Opteron 6128 CPU containing 16 cores (MPI-based parallel program) Timing measurements for subsonic flow past a sphere Team N2: Yang Zhang & Haiqing Wang Unstructured Grids: Evaluation GPU vs. CPU 27/37

28 Overview CUDA The Dwarfs Dynamic Programming Sparse Linear Algebra Unstructured Grids Combinational Logic Graphical Model Summary Team N2: Yang Zhang & Haiqing Wang Overview 28/37

29 Combinational Logic [4] : Parallel AES Goal: Efficient encryption/decryption of data streams on web server applications Approach: Design of a parallel AES on GPU Two design choices: Fine-grained: Focus on thread-level parallelism A lot of communication and synchronization Coarse-grained: Focus on higher-level parallelism i.e. blocks Team N2: Yang Zhang & Haiqing Wang Combinational Logic: Parallel AES 29/37

30 Combinational Logic [4] : Evaluation Comparison: Fine-grained vs coarse-grained on a Nvidia 8880 GT (112 cores) Team N2: Yang Zhang & Haiqing Wang Combinational Logic: Evaluation 30/37

31 Combinational Logic [4] : Evaluation GPU vs. CPU Throughput (in Mbps) comparisons on two Nvidia GPUs and two high-end CPUs (in 2009): CPU implementation from the OpenSSL toolkit Team N2: Yang Zhang & Haiqing Wang Combinational Logic: Evaluation GPU vs. CPU 31/37

32 Overview CUDA The Dwarfs Dynamic Programming Sparse Linear Algebra Unstructured Grids Combinational Logic Graphical Model Summary Team N2: Yang Zhang & Haiqing Wang Overview 32/37

33 Graphical Model [5] : Speech Recognition System ANN:Artificial Neural Network HMM:Hidden Markov Model ANN Model: recognizing the acoustic in a time frame (a word or a phoneme) HMM Model: warping and adjusting the whole acoustic combining these words or phonemes from ANN Team N2: Yang Zhang & Haiqing Wang Graphical Model: Speech Recognition System 33/37

34 Graphical Model [5] : ANN Training Input: A vector represents acoustic in a time frame Output: A vector represents most possible relative word or phoneme Hidden vector = Input vector weight vector 1 Output vector = Hidden vector weight vector 2 Inner product Training is the process of adjusting weight vector 1 and weight vector Team N2: Yang Zhang & Haiqing Wang Graphical Model: ANN Training 34/37

35 Graphical Model [5] : Block ANN Training Input: A Matrix made up of many input vectors Output: A Matrix made up of many output vectors Hidden matrix = Input matrix weight vector 1 Output matrix = Hidden matrix weight vector 2 Training can be solved by linear algebra Team N2: Yang Zhang & Haiqing Wang Graphical Model: Block ANN Training 35/37

36 Graphical Model [5] : Evaluation GPU vs. CPU GPU: 1600 MHz FSB, 8 GB RAM, NVIDIA GTX280 GPU (CuBLAS library) Training time, and relative speed-up for the WSJ0 corpus: CPU: a quad-core 3.0 GHz CPU (Intel MKL library) a speedup factor of Team N2: Yang Zhang & Haiqing Wang Graphical Model: Evaluation GPU vs. CPU 36/37

37 Summary What is it good for? Provides extremely high parallelism Accelerates scientific computations by a considerable factor Reduce CPU workload Achieves high performance for low cost Learning curve? Rather smooth since languages like C/C++ is supported But: Precise knowledge of hardware architecture necessary Given scalar α and two vectors x and y: operation x = αx + y? Easy to implement? Fairly easy: Basically C implementation with some added keywords and CPU/GPU memory management Disclaimer: Some comparisons to CPU not really representative or not clearly specified Team N2: Yang Zhang & Haiqing Wang Summary 37/37

38 References 1 [1] K. Nishida, Y. Ito, K. Nakano. Accelerating the Dynamic Programming for the Matrix Chain Product on the GPU. Networking and Computing (ICNC), 2011 Second International Conference on, pp , Nov Dec [2] F. Vazquez, G. Ortega, J. J. Fernandez, I.Garcia and E. M. Garzon. Fast sparse matrix matrix product based on ELLR-T and GPU computing. Parallel and Distributed Processing with Applications (ISPA), 2012 IEEE 10th International Symposium on, pp , July [3] Y. Xia, H. Luo, L. Luo, J. Edwards, J. Lou and F. Mueller. OpenACC-based GPU Acceleration of a 3-D Unstructured Discontinuous Galerkin Method. 52nd Aerospace Sciences Meeting. January Team N2: Yang Zhang & Haiqing Wang References 1 Ref 1/2

39 References 2 [4] A. di Biagio, A. Barenghi, G. Agosta, G. Pelosi. Design of a Parallel AES for Graphics Hardware using the CUDA framework. Parallel & Distributed Processing, IPDPS IEEE International Symposium on, pp. 1-8, May [5] S. Scanzio, S. Cumani, R. Gemello, F. Mana, P. Laface. Parallel implementation of artificial neural network training. Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp , March 2010 Image Sources: Team N2: Yang Zhang & Haiqing Wang References 2 Ref 2/2

40 Credits Yang Zhang: Haiqing Wang: CUDA Sparse Linear Algebra Combinational Logic Summary Dynamic Programming (in detail) Unstructured Grids Graphical Model Team N2: Yang Zhang & Haiqing Wang Credits

Technology for a better society. hetcomp.com

Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction