Performance of PETSc GPU Implementation with Sparse Matrix Storage Schemes

Size: px

Start display at page:

Download "Performance of PETSc GPU Implementation with Sparse Matrix Storage Schemes"

Helen Bridges
6 years ago
Views:

1 Performance of PETSc GPU Implementation with Sparse Matrix Storage Schemes Pramod Kumbhar August 19, 2011 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011

2 Abstract PETSc is a scalable solver library developed at Argonne National Laboratory (ANL). It is widely used for solving system of equations arising from discretisation of partial differential equations (PDEs). GPU support has recently been added to PETSc to exploit the performance of GPUs. This support is quite new and currently only available in the PETSc development release. The goal of this MSc project is to evaluate the performance of the current GPU implementation, especially iterative solvers on the HECToR GPU cluster. In the current implementation, a new sub-class of matrix was added which stores matrix in Compressed Sparse Row (CSR) format. We have extended the current PETSc GPU implementation to improve the performance using different sparse matrix storage schemes like ELL, Diagonal and Hybrid. For structured matrices, the current GPU implementation shows 4x speedup compared - to Intel Xeon quad-core CPU. For multi-gpu applications, speedup starts decreasing due to high communication costs on the HECToR GPU cluster. Our implementation with new storage schemes show 50% performance improvement on sparse matrixvector operations. For structured matrices, new implementation shows 7x speedup and significantly improves the performance of vector operations on the GPU. i

3 Contents Chapter 1 Introduction Background Motivation Related Work Contributions and Outline Change in Project plan... 4 Chapter 2 Background GPGPU GPU Programming models CUDA CUSP and Thrust Thrust CUSP...10 parse Matrices Iterative Methods for Sparse Linear Systems...13 Chapter 3 PE TSc GPU Implementation PETSc PETSc Kernels PETSc Components PETSc Object Design PETSc GPU Implementation Sequential Implementation Parallel Implementation Applications running with PETSc GPU support...20 Chapter 4 Sparse Matrices...22 ii

4 4.1 Sparse Matrix Representation Sparse Matrix Storage Schemes Coordinate List Compressed Sparse Row Diagonal ELL or Padded ITPACK Hybrid Jagged Diagonal Storage (JDS) Skyline or Variable Band Performance of Storage Schemes...28 Chapter 5 Implementation of Sparse Storage Support in PE TSc Design Approach Implementation Details New Matrix types for GPU PETSc Mat Object New User level API PETSc Mat objects on GPU Conversion of PETSc MatAIJ to CUSP CSR Conversion of PETSc MatAIJ to CUSP DIA/ELL/HYB/COO Matrix-Vector multiplication for different sparse formats Other Important notes Sample Use Case and Validation...44 Chapter 6 W rapper Codes and Benchmarks Testing Codes Matrix Market to PETSc binary format Benchmarking codes Benchmarking Approach...48 Chapter 7 Performance Analysis Benchmarking System Single GPU Performance Structured Matrices Semi-Structured Matrices Unstructured Matrices...57 iii

5 7.3 Multi-GPU Performance Comparing multi-gpu performance with HECToR CUSP Matrix Conversion Cost...64 Chapter 8 Discussion Challenges for multi-gpu parallelisation CPU-GPU and GPU-GDRAM Memory transfer GPU-GPU Communication Future Work...68 Chapter 9 Conclusion...72 Bibliography...74 iv

6 List of Figures Figure 1.1: Main stages involved in single iterations of Fluidity framework [7] D model problem with mesh spacing of and domain [-10, 10]... 3 Figure 3.1: PETSc Kernels (implementation: petsc/src/sys)...15 Figure 3.2: PETSc Library Organisation [28]...16 Figure 3.4: PETSc Objects and Application Level Interface...17 Figure 3.5: VecAXPY implementation in PETSc using CUSP & CuBLAS [11]...19 Figure 3.6: Parallel Matrix with on-diagonal and off-diagonal elements for two MPI process...19 Figure 3.7: Parallel Matrix-Vector multiplication in PETSc GPU implementation [11]...20 Figure 4.1: MxN: 4,690,002x4,690,002 NNZ: 20,316,253 id: Figure 4.2: MxN: 1,391,349x1,391,349 NNZ: 64,531,701 id: Figure 4.3: MxN: 3,542,400x3,542,400 NNZ: 96,845,792 id: Figure 4.4: MxN: 4,802,000x4,802,000 NNZ: : 85,362,744 id: Figure 4.5: MxN: 16,614x16,614 NNZ: 1,096,948 id: Figure 4.6: MxN: 999,999x999,999 NNZ: 4,995,991 id: Figure 4.7: MxN: 1,489,752x1,489,752 NNZ: 10,319,760 id: Figure 4.8: MxN: 1,971,281x1,971,281 NNZ: 5,533,214 id: Figure 4.9: MxN: 1,61,070x1,61,070 NNZ: 8,185,136 id: Figure 4.10: MxN: 1, 20,216x1, 20,216 NNZ3,121,160 id: Figure 5.1: PETSc Objects creation in current GPU implementation...36 Figure 5.2: PETSc Object creation and new sparse matrix support in new GPU implementation...36 Figure 5.3: New User level API registration with Mat class (petscsrc/mat/interface/matreg.c)...38 Figure 5.4: Modified Mat_SeqAIJCUSP class with ELL, DIA, HYB and COO storage support using the CUSP library...39 Figure 5.5: Converting PETSc AIJ Matrix to CUSP CSR matrix...40 Figure 5.6: Transparent conversion between different sparse formats with CUSP...41 Figure 5.7: Converting PETSc MatAIJ to CUSP ELL matrix (algorithmic details)...42 Figure 5.8: Converting PETSc MatAIJ to CUSP ELL format (Algorithmic Implementation)...42 v

7 Figure 5.9: Sparse Matrix-Vector operation support for different matrix formats using CUSP...43 Figure 5.10: Simple example of KSP with the use of new sparse matrix format...44 Figure 5.11: Convergence of KSP CG solver with different sparse matrix formats on CPU & GPUs for simple example of 2-D Laplacian from PETSc...45 Figure 6.1: Converting Matrix Market format to PETSc binary format (Algorithmic Implementation)...47 Figure 7.1: HECToR GPGPU Testbed System consist of NVIDIA and AMD GPUs connected by Infiniband Network...51 Figure 7.2: Tesla C2050/C2070 Specification [58]...51 Figure 7.3: Total execution time with different sparse matrix formats on GPU (using GMRES method)...52 Figure 7.4: Performance with different sparse matrix formats on GPU (using GMRES method)...52 Figure 7.5: Execution time of SpMV with different Sparse matrix formats...54 Figure 7.6: Execution time of SpMV+VecMDot+VecMAXPY with different sparse matrix formats on GPU...54 Figure 7.7: Performance on CPU with CSR, GPU with CSR and GPU with DIA...55 Figure 7.8: Achieved Speedup compare to Intel Xeon quad-core...55 Figure 7.9: Execution time of different sparse matrix format for semi-structured matrix on GPU...56 Figure 7.10: Performance for Semi-Structured Matrices...56 Figure 7.11: Sparse Matrix-Vector Execution time for different sparse matrix formats57 Figure 7.12: Unstructured matrix of size x with 36,816,170 non zero elements...57 Figure 7.13: Total execution time on GPU with CSR and HYB format...58 Figure 7.14: Performance of CSR and HYB on the GPU...58 Figure 7.15: Execution time of SpMV on CPU (CSR), GPU (CSR) and GPU (HYB).59 Figure 7.16: Performance on HECToR GPU cluster with CSR and DIA matrix format...60 Figure 7.17: Performance with CSR and DIA matrix formats with different number for GPUs...61 Figure 7.18: Execution time for SpMV using CSR and DIA matrix formats on HECToR GPU...62 Figure 7.19: Performance comparison between HECToR GPU system...63 Figure 8.1: Overall system architecture considering bandwidth of different sub-systems (Pre Sandy-Bridge Architecture)...67 vi

8 Figure 8.2: HECToR GPU: Infiniband Network with Switched fibre topology (schematic layout)...68 Figure 8.3: Speedup using the default Block Jacobi Preconditioner on CPU and GPU with CSR, ELL...69 Figure 8.4: Diagonal matrix with few independent nonzero numbers...70 Figure 8.5: User Implemented SpMV in PETSc Using MattShell (Design)...71 vii

9 Acknowledgements I am very grateful to Dr. Michele Weiland and Dr. Chris Maynard for their advice and supervision during this dissertation. I would also like to thank Dr. Lawrence Mitchell for providing valuable advice during project discussions. I am also indebted to my friends and family for their continued support during my study. viii

10 1

11 Chapter 1 Introduction 1.1 Background PETSc (Portable Extensible Toolkit for Scientific Computations) is an open source, scalable solver library developed over past twenty years at Argonne National Laboratory (ANL). It is used for solving system of equations arising from discretisation of partial differential equations (PDEs). Developing parallel, nontrivial PDE solvers for high end computing systems considering the scalability over thousands of processors is still difficult task and takes lots of time. The PETSc is designed to ease this task and reduce the development time. It provides parallel algorithms, debugging support and low-overhead profiling interface that help in the development of large and complex applications. PETSc is used to solve large linear systems with 500B unknowns on supercomputers like Jaguar and Jugene with more than 200K processors [1]. It is used in modelling of many scientific applications in the area of Geosciences, Computational Fluid Dynamics, Weather Modelling, Seismology, Surface water flow, Polymer Injection modelling etc. We will discuss one such application of PETSc called as Fluidity that we have analysed. Fluidity is an open source, general purpose computational fluid dynamics framework [2] developed by the Applied Modelling and Computational Group (AMCG) at Imperial College London. This framework is used in many scientific simulations in the areas of fluid dynamics, ocean modelling, atmospheric modelling etc. It solves the Navier-Stokes equations [3] on arbitrary unstructured, adaptive meshes using finite element methods. While solving this system, we impose the grid on the problem domain to calculate the numerical solution of partial differential equations (PDEs). The accuracy and computational cost of the solution depends on the grid spacing. To compute accurate solutions, one has to use a finer grid; but this increases computational cost. Hence the Adaptive Mesh Refinement (AMR) technique is used to reduce the computational costs. AMR uses a coarse grid at the start of the simulation and as the solution progresses, it identifies areas of interest (i.e. parts of the grid which exhibits large change in solution) where the grid needs to be refined. These methods are discussed in more detail in [4], [5], [6]. Figure 1.1 shows the main stages involved in simulations using the Fluidity framework: 1

12 Mesh Generation Assembly Phase Solver Phase Update Solution Output Solution repeat Figure 1.1: Main stages involved in single iterations of Fluidity framework [7] During simulation, a new mesh may need to be generated by using AMR techniques to maintain the accuracy of the solution. During the assembly stage, a system of simultaneous equations is assembled using the finite element mesh. In the solver stage, the system of equations assembled in the assembly stage is solved using iterative methods. Fluidity uses iterative solvers from the PETSc to solve large sparse system. The sparse matrices that arise are positive definite, non-symmetric and hence the Generalised Minimum Residual (GMRES) algorithm is normally used [7]. The update stage involves updating the solution variables, calculating new timestamps and estimating the error. Finally, the current solution can be written to disk. All these stages are discussed with more details in [7] and [8]. Currently, Fluidity uses various libraries like MPI (Message Passing Interface), ParMETIS and PETSc to support parallelisation [9]. 1.2 Motivation Despite various parallelisation and optimisation techniques, simulations of complex phenomenon like tidal modelling or tsunami simulation take from hours to a few days on modern supercomputers. For Fluidity, the main computationally expensive stages are the assembly phase and the solver phase. The initial idea of the project was to improve the performance of Fluidity framework using directive based GPU programming models like HMPP (Hybrid Multicore Parallel Programming) and PGI Accelerators. For initial profiling and performance optimisation, we decided to use 1- D non- [10]. The Burger equation is a fundamental PDE which occurs in various applications of fluid dynamics which takes the form: where is the velocity and is the viscosity coefficient. This is a basically 1-D Navier-Stokes equation (discussed in Section 2.4) without pressure and volume forces terms. The following Figure 1.2 shows the profiling result of this problem on single Intel Xeon processor. 2

Figure 1.2 -D model problem with mesh spacing of 0.002 and domain [-10, 10] The profiling results show that most of the execution time (84%) is spent in PETSc solver library.

13 Figure 1.2 -D model problem with mesh spacing of and domain [-10, 10] The profiling results show that most of the execution time (84%) is spent in PETSc solver library. For this model problem, when we increase the resolution (i.e. finer mesh spacing), the assembly phase time increases proportionally and the solver time increases exponentially. As this is a simple 1-D example, assembly time is relatively small. But the condition number of matrix is very high and hence the solver takes large time to converge. Hence, we decided to deviate from our original plan and optimise the solver phase of Fluidity which is ultimately the PETSc solvers. Graphics Processing Units (GPUs) are becoming more popular due to their performance to cost ratio and potential performance gains compared to CPUs. To improve the performance of the solver phase, we decided to use the newly implemented GPU support in PETSc. Also we identified potential performance improvements in PETSc using different sparse matrix storage schemes, which are more suitable for GPUs. 1.3 Related Work In the last year, basic GPU support in PETSc has been added, which is currently available in the PETSc development release [11]. As per our knowledge, there are no benchmarking results published for this GPU implementation. Also, the current implementation only supports CSR (Compressed Sparse Row) matrix storage format and there is no development effort to support other matrix storage schemes. Nathan Bell and Michael Garland from NVIDIA Research have published performance results [12] of sparse matrix-vector operations on GPUs using different sparse matrix storage schemes. These results show that storage schemes like DIA (Diagonal), ELL (ELLAPACK) and HYB (Hybrid) are well suited for GPUs. Compared to CSR storage scheme, DIA and ELL formats can achieve 4-6x speedup. 3

14 There are two main goals of this MSc project. First goal is to improve the performance of current PETSc GPU implementation using DIA, ELL and HYB sparse matrix storage formats. Second goal is to evaluate the performance of PETSc GPU implementation on HECToR GPU cluster. Specifically we are looking at performance of Krylov Subspace solvers for solving large sparse linear system. 1.4 Contributions and Outline The contributions of this project report are as follows: To discuss the performance of the CUSP and Thrust libraries with large sparse matrices from real world applications; To present an initial implementation to support different sparse matrix storage schemes in PETSc using CUSP and Thrust; To evaluate the performance of the PETSc GPU implementation for solving large sparse linear systems; To evaluate the performance benefits of the new implementation on single and multi-gpu applications; To compare the overall performance of PETSc GPU implementation on the HECToR GPU cluster and the HECToR (Phase 2b) system. Chapter 2 presents background information, which includes GPGPU, CUDA Programming model, CUSP &Thrust libraries, PDEs and Iterative methods for sparse linear algebra. Chapter 3 discusses the design of PETSc library and implementation of GPU support in PETSc. Chapter 4 presents different sparse matrix storage schemes suitable for vector processors which are available in CUSP library. We have also evaluated the performance of CUSP with large sparse matrices from real world applications. Chapter 5 presents our initial implementation to support sparse matrix storage schemes in PETSc using CUSP library. Chapter 6 discusses the wrappers codes developed for matrix conversion, performance analysis and benchmarking. Chapter 7 presents performance results of PETSc GPU implementation with different sparse matrix formats on the HECToR GPU cluster as well as the main HECToR (Phase 2b) system. Chapter 8 discuss the challenges of multi-gpu parallelisation encountered during performance analysis and outlines the future work in this area. Chapter 9 presents the conclusion of this project and summarize the results. 1.5 Change in Project plan During the project preparation phase (Semester-II), the idea of the MSc project was to extend the HMPP programming model for the C++ language. Specifically our aim was to implement a generic meta-programming framework for HMPP using C++ templates. HMPP is now an open standard developed by CAPS enterprise and Pathscale Inc. HMPP provides directive based GPU programming model similar to PGI accelerators. For this MSc project, external organisation was expected to provide ENZO compiler suite with HMPP C++ support by May But this compiler was not available until first week of June 2011 due to complexity of C++ compiler implementation. So we decided to change our project plan. With the great support of Dr. Michele Weiland and Dr. Chris Maynard we were quickly able to work out the alternative project plan and 4

15 decided to work on Fluidity project, especially the PETSc GPU implementation. This change in project affected the planned schedule of the project. But with the continuous advice and support of supervisors, I am able to complete this project successfully. 5

16 Chapter 2 Background 2.1 GPGPU GPUs (Graphics Processing Units) have a distinct architecture specifically designed for high floating point operations and fine grained concurrency. In the past, GPUs were mostly used for improving the performance of graphics operations like pixel shading, texture mapping and rendering. But in last few years, GPUs have been effectively used to speed-up the performance of non-graphics applications from different areas of science like computational fluid dynamics, molecular dynamics, medical imaging, climate modelling etc. The term GPGPU (General Purpose Computing on GPUs) is normally used to refer to the use of GPUs for accelerating non-graphics applications traditionally executing on CPUs. The primary reason for the popularity of GPUs in the area of scientific computing is their performance to cost ratio. For example, the NVIDIA Tesla C2070 GPU has 448 cores capable of achieving a theoretical peak performance of 515 GFlops which is 50 times more than the Intel Xeon (E5620) quad-core processor. But if we compare the prices, Tesla GPU is only five times costlier than Intel Xeon processor. Various applications ported to the GPUs show significant performance benefits (10-50x speedup) compared to the CPUs. More details about these applications can be found in [13]. 2.2 GPU Programming models Programming models like ARB, OpenGL, Direct3D and Cg were commonly used for development graphics applications. But these programming models do not fit well for development of HPC applications. Research in GPU technology helped to understand the use of GPUs in general purpose computing. Various programming models like CUDA, OpenCL, AMD Stream and PGI directives are available for programming these special purpose devices and CUDA is the most popular programming among them C UD A NVIDIA introduced the CUDA programming model, which enables a large developer community to exploit GPUs for general purpose computing. The programming 6

interfaces are exposed through C, C++, FORTRAN languages and third party wrappers for other languages like Python, Ruby etc. A CUDA application consists of code that runs on CPU as well as GPU.

17 interfaces are exposed through C, C++, FORTRAN languages and third party wrappers for other languages like Python, Ruby etc. A CUDA application consists of code that runs on CPU as well as GPU. Compute intensive functions of the program, which executes on GPU, are called as kernels. The nvcc compiler translates this kernel code to the PTX assembly code, which is executed on GPU. More details about the CUDA programming model can be found in [14]. C UD A A rchitecture: We will discuss the CUDA architecture considering a Tesla 10 series device. A Tesla C1060 GPU device consists of 30 multithreaded streaming multiprocessors (SM) and each SM consists of 8 streaming processors (SP), two special function units, on chip shared memory and an instruction unit. Figure 2.1 shows the organisation of SM, SP, registers and shared memory in a Tesla device. The SM creates, manages and schedules a group of 32 threads in a batch called as warp. A single SM has hardware resources that can hold state of three warps at a time [15]. For the C1060 devices there can be 23,040 threads (30 SM * 8 SP * 32 threads * 3 Warps) available for execution. Out of this, only 960 threads (30 SM * 32 threads) can be executed concurrently at a given time. All threads within a warp executes in SIMT (Single Instruction Multiple Threads) fashion. Figure 2.1: CUDA Architecture and Memory 7 Hierarchy (Adapted from [15] and [56] )

18 There are different memory types: register, shared, local, global and caches. Each of these types has different sizes, latencies, bandwidths and performance characteristics. Each SM has on-chip registers and shared memory. These memories are small in size and have very low latency. Local and global memory are largest in size and have very high latency. Data access from global or local memory is very costly and requires cycles. Texture and constant memory have similar latency. But these can be automatically cached by hardware and hence can be effectively used if kernel exhibits temporal locality. L1 and L2 caches are introduced in the newer Fermi architecture giving benefits similar to CPU caches. More detailed descriptions on memory organisation and performance can be found in [16]. Whenever a CUDA kernel is launched on a GPU, thousands of threads are created, which are organised into grid. The grid is a 1-D or 2-D array of thread blocks and each thread block is a 1-D, 2-D or 3-D array of threads. The thread blocks are assigned to available SMs. All threads within a thread block execute in a time-multiplexed fashion on a single SM. The grid and block dimensions largely depend on the hardware resource requirements of the executing kernel. More information on this can be found in [14]. 2.3 C USP and Thrust CUSP and Thrust are open source C++ template libraries developed using CUDA and providing high level interfaces for GPU programming. We have used these libraries for implementing the sparse matrix storage schemes support in PETSc Thrust Thrust is an open source, template library [17] developed on top of CUDA. The main advantage of Thrust is that it provides a high level interface for GPU programming and enables rapid development of complex HPC applications. Another important benefit is that Thrust, being a C++ template library, supports generic programming and Object Oriented (OO) paradigms. Three main components of Thrust are: Containers, Iterators and Algorithms. Containers: A container can store a collection of objects. The containers are usually implemented as template objects so that they can be used with different data types. For example, common data structures used in programming languages like linked lists, stacks, queues and heaps arrays are implemented as containers. In Thrust, there are two main containers: thrust::host_vector and thrust::device_vector. A host_vector and a device_vector represent an array of element on the CPU (host) and the GPU (device) memory respectively. The major benefit of containers is that they handle memory management for the underlying objects. For example, whenever we create a host_vector, they automatically allocate memory on the CPU. Similarly the device_vector container handles the memory allocation, deallocation on GPUs. Whenever we assign a host_vector to device_vector, they automatically make a 8

19 cudamalloc, cudamemcpy, cudafree etc are completely hidden from application developers. Iterators: An Iterator is a generalisation of pointers in C and can be thought of as an object in C++ which can point to other objects. They are usually used for traversing over the container objects and are similar to C pointers, thus we can perform pointer arithmetic. There are different types of Thrust iterators like input, output, constant, permutation or transform iterator [17]. For example, the input iterator provides the functionality of accessing the value of a container of that object. It is possible to write generic algorithms by using templates and parameterized by Iterators. Algorithms: Thrusts implements more than sixty basic algorithms like merge sort, radix sort, inclusive scan, reduce or parallel prefix. These algorithms are implemented as templated objects so that they can work with all basic data types. With the help of iterators, the algorithmic implementation does not have to worry about the underlying object type or object access methods. Algorithms do not directly access the container data, but use iterators to access the underlying data elements. For example, there is a single implementation of the radix sort for all data types. Depending on the data type, an Iterator provides a way to access the data elements. The mechanism of using containers, iterators and algorithms together can be explained with following simple example: /*Thrust headers */ void main() { /*allocate storage for one million numbers using host container*/ thrust::host_vector<float> vec_h( );; /* generate one million numbers on host using iterators*/ thrust::generate(vec_h.begin(),vec_h.end(), rand);; /*transparent copy of host vector to device vector*/ thrust::device_vector<float> vec_d = vec_h;; /*use of Thrust algorithms: passing iterators as parameters*/ thrust::sort(vec_d.begin(),vec_d.end());; } /* transparent copy from device to host memory*/ thrust::copy(vec_d.begin(),vec_d.end(),vec_h.begin());; Figure 2.3: Simple example to sort one million float elements on GPU using Thrust In the above example, we first create the host container to store one million float elements. We then randomly fill this vector using the thrust::generate method. The vec_h.begin() and vec_h.end() calls provide iterators pointing to the start and end of vec_h container respectively. When we assign the host container to the device container, Thrust automatically allocates memory on the GPU using cudamalloc() and 9

20 calls cudamemcpy() to make a host-to-device memory copy. In this example, we are using the thrust::sort method to invoke the default sorting algorithm (Merrill's radix sort [18]) on GPU. Finally we use the thrust::copy method to copy back vector data from GPU to CPU memory C USP CUSP is also an open source C++ template library [19] developed on top of CUDA, but it specifically targets sparse linear algebra and sparse matrix computations. Similar to Thrust, this library provides a high-level programming interface and internally uses the functionality of Thrust and CUBLAS. CUSP provides following five sparse matrix storage schemes: Compressed Sparse Row (CSR) Coordinate (COO) ELLAPACK (ELL) Diagonal (DIA) Hybrid (HYB) We will discuss these storage formats in detail in Section 4.2. CUSP provides an easy interface for building different sparse matrix formats and a transparent conversion between these formats. This is explained in the following example: /*CUSP headers */ void main() { /*sparse matrix in COO format on host*/ cusp::coo_matrix<int, float, cusp::host_memory> coo_mat;; /*matrix corresponding to 2-D Poisson problem on 15x15 mesh */ cusp::gallery::poisson5pt(a, 15, 15);; cusp::ell_matrix<int, float, cusp::device_memory> ell_mat;; } /*performs memory allocation, conversion from COO to ELL, and memory allocation and copy to device */ ell_mat = coo_mat;; Figure 2.4: Sparse matrix construction and transparent conversion using CUSP In the above example, we create a sparse matrix object in COO format. CUSP provides the cusp::gallery interface for generating sample matrices for a Poisson or Diffusion problem on a 2-D mesh. When we assign the COO matrix object on the host to the ELL matrix object on the device, CUSP automatically allocates memory on GPU, performs the COO to ELL conversion, and copies the matrix data from CPU to GPU. We have discussed this mechanism in Section

21 In addition to sparse matrix storage and operations, CUSP provides following features: File I/O interface for reading and writing large sparse matrices to/from matrix market files. Krylov subspace solvers like Conjugate-Gradient (CG), Multi-mass Conjugate- Gradient (CG-M), Biconjugate-Gradient (BiCG) and Generalized Minimum Residual (GMRES) on GPUs. Preconditioners like Algebric Multigrain (AMG), Diagonal and Approximate Inverse (AINV). We have used the FILE I/O interface for converting matrices stored in matrix market format (ASCII format) to PETSc binary format. For implementing sparse storage schemes in PETSc, we have extensively used CUSP and Thrust. Also we have developed small benchmark to measure the performance of CUSP linear solvers on GPUs. 2.4 : Source of Sparse Matrices Partial Differential Equations (PDEs) provide a mathematical model for many scientific and engineering applications. These equations relate partial derivatives of physical quantities like force, velocity, momentum, temperature etc. In fluid dynamics, the Navier-Stokes equations [20] are a set of nonlinear PDEs, which can be used to describe the flow of incompressible fluids as, where is the flow velocity, is the viscosity, P is the pressure, is the density of fluids and is vector differential operator. Most commonly, we solve these PDEs by approximating them with equations with a finite number of unknowns. This process of approximation is called discretisation. There are two commonly used techniques available, Finite Difference Methods (FDM) and Finite Element Methods (FEM), explained in [21]. We will illustrate the process of discretisation by using the common example of a PDE : where is a real valued function and two space variables, simple problem where we want to find a function such that = 1 11

22 in the solution domain and on the boundary. To find the numerical approximation of, we discretised the PDE using finite differences and subdivide domain into grid of. We solve for different variables where i,j=0, 1,2,3... In this case, the grid spacing is given by h=1/(n+1). Figure 2.2: 2-D Grid and Five-point stencil On the 2-D grid, we can write discretised equations (using forward difference for first derivative and backward difference for second derivative) as The right hand side of above expression is called as five point stencil because every point on the lattice is averaged with its four nearest neighbours as shown in Figure 2.2. A finite difference approximation to the above equation is given by This results in linear equations with unknowns. The resulting matrix A from the linear system is very large, sparse and with a banded structure. For example, for N=4, the matrix shown in Figure 2.3 is of order 16x16 and only contains 25% nonzero elements. 12

23 Figure 2.2: Sparse matrix for 5x5 grid (Poisson Problem, 25% Non-Zero elements) There are different storage schemes available to store these sparse matrices. Some formats like ELL or Diagonal are better suited for GPUs. We will discuss these formats in Section 4.2, considering their performance on GPUs. 2.5 Iterative Methods for Sparse Linear Systems Iterative methods are commonly used for solving large linear systems. These methods try to find the solution of linear system of equation Ax=b by generating sequence of improving approximate solutions. (Here, iterative meaning repetitive application of operations to improve the approximate solution). These methods use initial guess as the first approximate solution and then improves this solution over the successive iterations. There are two main classes of iterative methods: Stationary Iterative Methods and Krylov Subspace Methods. Jacobi, Gauss-Seidel and Successive Over- Relaxation (SOR) methods are examples of stationary methods and they are easy to implement and analyse. But the convergence of these methods is not guaranteed for all class of matrices. Krylov Subspace Methods are class of iterative methods which are considered as most important iterative techniques currently available for solving linear and non-linear system of equations. These methods are widely adopted because they are efficient and reliable. Examples of Krylov Subspace Methods are Conjugate-Gradient, Biconjugate Gradient and GMRES (Generalized Minimal Residual). These methods are based on the Krylov Subspace. The m-order Krylov Subspace is defined as 13

24 where A is nxn matrix and b is vector of length n. The Research in Krylov Subspace techniques has brought various new methods. Detailed explanation of all methods is beyond the scope of this project. We will discuss one such method of Krylov Subspace solver i.e. GMRES, that we have used in our performance analysis example. G MRES Method: GMRES is an iterative method which approximates the solution by the vector in Krylov subspace with minimal residual [22]. GMRES approximate solution by Euclidian norm of the residual Ax-b over the Krylov Subspace. This method is designed to solve non-symmetric linear systems. Most popular form of GMRES is based on the Gram-Schmidt orthogonalization process. Gram-Schmidt process takes a set of linearly independent vectors S= { in Euclidean space and computes set of orthogonal vectors = { which spans same subspace of for k<=n. More information about this can be found in [23]. The major problem with iterative methods is lack of robustness. Despite their suitability for large sparse linear system, these methods are not widely accepted in industrial applications. Also, a typical application from filed of computational fluid dynamics or electronic device simulations suffers because of slow convergence of iterative methods. Hence, preconditioning techniques play key role in the success of Krylov Subspace method. Preconditioning Technique: The condition number asymptotically measures the worst case of how much result of the function changes with small change in its arguments. For linear equation Ax=b, the condition number of matrix A gives an idea about rate at which solution x will change with respect to change in b. So for matrices with large condition number, a small error in causes large error in solution x and result into numerical instabilities. The main goal of preconditioning is to transform original linear system into equivalent linear system which will have same solution but likely to be easier to solve. A preconditioner of matrix is computed such that has small condition number than A. The preconditioning techniques are widely used with the iterative methods to improve the rate of convergence of iterative solvers. There are various preconditioning techniques available like Jacobi, Block-Jacobi and Incomplete LU factorization (ILU) etc that can be used with the iterative methods. More information about preconditioning techniques and implementation can be found in [24]. Note that PETSc library allows transparent use of various Krylov subspace solvers and Preconditioners. PETSc applications running in serial as well as parallel do not have to write specialize code to use different PETSc solvers or Preconditioners. More information about available Krylov Subspace solvers and Preconditioners in PETSc can be found in [25]. 14

Chapter 3 PE TSc GPU Implementation 3.1 PE TSc PETSc is a scalable solver library, which has been used in the development of a large number of HPC applications [26].

25 Chapter 3 PE TSc GPU Implementation 3.1 PE TSc PETSc is a scalable solver library, which has been used in the development of a large number of HPC applications [26]. It provides infrastructure for rapid prototyping and algorithmic design, which eases the development of scientific applications while maintaining the scalability on large numbers of processors. The design of PETSc allows transparent use of different linear/non-linear solvers and preconditioners in the applications. The programming interface is provided through C, C++, FORTRAN and Python. In this section we will discuss the PETSc design and architecture, which will help to understand our further implementation to support sparse matrix storage schemes PE TSc Kernels PETSc kernels are basic sets of services on top of which the scalable solver library is built. These kernels are shown in Figure 3.1. These kernels have a modular structure and are designed to maintain portability across different architectures and platforms. For example, instead of float or integer data types, PETSc provides new data types like PetscInt, PetscScalar or PetscMPIInt. These data types are internally mapped to corresponding int, float, float64 or double data types supported on the underlying platform. For our implementation, if we want to add new memory management routines, we can implement those in the corresponding kernel and make them available to applications and other kernels. These PETSc kernels are explained with more detail in [27]. Figure 3.1: PETSc Kernels 15

3.1.2 PE TSc Components PETSc is developed using object oriented paradigms and its architecture allows easy integration of new features from external developer communities.

26 3.1.2 PE TSc Components PETSc is developed using object oriented paradigms and its architecture allows easy integration of new features from external developer communities. PETSc consists of various sub-components listed below: Vectors Matrices Distributed Arrays Preconditioners Krylov Subspace Solvers Non-linear Solvers Index Sets Timesteppers PETSc allows easy customisation and extensions to these components. For example, we can implement a new matrix subclass or preconditioner that can be transparently used by all KSP solvers without any modifications. The algorithmic implementation is separated from the parallel library layer that allows code reusability and easy addition of new solvers, preconditioners and the data structures. Figure 3.2 shows the organisation of different PETSc libraries and the levels of abstraction at which they are exposed. Figure 3.2: PETSc Library Organisation [54] 16

27 PETSc internally uses a number of libraries like BLAS, ParMetis, MPI and HDF5 to provide the infrastructure required for large HPC applications. PETSc provides much flexibility for users to choose among different libraries for different classes of applications. But most of the functionalities of underlying libraries are hidden from application developers due to parallel library layer PE TSc Object Design In PETSc, classes like Vector, Matrix and Distributed Array represent data objects. These objects define various methods for data manipulation in sequential or parallel implementation. The internal representation of an object i.e. the data structure, is not exposed to applications and is only available through exposed APIs. This is shown in Figure 3.3. For example, the Vector class can be used for representing the right hand side of a linear system Ax=b or discrete solutions of PDEs and stores values in a simple array format similar to C or FORTRAN array convention. This class defines various methods for vector operations like the dot product, the vector norm, scaling, scatter or gather operations. For parallel applications, PETSc automatically distributes these vector elements within the communicators and uses the functionality of the underlying MPI library to perform collective or point-to-point MPI operations. Applications PE TSc Interface Matrix Vector Index Set Exposed APIs (Abstraction) Data Manipulation Routines Data Structure Figure 3.3: PETSc Objects and Application Level Interface In the parallel implementation, the Matrix or Preconditioner objects do not have access to the internal data structure directly. Instead, they just call exposed APIs through the PETSc interface and the internal object representation manages communication within a MPI communicator. For example, for parallel vectors, internally a VecScatter object is created to manage data communication across MPI processes. VecScatterBegin() and VecScatterEnd() routines are used to perform vector scatter operation across the communicator. To access internal Vector data, the application uses subroutines like VecGetArray().Only Preconditioners (PC) objects are implemented in a data structure specific way, so they access and manipulate Vector or Matrix data structures directly. 17

28 3.2 PE TSc GPU Implementation Recently, GPU support has been added to the PETSc solver library. Currently it is under development and available in the PETSc development release [11]. The initial implementation allows for transparent use of GPUs without modifying the existing application source code. Instead of writing completely new CUDA code, PETSc uses the open source CUSP and Thrust libraries discussed in Section 2.3. This helps to keep the GPU implementation separate from the existing PETSc code. We will discuss this new implementation in more detail, as our development work will be an extension to current development Sequential Implementation The current implementation assumes that every MPI process has access to a single GPU. A new GPU specific Vector class called VecCUSP has been implemented. It uses CUBLAS, CUSP, as well as Thrust library routines to perform vector operations on a GPU. The idea behind using these libraries is to use already developed, fine tuned CUDA implementations with PETSc instead of developing new ones. The PETSc implementation acts as an interface between PETSc data structures and external CUDA libraries, i.e. Thrust and CUSP. Whenever we execute a program with GPU support, two copies of any vector are created, one on the CPU and another on the GPU. In the existing Vec class, a new flag is added called valid_gpu_array. This flag has the following four possible values and corresponding meaning: PETSC_CUSP_UNALLOCATED : Object is not yet allocated on GPU PETSC_CUSP_CPU : Object is allocated and a valid copy is available on CPU only PETSC_CUSP_GPU : Object is allocated and a valid copy is available on GPU only PETSC_CUSP_BOTH : Object is allocated and valid copies are on CPU & GPU (both) Initially this flag has the value PETSC_CUSP_UNALLOCATED. When an application creates a Vector object, the VecCUSPCopyToGPU() subroutine creates a new vector copy on the GPU and sets the valid_gpu_array flag to PETSC_CUSP_BOTH indicating that both copies are now valid and contain recent values. Now all vector operations can be performed on the GPU. Whenever the VecCUSPCopyToGPU() function gets called, it makes a copy to GPU only if the vector object is modified on CPU, i.e. value of valid_gpu_array flag changed. Memory copies between host and device are managed through the subroutines VecCUDACopyToGPU() and VecCUDACopyFromGPU(). For example, when an application calls VecGetArrayRead() to access vector data, internally it first calls VecCUDACopyFromGPU() to copy recent vector values from GPU and then sets valid_gpu_array to PETSC_CUSP_BOTH, indicating that both copies are now valid and contain recent values. This mechanism can be illustrated by implementation of simple vector operation AXPY i.e. y = alpha x + y: 18

VecAXPY() { /*copy vectors from CPU to GPU: if modified*/ ierr = VecCUDACopyToGPU(xin);; /*copy vectors from CPU to GPU: if modified*/ ierr = VecCUDACopyToGPU(yin);; } try { /*perform AXPY using

29 VecAXPY() { /*copy vectors from CPU to GPU: if modified*/ ierr = VecCUDACopyToGPU(xin);; /*copy vectors from CPU to GPU: if modified*/ ierr = VecCUDACopyToGPU(yin);; } try { /*perform AXPY using CuBLAS library routine*/ cusp::blas::axpy(*((vec_cuda*)xin->spptr)->gpuarray, *((Vec_CUDA*)yin->spptr)->GPUarray,alpha);; /*now updated copy is present on GPU*/ yin->valid_gpu_array = PETSC_CUDA_GPU;; /*wait until all thread finishes*/ ierr = WaitForGPU();; } catch(char *ex) {...} Figure 3.4: VecAXPY implementation in PETSc using CUSP & CuBLAS [11] For the above vector operation, the VecCUDACopyToGPU() subroutine allocates memory and copies vector data on to the GPU if the flag value is PETSC_CUSP_UNALLOCATED. If flag value is PETSC_CUSP_CPU, that means memory is already allocated on GPU but copy on CPU is recently modified. So it makes CPU to GPU vector copy and then it makes call to CUBLAS library routines and sets valid_gpu_array to PETSC_CUDA_GPU Parallel Implementation In the parallel implementation, the parallel Vector and Matrix objects are implemented on top of the sequential implementation. The Rows of a matrix are partitioned among the processes in a communicator. This is shown in Figure 3.5. In the PETSc implementation, a sparse matrix is stored in two parts: the on-diagonal part and the offdiagonal part. The on-diagonal portion of the matrix, say Ad, stores values of the column associated with rows owned by that process. These matrix elements are shown by red colour. All remaining entries from the off-diagonal portion are stored in another component, say Ao. Figure 3.5: Parallel Matrix with on-diagonal and off-diagonal elements 19 for two MPI process

30 The sparse matrix vector product is calculated by using two steps: Firstly, calculating the product related to the on-diagonal entries of matrix, i.e. Ad, by using associated vector entries of vector x, i.e. xd. Then we calculate the product of off-diagonal matrix entries and the associated vector, which gets added to previous result yd as yd = Ad * xd yd += Ao * xo For this operation, updated entries of vector xo must be communicated within communicators. As only few Ao all xo entries. This communication is managed through the VecScatter object, which handles parallel gather and scatter operations using non-blocking MPI calls. The VecScatter object stores two arrays of indices: one array stores global indices of vector elements that will be received as updated entries from other processes in the communicator. These received vector elements are stored in the local array. Second vector stores the mapping between global index of vector element and its position in the local array. Communication starts with VecScatterBegin()call, which copies data into message buffers. For the GPU implementation, the updated vector entries first get copied from GPU memory to CPU using VecCUDACopyFromGPU()function. The communication completes after the VecScatterEnd()call, which waits for completion of nonblocking MPI calls posted by VecScatterBegin(). This implementation of parallel matrix-vector operation shown below: VecScatterBegin(a->Mvctx, xd, hatxo,insert_values, SCATTER_FORWARD);; MatMult(Ad, xd, yd);; VecScatterEnd(a->Mvctx, xd, hatxo,insert_values, SCATTER_FORWARD);; MatMultAdd(hatAo, hatxo, yd, yd);; Figure 3.6: Parallel Matrix-Vector multiplication in PETSc GPU implementation [11] More information about this implementation can be found in [11] [28]. 3.3 Applications running with PE TSc GPU support PETSc allows transparent use of GPUs without any changes in the application source code. Most of the existing PETSc applications can run on the GPUs. New class of Vector i.e.veccusp and matrix i.e. MatCUSP is added to PETSc which performs all matrix-vector operations on GPUs. To run any existing application on GPU, user has to 20

31 set Vector type to VECCUSP and matrix type to MATCUSP using VecSetType() and MatSetType() routines respectively. User can also set these Vector and Matrix type, using option database keys vec_type seqcusp and mat_type seqaijcusp. All of the Krylov Subspace methods except KSPIBCGS (Improved Stabilized version of BiConjugate Gradient Squared) are supported on the GPU. Currently, Jacobi, AMG (Algebraic Multigrid) and AINV (Approximate Inverse) preconditioners are supported on the GPUs. 21

32 Chapter 4 Sparse Matrices As discussed in Section 2.4, the discretisation of PDEs results in large sparse matrices. A matrix with only few non-zero elements can be considered as sparse. In a practical sense, a matrix can be considered as sparse if specialised techniques can be used to take advantage of the sparsity and the sparsity pattern of the matrix. Depending on the sparsity pattern, we can divide matrices into two broad categories: structured and unstructured. A matrix with non-zero elements in a specific regular pattern is called a structured sparse matrix. For example, all non-zero elements along few diagonals of the matrix or non-zero elements in small dense sub-blocks, which result into regular patterns. The Application of FDM or linear FEM on rectangular grids results in structured sparse matrices. On the other hand, for irregular meshes this results into unstructured sparse matrices with no specific structure or pattern of non-zero elements. Figure 4.1 and Figure 4.2 shows example of structured and unstructured matrices respectively. Figure 4.1: Example of Structured Matrix from structured problem Figure 4.2: Example of Unstructured matrix from bipartite graph Depending on the sparsity pattern, different storage schemes or data structures can be used. Importantly, the performance of the matrix operations depends on these storage schemes and processor architecture. This becomes more apparent for vector processors and GPUs. In this section we will discuss sparse matrix representation and different storage schemes with their storage efficiency and performance. 22

33 4.1 Sparse Matrix Representation The structure of sparse matrices can be ideally represented by adjacency graphs. Graph theory techniques have been used effectively for parallelising various iterative methods and implementing preconditioners [29]. A graph G =(V,E) is represented by set of vertices and set of edges where are elements of V. In 2-D plane, the graph G is represented by a set of points which are connected by edges between these points. In case of the adjacency graph of a sparse matrix, the n vertices in V represent n unknown variables, and the edges in E represent the binary relation between those vertices. There is an edge from node i to node j when matrix element. An adjacency graph can be directed or undirected depending on the symmetry of non zeros. When a sparse matrix has a symmetric non-zero pattern (i.e., the adjacency graph is undirected, otherwise it is directed Figure 4.3: Spase matrix representation with directed adjacency graph This adjacency graph representation can be used for parallelisation. In case of parallelising a Gaussian elimination, at a given stage of the elimination we can find unknowns which are independent of each other from above binary relation. For example, in the case of a diagonal matrix all unknowns are independent of each other, which is not true for dense matrices. More information about sparse matrix representation and parallelisation strategies can be found in [29]. 4.2 Sparse Matrix Storage Schemes There are two main reasons for different sparse matrix storage formats: memory requirements and computational efficiency. It may not be feasible to store a large sparse matrix in main memory. Importantly, it is not necessary to store zero matrix elements. Various storage schemes (i.e. data structures) have been proposed to effectively utilise sparsity and sparsity patterns of the matrices. best for all sparse matrices; but a few are suitable for matrices with structured sparsity patterns, some are general purpose and others are storage schemes for matrices with arbitrary nonzero patterns. Each storage scheme has different storage costs, computational costs and performance characteristics. In this section we will discuss various storage schemes and their performance on GPUs. 23

34 4.2.1 Coordinate List The coordinate list (COO) is a simple and the most flexible storage format, where we store every non-zero element of a matrix with three vectors: data, row and indices. The data vector stores nonzero elements of matrix in row major order. The row and indices vectors explicitly stores associated row and column index of every element in the data vector. This is explained in following Figure: data row indices Figure 4.4: Sparse matrix and corresponding COO storage representation This is a general purpose and robust storage scheme, which can be used for matrices with arbitrary sparsity patterns. The above example shows that the storage cost of COO format is proportional to the number of nonzero elements. For MxN sparse matrix with k non-zero elements, it requires bytes Compressed Sparse Row Compressed Sparse Row (CSR) is popular and the most general purpose storage format. This can be used for storing matrices with arbitrary sparsity patterns as it makes no assumptions about the structure of the nonzero elements. Like COO, this format also stores only nonzero elements. These elements are stored using three vectors: data, indices and row_ptr. The data and indices vectors are same as for the COO format. For an MxN sparse matrix, the row_ptr vector has length of M+1 and stores indexes where each row of the matrix starts in the val vector. The last entry of row_ptr corresponds to the number of nonzero elements in the matrix. This storage scheme is explained in the Figure below: data indices row_ptr Figure 4.4: Sparse matrix and corresponding CSR storage representation 24

35 There are some advantages of using CSR over COO. The CSR format takes less storage compared to COO due to the compression of the row indices explained in above Figure. Also, with the row_ptr vector we can easily compute the number of nonzero elements in the row as. In parallel algorithms, row_ptr values allow fast row slicing operations and fast access to matrix elements using pointer indirection. This is a commonly used sparse matrix storage scheme on CPUs. For a MxN sparse matrix with k non-zero elements, it requires bytes Diagonal Applications of stencils to regular grids result in banded sparse matrices, where nonzero elements are restricted to few sub-diagonals of the matrix. For these matrices, the diagonal (DIA) format can be effectively used. The DIA format uses only two vectors, data and offsets. The data vector stores nonzero elements of the sub-diagonals of the matrix. The offsets vector stores offset of every sub-diagonal from the main diagonal of the matrix. By convention, the main diagonal has the offset 0, the diagonals below the main diagonal have negative offset and those are above the main diagonal have positive offsets. This is illustrated with example in the Figure bellow: data * 3 8 * 4 9 * * 2 7 * offsets Figure4.5: Sparse matrix and corresponding DIA storage scheme Unlike CSR and COO, this storage format stores few zero elements explicitly. As we can see in above Figure, the diagonal with offset -3 has only two non-zero elements. But to store it in diagonal format, the elements of this diagonal are padded with there are more storage benefits due to fact that we do not have to store column or row indices explicitly. Usually, the data vector stores non-zero elements in the column major order which ensures memory coalescing for GPU devices. We will discuss this in more detail in the performance analysis section 4.3. For a MxN square matrix with d sub-diagonals having at least one non-zero element, it requires bytes. This is not a general purpose storage scheme like CSR and COO. It is very sensitive to the sparsity pattern and is useful for matrices with an ordered banded structure. For example, consider a matrix in Figure 4.6. This matrix has a banded structure, but this is not suitable for DIA storage scheme. The nonzero elements structure is exactly in opposite order of what is ideally suited for DIA format. 25

36 Figure 4.5: Banded nonzero pattern which is not suitable for DIA format When we store above matrix with diagonal storage format, we end up storing all subdiagonals, each containing single nonzero element and four padding elements E L L or Padded I TPA C K Like DIA, ELL format is also suitable for vector architectures. This format can be used for storing sparse matrices arising from semi-structured meshes where the average numbers of non-zero element per row are nearly same. For M x N sparse matrix with a maximum of k non-zeros per row, we store the matrix in a M x k dense data array. If a particular row has less than k non-zeros, that row is padded with zeros. The indices array stores the column index of every element in the data array. These elements are stored in column major order. Figure 4.6 illustrate an example of ELL storage scheme: data * indices * Figure4.6: Sparse matrix and corresponding ELL storage scheme Compare to DIA, ELL is again a have a banded structure of non-zero elements. But the average number of non-zeros must be the same across all rows of matrix, otherwise we end up padding large numbers of zero elements. For MxN sparse matrix with maximum of NNZ_PER_ROW nonzero per row, it requires * bytes for storage Hybrid Although the ELL format is well suited for vector architectures, most of the time sparse matrices arising from complex geometries that do not have the same number of nonzero per row [12]. As the number of non-zero elements starts varying to a larger extent, we end up storing a large number of padding elements. Consider the example of the sparse matrix shown in Figure 4.7. In this case, except for the first row all other rows have 26

37 same number of nonzero, i.e. two. So except the first row all other matrix elements can be effectively stored in ELL format Figure 4.7: Sparse matrix suited for hybrid storage format If we store this matrix in ELL format, we end up storing elements due to padding, which is very inefficient. An alternative approach is to use the combination of ELL and CSR/COO storage schemes for getting performance benefits of ELL and flexibility of CSR/COO format. In HYB format, rows having nearly the same number of nonzero elements are stored using ELL format and the remaining rows are stored in COO format. There is an additional overhead of calculating the nonzero elements per row. One approach, which is used in the CUSP [12] implementation, is to calculate the histogram of nonzero elements per row and from this maximum number of nonzero per row can be easily determined Jagged Diagonal Storage (JDS) Another storage scheme which is well suited for vector processors is the Jagged Diagonal storage (JDS). Like DIA, this storage scheme also stores sub-diagonals into vectors of equal length by using appropriate padding. However in the case of DIA, padding of sub-diagonals becomes costly if the number of nonzero per diagonal varies. Hence, JDS proposes a compression technique like CSR where we can reorder rows of matrix according to number of nonzero per row and also stores column indices of every element in indices vector ( this storage format as CUSP only supports basic DIA storage. More details about this can be found in [30], [31] Skyline or Variable Band The Skyline representation becomes popular for direct solvers especially when pivoting is not necessary. For symmetric matrices, this representation only stores the lower triangular matrix and hence requires half storage space. The matrix elements are stored using two vectors: value and row_ptr. The value vector stores all nonzero elements. The row_ptr vector indices point to the start of every row. For non-symmetric matrices, the lower triangular part of matrix is stored using the Skyline format and the upper triangular matrix is stored in column- this storage scheme and more details about this can be found in [30], [32], [31]. 27

38 4.3 Performance of Storage Schemes In this section we will analyse the performance of the different storage schemes discussed in section 4.2. For our performance analysis we will use the Sparse Matrix- Vector Multiplication (SpMV) benchmark routines developed by Nathan Bell [12]. These SpMV implementations are used in the CUSP library. So this benchmark results can give an idea about possible performance improvement in the PETSc. In the case of iterative methods, SpMV is the most computationally expensive operation and the convergence time of the solver largely depends on the efficiency of SpMV. An efficient implementation of SpMV can significantly improve the performance of iterative solvers. But due to sparse matrix storage schemes, these operations lead to irregular memory access which limits bandwidth. We will discuss these issues in more detail. Memory Bandwidth: In current systems, CPU-GPU memory bandwidth (8 GB/Sec, PCI-e x16 Gen 2) is significantly lower than GPU-GDRAM (144 GB/Sec, Tesla C2050). So we have to minimise the memory transfer between CPU and GPU. In the case of GPUs, more transistors are devoted to performing floating point operations rather than caching mechanisms. Thus computations are almost free on GPU. Hence sometime it is better to re-compute results rather than caching or copying them back to the host memory. Accessing global memory is expensive and hence it is necessary to transfer a single large memory block than accessing multiple smaller chunks. The speedup largely depends on whether a kernel is memory bound or computations bound. Thus it is necessary to exploit low latency memories like registers, shared memory and constant /texture caches. Various CUDA memory optimisations techniques are discussed in [33]. Memory Coalescing: Another major factor affecting the performance is memory coalescing. The global memory is divided into 128 or 64 bytes segments. In simple terms, memory access can be termed as coalesced if the kth thread in a half warp accesses kth word in a memory segment. When threads within the half-warp access memory with a certain pattern, the memory transaction of sixteen threads can be combined and completed in a few transactions. The order in which half-warp threads access the global memory and the compute capability of GPU determine whether memory access is coalesced or not. GPUs with compute capability 1.0 and 1.1 have stricter memory access requirements for coalescing. This is explained in the Figures below: Figure 4.8: Coalesced access: Single memory transactions for all compute capability Figure 4.9: Non-coalesced for 1.0 & 1.1 and Coalesced for higher compute capability 28

39 Figure 4.10: Non-coalesced for 1.0 & 1.1 and Coalesced for higher compute capability Figure 4.11: Non-coalesced for all compute capability When all threads within a half warp access consecutive memory locations within a segment, sixteen memory accesses can be completed in a single memory transaction of 64 bytes (Figure 4.8). But when memory access is out of sequence as shown in Figure 4.9 or misaligned as shown in Figure 4.10, GPUs with compute capability 1.0 or 1.1 require sixteen memory transactions to complete requests from sixteen threads. Due to the high latency of global memory, a single transaction takes clock cycles and hence it takes long time to complete this request. Also, a large percentage of memory transactions are wasted as every memory transaction has to be at least 32 bytes. For random memory access patterns, like the one shown in Figure 4.11, memory coalescing is not possible even for new Tesla/Fermi devices, which have very relaxed coalescing requirements. Now we will discuss the performance of the SpMV operation in the CUSP implementation for different unstructured matrices on GPUs. These matrices are selected from the University of Florida Sparse Matrix collection [34]. As all these matrices come from real applications, these benchmarking results can give an idea about how much performance benefit we can expect with the PETSc implementation using sparse matrix storage schemes. Diagonal SpM V: For a diagonal matrix, every thread processes a single row and computes the product between nonzero elements of that row with corresponding vector elements. The sub-diagonal elements are stored in column major order with fixed length vectors. Threads in a warp access matrix elements in consecutive memory locations and hence exhibit memory coalescing. Additionally, every thread of a warp corresponds to consecutive columns of matrix (due to the banded structure) which ensures contiguous memory access of vector x as well. This ensures good memory coalescing behaviour and is the ideal format for GPUs. The performance results of matrices from Fig 4.16 and 4.17 show that they outperform all other storage schemes for perfectly diagonal matrices. With DIA storage format, we are able to achieve a performance of ~14GFlop/Sec and ~12GFlop/Sec for these two matrices respectively. Benchmarking results also show memory bandwidth utilisation up to GB/Sec and GB/Sec for the same matrices. This is about 85% and 77% of theoretical peak of Tesla GPUs. This has also been discussed for relatively small structured matrices in [12]. Though DIA format is very well-suited for GPUs, matrices that arise from real applications may not be stored in this format. In our sample matrices, only two matrixes can be stored in DIA format on Tesla C2050. So the DIA matrix storage scheme should be preferred if the matrix has a regular banded or diagonal structure. 29

40 E L L SpM V: ELL also stores matrix elements in fixed length vectors with appropriate padding and uses the same parallelisation strategy as that of DIA, i.e. each thread processes a single row. This format also exhibits good memory coalescing. But as it stores column indexes of every nonzero element explicitly, there is extra memory access cost and hence for a perfectly banded matrix (Figure 4.16) ELL performance is slightly lower than DIA (11GFlop/Sec and 11.2GFlop/Sec respectively). Also, due to extra padding all threads may not participate in the SpMV operation which results in non-contiguous memory access for vectors. In our sample matrices, most of the matrices (except Figure 4.14) with regular sparsity patterns can be stored with ELL format. The ELL kernel shows performance up to 13.2 GFlop/Sec and achieves effective bandwidth of GByte/Sec for matrix in Figure Almost all matrices in our benchmarks show higher performance compared to the CSR and COO storage formats. Few matrices with a very irregular structure (like Figure 4.14) cannot be stored in ELL format due to the large number of nonzero elements in few rows. For these matrices, the HYB storage scheme exhibits higher performance which is discussed in subsequent sections. So whenever possible, we will prefer the ELL or HYB storage schemes for further benchmarking of PETSc implementation. CSR SpM V: CSR is the most generic storage scheme and different optimisation techniques have been developed for GPUs, which are documented in [35]. The major problem with the CSR storage scheme is the thread divergence and non-coalesced memory access. In this format nonzero elements of a row are stored contiguously in a data vector. But if we use a thread-per-row parallelisation scheme similar to DIA/ELL, the data elements are not accessed contiguously by threads in warp. Also there is a performance hit due to memory indirection and due to storage of row pointers only. CUSP introduces a new SpMV implementation for the CSR format, which is discussed in [12]. In this implementation, instead of assigning threads per row, a single row is assigned to a warp. As all elements of a row are stored contiguously and all threads within a warp are schedule at a time, data and indices are accessed contiguously. This results into memory coalescing and performance improvements by a factor of 4. We have considered this implementation for benchmarking as this is the default implementation for the CUSP library. However this scheme involves the extra overhead of a reduction operation to calculate the sum of partial results of all threads within a warp. Also, if there is a large variation in the number of nonzero elements per row, a threads corresponding to a row with few nonzero elements will remain idle and there is a large load imbalance. In our sample matrices, all matrices can be stored in CSR format. In most of the cases we are able to achieve performance up to 3-6 GFlop/Sec, which is significantly lower than the performance of the corresponding ELL and DIA formats. In a few cases shown in Figure 4.12 and 4.14, CSR is four times slower than ELL and DIA. Thread divergence and non-coalesced memory access affect effective memory bandwidth by a 30

41 large factor and matrices such as shown in Figure 4.14 and 4.17 show that only 10-20% of theoretical peak bandwidth has been achieved. C O O SpM V: The ELL/DIA storage format exhibit good performance, well suited for matrices with regular sparsity patterns. The CSR storage format is more generic and suitable for arbitrary sparsity structures, but has lower performance due to thread divergence and non-coalesced memory access. The COO format sits in between these two and its performance of is insensitive to the sparsity structure of the matrix. The COO SpMV implementation uses the scheme [12], where multiple rows are assigned to a single thread. In our sample matrices, almost all matrices show nearly equal performance (3-5 GFlop/Sec), which shows that COO is more robust and insensitive to sparsity structure. The performance is low because of non-coalesced memory access and extra memory overhead due to explicit storage of row as well as column indices. Hybrid SpM V: In the CUSP implementation, the HYB format is a combination of ELL and COO. Assuming a matrix with a large number of nonzero elements stored path in ELL format and partly in COO format, the overall performance of HYB is equal to the combined performance of these two formats. So if it is possible to store the corresponding matrix in ELL format, the performance of HYB will be identical to performance of ELL. In our sample matrices, the performance of HYB is slightly lower than the corresponding ELL kernel due to disadvantages of COO kernel (See COO SpMV). But this is very useful for matrices like shown in Figure 4.14, where only a few rows have a very large number of nonzero elements and hence cannot be stored in ELL or DIA format. Note that each of the following Figures consist of three graphs: first shows sparsity structure of the matrix, second shows performance in Gflops achieved for SpMV operation and last graph shows the bandwidth attained for SpMV operation. Each matrix has following information: M: No of Rows N: No of Columns NN Z: No of non zero elements Id: Id of sparse matrix in University of Florida Sparse Matrix Collection (more information about the source problem is available here [34] ). 31

42 Sparsity Structure Performance (Gflop/Sec) Bandwidth (GBytes/Sec) Figure 4.12: MxN: 4,802,000x4,802,000 NNZ: : 85,362,744 id: 2496 Figure 4.12: MxN: 3,542,400x3,542,400 NNZ: 96,845,792 id: 1902 Figure 4.14: MxN: 4,690,002x4,690,002 NNZ: 20,316,253 id: 1398 Figure 4.15: MxN: 1,391,349x1,391,349 NNZ: 64,531,701 id:

43 Sparsity Structure Performance (Gflop/Sec) Bandwidth (GBytes/Sec) Figure 4.16: MxN: 1,489,752x1,489,752 NNZ: 10,319,760 id: 2267 Figure 4.17: MxN: 999,999x999,999 NNZ: 4,995,991 id: 1883 Figure 4.18: MxN: 16,614x16,614 NNZ: 1,096,948 id: 409 Figure 4.19: MxN: 1,971,281x1,971,281 NNZ: 5,533,214 id: 374 NNZ: 33 1,096,948 id: 409

44 Sparsity Structure Performance (Gflop/Sec) Bandwidth (GBytes/Sec) Figure 4.20: MxN: 1,61,070x1,61,070 NNZ: 8,185,136 id: 2336 NNZ: 1,096,948 id: 409 Figure 4.21: MxN: 1, 20,216x1, 20,216 NNZ3,121,160 id: 2228 NNZ: 1,096,948 id:

45 Chapter 5 Implementation of Sparse Storage Support in PE TSc Currently, the PETSc GPU implementation only supports CSR sparse matrix format. We have added the implementation for supporting other formats like DIA, ELL and HYB, which are well suited for vector architectures. As discussed in Section 4.3, sparse matrix formats like ELL and DIA can perform three to four times faster on GPUs than the CSR format. We think this new implementation will greatly improve the performance of iterative solvers for which sparse matrix-vector operations are the key for performance improvements. In this section we will discuss the design approach and the new implementation that we have developed within the PETSc library. 5.1 Design Approach For implementing new sparse matrix support in PETSc, we have chosen a similar approach to the current GPU implementation, i.e. to allow transparent use of GPUs from existing PETSc applications. Also, new CUDA code is not added directly to the library. The idea is to keep the GPU implementation separate from core PETSc kernels. So, instead of developing new a CUDA implementation for sparse matrix formats, we have used the CUDA template libraries, CUSP and Thrust. CUSP already has support for sparse matrix types like COO, DIA, ELL and HYB. As the current PETSc GPU implementation is developed on top of the CUSP and Thrust libraries, it will be easier to integrate our new implementation with the existing PETSc development. As discussed in Section 3.2, in the current implementation new subclasses of Vector and Matrix are added to support vector operations, matrix operations and matrix-vector operations on GPUs. Our new implementation will be added as an extension to this to support new sparse matrix formats. Our goal is to improve the performance of sparse matrix-vector operations, which will ultimately improve the performance of iterative solvers. So during this development, we will consider the performance of different KSP methods. 35

When an application creates a Mat object, PETSc internally creates various instances of other classes. For the current GPU implementation, a new subclass of AIJ is added to PETSc, called as AIJCUSP.

46 When an application creates a Mat object, PETSc internally creates various instances of other classes. For the current GPU implementation, a new subclass of AIJ is added to PETSc, called as AIJCUSP. Figure 5.1 describes how these objects are created within the PETSc when sequential applications run with GPU support. When an application creates a Mat object, internally an instance of the MatSeqAIJ subclass is created. For applications running with GPU support, an instance of a newly implemented MatSeqAIJCUSP class is created. This internally creates a CUSP CSR matrix object. The CUSP CSR matrix object actually stores the matrix data on the GPU. All matrix and matrix-vector operations are implemented using the CUSP and Thrust libraries in MatSeqAIJCUSP. This application workflow is shown in Figure 5.1 below: Figure 5.1: PETSc Objects creation in current GPU implementation Our new design is based on the above implementation and has the same similar application workflow. But we have modified PETSc classes and added new implementations for matrix conversion. This is shown in Figure 5.2. With our new implementation, PETSc applications will be able to choose the sparse matrix format on the GPU for MatAIJ objects. For this, a new user level API is exposed. The Mat object is modified to support new APIs. Now the MatSeqAIJCUSP class is able to store the matrix object in various sparse storage formats like ELL, DIA, HYB on GPU. For this we have added an implementation to convert a PETSc MATAIJ object to the corresponding CUSP object on the GPU. The matrix and matrix-vector operations are performed in the same way using the CUSP and Thrust libraries. Figure 5.2: PETSc Object creation and new sparse matrix support in new GPU implementation 36

47 5.2 Implementation Details In this section, we will discuss our new implementation in detail. Due to the modular structure of PETSc, it is easy to extend the current functionality and to add new features. But different classes of Vector and Matrix are used by other PETSc components and hence we have to understand the dependency between them New Matrix types for GPU PETSc supports various matrix types like AIJ, Block AIJ, Shell or Dense on CPUs. Depending upon the problem being solved, a user can set the appropriate matrix format from the application code using the MatSetType() routine or using the option database key mat_type as a command line option. But for the initial GPU implementation, PETSc does not provide a choice for matrix types on the GPU. By default PETSc creates copy of PETSc Mat object in CSR format on the GPU. So, in our implementation we have added COO, DIA, ELL and HYB matrix types which are defined below: #define PETSC_CUSP_MAT_CSR 0 /* CSR matrix type on GPU */ #define PETSC_CUSP_MAT_ELL 1 /* ELL matrix type on GPU */ #define PETSC_CUSP_MAT_HYB 2 /* HYB matrix type on GPU */ #define PETSC_CUSP_MAT_DIA 3 /* DIA matrix type on GPU */ #define PETSC_CUSP_MAT_COO 4 /* COO matrix type on GPU */ Now, depending upon the sparsity structure of the matrix, a user can choose the appropriate matrix storage format on the GPU. These new sparse matrix formats are exposed to the user applications by adding these definitions in the header petscmat.h. Note that these matrix types represent the storage format used on the GPU. Users still need to set the appropriate type of PETSc Mat object on the CPU PE TSc Mat Object PETSc defines a common matrix data structure which is used as a base for all matrix types, such as SeqAIJ, MPIAIJ, BAIJ or Shell. This common data structure is defined in the struct _p_mat. In our implementation, we have added the new property sparse_gpu_matrix_type to _p_mat which indicates the type of sparse storage format that will be used during creation of GPU copy. This property will be accessible only if the PETSc library is built with GPU support. Note that the current PETSc implementation has support for only the MatAIJ class on the GPU. So it is sufficient to add the new property to the Mat_SeqAIJ class. But considering the future development for different matrix classes, we have added the new property to the matrix base class. The sparse_gpu_matrix_type property can take one of the values defined in the Section Now applications can set this property using the new user level API which we will discuss in Section

48 5.2.3 New User level API PETSc applications can not directly manipulate or access the internal data structure of the PETSc objects. The internal representation of all objects is hidden from applications. Therefore PETSc applications cannot directly access the new property sparse_gpu_matrix_type added to the Mat class. In our new implementation, the user will be able to select an appropriate sparse matrix format on the GPU. We have implemented a new user level API called MatSetCUSPSparseType. In PETSc, in order to provide any new API we have to first register that API with the corresponding class register interface. The implementation of this new user level API is shown in Figure 5.3: PetscErrorCode MatSetCUSPSparseType(Mat mat, PetscCUSPMatSparseType type) { PetscBool isvalidsparsetype = PETSC_FALSE;; /* check whether user has provided valid sparse matrix format */ MatCheckCUSPSparseType(&isValidSparseType, type);; /* if not valid type, display PETSC error and abort the execution*/ if( isvalidsparsetype == PETSC_FALSE ) SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_ARG_UNKNOWN_TYPE," UNKNOWN CUSP SPARSE TYP ON GPU: %d",type);; /* otherwise, set the valid sparse matrix format on the GPU*/ mat->sparse_gpu_matrix_type = type;; PetscFunctionReturn(0);; } Figure 5.3: New User level API registration with Mat class (petsc-src/mat/interface/matreg.c) PETSc applications will use the MatSetCUSPSparseType() routine to set the sparse matrix format of the Mat object on the GPU. Users can now use one of the matrix sparse types defined in Section as the value of the PetscCUSPMatSparseType parameter. We have added a new method, MatCheckCUSPSparseType(), that checks whether user a has provided a sparse matrix type that is supported by our implementation (i.e. COO, CSR, DIA,ELL and HYB). Once a user sets this property, our new implementation will create the appropriate sparse matrix copy on the GPU, which is discussed in the subsequent sections. 38

49 5.2.4 PE TSc Mat objects on GPU When a PETSc application runs on the CPU, every Mat object contains an instance of MatSeqAIJ or MatMPIAIJ depending on whether the application is serial or parallel. For the GPU implementation, every Mat object now contains another instance of either Mat_SeqAIJCUSP or Mat_MPIAIJCUSP. Basically these are sub-classes that hold a copy of the PETSc Mat object on the GPU. More details about these objects can be found in [36]. To support different sparse matrix formats, we have to modify these class implementations, which are discussed below. Sequential Implementation In the original implementation, the Mat_SeqAIJCUSP class stores the Mat object data in CSR format only. We have modified this class to support different sparse storage formats. The new class now contains mat_ell, mat_dia, mat_hyb and mat_coo pointers and then hold the matrix data in ELL, DIA, HYB or COO format on the GPU. The modified Mat_SeqAIJCUSP class is shown in Figure 5.4 below: /* these are new CUSP sparse matrix format that new implementation will support*/ /* for cusp object, we have to specify index type, value type and memory space*/ #define CUSPMATRIX_DIA cusp::dia_matrix<petscint,petscscalar,cusp::device_memory> #define CUSPMATRIX_ELL cusp::ell_matrix<petscint,petscscalar,cusp::device_memory> #define CUSPMATRIX_HYB cusp::hyb_matrix<petscint,petscscalar,cusp::device_memory> #define CUSPMATRIX_COO cusp::coo_matrix<petscint,petscscalar,cusp::device_memory> /*modified Mat_SeqAIJCUSP class*/ struct Mat_SeqAIJCUSP {... /*depending upon the user request, we will use one of the following type on GPU*/ CUSPMATRIX* mat;; /*pointer to the csr matrix on the GPU in CSR */ CUSPMATRIX_ELL* CUSPMATRIX_DIA* CUSPMATRIX_HYB* CUSPMATRIX_COO* mat_ell;; /*pointer to the ell matrix type on GPU*/ mat_dia;; /*pointer to the diagonal matrix type on GPU*/ mat_hyb;; /*pointer to the hybrid matrix type on GPU*/ mat_coo;; /*pointer to the coo matrix type on GPU*/... };; Figure 5.4: Modified Mat_SeqAIJCUSP class with ELL, DIA, HYB and COO storage support using the CUSP library With the original implementation, when PETSc applications run with GPU support, an instance of Mat_SeqAIJCUSP is created and CUSPMATRIX* mat holds the matrix data on the GPU in CSR format. In our new implementation, depending upon the requested sparse format (i.e. value of property sparse_gpu_matrix_type), we will make 39

50 transparent conversion from CSR to other sparse matrix formats using the CUSP library. We will discuss this implementation in Section Parallel Implementation: For parallel PETSc applications, instead of Mat_SeqAIJ, the object of Mat_MPIAIJ is created. Note that for parallel applications the PETSc Mat object is stored internally into two matrix objects: one for storing diagonal elements and the other for offdiagonal elements. So when PETSc applications with GPU support create an instance of a MatMPIAIJ object (the parent), PETSc internally creates two instances of MatSeqAIJCUSP objects (the children). In this case, when an application sets the sparse_gpu_matrix_type property using our MatSetCUSPSparseType(), it only sets the property of the parent object. So during the pre-allocation of the MPIAIJCUSP object, it is important to set the property of the two child objects to the value of the PETSc MatMPIAIJ parent object Conversion of PE TSc MatA IJ to C USP CSR The AIJ storage format is nothing but the CSR or the Yale formats. But note that the entire Mat PETSc object is not copied, but only non zero matrix elements get copied on the GPU. For the CSR matrix, we require three vectors: data, column indices and row_ptr. The PETSc MatAIJ object contains all this data structure required for CUSP CSR format. So converting a PETSc Mat object to a CUSP CSR object is easy and shown in Figure 5.5. We just have to copy vectors to the CUSP CSR matrix object on the GPU. We have used this implementation to convert the MATAIJ type to a CUSP CSR object. The original implementation of PETSc GPU also uses the same technique for this conversion. /*consider a PETSc Mat object pmat with maij as instance of Mat_SeqAIJ within a Mat object*/ Mat_SeqAIJ *maij = (Mat_SeqAIJ*)pMat->data;; /*create cusp matrix object in CSR format on GPU i.e. device*/ cusp::csr_matrix<petscint,petscscalar,cusp::device_memory> csrmat;; /* allocates memory for CUSP obj: no of rows, no of cols, no of non zero */ csrmat.resize(pmat->rmap->n,pmat->cmap->n,maij->nz);; /*copy pmat->rmap->n (i.e. M) row offsets from PETSc to CUSP matrix*/ csrmat.row_offsets.assign(maij->i,maij->i+ pmat->rmap->n +1);; /*copy column indices of all non zero elements*/ csrmat.column_indices.assign(maij->j,maij->j+maij->nz);; /*finally, copy all non zero elements of matrix*/ csrmat.values.assign(maij->a,maij->a+maij->nz);; Figure 5.5: Converting PETSc AIJ Matrix to CUSP CSR matrix 40

51 5.2.6 Conversion of PE TSc MatA IJ to C USP DI A/E L L/H Y B/C O O For supporting different matrix formats on GPU, we have to convert PETSc MatAIJ objects to the DIA, ELL, HYB and COO format of CUSP. There are two approaches that can be implemented: transparent matrix format conversion using CUSP; or implementing new conversion routines in PETSc. These two approaches are discussed below: Matrix conversions with C USP: The CUSP library provides the functionality to convert a matrix between different sparse storage schemes. Also, it is possible to convert a matrix object stored in one format on the host to a matrix stored in different format on the device. This is illustrated with the example in Figure 5.6. /* create cusp matrix object in COO format on host*/ cusp::coo_matrix<int, float, cusp::host_memory> hostcoo;; /*create matrix for poisson problem on 2-D 100x100 grid */ cusp::gallery::poisson5pt(hostcoo, 100, 100);; /*create cusp matrix object in ELL format on GPU*/ cusp::ell_matrix<int, float, cusp::device_memory> deviceell;; /*transparent conversion of the COO matrix to the ELL matrix with the host to device memory copy with the CUSP */ deviceell = hostcoo;; Figure 5.6: Transparent conversion between different sparse formats with CUSP In the above example, we have used the CUSP gallery interface. The cusp::gallery interface can be used to create matrices for standard Poisson or Diffusion problems on a 2-D grid with 5-point stencil. This interface automatically performs the memory allocation on the host or device depending on the memory space parameter. When we assign the COO matrix object on the host to the ELL matrix on the device, CUSP automatically handles the memory allocation for new object on the GPU as well as the data transfer from the host to the GPU. CUSP hides all this complexity of memory allocation and data transfer and internally uses cudamalloc and cudamemcpy routines. So in our new implementation, we can use this functionality of CUSP to convert CSR matrix to any other formats and this will be easy. The only disadvantage with this scheme is that we have to first construct the CUSP CSR object from the PETSc Mat object which can be then converted transparently to any other formats. So we create one extra CUSP CSR copy on the CPU which results into small memory overhead. But for our initial implementation and evaluation, we have used this approach as we have to do this conversion only once. We can avoid the overhead by implementing these conversion routines within PETSc. We have also implemented this conversion scheme for ELL matrices, which is discussed next. 41

52 Matrix conversion within PE TSc: To avoid the extra memory overhead from the conversion of PETSc Mat object to CUSP CSR on the host and then CUSP CSR to CUSP ELL/DIA/HYB objects on the device, we can implement new conversion routines within PETSc. We have implemented this scheme for ELL matrix format. The algorithmic implementation for converting PETSc MatAIJ object to CUSP ELL format is shown in Figure 5.7. For an ELL matrix, we have to calculate the maximum number of non zero elements per row. From a CSR matrix, we can calculate this by finding the maximum difference between successive row indices using thrust::inner_product routine. The CUSP library also provides various conversion utilities that we have used. For example, we can convert row offsets of a CSR matrix to equivalent row indices of an ELL matrix by using cusp::detail::offsets_to_indices routines. /*calculate max entries per row i.e. max difference between row offsets */ max_entries_per_row = thrust::inner_product( pmat_start+1, pmat_end, pmat_start, int(0), thrust::maximum<int>(), thrust::minus<int>());; /*create ell matrix and allocates memory*/ cuspstruct->mat_ell = new CUSPMATRIX_ELL;; cuspstruct->mat_ell->resize(pmat_rows, pmat_cols, pmat_nnz, max_entries_per_row);; /* convert row offsets of PETSc Mat to row indices of ELL matrix*/ cusp::detail::offsets_to_indices(pmat_row_offsets, ell_row_indices);; /* fill all column indices with -1 */ thrust::fill(cuspstruct->mat_ell->column_indices.values.begin(), cuspstruct->mat_ell->column_indices.values.end(), int(-1));; /*fill all values of ell matrix with 0 */ thrust::fill(cuspstruct->mat_ell->values.values.begin(), cuspstruct->mat_ell->values.values.end(), int(0));; /* here code for calculating Scatter Map which describes the mapping of CSR elements to ELL matrix elements: */ /* scatter column indices of PETSc MatAIJ to CUSP ELL matrix using Map */ thrust::scatter (pmat_column_indices.begin(),pmat_column_indices.end(), scattermap.begin(), cuspstruct->mat_ell->column_indices.values.begin());; /* scatter values of PETSc MatAIJ to CUSP ELL matrix */ thrust::scatter(pmat_values.begin(),pmat_values.end(), scattermap.begin(),cuspstruct->mat_ell->values.values.begin());; Figure 5.8: Converting PETSc MatAIJ to CUSP ELL format (Algorithmic Implementation) 42

53 As we use padding for the rows of an ELL matrix, we have to re-compute the new column indices for non zero elements of a CSR matrix. For this, we have to first calculate the permutation from CSR indices to ELL indices and then we can calculate the scatter map. The scatter map basically tells us how the column indices of CSR elements correspond to the column indices of ELL elements. This has been discussed with more details in [37] and in our code implementation. Once the scatter map is ready, we use the Thrust scatter routine to copy column indices and non zero elements. For this implementation, we have used the CUSP library routines that are used internally in the library for transparent conversion between different sparse matrix formats and are not exposed to users. As CUSP is only a template library (implementation comes as headers), we have used these routines directly by including those implementations. Note that, we have illustrated the implementation for converting a PETSc Mat object to a CUSP ELL matrix type. For other sparse matrix formats, a similar approach can be used by using the CUSP and Thrust libraries. More information about these conversion routines can be found in [38] Matrix-Vector multiplication for different sparse formats The important part for supporting different sparse matrix formats in PETSc is to use an efficient implementation for sparse matrix-vector multiplication. The CUSP library provides sparse matrix-vector multiplication routines for different sparse matrix formats on CPU as well as GPU. In the original implementation of PETSc GPU support, the sparse CSR matrix-vector multiplication routines from CUSP have been used. These routine are already optimized for NVIDIA GPUS. So we have also used these routines for supporting SpMV with DIA, ELL, HYB and COO sparse matrices on the GPU. CUSP provides same cusp::multiply interface for different sparse matrix types. As these are templated functions, depending on the type of parameters, the compiler generates appropriate instance of the function cusp::multiply. In our new implementation, depending on the sparse matrix type of the Mat object, we pass the appropriate CUSP matrix object to the cusp::multiply. This implementation is simple and shown in Figure 5.9 below: Mat_SeqAIJCUSP *cuspstruct =.../* Mat_SeqAIJCUSP pointer from Mat */ /*depending upon the sparse type, pass appropriate CUSP pointers */ if( pmat->sparse_gpu_matrix_type == PETSC_CUSP_MAT_ELL ) cusp::multiply(*cuspstruct->mat_ell,*xarray,*yarray);; else if(a->sparse_gpu_matrix_type == PETSC_CUSP_MAT_HYB) cusp::multiply(*cuspstruct->mat_hyb,*xarray,*yarray);; else if...do same for CSR, COO and DIA type... else /* unknown sparse matrix type provided */ SETERRQ(PETSC_COMM_SELF,PETSC_ERR_LIB,"Mat Sparse Type Error!") Figure 5.9: Sparse Matrix-Vector operation 43 support for different matrix formats using CUSP

54 5.2.8 Other Important notes This is our initial implementation where we have supported sparse matrix vector multiplication for different sparse formats on the GPU. We have modified routines such as MatCreateCUSPCopy, MatCUSPCopyToGPU, MatMult_SeqAIJCUSP, MatMultAdd_SeqAIJCUSP or MatDestroy_SeqAIJCUSP. These routines are used during the KSP solvers that we have used for testing and performance evaluation. In other PETSc routines such as MatInodeCUSPCopyToGPU, MatMult_SeqAIJCUSP_Inode or MatCUSPCopyFromGPU, the CUSP matrix object on the GPU is manipulated. Hence to support these routines for other methods or solvers, some more work is needed, which we have outlined in the future work section Sample Use Case and Validation This new implementation will work with current the PETSc GPU implementation. Applications running with Vector type VECSEQCUSP or VECMPICUSP and Matrix type MATSEQAIJCUSP or MATMPIAIJCUSP will run on GPU similar to the original implementation. If a user wants to use new sparse matrix format on the GPU, they have to set the type explicitly by using MatSetCUSPSparseType. Currently this method supports DIA, ELL, HYB and COO sparse matrix types which are implemented in CUSP. If sparse matrix type is not set, the CSR matrix format will be used by default on the GPU. Note that, in future, new matrix formats from CUSP can be easily supported. Figure 5.10 shows the sample use of this implementation. The only difference with the existing applications is the use of MatSetCUSPSparseType....Create PETSc Mat object... /*now applications can explicitly set the sparse matrix type*/ ierr = MatSetCUSPSparseType( pmat, PETSC_CUSP_MAT_ELL );; ierr = MatSetSizes(pMat,PETSC_DECIDE,PETSC_DECIDE,m*n,m*n);;...matrix assembly & Vector Setup Code... ierr = VecCreate(PETSC_COMM_WORLD,&u);; ierr = VecSetSizes(x,PETSC_DECIDE,m*n);;CHKERRQ(ierr);; ierr = VecDuplicate(b,&x);;...create KSP context & Set tolerance... ierr = KSPCreate(PETSC_COMM_WORLD,&ksp);; KSPSetOperators(ksp,A,A,DIFFERENT_NON ZERO_PATTERN);;...solve linear system... ierr = KSPSolve(ksp,b,x);;...post processing code... Figure 5.10: Simple example of KSP with the use of new sparse matrix format 44

55 We have tested our implementation with KSP solvers such as Conjugate-Gradient (CG), Biconjugate Gradient (BiCG) or Generalized minimal residual method (GMRES) on GPUs with different sparse matrix formats. We have compared these results with the results from an application running on CPU (i.e. without the new implementation). Figure 5.11 shows the convergence of CG method for simple example of 2-D Laplacian example [39] from PETSc. The graph shows a single line, as for every iteration the residual norm remains same with different sparse matrix formats on GPU and on the CPU Residual Norm CPU 1 GPU(CSR) 1 GPU (DIA) 1 GPU (ELL) 1 GPU (HYB) 2 CPU 2 GPU (CSR) 2 GPU (HYB) 4 GPU (DIA) 2 GPU(ELL) Number of Iterations (Conjugate Gradient) Figure 5.11: Convergence of KSP CG solver with different sparse matrix formats on CPU & GPUs for simple example of 2-D Laplacian from PETSc We have tested this example in sequential and parallel for CPUs as well as GPUs. The results (residual norm per iteration) for all cases (sequential/parallel on CPUs/GPUs) are same irrespective of the sparse matrix format used. This shows that the new implementation works correctly with different sparse matrix formats. 45

56 Chapter 6 Wrapper Codes and Benchmarks In this section we will discuss the wrapper codes that were developed for benchmarking the new implementation, converting the matrix market format to PETSc binary format and to measure the cost of the CUSP matrix conversion routines. 6.1 Testing Codes In order to test new implementation, we do not need to write new test cases. All existing KSP examples and test cases can be used. Note that all PETSc examples can run transparently on GPUs with CSR sparse matrix format without modifying the existing source code. We have used different KSP examples and test examples that come with the PETSc development source release. We have to set the matrix type to MATSEQAIJCUSP or MATMPIAIJCUSP using option database key mat_type, which is discussed here [40]. For our new new implementation, in addition to setting matrix type to MATSEQAIJCUSP or MATAMPIAIJCUSP, the application has to set the sparse format on the GPU using MatSetCUSPSparseType as discussed in Section 5.3. With this modification, we have used the same PETSc test cases. Also, we have tested different KSP solvers and preconditioners with same examples, but with different options database key values of ksp_type and pc_type. 6.2 Matrix Market to PE TSc binary format For this project, we have used matrices from the UFL Sparse matrix collection [34]. These matrices are provided in the formats like Matrix Market (MM), Matlab (NAT) and Rutherford/Boeing (RB) format. More information about these formats can be found in [41]. In order to use these matrices directly with PETSc examples, we have to convert these matrices form Matrix Market format to PETSc binary format. The Matrix Market is a widely used, ASCII exchange format for matrices. The matrix market file consists of four parts: 46

57 Header line : specify identifier, pattern, format etc Comment Line: information or comments about matrix Size Line: specify number of : rows, columns and non zero elements Data Line: specify row, column index and value To convert this format to PETSc binary format, we have to first create a PETSc Mat object from a matrix market file and then we can write the Mat object directly to binary file using PETScViewer. The important thing that we have to keep in mind about this conversion process is the pre-allocation of the PETSc Mat object. As these sparse matrices are very large (Matrix Market file sizes vary from few MBs up to GBs), it is important to pre-allocate the memory for PETSc Mat object. Otherwise, every time when we set the Mat object values from matrix market file, PETSc internally does reallocation of memory as PETSc does not have any information about exact matrix size. So for large sparse matrices, the conversion process takes a few hours due to the high cost of memory reallocation. PETSc provides an option to pre-allocate memory for a Mat object. For this, we have to pass an array describing the distribution of the non zero elements per row. This implementation is shown in Figure 6.1 below: /*Histogram: distribution of non zero elements per row*/ while (...) { fscanf(filein,"%d %d %le\n",&row,&col,&val);; row = row-1;; rowdistribution[row]++;; } /*create Mat object: important to use pre-allocation*/ MatCreateSeqAIJ(PETSC_COMM_WORLD,M,N,0,rowDistribution,&pMat);; /*matrix assembly code here*/ /*create PETSc binary viewer for binary file I/O*/ PetscViewerBinaryOpen(PETSC_COMM_WORLD,outputfile,FILE_MODE_WRITE,&view);; /*with PETScViewer, we can write matrix object to binary file*/ MatView(pMat,view);; Figure 6.1: Converting Matrix Market format to PETSc binary format (Algorithmic Implementation) Note that the UFL sparse matrix collection provides solution vectors (right hand side i.e. b for linear system Ax=b) for a few matrices. These vectors are also in Matrix Market format. We have used similar approach and implemented routines for converting vector to binary format. 47

58 6.3 Benchmarking codes We have measured the performance of our new implementation on the HECToR GPU system and compared it with performance on HECToR (Phase 2b). Also we have benchmarked the cost of sparse matrix format conversion in CUSP. The codes developed for benchmarking are discussed below. 1) C USP matrix format conversion As we have used the sparse matrix conversion routines from CUSP for new implementation in PETSc, we have measured the cost associated with these conversion routines. We have written small benchmark that takes matrix market file as an input and then converts this matrix from CSR format on the host to DIA, ELL and HYB formats on the device. Additionally, in the same benchmark we have implemented routines to measures the performance of iterative solvers from the CUSP library on GPU with different sparse matrix formats. As PETSc uses preconditioner from CUSP, this benchmark can be used to measure the performance of different sparse matrix formats with CUSP iterative solvers to compare to PETSc solvers. 2) PE TSc KSP Solvers For benchmarking the performance of the new sparse matrix support in PETSc, we have used the example of a KSP parallel solver. There are various examples present in the PETSc library. We have modified these examples to work with a binary matrix created from Matrix Market format. This code takes a binary matrix and optionally, a solution vector as an input in binary format. PETSc provides the functionality to load a matrix from a binary file directly through PetscViewerBinaryOpen and MatLoad routines. As previously discussed, this example also can be used with different KSP solvers, preconditioners by using option database keys pc_type, -ksp_type etc. More details can be found [42]. 6.4 Benchmarking Approach Our goal is to benchmark the performance of current PETSc GPU implementation with our new implementation on the HECToR GPU cluster and the HECToR (Phase 2b) Service. All KSP (Krylov Subspace) codes run with GMRES method and without any preconditioners. Note that all benchmarking codes do not wait till convergence of the system. Typically code executes 20-30K KSP iterations, which gives sufficient profiling information for performance analysis. 48

59 For benchmarking results following terminology is used: Total execution time: time required to finish the benchmark for specified sparse matrix type Performance (Gflops/Sec): No of floating point operations performed during the execution divided by total execution time SpM V Performance: Amount of time spent in Sparse Matrix-Vector Operations Note: We have used PETSc inbuilt profiling interface to collect the profiling data. 49

60 Chapter 7 Performance Analysis In this section we will discuss the performance of our PETSc GPU implementation on the HECToR GPGPU testbed. We have compared the performance of our new implementation with the existing GPU support available in the PETSc library. We are mainly focusing on the DIA, ELL and HYB sparse matrix storage schemes which are more suitable for GPUs. Also, we will compare the performance of PETSc on the main HECToR (Phase 2b) system with the HECToR GPU system. Note that these two systems have different architectures and interconnects. But the results will nonetheless give an idea about the performance and scalability on multi-gpu systems. We have used the PETSc inbuilt profiling interface for collecting the performance information. Note that when we use this profiling interface, PETSc internally synchronizes CUDA calls to get exact timing for various phases (communication, data copy, CPU-GPU communication). So the performance data may not represent the exact results without the use of the profiler interface. But these results are valid to compare our new implementations with existing GPU support as we are using same profiler interface for both results. 7.1 Benchmarking System For the performance analysis of PETSc, we have used the HECToR GPGPU testbed [43] facility, which is part of HECToR, the UK National Supercomputing Service. Currently this GPU cluster consists of four compute nodes connected with quad-band Infiniband interconnect. Each compute node is a quad-core Intel Xeon (2.4 GHz) CPU with 8 GB main memory. Out of this, three nodes have four NVIDIA C2050 GPU (Fermi) cards each. The remaining one node has one NVIDIA and one AMD FireStream GPU card. The system layout is shown in the Figure 7.1 below: 50

benchmarking results, we have used three nodes with the NVIDIA GPUs.

61 Figure 7.1: HECToR GPGPU Testbed System consist of NVIDIA and AMD GPUs connected by Infiniband Network For our benchmarking results, we have used three nodes with the NVIDIA GPUs. The specification of NVIDIA C2050/C2070 card is given below: Figure 7.2: Tesla C2050/C2070 Specification [58] For comparing the performance of PETSc3.1 (i.e. the CPU version), we have used the HECToR (Phase 2b) Service. More information about HECToR can be found in [44]. 51

We have chosen large, structured as well as unstructured sparse matrices from the UFL sparse matrix collection. Our new implementation is more suitable for matrices with structured sparsity patterns.

62 7.2 Single GPU Performance In this section, we will discuss the performance of our PETSc GPU implementation on a single GPU node. This will help to analyse the performance without considering the Infiniband interconnect between the compute nodes. We have chosen large, structured as well as unstructured sparse matrices from the UFL sparse matrix collection. Our new implementation is more suitable for matrices with structured sparsity patterns. The structured sparse matrices are used to measure the performance benefits from our implementation in the PETSc. Unstructured matrices will give a more general idea about the performance of the PETSc GPU implementation for more general matrices without any specific sparsity patterns Structured Matrices If the matrix have regular sparsity pattern, we can store that matrix with the newly implemented formats like DIA or ELL. We will consider the example of matrix arising from the 2-D Laplacian equation using a 5-point stencil. The dimension of the matrix is 1,000,000 X 1,000,000 and number of non zero elements are 5,000,000. This matrix has banded diagonal structure and all non zero elements of the matrix are stored in the few sub-diagonals of the matrix near the main diagonal. Due to the banded diagonal structure, we can effectively store this matrix in the diagonal format. We can also store this matrix into COO, ELL, CSR and HYB format as well. So we have chosen this matrix for comparing the performance of all provided sparse storage schemes on the GPU. Figure 7.3 shows the execution time of the program using the GMRES method with different sparse matrix storage formats. Figure 7.3: Total execution time with different sparse matrix formats on GPU Figure 7.4: Performance with different sparse matrix formats on GPU 52

63 The current implementation only supports the CSR matrix format on the GPU. The execution time for this format is higher than any other sparse matrix format. This is because of the non-coalesced memory access and thread divergence on the GPU. We have already discussed this problem in Section 4.3. The COO format takes slightly less time than CSR. In the CUSP library, the COO sparse matrix-vector operation is implemented using segmented reduction scheme where multiple rows are assigned to a single thread. This exhibits better memory coalescing than CSR (Section 4.3). But there is an extra memory overhead because of the explicit storage of row indices and hence only small performance benefits. The diagonal matrix format outperforms all other storage schemes. In diagonal sparse matrix format, all non zero elements of the matrix are stored in column major order with fixed length vectors. Also, every thread of the warp corresponds to consecutive columns of the matrix (as we partition the matrix by rows). Because of this, all threads in the warp access contiguous memory locations which results into good memory coalescing. So the diagonal storage format takes the minimum execution time. The ELL storage format performs better than CSR and COO but delivers lower performance compared to DIA. As ELL stores the matrix with equal length vectors with the appropriate padding, the memory access is well coalesced. But compared to DIA, there is an extra memory overhead due to explicit storage of column indices. The performance of HYB is almost the same as for the ELL storage format. As we discussed in the Section 4.3, the HYB format stores the matrix in two parts: the ELL matrix containing rows having nearly same number of non zero elements and the COO matrix containing those rows showing a large deviation in the number of non zero elements. For a perfectly diagonal matrix, all rows of matrix are stored into the ELL matrix itself (i.e. 100% ELL and 0% COO). So the HYB sparse matrix vector multiplication is same as for ELL sparse matrix vector operation. But at the time of creating the HYB matrix, we have to compute the distribution of non zero elements per row and have to check whether there is a large variation of non zero elements per row. So there is small performance penalty due to extra computations. Figure 7.4 shows achieved Gflops for different sparse matrix types on the GPU in double precision. The Flop rate is a decent measure of the CPU or GPU performance and can give an exact idea about how well code performs on the particular system. For our example, the total floating point operations remains nearly same for all matrixes. So the Gflops reflect the trend of execution time. The DIA sparse matrix format shows 31% improvement in the Gflops compare to default CSR storage format. This example illustrates the performance improvement because of new sparse matrix support added in PETSc for structured sparse matrices. 53

Performance Improvements in SpM V The main goal for supporting new sparse matrix types is to improve the performance of the matrix-vector operation in iterative methods.

The execution of SpMV for DIA format takes lowest time and is about two times faster than the CSR format. But if we observe total execution time for CSR and DIA (in Figure 7.

64 Performance Improvements in SpM V The main goal for supporting new sparse matrix types is to improve the performance of the matrix-vector operation in iterative methods. Here, we will discuss the performance improvement in the example discussed above. Figure 7.5 shows the execution time for the matrix-vector operation for different sparse matrix formats on the GPU. The execution of SpMV for DIA format takes lowest time and is about two times faster than the CSR format. But if we observe total execution time for CSR and DIA (in Figure 7.3), there is large difference. The main reason for this is the Orthogonalization step during the GMRES method. During the Gram-Schmidt orthogonalization of the set of vectors produced by the Krylov iteration, the PETSc implementation of classical Gram-Schmidt orthogonalization uses two key routines: VecMDot (i.e. multiple dot products) and VeMAXPY(i.e. multiple AXPY, i.e. y= scalar). These two routines are actually matrix vector products where the matrix is stored as pointers to the rows [45]. So while considering the total improvement due to sparse matrix formats, we have to consider the MatMul operation along with the VecMDot and VecMAXPY operations. This profiling result is shown in Figure 7.6. Now, if we compare the total execution time for any sparse matrix type (Figure 7.3) with the corresponding SpMV operation time (Figure 7.5), both show the same trend, i.e. the improvement in the SpMV operations is directly reflected in the total execution time for all sparse matrix types. Figure 7.5: Execution time of SpMV with different Sparse matrix formats 54 Figure 7.6: Execution time of SpMV+VecMDot+VecMAXPY with different sparse matrix formats on GPU.

Performance comparison with quad-core Intel Xeon We have compared the performance of single NVIDIA GPU with single compute node on the HECToR GPU system. Figure 7.

89x faster with CSR format and new implementation runs 6.28x faster with DIA matrix format. On the CPU we are able to achieve 20% peak of compute node.

65 Performance comparison with quad-core Intel Xeon We have compared the performance of single NVIDIA GPU with single compute node on the HECToR GPU system. Figure 7.8 shows speedup achieved with CSR and DIA matrix formats on the GPU compare to quad-core Intel Xeon (utilising all four cores). The original PETSc GPU implementation runs 4.89x faster with CSR format and new implementation runs 6.28x faster with DIA matrix format. On the CPU we are able to achieve 20% peak of compute node. But on the GPU the utilisation is still low and able to achieve only 2.5% of peak. This is shown in the Figure 7.10 below: Figure 7.8: Achieved Speedup compare to Intel Xeon quad-core Figure 7.7: Performance on CPU with CSR, GPU with CSR and GPU with DIA We will discuss the performance and scalability of the PETSc library in the Section 7.4 while comparing the performance with HECToR (Phase 2b) system. Note: From the next section, timings of MatMul, VecMDot and VecMAXPY routines are considered as SpMV. 55

7.2.2 Semi-Structured Matrices Matrices from real life applications do not always have the regular non zero sparsity patterns as described in the previous example.

We will consider one such matrix from a 3D PDE-constrained optimisation problem [46] of size 3,542,400 X 3,542,400 and with 95,117,592 non zero elements.

66 7.2.2 Semi-Structured Matrices Matrices from real life applications do not always have the regular non zero sparsity patterns as described in the previous example. So to measure the performance of our PETSc GPU implementation, it is necessary to choose a matrix representing a class of scientific applications. We will consider one such matrix from a 3D PDE-constrained optimisation problem [46] of size 3,542,400 X 3,542,400 and with 95,117,592 non zero elements. This is relatively large matrix and can give an idea of the performance of the PETSc GPU implementation for real life applications. Figure 7.10 shows the performance of different storage formats for this large sparse matrix. These graphs show total execution time and performance for CSR, ELL and HYB formats. It is not possible to convert the matrix into DIA format as this matrix does not have banded or diagonal pattern. We have also not considered the COO format as its performance is less than DIA, ELL, HYB for most of the matrices that we have discussed in the Section 4.3. Interestingly, though this is a large sparse matrix, ELL and HYB formats show large performance improvements. The total execution time is reduced by 32% compared to CSR format. The Hybrid format also shows similar improvements in the execution time. Figure 7.9 shows that the performance (i.e. flops) is improved by 40% for ELL matrix on the GPU compared to CSR matrix. Figure 7.9: Execution time of different sparse matrix format for semi-structured matrix on GPU Figure 7.10: Performance for Semi- Structured Matrices 56

When we convert CSR matrix to HYB, the CUSP library stores 100% of non zero elements of this matrix into the ELL format itself.

67 The main reason for this performance improvement is the distribution of the non zero elements of the matrix. Though this matrix does not have a regular sparsity pattern like diagonal matrices, there is no large divergence in the number of non zero elements per row. When we convert CSR matrix to HYB, the CUSP library stores 100% of non zero elements of this matrix into the ELL format itself. The ELL sparse matrix-vector operation benefits from contiguous memory access and memory coalescing. We have already discussed this in the previous section. But as this matrix is too large, the CSR sparse matrix vector operation becomes more costly due to high cost of non-coalesced memory access and thread divergence. The execution time for SpMV for CSR, ELL and HYB is shown in Figure 7.11 below: Unstructured Matrices Figure 7.11: Sparse Matrix-Vector Execution time for different sparse matrix formats Some matrices have entirely unstructured sparsity patterns and cannot be stored in the CUSP formats DIA or ELL. One such example of the matrix is shown in Figure 7.12 below: Figure 7.12: Unstructured matrix of size x with 36,816,170 non zero elements 57

As we can see in the Figure above, this matrix does not have diagonal or banded structure. Additionally, the number of non zero elements per row varies greatly.

68 As we can see in the Figure above, this matrix does not have diagonal or banded structure. Additionally, the number of non zero elements per row varies greatly. So it is not possible to store this matrix into DIA or ELL format. The only possible storage format with CUSP is HYB. For this matrix, there are few large dense blocks on the diagonal of the matrix for which few rows contain large number of non zero elements. CUSP Hybrid format is able to store this matrix with 82% rows in ELL format and the remaining 18% with COO format. As CUSP HYB matrix is internally stored into the ELL and COO matrix, SpMV is implemented as two independent operations: ELL SpMV followed by COO SpMV. As we have discussed in the Section 4.3, the performance of a HYB matrix decreases when more elements needs to store in the COO format. For this matrix 18% of matrix is stored into COO format. So its performance is less than matrices which are stored 100% in the ELL format. Figure7.13 shows the total execution time of application. HYB format shows 16% improvement in the total execution time. In case of HYB SpMV, though ELL SpMV performs better, COO SpMV performs slower and hence not major improvement in the execution time as well as Gflops achieved. Figure 7.13: Total execution time on GPU with CSR and HYB format Figure 7.14: Performance of CSR and HYB on the GPU 58

Figure 7.15 shows execution of SpMV on CPU with CSR, GPU with CSR and DIA format. The HYB format shows 20% improvement in the execution time of SpMV compared to format. Figure 7.

69 Figure 7.15 shows execution of SpMV on CPU with CSR, GPU with CSR and DIA format. The HYB format shows 20% improvement in the execution time of SpMV compared to format. Figure 7.15: Execution time of SpMV on CPU (CSR), GPU (CSR) and GPU (HYB) H E C ToR GPU Interconnect Issues: HECToR GPU cluster is setup with Infiniband interconnect. But during the performance analysis on multi-gpu system, we found that the GPU cluster uses Ethernet network for communication rather than Infiniband network. This may be due to the improper configuration of MPI/Interconnect on this system. This issue was not solved till the end of project. We will discuss Infiniband interconnect in more detail in the next Section. 59

70 7.3 Multi-GPU Performance Considering the future trend of GPU-based HPC systems, it is important to consider the performance of PETSc on a GPU cluster. For large problems, scalability on the large numbers of GPUs is important. The results discussed in the previous section for single GPU system do not give an idea about the performance on a GPU cluster as the communication cost is the deciding factor. In this section we will discuss the performance of PETSc on a multi-gpu system. We will consider the scalability issues, specifically communication cost, bandwidth and latency. In order to compare the performance of PETSc on a multi-gpu system, we have used the matrix from the Landscape ecology problem [47], of size 999,999x999,999 with 4,995,991 non zero elements. This matrix has a diagonal structure and can be stored in CUSP formats like ELL, HYB and COO. Figure 7.16 shows the total execution time for benchmark code (GMRES) on the HECToR GPU with CSR and DIA formats. The current implementation of CSR in PETSc does not scale very well. The total execution time decreases very slowly. For the DIA matrix format up to 4 GPUs, the execution time decreases linearly and almost two times faster than CSR format. However, for six GPUs, the execution time jumps to considerably higher value. This is shown in the Figure 7.16 and we will discuss this next. Figure 7.16: Performance on HECToR GPU cluster with CSR and DIA matrix format 60

71 A single compute node on the HECToR GPU system has four NVIDIA GPUs. When we increase the number of GPUs up to four, the MPI communication remains within a node. When an application uses more than four GPUs, the communication is supposed to go through the Infiniband interconnect. But due to configuration issues, HECTOR GPU cluster is using Ethernet for communication. Ethernet network has very high latency (few milliseconds) compared to the PCIe express bus. Thus the communication cost is higher and hence the graph shows a sudden jump in the execution time for six GPUs. Interestingly, despite the high communication cost between multiple GPUs, the execution time for eight GPUs is half of the execution time with six GPUs. When we want to use six GPUs, we have to allocate two compute nodes: four GPUs from one node and two from another. In this case, processes in the first compute node can communicate faster with each other than with the processes in the second node. So with six GPUs, we pay the high cost of the communication for 50% extra computational resources (compare to four GPUs). But for eight GPUs, we allocate all GPUs from two nodes. So, although there is a high communication cost, we have two times more computational resources. Communication and computations are well balanced and hence the graph shows high execution time for six GPUs, but not for eight GPUs. The graph for the CSR matrix on GPU does not show the sudden jump for six GPUs. The main performance bottleneck for a CSR matrix is the memory coalescing and thread divergence. The CSR matrix format pays higher cost for non-coalesced memory access and thread divergence than for communication. This hides the communication cost and hence there is no sudden jump in the execution time for CSR matrix. Gflops achieved with the CSR and DIA matrix formats for different numbers of GPUs is shown in the Figure The reason for the sudden fall in performance of the DIA format for six GPUs is discussed above. Figure 7.17: Performance with CSR and DIA matrix formats with different number 61 for GPUs

72 SpM V Performance on GPU Cluster SpMV is the key operation for performance of parallel PETSc applications running on the GPU cluster. This is due to the communication of vectors from all GPUs. In this section we will discuss it in detail. How SpM V is implemented for multi-gpu: For performance analysis, it is important to understand the implementation of SpMV in the PETSc GPU implementation. During every iteration of the solver, updated entries of the vector need to be communicated across all processes. For this, PETSc creates a VecScatter object which manages all communication of vector entries. The VecScatter object stores two sets of indices: one which tells where the vector entries are coming from, and the second tells at what locations updated entries will be copied into the local vector. The SpMV operation is completed in the four steps below: 1. VecScatterBegin: copy the required vector entries from GPU to CPU and post the MPI non-blocking send and receive operations. 2. MatMul: multiplication of diagonal components of the matrix with the vector (for this operation, we have updated copy of Vector locally). 3. VecScatterEnd: wait for updated vector entries from all other processes and copy those back to the GPU. 4. MatMultAdd: multiplication of off-diagonal components of the matrix with the updated vector entries from step three. The major bottleneck for parallel SpMV is: copying vectors from GPU to CPU, communication from CPU to CPU and copying updated vectors back from CPU to GPU. Figure 7.18 shows the execution time of the matrix-vector operation. DIA format shows great improvement compared to the CSR format. But as discussed in the previous section, for six GPUs, SpMV takes a considerably longer time due to inter GPU communication cost. Figure 7.18: Execution time for SpMV using CSR and DIA matrix formats on HECToR GPU 62

73 7.4 Comparing multi-gpu performance with H E C ToR Today most HPC applications are still running on non-gpu computing systems like HECToR (Phase 2b). To get an idea about how these applications (specifically PETSc) will perform on the next generation GPU-based HPC systems, we have compared the performance of the PETSc application running on the HECToR Service and the HECToR GPU cluster. Note that it is difficult to exactly compare the performance as both systems have different architectures, network interconnects and set of optimised libraries. But this comparison can give the general idea about the performance on both systems. Figure 7.19 shows the performance comparison between the CSR matrix format on GPU, the DIA matrix format on GPU and the CSR matrix format on HECToR (Phase 2b). Note that for comparing the performance of HECToR (Phase 2b) system with HECToR GPU, we have compared the performance of twelve cores with a single GPU. HECToR shows linear scaling up to 96 cores (4 compute nodes, using all 24 cores). The execution time for the CSR format on the GPUs is high compared to HECToR and does not scale very well. Interestingly, DIA format on the GPU shows better performance compared to HECToR up to 4 GPUs. But for more than four GPUs, the execution time increases considerably compared to HECToR. This is shown in the Figure 7.19 below: Figure 7.19: Performance comparison between HECToR GPU system (with CSR and DIA) and HECToR (Phase 2b) 63

74 The major bottleneck for scalability of DIA matrix format on the HECToR GPU system is interconnects and the latency. Though this system is setup with Infiniband interconnects, currently it is using 10Gig Ethernet for network communication due to configuration issues. Infiniband network has MPI latency of 5-6 µs and that of 10Gig Ethernet has µs. (These are theoretical values, but also depends on the Vendors). On the other hand, the HECToR (a Cray XE6) uses the Cray Gemini interconnect which is designed to scale beyond one million cores. One Gemini router chip is used per two XE nodes. There are 10 network links that are used to implement a 3-D torus topology of processors. This interconnect have much shorter MPI latency of µs [48]. So the total cost of communication is very low compared to HECToR GPU system and hence it is more scalable. 7.5 C USP Matrix Conversion Cost In our PETSc GPU implementation, we have used CUSP inbuilt functionality to convert a CSR matrix into other formats like COO, DIA, ELL or HYB. There is fixed cost associated with this conversion. For our example of linear system with KSP solvers, we have to perform this conversion only once as source matrix remains unchanged. But this may not be true for other problems or non-linear solvers. In that case, we may have to repeatedly perform this conversion process after fixed number of iterations. So, we have measured the cost of matrix format conversion using CUSP library. The following Figure 7.20 shows the cost of such conversion from CSR matrix on the host to other formats like ELL, DIA, HYB or COO on the device. The sparse matrix id is a unique id given to sparse matrices from UFL sparse matrix collection. The size of these matrices and number of nonzero elements is shown in Fig Figure 7.20: Cost of converting CSR Matrix on Host to ELL/DIA/COO/HYB on Device 64

75 Figure 7.21: Matrix Id from UFL sparse matrix collection, No of Rows, No of Columns and Number of Non Zero elements This conversion cost consists of two parts: computational cost to decide new data structure and memory transfer cost to transfer matrix elements. For example, for converting CSR host matrix to ELL matrix on device, we have to first calculate maximum number of nonzero in any row. Then we have to create data structure, need to allocate the memory and then we can copy the matrix elements. Figure 7.20 shows that, the cost is directly proportional to the size of the matrix and number of nonzero elements. This is obvious because we have to manipulate those many matrix elements and memory copy cost increases with increase in non zero elements. Converting CSR to ELL or HYB is computationally expensive and hence takes more time for conversion. CSR to CSR conversion takes very less time as we just have to perform memory copy operation. 65

76 Chapter 8 Discussion In this section we will discuss the challenges of multi-gpu parallelisation that we encountered during our PETSc GPU implementation and the performance analysis on the HECToR GPU system. As a result of our initial implementation, we came across a range possible development work in the PETSc GPU implementation. We will outline this in the future work section. 8.1 Challenges for multi-gpu parallelisation Two major challenges while parallelising HPC applications on GPU clusters are: CPU to GPU memory transfer and GPU to GPU communication. We will discuss these challenges in the context of PETSc multi-gpu applications CPU-GPU and GPU-G DR A M Memory transfer Although GPUs are attractive for HPC applications, memory transfer between CPU- GPU and GPU-GDRAM can be a major bottleneck for memory bound applications. Figure 8.1 describes a schematic representation of a motherboard (Generic, Pre-Sandy- Bridge) with different memory buses considering their bandwidths. Note that in current architectures like Sandy-Bridge, functionalities of South-North Bridge are combined within the processor chip. But for- this discussion, we are considering the more general case as we are only interested in the memory bandwidth and latency between CPU- GPU and GPU-GDRAM. The theoretical peak bandwidth of a PCI express (Gen 2.0) is 8 GB/Sec and for GPU to GPU GDRAM is 144 GB/Sec (Tesla). The data copy between CPU and GPU is major bottleneck especially for small data sizes. e.g. for the GeForce GTX 285 (NVIDIA GPU for Desktops), the latency of data transfer from CPU to GPU memory is µs for data less than 10KB. For our PETSc applications with preconditioning stage performed on the CPU, for every iteration we have to copy vectors from GPU to CPU and then back to GPU. This becomes a major bottleneck for the performance of PETSc applications. For improving the performance, we have to further investigate the various optimisation techniques, including zero copy transfer, pinned memory, CUDA streams and so on. 66

Figure 8.1: Overall Schematic Representation of Motherboard considering bandwidth of different memory buses (Pre Sandy-Bridge Architecture) The same is the case with the GPU DRAM.

77 Figure 8.1: Overall Schematic Representation of Motherboard considering bandwidth of different memory buses (Pre Sandy-Bridge Architecture) The same is the case with the GPU DRAM. Though we have 144GB/Sec bandwidth (for Tesla), global memory access is very expensive and requires more than cycles. So for large sparse matrices with irregular sparsity pattern, global memory access becomes very expensive due to latency and memory non-coalescing issues. We have discussed this in Section GPU-GPU Communication As we have previously seen, PETSc GPU applications do not scale beyond four GPUs on the HECToR GPU system. This is because of the high communication cost across GPU nodes. The HECToR GPU system has quad-band Infiniband network with switched fabric topology. The Infiniband network has end-to-end MPI latency of up to 5 (for QLogic interconnect). The schematic diagram of this system with Infiniband network is shown in Figure

78 Figure 8.2: HECToR GPU: Infiniband Network with Switched fibre topology (schematic layout) At a later stage of the project, we found that the MPI latency of the HECToR GPU interconnect is high and inter-gpu communication is very expensive. The possible reason could be the improper MPI setup on this system, which leads to communication through Ethernet rather than Infiniband network. This issue was not solved by end of this project and hence we are not able to fully test the scalability on a large number of GPUs. For PETSc applications, during every iteration of solver, we have to update the vector entries. For this, vector entries needs to be copied from GPU to CPU, then through the network and then back from CPU to GPU. This is very expensive and takes more time than the actual computations performed on the GPUs. More work is needed to improve the performance of communication, possibly using technologies like GPUDirect, which allows GPU to GPU communication without involvement of host CPUs. 8.2 Future Work Results discussed in Chapter 7 show large performance improvements for structured matrices. The DIA and ELL matrix formats shows a factor of two performance improvement compared to the CSR matrix format available in the current PETSc GPU 68

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,