Sparse Convex Optimization on GPUs

Size: px

Start display at page:

Download "Sparse Convex Optimization on GPUs"

Donna McKenzie
5 years ago
Views:

1 Sparse Convex Optimization on GPUs by Marco Maggioni B.A. (Politecnico di Milano) 2006 M.S. (University of Illinois at Chicago) 2008 M.S. (Politecnico di Milano) 2010 M.S. (University of Illinois at Chicago) 2012 Thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Compute Science in the Graduate College of the University of Illinois at Chicago, 2015 Chicago, Illinois Defense Committee: Tanya Berger-Wolf, Chair and Advisor Ajay Kshemkalyani Jie Liang Ashfaq Khokhar (Illinois Institute of Technology) Marco Domenico Santambrogio (Politecnico di Milano)

2 To my Kenyan Principessa. I have found you during my journey, and now you are the most important part of it. ii

3 ACKNOWLEDGMENTS I would like to thank Prof. Tanya Berger-Wolf. She has been the best PhD advisor I could have ever wished for, giving me the freedom to be creative in my research. I also have to thank her for the sequence of events that brought me to Kenya. I wish that my aunt Leila was still here today to cheer for me. She loved me like a son and made me feel at home here in Chicago. She has also been a great life teacher, and I will always carry her lessons with me. Another big thank is due to my entire family back in Italy. They have always supported me, making my PhD journey safe. Finally, I would like to thank all the members of my PhD committee and, in particular, Marco Domenico Santambrogio. Without his encouragement, my academic career would have not led toward studying in US. M.M. iii

4 TABLE OF CONTENTS CHAPTER PAGE 1 INTRODUCTION A Brief Historical Review Contributions Research Workflow Outline BACKGROUND Convex Optimization Linear Programming Quadratic Programming Integer Linear Programming Interior Point Method Primal-Dual Central Path Multiple Centrality Correctors Practical Issues GPU Computing GPU Architecture CUDA SPARSE LINEAR ALGEBRA ON GPUS Sparse Matrix-Vector Multiplication Matrix Characterization Related Work Basics Improving ELL with Warp Granularity Unrolling Nonzeros for Memory Hierarchy Efficiency Exploiting Dense Substructures with Blocking Loseless Compression for Indexing Data Improving Adaptive ELL Putting It All Together Online Auto-Tuning Comparison with the State-of-the-Art SOLVING THE NEWTON DIRECTION Normal Equation Sparse Cholesky Factorization Conjugate Gradient iv

5 TABLE OF CONTENTS (Continued) CHAPTER PAGE 4.4 Regularization and Preconditioning Accelerating CG with AdELL GPU-BASED TECHNIQUES FOR IPM Building the Normal Equation Matrix Free Approach Adaptive IPMs An Optimized Hybrid CPU-GPU Implementation Performance Evaluation EXTENSION TO ILP Combinatorial Techniques for ILP Data Structures for Branch & Cut CONCLUSION CITED LITERATURE v

6 LIST OF TABLES TABLE PAGE I Description of the regular benchmark suite II Description of the irregular benchmark suite III The distribution of nonzeros per row in the regular matrices IV The distribution of nonzeros per row in the irregular matrices V Efficiency of ELL versus WELL for the regular benchmarks VI Efficiency of ELL versus WELL for the irregular benchmarks VII VIII Performance results for the nonzero unrolling with WELL on the regular benchmarks Performance results for the nonzero unrolling with WELL on the irregular benchmarks IX SMX occupancy of SpMV kernels for WELL X XI XII XIII XIV XV Incremental performance for the blocking technique on the regular benchmarks Incremental performance for the blocking technique on the irregular benchmarks Incremental performance for delta compression on the regular benchmarks Incremental performance for delta compression on the irregular benchmarks Incremental performance for the adaptivity technique on the irregular benchmarks Incremental performance for the adaptivity technique on the regular benchmarks vi

7 LIST OF TABLES (Continued) TABLE PAGE XVI Preprocessing time for CSR and AdELL XVII FEM/Harbor parameter tuning with line search XVIII Memory footprint XIX Symmetric positive definite systems XX Normalized CG performance XXI LP test set XXII SpMM performance vii

8 LIST OF FIGURES FIGURE PAGE 1 Research workflow CPU-GPU interconnection architecture Kepler SMX architecture Sparse matrix A [a] represented as COO [b], CSR [c] and ELL [d] ELL sparse format [a] and its column-major memory layout [b] An arbitrary matrix represented as ELL [a] and WELL [b] Processing nonzeros without unrolling [a] and with 2x unrolling [b] Sparse matrix [a], WELL [b], sparse blocked matrix [c] and WELL with blocking [d] Sparse blocked matrix [a] and its interleaved memory layout [b] WELL [a] and WELL with delta-based index compression [b] An example of the nonzeros distribution unfavorable to row-based parallelization (WELL [a] vs AdELL [b]) Online auto-tuning CPU/GPU timeline AdELL+ single-precision performance on the regular benchmarks AdELL+ single-precision performance on the irregular benchmarks AdELL+ double-precision performance on the regular benchmarks AdELL+ double-precision performance on the irregular benchmarks Dense [a] and supernodal [b] partial Cholesky factorization Original [a] and scaled [b] spaces viii

9 LIST OF FIGURES (Continued) FIGURE PAGE 19 CG profiling SpMM [a] and its implementation as SpMV [b] IPM profiling Comparison with CPLEX Incremental SpMV overhead Extending SpMM to ILP ix

10 LIST OF ABBREVIATIONS ACSR AdELL ArgCSR BCCOO BCSR BELL BRC CMRS COO CPU CSC CSR DAG DFS DIA DMA Adaptive CSR Adaptive ELL Adaptive row-grouped CSR Block Compressed COO Blocked CSR Blocked ELL Blocked Row-Column Compressed Multi-Row Storage COOrdinate format Central Processing Unit Compressed Sparse Column Compressed Sparse Row Direct Acyclic Graph Depth First Search DIAgonal (sparse format) Direct Memory Access EA Ellipsoid Algorithm x

11 LIST OF ABBREVIATIONS (Continued) ELL FPGA GPU HYB IC ILP ILP KKK LP MILP NP PCG PCI QP RgCSR SDP SELL SIC ELLiptic package (sparse format) Field Programmable Gate Array Graphic Processing Unit HYBrid (sparse format) Incomplete Cholesky Integer Linear Programming (optimization) Instruction Level Parallelism (architecture) Karush-Kuhn-Tucker Linear Programming Mixed Integer Linear Programming Nondeterministic Polynomial Preconditioned Conjugate Gradient Peripheral Component Interconnect Quadratic Programming Row-grouped CSR SemiDefinite Programming Sliced ELL Segmented Interleave Combination xi

12 LIST OF ABBREVIATIONS (Continued) SIMD SIMT SM SMX SpMM SpMV WELL Single Instruction Multiple Data Single Instruction Multiple Thread Simplex Method Streaming Multiprocessor (Kepler) Sparse Matrix-Matrix multiplication Sparse Matrix-Vector multiplication Warp-grained ELL xii

13 SUMMARY Convex optimization is a fundamental mathematical framework used for general problem solving. The computational time taken to optimize problems formulated as Linear Programming, Integer Linear Programming or Quadratic Programming has an immediate impact on countless application fields, and it is critical to determining which problems we will be able to solve in the future. Since the very beginning, the research community has always been investigating on new algorithmic and numerical techniques to speed up convex optimization. Recently, the focus has included parallel computer architectures and their ability to perform high-throughput computation. This dissertation continues on the same research direction developing novel computational techniques tailored for modern GPUs. We focus on problems with sparse structure which are, arguably, the most challenging to solve on throughput-oriented many-core architectures naturally well-suited for dense computations As original contribution, we combine the leading ideas in SpMV optimization on GPUs into an advanced sparse format known as AdELL+. We also speed up the class of optimization algorithms known as Interior Points Methods with GPU-based adaptive strategies to select between Cholesky factorization and Conjugate Gradient. Last, we design an incremental matrix data structure that provides the foundation for implementing branch-and-cut ILP solvers. The goal of this dissertation is to bridge the gap between GPU computing and sparse convex optimization. This will provide a potential foundation to build a new generation of xiii

14 SUMMARY (Continued) GPU-based optimization solvers, leading to a broad and long-lasting impact beyond the specific results achieved here. Supported by solid experimental evidence, we already encourage people in industry and academia to consider our GPU-based computational techniques as efficient building blocks for their convex optimization code. xiv

15 CHAPTER 1 INTRODUCTION Optimization is a fundamental process that pervades all spheres of life. As humans, we make everyday decisions to optimize a goal, such as selecting a route to minimize the time to reach a destination or looking for the best restaurant within a certain budget. Nature itself follows optimization principles to regulate evolution (survival of the fittest) or protein folding (energy minimization). After World War II, optimization theory and algorithmic techniques have been developed as an important branch of computational mathematics. In the literature, mathematical optimization is defined as the minimization or maximization of a function subject to a set of constraints on its variables [1]. When the function is convex and the constraints define a convex set, we can restrict ourselves to a subfield of optimization known as convex optimization. Roughly speaking, the convexity property makes the problem easier because any local optimum must be a global optimum. Yet, this mathematical framework is still powerful and general enough to have applications in a wide range of disciplines, such as combinatorial optimization, operations research, control theory, structural optimization, economics, computational biology and several other engineering fields. In this dissertation, we cover three fundamental classes of convex optimization problems. The first class is known as Linear Programming (LP) and consists of a linear objective function subject to linear constraints. LP is the most natural mechanism to formulate many real-world problems, and it has been proven effective even on problems that are actually non- 1

16 2 linear [2]. The second class of convex optimization problem covered in this research is known as Quadratic Programming (QP) [3] and is defined by a quadratic function with linear constraints on real variables. The importance of QP goes beyond the ability to solve quadratic problems that naturally map to it. In fact, an iterative technique known as sequential quadratic programming [4] uses the solution of several QP subproblems to solve nonlinear optimization problems. The last important class covered here is known as Integer Linear Programming (ILP) [5], which is basically LP with variables restricted to be integers. ILP has practical applications (such as the fleet assignment and crew scheduling in the airline industry) as well as a great theoretical importance. In fact, ILP is known to be an NP-Hard problem and the decision version of its 0-1 special case has been listed as one of Karp s 21 NP-complete [6] problems. The computational time taken to solve convex optimization has an immediate impact on countless application fields, and it is critical to determining which problems we will be able to solve in the future. Since the very beginning, the research community has always been investigating on new algorithmic and numerical techniques to speed up convex optimization. Recently, the focus has included parallel computer architectures and their ability to perform highthroughput computation. Our research continues this direction developing novel computational techniques tailored for modern Graphic Processing Units (GPUs). We focus on problems with sparse structure which are, arguably, the most challenging to solve on throughput-oriented many-core architectures naturally well-suited for dense computations.

17 3 1.1 A Brief Historical Review During the last three decades, the convex optimization field has had an important development due to the introduction of Interior Points Methods (IPMs) [7]. This development has rapidly led to the design of new and efficient algorithms that have, for the first time in fifty years, offered a valid alternative to Dantzig s Simplex Method (SM) [8], especially for large LP problems. SM relies on the idea to seek the optimal solution by walking vertex to vertex along the edges of the feasible polytope, following each time an edge (direction) with favorable gradient. Over the years, many progressively better variants of this algorithm have been proposed to reduce the computational cost [9] [10] [11]. Although the state-of-the-art of SM has currently achieved a successful practical performance [12], its worst-case time complexity has been proved exponential using an artificial problem known as the Klee-Minty cube [13]. From a computational point of view, SM heavily relies on algebraic operations applied to a tableau. On one hand, sparse problems can be efficiently represented and solved by using sparse data structures and operations [14]. On the other hand, dense problems provide an excellent degree of fine-grain parallelism and, hence, an opportunity for parallel implementation [15 17]. Recently, the research community has explored the use of massively parallel architectures such as throughput-oriented GPUs [18 22] and data-flow FPGAs [23], achieving good speedups for reasonably large and dense LP problems. Unfortunately, sparse problems pose a computational challenge for those SM implementations. First, the dense representation of a sparse tableau is often not feasible due to memory constraints. When feasible, the computation is intrinsically inefficient due to the large amount of operations wasted on zero entries. Second, SM does not

18 4 preserve sparsity after each iteration (i.e. the problem becomes denser and denser). This is an issue for GPUs and FPGAs since they cannot efficiently handle dynamic data structures that change their allocation in memory. The need for a faster algorithm for LP stimulated the research community and led to the blossoming of IPMs, a class of optimization algorithms that seek the solution by following a trajectory of interior feasible points, avoiding the combinatorial complexity of vertex-following algorithms. IPMs were extensively studied in the 1960 s for solving nonlinear optimization problems [24], but they have been later abandoned due to the expensive computational steps and discouraging experimental results. The first polynomial IPM algorithm to solve LP problems was the Ellipsoid Algorithm (EA) in 1979 [25]. This iterative method makes use of ellipsoids whose volumes decrease at a constant rate and achieves an exact solution in O(n 2 L), where n is the number of variables in the problem and L the total size of the problem s input data (including variables and constraints). EA never became very popular from a practical point of view due to its limited performance compared to SM. However, EA represented a theoretical breakthrough because it proved that LP is solvable in polynomial time. Few years later, in 1984, Karmarkar [26] proposed a new polynomial algorithm that held great promises for performing well in practice, achieving an exact solution in O(nL). The appearance of this algorithm started a new explosive development in the area of IPMs, leading few years later to an improved worst-case complexity O( nl) [27]. Gill et al. [28] showed the relation between Karmarkar s algorithm and the theoretical foundations of nonlinear programming (e.g. barrier and Newtontype methods), laying out the categorization of potential-reduction algorithms [29] and path-

19 5 following algorithms [30]. After decades of continuous evolution, today s IPM implementations are efficient and robust enough to solve LP problems substantially faster than the state-of-theart SM code [31]. The major work in a single iteration of any IPM consists of solving a system of linear equations, the so-called Newton equation system. As a consequence, the efficiency of the underlying linear algebra kernels has a key role in achieving high computational performance. Parallel IPM implementations based on multiple CPUs [32,33] have been successful due to the advances in sparse Cholesky factorization. FPGA-based implementations available in the literature are instead focusing on custom IMP optimization blocks as part of real-time controllers (rather than building a general solver) [34, 35]. GPUs are naturally well-suited for implementing IPMs tailored for dense problems. Jung and O Leary [36] proposed a pioneering CPU-GPU implementation where GPU was used for computationally intensive tasks. Due to the early stage of GPU programming at the time, the authors were forced to directly program GPU shaders (rather than using a general purpose GPU programming abstraction such as CUDA [37] or OpenCL [38]). Not surprisingly, their GPU solver achieved a modest speedup only on large dense synthetic problems. A more up-to-date GPU-based IPM implementation for dense problems has been recently proposed by Dikavar [39]. Despite a claimed 22x speedup over the serial implementation and a 2x speedup over the time results reported in [36], we find the presented results questionable due to an unfair comparison method (i.e. comparing against an older hardware platform). A report from Smith et al. [40] has shown that the class of matrix-free IPMs [41] can be accelerated due to their heavy dependency on Sparse Matrix-Vector multiplication

20 6 (SpMV), a fundamental computational kernel that can be efficiently executed on GPUs [42]. Gade-Nielsen s PhD dissertation [43] is arguably the most notable research work on GPU-based IPMs. The author focused on solving test problems from model predictive control and developed different IPM implementations based on available GPU-accelerated libraries, achieving speedups up to 4x over a multi-threaded CPU implementation written in MATLAB [44]. 1.2 Contributions The main contribution of this dissertation is the design of novel GPU-based computational techniques to advance the state-of-the-art of convex optimization. We target and optimize the sparse linear algebra representing the most computationally intensive tasks in IPMs. Since the early days, researchers have given attention to speeding up the solution of the Newton equation system. We observe that the ability to adaptively switch between direct and iterative methods (as proposed by Wang and O Leary [45]) opens up to the possibility of fully taking advantage of modern-day GPU architectures. On one hand, we are able to use the state-of-the-art in sparse Cholesky factorization [46]. On the other hand, iterative solvers such as Conjugate Gradient (CG) are known to perform well on GPUs [47]. However, we show that improving the fundamental SpMV kernel can directly speed up CG and, hence, IPMs. As original contribution, we combine the leading ideas in SpMV optimization on GPUs into a lightweight, general, adaptive, efficient and high-performance method for sparse linear algebra computation. The result is known as AdELL+, an advanced sparse matrix format that explicitly addresses the performance bottlenecks of the SpMV kernel. Building upon our initial experience [48, 49], we carefully compose several optimization techniques into a data structure

21 7 with warp granularity that naturally suits the Single Instruction Multiple Data (SIMD) vectorization associated with many-core GPUs. Indeed, the main challenge of sparse computation on GPUs concerns the memory layout. This work proposes an innovative mapping that promotes regularity and, hence, that is substantially architecture-independent, with foreseen benefits to irregular applications. We cope with the bandwidth-limited nature of the SpMV kernel by using blocking and delta-based index compression (to reduce the memory footprint) plus nonzero unrolling (to improve the memory hierarchy utilization). We use a parametrized adaptive warpbalancing heuristic in order to implement the idea of adaptivity. Moreover, we propose a novel online auto-tuning to minimize the preprocessing cost. Our experimental results show that AdELL+ achieves comparable or better performance over other state-of-the-art SpMV sparse formats proposed in academia (BCCOO [50]) and industry (CSR+ [51] and CSR-Adaptive [52]). This, in turn, has a practical impact that goes beyond CG and IPM implementations. In fact, the SpMV kernel is fundamental to numerous areas of applied mathematics, science and engineering. The additional contributions of this dissertation to the state-of-the-art of convex optimization are the following. First, we develop an efficient GPU-based implementation of the specialized Sparse Matrix-Matrix multiplication (SpMM) needed to build a new Newton equation system at each iteration of the IPM algorithm. Second, we propose some strategies to promote hybrid CPU-GPU computation during Cholesky factorization [46] and to avoid data transfers in case of matrix-free preconditioning [41]. Last, we design an incremental matrix data structure that provides the foundation for implementing branch-and-cut ILP solvers.

22 8 The proposed techniques are general to any IPM code but we validate them by providing a GPU-based implementation of the well-known primal-dual infeasible IPM [53] with multiple correctors [54] and adaptive direct/iterative method selection [45]. Note that it is beyond the scope of this dissertation to explore all the best mathematical optimizations or compete directly with commercial optimization software such as CPLEX [55]. Indeed, the proprietary mathematical know-how derived from years of industrial research provides a position of advantage. The focus is instead on a proof-of-concept to provide enough evidence to support the use of our GPU-based computational techniques in the state-of-the-art optimization software. That said, our experimental results show that GPUs have enough computational power to make our primal-dual infeasible IPM code already competitive in terms of running time. The research topics touched in this dissertation overlap with previous work in implementing IPMs on GPUs. However, we clearly distinguish ourselves because we focus on the underlying sparse linear algebra proposing novel computational kernels (rather than solely relying on available libraries such as cusparse [56]). We can easily prove that our approach AdELL+ gives much better SpMV performances than the sparse formats used in [40, 43]. Similarly, we note that the sparse Cholesky factorization used in [43] is not performed on GPU due to software library issues, providing a suboptimal solution in terms of performance which we overcome in this work. Last, the adaptive approach to the solution of the Newton equation system allows us to select the best between GPU-based sparse direct and GPU-based iterative methods. Overall, our research work provides a better foundation to effectively accelerate sparse convex optimization on GPUs.

23 9 1.3 Research Workflow Figure 1 presents a graphical representation to summarize how this research is organized. In the first stage, on the left, we have the set GPU-based computational techniques. AdELL+ (i.e. SpMV), SpMM and Incremental (i.e. incremental matrix for ILP) are novel (and, thus, highlighted with solid border). Cholesky represents the state-of-the-art GPU-based Cholesky factorization [46] used as a direct method. Similarly, Precond (i.e. matrix-free preconditioner [41]) is a modification of the efficient approach proposed in [43] that produces a sparse (rather than dense) structure. In the second stage, we have the set of optimization algorithms that can be accelerated by plugging in the GPU-based computational techniques. First, we can GPU-based Computational Techniques AdELL+ Precond. Conjugate Gradient Precond. Optimization Algorithms AdELL+ Faster Solver LP Linear Programming IPM Cholesky Interior Point Methods Cholesky SpMM CG Cholesky QP Quadratic Programming CG IPMs SpMM Incremental Branch and Cut IPM Incremental ILP Integer Linear Programming B&C Figure 1. Research workflow

24 10 speed up the solution of symmetric positive definite linear systems (such as the Newton normal equation) by improving the iterative CG algorithm. Second, IPMs can be accelerated by using the specialized SpMM to generate the Newton equation system and by solving it with the GPU-based direct/iterative method of choice. Third, we can implement a branch-and-cut ILP algorithm by using the incremental matrix to represent the different LP-relaxations and solving them with IPMs. On the right, the practical impact of this dissertation represented as the potential ability to solve LP, ILP and QP faster. Note that unconstrained QP is equivalent to a system of linear equations (and, thus, solvable with direct/iterative methods) whereas the class of IPMs is the approach used in case of generic QP with linear constraints. 1.4 Outline Chapter 2 introduces different key concepts used in the dissertation, lying down the mathematical definition of some optimization problems and introducing the class of IPM algorithms. In addition, it briefly covers the topic of GPU computing. Chapter 3 presents our research on SpMV optimization on GPUs. We propose an efficient data structure named AdELL+ that addresses the bottleneck of sparse computational and achieves comparable or better performance over other state-of-the-art SpMV sparse formats. This optimized SpMV kernel provides the foundation to accelerate sparse convex optimization. Chapter 4 focuses on the solution of the Newton equation systems, which is the main computational effort of primal-dual IPM algorithms. We review both sparse direct and iterative solution methods, analyzing strengths and drawbacks of their GPU-based implementations. Moreover, we optimize the CG method using the SpMV kernel developed in Chapter 3.

25 11 Chapter 5 discusses further computational aspects associated with the implementation of primal-dual IPM algorithms on GPUs. First, we focus on the efficient generation of the system AΘA T, proposing a specialized SpMM technique based on SpMV. Then, we explore the use of adaptive IPMs [45] as an effective strategy to take advantage of both the best direct and iterative GPU-based methods. Last, we propose some strategies for hybrid CPU-GPU computation. Chapter 6 is dedicated to the solution of ILP problems through relaxations. After introducing the classic branch-and-bound and branch-and-cut algorithms, we propose an extension of our GPU-based computational techniques to enable the use of the adaptive IPM algorithm as core solver for LP relaxations. Finally, Chapter 7 summarizes the main contributions of this dissertation, providing conclusions and future directions.

26 CHAPTER 2 BACKGROUND In this chapter, we introduce different key concepts used in the dissertation. There are whole books devoted to the subject of convex optimization. This section is only intended to lie down the mathematical definition of some optimization problems and to introduce the class of IPM algorithms. In addition, we also provide a brief background on GPU computing and GPU architectures. 2.1 Convex Optimization Many realistic problems in the scientific world, engineering, economics and even in the ordinary everyday life can be solved using a powerful mathematical framework known as convex optimization. In general, we can formulate an optimization problem as a search space Ω (the set of feasible solutions) and an objective function f : Ω R that we desire to minimize (or maximize, if that is the goal). A feasible point x Ω is an optimal solution for the minimization problem when x Ω, f(x ) f(x). (2.1) The domain Ω is typically a subset of the Euclidean space R n defined by a set of constraints (equality or inequalities). Since its early days, a branch of computational mathematics known as optimization developed analytical and algorithmic techniques to find x. A generic function 12

27 13 f may have a landscape that incurs in one or more local solutions. A feasible point x l is a local optimum if the optimality condition f(x l ) f(x) holds in some local region x x l δ with δ > 0 (i.e. x l neighborhood). The presence of multiple local optima makes optimization harder because algorithmic techniques may mistake a local solution x l for the global x. Convex optimization is a subfield of mathematical optimization that considers convex functions over convex sets. More formally, a convex optimization problem has the goal to find an optimal solution x Ω such that f(x ) = min{f(x) : x Ω}, (2.2) where the feasible set Ω R n is a closed convex set a, b Ω, α [0, 1] : (α a + (1 α) b) Ω (2.3) and f(x) : R n R is a convex on R n a, b R n, α [0, 1] : f(α a + (1 α) b) α f(a) + (1 α) b. (2.4) This convexity introduces some useful properties for the underlying optimization problem. First, any local optimum is also a global optimum. Second, there exist necessary and sufficient conditions for optimality. Third, it is possible to define a duality theory. These properties provide the foundation to develop efficient computational methods for convex optimization. Problems aris-

28 14 ing from a wide range of disciplines can be modeled and solved as convex optimization problems. Hence, the convexity assumption is not a significant restriction on the practical uses in combinatorial optimization, operations research, control theory, structural optimization, economics, computational biology and several other engineering fields Linear Programming LP is a class of convex optimization problems where both the objective function and the constraints are linear. A LP problem can be defined in its standard form using matrix representation. Given A R m n, c R n and b R m, find a solution x R n that solves the problem minimize such that c T x Ax = b x 0. (2.5) The feasible set Ω lies in positive orthant (x 0) and is defined by a linear system (Ax = b). Assuming A has rank m, this LP formulation leads to an optimization problem only if we have some degrees of freedom d = (n m) > 0. When m = n, Ω contains only a feasible point given by the solution of the linear system Ax = b. When n < m, the LP problem is unfeasible (i.e. Ω = ). Any maximization goal can be easily adapted to this standard LP formulation by using c T x as objective function to minimize. Similarly, this standard LP formulation can be derived from problems with inequality constraints (introducing slack variables) or with free variables (substituted with pairs of positive variables).

29 15 Geometrically, the feasible set Ω = {x R n : Ax = b, x 0} is a convex polyhedron and the optimal solutions x, when exists, always lies on the boundary of this polytope. In fact, the contours of the linear objective function are parallel hyperplanes defined by c x = h. The contour with lower h may intersect the polytope on a vertex (unique solution x ) or on a facet (infinite many solutions). Hence, the convexity implies that every (local) minimum is global. Moreover, an optimal may not exist because the problem is unfeasible (i.e. Ω = ) or because the problem is unbounded (i.e. there is no constraint that prevents to indefinitely decrease the objective function along the direction of the gradient). The duality theory has been developed to provide bounds and to estimate the gap from the optimal solution. Given the previous LP formulation (known as primal), we can define the dual problem as maximize such that b T y A T y + s = c s 0, (2.6) where y R m is the dual variable, s R m is the dual slack variable and Ω = {(y, s) : A T y + s = c, s 0} is the dual feasible region. The weak duality theorem states that for any feasible solution x Ω and (y, s) Ω b T y c T x. (2.7)

30 16 In other words, any solution of the dual maximization problem is a lower bound for the primal minimization problem. Similarly, any solution of the primal is an upper bound for the dual problem. This allows to define the duality gap as c T x b T y 0, (2.8) providing an estimate on how the current primal-dual solutions x and (y, s) are close to the optimal. The strong duality theorem instead states that b T y = c T x. (2.9) In other words, if an optimal solution x exists for the primal problem then its dual also has an optimal solution (y, s ) that cancels the duality gap, and vice versa. Moreover, if the primal problem is unfeasible (unbounded) then its dual is unbounded (unfeasible). A necessary condition for optimality known as complementary slack can be derived as c T x b T y = (x ) T s = 0. (2.10) At the optimal, primal and slack variables becomes strongly complementary pairs since each dot product term x i s i is zero (i.e. if x i 0 then its dual slack s i = 0, and vice versa).

31 Quadratic Programming There exist problems where the nature of the objective function cannot be adequately represented as LP. As a consequence, QP is often used to provide a better approximation for nonlinearity. This class of convex optimization problems is characterized by a (convex) quadratic function over a set of linear constraints. More formally, given Q R n n symmetric (Q = Q T ) and positive definite ( x 0 : x T Qx > 0), c R n, A R n m and b R m, find a solution x R n that solves the problem minimize 1 2 xt Qx + c x such that Ax = b x 0. (2.11) This is also known as constrained quadratic minimization problem. In addition, an unconstrained version of the problem can be obtained by dropping the system of linear equalities Ax = b and the solution non negativity. The importance of QP goes beyond the ability to solve quadratic problems that naturally map to it. For example, an iterative technique known as sequential quadratic programming [4] uses the solution of several QP subproblems to solve nonlinear optimization problems. Given a nonlinear function f(x) with continuos gradient f(x) and hessian 2 f(x) as optimization goal, the key idea is to improve the current solution x k by setting a QP problem where Q = 2 f(x k ) and c = f(x k ) (i.e. a second order approximation of the nonlinear function around x k ) and calculating a search direction k used to move closer the optimal x.

32 Integer Linear Programming ILP is a class of convex optimization problems that can be derived from LP by adding the integral constraint and reducing the set of feasible solutions from Ω = {x R n : Ax = b, x 0} to Ω Z = {x Z n : Ax = b, x 0}, where Ω is the convex hull of Ω Z. Given any ILP problem, we can derive a bound on its optimal value by relaxing the integer constraint and solving its so-called LP relaxation. Indeed, the associated LP problem (which can be solved efficiently) has a solution at least as good as the original integer problem due to being less constrained. ILP is a very general computational framework since problems in combinatorial optimization can be formulated as ILP instance (for example, the fundamental class of covering problems). In general, ILP is known to be an NP-Hard problem and the decision version of its 0-1 special case has been listed as one of Karp s 21 NP-complete problems [6]. If only a subset of the variables is constrained to be integer, the problem is classified as Mixed Integer Linear Programming (MILP). Despite the partial relaxation, MILP is consider even more general than ILP and, hence, still NP-hard to solve. If the optimal integer solution lies on the convex polyhedron Ω, ILP can be solved efficiently as LP relaxation. Otherwise, its complexity remains NP-hard. 2.2 Interior Point Method IPMs are a class of iterative algorithms that can be used to solve different kinds of mathematical optimization problems. For the sake of the explanation, we introduce IPMs focusing only on LP problems. However, these methods are basically a nonlinear search approach with extensions to QP [57] and other nonlinear optimizations. Given the primal (2.5) and dual (2.6)

33 19 LP formulations, we can compose a system of nonlinear equations describing the Karush-Kuhn- Tucker (KKT) conditions for optimality [58] as Ax b 0 F (x, y, s) = A T y + s c = 0 Xs 0, (2.12) where primal feasibility, dual feasibility and complementarity must be satisfied. Note that X R n n is a matrix with x R n on its diagonal. The Newton s method is a numerical algorithm with local quadratic convergence and it is commonly used to find the roots of nonlinear functions such as F (x, y, s). At each iteration, the Newton s method linearizes around the current (x, y, s) to obtain the search direction ( x, y, s) by solving the following linear system A 0 0 x b Ax 0 A T I y = c A T y s S 0 X s Xs. (2.13) Note that S is a R n n matrix with s on its diagonal. The direction ( x, y, s) is known as pure Newton or affine-scaling direction. Once this is calculated, the Newton s method updates the solution taking a full step as follows (x, y, s) [k+1] = (x, y, s) [k] + ( x, y, s) [k]. (2.14)

34 20 Unfortunately, this may lead to points that violate the inequality constraints x 0 and s 0. Each iteration can be adjusted in order to include a line search that selects a feasible step α [0, 1] along ( x, y, s). However, it has been shown that the affine-scaling direction usually does not make much progress towards the KKT solution [59], leading to a very slow convergence of the Newton s method. 2.3 Primal-Dual Central Path The central path is a trajectory in the feasible region that converges to the primal-dual optimal solution (x, y, s ). The theory of IPMs is based on the key idea to approximately trace this central path as a way to efficiently solve the KKT conditions. The central path can be defined by adding a logarithmic barrier µ in the primal LP problem n minimize c T x µ ln x i such that Ax = b i=1. (2.15) x > 0 Indeed, the logarithmic term inherently penalizes solutions on the boundary of the feasible region. The perturbation µ moves the optimal x from a vertex of the polytope to x µ called µ-center in the interior Ω 0 = {x R n : Ax = b, x > 0}. Those µ-centers converge to the optimal x at the limit µ 0, tracing a continuous smooth curve called central path. The key algorithmic idea in IPM is to compute the Newton direction for the perturbed KKT conditions and to take a feasible step toward the neighborhood of the central path (rather than just using

35 21 the affine-scaling direction). By reducing µ at each iteration, we can usually trace a good trajectory toward the optimal x. The perturbed KKT conditions can be derived from the Lagrangian of the barrier problem n L(x, y) = c T x µ ln x i y (Ax b). (2.16) i=1 Note that the linear constraints are now embedded into L(x, y) such that Ax = b holds in each stationary point where L(x, y) = 0. The perturbed KKT conditions are then x L(x, y) = c µx 1 e A T y = 0 y L(x, y) = b Ax = 0, (2.17) x > 0 where e R n is the unit vector and µ > 0 is the perturbation. By complementary, we introduce the slack variable s = µx 1 e obtaining the nonlinear system Ax b 0 F µ (x, y, s) = A T y + s c = 0. (2.18) Xs µe 0

36 22 The Newton direction for the perturbed KKT conditions is calculated by solving A 0 0 x r p 0 A T I y = r d S 0 X s r g + σµe. (2.19) where r p = b Ax is the primal residual, r d = c s A T y is the dual residual, r g = Xs is the gap residual and σ [0, 1] is the centering parameter. This last parameter is used to move the trajectory toward the central path or toward the affine-scaling direction. For σ = 1, we direct the trajectory toward the current µ-center. For σ = 0, we ignore the logarithmic barrier and only use the affine-scaling direction (i.e. µ = 0). Many IPM implementations reduce the barrier parameter µ according to the proximity of the optimal solution. Specifically, they heuristically set µ = xt s n, (2.20) using the duality gap as an estimate of optimality at each iteration. Moreover, the mathematical technique used to select the centering parameter σ is critical for a fast convergence. Intuitively, a trajectory close to the boundary often results in small steps due to hitting the boundary of the interior feasible region. On the other hand, a trajectory close to the central path has often the opportunity to take longer steps, although this may require some centering toward the current

37 23 µ-center during the IPM iterations. Indeed, selecting the centering parameter σ is a trade-off between increasing centrality and making good progress towards the optimal solution. One of the main theoretical properties of IPMs is to show convergence as long as the approximate trajectory is in the neighborhood of the central path. A common way to formalize a tight horn neighborhood is N 2 (β) = {(x, s) : Xs µe 2 βµ}. (2.21) Algorithms based on this neighborhood are of theoretical importance, although they take short steps and converge slowly in practice [31]. Today s most efficient IPM implementations are instead based on wider neighborhoods and faster barrier parameter reduction. This, in turn, results in longer steps and (generally) faster convergence. Once the Newton direction has been been calculated, we take a step α [0, 1] such that (x, s) > 0 remains positive. In practice, α is calculate with a ratio test as follows α p α d { = min = min } : x i < 0, i = 1,..., n x i } x i { s i : s i < 0, i = 1,..., n s i. (2.22) α = min {α p, α d }

38 24 In addition, we use a scaling parameter η to avoid hitting the boundary of the interior feasible region. Typically η has a value of or The solution is then updated as (x, y, s) [k+1] = (x, y, s) [k] + ηα [k] ( x, y, s) [k]. (2.23) All the algorithmic components discussed so far are summarized by Algorithm 1. Note that we still have not covered topics such as how to select the starting point (x, y, s) [0], how to select the centering parameter σ and how to test convergence. These issues will be discussed shortly in the next subsections. Algorithm 1 Primal-Dual Interior Point Method Input: (x, y, s) [0] and η Output: (x, y, s ) 1: for k = 1, 2,... and not converged do 2: Choose σ [k] [0, 1] 3: Compute ( x, y, s) [k] by solving (2.19) 4: Compute step length α [k] by ratio test (2.22) 5: Update (x, y, s) [k] by taking step (2.23) 6: end for Multiple Centrality Correctors An effective approach to choose the centering parameter σ is to split the computation of the Newton direction into two steps. First, a predictor direction is calculated by solving (2.20), assessing the quality of the affine-scaling step. This information is later used to implement a corrector step where a centering direction is calculated by solving the same system with a

39 25 different right-hand side vector. This powerful predictor-corrector technique has been initially proposed by Mehrotra [60]. We recall some of its key features. The affine-scaling prediction direction a = ( a x, a y, a s) is used to evaluate a predicted complementary gap after maximum feasible step as g a = (x + a x) T (s + a s). (2.24) The ratio ga x T s is used as a measure of the quality of the predictor. Very little progress is achievable in direction a when g a is close to one, requiring a strong centering. On the other hand, a more aggressive optimization is possible if less centering is needed. The Mehrotra s predictor-corrector technique uses the choice of σ = ( g a x T s) 3 for centering parameter, adjusting the barrier parameter as follow µ = σ xt s n. (2.25) This new value is used to calculate a corrector direction c that does not only add the centrality term but corrects for the error made by the predictor as well. Mehrotra s predictor-corrector technique has however few drawbacks. First, it optimistically assumes that a full step in the corrected direction will be possible. Second, the attempt to correct each direction component with the same µ may occasionally be too aggressive. Last, Mehrotra s corrector does not provide additional benefit if applied recursively.

40 26 The multiple centrality correctors technique [54] has been designed to address these drawbacks. Given a predictor direction p (initially calculated from the affine-scaling direction), this technique looks for a centrality corrector m such that larger steps will be made in the composed direction = p + m. Given the corresponding primal α p and dual α d steps over the predicted direction p, we attempt to expand those step sizes to α p = min {α p + δ, 1} α d = min {α d + δ, 1}, (2.26) for some aspiration level δ (0, 1) (usually set to 0.1). Then, we use those increased steps to compute a trial point x = x + α p x p s = x + α d s p. (2.27) The key idea of the multiple centrality correctors technique is to correct only the complementary products x i s i that are significantly smaller than µ (including the negative components in ṽ) or that significantly exceed µ. Specifically, we do not apply any correction to elements which are close to the target value µ and fall in the range γµ x i s i γ 1 µ where γ (0, 1) (usually set to 0.1). On the other hands, small product outliers ( x i s i γµ) are moved to γµ as well as large product outliers (γ 1 µ x i s i ) are moved to γ 1 µ.

41 27 Those correction are stored in a special right-hand side (0, 0, t) T R n, where the target t is defined as γµ x i s i if x i s i γµ t i = γ 1 µ x i s i if x i s i γ 1 µ. (2.28) 0 otherwise The corrector direction m is computed by solving the usual system of equation (2.20) with this new right-hand side. This strategy is very effective and step sizes in the primal and dual spaces for the composite direction are larger than the predictor direction only. Moreover, the multiple centrality correctors technique can be applied recursively on the direction p = p + m as long as the step sizes increase by a fraction of the aspirational level γ (up to a maximum number of times) Practical Issues One of the difficulty arising in implementing the primal-dual method is the choice of initial solution. Some implementations uses a two phases approach where an artificial is initially solved in order to derive a feasible solution for the original problem [61, 62]. Primal-dual infeasible algorithms [53] are regarded as the most efficient IPMs today due to their ability to reach the solution with trajectories on the positive orthant (i.e. (x, s) 0) that may or may not be completely primal or dual feasible, avoiding the need for additional solving phases. On the other hand, the selection of an initial solution is still critical to them for a good convergence.

42 28 Ideally, one would like the initial point to be well centered as well as close to primal and dual feasibility as possible. Here, Algorithm 2 describes a commonly used method known as the Mehrotra s initial point heuristic [60] that attempts to satisfy the primal and dual equality constraints while shifting away from the boundary. Algorithm 2 Mehrotra s Initial Point Heuristic Output: (x, y, s) [0] 1: x = A T (AA T ) 1 b 2: ȳ = (AA T ) 1 Ac 3: s = c A T ỹ 4: δ x = max { 1.5 min { x}, 0}, δ s = max { 1.5 min { s}, 0} 5: ˆx = x + δ xe, ŝ = s + δ se 6: δˆx = 1 ˆx T ŝ, δ 2 ê T ŝ ŝ = 1 ˆx T ŝ 2 ê Tˆx 7: x [0] = ˆx + δˆx e, y [0] = ȳ s [0] = ŝ + δŝe The termination criteria for the primal-dual infeasible IPMs should check if the KKT conditions are satisfied up to some predefined tolerance. This translates to the following criteria imposed on the relative primal residual, dual residual and duality gap r p b 2 ɛ p r d c 2 ɛ d x T s 1 + (x T c + y T b)/2 ɛ p. (2.29) Note that the duality gap is normalized with an intermediate value between the primal and dual objectives. In practice, it is rare that the termination condition on the gap is satisfied whereas primal or dual feasibility does not hold. This is a consequence of the fact that the primal and dual residuals are linear functions and, thus, easier to solve for the Newton s method than the

43 29 nonlinear residual gap. Commonly used tolerances are ɛ p = 10 4, ɛ d = 10 4 and ɛ g = Even with stricter tolerances, IPMs do not exactly find a solution on a vertex of the convex polytope. This, instead, requires a refinement step such as the one proposed in [63]. Another relevant issue for termination is to reliably detect infeasibility or unboundedness of the LP problem. This usually manifests in a rapid growth of the primal or dual objective function and immediately leads to numerical problems. Techniques such as the homogenous and self-dual linear feasibility model [64] have been developed in order to correctly detect infeasibility for at least one of the primal or dual problems. 2.4 GPU Computing Originally dedicated to dense 3D rendering, throughput-oriented manycore architectures have evolved into more general purpose parallel processors available for scientific and engineering computing, providing good speedups over multicore CPUs for well-known computational kernels [65]. GPUs offer a large degree of fine-grain parallelism which is naturally well-suited for dense computational kernels. In addition, GPUs also represent an attractive solution to optimize bandwidth-limited sparse computational kernels due their high memory bandwidth peaks. In this section, we provide a brief introduction on GPU computing taking as a reference the specific GPU device used for the our experimental results, the current state-of-the-art NVIDIA Tesla K40 based on Kepler architecture [66] GPU Architecture Multicore CPUs and GPUs are parallel processors tailored for different computational workloads. On one hand, multicore CPUs have few large cores optimized for fast sequential per-

44 30 formance and minimum latency. On the other hand, GPUs focus on massive parallelism to achieve high throughput. Indeed, hardware multithreading is the key idea used to design GPU architectures. Memory latencies are hidden by the ability to switch between a large pool of threads ready to execute, keeping most of the computational cores busy. As a consequence, the use of large and complex cache memories (such those available on CPUs) becomes less crucial and more silicon area can be dedicated to implementing computational cores. CPU GPU Core Core Core Core Host Bus GB/s Device Bus 288 GB/s Host Memory (DDR3) 64GB PCI Express GB/s Device Memory (GDDR5) 12 GB Figure 2. CPU-GPU interconnection architecture High-performance computing-oriented GPUs such as Tesla K40 are located on discrete PCIexpress acceleration boards, as illustrated in Figure 2. The GPU can read and write an offchip memory (called device memory) with size up to 12 GB and theoretical peak bandwidth up to 288 GB/s, depending on the specific acceleration board. Data can be moved between CPU and GPU using the PCI-express bus, with speed up to 16 GB/s for PCI-express 3.0. Due to the relatively small bandwidth, this may become a performance bottleneck for those

45 31 applications that move too many data back and forth the GPU. In general, data movement should be minimized as much as possible. Alternatively, the use of asynchronous GPU streams can facilitate the overlapping between computation and memory transfers. GPUs are considered manycore architectures due to their large number of computational cores. These are organized into a number of vectorized processing units called Streaming Multiprocessors (SMXs), each containing 192 simple scalar cores known as CUDA [37] cores (see Figure 3). The physical execution of threads is organized according to the Single-Instruction Multiple-Thread (SIMT) model with warp granularity, where a warp is composed by 32 threads. Each core executes a single threads in lockstep fashion. That is, the same warp-level instruction is scheduled for the entire warp, and each thread executes the same operation on its own data. Figure 3. Kepler SMX architecture

46 32 The SIMT model also supports branching using serialization (i.e. threads are divided into active and inactive depending by their control flow). However, this obviously leads to a decrease in performance. SMXs are connected to a random access memory through a cache hierarchy. This global memory has high bandwidth (up to 288 GB/s) but high latency (up to 800 clock cycles). The cache hierarchy is organized on two levels. A coherent L2 cache (1536KB) is shared among all the SMXs and provides a mean to reduce the global memory bandwidth usage (whereas latency is not improved). Each SMX also has a local on-chip memory (64KB) with low latency and very high bandwidth ( 1.33 TB/s). This fast memory can be split and configured (16/48KB, 32/32KB or 48/16KB) to work as shared memory (software managed) or as L1 cache (compiler managed). In addition, each SMX has 48KB of dedicated read-only data cache and a large register file composed of bit registers (used in pairs in case of doubleprecision arithmetic) with very low latency and extremely high bandwidth ( 7.99 TB/s). Fast context switching between warps is possible by statically assigning different registers to different threads. A crucial factor for GPU efficiency is the memory access pattern. Any GPU architecture is indeed optimized for regular accesses. Specifically, memory requests within a warp are converted into L2 cache line requests (32 bytes). Hence, memory performance is maximized only when memory addresses can be coalesced into a single aligned transaction (as opposed to an inefficient scattered accesses). Being a manycore architecture, GPUs have been designed for fine-grain parallelism and regular computation. Each SMX can manage up to 2048 threads (forming up to 64 warps) reaching a massive parallelism with up to simultaneous threads (960 warps).

47 CUDA CUDA is a leading abstract parallel computing architecture and programming model [37]. Actively developed by NVIDIA, CUDA has facilitated the use of GPU computing in many application domains. CUDA is based on a hierarchy of concurrent lightweight threads to model the computation according to a data-parallel model. That is, the algorithm is expressed in such a way that each data element is processed by an independent lightweight thread. All threads execute the same program on different data. Threads are logically grouped into equally sized blocks. These represent abstract SMXs capable to run all the threads simultaneously. However blocks are permanently assigned to one of the available SMXs, broken down into warps and executed in an arbitrary order. Cooperation and synchronization within blocks are allowed by the means of shared memory and primitives. On the other hand, threads within the same warp can benefit of the implicit synchronization due to the hardware lockstep execution (SIMT model). Blocks are grouped into a grid for covering all the data to process.

48 CHAPTER 3 SPARSE LINEAR ALGEBRA ON GPUS Sparse linear algebra is fundamental to the field of convex optimization, where efficiency is highly dependent on the computational techniques used. In this chapter, we propose an efficient data structure named AdELL+ for optimizing the SpMV kernel on GPUs, focusing on performance bottlenecks of sparse computation. Our experimental results show that AdELL+ achieves comparable or better performance over other state-of-the-art SpMV sparse formats proposed in academia (BCCOO) and industry (CSR+ and CSR-Adaptive). 3.1 Sparse Matrix-Vector Multiplication The SpMV is, arguably, the most important kernel in sparse matrix computations. In this dissertation, we specifically refer to the operation y = y + Ax, where y and x are dense vectors of length m and n, respectively, while A is a sparse matrix of size m n. Typical matrices in applications have thousands or even millions of rows and columns with only a small fraction of nonzeros entries. Calculation of the SpMV essentially reduces to many independent dot products such as the following y i = y i + j:a ij A i a ij x j, (3.1) where A i is a sparse row vector of the matrix A and a ij corresponds to its nonzero entries. While SpMV computation may seem intrinsically parallel, several challenges are faced in achieving 34

49 35 high performance. First, an irregular structure may lead to an unbalanced workload with uneven nonzeros across different working threads. Such imbalance problem is even more severe in many-core GPUs since warps operate in SIMD fashion, leading to potential control divergency (e.g. a thread processing a heavyweight row that forces adjacent threads idle). Second, the bandwidth-limited nature of the SpMV kernel puts high pressure on the memory hierarchy. Each nonzero a ij in matrix A is only used once for computing the correspondent dot product, whereas the elements of the dense vector x may be reused across different dot products (although we expect poor locality in case of irregular structure). Roughly speaking, SpMV can be seen as a streaming computation, where performance is limited by the bandwidth at which matrix data can be streamed from memory. Moreover, modern GPUs can benefit from a cache hierarchy (usually organized on two levels) that performs best in case of aligned and coalesced memory operations (e.g. different threads in a warp accessing data within the same cache line) Matrix Characterization The sparse structure provides the opportunity to represent a matrix in memory by only its nonzero elements (plus additional information associated with row and column indices) as opposed to using a two-dimensional dense array (which would considerably limit sizes of computed problems). In some cases, the pattern of nonzero elements is quite regular (e.g. for problems resulting from discretization on regular grids), but the most interesting (and challenging) matrices are those whose distribution of nonzeros per row appears to be highly variable. The matrix sparsity pattern can be roughly evaluated by some simple metrics based on the number of nonzeros per row in A. We can immediately calculate the average nonzeros

50 36 per row µ and the associated standard deviation σ. Thus, we can get an approximate idea of the nonzero distribution (and, consequently, a sense of the load balancing) by looking at the minimum, median and maximum nonzeros per row along with its binned histogram. Here we introduce the characterization of the benchmark suite used to provide a comprehensive evaluation of our work. All the sparse matrices (except a synthetically generated dense one) are downloadable from the well-known University of Florida Sparse Matrix Collection [67] as Matrix Market files [68]. As suggested by Langr and Tvrdik [69] in their evaluation criteria for sparse matrix formats, we provide a fair comparison by choosing relatively large matrices that have been widely adopted in previous SpMV optimization research [42, 48 50, 52, 70 76]. This benchmark suite contains 20 sparse matrices of heterogenous sizes, structures, and application domains. We categorize them as regular (14 matrices in Table I) and irregular (6 matrices in Dimension Nonzeros per Row Matrix nnz m n µ σ Circuit Dense Economics Epidemiology FEM/Accelerator FEM/Cantilever FEM/Harbor FEM/Ship FEM/Spheres Ga41As41H Protein QCD Si41Ge41H Wind Tunnel TABLE I. Description of the regular benchmark suite

51 37 Dimension Nonzeros per Row Matrix nnz m n µ σ Circuit5M Eu In LP Mip Webbase TABLE II. Description of the irregular benchmark suite Table II), based on the coefficient of variation µ σ as well as on the distribution of the nonzeros per row. The adopted classification is more evident from Table III and Table IV. Each irregular matrix has a large range [min, max] and its associated histogram (built with uniform bins within the range [0, max]) highlights a skewed distribution of nonzeros per row. Due to the large range, rows falling in the last bins account for a significant percentage of the total nonzeros Nonzeros per Row Binned Matrix min med max Histogram Circuit Dense Economics Epidemiology FEM/Accelerator FEM/Cantilever FEM/Harbor FEM/Ship FEM/Spheres Ga41As41H Protein QCD Si41Ge41H Wind Tunnel TABLE III. The distribution of nonzeros per row in the regular matrices

52 38 Nonzeros per Row Binned Matrix min med max Histogram Circuit5M Eu In LP Mip Webbase TABLE IV. The distribution of nonzeros per row in the irregular matrices (e.g. 13% of total nonzeros in Circuit5M are associated with rows falling in the last bin). Clearly, this sparsity pattern is hard to parallelize. On the other hand, regular matrices distribute their rows more freely along the range of nonzeros per row. Some benchmarks, such as QCD, are perfectly balanced since σ = 0. Some other benchmarks, such as Circuit or Ga41As41H72, still appear skewed but their smaller range (compared with irregular matrices) makes the computing load associated with long row less severe Related Work The literature on implementing SpMV on throughput-oriented manycore processors is extensive. The main focus has been on proposing and tuning novel formats to efficiently represent the sparse matrix, with the aim of optimizing memory pattern efficiency, memory footprint, load balancing and thread divergency. Bell and Garland presented perhaps the best-known work on GPU-based SpMV [42], including basic sparse matrix formats such as COO, DIA, CSR, ELL, and a new hybrid format HYB (which combines ELL and COO). The vendor-tuned library cusparse [56] provides a well-maintained implementation of the above formats. However, none of them is consistently superior since each has been designed to leverage a specific

53 39 matrix structure. COO and CSR are suited for unstructured matrices, with COO rarely used in practice due to its space inefficiency. ELL is instead suited for vector architectures and performs well with regular structures. DIA provides a clear advantage for strongly diagonal matrices. Several derivative formats have been subsequently proposed in order to improve the baseline performance. ELL-R [77] uses an auxiliary data structure to store row lengths and avoid wasteful computation. Baskaran and Bordawekar [78] improve the CSR-based SpMV kernel by assuring the memory alignment and by caching values repeatedly accessed. Another format known as SELL [71] (and its generalization for heterogenous vectorized units SELL-Cσ [72]) reduces the memory footprint of the basic ELL format by horizontally partitioning the original matrix into several slices. Choi et al. [73] added the idea of blocking to CSR and ELL, creating two novel sparse formats known as BCSR and BELL. This technique is beneficial for matrices with block substructure and intrinsically reduces the memory footprint (by using indices to entire blocks instead of individual nonzero entries).row and column reordering is another strategy than can be used to reduce matrix bandwidth and improve cache locality during SpMV execution [79]. However, original matrices often have inherent locality so reordering may not have any practical benefit (especially considering the preprocessing overhead). There are some works focusing on compression to reduce the matrix footprint and, consequently, improve the bandwidth-limited SpMV kernel. A bit-representation index compression scheme has been proposed by Tang et al. [80]. Another work by Xu et al. [81] focused on index compression for matrices with diagonal structure. In general, the challenge of compression techniques is the complexity of decompres-

54 40 sion. This operation becomes embedded into the SpMV kernel and should be as lightweight as possible in order to not cancel out the performance gain (which is also limited by the intrinsic matrix compressibility). Another approach to SpMV optimization is tuning the sparse format parameters (e.g. block size or slicing factor) in order to achieve the best possible performance. This is usually done by the means of auto-tuning frameworks that explore the parameter search space (usually a reasonable subset). While auto-tuning is beneficial, it also introduces additional preprocessing and benchmarking costs that should be accounted for. The success of ensemble approaches like clspmv [74] heavily relies on auto-tuning. Since different matrix sparsity patterns can be related to specific optimizations, the clspmv framework provides an attractive approach based on analyzing the matrix structure and appropriately selecting the best representation (or a combination of these) among an ensemble of many available GPU-based sparse formats. The greatest challenge in SpMV optimization is to cope with the irregular matrix structure. Row-based parallelization (i.e. assigning one working thread to each row) leads to load imbalance since nonzeros are unevenly distributed across different rows. Roughly speaking, this problem can be solved by distributing threads to rows according to their computational load, a technique known as adaptivity. The first step towards the implementation of this technique was the ability to use collaborative threads to process a single row (as opposed to the row-based parallelization just described). Advanced sparse matrix formats such as ELL-T [82], SIC [75], CMRS [83] or RgCSR [84] use slightly different approaches to map a row to one or more threads in a warp. Unfortunately, this thread mapping is not flexible (i.e. each row is

55 41 mapped to exactly t threads, where t is a parameter) hence unbalanced workloads are still problematic. A more flexible implementation was then proposed by Heller and Oberhuber [85]. However, their ArgCSR composes balanced blocks using a very simple policy that fails in the case of very skewed nonzero distributions (e.g. matrices with a single dense row). A complete (and arguably superior) adaptive sparse matrix format is provided by AdELL and its warp-balancing heuristic to compose warps [48]. More recent work on adaptivity has used twodimensional blocking for composing balanced computation (BRC [76]) and dynamic parallelism for processing heavyweight rows (ACSR [86]). Focusing on pure performance, the state-of-the-art in SpMV optimization is arguably BC- COO [50]. This advanced sparse matrix format is an evolution of COO based on blocking and row indices compression, where load balancing is achieved by the means of a highly-efficient segmented reduction (which, however, relies on a non-portable synchronization-free mechanism that stalls on modern AMD GPUs). BCCOO, as many other advanced sparse formats, introduces a significant preprocessing overhead for data structure generation and tuning. This set-up cost is acceptable in applications where it can be amortized over repeated SpMV iterations on the same matrix or on matrices with the exact same sparse structure. For this reason, GPU vendors focused their research on performance with lightweight preprocessing over CSR. NVIDIA Research recently released the ModernGPU library [51], a collection of advanced GPU coding approaches including CSR+. Based on segmented reduction for load balancing, CSR+ provides very promising performance with negligible preprocessing overhead. On the same note, AMD Research recently proposed CSR-Adaptive [52]. With a slightly different interpretation

56 42 of adaptivity, CSR-Adaptive uses an unconventional (compared to previous work) approach to process short/medium rows (segmented reduction over nonzeros buffered in local memory). Despite the impressive progresses achieved in SpMV optimization, we observe that no previous work has ever explicitly combined different optimization techniques in order to effectively address all the performance bottlenecks of sparse linear algebra computation. This, arguably, provides a promising approach to improve the state-of-the-art, motivating the research work presented in this chapter. Our approach combines in an efficient and structured way the existing techniques to achieve maximum performance. In addition, we recognize the need to keep the preprocessing as small as possible in order to amortize the additional overhead Basics In this subsection, we provide a brief description of the ELL sparse matrix representation (and its GPU-based SpMV kernel) as the necessary foundation to understand the remaining of this chapter. ELL is a sparse format particularly well-suited to SIMD vectorized architectures. Its basic idea is to compress the sparse m n matrix using a dense m k data structure, where [a] a b c [b] Row d ge f g Column h Value a b c d e f g h A [c] Csr [d] a b c Column d f ge g Value a b c d e f g h h 3 Value Column Figure 4. Sparse matrix A [a] represented as COO [b], CSR [c] and ELL [d]

57 43 k corresponds to max (the maximum number of nonzeros per row). As shown by Figure 4, the sparse matrix A is stored in memory by the means of two 4 3 dense arrays Value[] and Column[] (row indices remain implicit as in a dense matrix). Hence, zero-padding is necessary for rows with less than k = 3 elements. Indeed, the memory footprint associated with strictly depends by the matrix regularity. Whenever rows approximately have the same number of nonzeros, ELL can be as efficient as CSR. On the other hand, skewed distributions may lead to a unfeasible memory footprint (i.e. as large as the dense matrix). Similarly to CSR, ELL works at the granularity of thread per row. However, its memory layout is optimized for coalesced access by adopting column-major ordering. Figure 5 provides a detailed view on how data are stored in memory for efficiency. Due to column-major ordering, a vectorized warp-level instruction (for the sake of simplicity w = 4) can process a batch of w distinct nonzeros (e.g. [a,d,f,h]) by coalescing the loading of adjacent memory addresses. Given a matrix of arbitrary size, it may be also necessary to add few padding rows at the bottom of the two m k arrays in order to guarantee w-alignment (indeed, this optimization is a consequence of how GPU memory hierarchy is designed). In other words, we round m to the next multiple of w (i.e. m = m w w). ELL certainly achieves good performance [a] a d b ge c [b] Column f h g Value a d f h b e g c Value Column Figure 5. ELL sparse format [a] and its column-major memory layout [b]

58 44 on regular structures (i.e. with an equal number of nonzeros on each row). On the other hand, irregular matrices inevitably lead to memory footprint inefficiency and waste of computation (i.e. short rows make their thread idle for most of the time). We can quantify efficiency as e ELL = nnz m k, (3.2) where nnz is the overall number of nonzeros in A and e ELL is a [0, 1] metric that measures the portion of actual nonzeros in Value[] and Column[] (intuitively, 1 e ELL corresponds to padding). Using ELL as reference sparse format, we can provide a roofline-type [87] performance model to characterize the SpMV kernel as bandwidth-limited. Assuming no latency effect and infinitely fast caches, the arithmetic intensity i (the ratio between performed floating-point operations and bytes moved from/to memory) associated with each nonzero can be calculated as ( ) 2 Flops i = m A + m x + m y Byte. (3.3) Here the factor 2 comes from the dot product (i.e. floating-point multiply and add), m A accounts for reading the nonzero value and column index from A, m x is the traffic incurred by partially indirect access to vector x, and m y is the data volume for updating the result y. Assuming single-precision and four-byte integer indices, each nonzero is represented by m A = bytes. However, we should also take into account the efficiency e ELL itself (which

59 45 may increase the effective amount of loaded data). The effect of the memory hierarchy on accessing vector x is characterized by a factor α [0, 1]. In the worst case scenario, every access causes a cache miss so α = 1. In the ideal situation, we only account for cold misses (i.e. those necessary to fetch x from global memory) so each cache element will be reused a number of times equal to the average number of nonzeros per column (α = 1/µ for square matrices). The cost of updating y elements (one read and one write) is instead amortized on µ nonzeros when using a thread to process a row. All those concepts are substituted in formula (3.3) to achieve the following expression ( ) [ ] 2 Flops i = (4 + 4)/e ELL α + 2 4/µ Byte. (3.4) Let us assume the best case where e ELL = 1, a 0 and µ is large. We can then derive an upper bound of for a single-precision arithmetic intensity. Let us now consider the specific Tesla K40 GPU based on Kepler architecture [66] used in our experimental section. Given a GB/s measured peak bandwidth with ECC disabled (as opposed to GB/s theoretical), the upper bound on SpMV performance measured in FLoating-points Operation Per Second (or FLOPS) is GFLOPS. This upper bound is only a small fraction of the 4.29 TFLOPS Tesla K40 theoretical computational peak, proving that SpMV is a bandwidth-limited kernel due to

60 46 its low arithmetic intensity. We can repeat the same analysis for double-precision computation obtaining ( ) [ ] 2 Flops i = (8 + 4)/e ELL α + 2 8/µ Byte, (3.5) which corresponds to an even more capped upper bound of GFLOPS, still a small fraction of the double-precision 1.43 TFLOPS Tesla K40 theoretical peak. 3.2 Improving ELL with Warp Granularity A common optimization pattern in GPU computing consists of tailoring both the data structure and the computation around the SIMD width (i.e. warp size). On the one hand, this provides an easy way to guarantee coalesced and aligned memory accesses. On the other hand, computation that involves only threads in a warp has a better performance due to its ability of exploiting implicit lockstep execution for synchronization rather than explicit primitives [88]. Similarly, collective operations between threads can be efficiently performed by building upon warp intrinsics (e.g. shuffle instruction) rather than relying on local/shared or global memory. Inspired by this ideas, we have proposed an optimized sparse format known as WELL (Warpgrained ELL) [89]. WELL partitions the matrix into (warp-sized) slices and represents them with local ELL data structures, leading to a fundamental advantage in terms of the overall memory footprint. Intuitively, the dimension size k i of each slice depends on the longest row in the warp, rather than a global k. WELL requires two additional vectors K[] and Offset[] of dimension m w = m w (the number of warps) and mw + 1. K[] is used to keep track of local k i,

61 47 whereas Offset[] is used to identify the incremental starting location of each local ELL structure within Value[] and Column[]. [a] a b c [b] 0 a b c d ge d ge 0 1 f g f g 1 2 h 3 Offset h 3 i 4 i 4 j g 5 3 j 5 k 6 1 k 6 l 7 K l 7 Value Column Value Column Figure 6. An arbitrary matrix represented as ELL [a] and WELL [b]. Figure 6 shows an example of WELL and depicts the advantage of using WELL over ELL. The memory footprint improvement can be quantified by updating the efficiency formula (3.2) as e W ELL = nnz nw i=1 w k i = nnz Offset[m w ] (3.6) where the summation at the denominator takes into account each local ELL contribution to the overall WELL memory footprint. We evaluated e W ELL for our regular benchmark suite using both (3.2) and (3.6), obtaining the results in Table V. Using warp granularity, WELL substantially improves over e ELL by a 4.73x factor (calculated using harmonic mean), leading to immediate benefits in terms of memory footprint and, hence, the overall SpMV computation. Table VI shows the results for the other irregular matrices. The extremely low e ELL associated

62 48 Efficiency Matrix ELL WELL Circuit Dense Economics Epidemiology FEM/Accelerator FEM/Cantilever FEM/Harbor FEM/Ship FEM/Spheres Ga41As41H Protein QCD Si41Ge41H Wind Tunnel TABLE V. Efficiency of ELL versus WELL for the regular benchmarks Efficiency Matrix ELL WELL Circuit5M Eu In LP Mip Webbase TABLE VI. Efficiency of ELL versus WELL for the irregular benchmarks with ELL is alone sufficient to justify that WELL is the only viable option between the two sparse formats to represent irregular matrices (i.e. ELL is so inefficient that some matrices do not even fit in memory). WELL has been designed to promote efficient SIMD vectorization at warp level. Intuitively, each local ELL structure is mapped to a warp that executes independently by any other warp in the SpMV kernel. Moreover, there is a conceptual analogy between threads in a warp

63 49 and SIMD lanes (it is possible to use these two terms interchangeably in the context of warp computation). It might be argued that WELL memory layout intrinsically guarantees coalescing and alignment. In fact, each local ELL structure is still stored using column-major ordering for coalescing. Moreover, each local ELL structure has size w k i such that each and every offset in Offset[] is w-aligned. First, we do not apply local rearranging in order to avoid the extra layer of complexity (benefiting from the intrinsic locality of the original matrix). Second, we use texture (read-only) cache to map the vector x. Third, we let the padding zeros to be processed the same way as the rest of the nonzeros in the matrix. This additional arithmetic performs the same 0 x 0 operation over and over, not introducing overhead due to the lockstep execution. Similarly, repeated memory accesses to x 0 are cheap due to caching. In other words, we gain in compile efficiency maintaining a uniform code and data structure for the entire matrix. WELL is very similar to SELL [71] but its implementation of warp granularity goes beyond the simple choice of slices matching the warp size (w = 32 for Kepler architecture). GPU programming models such as CUDA [37] or OpenCL [38] logically organize threads into blocks that are mapped to vectorized units (also known as Streaming Multiprocessors or SMXs on Kepler architecture) and then broken down to warps for the actual execution on the GPU hardware. It is well-known that block size b must be chosen carefully to not undermine performance. The key difference between SELL and WELL is that the former binds slice size s to block size b. Hence, warp granularity s = w (which is the best in terms of e W ELL ) provides a poor choice in terms of block size (i.e. b = w achieves very low utilization of the GPU hardware architecture). On the other hand, WELL implements warp granularity by decoupling s from b. In other words,

64 50 we are able to launch SpMV kernels with optimal b while the data structure, as well as the actual computation, can still take advantage of warp granularity. 3.3 Unrolling Nonzeros for Memory Hierarchy Efficiency Loop unrolling is a well-known technique in compiler optimization used to expose Instruction Level Parallelism (ILP) or to reduce the overhead associated with the loop control structure. Murthy et al. [90] provide an in-depth analysis of loop unrolling in the context of GPU programming. In general, the technique is transparently applied to optimize the execution of short loops with a fixed number of iterations (e.g. when processing a dense 2 2 matrix subblock). However, loop unrolling can be also useful in the context of bandwidth-limited kernels such as SpMV. Let us assume to have a loop where data (e.g. nonzeros) are loaded from memory and processed at each iteration. Memory transactions will be issued in a serialized way leading to potential memory latency inefficiency. On the other hand, unrolling allows to issue multiple independent memory transactions per iteration. This scenario clearly facilitates latency hiding. Indeed, those memory transactions are served simultaneously, improving the overall bandwidth utilization. Hence, loop unrolling is able to optimize the GPU memory hierarchy efficiency by increasing the number of on-the-fly memory transactions. [a] Lane0 [b] Lane0 a b c d a a b c d a b e g g f h X e g gf h X h i j x0 h i j x0 x1 Value Column Dot Product Value Column Dot Product Figure 7. Processing nonzeros without unrolling [a] and with 2x unrolling [b]

65 51 Figure 7 shows the unrolling technique applied to the SpMV kernel. Without unrolling, we have a loop that iterates k i times over the nonzeros. For each iteration, we load both the nonzero value and column from matrix A, we perform an indirect access to vector x and then we accumulate the dot product (i.e. multiply and add). Let us consider lane 0 that processes the first row. During the first iteration, value a and column 0 are loaded into two local registers, x 0 is fetched into another local register and then a x 0 is computed. With unrolling, we aggressively buffer multiple nonzeros (and x elements) into local registers at each loop iteration (e.g. lane 0 buffers [a, b], [0, 1] and then [x 0, x 1 ]). Therefore, this technique (that we call nonzero unrolling) allows to issue multiple memory transactions simultaneously, improving bandwidth utilization and latency hiding. The overall number of iterations k i may not be a multiple of the loop unrolling factor lu. In that case, buffering will simply pad the buffers with few zeros during the last iteration. This approach clearly promotes code regularity. In fact, the added padding does not change the way in which buffers are processed (similarly, it does not affect the result correctness). As protocol for all our computational experiments (including those that will be presented in the next sections), we used a single NVIDIA Tesla K40 (2880@745MHz CUDA cores) equipped with 12GB@3004MHz ECC GDDR5 memory as hardware platform. We implemented all the necessary GPU-based SpMV kernels using CUDA (nvcc compiler version 6.5), and we measured the average performance in GFLOPS over 100 repetitions without considering additional overhead (e.g. moving the sparse matrix from GPU to GPU). We also disabled the Tesla K40 ECC feature and we used the 48KB texture(read-only) memory for caching vector x. Our first goal is

66 52 to provide an empirical analysis of the effectiveness of the proposed nonzero unrolling technique in terms of performance. Table VII reports the SpMV performance on the regular benchmark suite using WELL kernels with different lu factors. These results can be generalized as long as warp granularity holds. Not surprisingly, higher unrolling factors lead to better performance Performance [GFLOPS] Matrix 1x 2x 4x 8x 16x Circuit Dense Economics Epidemiology FEM/Accelerator FEM/Cantilever FEM/Harbor FEM/Ship FEM/Spheres Ga41As41H Protein QCD Si41Ge41H Wind Tunnel Single Precision Circuit Dense Economics Epidemiology FEM/Accelerator FEM/Cantilever FEM/Harbor FEM/Ship FEM/Spheres Ga41As41H Protein QCD Si41Ge41H Wind Tunnel Double Precision TABLE VII. Performance results for the nonzero unrolling with WELL on the regular benchmarks

67 53 due to a better utilization of the GPU memory hierarchy. The best result (highlighted and in bold) is usually associated with 16x unrolling (i.e. buffering and processing 16 nonzeros for each iteration). Some of the matrices achieve their best with smaller lu due to their low µ that makes higher order buffering unnecessary (e.g. Epidemiology has µ = 3.99 so it achieves its best with lu = 4). Over regular matrices, nonzero unrolling can generally provide a performance speedup (calculated using harmonic mean) of 1.98x for single-precision calculation and 1.60x for double-precision calculation. The smaller speedup in double-precision is associated with a larger memory footprint that intrinsically provides a more optimized baseline and, hence, less room for optimization. Last, we also notice that the benchmark Dense obtains relatively low performance despite being perfectly regular. This is due to poor GPU hardware utilization. In fact, the m = 4000 rows are mapped to only m w = 125 warps, a small portion of the 960 warps that Tesla K40 can manage. Table VIII reports the analysis on irregular matrices. As expected, the baseline performance without unrolling is generally poor compared to the regular benchmarks (for which WELL is better suited). On the other hand, there is a substantial performance improvement (approximatively 4x for both single and double precision) achieved by choosing lu = 16. Indeed, a large unrolling factor provides a more efficient processing for skewed warps (i.e. those containing very irregular rows). There are additional architectural factors that should be taken into consideration while predicting the best unrolling factor. As general guideline for performance, SMX occupancy (i.e. the fraction of active threads on a SMX) [37] should be maximized by taking into account constraints such as registers per thread (or shared memory usage) and by choosing an appropriate

68 54 Performance [GFLOPS] Matrix 1x 2x 4x 8x 16x 1x 2x 4x 8x 16x Circuit5M Eu In LP Mip Webbase Single Precision Double Precision TABLE VIII. Performance results for the nonzero unrolling with WELL on the irregular benchmarks block size b. Given an arbitrary GPU kernel, we immediately know the amount of registers per thread r. Moreover, we are able to evaluate how r will limit the number of active blocks and, hence, SMX occupancy. For example, r = 32 does not impose any register-related limit. On the other hand, r = 128 and b = 96 (i.e. a multiple of warp size) imposes a limit of 5 active blocks (each one using registers out of the available on each SMX) achieving a occupancy. By selecting block size b = 64, the imposed limit raises to 8 with a occupancy. Similarly, b = 32 corresponds to 16 active blocks with the same occupancy. As we can see, the selection of optimal block size b is a convoluted tradeoff. However, we tackle this problem using a simple but effective exhaustive approach. We test all the warp multiples b = i w such that i {1, 2,..., 32} and we choose the smallest b that leads to highest SMX occupancy (in general, small blocks provide a better turnover). Referring to the example just presented, we choose b = 32 over b = {64, 128, 256} despite all the options have a occupancy. As an aside, this selection approach has been used for tuning b in all our tests.

69 55 Unrolling Registers Block SMX Unrolling CUDA CUDA SMX Factor r b Occupancy Factor r b Occupancy 1x x x x x x x x x x Float Precision Double Precision TABLE IX. SMX occupancy of SpMV kernels for WELL Table IX reports registers and SMX utilization for all the SpMV kernels that we presented so far. As we can see, buffering with large unrolling factors involves more complex code and, hence, a higher number of registers. This may often decrease SMX occupancy with potential negative effects on the performance. This trend is more severe with an already complex baseline kernel (i.e. using far more than 15 or 18 registers), a scenario that we will encounter when combining together different optimization techniques. The best unrolling factor is a tradeoff between providing good buffering and maintaining a decent SMX occupancy. In other words, complex SpMV kernels will likely perform better with lu = {2, 4} rather than lu = 16 due to a better compromise in terms of SMX occupancy. So far, we presented how to design the data layout in order to promote efficient SIMD vectorization and how to process nonzeros in order to optimize the memory hierarchy efficiency. In the next sections, we will introduce two techniques targeted for mitigating the bandwidth-limited nature of the SpMV kernel. 3.4 Exploiting Dense Substructures with Blocking Dense substructures are inherently present in many sparse matrices such as those derived from partial differential equation models, providing an opportunity to further optimize the

70 56 performance. Blocking is a technique similar to compression that allows to leverage dense substructure to reduce the memory footprint associated with indexing. The idea is to extract a blocked matrix from the original one and store nonzero blocks (rather than single nonzero entries). Blocking can be applied to WELL as described by Figure 8. Given the sparse matrix A with a dense substructure, its WELL representation needs 32 elements for Value[] and Column[]. Moreover, the data structure spans two warps (w = 4). Given the blocked matrix B extracted from A, we can build WELL using [2 2] nonzero blocks rather than single nonzeros. We can immediately notice how this new WELL structure has 8 (block) elements rather than 32. As a result, the Column[] structure is more compact, leading to an overall space savings. In other words, each [2 2] block needs to store only a single column index corresponding to subelement [0, 0] (everything else can be directly derived from there). We can also notice that now WELL [a] a b c d [b] 0 a b c d e f g 16 e f g h i 32 h i 2 3 j k Offset j k 2 3 l m n o l m n o p q 4 q p 3 4 r s 4 r s 6 7 t u K t u 6 7 A Value Column [c] a b c d [d] 0 a b c d 0 2 e f g 8 e f g 1 h i Offset h i 1 2 j k j k 3 l m n o 2 l m n o Column p q K p q r s r s t u t u B Value Figure 8. Sparse matrix [a], WELL [b], sparse blocked matrix [c] and WELL with blocking [d]

71 57 spans a single warp with 4 lanes that process two blocks each. In other words, now WELL works at block granularity, processing multiple rows simultaneously depending on the blocking factor (2 in the given example). In order to guarantee coalesced memory access at warp level, the memory layout of Value[] needs to be redesigned. The overall idea is to transform each 2D block as a 1D array (e.g. [ ] a b e f becomes [a, b, e, f]) and then store those arrays in a interleaved fashion at warp granularity, repeating the process for each batch of nonzeros blocks. Figure 9 clarifies this memory layout with an example. The interleaving placement is highlighted with different shades of gray. The first coalesced memory access will load [a][h][l][r] (one per lane), which is the first element in each block. The second access will load [b][i][m][s], and so on until the entire first batch of blocks is completely loaded. This is then repeated for all the other block batches until the warp completes the processing to its local (blocked) ELL structure. Blocking can naturally take advantage of buffering. Moreover, blocking can be integrated with nonzero unrolling (which will now happen at block granularity). For example, if we apply 2x unrolling to Figure 8, thread lane 0 will buffer two nonzero blocks ( i.e. [a, b, e, f, c, d, g, 0] and [0, 2]) plus 4 elements from [a] a b c d [b] e f g h i a h l r b i m s e j t f k p u j k l m n o p q c n d o g q r t s u Value Value Figure 9. Sparse blocked matrix [a] and its interleaved memory layout [b]

72 58 x ( i.e. [x 0, x 1, x 3, x 4 ]). The resulting dot product computation can be aggressively optimized with loop unrolling. The blocking technique applied on WELL may be more or less effective, depending on how well an arbitrary blocking factor [bm bn] suits the underlying dense substructure in the sparse matrix A. First, we define a [0, 1] metric called block density d [bm bn] as d [bm bn] = nnz bm bn nnz [bm bn] (3.7) where nnz [bm bn] is the number of nonzero blocks in the blocked matrix B [bm bn] extracted from the original A. Note that d [bm bn] = 1 when no padding needs to be added. In the example of Figure 8, d [2 2] = because we added some zeros to densely fill [ ] [ c d g 0, l m ] 0 p and [ n q 0 o ]. The next factor to analyze is the impact of blocking on the WELL data structure. Let m w [bm bn] = be the number of warps necessary for the blocked matrix B [bm bn] m/bn w and k i [bm bn] be the longest blocked row in the local (blocked) ELL structure associated with warp i. Hence, we modify the efficiency formula (3.6) as follows e [bm bn] = nnz [bm bn] m w [bm bn] i=1 w k i [bm bn] = nnz [bm bn] Offset[m w (3.8) [bm bn]]. Considering again the example in Figure 8, e [2 2] = 0.75 because 2 blocks out 8 have been inserted as zero padding. The last factor to analyze is the effect of blocking on the memory footprint. Given a nonzero block, the amount of data necessary for column indexing will

73 59 be compressed by a bm bn factor. Assuming single-precision computation, we define the compression c [bm bn] associated with blocking as follows c [bm bn] = 8 2, (3.9) bm bn where the upper bound for compression is 2 (i.e. the scenario in which we represent the sparse matrix with an unique dense block). It might be argued that compression c [bm bn] is only dependent by the given blocking factor [bm bn] whereas density d [bm bn] and efficiency e [bm bn] also depend by the original sparse matrix A. We can repeat the same analysis for double-precision computation obtaining c [bm bn] = , (3.10) bm bn where the upper bound is smaller due to the smaller impact of column index compression (and, hence, of blocking) on the overall memory footprint. Finally, we summarize all previous considerations into a quality metric that provides a rough estimate of which blocking factor [bm bn] to choose for optimality q [bm bn] = d [bm bn] e [bm bn] c [bm bn]. (3.11) In other words, q [bm bn] estimates the effect of zero padding and compression on the memory footprint associated with a particular choice of [bm bn]. The definitions of density d [bm bn],

74 60 efficiency e [bm bn] and compression c [bm bn] are generalization of the base case [1 1] with no blocking. From (3.11), q [1 1] = d [1 1] e [1 1] c [1 1] = 1 e W ELL 1 = e W ELL. By comparison, the quality metric q [bm bn] may estimate whether applying blocking is a good idea (i.e. presence of a dense substructure) and, eventually, which blocking factor provides a more compact memory footprint (and, likely, good SpMV performance). Here we analyze the effectiveness of the blocking technique just presented, with the secondary goal to directly correlate the proposed quality metric to measured performance. We considered all the blocking factors with area bm bn [1, 9] (e.g. for area bm bn = 4 we consider [1 4], [2 2] and [4 1]). Table X reports the SpMV performance on the regular matrices focusing on the baseline [1 1] and the best performing blocking factor. Those results are incremental over those in Section 3.3, in the sense that the implemented SpMV kernels use both Performance [GFLOPS] Quality Metric Performance [GFLOPS] Quality Metric Matrix [1x1] Best [1x1] Best [1x1] Best [1x1] Best Circuit [1x1] [1x1] Dense [1x9] [1x9] Economics [1x1] [1x1] Epidemiology [1x1] [1x1] FEM/Accelerator [2x1] [1x1] FEM/Cantilever [3x3] [3x3] FEM/Harbor [1x3] [1x2] FEM/Ship [1x6] [1x6] FEM/Spheres [3x3] [3x3] Ga41As41H [1x2] [1x1] Protein [3x3] [1x3] QCD [3x3] [3x3] Si41Ge41H [1x2] [1x1] Wind Tunnel [1x3] [1x3] Harmonic Mean x x Single Precision Double Precision TABLE X. Incremental performance for the blocking technique on the regular benchmarks

75 61 Performance [GFLOPS] Quality Metric Performance [GFLOPS] Quality Metric Matrix [1x1] Best [1x1] Best [1x1] Best [1x1] Best Circuit5M [1x9] [1x4] Eu [1x2] [1x1] In [1x3] [1x2] LP [1x2] [1x1] Mip [1x9] [1x9] Webbase [1x1] [1x1] Harmonic Mean x x Single Precision Double Precision TABLE XI. Incremental performance for the blocking technique on the irregular benchmarks blocking and nonzero unrolling (we report the best-tuned result). As we can visually notice (highlighted and in bold), most of the regular matrices have a dense substructure that can be leveraged with blocking to achieve superior performance. Indeed, we have an average of 1.17x (1.09x) speedup for single-precision (double-precision). Not surprisingly, the quality metric can estimate fairly well when it is convenient to apply blocking. Specifically, the best performing blocking factor always has a higher quality metric. For example, matrix QCD has a [3 3] dense substructure on which blocking improves performance from to GFLOPS (single-precision). This improvement can be correlated to the quality metric improvement from to We analyzed the relationship between the measured performance and the quality metric across all the blocking factors more directly. There is a positive correlation of (single-precision) and (double-precision). For the sake of completeness, we also report performance tests on irregular matrices (Table XI). Even here, despite comparing with a poor baseline (we have not yet addressed the irregularity issue), we can observe that the blocking technique provides a performance edge that we can integrate with other advanced techniques for SpMV optimization. On the other hand, the skewed distribution of some of the irregular matri-

76 62 ces makes the proposed quality metric less reliable, although the correlation with performance still remains (single-precision) and (double-precision). 3.5 Loseless Compression for Indexing Data In general, compressing the sparse matrix structure (i.e indexing data) is a target for optimization since it reduces the memory footprint. For example, the blocking technique just analyzed substantially reduces the column indices associated with each nonzero block, providing a [bm bn] implicit compression ratio. There are additional opportunities for compression within the nonzero pattern. However, any technique that we may want to adopt should have a decompression phase which is as simple as possible in order to be efficiently embedded into the SpMV kernel. Following this principle, we proposed a technique based on delta encoding of column indices [49]. Let us consider the way thread lanes process nonzeros within a warp. Whenever the column distance between two consecutive elements is small enough, it is convenient to reconstruct the next column using a delta from the current index. This, in turn, allows to simply store a 8-bit or 16-bit delta rather than a 32-bit absolute index. This compression technique can be tailored to SIMD vectorized execution. Specifically, we propose to apply the differential encoding only when all the delta within the warp (i.e. local ELL data structure) are representable as 8-bit or 16-bit integers. This provides a regular memory layout as well as a divergency free execution. Figure 10 shows an example of our compression technique integrated into a sparse matrix format with warp granularity. The nonzero structure associated with the first warp is stored as an absolute base column index (from the first nonzero) followed by a sequence of compressed delta values occupying in the case of the example half the space (of 32-

77 63 [a] 0 a b c d [b] 6 0 a b c d e f g e f g h i j h i j Offset k m l n Offset k m l n Column Offset o 5 4 o p q p q K r 7 K r Flags 7 Value Column Value Column Figure 10. WELL [a] and WELL with delta-based index compression [b] bit integers). The correspondent portion of Column[] is now resized down from dimension 4 to 2.5. However, we may decide to introduce some padding in order to w-align the next warp (i.e. moving the second warp from location 10 to 12). This delta-based compression technique needs some additional data structures. First, it is necessary to store ColumnOffset[], an incremental offset to locate any local indexing data in Column[]. Second, we need a flag array Flags[] that encodes the compression applied to each warp (e.g. b 1 b 0 = 00 is no compression, b 1 b 0 = 01 is 16-bit encoding and b 1 b 0 = 10 is 8-bit encoding ). The second warp in Figure 10 is not compressed despite being eligible. In general, it is not profitable to apply delta encoding to warps with small k i = {1, 2} because they do not improve the local memory footprint after the necessary padding for w-alignment. Finally, this compression technique can be easily integrated with blocking due to the independency of column indexing data. Similarly, it is possible to apply nonzero unrolling by buffering deltas rather than absolute column indices.

78 64 We can now revise the quality metric q [bm bn] defined in Section 3.4 in order to incorporate our delta-based compression technique. Given an arbitrary warp w [i], we can define its delta compression as 1 no delta encoding δ[bm bn] i = 4k i (k i w [bm bn] 1) /2w if 16-bit delta encoding. (3.12) 4k i (k i w [bm bn] 1) /4w if 8-bit delta encoding Here the achieved ratio has a different upper bound, depending on which compression it is possible to apply. This definition takes into account w-alignment. Specifically, the ceiling function at the denominator evaluates the number of aligned memory chunks associated with differential encoding. Note also how the delta compression dc i [bm bn] depends on the blocking factor [bm bn]. For example, doubling the block dimension bn halves all the column deltas of the associated blocked matrix B. As a result, some warp may become eligible for compression. The overall delta compression dc [bm bn] is defined as follows δ [bm bn] = m w [bm bn] i=1 k i [bm bn] dci [bm bn] m [bm bn] w i=1 k i [bm bn] = ColumnOffset[mw [bm bn] ] Offset[m w [bm bn] ]. (3.13) The weighted average across the warps can be quickly obtained by dividing the size of Value[] by the size of Column[]. Referring to the example in Figure 10, dc [bm bn] = 24/20 = 1.2. Last,

79 65 we update definition (3.9) and (3.10) in order to integrate the newly defined delta compression dc [bm bn] c [bm bn] = 8 2, (3.14) bm bn δ [bm bn] c [bm bn] = (3.15) bm bn δ [bm bn] Intuitively, the ability to apply delta compression to a sparse matrix increases c [bm bn] and, hence, the quality metric q [bm bn]. We performed some computational experiments with the aim to evaluate the incremental benefit of delta-based index compression over the other optimization techniques. Table XII reports the best-tuned (in terms of blocking and unrolling factors) SpMV performance with and without delta-compression on the regular benchmark suite. In general, we observed an average of 1.09x (1.05x) speedup factor for single-precision (double-precision) SpMV computation over a fairly optimized baseline. As we can see (highlighted and in bold), almost all the matrices have a nonzero pattern that can be successfully exploited to reduce memory footprint and to improve SpMV performance. The only exception is the matrix Dense. As mentioned, the row-based parallelization implemented by the underlying WELL does not generate a sufficient number of warps to properly exploit the GPU hardware architecture for that matrix. Therefore, we can treat that matrix as an outlier. Moreover, the revised quality metric is now able to capture

80 66 [GFLOPS] Quality Metric [GFLOPS] Quality Metric Delta Compression Delta Compression Matrix w/o w/ w/o w/ w/o w/ w/o w/ Circuit Dense Economics Epidemiology FEM/Accelerator FEM/Cantilever FEM/Harbor FEM/Ship FEM/Spheres Ga41As41H Protein QCD Si41Ge41H Wind Tunnel Harmonic Mean x x Single Precision Double Precision TABLE XII. Incremental performance for delta compression on the regular benchmarks. the memory footprint reduction associated with the compression technique. In other words, a higher value of q [bm bn] correlates with superior performance. Table XIII reports the result for the other set of matrices. Observing the negligible increments of the quality metric, we can deduce that there is no real opportunity to leverage the irregular nonzero patterns to improve the memory footprint. As a consequence, there is no substantial effect on the average SpMV performance. On the other hand, we can observe that the additional complexity of integrating the compression technique into a SpMV kernel does not introduce any visible performance overhead. This provides a supporting evidence to always apply the proposed optimization, without the need of introducing an additional variable to the space of tuning parameters.

81 67 [GFLOPS] Quality Metric [GFLOPS] Quality Metric Delta Compression Delta Compression Matrix w/o w/ w/o w/ w/o w/ w/o w/ Circuit5M Eu In LP Mip Webbase Harmonic Mean x x Single Precision Double Precision TABLE XIII. Incremental performance for delta compression on the irregular benchmarks. In the last sections, we presented two complementary optimization techniques explicitly targeted to reducing memory footprint. We could observe tangible performance improvements, especially for matrices with regular structure. We now switch our focus on coping with matrix irregularity. 3.6 Improving Adaptive ELL The ability to adapt the nonzero workload of each threads (as opposed to row-based parallelization used so far) provides an effective way to cope with matrix irregularity. We proposed AdELL [48], a warp-grained sparse matrix format implementing the adaptivity technique. AdELL is based on the idea of allocating an adaptive number of warp lanes l r to each row r, depending on the row s nonzeros nnz r. In other words, now each warp w [i] provides a more flexible computation strategy where an arbitrary number {1, 2,..., w} of rows can be used to keep all the processing lanes occupied (i.e. r w [i] l r = w). In general, thread lanes need to collaborate to aggregate the dot product of the corresponded row r. This can be efficiently implemented as a warp-level segmented reduction by building upon intrinsics (e.g. warp shuffle instruction).

82 68 A careful lane assignment can greatly improve the overall memory footprint efficiency (which is still calculated as e W ELL ). Specifically, each warp workload k i is now determined by value of the largest nnzr among the rows (rather than the absolute nnz r ). Irregular rows (i.e. with t r a large number of nonzeros) can now be processed by up to w lanes or even across multiple warps by the means of atomic operations, providing an effective strategy to implement load balancing. [a] [b] a d 0 3 b e a b c d e f c f g h g h i j k k 2 Offset 6 3 K l m p r n q o Offset Row Offset Map j i l m n o Value Column K p q 5 6 r 6 Value Column Figure 11. An example of the nonzeros distribution unfavorable to row-based parallelization (WELL [a] vs AdELL [b]) The example in Figure 11 illustrates the AdELL data structure and its efficiency benefit over WELL. A reduction map array Map[] is used to specify each local segmented reduction pattern (e.g. warp w [0] has Map[0]=1001 since l 0 = 3 and l 1 = 1). Moreover, an additional array RowOffset[] is used to store the first row associated with each warp. This is necessary because the implicit assignment of w rows for each warp does not hold anymore. AdELL has the same fundamental warp-grained structure of WELL so all the presented optimization techniques can

83 69 be directly applied. Considering blocking, the adaptivity technique will just focus on allocating lanes to rows from the blocked matrix. Similarly, delta-based compression can be applied on Column[] and each lane can unroll the processing of its nonzero blocks. Figure 11 does not show any example of rows distributed across multiple warp. In that case, we may use an addition flag (e.g. b 2 in Flags[]) to signal that the final step of segmented reduction is an atomic operation on result vector y. AdELL is a data structure that supports adaptivity, a technique can be leveraged to optimize the SpMV performance for irregular matrices. However, it is the actual warp assignment that achieves memory footprint efficiency and load-balancing across the GPU hardware architecture. We can define a simplified version of this Warp-Balancing Problem [48]. Given the set of rows from (blocked) matrix A, we perform the following (constrained) tasks : Lane Assignment : Assign an adaptive number of lanes l r {1, 2,..., w} to each row r Warp Partitioning : Compose each warp w [i] such that r w [i] l r = w The multi-objective function maximizes the memory footprint efficiency e W ELL while minimizing the global workload unbalance u = max k i k median k median across the warps. Here, we do not combine the two objectives into one (e.g. linear combination) since it is more appropriate to generalize rather than focus on a specific formulation. The optimization problem is then to find a lane assignment and a partition that optimize the given multi-objective function. Note, the problem just stated is in the class of NP-hard load-balancing problems [91]. We focus on a variation where rows can be distributed across multiple warps and where the original row ordering should be maintained. Here we propose a revised heuristic to efficiently solve the Warp-Balancing

84 70 Problem and, consequently, improve the SpMV performance. This O(n)-runtime approach is designed to be faster than the original self-balancing heuristic used by AdELL [48], with the aim to reduce the cost of preprocessing. Our revised heuristic, in turns, achieves both efficient lane assignment (with Algorithm 3) and warp partitioning (with Heuristic 4). Algorithm 3 Adaptive Warp Expansion Input: Adaptive warp w [i] = {1, 2,..., r 1} with minimal k i and containing at most w 1 rows Input: Row r to add Output: Adaptive warp w [i] = {1, 2,..., r 1, r} with minimal k i 1: l r w 2: for each row j in w [i] do nnzj 3: l j k i 4: l r l r t j 5: end for 6: while nnz r > l r k i do 7: C { row j with t j > 1 } 8: Search row j C with min ( ) nnzj 9: k i min l j, nnzr 1 l r 10: Repeat 1 to 5 11: end while 12: Add r to w [i] nnzj l j 1 Let first describe a lane assignment approach to compose candidate warps with minimal k i and, thus, with maximal e ELL. Given a set of r w consecutive rows {1,.., r}, we solve the subproblem of assigning warp lanes to the rows so that the resulting warp w [i] has minimal k i and, thus, maximal e ELL i. The greedy approach presented by Algorithm 3 is incremental. In other words, we use the optimal warp lane assignment for {1,.., r 1} in order to construct the one for {1, 2,.., r 1, r}. Given an optimal warp w [i] with minimal k i, we calculate the lanes l r potentially available for row r (lines 1 to 5). l r may be sufficient to process nnz r nonzeros. Otherwise, we have to increase k i by removing one of the lanes assigned to the rows (lines 9 to

85 71 10). If no removal is possible (i.e. all rows have a single lane), we update k i only considering how to accommodate nnz r nonzeros over l r lanes (line 12). Finally, we can add the new row r (line 16). We can prove the optimality of this approach by induction on the rows incrementally added to the warp. The base case is an empty warp w [i] with k i = 0. In that case, k i is updated to the minimal nnz r w to accommodate nnzr nonzeros. The induction step can be proved by focusing on the lane assignment. When l r k r 1 i is sufficient to process nnz r, we just keep k r i = ki r 1. Otherwise, we need to increase ki r 1 either to make the current l r lanes longer (i.e. capable of processing more nonzeros) or to move the assignment of one or more lanes from row j {1, 2,.., r 1} to row r (i.e. increasing l r ). Note, the assignment between each row j and its l j lanes follows the modulo arithmetic. Whenever we increase k i by one, each row j nnzj nnzj nnzj nnzj may have either l j = k i +1 = (same assignment) or l j = k i +1 < (reduced k i number of lanes). On the other hand, the quantity l r k i is monotonically increasing (recall that l r = w r 1 j=1 l j). Given ki r 1, we can find the optimal ki by unit increments over ki r 1 k i until k i l r nnz r. It might be argued that Algorithm 3 goes from the optimal k r 1 i for rows {1, 2,..., r 1} to ki r 1 for rows {1, 2,..., r 1, r} with steps that are equivalent to multiple unit increments done at once. Whenever we search for a row j with minimum k i = nnzj l j 1 (line 8), we aim to move an additional lane from row j to row r. Due to the modulo arithmetic, all the intermediate unit increments between k i and k i do not increase l r. In other words, only the last step can modify the overall lane assignment and lead to the optimal ki. This reasoning also

86 72 applies for the case in which it is sufficient expand k i without increasing l r (line 9), proving the optimality of the presented greedy approach. Heuristic 4 Warp-balancing heuristic Input: Sparse (blocked) matrix A Input: Nonzeros per lane k min (default 4) Input: Nonzeros per lane k max (default ) Output: AdELL data structure 1: r 0 2: while r < m do 3: C { candidate warps from (blocked) rows {r,..., r + w 1} using Adaptive Warp Expansion } 4: C k { Only w [i] C with k min k i k max } 5: if C k = then 6: C k { Only w [i] C with k i > k max } 7: if C k = then 8: C k C 9: end if 10: end if 11: Search warp w [i] C k with highest efficiency e ELL i 12: if k i k max then 13: W { w [i] } 14: else 15: W { Split w [i] into 16: end if 17: Append W to AdELL 18: r r + i 19: end while 20: while m w < occupancy do 21: Split each w [i] AdELL into 22: end while ki k max atomic warps } k i k median atomic warps The warp-balancing heuristic itself is described by Heuristic 4. As we can see, we have two additional parameters k min and k max that are available to tune the way AdELL is composed. Specifically, k min gives a suggestion to favor warps with k i k min. This is motivated by the empirical fact that warps below k min = 4 do not optimize the execution efficiency on the underlying GPU architecture. On the other hand, k max imposes an upper bound on k i that can be used to distribute rows across multiple warps, providing a tuning factor for load-

87 73 balancing. The proposed heuristic incrementally processes the rows in matrix (A). At each step, it generates a set of w optimal candidate warps using Algorithm 3 (line 3). Then, it greedily selects the best w [i] in terms of efficiency such that k min k i k max (line 4 to 11). If this latter condition does not hold for any candidate, we also consider warps satisfying k i > k max. Whenever this happen, the upper limit k max is subsequently enforced by splitting w [i] on multiple atomic warps with ki k max workload (lines 12 to 16). The last part of the heuristic (lines 20 to 22) takes care of the case in which the number of warps m w is too low to fully utilize the GPU hardware (e.g. matrix Dense with WELL). The proposed heuristic clearly has O(n) runtime. In fact, the outer loop over the rows performs constant work (incremental construction of w candidate warps) at each iteration. The main target of adaptivity is the irregular benchmark suite. As done for the other optimization techniques, we performed some computational tests with the aim to evaluate the incremental benefit in terms of SpMV performance. We composed the AdELL data structure [GFLOPS] Quality Metric [GFLOPS] Quality Metric Adaptivity Adaptivity Matrix w/o w/ w/o w/ w/o w/ w/o w/ Circuit5M Eu In LP Mip Webbase Harmonic Mean x x Single Precision Double Precision TABLE XIV. Incremental performance for the adaptivity technique on the irregular benchmarks.

88 74 considering all the following upper limits: k max = {4, 8, 16, 32, 64, 128, } (where corresponds to not providing any k max ). Table XIV reports the best-tuned (in terms of blocking, unrolling factors, upper limit k max ) SpMV performance with and without adaptivity on the set of irregular matrices. From now on, we will use the term AdELL+ to refer to AdELL with all the optimizations active. Not surprisingly, addressing the irregular matrix structure brings a huge performance improvement on all the benchmark suite (highlighted and in bold). We observe an astonishing 19.93x (14.03x) speedup for single-precision (double-precision) SpMV computation. The simultaneous increment of quality metric q [bm bn,kmax] (notice that we added k max to the notation) can give us a hint of the effect of adaptivity. The warps selected by the warp-balancing heuristic have by construction a good efficiency e ELL that contributes to improving the overall memory footprint. On the other hand, the imposed upper limit k max is fundamental for load-balancing. Let us take matrix Circuit5M as example. The best-tuned result is obtained when k max = 16. This upper limit generates an AdELL+ data structure with atomic warps out of This corresponds to a nearly perfect load-balanced execution where heavyweight rows (accounting for 13% of the total nonzeros) are broken down into small pieces of computation. Adaptivity is also useful for regular matrices as shown by Table XIV. We observed another substantial 1.57x (1.52x) speedup for single-precision (doubleprecision) SpMV computation. As we can see (highlighted and in bold), most of the regular matrices can benefit from an improved memory footprint efficiency (which is measured as an increase in the quality metric q [bm bn,kmax] ). A notable example is the matrix Circuit where the performance improves by more than a factor of 2 over its baseline without adaptivity. The

89 75 [GFLOPS] Quality Metric [GFLOPS] Quality Metric Adaptivity Adaptivity Matrix w/o w/ w/o w/ w/o w/ w/o w/ Circuit Dense Economics Epidemiology FEM/Accelerator FEM/Cantilever FEM/Harbor FEM/Ship FEM/Spheres Ga41As41H Protein QCD Si41Ge41H Wind Tunnel Harmonic Mean x x Single Precision Double Precision TABLE XV. Incremental performance for the adaptivity technique on the regular benchmarks. best absolute performance is associated with the matrix Dense. Not surprisingly, AdELL+ provides a sufficient number of warps n w to achieve full utilization. On the other hand, there are examples in which adaptivity does not add any benefit. Consider the special case of the matrix Epidemiology. Its structure is composed of rows with an average of µ = 4 and almost zero variability. As a result, there is virtually no room for improvement and it is advisable to not apply the adaptivity technique in order to not to incur any implementation overhead (e.g. segmented reduction) on the performance. Fortunately, this feature can be easily integrated into AdELL+ by adding a flag b 3 into Flags[] to select between the two options.

90 Putting It All Together To recap, we designed AdELL+, an advanced ELL-based sparse matrix format that addresses the performance bottlenecks of the SpMV kernel. We tailored the underlying data structure on warp-granularity in order to suit the vectorized GPU hardware architecture. We introduced nonzero unrolling to optimize the memory hierarchy utilization. We mitigated the bandwidth-limited nature of the SpMV kernel by reducing the matrix memory footprint with blocking and indexing compression. Finally, we coped with matrix irregularity using adaptivity and our warp-balancing heuristic. We still have not analyzed and addressed the additional complexity associated with preprocessing. Indeed, selecting the best-tuned parameters may represent a practical issue when we use AdELL+ for real-world application. Therefore, we dedicate the next section to efficient (and effective) parameter auto-tuning. 3.7 Online Auto-Tuning Parameter tuning is a key step to achieving the best possible performance from any advanced sparse matrix format. The overhead of evaluating points in the optimization parameter search space is determined by preprocessing and by actual kernel execution on the GPU hardware. In general, it is desirable to keep this one-time cost as small as possible in order for it to be amortizable over a reasonable number of SpMV iterations. This, in turn, means avoiding an exhaustive search by pruning the parameter space with some clever strategy. Here we propose an online auto-tuning approach for AdELL+ that uses the quality metric q [bm bn,kmax] to drive the process. The main novelty of this approach is the idea to tune while doing useful SpMV computation (as opposed to an initial offline tuning phase), completely hiding all the

91 77 CPU Loading Tuning 0 Tuning 1 Tuning 2 Tuning 3 Time GPU SpMV T SpMV T SpMV T Tuned SpMV Figure 12. Online auto-tuning CPU/GPU timeline. preprocessing overhead associated with composing AdELL+. The proposed auto-tuning schema is depicted in Figure 12. We have a timeline with CPU (where each tuning configuration is composed) and GPU (where it can be actually benchmarked). After loading the sparse matrix from the disk, the CPU processes the sparse format data structure and transfers it to the GPU. From that moment, the SpMV execution (e.g. a sparse solver) can begin. We can evaluate the performance by simply timing the SpMV kernel within the application context. In the mean time, the CPU is available to compose the second tuning configuration. Let us assume that the GPU has enough memory to contain two differently-tuned sparse matrices. Let us also assume that computing and data transfer can overlap (as in Kepler architecture [66]). The GPU can keep executing the original SpMV kernel until the entire second configuration has been transferred. Then, it can switch the computation to this latter for few iterations, evaluating the performance. Depending on which configuration is faster, we will keep one and discard the other. This process can be repeated as many times as needed on multiple configurations, providing an incremental online improvement towards the best-tuned configuration.

92 78 Time [s] CSR AdELL+ CSR AdELL+ Matrix Load Compose q [bm bn] Compose q [bm bn] Circuit Dense Economics Epidemiology FEM/Accelerator FEM/Cantilever FEM/Harbor FEM/Ship FEM/Spheres Ga41As41H Protein QCD Si41Ge41H Wind Tunnel Circuit5M Eu In LP Mip Webbase Arithmetic Mean x x Single Precision Double Precision TABLE XVI. Preprocessing time for CSR and AdELL+. The proposed auto-tuning schema can be concretely applied to AdELL+ by introducing few additional considerations. We first analyze some preprocessing timing data reported in Table XVI. The experimental computing platform on which those data have been generated was a dual-socket system with two 8-cores INTEL Xeon E5-2650@2GHz and 64GB@1.6GHz DDR3 memory. The operating system was 64-bit CentOS 6.6 with kernel The compiler used was gcc As we can see, the typical time to compose is a single-precision (double-precision) AdELL+ data structure from an intermediate representation derived from the original Matrix Market file [68] is 2.84x (2.57x) higher than the simpler CSR. Indeed, the

93 79 design choices associated with the warp-balancing heuristic have led to an overhead (i.e. first tuning configuration) that can be reasonably amortized. Second, we define the parameter space including all the [bm bn] block factors used in Section 3.4 and all the k max factors used in Section 3.6 (plus k max = 0 to indicate no adaptivity applied). The nonzero unrolling technique is implemented by the means of differently-coded SpMV kernels applied on the same data structure. We empirically determined that 5 SpMV executions are enough to get a performance evaluation for each unrolling factor lu. Hence, 25 SpMV iterations (within the context of our online auto-tuning approach) are sufficient to find the best lu associated with any new tuning configuration just transferred to GPU. The optimal CUDA block size is instead selected using the strategy outlined in Section 3.3. The goal is now to use the quality metric q [bm bn,kmax] to prune the search space. We have already described the strong correlation between performance and quality metric. Hence, we may decide to evaluate q [bm bn,kmax] for all the search space and pick the top-ranked configurations to be benchmarked on the GPU. Unfortunately, evaluating q [bm bn,kmax] cannot be done without explicitly building the entire AdELL+ data structure. Note, q [bm bn] can approximate q [bm bn,kmax] due its ability to identify the most suitable blocking factors for a sparse matrix. In addition, q [bm bn] can be efficiently calculated without the need of composing AdELL+ (i.e. only its accessory data structure). Table XVI reports the time necessary to calculate q [bm bn] for all the considered blocking factors using an OpenMP [92] implementation (indeed, each q [bm bn] can be calculated independently). We conjecture that q [bm bn] can be also calculated while parsing the sparse matrix from disk, completely hiding any additional cost (the loading times, as reported in Table XVI, are large enough to implement this

94 80 strategy). Once we pick a subset of top-ranked blocking factors, we can iteratively benchmark all the combinations with k max. Alternatively, we can exploit the conceptual independency between blocking and adaptivity to reduce the search space, applying the line search strategy outlined in Table XVII. First, we optimize k max by testing all the feasible possibilities associated with the best blocking factor. In the given example, [2 1] is the top-ranked factor so we explore its [2 1, k max ] combinations with k max from (no bound) to 64 (we omit k max = 128 because FEM/Harbor does not have any blocked row with more than 64 nonzero blocks). Second, we optimize [bm bn, 16] by exploring the other top-ranked blocking factors (i.e. [1 2], [3 1], [2 2], [1 3]). As a result, we identify [2 1, 16], the best-tuned configuration with GFLOPS. Performance [GFLOPS] k max [2x1] [1x2] [3x1] [2x2] [1x3] Single Precision TABLE XVII. FEM/Harbor parameter tuning with line search. 3.8 Comparison with the State-of-the-Art In this section, we provide a performance-focused comparison of auto-tuned AdELL+ with other advanced sparse matrix formats. Specifically, we considered BCCOO [50] (which is ar-

95 81 guably the state-of-the-art in SpMV optimization), CSR+ [51] and CSR-Adaptive [52]. Regarding the first two sparse formats, we used the code made available by their respective authors. Regarding CSR-Adaptive, we used the implementation contained in the linear algebra library ViennaCL [93]. Our experimental results on Tesla K40 for single-precision computation are illustrated in Figure 13 and Figure 14. As we can observe, AdELL+ consistently achieves comparable or better performance that the state-of-the-art. On the regular benchmark suite, AdELL+ has a 1.22x speedup over BCCOO, a 1.52x over CSR+ and a 2.95x over CSR-Adaptive. AdELL+ performs better on each individual matrix but Economics, where its performance is comparable with CSR+. Indeed, these results are directly correlated with the SIMD-oriented layout and the nonzero unrolling that leverage the GPU hardware at its best. As an aside note, the ViennaCL implementation of CSR-Adaptive does not perform too well on the the [GFLOPS] Circuit Dense Economics Epidemiology FEM/Accelerator FEM/CanDlever FEM/Harbor FEM/Ship FEM/Spheres Ga41As41H72 Protein QCD Si41Ge41H72 Wind Tunnel Harmonic Mean CSR AdapDve CSR+ BCCOO AdELL+ Figure 13. AdELL+ single-precision performance on the regular benchmarks.

96 [GFLOPS] Circuit5M Eu In LP Mip1 Webbase Harmonic Mean CSR AdapGve CSR+ BCCOO AdELL+ Figure 14. AdELL+ single-precision performance on the irregular benchmarks. NVIDIA Kepler architecture, at least when compared with the results reported by Greathouse and Daga [52] on a comparable AMD GPU. On the irregular benchmark suite, AdELL+ still has an edge over the competition. Specifically, we measured a 1.05x speedup over BCCOO, a 1.16x over CSR+ and a 2.28x over CSR-Adaptive. Perhaps the reduced speedup is due to the fact that both BCCOO and CSR+ do a good job already in terms of load balancing. Moreover, there is very little structure in terms of dense subblocks and nonzero patterns to leverage for compression. Last, we can notice that BCCOO clearly obtains the best performance for matrix LP. This particular benchmark has only 4284 rows associated with atomic warps that create a high level of conflict when reducing the result to vector y. Although BCCOO obtained its advantage using a non-portable synchronization-free segmented reduction, we conjecture that the problem can be mitigated by implementing a reduction strategy more similar to the one in CSR+. In Section 3.1.3, we estimated a GFLOPS theoretical single-precision performance peak associated with ELL-based formats. AdELL+ breaks this limit for the regular benchmark

97 83 suite ( GFLOPS) whereas it achieves a fairly high peak portion ( GFLOPS) for the irregular matrix suite. The former result (even surpassing the theoretical peak) comes from the use of blocking and delta-based compression. Figure 15 and Figure 16 reports the same computational tests in double-precision. Here, we could not include BCCOO due to the lack of double-precision implementation. As we can see, AdELL still consistently outperforms the other advanced sparse matrix formats. On the regular benchmark suite, AdELL+ is the best for all the matrices with an overall 1.31x speedup over CSR+ and a 3.08x over CSR-Adaptive. On the irregular benchmarks, AdELL+ still achieves comparable or better performance with a unique exception of the matrix Webbase (where AdELL+ is only slightly behind CSR+). Again, we conjecture that a little variation in the atomic segmented reduction mechanism may bridge the performance gap. For double [GFLOPS] Circuit Dense Economics Epidemiology FEM/Accelerator FEM/CanAlever FEM/Harbor FEM/Ship FEM/Spheres Ga41As41H72 Protein QCD Si41Ge41H72 Wind Tunnel Harmonic Mean CSR AdapAve CSR+ AdELL+ Figure 15. AdELL+ double-precision performance on the regular benchmarks.

98 [GFLOPS] Circuit5M Eu In LP Mip1 Webbase Harmonic Mean CSR AdapCve CSR+ AdELL+ Figure 16. AdELL+ double-precision performance on the irregular benchmarks. precision, we estimated a GFLOPS theoretical performance peak. This time AdELL+ does not break the limit for regular matrices but it comes very close ( GFLOPS). On the other hand, AdELL+ still achieves a fairly high peak portion ( GFLOPS) for irregular matrices. Last, we are interested in analyzing the memory footprint associated with the best-tuned AdELL+. Table XVIII provides a comparison with CSR taken as baseline for its well-known compact representation. In general, AdELL+ provides a more compact memory footprint measured as a 0.86x ratio for single-precision representation and as a 0.93x ratio for doubleprecision representation. This reduction is primarily due to the blocking and delta-based index compression techniques, and secondarily to adaptivity (which arranges nonzero more efficiently across the warps). On the irregular matrices, the compression has in general less room for improvement. Hence, AdELL+ cannot always be more compact than CSR. This also provides an additional explanation regarding the double-precision performance on matrix Webbase. As

99 85 Memory Footprint [MB] CSR AdELL+ CSR AdELL+ Circuit Dense Economics Epidemiology FEM/Accelerator FEM/Cantilever FEM/Harbor FEM/Ship FEM/Spheres Ga41As41H Protein QCD Si41Ge41H Wind Tunnel Circuit5Ml Eu In LP Mip Webbase Arithmetic Mean x x Single Precision TABLE XVIII. Memory footprint. Double Precision we can see, AdELL+ has a slightly bigger memory footprint than CSR (44.92MB vs 39.35MB) which can be correlated with the slightly inferior SpMV performance.

100 CHAPTER 4 SOLVING THE NEWTON DIRECTION In this chapter, we focus on the main computational effort of primal-dual IPM algorithms, the solution of the Newton equation systems. We review both sparse direct and iterative solution methods, analyzing strengths and drawbacks of their GPU-based implementations. Finally, we use the SpMV kernel developed in Chapter 3 to optimize the CG method, comparing the performance with other sparse linear algebra libraries. 4.1 Normal Equation Calculating the Newton direction is the most compute-intensive step of each IPM iteration. The system (2.20) is usually not solved directly, but it is first simplified algebraically. The term s can be expressed as s = X 1 (r g S x). (4.1) Note that X and S are diagonal so multiplication and inverse can be implemented as efficient element-wise operations. The original system (2.20) is then reduced to Θ 1 A T A 0 x = r d X 1r g, (4.2) y r p 86

101 87 where Θ = XS 1 is the diagonal scaling matrix resulting from the barrier terms. Note that this reduced system is symmetric and does not introduce any off-diagonal fill-in. The system (4.2) can be further reduced to the well-known normal equation. The term x can be expressed as x = Θ(r d X 1 r g A T y), (4.3) obtaining the following ] [ ] [ ] [AΘA T y = r p + AΘ(r d X 1 r g ). (4.4) Assuming full row rank for the constraint matrix A, the matrix AΘA T R m m is symmetric positive definite. Thus, the Cholesky factorization exists and CG is readily available as iterative solution method. For these reasons, the normal equation is usually preferred over the symmetric indefinite system (4.2) where numerical pivoting is necessary for stability. Moreover, AΘA T may provide a substantial dimensionality reduction, especially when m n. This advantage, however, may be counterbalanced by the fill-in induced when composing AΘA T from sparse problems. The sparsity pattern of the normal equation is constant throughout the entire IPM algorithm execution. Indeed, AA T induces the structure whereas the change in the diagonal scaling matrix Θ causes different numerical values at each IPM iteration. This provides an opportunity for computational optimizations such as reuse (e.g. for the sparse matrix indexing structure). The

102 88 matrix AΘA T has however an unwelcome feature. When the IPM algorithm approaches the optimal solution, each elements x i /s i on the Θ diagonal may spread from zero to infinity due to strong complementarity (i.e. either x i 0 or s i 0). This inevitably leads to an illconditioned system, making its solution more challenging. Surprisingly, sparse direct methods do not seem to suffer from this property. On the other hand, the use of iterative methods may become problematic unless a good preconditioning is applied. 4.2 Sparse Cholesky Factorization The Cholesky factorization of a symmetric positive definite matrix M R m m is a lower triangular matrix L such that M = LL T. (4.5) The Cholesky factorization is in the class of direct methods for solving the linear systems. Once the factor L is obtained, we can solve Mx = b through forward and backward substitution Ly = b (4.6) L T x = y, (4.7) corresponding to the solution of two triangular systems. The Cholesky factorization has O(n 3 ) time complexity, although the symmetric structure reduces the number of operations compared to the general case. The major advantage of this approach is the ability to solve the same system for additional right-hand sides in O(n 2 ) time with forward and backward substitutions.

103 89 Algorithm 5 In-place Cholesky Factorization Input: Symmetric positive definite matrix M Output: Lower triangular Cholesky factorization L 1: for j = 1..m do 2: l jj m jj 3: L (j+1:n)j M (j+1:n)j /l jj 4: M (j+1:n)(j+1:n) M (j+1:n)(j+1:n) L (j+1:n)j L T (j+1:n)j 5: end for Algorithm 5 describes the basic in-place Cholesky factorization. Note that L (j+1:n)j is the notation used for the bottom portion (i.e. from j + 1 to m) of the column vector at position j. Similarly, M (j+1:n)(j+1:n) is the bottom-right square submatrix also known as Shur complement. Each iteration of this algorithm performs a partial factorization on a single column. After this, the computation reduces to the Shur complement (which remains to be factored). The outer loop of this basic Cholesky factorization has a straightforward sequential implementation. However, the element-wise linear algebra operations on submatrices naturally exposes finegrain parallelism. Algorithm 5 can be reorganized in a block form as shown by Algorithm 6, increasing the cache friendliness of the memory accesses and exposing even more fine-grain parallelism. This formulation can be summarized as an element-wise Cholesky factorization Algorithm 6 In-place Block Cholesky Factorization Input: Symmetric positive definite matrix M Output: Lower triangular Cholesky factorization L 1: for j = 1..m b do 2: L jj Cholesky(M jj ) 3: for i = j + 1..m b do 4: L ij M ij L jj T 5: for k = j + 1..i do 6: A ik A ik L ij L kj T 7: end for 8: end for 9: end for

104 90 (Algorithm 5) plus several blocked linear algebra operations (i.e. symmetric/general matrixmatrix multiplications, triangular solvers and matrix updates). Intuitively, the use of optimized dense BLAS routines [94] can guarantee very high dense performance both on CPU and GPU. In addition, it is possible to parallelize the two inner loops while partially overlapping following outer iterations [95]. An alternative implementation known as DAG-based approach uses a dependency graph to schedule block operations, allocating tasks to cores either statically or dynamically [96]. Performing efficient Cholesky factorization on sparse matrices is a challenging task. First, we may not be able to benefit from dense BLAS routines. Second, the update operations done on the Shur complement may introduce additional nonzeros (the so-called fill-in), making the factor L denser than the original matrix M. The amount of fill-in can be dramatically reduced by using graph theory. Indeed, sparse symmetric matrix structures are equivalent to undirected graphs where matrix columns are nodes and nonzeros are edges. Let us assume to apply partial Cholesky factorization on one column (like done for each outer iteration of Algorithm 5). This is equivalent to removing a node from the graph and forming the clique of its neighbors (with new edges representing fill-in). Moreover, this may introduce up to (r 1) 2 nonzeros to the factor L, where r is the degree of the removed node. The potential fill-in can be however reduced by ordering. Thus, the sparse Cholesky factorization has an initial stage called analysis that calculates a symmetric permutation PMP T that minimizes the number of non-zeros of L. The most common ordering heuristics are based on minimum degree [97] or nested dissection [98].

105 91 Traditionally, the analysis phase does not only find a good ordering. In fact, it is also important to determine storage requirements for the factor L and to precompute the elimination tree. This latter is a specialized data structure that describes the operation dependencies during the actual numerical factorization. Thus, the sparsity of the matrix M is critical to performance. On one hand, independent branches of the elimination trees expose parallelism. On the other hand, the ability to aggregate nonzeros into dense submatrices gives an opportunity to improve efficiency with dense BLAS routines. This idea is implemented with supernodes. A supernode is a set of consecutive columns with the same non-zero pattern (i.e. a clique in the adjacency graph) and can be stored as a dense block in the elimination tree. Unfortunately the number and the size of complete supernodes in sparse matrices are typically small. However, [a] POTRF T L11 T L21 L11 0 I 0 L21 I X 0 T M21 - L21 L21 X 0 I TRSM GEMM/SYRM [b] T L11 T L21 L11 0 I 0 L21 I X 0 T M21 - L21 L21 X 0 I Supernode Partial Update Figure 17. Dense [a] and supernodal [b] partial Cholesky factorization

106 92 the introduction of explicit zeros when necessary provides an effective strategy to aggregate columns with similar non-zero patterns into supernodes. Figure 17 shows how dense BLAS routines can be used to accelerate the sparse supernodal factorization. On the top, we see the dense case where direct Cholesky factorization (POTRF), triangular system solution (TRSM) and general/symmetric matrix-matrix multiplication (GEMM/SYRK) are used to compose a trapezoidal portion of the factor L and to update the Shur complement. On the bottom, we have instead a supernode from which we can extract some dense windows (i.e. rows). It follows that efficient operations such as TRSM and GEMM can be directly used by first gathering the computation into a dense matrix and then scattering the results into the Shur complement. The supernodal approach organizes the elimination tree as a number of small dense matrix operations and is, in general, well-suited for sparse structures with relatively high density. CHOLMOD is one of the fastest sparse supernodal Cholesky implementations [99] and provides support for GPU-acceleration of the dense matrix algebra. The underlying idea is to move the processing of large supernodes on GPU whereas the remaining small blocks, which do not generate enough arithmetic intensity, are still handled by the CPU. There are however some additional issues related with GPU device utilization, PCI-express bandwidth and kernel latency that may prevent to achieve a compelling speedup. Recently, Rennich et al. [46] suggested an approach based on GPU streams and batch processing that overcomes those limitations, providing an average 2.2x speedup (and peaks up to 4.1x) over the CPUbased CHOLMOD on a benchmark of sparse real symmetric positive definite matrices from different application domains [67].

107 Conjugate Gradient The conjugate gradient [100] [101] is the most prominent iterative method for solving symmetric positive definite systems of linear equations. In general, it can be applied to matrices where direct approaches such as Gaussian elimination are not feasible. As other iterative methods, CG only requires the ability to compute the matrix-vector multiplication Ax and, assuming a low iteration count is achieved, it can converge to the solution faster than factorization. Let us take the linear system Mx = b. We define its quadratic form as f(x) = 1 2 xt Mx b T x. (4.8) When M is symmetric positive definite, f(x) landscape looks like a paraboloid bowl with a global minimum x. Note that solving the linear system Mx = b is equivalent to minimizing f(x). Any point x can be described by an offset from the solution x x = x + e, (4.9) where e is the error term. Given m orthogonal directions (d 0,..., d m 1 ), the goal is the design of an iterative procedure that moves along each of those orthogonal directions d k and cancels out the correspondent error component e k. At each step of the algorithm, we perform a line search on d k x k+1 = x k + α k d k (4.10)

108 94 such that the step α k cancels the component e k. In geometric terms, this is equivalent to make the current search direction d k orthogonal to the (left over) error e k+1 at the next step. Thus, imposing d T k e k+1 = 0, we can derive the step length as α k = dt k e k d T k d k. (4.11) Unfortunately, this approach is not meaningful because it directly requires the error vector e k (in other words, the a priori knowledge of the solution x ). Let us introduce a definition. Two vectors x and y are conjugate (with respect to M) if x T (My) = 0. (4.12) An intuitive geometrical representation of conjugate vectors is presented in Figure 18. On one hand, the original space has elliptical contours defined by the quadratic form f(x). On the other hand, the linear transformation M stretches the scaled space transforming the ellipses into circles. Note that conjugate vectors in the original space are orthogonal in the scaled space. This observation is the key to design an iterative method that moves along orthogonal directions in the scaled space using n conjugate directions (d 0,..., d n 1 ) in the original space. At each iteration, we perform a line search to select a step α k that cancels out the component e k in the scaled space (making e k+1 orthogonal to d k ). Conveniently, this is equivalent to being

109 95 [a] [b] Figure 18. Original [a] and scaled [b] spaces conjugated in the original space. Thus, imposing d T k (Me k+1) = 0, we can derive the step length as α k = dt k (Me k) d T k (Md k) = dt k r k d T k (Md k), (4.13) where r k = b Mx k is the residual at step k (a vector that we can actually calculate). The next step to the solution is the construction of the n conjugate directions (d 0,..., d n 1 ) needed by the approach. A simple way to generate those directions involves the use of the well-known Gram-Schmidt process [102]. However, this has O(n 3 ) time complexity (equivalent to solve the system directly with Gaussian elimination) and O(n 2 ) space complexity (to store all the direction vectors) so the approach is not feasible in practice. Instead, the idea is to

110 96 iteratively generate the conjugate search directions one at the time without storing the entire set of vectors (d 0,..., d n 1 ). This can be done using the so-called Krylov subspace D k = span{r 0, Mr 0, M 2 r 0,..., M k 1 r 0 }. (4.14) Given D k, we can efficiently derive k conjugate directions such that d 0,..., d k 1 D k. The Krylov subspace is very popular in numerical algorithms due to its iterative construction based on repeated matrix-vector multiplications. Let us take the initial residual r 0 as initial d 0. Let us also introduce the following alternative definition r k+1 = Me k+1 = M(e k + α k d k ) = r k α k Md k. (4.15) Note that now the residual r k+1 is a linear combination between the current residual r k and Md k. We observe that both r k and d k belong to D k+1. Therefore, this iterative construction expands the Krylov subspace from D k+1 to D k+2 due to the new spanning vector M k+1 r 0 introduced by the term α k Md k. There is an useful property associated with how we select α k. We first observe that the condition d T k (Me k+1) = 0 used in (4.13) is equivalent to d T k (r k+1) = 0. In other words, the current step automatically gives a residual r k+1 that is orthogonal to the current search direction d k. Moreover, r k+1 is orthogonal to the entire D k+1 since, by construction, the residual components along the conjugate directions defining D k+1 have been canceled out. Last, we observe that MD k D k+1 so r k+1 is also conjugate to D k. This key property provides an effective strategy to iteratively build conjugate directions without using

111 97 the full Gram-Schmidt process. Since r k+1 is already conjugate to all the previous d 0,..., d k 1, we can find the search direction d k+1 by simply canceling out the conjugate projection on d k from r k+1. This corresponds to performing d k+1 = r k+1 + β k+1 d k (4.16) where β k+1 = rt k+1 r k+1 r T k r k. (4.17) Assuming no floating point roundoff error, CG takes n iterations to perfectly cancel out the error e. Note that this is computationally equivalent to a direct approach. However, the goal of iterative methods is to solve the system of linear equations up to a certain accuracy, where the magnitude of the residual norm r k is the used metric due to its proportionality to e k. In general, this leads to a reduced number of iterations and thus, a computational advantage over direct methods. Iterative methods are indeed critical for sparse large systems where it is not feasible to run n iterations or where factorization causes an explosive fill-in. The convergence behavior of CG depends by the eigenvalues of matrix M. Given an infinite floating point precision, the number of iterations required to compute an exact solution is at most the number of distinct eigenvalues. In practice, we have a quick convergence when the

112 98 condition number κ(m) = λmax λ min is small and the eigenvalues are clustered together. The convergence of CG is expressed by the following inequality e k M 2 ( κ(m) 1 κ(m) + 1 ) k e 0 M and it is faster for well-conditioned matrices (κ(m) 1). The time complexity of the CG iteration is dominated by the matrix-vector product. Given an accuracy ɛ, we require at most 1 ( 2 κ(m) ln 2 ) ɛ iterations to solve the system with such accuracy. The CG methods is summarized by Algorithm 7. Note that d T k r k = r T k r k due a property of the Gram-Schmidt conjugation (see [101] for details). Algorithm 7 Conjugate Gradient Input: M R n n, b R n, x 0 R n, ɛ > 0 Output: x such that r < ɛ 1: r 0 b Ax 0 2: d 0 r 0 3: for k = 0, 1,... and r k ɛ do 4: α k rt k r k d T k (Md k) 5: x k+1 x k + α k d k 6: r k+1 r k α k Md k 7: β k+1 rt k+1 r k+1 r T k r k 8: d k+1 r k+1 + β k+1 d k 9: end for 4.4 Regularization and Preconditioning The convergence of CG strongly depends on the condition number of the linear system to solve. Unfortunately, in the context of IPMs, the normal equation AΘA T becomes more and more numerically intractable as we approach the optimal. As a consequence, the basic CG

113 99 described in Section 4.3 may converge very slowly (or even not at all) with severe implications in terms of performance. The only way to mitigate this issue is to address the ill-conditioning with appropriate numerical techniques. One common approach called regularization [103] aims to introduce bounds on the condition number κ(aθa T ). Indeed, it is possible to show that the original normal equation has κ(aθa T ) κ(a)o(µ 2 ). (4.18) Note that as the barrier parameter µ goes to zero the bound on κ(aθa T ) goes to infinity indicating that the original normal equation gets very ill-conditioned. The regularization technique modifies the primal LP formulation adding a regularization term as follows minimize c T x (x x 0) T R p (x x 0 ) such that Ax = b x 0, (4.19) where R p R n n is diagonal and x 0 R n is the primal reference point. Similarly, the dual problem is regularized as maximize b T y 1 2 (y y 0) T R d (y y 0 ) such that A T y + s = c s 0, (4.20)

114 100 where R d R m m is diagonal and y 0 R m is the dual reference point. These two formulations lead to the following regularized normal equation ] [ ] [ ] [A(Θ 1 + R p ) 1 A T + R d y = h, (4.21) where h = r p R d (y y 0 ) + A(Θ 1 + R p ) 1 (r d R p (x x 0 ) X 1 r g ). Note that now it is possible to modify the conditioning of the given linear system through the regularization matrices R p and R d. The reference points x 0 and y 0 can be change dynamically and set to the current primal and dual solutions, simplifying the right-hand side h and minimizing the effect of the approximation. Let us assume to have upper and lower bounds on the primal diagonal γ 2 r (p)ii Γ 2 as well as on the dual diagonal δ 2 r (d)ii 2. It can be shown that the condition number of the regularized normal equation satisfies κ(a(θ 1 + R p ) 1 A T + R d ) σ2 m + γ 2 2 γ 2 δ 2, (4.22) where σ m is the largest eigenvalue of the constraint matrix A. Here, the bound does not depend on barrier parameter µ but only on the level of regularization and σ m. Assuming a moderate σ m, the regularization terms σ 2 and δ 2 can be kept large enough to guarantee a good bound. On the other hand, large σ 2 and δ 2 mean a worse approximation of the original normal equation. This may lead to an erroneous Newton direction and, eventually, hamper the convergence of the primal-dual IPM. Therefore, σ 2 and δ 2 should be chosen to be a compromise between a reasonable bound and a good approximation of the normal equation.

115 101 The regularization technique imposes bounds on the condition number of the normal equation but it is not sufficient to provide a satisfactory convergence for CG. The idea is then to build and solve an equivalent linear system with a much better condition number. This technique is known as preconditioning. Let us assume that P c R m m is a symmetric positive definite matrix that is cheap calculate and invert. Let us also assume that P c approximates the linear system M. We can indirectly solve the original problem by solving P 1 c Mx = P 1 c c. (4.23) In general κ(p 1 c M) κ(m) since P 1 c M I. Hence, CG can solve the preconditioned linear system in less iterations. The Preconditioned Conjugate Gradient (PCG) is presented by Algorithm 8. Note that now the PCG iteration is heavier than the CG due to the time spent inverting and multiplying the preconditioning matrix P c. In practice, those Algorithm 8 Preconditioned Conjugate Gradient Input: M R n n, P c R n n, b R n, x 0 R n, ɛ > 0 Output: x such that r < ɛ 1: r 0 b Ax 0 2: z 0 P 1 c r 0 3: d 0 z 0 4: for k = 0, 1,... and r k ɛ do 5: α k rt k z k d T k (Md k) 6: x k+1 x k + α k d k 7: r k+1 r k α k Md k 8: z k+1 P 1 c r k+1 9: β k+1 zt k+1 r k+1 z T k r k 10: d k+1 r k+1 + β k+1 d k 11: end for

116 102 operations are fairly cheap since implicitly implemented as element-wise division (diagonal P c ) or as triangular solve (lower/upper triangular P c ). In the literature, there are different techniques to construct P 1 c such as Jacobi, incomplete Cholesky or approximate inverse [102]. The additional overhead to build and apply preconditioning is usually low enough to be easily amortized by the reduced number of iterations, leading to a faster solution than the original CG. In the context of IPMs, Oliveira and Sorensen [104] designed a special preconditioning technique that works better near the solution where the normal equation AΘA T is highly ill-conditioned. GPU libraries such as cublas [105] provide full support for a high performance implementation of the PCG. Triangular solvers are available for both dense and sparse CSR matrices. Moreover, CuSPARSE provides routines to build the incomplete Cholesky preconditioner, separating the analysis and the factorization phases. Similarly, the approximated inverse preconditioner has been successfully implemented on GPU [106]. 4.5 Accelerating CG with AdELL+ This last section of Chapter 4 is dedicated to computational experiments, providing an empirical analysis of the effectiveness of AdELL+ in the context of solving the Newton direction. Here we focus on the performance of the PCG algorithm using CSR-based sparse linear algebra kernels as baseline implementations (MKL for multicore and cusparse for GPU). The computing platform is still composed by a dual-socket system with two 8-cores Xeon E and a Tesla K40 GPU. We selected a set of symmetric positive definite square matrices from the well-known University of Florida Sparse Matrix Collection [67] and solved them with unit right-hand side vector b = {1,.., 1} T and initial guess x 0 = {0,..., 0} T. The benchmarks set is

117 103 described in Table XIX. The system sizes range from small to large in order to test different scenarios. Moreover, the sparse structure for those benchmarks are mostly regular, except for two matrices arising from thermal problems. Dimension System nnz m µ Domain Trefethen Combinatorial Fv D/3D Bodyy Structural Muu Structural S1rmt3m Structural Crystm Materials Thermomech TK Thermal Dubcova Thermal Pdb1HYS Weighted Undirected Af shell Structural Inline CFD TABLE XIX. Symmetric positive definite systems We measured the performance over 100 iterations of the CG algorithm using different sparse linear algebra kernels. The normalized runtime results are shown by Table XX. As we can see (highlighted and in bold), the best performance is always the one based on AdELL+. The only exception is the small benchmark Trefethen 700. On one hand, MKL can take full advantage of the lower CPU cache levels. On the other hand, there is no enough computation to fully utilize the GPU and to hide the kernel launching overhead. On all the other matrices, AdELL+ achieves an average of 3.54x speedup over MKL. Note that, instead, the GPU baseline cusparse achieves an average of 2.45x speedup closer to the 2.8x gap between CPU and GPU memory

118 104 bandwidths. The additional performance improvement is obviously due to the optimization techniques embedded into AdELL+. Normalized Performance System MKL cusparse AdELL+ Trefethen Fv Bodyy Muu S1rmt3m Crystm Thermomech TK Dubcova Pdb1HYS Af shell Inline TABLE XX. Normalized CG performance We repeated similar computational experiments for PCG, observing different performance behaviors depending on the preconditioning. Figure 19 provides a detailed profiling of the average runtime distribution for the PCG iteration based on MKL and AdELL+. In general, any GPU implementation benefits from the available memory bandwidth for linear algebra operations such as SpMV (in blue), element-wise vector addition (Axpy, in green) and dot product (Dot, in purple). On the other hand, preconditioning (Prec, in red) may be the bottleneck operation at each PCG iteration. Let us consider CG with no preconditioning (on top). The SpMV kernel is the more expensive operation for the MKL implementation. On the other hand, the highly-optimized AdELL+ kernel will now take a smaller portion of the total runtime, changing the overall distribution in agreement with the Amdahl s law. As we can

105 SpMV 79% Prec 0% Axpy 15% Dot 6% MKL SpMV 38% Prec 0% Axpy 31%

7% Axpy 25% Dot 36% AdELL+ SpMV 20% Prec 73% Axpy 5% Dot 2% MKL

Dot 5% MKL SpMV 23% Prec 24% Axpy 21% Dot 32% AdELL+ No

119 105 SpMV 79% Prec 0% Axpy 15% Dot 6% MKL SpMV 38% Prec 0% Axpy 31% Dot 31% AdELL+ SpMV 60% Prec 22% Axpy 12% Dot 6% MKL SpMV 32% Prec 7% Axpy 25% Dot 36% AdELL+ SpMV 20% Prec 73% Axpy 5% Dot 2% MKL SpMV 2% Prec 93% Axpy 2% Dot 3% AdELL+ SpMV 42% Prec 42% Axpy 11% Dot 5% MKL SpMV 23% Prec 24% Axpy 21% Dot 32% AdELL+ No Preconditioner Diagonal Approximate Inverse Incomplete Cholesky Figure 19. CG profiling

120 106 see, SpMV, Axpy and Dot have approximately the same cost. Preconditioning adds another operation to each CG iteration, and this may be more or less suited to GPU acceleration. The inversion of a diagonal matrix can be implicitly implemented as element-wise division. Therefore, diagonal preconditioning is well suited for GPU implementation. On the other hand, the inversion of the Incomplete Cholesky (IC) factor involves the solution of (sparse) triangular systems, an operation with a limited degree of parallelism. As a consequence, preconditioning becomes the main performance bottleneck as clearly shown in Figure 19. Considering the GPU implementation based on AdELL+, the achieved speedup entirely depends on the sparse triangular solver (we used the one available from cusparse library) whereas the benefit of AdELL+ are limited to a small portion of the running time. Note that, however, this does not prevent the GPU-based implementation to be faster than MKL-based version. This is particularly true in the context of IPM, where the cusparse triangular solver may be used to calculate the Newton direction from the sparse Cholesky factor. Last, we observe that approximate inverse is a preconditioning more suited for GPU due to its exclusive dependence on an additional SpMV at each CG iteration. In other words, we can fully exploit all the SpMV optimization techniques proposed in this research to accelerate the PCG iteration.

121 CHAPTER 5 GPU-BASED TECHNIQUES FOR IPM In this chapter, we discuss further computational aspects associated with the implementation of primal-dual IPM algorithms on GPUs. First, we focus on the efficient generation of the system AΘA T, proposing a specialized SpMM technique based on SpMV. In addition, we discuss the matrix-free approach proposed by Gondzio [41] for cases in which the normal equation cannot be represented explicitly. Then, we explore the use of adaptive IPMs [45] as an effective strategy to take advantage of both the best direct and iterative GPU-based methods. Last, we propose some strategies for hybrid CPU-GPU computation, comparing our primal-dual IPM implementation with the state-of-the-art commercial optimization software CPLEX. 5.1 Building the Normal Equation The time complexity of IPM algorithms is dominated by the solution of the normal equation. However, in practice, forming AΘA T at each iteration consumes a significant portion of the total computation effort. This operation can be initially described in terms of generic SpMMs. Given two sparse matrices M a R m k and M b R k n, the SpMM kernel computes M = M a M b where M R m n. Although related to both SpMV and dense matrix-matrix multiplication (i.e. GEMM), SpMM is highly unstructured and gives rise to complex and unpredictable data access patterns. In addition, techniques to improve performance through sparsity pattern analysis are less effective due to the inability to amortize the overhead. Over 107

122 108 the years, GPU-based SpMM kernels operating on the CSR format have been proposed and implemented as part of cusparse [107,108]. However, we observe that AΘA T has some special features that can be exploited for further optimization. First, the scaling matrix Θ is diagonal so the intermediate SpMM can be cheaply implemented as an element-wise multiplication. Second, the result matrix M has a constant AA T structure. Thus, once this is calculated, it can be reused throughout all the IPM iterations without further preprocessing. Third, we can exploit symmetry and reduce the amount of computation to the lower/upper triangular part of AΘA T. The actual SpMM kernel is always preceded by the analysis of the structural nonzeros in the result matrix M. This is mandatory to identify the number of nonzeros nnz m and to compose the AA T structure according to the sparse format of choice (e.g. CSR). In our case, the computation associated with each output element m ij is n m ij = a ik a jk Θ k. (5.1) k=1 Note that, due to sparsity, only few terms of this summation will contribute to the result element m ij. This formulation immediately provides fine-grain parallelism (i.e. each m ij can be calculated independently). Let us assume to have a mechanism (e.g. an index) to update each element m ij within the sparse structure storing the matrix M. The computation can be organized such that each GPU thread accesses row i and row j of the constraint matrix A to

123 109 process its m ij. Unfortunately, the memory access pattern associated with this computation is often highly irregular and, thus, not well-suited for GPU computing. The key idea to optimize the generation of the normal equation AΘA T is to connect the expression (5.1) to the SpMV kernel. For the sake of simplicity, let us assume that M is stored as CSR. Thus, each element m ij will correspond to an ordered location of the array Value[] of size nnz m. Let us now define a generating matrix G R nnzm n where each row is associated to an element m ij. Similarly, each row embeds an appropriate pattern of nonzero values that replicates the same operations performed in (5.1). More specifically, each row performs the following summation n n m ij = a ik a jk Θ k = g (ij)k Θ k, (5.2) k=1 k=1 which is equivalent to calculate the element m ij. We can observe that now the sparse matrixvector product between the generating matrix G and the scaling factor Θ can automatically and efficiently create the normal equation. Figure 20 shows in more details this process. On the top, the matrix-matrix operations produce a sparse symmetric positive definite matrix M = AΘA T. On the bottom, the lower triangular part of M is stored as CSR. Moreover, we see a generating matrix G that embeds the necessary operations to generate a different normal equation depending on Θ. For example, m 00 = a 00 a 00 Θ 0 + a 02 a 02 Θ 2 is calculated from the first row in G as the contribution of the nonzeros in the range [0,1] (i.e. from CSR[0] to CSR[1] 1). Note that the coefficients a 00 a 00 and a 02 a 02 are precomputed and stored as g (00)0

124 110 [a] a00 a00 θ0 + a02 a22 θ2 a02 a02 θ2 a00 a02 θ0 a00 a11 a11 θ1 a11 a31 θ1 a11 a22 a23 θ1 a11 a31 X X = θ2 a02 a22 a02 a22 θ2 a02 a02 θ2 + a23 a33 θ3 a31 a33 θ3 a23 a33 a22 a22 θ2 A θ A T a11 a31 θ1 a23 a33 θ3 a31 a31 θ3 + a33 a33 θ3 A θ A T [b] CSR Column Value a00 a00 θ0 + a11 a11 θ1 a02 a22 θ2 a02 a02 θ2 + a11 a31 θ1 a23 a33 θ3 a31 a31 θ3 + Lower Normal Equation a02 a02 θ2 a22 a22 θ2 a33 a33 θ3 SpMV CSR Column Generating Matrix Value a00 a00 a02 a02 a11 a11 a02 a22 a02 a02 a22 a22 a11 a31 a23 a33 a31 a31 a33 a33 Figure 20. SpMM [a] and its implementation as SpMV [b] and g (00)2. One of the strengths of the proposed SpMM technique is the ability to use all the SpMV optimizations developed in Chapter 3 by simply representing the generating matrix G as AdELL+. Similarly, the matrix M can be stored in any advanced format as long as the structural zeros (i.e. padding) are treated like empty rows in G. The memory requirement for storing the generating matrix G depends on the amount of intermediate calculations. Despite a worst-case O(n 3 ) space complexity, our SpMM technique is as expensive as other data structures proposed in the past (e.g. outer-product list for the Karmarkar s algorithm [109]) and, thus still

125 111 practical in a reasonable context, although special precautions may be necessary (e.g. handling dense columns separately). 5.2 Matrix Free Approach There exist optimization problems where an explicit representation of the normal equation is not possible due to memory constraints (i.e. AΘA T is too large to be stored in memory). Gondzio [41] recently proposed a matrix-free IPM that circumvents this limitation and that works as long as we can perform SpMV on the constraint matrix and its transpose. Gondzio s approach calculates the Newton directions using CG, implicitly performing the matrix-vector product on AΘA T as a composition of SpMV kernels without the need to explicitly store M = AΘA T (as opposed to the Cholesky factorization). In order to improve the numerical tractability, the normal equation system is appropriately regularized and preconditioned. This is done implicitly as well. On one hand, regularization is equivalent to the use of the matrix M = A(Θ 1 + R p ) 1 A T + R d. Any time a matrix-vector product on M is required, we can still break down the implementation into a concatenation of linear algebra operations. On the other hand, Gondzio designed a preconditioner based on partial Cholesky factorization with pivoting and formulated as follows P c = L 11 L 21 I I LT 11 LT 21. (5.3) S I Note that P c is composed using a partial LL T Cholesky factorization. Thus, L 11 R k k is lower diagonal, L 21 R (m k) k is rectangular and S R (m k) (m k) is the diagonal of the

126 112 Shur complement. The key idea of this preconditioning technique is to explicitly calculate only k columns of the Cholesky factor, selecting at each step the largest diagonal pivot in order to condition the largest eigenvalues. Indeed, the dimension k provides a knob to transition from a diagonal preconditioner (k = 0) to an exact Cholesky factorization with complete pivoting (k = m). The computation of the matrix-free preconditioner has been designed in order to be as cheap as possible. However, there are cases where the combined dimension m k is substantial. Thus, implementing this operation directly on the GPU is beneficial to avoid an additional CPU- GPU data transfer. The construction of the matrix-free preconditioner starts from explicitly calculating the diagonal of the matrix AΘA T (for the sake of notation simplicity, we do not consider the regularization). The most efficient way to perform this operation is to apply the GPU-based SpMM technique just presented in Section 5.1, with the restriction of generating only the diagonal elements. Once the diagonal is obtained, we can use a GPU-based sorting routine such as those available in Thrust [110] to perform pivoting, storing the permutation matrix P as an index vector. After selecting the largest k pivots, we can compose the following trapezoidal matrix M t = M 11 M 21, (5.4) where M t R m k includes k pivoted columns and M 11 R k k is lower triangular. Note also that M t is stored as dense despite being sparse. This allows us to use the efficient cublas

127 113 routines (i.e. POTRF and TRSM) to calculate the partial Cholesky factorization. Let us refer to the steps described in Figure 17. On one hand, the factor L 11 can be derived by direct factorization of the dense M 11. On the other hand, the remaining factor L 21 is the solution of the following triangular system L 11 L T 21 = M T 21. (5.5) After this operation, the matrix-free preconditioner is almost complete. The only step remaining is the update of the diagonal Shur complement S. This involves (m k) independent quadratic sums over the rows of the matrix L 21 according to the formula s ii = s ii l T (21)i l (21)i, (5.6) where l (21)i denotes the i-th row of the matrix L 21. Assuming data stored in column-major order, this operation can be implemented very efficiently on the GPU due to coalescing. The matrix-free preconditioner is implicitly applied within the PCG algorithm as a solution of two triangular linear systems plus an element-wise division. In addition, a permutation kernel shuffles the data before and after those operations according to the pivoting permutation P. As the preconditioner P c is applied in every iteration of the PCG algorithm, it is important for performance to have a very efficient execution of these algebra routines. In particular, we should focus on the triangular system solution. So far, we have treated the trapezoidal matrix composed by L 11 and L 21 as dense despite its actual structure being sparse. In general, the ability to reduce

128 114 the amount of computation wasted on zero entries is beneficial to the performance. In addition, the cusparse library provides a very efficient sparse triangular solver that extracts additional parallelism from dependency analysis. In the context of PCG, this latter preprocessing step can be done once and amortized throughout the iterations. Following one of the future improvement suggested by Gade-Nielsen [43], we convert the preconditioner P c from dense to CSR in order to reduce the workload and to benefit from more opportunities for parallelism. Such conversion can be performed directly on the GPU by first counting and then copying the nonzeros into an appropriate CSR structure. 5.3 Adaptive IPMs The ability to choose between direct or iterative methods is critical to provide an efficient IPM implementation. Unfortunately, there is no trivial way to select a priori which approach works the best. On one hand, sparse Cholesky factorization is used by most IPM codes due to its state-of-the-art performance and numerical stability. On the other hand, PCG may be the better option when a good preconditioner is available for cheap and convergence can be reached in few iterations. The theory of IPMs supports CG by allowing inexact Newton direction calculations. In general, the accuracy requirement can be quite low at beginning such that only few PCG iterations are needed. In order to guarantee global convergence for the primal-dual infeasible IPM algorithm, Baryamureeba and Steihaug [111] stated the following bound r [k] η [k] (x [k] ) T s [k], (5.7)

129 115 where 0 η [k] < 1 and r [k] is the residual of the Newton direction at each iteration k. Note that this bound becomes stricter as the primal and dual variables approach the optimal solution. In other words, the convergence tolerance must be tightened with the increase of the ill-conditioning of the normal equation AΘA T. The approach proposed by Wang and O Leary [45] suggests some heuristic guidelines to adaptively switch between direct or iterative methods depending on the runtime performance. The main strategy follows the ill-conditioning of the system AΘA T. Initially, the Newton direction is solved by Cholesky factorization. Then, there is an intermediate phase where performance is monitored and the best solution method is selected accordingly. When iterates are close to the optimal solution and accuracy requirements are too high, the adaptive approach enters into a final phase where only the Cholesky factorization is used. The performance of the PCG algorithm is heavily influenced by preconditioning. The aspects to consider are the cost of building a preconditioner, the cost to apply it and the associated saving in PCG iterations. A good starting point for preconditioning is given by the initial Cholesky factor L. However, the adaptive approach takes into account the variation of the scaling matrix Θ. Let us assume to have the initial factorization AΘA T = LL T. After the scaling matrix changes to Θ = Θ+ Θ, the new factorization L L T can be composed as A ΘA n T = AΘA T + A( Θ)A T = LL T + δ ii a i a T i = L L T, (5.8) i=1

130 116 where a i is the i-th row of the constraint matrix A. Note that LL T and L L have the same sparsity structure. Instead of calculating a new Cholesky factorization on every interior point step, the adaptive approach reuses the preconditioner that was computed for one previous barrier parameter µ, applying α small-rank updates (i.e. few terms of the summation n i=1 δ iia i a T i ) in order to approximate the exact L L T. Assuming that α is large enough to include most of the large magnitude terms, this update strategy provides a good preconditioner that is also easy to compute. During the intermediate phase, the adaptive approach focuses on the decision to refactor the preconditioner (i.e. Cholesky factorization) or to improve it (i.e. small-rank updates). This decision is based upon a prediction. On one hand, we have a direct cost t direct that is constant. On the other hand, we have an iterative cost t pcg last for the last PCG run based on the number of updates α, the cost of each update t update, the number of PCG iteration β and their individual cost t iter. More specifically, the cost of the last PCG run is t pcg last = α t update + β t iter. (5.9) If the last run has been too expensive (e.g. t pcg > 0.8 t direct as suggested in [111]), the adaptive approach solves the normal equation using the Cholesky factorization, obtaining a new (and better) preconditioner to use in the subsequent IPM iterations. Otherwise, it construct a

131 117 prediction γ of the iterations needed for the current PCG run. This prediction is based on linear interpolation of the last two runs. The predicted cost is then t pcg current = α t update + γ t iter. (5.10) If this prediction is smaller than the direct cost, then the preconditioner is improved with α small-rank updates and the PCG algorithm is used for the current IPM iteration. In addition, the adaptive approach modifies the parameter α in order to provide a good preconditioning and keep the cost associated with the PCG iterations reasonable (see [111] for details). In the context of this dissertation, there are additional GPU-related considerations that should be taken into account as heuristic guidelines for adaptive IPMs. The sparse Cholesky factorization can be performed with or without GPU-acceleration. Depending on the problem, we can adaptively select which one of the options gives better performance. Small-rank updates are natively supported by CHOLMOD [99]. Thus, the adaptive preconditioning strategy just described can be implemented on GPUs as well. Alternatively, it is possible to use the GPUbased matrix-free preconditioning introduced in Section 5.2. Due to the advances in SpMV optimization, the PCG algorithm observes better speedups than its sparse Cholesky factorization counterpart when implemented on GPUs. This opens up more opportunities for using GPU-based iterative methods and enjoying higher performance. On the other hand, high-order techniques like multiple correctors are less suited for PGC. Their main drawback is the need to solve multiple linear systems (as opposed to use the same factorization again and again). That

132 118 said, an appropriate adaptive strategy can still select the best option depending on the runtime conditions. 5.4 An Optimized Hybrid CPU-GPU Implementation The computational techniques presented so far are the foundation to implement the adaptive IPM on GPUs. All the needed linear algebra kernels have a pure GPU-based implementation. The only exception is the supernodal sparse Cholesky factorization CHOLMOD [99]. Indeed, this library has an hybrid GPU-CPU implementation that uses GPU-acceleration to factor some of the dense sub blocks. As a requirement, the matrix M to factorize should entirely reside in CPU memory as sparse CSC matrix. This leads to a series of data transfers between the CPU and the GPU. Due to the normal equation symmetry, the rows of the CSR matrix M on the GPU (lower diagonal) can be directly copied into the columns of the CSC matrix M cpu on the CPU (upper diagonal), avoiding the need for a costly conversion. In addition, all CPU-GPU transfers can be performed on the PCI-express bus at full DMA speed as long as the data on the CPU side are kept in page-locked memory. The use of asynchronous GPU streams provides an opportunity for overlapping GPU kernels with data transfers and CPU computation, with substantial benefit in terms of algorithm runtime. GPU streams are basically asynchronous command queues in which we can push GPU kernels and data transfers. Within the same stream, operations are serialized. On the other hand, operations in different streams can overlap and execute simultaneously. In addition, stream execution is asynchronous in respect to CPU threads, allowing hybrid CPU-GPU implementations. In our case, GPU streams optimize the data transfers and the computation

133 119 associated with the Cholesky factorization. Algorithm 9 proposes a simplified version of our hybrid adaptive IPM algorithm implementation. For the sake of simplicity, we do not show the performance monitoring necessary for selecting what is the best option at each iteration (i.e. we assume a priori knowledge). Similarly, we do not include the multiple centrality corrector and the regularization techniques. The light green color indicates the operations that can be overlapped. Once the GPU-based SpMM kernel calculates AΘA T, we either compute or update the Cholesky factor L depending on the adaptive selection (line 4). In the first case, we move the matrix M to M cpu and perform the Cholesky factorization (line 5-6). In the second case, we move the scaling factor Θ to Θ cpu and perform the small-rank updates (line 8-9). In both the cases, there is opportunity for overlapping with the subsequent linear algebra operations (line 11-15) since executed as pure GPU kernels. Note that CHOLMOD is already implemented using hybrid CPU-GPU computation. This do not create conflicts because modern GPUs have the ability to execute multiple kernel simultaneously with the goal to fully utilize the hardware. The algorithm will synchronize when the Cholesky factor is transferred back to be used directly as triangular solver or as preconditioner (line 6). Note that this last data transfer can be overlapped as well. Another opportunity for optimization is the ability to reuse the support data associated with supernodal Cholesky factorization, triangular system solution and SpMM. Those operations relies on intermediate structures that stay constant throughout the IPM iterations and, thus, can be conveniently calculated once in the beginning.

134 120 Algorithm 9 Hybrid CPU-GPU Adaptive Interior Point Method Input: (x, y, s) [0] and σ Output: (x, y, s ) 1: for k = 1, 2,... and ( ( ( rp 2 1+ b 2 ɛ p ), rd 2 1+ c 2 ɛ d ), 2: Θ XS 1 3: M AΘA T 4: if Cholesky then 5: M cpu M 6: L cpul T cpu M cpu 7: else 8: Θ cpu Θ 9: L cpul T cpu L cpul T cpu + α i=1 δiiaiat i 10: end if 11: r p b Ax 12: r d c s A T y 13: µ xt s n 14: r g Xs + σµe 15: h r p + AΘ(r d X 1 r g) 16: P c L cpu 17: if Cholesky then 18: y (P cp T c ) 1 h 19: else 20: y P CG(M, h, P c) 21: end if 22: x Θ(r d X 1 r g A T y) 23: s X 1 (r g S x) 24: (x [k+1], y [k+1], s [k+1] ) (x [k], y [k], s [k] ) + α( x, y, s) 25: end for ) x T s 1+ (x T c+y T b)/2 ɛp do 5.5 Performance Evaluation We evaluated the effectiveness of the proposed GPU-based optimization techniques using the same experimental setting described in Section 4.5. We selected a set of eight well-known LP problems from the NETLIB repository [112] and from the Mittelmann s test set [113]. All the benchmarks, listed below in Table XXI, were directly available in their standard form with no upper bounds on the primal and slack variables. We initially focused our attention on the SpMM kernel. Given the constraint matrix A, we composed the generating matrix G needed to calculate the normal equation M = AΘA T at each step of the algorithm. The

135 121 Dimension LP Problem m n nnz Author Rail Mittelmann Rail Mittelmann Rail Mittelmann Watson Mittelmann Degme Meszaros Tp Meszaros Karted Meszaros Ts-palko Meszaros TABLE XXI. LP test set SpMM performance was then directly obtained from the SpMV kernel. Here, we compare MKL and AdELL+ in order to evaluate the benefit of moving the computation to GPU. We focus on double-precision calculations due to the fact that better numerical accuracy helps the convergence of IPM algorithms. The results are reported in Table XXII. The proposed SpMM technique has a very efficiency GPU-based implementation that always outperforms the baseline MKL implementation (highlighted and in bold). We observe a substantial average of Performance MKL AdELL+ LP Problem [GFLOPS] [GFLOPS] Speedup Rail x Rail x Rail x Watson x Degme x Tp x Karted x Ts-palko x Harmonic Mean x TABLE XXII. SpMM performance

136 122 8,3x speedup. This provides enough experimental evidence to use our novel SpMM technique as a building block for an efficient GPU-based IPM algorithm. We implemented the adaptive technique presented in Section 5.3 as the core component of a primal-dual infeasible IPM algorithm with multiple centrality correctors. In addition, we integrated the use of GPU streams as outlined by 5.4. Figure 21 provides a runtime profiling of how the various computational kernels compose the running time of our GPU-based IPM algorithm. Note that we have the Cholesky factorization (in red), the triangular backsubstitution (in blue), the PCG method (in green), the SpMV kernel (in purple) and SpMM kernel (in light blue). The runtime distribution changes depending on the LP problem, but in general most of the running time is dedicated to solve the normal equation using Cholesky factorization (plus backsubstitution) or using PCG (with the most recent Cholesky factor as preconditioner). On the other hand, the impact of SpMV and SpMM on the runtime is marginal due to the aggressive optimizations. For three out of eight LP benchmarks, the adaptive technique uses the GPU-based PCG over CHOLMOD during some of the intermediate IPM iteration, providing a speedup of 1.31x (Watson2), 1.06x (Degme) and 1.04x (Karted) over a Cholesky-only solution. Moreover, the adaptive technique chooses to accelerate CHOLMOD with the GPU in half of the cases (Degme, Tp6, Karted and TsPalko). The triangular backsubstition has instead a variable impact on the runtime. For some of the LP problems, cusparse can extract and use the underlying parallelism whereas, on the other cases, the backsubstitution does not offer such opportunity. We observed that GPUs have enough computational power to make our primal-dual infeasible IPM code efficient in practice. Note that it is beyond the scope

123 Triangular 17% Cholesky 81% SpMM 1% SpMV 1% Rail4284 Triangular 4% Cholesky 48% PCG 48% SpMM 0% SpMV 0% Watson2 Triangular 5% Cholesky 37% PCG 58% SpMM 0% SpMV 0% Karted Triangular 5% Cholesky

137 123 Triangular 17% Cholesky 81% SpMM 1% SpMV 1% Rail4284 Triangular 4% Cholesky 48% PCG 48% SpMM 0% SpMV 0% Watson2 Triangular 5% Cholesky 37% PCG 58% SpMM 0% SpMV 0% Karted Triangular 5% Cholesky 38% PCG 57% SpMM 0% SpMV 0% Degme Triangular 13% Cholesky 87% SpMM 0% SpMV 0% TsPalko Triangular 16% Cholesky 82% SpMM 1% SpMV 1% Tp6 1.31x 1.04x 1.06x Triangular 28% Cholesky 64% SpMM 2% SpMV 6% Rail2586 Triangular 43% Cholesky 53% SpMM 1% SpMV 3% Rail507 Figure 21. IPM profiling

138 124 of this dissertation to compete directly with commercial state-of-the-art optimization software. However, here we present a performance comparison with CPLEX [55]. We set ɛ p = 10 4, ɛ d = 10 4 and ɛ g = 10 6 as reasonable convergence criteria. Figure 22 shows the runtime performance of CLEX (normalized over our GPU-based IPM code). As we can see, the performance are comparable (or slightly better) in all cases except one where CLEX is much more efficient (Watson2). Note that CPLEX uses a preprocessing step to reduce the LP problem size before applying its sophisticate (and proprietary) version of the barrier algorithm. In addition, CPLEX is implemented as a multithreaded application and,thus, can fully advantage of the 16 cores available in our experimental computing platform, making the direct comparison with our IPM implementation unbiased towards the powerful K40 Tesla GPU. The achieved results Normalized Run/me Rail507 Rail2586 Rail4284 Watson2 Degme Tp6 Karted TsPalko CPLEX GPU IPM Figure 22. Comparison with CPLEX

139 125 provide a solid proof-of-concept to promote the use of the proposed GPU-based techniques in commercial software, with a general and long-lasting impact on the field of convex optimization.

140 CHAPTER 6 EXTENSION TO ILP In this last chapter, we focus on the solution of ILP problems through relaxations. After introducing the classic branch-and-bound and branch-and-cut algorithms, we propose an extension of our GPU-based computational techniques to enable the use of the adaptive IPM algorithm as core solver for LP relaxations. 6.1 Combinatorial Techniques for ILP The solution of ILP problems (such as those arising combinatorial optimization) is based on an abstract computational tree where each node is a restricted version of the original ILP. Given any node i, we can relax its integer constraint and solve its LP relaxation obtaining an optimal x [i]. This solution may be used in different ways. If x [i] is integral, then it is also optimal for the ILP problem described by node i. Otherwise, the fractional solution x [i] is a lower bound (let us assume minimization) for the same ILP instance. Heuristic rounding can be used to find a nearby feasible point x [i] to take as upper bound (note that indeed any feasible solution is an upper bound). Moreover, these considerations are used in a more global context for the solution of the original ILP problem. The overall computational tree is structured in such a way that each branch corresponds to the addition of linear constraints, either to fix the value of some variables or to reduce the feasible space. Thus, any child node represents a more constrained version of all its ancestors up to the original ILP problem at the root node. Note 126

141 127 that any integral solution x i Ω [i] Z found at node i, either optimal or not, can be backtracked to the root since Ω [i] Z is a subset of the original Ω Z. In addition, this gives the opportunity to update a global upper bound x Ω Z (i.e. the best integral solution so far) for the original ILP. Branch-and-bound is a general algorithmic technique for solving combinatorial problems. Let us assume to have a 0-1 ILP problem. Given a node in the abstract computational tree, we can brach and create two new restricted subproblems by fixing a variable to zero for one child node (x = 0) and to one for the other (x = 1). Let us now assume to have a generic ILP problem. We can still create child nodes by splitting the feasible space Ω Z over a variable. For example, one subproblem may have the linear constraint x 3 whereas the other subproblem may have x 4. Note that the addition of constraints restricts the feasible space up to a point where only a single integral solution is left (leaf nodes). In other words, the abstract computational tree is a representation used to enumerate all the feasible ILP solutions. Unfortunately, this is equivalent to an exponentially-large search space. The brach-and-bound technique is then essential to reduce the search space to a small portion of the abstract computational tree, and, yet, achieve the optimal ILP solution. The technique is based on pruning unfruitful subtrees according to the bounds derived from LP relaxations. Anytime a node has a solution x i for its LP relaxation that is also integral, we can prune its subtree and update the global upper bound x if the objective function is improved (c x i < c x). Otherwise, the non integral solution x i represents a lower bound for the ILP instance rooted at the current node. If this lower bound is worse than the global upper bound (c x c x i ), we can safely prune the current subtree since there is no opportunity to improve the best integral solution so far. However, if c x > c x i

142 128 then the current ILP instance may contain an integral solution that improves the global upper bound x. Thus, the search should continue in the current subtree. Cutting planes are linear constraints used to restrict a convex space to the convex hull of its integral restriction. The key idea is to restict the relaxed Ω = {x R n : Ax = b, x 0} without leaving out any feasible integral point in Ω Z = {x Z n : Ax = b, x 0}. Indeed, the goal is to move the optimal ILP solution on the polyhedron facets. Assuming to use the appropriate cutting planes, we can derive an integral solution for the LP relaxation that is also optimal for original ILP problem. The cutting planes technique for general ILP problems was first proposed by Gomory [114] but neglected for several years due to slow convergence. Fortunately, the development of polyhedral theory led to stronger problem-specific cutting planes [115]. In general, it is difficult to efficiently solve a generic ILP problem using the cutting planes technique only by itself. Instead, a balanced combination of branch-and-bound and cutting planes can considerably speed up the solution. This so-called branch-and-cut technique [116] is incorporated into optimization software providing an efficient and practical way to solve ILP problems. Depending on the implementation, cutting planes are added to strategic nodes of the abstract computational tree leading to a considerable pruning of the search space. Algorithm 10 shows in more details the branch-and-cut technique. Note that the most expensive operation of each branch-and-cut iteration is the solution of a LP relaxation (corresponding to a node of the abstract computational tree). The use of IPM algorithms as core LP solvers is a very attractive option, especially for large sparse LP instances. However, this may introduce a difficulty known as warm starting [117]. As described in Section 2.3.2, the selection of the initial solution x 0

143 129 Algorithm 10 Branch and Cut Input: List L, upper bound integral ILP solution x Output: Integral ILP solution x 1: Put the original ILP into L 2: while L do 3: Get node i from L 4: Solve its LP relaxation 5: if there exists a feasible x i then 6: if c x i < c x then 7: if x i Ω Z then 8: x x i 9: else 10: Get integral x i close to x i 11: if c x i < c x then 12: x x i 13: end if 14: Add cutting plane 15: Split over a variable to create child nodes 16: Add child nodes to L 17: end if 18: end if 19: end if 20: end while 21: return x is crucial to obtain an efficient convergence of IPMs. In general, x 0 can be found with an heuristic for each LP relaxation (cold starting). On the other hand, it is computationally more attractive to reuse a starting point taken from one of the LP relaxations just solved (warm starting). However, x 0 should be also picked as centered as possible in order to produce a faster interior trajectory to the optimal solution. Computational experiments have shown that branch-and-bound performs better in practice when we solve the LP relaxation with moderate accuracy. Given an approximate solution x i, this may provide a centered warm starting due to the distance from the polytope walls. However, the level of accuracy has additional implications

144 130 for the branch-and-bound algorithm. First, we need enough accuracy to potentially reach an integral x i. Second, we need enough accuracy to provide a good lower bound and prune as many subtrees as possible. Primal-dual IPM algorithms may be used to simultaneously derive both lower and upper bound for the current node. This gives the ability to stop the LP relaxation solver as soon the optimal for the current node cannot be better that the lower integral bound x already achieved (c x > b y i ). The literature on branch-and-cut describes additional heuristics to terminate the relaxation at the right moment as well as effective techniques for centering x i during warm starting [117]. 6.2 Data Structures for Branch & Cut The implementation of branch-and-cut strategies relies on the ability to compose new ILP instances by updating the constraint matrix A, the right-hand side vector b and, eventually, the objective function vector c. For example, let us assume to fix a variable such that x i = b m+1. This is implemented by adding a nonnzero a (m+1)i = 1 to matrix A and a new entry b m+1 to vector b. Note that this increases the row dimensionality by one. Let us instead assume to impose a simple disequality x i b m+1. This constraint can be reduced to the standard form by introducing an artificial slack variable x n+1 and, thus, a zero entry in the objective function c. The standard constraint x i + x n+1 = b m+1 is then implemented as the addition of two nonzeros in A and an entry in b. Indeed, any generic equality constraint, including those derived from cutting planes, corresponds to a new row in A (with nonzeros describing the variables dependencies) and a right-hand side value in b.

145 131 The sparse matrix formats commonly used for GPU-based implementation are, in general, not well-suited for structural updates. The random addition of few nonzeros may require a preprocessing that is substantially equivalent to composing the sparse format from scratch. On the other hand, the incremental addition (and removal) of new rows can be done with little or no effort for flexible formats such as COO or CSR. The key idea is to append the new rows and their nonzeros at the end of the data structure, avoiding any modification of the existing matrix. A similar approach can be also used for warp-grained data structures such as AdELL+. We observe that this incremental mechanism is appropriate when the branch-and-cut algorithm uses DFS to explore the abstract computational tree. Given the ILP problem, we can use a base matrix A base = [A 0] T R (m+l) n and incremental matrix A inc = [0 R new ] T R (m+l) n to represent the constraint matrix A [l] at the current level l in the abstract computational tree. Note that R new R l n are equality constraints. If A inc can be stored as a COO (CSR) extension of a COO (CSR) base A base, the SpMV operation can be seamlessly executed as a single kernel. Otherwise, the SpMV operation can be simply decomposed as A [l] x = A base x + A inc x, (6.1) where each part can be independently stored using a different sparse format. The runtime of this aggregated SpMV is mostly dependent on A base x. Moreover, A base is fixed for the entire branch-and-cut algorithm. Thus, we can conveniently apply all the SpMV optimization techniques proposed in Chapter 3. On the other hand, the size of A inc can potentially grow

146 132 and become a substantial part of the aggregated SpMV runtime. Then, it is reasonable to refactor A base in order to include A inc up to a certain level of the abstract computational tree. There is also a strategy to optimize the transpose operation A [l]t y. Due to the computational load, it is beneficial to explicitly compose the transpose A T base as AdELL+. On the other hand, the remaining A inc still needs flexibility to add constraints so it uses COO or CSR. Note that those sparse formats directly supports transpose SpMV with no need explicitly compose the transpose matrix. We tested the overhead associated with the incremental matrix technique by considering the set of LP problems used in Section 5.5. We measured the SpMV performance on the constraint matrix A base (stored as AdELL+) and compared the overhead associated with an incremental matrix A inc (stored as CSR). This latter has been composed by randomly inserting 100 unit nonzeros on new rows (i.e. fixing 100 variables) to simulate an abstract computational tree of depth 100. Figure 23 shows the average normalized performance for nontranspose and transpose SpMV applied to the LP benchmarks. As we can see, the overhead associated with A inc is negligible. On the other hand, A T inc introduces a substantial overhead of 32%. This inefficiency may be explained by two reasons. First, the transpose SpMV kernel available in cusparse provides reduced performance (as low as half of the nontranspose SpMV). Second, we explicitly store A T base in order to fully take advantage of AdELL+. This, however, provides a very optimized baseline that magnifies the overhead. Modern GPU architectures have the ability to simultaneously run multiple kernels and maximize the utilization of the underlying computing resources. This potentially allows to hide the overhead associated with A T inc. In fact, we can launch the base and the incremental kernels together. Despite its latency, the

147 133 Figure 23. Incremental SpMV overhead incremental SpMV kernel processes only a limited number of nonzeros using just few warps. Hence, the base SpMV kernel will have enough resources to not decrease its performance and, at the same time, to completely hide the overhead. The incremental matrix technique can be also adapted to the SpMM kernel. The key idea is that the additional rows in A inc generates additional nonzeros on the bottom of the lower triangular part of the matrix AΘA T. Those nonzeros can be incrementally added at the end of the matrix M. Similarly, it is possible to embed the associated summation pattern at the end of the generating matrix G. An example of this technique is shown by Figure 24. For the sake of simplicity, we use CSR to store the entire matrices although the AdELL+ format represents a more efficient solution for M base and G base. On the top, the matrix A has an additional constraint involving two nonzeros (highlighted in red). This constraint induces three nonzeros

134 [a] X X X X X X X X X X X X X X X X X X X X X X = X X X X X X X X X X X X X X X X X X X X X A θ A T A θ A T [b] CSR 0 1 2 4 6 9 Column 0 1 0 2 1 3 0 2 4 Lower Normal Equation Value X X X X X X X

148 134 [a] X X X X X X X X X X X X X X X X X X X X X X = X X X X X X X X X X X X X X X X X X X X X A θ A T A θ A T [b] CSR Column Lower Normal Equation Value X X X X X X X X X CSR Column Generating Matrix Value X X X X X X X X X X X X X X X X X Figure 24. Extending SpMM to ILP in the last row of the matrix AΘA T. On the bottom, we can observe how the matrix M is incrementally updated to accommodate a new row with three additional nonzeros. Similarly, the pattern of summations necessary to generate those nonzeros is incrementally embedded at the end of the generating matrix G. For example, the element m 40 is the summation of two nonzeros in G due to the fact that row 4 and row 0 in A have nonzeros intersecting on two columns (2 and 4 respectively). In general, the incremental structures M inc and G inc are substantially smaller than the fixed M base and G base and, thus, conveniently stored as CSR.

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited. page v Preface xiii I Basics 1 1 Optimization Models 3 1.1 Introduction... 3 1.2 Optimization: An Informal Introduction... 4 1.3 Linear Equations... 7 1.4 Linear Optimization... 10 Exercises... 12 1.5